Cutting Edge

Binary Serialization of DataSets

Dino Esposito

Contents

The Serialization Problem Defined
How DataSet Serialization Really Works
Avoiding the Problem
Drilling Down a Little Further
DataSet Serialization in ADO.NET 2.0
DataTable Enhancements
Call to Action

The ADO.NET DataSet object plays an essential role in most of today's distributed, multitiered applications. Instances of the DataSet class are used to move data across the tiers and to exchange data with external services. The DataSet has most of the features needed to represent real-world business entities.

First, the DataSet is a disconnected object—you can use it without a physical connection to the data source. In addition, the DataSet provides a rich programming interface. It supports multiple tables of data, accepts relations defined between pairs of tables, and allows you to enumerate its contents. Last but not least, the DataSet is a fully serializable object. It can be serialized in three different ways—to a standard .NET formatter, to an XML writer, and through the XML serializer.

For objects that represent the real-world entities of a multitiered app, serialization is key because the object must be serialized to be moved across tiers. This month, I'll discuss a problem with serialization in ADO.NET 1.x and how serialization in ADO.NET 2.0 will improve upon it. Back in the December 2002 installment of Cutting Edge, I reviewed the serialization capabilities of the DataSet object, which you might want to refer to for a little background.

The Serialization Problem Defined

I have a client with a classic three-tier application in which a number of user interface process components were moving data up and down to business components through a .NET Remoting channel. The runtime environment of .NET Remoting used a custom host application configured to work over binary serialization. The client chose binary serialization, thinking it was the fastest way to move data between processes. But their application was running slowly and experiencing problems moving data. Hence, the client put a sniffer on the network to look at the size of the data being moved back and forth and discovered that too many bytes were being transported; the actual size of the data was grossly exceeding the expected size. So what was going on?

How DataSet Serialization Really Works

Marked with the [Serializable] attribute, the DataSet object implements the ISerializable interface to gain full control over its serialization process. The .NET formatter streamlines, in a default way, objects that don't implement ISerializable and guarantees that data is stored as a SOAP payload if the SoapFormatter is used or as a binary stream if the BinaryFormatter is used. The responsibility of the formatters, however, ends here if the serialized object supports ISerializable. In that case, the formatter passes an empty memory buffer (the SerializationInfo data structure) and waits for the serializee to populate it with data. The formatter's activities are limited to flushing this memory buffer to a binary stream or wrapping it up as a SOAP packet. However, there are no guarantees made about the type of the data added to the SerializationInfo object.

The following pseudocode shows how the DataSet serializes with a .NET formatter. The GetObjectData method is the sole method of the ISerializable interface.

void GetObjectData(SerializationInfo info, StreamingContext context) { info.AddValue("XmlSchema", this.GetXmlSchema()); this.WriteXml(strWriter, XmlWriteMode.DiffGram); info.AddValue("XmlDiffGram", strWriter.ToString()); }

Regardless of the formatter being used, the DataSet always serializes first to XML. What's worse, the DataSet uses a pretty verbose schema—the DiffGram format plus any related schema information. Now take a DataSet with a few thousand records in it and imagine such a large chunk of text traveling over the network with no sort of optimization or compression (even blank spaces aren't removed). That's exactly the problem that I'm sure many of you have been called to solve at one time or another.

Avoiding the Problem

Until ADO.NET 2.0 the prognosis was mixed. There was both good and bad news. The bad news was that this problem is virtually unavoidable if you choose the DataSet as the implementor of your business entities. Alternative choices are raw XML, custom collections, and arrays of custom classes. All of these have pros and cons. All are serializable and enumerable data types. In addition, raw XML and custom types are better at interoperating with components running on platforms such as J2EE than is the DataSet. This assumes you're employing Web services to bridge user interface process components to business entities (see Figure 1). For more information about Web services and DataSets, check out The XML Files column in the April 2003 issue of MSDN®Magazine.

Figure 1 User Interface

Figure 1** User Interface **

If you modify your architecture to do without DataSets, you avoid the problem of serialization but lose some usability and simplicity. The DataSet is a great class to code against. Using it you can pass the full DataSet around, and then if needed, you can also easily excerpt only the changed rows in each table in order to minimize bandwidth and CPU usage for a call.

The good news is that there are some workarounds. To see the various options compared, check out the DataSet FAQ. One way to improve the end-to-end transfer speed is to override the DataSet serialization mechanism.

You can create your own DataSet-derived class and implement the members of the ISerializable interface to fulfill your performance and scalability requirements. As an alternative, you can generate a typed DataSet using Visual Studio® .NET and then modify its source code to reimplement the ISerializable interface as needed. A third option is to use a little-known feature of the .NET Framework—serialization surrogates. Jeffrey Richter provides excellent coverage of serialization surrogates in his September 2002 installment of the .NET column. Technically speaking, a serialization surrogate is a class that implements the ISerializationSurrogate interface. The interface consists of two members: GetObjectData and SetObjectData.

By using surrogates you can override the way types serialize themselves. The technique has a couple of interesting practical applications: serialization of unserializable types and deserialization of an object to a different version of its type. Surrogates work in cooperation with .NET runtime formatters to handle serialization and deserialization for a given type. For example, the BinaryFormatter class has a member named SurrogateSelector that installs a chain of surrogates for a variety of types. When instances of any of these types are going to be serialized or deserialized, the process won't go through the object's serialization interface (ISerializable or the reflection-based default algorithm) but rather takes advantage of the surrogate's capabilities using its ISerializationSurrogate interface. Here's an example of how to override the serialization of a DataSet using a custom surrogate DataSetSurrogate:

SurrogateSelector ss = new SurrogateSelector(); DataSetSurrogate dss = new DataSetSurrogate(); ss.AddSurrogate(typeof(DataSet), new StreamingContext(StreamingContextStates.All), dss); formatter.SurrogateSelector = ss;

In the DataSetSurrogate class, you implement the GetObjectData and SetObjectData methods to work around the known performance limitations of the standard DataSet serialization. Next, you add an instance of the DataSet surrogate to a selector class. Finally, the selector class is bound to an instance of the formatter.

Whichever approach you choose to improve the DataSet's serialization performance (a new class or a serialization surrogate), the end goal should be to reduce the amount of information being moved.

The DataSet serializes to an XML DiffGram—a rich XML schema that contains the current snapshot of the data as well as pending errors on table rows and a history of all the changes that occurred on the rows. The history is actually the original value of the row assigned when either the DataSet was created or the last time the AcceptChanges method was invoked.

Since the main problem is the size of the data, compression is one possible answer. The solution outlined in Knowledge Base article 829740 ("Improving DataSet Serialization and Remoting Performance employs a DataSet-like class that is simply marked as Serializable and doesn't implement ISerializable. Other ADO.NET objects (DataTable, DataColumn) are also replaced to take full control of the serialization process. This simple change provides a double advantage: it reduces the amount of data being moved and reduces the pressure on the .NET Remoting framework to serialize and deserialize larger DataSet objects.

Another approach to the problem, that is much more than just a smart way to serialize a DataSet, involves implementing a custom formatter, such as Angelo Scotto's CompactFormatter (available for download at https://www.freewebs.com/compactFormatter/About.html). The CompactFormatter is a generic formatter for both the Microsoft® .NET Framework and the .NET Compact Framework and is capable of producing an even more compact byte stream than the native BinaryFormatter class. In addition, it supports compression, which further reduces the amount of data being moved. More importantly, the CompactFormatter is not DataSet-specific and works with most .NET types.

Drilling Down a Little Further

All .NET distributed systems that make intensive use of disconnected data (as recommended by Microsoft architecture patterns and practices) are sensitive to the size of serialized data. The larger the DataSet, the more CPU cycles, memory, and bandwidth these systems consume. Fortunately, ADO.NET 2.0 provides a great fix. Before I explain it, though, I should make clear that I'm not saying DataSet XML serialization is a bad thing in general. Standard DataSet XML serialization is a "stateful" form of serialization in the sense that it maintains some state information such as pending errors and current changes to rows. It can be customized to a great extent by choosing the favorite combination of nodes and attributes; you can also decide if relations should simply be tracked or rendered by nesting child records under their parents.

What's really interesting is the fact that as long as few records (for example, less than 100) are involved in the serialization, the performance of a standard DataSet and that of an optimized DataSet (custom DataSet, surrogate, compressed, whatever) is negligible. Therefore, I wouldn't spend much time implementing alternative forms of DataSet serialization if my application moves only small chunks of data.

Although performance worsens as the size of the DataSet grows, even for a thousand records the performance is not much of a problem. Note that performance here relates to .NET Remoting end-to-end speed, a parameter that includes the size of the data to transfer and the costs of successive instantiations. As you scale from a thousand to a few thousand records, the performance begins to suffer significantly. Figure 2 shows the difference in .NET Remoting end-to-end speed when XML (standard) and true binary serialization are used (the latter is new to ADO.NET 2.0; more on this in a moment). For the sake of precision, I have to say that the numbers behind the graph have been obtained with a Beta build of ADO.NET 2.0.

Figure 2 Remoting End-to-End Time

Figure 2** Remoting End-to-End Time **

Of course the worst scenario is when you move a large DataSet (thousands of rows) with numerous changes. If you're simply forwarding this DataSet to the data access layer (DAL) for applying changes, you can alleviate the issue by using the DataSet's GetChanges method. This method returns a new DataSet that contains only the rows in the various tables that have been modified, added, or deleted.

However, when you're moving a DataSet from tier to tier, or from a business component to an external service that implements a business process, you have no choice but to pass a stateful representation of the data—the whole DataSet with its own set of relations, pending changes, and errors.

Why does the end-to-end speed slow down once you reach a certain threshold? Handling large DataSets poses an additional problem to the .NET Remoting infrastructure aside from the time and space needed to complete the operation. This extra problem has to do with the specific algorithm used to serialize and, more importantly, deserialize DataSets saved as XML DiffGrams. To restore a DataSet from a DiffGram, the binary formatter invokes a protected constructor on the DataSet class (part of the ISerializable implementation). The pseudocode of this DataSet constructor is shown in Figure 3. As you can see, the DataSet's ReadXml method is called to process the DiffGram. Especially for large DataSets, ReadXml works by creating lots of transient, short-lived objects, a few for each row to be processed. This mechanism puts a lot of additional pressure on the .NET Remoting infrastructure and consumes a lot of memory and CPU cycles. Exactly how bad it is will depend on the runtime conditions and hardware equipment of the system. This explains why certain clients experience visible problems when moving only 7MB of data while others won't see problems until they've moved 20MB of data.

Figure 3 Pseudocode for DataSet Serialization

protected DataSet(SerializationInfo info, StreamingContext context) { string schema, diffgram; schema = (string) info.GetValue("XmlSchema", typeof(string)); diffgram = (string) info.GetValue("XmlDiffGram", typeof(string)); if (schema != null) ReadXmlSchema(new XmlTextReader(new StringReader(schema)), true); if (diffgram != null) ReadXml(new XmlTextReader(new StringReader(schema)), XmlReadMode.DiffGram); }

DataSet Serialization in ADO.NET 2.0

When you upgrade to ADO.NET 2.0, your problems will be solved. In ADO.NET 2.0, the DataSet class provides a new serialization option specifically designed to optimize remoting serialization. As a result, remoting a DataSet uses less memory and bandwidth. The end-to-end latency is greatly improved, as Figure 2 illustrates.

In ADO.NET 2.0, the DataSet and DataTable come with a new property named RemotingFormat defined to be of type SerializationFormat (see Figure 4). By default, the new property is set to SerializationFormat.Xml to preserve backward compatibility. The property affects the behavior of the GetObjectData members on the ISerializable interface and ultimately is a way to control the serialization of the DataSet. The following code snippet represents the pseudocode of the method in ADO.NET 2.0:

void GetObjectData(SerializationInfo info, StreamingContext context) { SerializationFormat fmt = RemotingFormat; SerializeDataSet(info, context, fmt); }

SerializeDataSet is a helper function that fills the SerializationInfo memory buffer with data that represents the DataSet. If the format parameter is SerializationFormat.Binary, the function goes through every object in the DataSet and copies its contents into a serializable structure—mostly ArrayLists and Hashtables. Figure 5 shows the pseudocode. If you compare that to the code in the aforementioned Knowledge Base article 829740, you'll find many similarities. Note, though, that the code in Figure 5 is based on Beta 1 and might change in the future. What shouldn't change, though, is the idea behind the binary serialization. To serialize a DataSet in a true binary fashion, here's what you do:

DataSet ds = GetData(); ds.RemotingFormat = SerializationFormat.Binary; BinaryFormatter bin = new BinaryFormatter(); bin.Serialize(stream, ds);

Aside from the new RemotingFormat property, there's nothing new in this code. The impact of this new feature on existing code is very minimal.

Figure 5 Pseudocode for DataSet's Binary Serialization

private void SerializeDataSet( SerializationInfo info, StreamingContext context, SerializationFormat remotingFormat) { info.AddValue("DataSet.RemotingVersion", new Version(2, 0)); if (remotingFormat != SerializationFormat.Xml) { int i; info.AddValue("DataSet.RemotingFormat", remotingFormat); SerializeDataSetProperties(info, context); info.AddValue("DataSet.Tables.Count", this.Tables.Count); for (i=0; i< Tables.Count; i++) Tables[i].SerializeConstraints(info, context, i, true); SerializeRelations(info, context); for (i=0; i< Tables.Count; i++) Tables[i].SerializeExpressionColumns(info, context, i); for (int=0; i< Tables.Count; i++) Tables[i].SerializeTableData(info, context, i); return; } // 1.x code }

Figure 4 SerializationFormat Values

Value Description
Xml Maintained for backward compatibility, serializes the DataSet object using an XML DiffGram format as in ADO.NET 1.x; this is the default value of the RemotingFormat property
Binary Instructs the internal serializer to use a true binary format when serializing the DataSet

It is interesting to measure the performance gain that you get from this new feature. Here's a simple technique you can easily reproduce. Fill a DataSet with the results of a query and persist it to a file on disk (see Figure 6). You can wrap the code in Figure 6 in either a Web Form or a Windows Form. Run the sample application and take a look at the size of the files created. Try first with a simple query like this:

SELECT lastname, firstname FROM employees

Figure 6 Testing Performance of Remoting Format

SqlDataAdapter adapter = new SqlDataAdapter(query, connString); DataSet ds = new DataSet(); adapter.Fill(ds); BinaryFormatter bin = new BinaryFormatter(); // Save as XML using(StreamWriter writer1 = new StreamWriter(@"c:\xml.dat")) { bin.Serialize(writer1.BaseStream, ds); } // Save as binary using(StreamWriter writer2 = new StreamWriter(@"c:\bin.dat")) { ds.RemotingFormat = SerializationFormat.Binary; bin.Serialize(writer2.BaseStream, ds); }

If you're familiar with the Northwind database, you know that this query returns only nine records. Quite surprisingly, in this case the XML DiffGram is about half the size of the binary file! Don't worry, there's nothing wrong in the code or in the underlying technology. To see the difference, run a query that will return a few thousand records, like this one:

SELECT * FROM [order details]

Now the binary file is about 10 times smaller than the XML file. With this evidence, look back at the graph in Figure 2. The yellow series shows the performance of the DataSet binary serializer and fortunately increases very slowly as the size of the DataSet grows. The same can't be said for the DataSet XML serializer which reports a sudden upswing and significantly decreased absolute performance as the number of rows exceeds a few thousand.

DataTable Enhancements

In ADO.NET 1.x, the DataTable class suffers from three main limitations. A DataTable can't be used with Web service methods and doesn't provide a direct way to serialize its contents. In addition, while DataTable instances can be passed across a remoting channel, strongly typed DataTable instances cannot be remoted. These limitations have been removed in ADO.NET 2.0. Let's take a look at exactly how this can be accomplished.

Return values and input parameters of Web service methods must be serializable through the XmlSerializer class. This class is responsible for translating .NET types into XML Schema Definition (XSD) types and vice versa. Unlike the runtime formatter classes (such as BinaryFormatter), the XmlSerializer class doesn't handle circular references. In other words, if you try to serialize a DataTable you get an error because the DataTable contains a reference to a DataSet—the DataSet property—and the DataSet, in turn, contains a reference to the DataTable through the Tables collection. For this reason, XmlSerializer raises an exception if you try to pass a DataTable to, or return a DataTable from, a Web service method. DataSet objects have the same circular reference, but unlike the DataTable class, DataSets implement the little-known IXmlSerializable interface.

The XmlSerializer class specifically looks for this interface when serializing an object. If the interface is found, the serializer yields control and waits for the WriteXml/ReadXml methods to terminate the serialization and deserialization process. The IXmlSerializable interface has the following three members:

XmlSchema GetSchema(); void ReadXml(XmlReader reader); void WriteXml(XmlWriter writer);

In ADO.NET 2.0, the DataTable class fully supports the IXmlSerializable interface so you can finally use DataTables as input parameters or return values in Web service methods.

Note that the GetSchema method exists on IXmlSerializable purely for backward compatibility, and it is safe to choose a trivial implementation of GetSchema that returns null. However, if the class is being used in a Web service method's signature, an XmlSchemaProviderAttribute should be applied to the class to denote a static method on the class that can be used to get a schema. This XSD representation can then be mapped back to the actual object by implementing a SchemaImporterExtension.

In addition, the DataTable provides a simple XML I/O channel. A DataTable can be populated from an XML stream using the new ReadXml method and persisted to disk using the new WriteXml method. The methods have the same role and signature as the equivalent methods on the DataSet class. Last, but not least, a DataTable supports the RemotingFormat property and can be serialized in a true binary format.

Note that the list of changes for the DataTable class doesn't end here. For example, the class also features full integration with streaming interfaces and can be loaded from readers (including SQL DataReaders) and its contents can be read through DataReaders. However, these changes don't directly (or necessarily) affect the building of middle tiers of enterprise class applications and in this column I don't have the space to cover them or even the majority of the new ADO.NET 2.0 features.

Call to Action

One of the main ADO.NET 2.0 themes is the emphasis on improved performance and scalability. This encompasses a new, faster index engine to locate table records and the oft-requested binary serialization for DataSets. By adding the same capabilities to the DataTable class, I'd say the team exceeded developers' expectations by a wide margin.

So what's the best thing to do if you have an application that suffers from DataSet serialization performance problems? Once you upgrade to ADO.NET 2.0, fixing the problem is as easy as adding one line of code—the one that sets the RemotingFormat property on the DataSet. The fix is simple to apply and terrifically effective. Especially if you haven't made any progress on alternative routes (like compressors, surrogates, and the like), I suggest you stay and start planning a full migration for when the new platform (or at least its Go-Live license) is available. If you absolutely need a fix today, I suggest that you choose the one that has the least impact on your system. My experience suggests that since most applications use typed DataSets you can just modify the GetObjectData method of your typed DataSets to use surrogates (see the Knowledge Base article 829740) or, more simply, zip the XML DiffGram. The performance improvement is immediate, though not as good as it could be. It's a workaround after all.

ADO.NET 2.0 is evolution, not revolution. It's a fully backward-compatible data access platform in which the RemotingFormat property stands out as the feature that really fixes many existing enterprise applications, allowing Web services to bridge user interface components and business entities.

Send your questions and comments for Dino to  cutting@microsoft.com.

Dino Esposito is a Wintellect instructor and consultant based in Italy. Author of Programming ASP.NET and the newest Introducing ASP.NET 2.0 (both from Microsoft Press), he spends most of his time teaching classes on ASP.NET and ADO.NET and speaking at conferences. Get in touch with Dino at cutting@microsoft.com or join the blog at https://weblogs.asp.net/despos.