Printer Friendly Version      Send     
Click to Rate and Give Feedback
Related Articles
This article presents an overview of the motivation behind new techniques that decompose problems into independent pieces for optimal use of parallel programming.

By David Callahan (October 2008)
We take a look at planned support for parallel programming for both managed and native code in the next version of Visual Studio.

By Stephen Toub and Hazim Shafi (October 2008)
Here we describe some of the more common challenges to concurrent programming and present advice for coping with them in your software.

By Joe Duffy (October 2008)
Here is an ASP.NET AJAX data-driven Web application that takes the best features from server- and client-side programming to deliver an efficient, user-friendly experience.

By Bertrand Le Roy (October 2008)
More ...
Articles by this Author
There’s a strong similarity between Web-based Silverlight 2 applications and desktop WPF applications. Enabling easy code reuse between the two is Dino’s focus here.

By Dino Esposito (October 2008)
This month Dino builds a service layer that authenticates users of Silverlight 2 and ASP.NET AJAX services to prevent illegal access to sensitive back-end services.

By Dino Esposito (September 2008)
Dino Esposito compares the use of AJAX patterns and DOM manipulations to the use of the ASP.NET partial rendering engine.

By Dino Esposito (August 2008)
In this installment, the author provides an enhanced implementation of the BST pattern and compares it to HTM solutions.

By Dino Esposito (July 2008)
AJAX is meant to go beyond mere partial page rendering. Find out where Dino Esposito thinks dynamic pages are headed in the future with ASP.NET AJAX.

By Dino Esposito (June 2008)
This month we begin a look at the Single Page Interface (SPI) model and some design patterns for designing AJAX applications.

By Dino Esposito (May 2008)
This month, use nested ListView controls to create hierarchical views of data and extend the eventing model of the ListView by deriving a custom ListView class.

By Dino Esposito (April 2008)
This month Dino Esposito shows you how to get Windows-style modal dialog boxes for your Web applications thanks to the Ajax Control Toolkit and some clever coding.

By Dino Esposito (Launch 2008)
More ...
Popular Articles
If you're unfamiliar with Windows Presentation Foundation (WPF), building that first Silverlight custom control can be a daunting experience. This article walks through the process.

By Jeff Prosise (August 2008)
Here we present techniques for programmatic and declarative data binding and display with Windows Presentation Foundation.

By Josh Smith (July 2008)
Systems that handle failure without losing data are elusive. Learn how to achieve systems that are both scalable and robust.

By Udi Dahan (July 2008)
Here the author introduces SQL Server Data Services, which exposes its functionality over standard Web service interfaces.

By David Robinson (July 2008)
More ...
Read the Blog
Well designed code keeps things that have to change together as close together in the code as possible and allows unrelated things in the code to change independently, while minimizing duplication in the code. In the October 2008 issue of MSDN Magazine, Jeremy Miller shows you some design ...
Read more!
The process for ink capture and analysis on the Tablet PC is straightforward in managed code. To the uninitiated developer, however, creating unmanaged Tablet PC applications can be rather daunting. In the October 2008 issue of MSDN Magazine, Gus Class a quick introduction to the Tablet PC ...
Read more!
Multicore systems are becoming increasingly prevalent, but the majority of software today will not automatically take advantage of this additional processing ability. And multithreaded programming, for anything but the most trivial of systems, is incredibly difficult and error prone today. In the October 2008 issue of MSDN ...
Read more!
Concurrent programming is notoriously difficult, even for experts. You have all of the correctness and security challenges of sequential programs plus all of the difficulties of parallelism and concurrent access to shared resources. In the October 2008 issue of MSDN Magazine, David Callahan describes ...
Read more!
A major advantage of AJAX and Silverlight applications is that they can transparently and continuously interact with a back-end service. The problem is that they run over HTTP, which wasn't designed with security in mind. In the September 2008 issue of MSDN Magazine, Dino Esposito shows you ...
Read more!
Unhandled exception processing shouldn't be a mystery. It's actually quite useful since it gives a crashing application an opportunity to perform last-minute diagnostic logging about what went wrong. In the September 2008 issue of MSDN Magazine, Gaurav Khanna discusses how ...
Read more!
More ...
Cutting Edge
Binary Serialization of DataSets
Dino Esposito

The ADO.NET DataSet object plays an essential role in most of today's distributed, multitiered applications. Instances of the DataSet class are used to move data across the tiers and to exchange data with external services. The DataSet has most of the features needed to represent real-world business entities.
First, the DataSet is a disconnected object—you can use it without a physical connection to the data source. In addition, the DataSet provides a rich programming interface. It supports multiple tables of data, accepts relations defined between pairs of tables, and allows you to enumerate its contents. Last but not least, the DataSet is a fully serializable object. It can be serialized in three different ways—to a standard .NET formatter, to an XML writer, and through the XML serializer.
For objects that represent the real-world entities of a multitiered app, serialization is key because the object must be serialized to be moved across tiers. This month, I'll discuss a problem with serialization in ADO.NET 1.x and how serialization in ADO.NET 2.0 will improve upon it. Back in the December 2002 installment of Cutting Edge, I reviewed the serialization capabilities of the DataSet object, which you might want to refer to for a little background.

The Serialization Problem Defined
I have a client with a classic three-tier application in which a number of user interface process components were moving data up and down to business components through a .NET Remoting channel. The runtime environment of .NET Remoting used a custom host application configured to work over binary serialization. The client chose binary serialization, thinking it was the fastest way to move data between processes. But their application was running slowly and experiencing problems moving data. Hence, the client put a sniffer on the network to look at the size of the data being moved back and forth and discovered that too many bytes were being transported; the actual size of the data was grossly exceeding the expected size. So what was going on?

How DataSet Serialization Really Works
Marked with the [Serializable] attribute, the DataSet object implements the ISerializable interface to gain full control over its serialization process. The .NET formatter streamlines, in a default way, objects that don't implement ISerializable and guarantees that data is stored as a SOAP payload if the SoapFormatter is used or as a binary stream if the BinaryFormatter is used. The responsibility of the formatters, however, ends here if the serialized object supports ISerializable. In that case, the formatter passes an empty memory buffer (the SerializationInfo data structure) and waits for the serializee to populate it with data. The formatter's activities are limited to flushing this memory buffer to a binary stream or wrapping it up as a SOAP packet. However, there are no guarantees made about the type of the data added to the SerializationInfo object.
The following pseudocode shows how the DataSet serializes with a .NET formatter. The GetObjectData method is the sole method of the ISerializable interface.
void GetObjectData(SerializationInfo info, StreamingContext context)
{
   info.AddValue("XmlSchema", this.GetXmlSchema());
   this.WriteXml(strWriter, XmlWriteMode.DiffGram);
   info.AddValue("XmlDiffGram", strWriter.ToString());
}
Regardless of the formatter being used, the DataSet always serializes first to XML. What's worse, the DataSet uses a pretty verbose schema—the DiffGram format plus any related schema information. Now take a DataSet with a few thousand records in it and imagine such a large chunk of text traveling over the network with no sort of optimization or compression (even blank spaces aren't removed). That's exactly the problem that I'm sure many of you have been called to solve at one time or another.

Avoiding the Problem
Until ADO.NET 2.0 the prognosis was mixed. There was both good and bad news. The bad news was that this problem is virtually unavoidable if you choose the DataSet as the implementor of your business entities. Alternative choices are raw XML, custom collections, and arrays of custom classes. All of these have pros and cons. All are serializable and enumerable data types. In addition, raw XML and custom types are better at interoperating with components running on platforms such as J2EE than is the DataSet. This assumes you're employing Web services to bridge user interface process components to business entities (see Figure 1). For more information about Web services and DataSets, check out The XML Files column in the April 2003 issue of MSDN®Magazine.
Figure 1 User Interface 
If you modify your architecture to do without DataSets, you avoid the problem of serialization but lose some usability and simplicity. The DataSet is a great class to code against. Using it you can pass the full DataSet around, and then if needed, you can also easily excerpt only the changed rows in each table in order to minimize bandwidth and CPU usage for a call.
The good news is that there are some workarounds. To see the various options compared, check out the DataSet FAQ. One way to improve the end-to-end transfer speed is to override the DataSet serialization mechanism.
You can create your own DataSet-derived class and implement the members of the ISerializable interface to fulfill your performance and scalability requirements. As an alternative, you can generate a typed DataSet using Visual Studio® .NET and then modify its source code to reimplement the ISerializable interface as needed. A third option is to use a little-known feature of the .NET Framework—serialization surrogates. Jeffrey Richter provides excellent coverage of serialization surrogates in his September 2002 installment of the .NET column. Technically speaking, a serialization surrogate is a class that implements the ISerializationSurrogate interface. The interface consists of two members: GetObjectData and SetObjectData.
By using surrogates you can override the way types serialize themselves. The technique has a couple of interesting practical applications: serialization of unserializable types and deserialization of an object to a different version of its type. Surrogates work in cooperation with .NET runtime formatters to handle serialization and deserialization for a given type. For example, the BinaryFormatter class has a member named SurrogateSelector that installs a chain of surrogates for a variety of types. When instances of any of these types are going to be serialized or deserialized, the process won't go through the object's serialization interface (ISerializable or the reflection-based default algorithm) but rather takes advantage of the surrogate's capabilities using its ISerializationSurrogate interface. Here's an example of how to override the serialization of a DataSet using a custom surrogate DataSetSurrogate:
SurrogateSelector ss = new SurrogateSelector();
DataSetSurrogate dss = new DataSetSurrogate();
ss.AddSurrogate(typeof(DataSet), 
    new StreamingContext(StreamingContextStates.All), dss);
formatter.SurrogateSelector = ss;
In the DataSetSurrogate class, you implement the GetObjectData and SetObjectData methods to work around the known performance limitations of the standard DataSet serialization. Next, you add an instance of the DataSet surrogate to a selector class. Finally, the selector class is bound to an instance of the formatter.
Whichever approach you choose to improve the DataSet's serialization performance (a new class or a serialization surrogate), the end goal should be to reduce the amount of information being moved.
The DataSet serializes to an XML DiffGram—a rich XML schema that contains the current snapshot of the data as well as pending errors on table rows and a history of all the changes that occurred on the rows. The history is actually the original value of the row assigned when either the DataSet was created or the last time the AcceptChanges method was invoked.
Since the main problem is the size of the data, compression is one possible answer. The solution outlined in Knowledge Base article 829740 ("Improving DataSet Serialization and Remoting Performance employs a DataSet-like class that is simply marked as Serializable and doesn't implement ISerializable. Other ADO.NET objects (DataTable, DataColumn) are also replaced to take full control of the serialization process. This simple change provides a double advantage: it reduces the amount of data being moved and reduces the pressure on the .NET Remoting framework to serialize and deserialize larger DataSet objects.
Another approach to the problem, that is much more than just a smart way to serialize a DataSet, involves implementing a custom formatter, such as Angelo Scotto's CompactFormatter (available for download at http://www.freewebs.com/compactFormatter/About.html). The CompactFormatter is a generic formatter for both the Microsoft® .NET Framework and the .NET Compact Framework and is capable of producing an even more compact byte stream than the native BinaryFormatter class. In addition, it supports compression, which further reduces the amount of data being moved. More importantly, the CompactFormatter is not DataSet-specific and works with most .NET types.

Drilling Down a Little Further
All .NET distributed systems that make intensive use of disconnected data (as recommended by Microsoft architecture patterns and practices) are sensitive to the size of serialized data. The larger the DataSet, the more CPU cycles, memory, and bandwidth these systems consume. Fortunately, ADO.NET 2.0 provides a great fix. Before I explain it, though, I should make clear that I'm not saying DataSet XML serialization is a bad thing in general. Standard DataSet XML serialization is a "stateful" form of serialization in the sense that it maintains some state information such as pending errors and current changes to rows. It can be customized to a great extent by choosing the favorite combination of nodes and attributes; you can also decide if relations should simply be tracked or rendered by nesting child records under their parents.
What's really interesting is the fact that as long as few records (for example, less than 100) are involved in the serialization, the performance of a standard DataSet and that of an optimized DataSet (custom DataSet, surrogate, compressed, whatever) is negligible. Therefore, I wouldn't spend much time implementing alternative forms of DataSet serialization if my application moves only small chunks of data.
Although performance worsens as the size of the DataSet grows, even for a thousand records the performance is not much of a problem. Note that performance here relates to .NET Remoting end-to-end speed, a parameter that includes the size of the data to transfer and the costs of successive instantiations. As you scale from a thousand to a few thousand records, the performance begins to suffer significantly. Figure 2 shows the difference in .NET Remoting end-to-end speed when XML (standard) and true binary serialization are used (the latter is new to ADO.NET 2.0; more on this in a moment). For the sake of precision, I have to say that the numbers behind the graph have been obtained with a Beta build of ADO.NET 2.0.
Figure 2 Remoting End-to-End Time 
Of course the worst scenario is when you move a large DataSet (thousands of rows) with numerous changes. If you're simply forwarding this DataSet to the data access layer (DAL) for applying changes, you can alleviate the issue by using the DataSet's GetChanges method. This method returns a new DataSet that contains only the rows in the various tables that have been modified, added, or deleted.
However, when you're moving a DataSet from tier to tier, or from a business component to an external service that implements a business process, you have no choice but to pass a stateful representation of the data—the whole DataSet with its own set of relations, pending changes, and errors.
Why does the end-to-end speed slow down once you reach a certain threshold? Handling large DataSets poses an additional problem to the .NET Remoting infrastructure aside from the time and space needed to complete the operation. This extra problem has to do with the specific algorithm used to serialize and, more importantly, deserialize DataSets saved as XML DiffGrams. To restore a DataSet from a DiffGram, the binary formatter invokes a protected constructor on the DataSet class (part of the ISerializable implementation). The pseudocode of this DataSet constructor is shown in Figure 3. As you can see, the DataSet's ReadXml method is called to process the DiffGram. Especially for large DataSets, ReadXml works by creating lots of transient, short-lived objects, a few for each row to be processed. This mechanism puts a lot of additional pressure on the .NET Remoting infrastructure and consumes a lot of memory and CPU cycles. Exactly how bad it is will depend on the runtime conditions and hardware equipment of the system. This explains why certain clients experience visible problems when moving only 7MB of data while others won't see problems until they've moved 20MB of data.

DataSet Serialization in ADO.NET 2.0
When you upgrade to ADO.NET 2.0, your problems will be solved. In ADO.NET 2.0, the DataSet class provides a new serialization option specifically designed to optimize remoting serialization. As a result, remoting a DataSet uses less memory and bandwidth. The end-to-end latency is greatly improved, as Figure 2 illustrates.
In ADO.NET 2.0, the DataSet and DataTable come with a new property named RemotingFormat defined to be of type SerializationFormat (see Figure 4). By default, the new property is set to SerializationFormat.Xml to preserve backward compatibility. The property affects the behavior of the GetObjectData members on the ISerializable interface and ultimately is a way to control the serialization of the DataSet. The following code snippet represents the pseudocode of the method in ADO.NET 2.0:
void GetObjectData(SerializationInfo info, StreamingContext context)
{
  SerializationFormat fmt = RemotingFormat;
  SerializeDataSet(info, context, fmt);
}
SerializeDataSet is a helper function that fills the SerializationInfo memory buffer with data that represents the DataSet. If the format parameter is SerializationFormat.Binary, the function goes through every object in the DataSet and copies its contents into a serializable structure—mostly ArrayLists and Hashtables. Figure 5 shows the pseudocode. If you compare that to the code in the aforementioned Knowledge Base article 829740, you'll find many similarities. Note, though, that the code in Figure 5 is based on Beta 1 and might change in the future. What shouldn't change, though, is the idea behind the binary serialization. To serialize a DataSet in a true binary fashion, here's what you do:
DataSet ds = GetData();
ds.RemotingFormat = SerializationFormat.Binary;
BinaryFormatter bin = new BinaryFormatter();
bin.Serialize(stream, ds);
Aside from the new RemotingFormat property, there's nothing new in this code. The impact of this new feature on existing code is very minimal.
It is interesting to measure the performance gain that you get from this new feature. Here's a simple technique you can easily reproduce. Fill a DataSet with the results of a query and persist it to a file on disk (see Figure 6). You can wrap the code in Figure 6 in either a Web Form or a Windows Form. Run the sample application and take a look at the size of the files created. Try first with a simple query like this:
SELECT lastname, firstname FROM employees
If you're familiar with the Northwind database, you know that this query returns only nine records. Quite surprisingly, in this case the XML DiffGram is about half the size of the binary file! Don't worry, there's nothing wrong in the code or in the underlying technology. To see the difference, run a query that will return a few thousand records, like this one:
SELECT * FROM [order details] 
Now the binary file is about 10 times smaller than the XML file. With this evidence, look back at the graph in Figure 2. The yellow series shows the performance of the DataSet binary serializer and fortunately increases very slowly as the size of the DataSet grows. The same can't be said for the DataSet XML serializer which reports a sudden upswing and significantly decreased absolute performance as the number of rows exceeds a few thousand.

DataTable Enhancements
In ADO.NET 1.x, the DataTable class suffers from three main limitations. A DataTable can't be used with Web service methods and doesn't provide a direct way to serialize its contents. In addition, while DataTable instances can be passed across a remoting channel, strongly typed DataTable instances cannot be remoted. These limitations have been removed in ADO.NET 2.0. Let's take a look at exactly how this can be accomplished.
Return values and input parameters of Web service methods must be serializable through the XmlSerializer class. This class is responsible for translating .NET types into XML Schema Definition (XSD) types and vice versa. Unlike the runtime formatter classes (such as BinaryFormatter), the XmlSerializer class doesn't handle circular references. In other words, if you try to serialize a DataTable you get an error because the DataTable contains a reference to a DataSet—the DataSet property—and the DataSet, in turn, contains a reference to the DataTable through the Tables collection. For this reason, XmlSerializer raises an exception if you try to pass a DataTable to, or return a DataTable from, a Web service method. DataSet objects have the same circular reference, but unlike the DataTable class, DataSets implement the little-known IXmlSerializable interface.
The XmlSerializer class specifically looks for this interface when serializing an object. If the interface is found, the serializer yields control and waits for the WriteXml/ReadXml methods to terminate the serialization and deserialization process. The IXmlSerializable interface has the following three members:
XmlSchema GetSchema();
void ReadXml(XmlReader reader);
void WriteXml(XmlWriter writer);
In ADO.NET 2.0, the DataTable class fully supports the IXmlSerializable interface so you can finally use DataTables as input parameters or return values in Web service methods.
Note that the GetSchema method exists on IXmlSerializable purely for backward compatibility, and it is safe to choose a trivial implementation of GetSchema that returns null. However, if the class is being used in a Web service method's signature, an XmlSchemaProviderAttribute should be applied to the class to denote a static method on the class that can be used to get a schema. This XSD representation can then be mapped back to the actual object by implementing a SchemaImporterExtension. For more information on XmlSchemaProviderAttribute and SchemaImporterExtension, see New Features for Web Service Developers in Beta 1 of the .NET Framework 2.0.
In addition, the DataTable provides a simple XML I/O channel. A DataTable can be populated from an XML stream using the new ReadXml method and persisted to disk using the new WriteXml method. The methods have the same role and signature as the equivalent methods on the DataSet class. Last, but not least, a DataTable supports the RemotingFormat property and can be serialized in a true binary format.
Note that the list of changes for the DataTable class doesn't end here. For example, the class also features full integration with streaming interfaces and can be loaded from readers (including SQL DataReaders) and its contents can be read through DataReaders. However, these changes don't directly (or necessarily) affect the building of middle tiers of enterprise class applications and in this column I don't have the space to cover them or even the majority of the new ADO.NET 2.0 features.

Call to Action
One of the main ADO.NET 2.0 themes is the emphasis on improved performance and scalability. This encompasses a new, faster index engine to locate table records and the oft-requested binary serialization for DataSets. By adding the same capabilities to the DataTable class, I'd say the team exceeded developers' expectations by a wide margin.
So what's the best thing to do if you have an application that suffers from DataSet serialization performance problems? Once you upgrade to ADO.NET 2.0, fixing the problem is as easy as adding one line of code—the one that sets the RemotingFormat property on the DataSet. The fix is simple to apply and terrifically effective. Especially if you haven't made any progress on alternative routes (like compressors, surrogates, and the like), I suggest you stay and start planning a full migration for when the new platform (or at least its Go-Live license) is available. If you absolutely need a fix today, I suggest that you choose the one that has the least impact on your system. My experience suggests that since most applications use typed DataSets you can just modify the GetObjectData method of your typed DataSets to use surrogates (see the Knowledge Base article 829740) or, more simply, zip the XML DiffGram. The performance improvement is immediate, though not as good as it could be. It's a workaround after all.
ADO.NET 2.0 is evolution, not revolution. It's a fully backward-compatible data access platform in which the RemotingFormat property stands out as the feature that really fixes many existing enterprise applications, allowing Web services to bridge user interface components and business entities.

Send your questions and comments for Dino to  cutting@microsoft.com.


Dino Esposito is a Wintellect instructor and consultant based in Italy. Author of Programming ASP.NET and the newest Introducing ASP.NET 2.0 (both from Microsoft Press), he spends most of his time teaching classes on ASP.NET and ADO.NET and speaking at conferences. Get in touch with Dino at cutting@microsoft.com or join the blog at http://weblogs.asp.net/despos.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.
Page view tracker