Export (0) Print
Expand All
4 out of 8 rated this helpful - Rate this topic

XML Features in ADO.NET

 

Dino Esposito
Wintellect

December 13, 2001

XML and some of its related technologies, including XPath, XSL Transformation, and XML Schema, are unquestionably at the foundation of ADO.NET. As a matter of fact, XML constitutes the key element for the greatly improved interoperability of the ADO.NET object model when compared to ADO. In ADO, XML was merely an (non-default) I/O format used to persist the content of a disconnected recordset. The participation of XML in the building and in the inter-workings of ADO.NET is much deeper. The aspects of ADO.NET where the interaction and the integration with XML is stronger can be summarized in the following points:

  • Objects serialization and remoting
  • A dual programming interface
  • XML-driven batch update (for SQL Server 2000 only)

In ADO.NET, you have several options to save objects to, and restore objects from, XML documents. To say it all, this ability belongs to one object only—the DataSet—but can be extended to other container objects with minimal coding. Saving objects like DataTable and DataView to XML is substantially seen as a special case of the DataSet serialization.

Furthermore, ADO.NET and XML classes provide for a sort of unified intermediate API that is made available to programmers through a dual and synchronized programming interface. You can access and update data using either the hierarchical and node-based approach of XML ,or the relational approach of column-based tabular data sets. At any time, you can switch from a DataSet representation of the data to XMLDOM, and vice versa. Data is synchronized and any change you enter in either model is immediately reflected and visible in the other. In this article, I'll cover ADO.NET-to-XML serialization and XML data access—that is, the first two points in the list above. Next month, I'll attack XML-driven batch update—one of the coolest features you get from SQL Server 2000 XML Extensions (SQLXML 2.0).

DataSet and XML

Just like any other .NET object, the DataSet object is stored in memory in a binary format. Unlike other objects, though, the DataSet is always remoted and serialized in a special XML format called the DiffGram. When the DataSet trespasses the boundaries of the app-domains, or the physical borders of the machine, it is automatically rendered as a DiffGram. At destination, the DataSet is silently rebuilt as a binary and promptly usable object. The same serialization facilities are available to applications through a bunch of methods, a pair of which clearly stands out. They are ReadXml and WriteXml. The table below presents the DataSet's methods you can use to work with XML, both in reading and in writing.

Methods Description
GetXml
  • Returns a string that is the XML representation of the data stored in the DataSet
  • No schema information is included
GetXmlSchema
  • Returns a string that is the XML schema information for the DataSet
ReadXml
  • Populates a DataSet object with the specified XML data read from a stream or a file
ReadXmlSchema
  • Loads the specified XML schema information into the current DataSet object
WriteXml
  • Writes the XML data, and optionally the schema, that represents the DataSet
  • Can write to a stream or a file
WriteXmlSchema
  • Writes the string being the XML schema information for the DataSet
  • Can write to a stream or a file

As the table shows, when working with DataSet and XML you can manage data and schema information as distinct entities. You can take the XML schema out of a DataSet and use it as a string. Alternately, you could write it to a disk file or load it into an empty DataSet object. Side by side with the methods listed in the table above, the DataSet object also features two XML-related properties—Namespace and Prefix. Namespace determines the XML namespace used to scope XML attributes and elements when you read them into a DataSet. The prefix to alias the namespace is stored in the Prefix property.

Building a DataSet from XML

The ReadXml method fills out a DataSet object reading from a variety of sources, including disk files, .NET streams, or instances of XmlReader objects. The method can process any type of XML file, but, of course, XML files having a non-tabular and rather irregularly shaped structure may create some problems when rendered in terms of rows and columns.

The ReadXml method has several overloads, all of which are rather similar. They take the XML source plus an optional XmlReadMode value as arguments. For example:

public XmlReadMode ReadXml(String, XmlReadMode);

The method creates the relational schema for the DataSet depending on the read mode specified, and whether or not a schema already exists in the DataSet. The following code snippet illustrates the typical code you would use to load a DataSet from XML.

StreamReader sr = new StreamReader(fileName);
DataSet ds = new DataSet();
ds.ReadXml(sr);   // defaults to XmlReadMode.Auto
sr.Close();

When loading the contents of XML sources into a DataSet, ReadXml does not merge rows whose primary key information match. To merge an existing DataSet with one loaded from XML, you first have to create a new DataSet, and then merge the two using the Merge method. During the merging, the rows that get overwritten are those with matching primary keys. An alternate way to merge existing DataSet objects with contents read from XML is through the DiffGram format (more on this later).

The table below illustrates the various read modes that ReadXml supports. You can set them using the XmlReadMode enumeration.

Read Mode Description
IgnoreSchema Ignores any inline schema and relies on the DataSet's existing schema
ReadSchema Reads any inline schema and loads the data and schema
InferSchema Ignores any inline schema and infers the schema from the XML data
DiffGram Reads a DiffGram and adds the data to the current schema
Fragment Reads and adds XML fragments until the end of the stream is reached

The default read mode is not listed in the table and is XmlReadMode.Auto. When this mode is set, or when no read mode has been explicitly set, the ReadXml method examines the XML source and chooses the most appropriate option.

If the XML source turns out to be a DiffGram, then it is loaded as such. If the source happens to contain an inline schema, or a reference to an external schema, it is loaded using ReadXmlSchema. Finally, if no schema information is available in the XML source, then the ReadXml method infers the schema using the DataSet's InferXmlSchema method. The relational structure, or schema, of a DataSet is made up of tables, columns, constraints, and relations. Let's review what exactly happens when each of the modes is set.

The XmlReadMode.IgnoreSchema option causes the method to ignore any inline or referenced schema. So, the data is loaded into the existing DataSet schema and any data that does not fit in it is discarded. If no schema exists in the DataSet, no data is loaded. Notice that an empty DataSet has no schema information. Bear in mind that if the XML source is in a DiffGram format, then the IgnoreSchema option has the same effect as XmlReadMode.DiffGram.

// no schema in the DataSet, no data will be loaded
DataSet ds = new DataSet();
StreamReader sr = new StreamReader(fileName);
ds.ReadXml(sr, XmlReadMode.IgnoreSchema);

The XmlReadMode.ReadSchema option works only with inline schema and does not recognize external references. It can add new tables to the DataSet, but if any tables defined in the inline schema already exist in the DataSet, an exception is thrown. You cannot use the ReadSchema option to change the schema of an existing table. If the DataSet does not contain a schema (that is, it's empty), and there is no inline schema, no data is read and loaded. ReadXml can only read inline schemas defined using the XML Schema definition language (XSD) or XML-Data Reduced (XDR). No document type definition (DTD) is supported.

When the XmlReadMode.InferSchema option is set, ReadXml infers the schema directly from the structure of the XML data and ignores any inline schema that may be present. The data is loaded only when the schema has been inferred. Existing schemas are extended by adding new tables, or by adding new columns to existing tables, as appropriate. You can use the DataSet's InferXmlSchema method to load the schema from the specified XML file into the DataSet. You can control, to some extent, the XML elements processed during the schema inference operation. The signature of the method InferXmlSchema allows you to specify an array of namespaces whose elements will be excluded from inference.

void InferXmlSchema(String fileName, String[] rgNamespace);

A DiffGram is an XML format that ADO.NET utilizes to persist the state of a DataSet. Similar to the SQLXML's updategram format, the DiffGram contains both current and original versions of data rows. Loading a DiffGram using ReadXml will merge rows that have the matching primary keys. You explicitly instruct ReadXml to work on a DiffGram using the XmlReadMode.DiffGram flag. When using the DiffGram format, the target DataSet must have the same schema as the DiffGram, otherwise the merge operation fails and an exception is thrown.

When the XmlReadMode.Fragment option is set, the DataSet is loaded from an XML fragment. An XML fragment is a valid piece of XML that identifies elements, attributes, and documents. The XML fragment for an element is the markup text that fully qualifies the XML element (node, CDATA, processing instruction, comment). The fragment for an attribute is the attribute value, and for a document is the entire content set. When the XML data is a fragment, the root level rules for well-formed XML documents are not applied. Fragments that match the existing schema are appended to the appropriate tables and fragments that do not match the schema are discarded. ReadXml reads from the current position to the end of the stream. The XmlReadMode.Fragment option should not be used to populate an empty, and subsequently schema-less, DataSet.

Serializing DataSet Objects to XML

The XML representation of the DataSet can be written to a file, a stream, an XmlWriter object, or a string, using the WriteXml method. The XML representation can include, or not include, schema information. The actual behavior of the WriteXml method can be controlled through the optional XmlWriteMode parameter you can pass. The values in the XmlWriteMode enum determine the output's layout. The DataSet representation includes tables, relations, and constraints definitions. The rows in the DataSet's tables are written in their current versions unless you choose to employ the DiffGram format. The table below summarizes the writing options available with XmlWriteMode.

Write Mode Description
IgnoreSchema Writes the contents of the DataSet as XML data without schema
WriteSchema Writes the contents of the DataSet with an inline XSD schema
DiffGram Writes the contents of the DataSet as a DiffGram, including original and current values

XmlWriteMode.IgnoreSchema is the default option. The following code shows the typical way to serialize a DataSet to XML.

// ds is the DataSet
StreamWriter sw = new StreamWriter(fileName);
ds.WriteXml(sw);   // defaults to XmlWriteMode.IgnoreSchema
sw.Close();   

A few factors influence the final structure of the XML document created from the contents of a DataSet object. They are;

  • The overall XML format used—DiffGram or plain hierarchical representation of the current contents
  • The presence of schema information
  • Nested relations
  • How table columns are mapped to XML elements

The DiffGram format is a special XML format that I'll describe more in depth in a moment. It does not include schema information, but preserves row state and row errors. As such, it seems able to constitute a closer representation of the living instance of the DataSet.

Schema information, if present in the DataSet being created, is always written as inline XSD. There's no way for you to write it as XDR, DTD, or add a reference to an external file. The root node of the generated XML file takes the name of the DataSet, or NewDataSet, if no name has been specified. The following code snippet is an example of the XML representation of a DataSet object made of two tables, Customers and Orders, put into relationship on the CustomerID field.

<MyDataSet>
<xs:schema ... />
<Customers>
   <CustomerID>1</CustomerID>
   <FName>John</FName>
   <LName>Smith</LName>
</Customers>
<Customers>
   <CustomerID>2</CustomerID>
   <FName>Joe</FName>
   <LName>Users</LName>
</Customers>
<Orders>
   <CustomerID>1</CustomerID>
   <OrderID>000A01</OrderID>
</Orders>
<Orders>
   <CustomerID>1</CustomerID>
   <OrderID>000B01</OrderID>
</Orders>
</MyDataSet>

From the listing above, you can hardly say that the two tables are in relation. Some information about this is set in the <xs:schema> tree, but aside from this, nothing else would hint toward that conclusion. A relation set on the CustomerID field put down in words sounds like this—all the orders issued by a given customer. The XML tree above does not provide an immediate representation for this information. To change the order of the nodes when a data relation is present in the DataSet, you can set the Nested attribute of the DataRelation object to true. As a result of this change, the XML code from above changes as follows:

<MyDataSet>
<xs:schema ... />
<Customers>
   <CustomerID>1</CustomerID>
   <FName>John</FName>
   <LName>Smith</LName>
<Orders>
   <CustomerID>1</CustomerID>
   <OrderID>000A01</OrderID>
</Orders>
<Orders>
   <CustomerID>1</CustomerID>
   <OrderID>000B01</OrderID>
</Orders>
</Customers>
<Customers>
   <CustomerID>2</CustomerID>
   <FName>Joe</FName>
   <LName>Users</LName>
</Customers>
</MyDataSet>

As you can see, all the orders are now grouped under the corresponding customer subtree.

By default, in XML table columns are rendered as node elements. However, this is only a setting that can be adjusted on a per column basis. The DataColumn object has a property called ColumnMapping that determines how that column will be rendered in XML. The ColumnMapping property takes values from the MappingType enum listed below.

Mapping Description
Element Mapped to an XML node element:

<CustomerID>value</CustomerID>

Attribute Mapped to an XML node attribute:

<Customers CustomerID=value>

Hidden Not displayed in the XML data unless the DiffGram format is used
SimpleContent Mapped to simple text:

<Customers>value</Customers>

If the XML output format is the DiffGram, then the Hidden mapping type is ignored. In this case, though, the DiffGram representation of the column features a special attribute that marks the column as originally hidden from XML serialization. The SimpleContent mapping type is not always available and can be used only if the table has one column.

The DiffGram Format

A DiffGram is simply an XML string written according to a particular schema that represents the contents of a DataSet. It is in no way a .NET type. The following code snippet shows how to serialize a DataSet object to a DiffGram.

StreamWriter sw = new StreamWriter(fileName);
ds.WriteXml(sw, XmlWriteMode.DiffGram);
sw.Close();

The resulting XML code is rooted in the <diffgr:diffgram> node, and contains up to three distinct sections of data, as shown below:

<diffgr:diffgram>
   <MyDataSet>
   :
   </MyDataSet>

   <diffgr:before>
   :
   </diffgr:before>

   <diffgr:errors>
   :
   </diffgr:errors>
</diffgr:diffgram>

The first section of the DiffGram is mandatory and represents the current instance of the data. It is nearly identical to the XML output you can get from ordinary serialization. The major difference between the two is that the DiffGram format never includes schema information.

The data section includes the current values of the rows in the DataSet. The original rows, including deleted rows, are stored in the <diffgr:before> section. Only modified or deleted records are listed here. Newly added records are only listed in the data instance as they have no preceding reference to which to link. The rows in the two sections are tracked using a unique ID. These rows represent the delta between the original and the current version of the DataSet.

Finally, in the <diffgr:errors> section, any message that relates to pending errors on rows is listed . Also in this case, rows are tracked using the same unique ID discussed for changes. DiffGram nodes can be flagged with special attributes to relate elements across different sections—data instance, changes, and errors.

Attribute Description
diffgr:hasChanges The row has been modified (see related row in <diffgr:before>) or inserted.
diffgr:hasErrors The row has an error (see related row in <diffgr:errors>).
diffgr:id Identifies the ID used to couple rows across sections: TableName+RowIdentifier.
diffgr:parentId Identifies the ID used to identify the parent of the current row.
diffgr:error Contains the error text for the row in <diffgr:errors>.
msdata:rowOrder Tracks the ordinal position of the row in the DataSet.
msdata:hidden Identifies columns marked as hidden msdata:hiddenColumn=…

The ADO.NET Framework provides explicit XML support only for the DataSet object. However, converting a DataView or a DataTable to XML is not particularly hard. In both cases, you have to use a temporary DataSet as the container for the set of rows you want to save as XML. The code necessary to save a DataTable to XML is simple.

void WriteDataTableToXml(String fileName, DataTable dt)
{
   // Duplicate the table and add it to a temporary DataSet
   DataSet dsTmp = new DataSet();
   DataTable dtTmp = dt.Copy();
   dsTmp.Tables.Add(dtTmp);

   // Save the temporary DataSet to XML
   StreamWriter sr = new StreamWriter(fileName);
   dsTmp.WriteXml(sr);
   sr.Close();
}

It's important that you duplicate the DataTable object for the simple reason that each ADO.NET object can be referenced only by one container object. You cannot have the same instance of, say, a DataTable object belonging to two distinct DataSet objects.

Unlike the DataTable object, the DataView is not a standard part of a DataSet, so in order to save it to XML you should convert the DataView to a table object. This is demonstrated by the following code:

void DataViewToDataTable(DataView dv)
{
// Clone the structure of the table behind the view
DataTable dtTemp = dv.Table.Clone();
   dtTemp.TableName = "Row";   // this is arbitrary!

   // Populate the table with rows in the view
   foreach(DataRowView drv in dv)
      dtTemp.ImportRow(drv.Row);

   // giving a custom name to the DataSet can help to 
   // come up with a clearer layout but is not mandatory
   DataSet dsTemp = new DataSet(dv.Table.TableName);   

   // Add the new table to a temporary DataSet
   dsTemp.Tables.Add(dtTemp);
}

As a first step, you clone the structure of the table lying behind the DataView object that is being processed. Next, you walk through all the records in this view and add the corresponding rows to a temporary DataTable. Then this DataTable is added to a temporary DataSet and serialized. You can also manage to give the DataSet the table name and a customized format to the whole XML output. For example:

<TableName>
   <Row>
      <Column1>…</Column1>
      :
   </Row>
   <Row>
   :
   </Row>
   <Row>
   :
</Row>
</TableName>

The XmlDataDocument Class

XML and ADO.NET Frameworks offer a unified model to access data represented as both XML and relational data. The key XML class for this is XmlDataDocument, whereas the DataSet is the key ADO.NET class. XmlDataDocument, in particular, inherits from the base class XmlDocument and differs from it only in the ability to synchronize with DataSet objects. When synchronized, DataSet and XmlDataDocument classes work on the same collection of rows, and you can apply changes through both interfaces (nodes and relational tables) and make them immediately visible to both classes. Basically, DataSet and XmlDataDocument provide two sets of tools for the same data. As a result, you can apply XSLT transformations to relational data, query relational data through XPath expressions, and use SQL to select XML nodes.

There are a few ways to bind a DataSet object and XmlDataDocument object together. The first option is that you pass a non-empty DataSet object to the constructor of the XmlDataDocument class.

XmlDataDocument doc = new XmlDataDocument(dataset);

Like its base class, XmlDataDocument provides a XML DOM approach to work with XML data and, as such, turns out to be quite different from XML readers and writers. An alternate way of synchronizing the two objects is the following example that creates a valid and non-empty DataSet object from a non-empty instance of the XML DOM.

XmlDataDocument doc = new XmlDataDocument();
doc.Load(fileName);
DataSet dataset = doc.DataSet;

You turn an XML document into a DataSet object using the XmlDataDocument's DataSet property. The property instantiates, populates, and returns a DataSet object. The data set is associated with the XmlDataDocument the first time you access the DataSet property. The methods GetElementFromRow and GetRowFromElement switch between the XML and relational view of the data. In order to view the XML data relationally, you must first specify a schema to use for data mapping. This can be done by calling the ReadXmlSchema method on the same XML file. As an alternate approach, you can manually create the necessary tables and columns in the DataSet.

Yet another way to synchronize XmlDataDocument and DataSet objects is when they are empty and you fill both separately. For example,

DataSet dataset = new DataSet();
XmlDataDocument xmldoc = new XmlDataDocument(dataset);
xmldoc.Load("file.xml");

Keeping the two objects synchronized provides an unprecedented level of flexibility since, as mentioned earlier, you can use two radically different types of navigation to move through records. In fact, you can use SQL-like queries on XML nodes, as well as XPath queries on relational rows.

Not all XML files can be successfully synchronized with a DataSet. For this to happen, XML documents must have a regular, tabular structure that can be mapped to a relational architecture where each row has the same number of columns. When rendered as DataSet objects, XML documents lose any XML-specific information they may have and for which there isn't a relational counterpart. This information includes comments, declarations. and processing instructions.

Summary

In ADO.NET, XML is more than a simple output format for serializing content. You can use XML to streamline the entire contents of a DataSet object, but you can also choose the actual XML schema and control the structure of the resulting XML document. You can mirror the contents of the DataSet, including tables and relations, you can take the schema information out of the final document, and you can even resort to the DiffGram format.

ADO.NET has a lot more to offer when it comes to interacting and integrating with XML. In particular, in .NET you can simultaneously provide and exploit two equivalent but independent views of the same data that obey to different logical data representations.

Dialog Box: Using GetChanges with Batch Update

I've seen that the DataSet programming interface provides for a method called GetChanges, which returns a smaller DataSet filled out only with the updated rows in all included tables. So, I'm led to think that using this smaller DataSet, instead of the original one, results in improved performance. However, there was something in your last articleI can hardly say what and wherethat has thrown some doubtful exceptions in my mind. So the question is, can you shed some more light on the use of the DataSet's GetChanges method in batch update?

The ADO.NET batch update is based on a loop that walks its way through the rows on the specified table. The code checks the state of the row and decides which action to take. The loop works on the DataSet and the DataTable you provide as an argument to the adapter's method Fill. If you call Fill on the original DataSet, or on the smaller DataSet returned by GetChanges, the results are roughly the same. The optimization is minimal and serves only to reduce the length of the loop.

During the batch update process, rows are processed sequentially from the middle-tier to data server. There is no snapshot of data that is sent down to the database in a single shot or as a single block of data. In this case, in fact, using GetChanges would result in much more optimized code.

The key parameter that determines how many significant operations are performed during the batch update is the number of modified rows. This parameter does not change regardless of whether you use the original DataSet or the one returned by GetChanges.

By contrast, if you batch update from the DataSet returned by GetChanges, you may run into serious troubles when a conflict is detected. In this case, the rows processed prior to the one that failed are regularly committed, but not on the original DataSet! To guarantee the consistency of your application, you must accept changes on the committed rows, as well as changes on the original DataSet. This code is entirely on its own. All in all, the batch update code is much easier if you use the original DataSet.

Dino Esposito is Wintellect's ADO.NET expert and a trainer and consultant based in Rome, Italy. Dino is a contributing editor to MSDN Magazine and writes the Cutting Edge column. He also regularly contributes to Developer Network Journal and MSDN News. Dino is the author of the upcoming Building Web Solutions with ASP.NET and ADO.NET from Microsoft Press, and the cofounder of http://www.vb2themax.com/. You can reach Dino at dinoe@wintellect.com.

Show:
© 2014 Microsoft. All rights reserved.