Database-like Data Containers

Article
06/29/2006

Dino Esposito

April 26, 2001

Recently, I've thrown myself into the somewhat evangelic task of bringing ADO.NET to the masses. While debating the intrinsic beauty of the ADO.NET object model, I find a hardly surmountable hurdle in the perception of its similarity with ADO, which hails from such an object model.

The majority of the questions focus around the key improvements in ADO.NET.

The disconnected model of computing? "Do you mean the model that ensures high levels of scalability for Web applications? Oh my, it was already available since the times of ADO 2.0."

Tight integration with XML? "You're kidding, right? XML and ADO celebrated their wedding in ADO 2.1 and reinforced the pact with ADO 2.5. Indeed, it was a very successful union."

The seamless integration with the rest of the framework? "What kind of framework are you talking about? Is it COM or COM+, or perhaps you are referring to MFC?"

The word "framework," though, is the first successful attempt to break the wall of "interested coldness"—if not "hot apathy"—that sometimes surrounds and welcomes presentations about ADO.NET.

This doesn't mean that developers are really unconcerned about data access in general, and ADO.NET in particular. They would rather hear about new prodigious features and astonishing strengthening of existing services. Instead, to program with ADO.NET, they end up using connections and commands working against data providers. They grab static snapshots of records and ask the runtime to reconnect and submit changes when they're done. They persist their content to XML streams.

So people reasonably ask what's really new with ADO.NET. And the subliminal sensation that nothing is really new progressively grows as the presentation rolls out and unveils what's under the ADO.NET covers.

Again, framework is indeed the word that represents the first crack in an otherwise plain and solid skepticism that can be easily observed by you, the ADO.NET presenter.

ADO.NET has unprecedented integration with the surrounding system SDK (in this case, the .NET Framework). Such seamless integration goes from the design patterns and type system of the base class libraries all the way up through Web Services and ASP .NET. Regardless of your programming model, as long as you stay in .NET, ADO.NET classes are the preferred way to go for data access.

In the last column, I covered data readers—a simple and efficient forward cursor to scroll a set of data rows. In this column, the spotlight is for DataSet objects—the .NET class that is the quintessence of disconnected and XML-driven data containers.

DataSet vs. Recordset Objects

The most accurate description for a DataSet object is in the title of this column. It sounds even more precise if you preface it with an "in-memory" qualifier. So putting it all together, you should think of a DataSet object as an in-memory, database-like data container.

The DataSet can be filled out as the result of a query command run against an OLE DB or a Managed provider. Regardless of the different syntax and structure, at least in the simplest contexts you end up using the DataSet object in much the same way you would with ADO Recordsets. You create them, set some arguments, and then have both filled up with data, care of a special command (usually a SQL command). Next, you move your record pointer back and forth over the table of records and update or delete records.

So far nothing is really different from an ADO Recordset. And nothing really justifies such a "revolution" in the object model. Due to the lack of a Recordset object in ADO.NET, backward compatibility is significantly broken up in the existing ADO code. However, at the end of the day, just the lack of a Recordset object is what significantly brings you to optimize the design and the performance of your data access components.

Let's start by reviewing the aspects in which ADO Recordset and DataSet objects are similar. Both can be disconnected, serialize themselves to XML, and can be fabricated manually using disconnected data. And both support data shaping, local sorting, filtering, and batch update.

As many design aspects make them significantly different as well. DataSets are data-centric as opposed to the database-centric nature of recordsets. DataSets work only in memory and in no time of their existence have anything to do with databases or, in general, with data sources. DataSet objects are in-memory repositories of tables; as many tables as you need and not just one as is the case with recordsets.

In addition, a DataSet can serialize its content to XML but without the burden of necessarily creating a disk file. While in ADO, XML is a mere output format; in ADO.NET, instead, XML is the underlying representation of the object. The XML representation of the DataSet content is available all the time through a special set of properties and methods. In no way is the representation built at runtime through a spooky instance of the XML DOM object.

DataSets can use different schemas of data to create their XML output or to rebuild from a remote source. The hierarchical XML model and the relational Recordset schema happily cohabitate within the DataSet object context giving you an unprecedented power of navigation over a set of records.

And finally, a DataSet object is an instance of a .NET managed class, rather than an instance of a COM object—more often than not, an instance of a COM Automation object. This means that a DataSet object can be safely stored in an ASP .NET Session slot without concerns for the overall scalability of the application. In addition, the block of records it contains can be transmitted across a network irrespective of corporate firewalls along the way. This happens because .NET objects are thread-safe and do not require thread affinity, and because a DataSet is already a XML stream that travels quite comfortably over HTTP and port 80.

Anatomy of a DataSet Object

A DataSet object is a class that descends from Component and implements the IListSource interface. In Visual Basic .NET, the declaration looks like the following:

Public Class DataSet
   Inherits Component
   Implements IListSource

The IListSource interface, in turn, derives from IList and owes its content to some sort of data source to which the code connects. IListSource has a single method, called GetList, to expose such content. GetList returns an IList-based collection of items. IList is a descendant of the base ICollection interface and turns out to be the abstract base class of all .NET lists.

At the end of the day, a DataSet object is a collection of more specialized objects holding the data. On its own, the DataSet provides the ability to serialize according to given XML schemas and an all-encompassing nature that is remarkably helpful when you need to keep in a single place more tables of inter-related data.

In addition, a DataSet is a completely-in-memory object with all the pro and con considerations that usually applies to this role. It is easy and quick to access as long as the amount of information stored is neither particularly volatile nor growing so fast as to require frequent data refresh.

A DataSet is a container that looks like a database. All the data you manipulate is kept in memory, yet it provides you with methods to index, sort, logically join, set constraints, and check data integrity. Only a few of these extended functionalities are implemented directly by the DataSet object, which reinforces the overall feeling of integration and consistency that pervades across the whole .NET Framework.

You create a DataSet object using the ordinary new keyword with a couple of possible constructors:

DataSet data = new DataSet();
DataSet data = new DataSet("MyDataSet");

As you can see, the difference is only in the name you want to assign to the newly created data container. Either you omit the name or it must be a non-empty name. If you omit the name, and therefore you use the former constructor, the name defaults to NewDataSet. However, the dataset's name doesn't play a viable role in the object usage. The only place where you will ever see it used is in the root tag of the XML document that renders the content of the DataSet.

Geography of a DataSet Object

Let's review the geography of DataSets—namely the sub-objects and collections that form the actual programming model—and explore the various approaches you can take to populate a DataSet object.

The DataSet class comprises three main collections:

Tables
Relations
ExtendedProperties

Tables gathers all the child tables of data that you should be getting from their external sources and explicitly adds them to the repository. Tables are rendered through the DataTable object. Once part of the DataSet family, a table becomes automatically serializable through the DataSet's XML schema. Any table in a DataSet is characterized by a unique name, but can also be accessed through index.

A relation is a logical SQL JOIN statement that creates a relationship between two tables previously added to the DataSet. The relation links the two tables on the matching values in a common column in much the same way a JOIN statement does.

The big difference between a DataSet relation and JOIN statements is that the two tables involved in the link remain distinct and no unified table, no matter physical or logical, is created. A relation can be seen as an extra column added to rows on the primary table whose content is an array of matching rows on the target table.

ExtendedProperties gets the collection of custom user information whose structure and logic is completely up to you. You use this property in the same way you would use ASP Session or Application collections.

data.ExtendedProperties.Add("refreshat", "12:00");

Among the other properties of the DataSet object, a special mention deserves the DefaultView property. It returns a DataSetView object that represents a custom, filtered view on all the tables currently forming the dataset. This feature allows you to build multiple different views of the dataset content; for example, views showing different fields for different users.

To set a dataset's view, you first create and populate the DataSet. Next, you create a new instance of the DataSetView object passing the DataSet reference to the class constructor.

DataSet data = new DataSet();
// fill the data set here
DataSetView dsv = new DataSetView(data);

So far, you have created an association between two objects but you haven't provided the information needed to create a collection of table-specific views. You do this through a new type of object—the TableMapping object.

A table mapping is meant to define the custom settings used to view a table in a DataSetView view. Basically, it applies a sort of mask on top of any table in the DataSet. This mask can provide automatic filtering, sorting, as well as an alternative naming for fields. By simply assigning a DataSetView object to the DefaultView property of a DataSet, you enable a different view of the same content.

Another couple of important elements in the DataSet's geography are the Xml and XmlData properties. They let you read or change the DataSet structure and content using XML. The Xml property exposes both schema and data. The XmlData property makes available, for reading and writing, only data.

Populating the DataSet

The DataSet is a publicly creatable object, as are almost all the other ADO.NET objects. Normally, you create a DataSet with or without a custom name. However, DataSetName is the property you can use to get and set the DataSet's name. Next, you might want to set some environment attributes such as CaseSensitive and EnforceConstraints.

The former determines whether string comparisons within child DataTable objects have to be case-sensitive. By default, the property returns False.

EnforceConstraints is a Boolean value that indicates whether any constraint rule set through the DataTable's Constraints collection has to be verified when attempting an update operation. Constraints are a collection of Constraint objects, each of which defines a rule enforced by the table to maintain data integrity.

On the XML side of a DataSet, you can decide to set the namespace name (the Namespace property) and the namespace's prefix (the Prefix property) to be used when serializing to XML the content of the object.

Upon creation, a DataSet is empty as its Tables collection has no element. You can add tables to the collection in two basic ways—through a .NET data adapter or by manually creating a DataTable object and adding it to the collection.

A data adapter takes the form of a SQLDatasetCommand or a ADODatasetCommand object. (Notice that the names of these classes are subject to change with the Beta 2 of .NET.)

A data adapter exposes methods that hide what really happens under the covers and is not at all different from what you can do manually. The DataTable object is first created as follows:

DataTable dt = new DataTable();

Next, it is given a schema in terms of columns and related attributes, and finally it gets filled up with rows. You use the DataTable's Rows collection for this and append one DataRow object after the other. When the table is ready, it is inserted in the Tables collection.

A data adapter executes this procedure under the control of the FillDataSet method.

SQLDatasetCommand cmd = new SQLDatasetCommand(strCmd, strConn);
DataSet data = new DataSet();
cmd.FillDataSet(data, strTableName);

Such a method serves the purpose of updating data and schema of the specified table in the given DataSet. FillDataSet retrieves the data from the data source using the query command you passed through the adapter's class constructor.

There must be a connection object associated with the command, right? It can specified either through a SQLConnection object (an ADOConnection object if you're targeting an OLE DB data source) or a simpler connection string. Either way, if the connection needed is closed, it is opened to retrieve data, then closed. If the connection is open before FillDataSet is invoked, it is used and left open.

FillDataSet has an extra calling prototype that lets you load only a portion of all the selectable records at a time. You specify the 0-based position of the record to start loading with and the maximum number of records to retrieve for each step. This trick allows you to implement an asynchronous reading of records from virtually any data source.

Furthermore, bear in mind that if the command returns multiple results, FillDataSet will only take into account the first result—or the specified portion of it in case you're used to reading a maximum number of records at a time. By contrast, if the command returns no rows, then no table is added to the DataSet.

Any error that might occur while populating the data set is that it won't roll back the changes already submitted. In fact, any row added or modified prior to the error is acquired and never canceled. When your aim is to refresh a dataset, as opposite to fill it up, you can avoid duplicated rows if you use the same SQL statement that was initially used to populate the DataSet and primary key information is present.

Primary key information is normally inferred by the table metadata if you're fetching records out of a relational database. Otherwise, it might be set through the PrimaryKey property of the DataTable object.

DataColumn[] keys = new DataColumn[1];
DataTable dt = new DataTable("MyList");
keys[0] = dt.Columns["ID"];
dt.PrimaryKey = keys;

The DataSet object has the built-in ability to recover some viable missing information. The MissingSchemaAction property, for example, indicates how to manage potentially inconsistent situations where missing tables or columns may cause unpredictable behaviors. By assigning a predefined value to MissingSchemaAction, you decide whether the missing information must be simply added to the DataSet's schema or raise a warning or an error. If you set the property to AddWithKey as follows:

data.MissingSchemaAction = MissingSchemaAction.AddWithKey

Then all the necessary columns and key information is added to complete the schema.

Data adapters are specialized objects that end up creating and filling data tables. You can run the same procedure under your own control and populate a data set with non-database records. For example, the following snippets shows how to add a table with directory information:

DataTable dt = new DataTable();
DataColumn colName = new DataColumn();
colName.DataType = System.Type.GetType("System.String");
colName.ColumnName = "FolderName";
dt.Columns.Add(colName);
DataColumn colDesc = new DataColumn();
colDesc.DataType = System.Type.GetType("System.String");
colDesc.ColumnName = "FolderDesc";
dt.Columns.Add(colDesc);          
Directory dir = new Directory(strDir);
foreach (Directory d in dir.GetDirectories())
{
   DataRow dr = dt.NewRow();
   dr["FolderName"] = d.Name;
   dr["FolderDesc"] = "Content of " + d.Name;
   dt.Rows.Add(dr);
}
DataSet data = new DataSet();
data.Tables.Add(dt);

All the tables linked to a DataSet object can be manipulated through the same API regardless of their actual origin. Whether you created a table from SQL Server or from scanning the content of a folder makes no difference whatsoever when it comes to putting tables into relation or indexing or persisting to XML.

All the data is kept in memory and can be updated, sorted, and filtered without resorting to any server-side functionality. All database-like functionalities are implemented as in-memory features including a commit model that closely resembles the transaction model of many DBMS. You really only need to get back to the server when you want to save changes to the data source. This will raise another category of problems that I'll tackle in an upcoming column.

Dialog Box: Love Without Further Ado(.NET)

I feel that one day I could love ADO.NET and ASP .NET, but this day is yet to come. Actually, I don't feel comfortable enough with them to take the plunge to recommend it for the next project. Ideally, I would wait at least one year to start and more or less two years to go live. Production boxes are very fierce beasts. Am I missing something?

We're really close now to the Beta 2 of .NET. Many companies involved in pilot projects are already "playing" with it as I write this response. Having it publicly available is a matter of weeks. Once you have put your hands on it, have a careful look at the documentation. The more you feel comfortable with that, the more I recommend that you take the plunge.

When you write code, system documentation is your best friend. This is even truer if you consider that you will be working with a new and somehow unexplored platform. To take the project home, you must survive by relying on your team skills and the documentation available.

By the time the first .NET platform ships, we will certainly have excellent documentation. Already having good documentation for Beta 2 is a significant signal that we're headed in the right direction.

Personally, I'd first check documentation for ASP .NET server controls, in particular, the custom controls. Then, I'd make sure that the ADO.NET documentation for related objects (and for overrides of the same methods) is not the offspring of a more or less smart cut-and-paste. Finally, I'd make sure that I feel confident in my ability to deploy ASP .NET applications.

Loving .NET might not be enough to start a real-world project, but feeling comfortable with the SDK and documentation is a great place to start. It's exactly what you need to start a new "production" story.

Dino Esposito is Wintellect's ADO.NET expert and a trainer and consultant based in Rome, Italy. Dino is a contributing editor to MSDN Magazine and writes the Cutting Edge column. He also regularly contributes to Developer Network Journal and MSDN News. Dino is the author of the upcoming Building Web Solutions with ASP.NET and ADO.NET from Microsoft Press, and the cofounder of https://www.vb2themax.com/. You can reach Dino at dinoe@wintellect.com.