From the August 2002 issue of MSDN Magazine

MSDN Magazine

Designing Reader Classes for .NET Documents
Dino Esposito Cutting Edge Archive
I

have to admit, there's a little bit of hype in title of this month's column. What the heck is a .NET document? How is a .NET document different from a Win32® document or even a Linux document? Ever since the good old days of punch cards, it is the program, not the hardware or software platform, that determines what a document is. At the highest level of abstraction, a document is the repository of some sort of information that can be codified in text or binary formats, obfuscated, and encrypted. If the document contains plain text, you can play around with it using a number of tools. But when a document has lots of formatting and other information embedded in it, you need special software to read it.
      There's a long list of document formats for which ad hoc reading tools would be useful. Some common formats that may need a specialized application include XML, HTML, comma-separated files (CSV), Windows® configuration files (INI), Microsoft® Management Console (MMC) scripts, all those proprietary file formats, and those that have fixed-width fields.
      So, what if you want to build apps to read these documents? Well, the Microsoft .NET platform provides you with a ready-made interface to build ad hoc reader classes for simple file formats.

.NET Readers and Writers

      The .NET Framework defines two rather generic software components or classes, the reader and the writer, which provide a common programming interface for both synchronous and asynchronous operations on two distinct categories of data: streams and files. These classes are contained in the System.IO namespace.
      A file is defined as an ordered and named collection of bytes and is persistently stored to a disk. A stream represents a block of bytes that is read from, and written to, a data store. The data store can be based on a variety of storage media, including memory, disk files, and remote URLs. A stream is a superset of a file, which can be saved to a variety of media, including memory. To work with streams, the .NET Framework defines several flavors of readers and writers. Figure 1 shows how each class relates to the others.

Figure 1 Class Relationships
Figure 1 Class Relationships

      The base classes are TextReader, TextWriter, BinaryReader, BinaryWriter, and Stream. All but the binary classes are marked as abstract (MustInherit, if you speak Visual Basic®) and cannot be directly instantiated in code. You can use abstract classes, though, to reference living instances of derived classes.
      In the .NET Framework, there are a number of concrete implementations for base reader and writer classes, including StreamReader, StringReader, and their writing counterparts. By design, reader and writer classes work on top of .NET streams and provide programmers with a customized interface able to handle a particular type of underlying data or file format. Although each specific reader and writer class is custom-made for the content of a given type of stream, they share a common set of methods and properties which define the official .NET interface for reading and writing data.
      A reader works in much the same way as a client-side database cursor. The underlying stream is seen as a logical sequence of units of information whose size and layout depend on the particular reader. The reader moves through the data in a read-only, forward-only manner. Normally a reader is not expected to cache any information, but this is more of a common practice for .NET readers than a strict guideline. Nothing would actually prevent you from writing custom readers with the built-in ability to read many units at once and cache them for further use.
      A non-cached database cursor reads one record at a time and stores the contents in some internal data structure. The unit of information read at this step is the database row. Similarly, a reader working on a disk file stream uses the single byte as its atomic unit of information, whereas a text reader could extract one row of text at a time. Configuration settings for an application, for instance, are read using an AppSettingsReader class.
      Another very peculiar type of reader is the XML reader. The XmlReader class parses the content of an XML file, moving from one node to the next. In this case, the information processed is more granular: it's the XML node, which may be an element, an attribute, a comment, or a processing instruction.
      In the .NET Framework some reader and writer classes, notably the XmlReader and the SqlDataReader, don't derive from TextReader or BinaryReader. But even classes that don't derive from a common source share a minimal programming interface with all other readers. This common interface contains the Read and Close methods. Whatever the base class is, and whatever the names of methods and properties are, all .NET readers and writers work in the same way and feature the same functions as long as the characteristics of the underlying data stream permit it.

Text Readers and Writers

      The TextReader class provides the capability to read a series of characters in various ways. The Read method gets one or more characters from the input stream, whereas the ReadLine method always stops at the next carriage return. The ReadToEnd method reads until the end is reached, but ReadBlock lets you choose the characters from within a specified range. The TextReader class also lets you Peek the next character in the stream without actually extracting it. A TextReader object is not thread-safe by default, but a thread-safe wrapper class can be obtained at any time by calling the Synchronized static (shared) method:

  TextReader sync = TextReader.Synchronized(reader);

      The base interface of the TextWriter is even simpler. The class implements a pair of methods like Write and WriteLine with a list of overloads as long as is necessary. Writers support .NET format providers as well as custom newline characters and the encoding set of choice. Tailor-made writer classes include the indented writer (which classes in the CodeDOM namespace use to generate code on the fly), the HTTP writer (used to send data back to the browser), and the HTML text writer (which ASP.NET server controls employ to render their HTML source code). The writer class is not thread-safe by default and concurrent access can be ruled out by resorting to an ad hoc wrapper class.
      To create readers or writers for proprietary data, consider implementing the Read and Peek methods in the reader and a few Write overloads in the writer. Although the base classes are abstract, only a few properties are actually marked abstract and so must be overridden in derived classes.

Binary Readers and Writers

      A binary reader works by reading primitive data types from a stream with a specific encoding. The default encoding is UTF-8. The set of methods include Read, PeekChar, and a long list of ReadXxx methods specialized to extract data of the given type. In contrast, the generic Read method works by extracting a single character from the stream and moving the internal pointer ahead according to the current encoding.
      For example, the code in Figure 2 shows how to read width, height, and color depth for a BMP file. The file header of a BMP file is given by a structure called BITMAPFILEHEADER which counts a total of 14 bytes: three UInt16 and two UInt32. Notice that if you want to skip over the specified number of bytes, the reader's interface forces you to indicate the number of bytes explicitly.

  reader.ReadBytes(14)

You cannot rely on handy functions like sizeof in C# or the Len function of Visual Basic. The sizeof function only works on unmanaged resources. More important than how you get the length to skip, though, is the fact that the bytes are actually read, not skipped. To really jump over the specified number of bytes you must resort to the methods of the underlying stream object. You get the current stream using the BaseStream property, then skip the first 14 bytes like this:

  reader.BaseStream.Seek(14, SeekOrigin.Begin)

      You can also choose to move by a given offset from the bottom of the stream or from the current position. The following code shows how to actually skip a short integer:

  reader.BaseStream.Seek(2, SeekOrigin.Current)

      Finally, if you need to read a complex structure from a binary stream you have to do it sequentially on a per-field basis. You must not use a binary formatter to read binary data into a newly created instance of an object. The .NET binary formatter can be used only to deserialize object graphs that had been previously created with the .NET formatter's deserialize method.
      To write binary data, you have a variety of WriteXxx methods, one for each primitive .NET type.

Data Readers

      Readers are the key tool for data retrieval in ADO.NET. Each DBMS managed provider exposes an object acting as a connected firehose-style cursor—the data reader. The .NET data provider for SQL Server™ returns data using instances of the SqlDataReader class. By contrast, the managed provider for OLE DB providers makes use of the OleDbDataReader class. All data reader objects share the methods of the IDataReader interface, among which the Read method plays a key role. It moves the internal pointer one row ahead while copying the column values found in the current row to an internal buffer. As long as you work on a data reader, the connection with the data source remains active and open.
      In ADO.NET, the provider-specific data reader class is often presented as the lightweight connected alternative to DataSet objects. The DataSet object and the DataReader object present two philosophically different ways to get data in ADO.NET. The DataSet is a database-independent object that acts as a disconnected static container of a data snapshot. The DataReader is the .NET counterpart of a simple database cursor. Although you can take for granted that DataSet objects and data readers are different software tools, don't jump to the conclusion that a DataSet and DataReader use different APIs to fetch the data. Instead, the difference between the two classes is in how they store results. The data reader class, in fact, is the primitive method used to build the in-memory cache of data that makes up the DataSet.
      A database-driven DataSet object is populated using the Fill method of the adapter class. Fill calls the provider-specific data reader, then loops through the reader's results to create and populate DataTable objects within the target DataSet. As a result, the data reader is certainly more lightweight than a DataSet and should always be used for simple, one-shot database operations. If you need to take a snapshot of the data, then copy it to a powerful, feature-rich data structure such as the DataSet. The message here is that regardless of the actual data container, ADO.NET knows just one way to get to the data—the DataReader.
      Today, the programming interface exposed by a .NET data provider does not include a DataWriter. You can access the rows stored in the database using DBMS-specific commands, primarily SQL statements. However, although some DBMS provide ad hoc, position-based facilities, you normally identify the records to work on by key. The concept of a DataWriter is a lot like a classic database cursor. Server cursors allow you to move back and forth between records and give you the ability to modify data. As of version 1.0 of the .NET Framework, ADO.NET does not support server cursors, but this feature could easily be added in a future release.

XML Readers and Writers

      When you look at the parsing models supported by the .NET Framework, you'll notice that there is no SAX support in .NET. However, the SAX2 interfaces are fully supported by the Microsoft COM-based parser, MSXML 3.0 and higher. In parsing XML documents, .NET readers play a key role.
      A reader, as it is implemented in .NET, processes its input data according to the pull model, in which the calling application has full control of the overall read process. The application can decide what data to process, what data to skip, and what object model to build on top of the source code. When you apply this model to XML parsing, the caller application processes only the subtrees it needs and keeps track of only the pieces of information it really must keep track of.
      In a push environment, the parser is a component distinct from the caller application. The parser, not the application, controls the activity going on in the XML source data. The parser loads and preprocesses nodes and then passes them on to the application. For each node, data gets passed to the application, whether the application actually needs it or not.
      For XML parsing, the pull model is inherently more efficient and flexible. In fact, you can always arrange a push model on top of a pull model, but not vice versa. Although you don't strictly need a SAX parser in .NET, you can easily write a managed SAX parser using XML readers (see the sidebar "A SAX Parser in .NET").
      XML readers are read-only cursors which jump from one node to the next in a forward-only manner. Once the internal pointer is positioned on a certain node you see all the subtree roots in the node, including attribute nodes. You cannot modify free text or attributes and can only move forward from the current node. XML readers are built from a base abstract class called XmlReader. To access XML docs, you need to use one of its derived classes or write your own. The .NET Framework provides three reader classes: XmlTextReader, XmlValidatingReader, and XmlNodeReader.
      One way you can write XML text to disk files is to prepare the whole output to a string and then write it out. In this case, you are responsible for ensuring the document's well-formedness and must deal with the intricacies of quotes, attributes, indentation, closing tags, and all that fun stuff.
      In .NET, XML writers come to the rescue and let you write XML documents programmatically. You can specify whether or not you want a namespace prefix, indentation, and how you want white spaces to be treated. You also create nodes using methods to write comments, attributes, and elements. Using XML writers guarantees a final output compliant with the XML 1.0 specification.

Custom Readers

      So how would you design and write a custom reader? You could build it from base classes such as BinaryReader, TextReader, and XmlReader, or you can create your own.Typically, you choose BinaryReader if you need to manipulate primitive types in binary rather than text. You should choose the TextReader class when character input is critical. To successfully build on top of a TextReader, the most complicated thing you might need to do is read a line of text between two successive instances of a carriage return.
      Finally, use the XmlReader class as the base class if the content of the data you expose can be rendered, or at least traversed, as XML. XML is a very specific type of text, so the XmlReader class is more powerful and richer than any other reader class. Not all data, though, can map to XML.
      In theory, if you are writing a reader from scratch, you could use any series of methods and properties since no IReader interface actually exists. In practice, though, you might want to stick close to the interface of other readers. The guidelines for designing custom readers can be summarized in a couple of points. First, make it work according to the pull model and if possible make it read-only and forward-only. Second, give it a programming interface customized for the data it will handle. To apply these principles, try to build your class atop an existing .NET reader, but don't be afraid to outline a new one.

Send questions and comments for Dino to cutting@microsoft.com.

Dino Esposito is an instructor and consultant based in Rome, Italy. He is author of Building Web Solutions with ASP.NET and ADO.NET and the upcoming Applied XML Programming for .NET, both from Microsoft Press. Get in touch with Dino at dinoe@wintellect.com.