This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Article
02/26/2008

MSDN Magazine

SAX, the Simple API for XML

Aaron Skonnard

Download the code for this article: XMLFiles1100.exe (108KB)
Browse the code for this article at Code Center: SAX ContentHandler Implementations

oday there are two widely accepted APIs for working with XML: the Simple API for XML (SAX) and the Document Object Model (DOM). Both APIs define a set of abstract programmatic interfaces that model the XML Information Set (Infoset). The DOM models the Infoset through a hierarchy of generic nodes that support well-defined interfaces. Due to the DOM's tree-based model, most implementations demand that the entire XML document be contained in memory while processing. SAX, on the other hand, models the Infoset through a linear sequence of well-known method calls. Because SAX doesn't demand resources for an in-memory representation of the document, it's a lightweight alternative to the DOM.
      The distinction between SAX and DOM is best illustrated by comparing them to database cursors. SAX is similar to firehose-mode cursors (read-only and forward-only) while DOM is closer to a static cursor that allows full traversal and updates. There are obvious trade-offs when choosing one cursor type over the other, although everyone agrees that both cursor types are valuable. The same holds true when choosing one XML API over the other.
      At first glance DOM and SAX may seem wildly different, but each is simply a different projection of the Infoset onto programmatic types. This promotes a synergy between SAX and DOM that creates a mountain of possibilities. SAX can be used to build DOM trees (or more interestingly, portions of DOM trees). Conversely, developers can traverse DOM trees and emit SAX streams.
      The DOM has been sanctioned by the W3C, while SAX hasn't even touched a standards body yet. SAX was created by a group of developers coordinated by David Megginson on the XML-DEV mailing list. Despite its humble beginnings, virtually all XML tool vendors, including Microsoft, have adopted SAX. The official SAX package and documentation can be found on Megginson's Web site at https://www.megginson.com/SAX.

SAX Interfaces

SAX defines a set of interfaces that model the Infoset (see Figure 1). SAX was originally defined for the Java programming language using Java interface definitions. Because the Java-language interfaces are not language-neutral, it's up to tool vendors to decide exactly how the SAX interfaces should map to their specific language. At some point in the future, however, standard language bindings will undoubtedly emerge.

Microsoft added support for SAX in the last two preview releases of MSXML 3.0. In May, they added C++-based support for SAX. A few months later, in July, they added SAX support in Visual BasicÂ®. Each of these language bindings requires a different set of interfaces that reflect the individual language and type restrictions. The names of the MSXML 3.0 SAX interfaces are shown in Figure 2. The names of the MSXML interfaces are the same as they are in the Java language, only prefixed with ISAX for C++ and IVBSAX for Visual Basic (ISAXContentHandler and IVBSAXContentHandler, for example). Throughout the rest of this column, I'll refer to the SAX interfaces by their native names, and you can perform the translation of your choice.

ContentHandler

      ContentHandler is the primary SAX interface that models the Infoset's core information items (the DTDHandler, LexicalHandler, and DeclHandler interfaces model the rest of the Infoset, plus more). A component that wants to receive an XML document implements ContentHandler, while a component that wants to send an XML document consumes ContentHandler. Figure 3 describes each of the ContentHandler members.
      Notice that ContentHandler contains several startXXX and endXXX member pairs that model different Infoset abstractions. For example, the startDocument method call indicates the beginning of the document information item and its descendants. Every method call that comes after startDocument represents a descendant of the document information item. The endDocument method call signals the end of the document's descendants. The same holds true for startElement and endElement.
      In other words, SAX models the structure of an XML document through a linear sequence of nested start and end method calls. For example, consider the following code that consumes the IVBSAXContentHandler interface:

  Public Sub GenerateXML(ch as IVBSAXContentHandler)
  Dim atts as New CVBSAXAttributesImpl
  ch.startDocument
     ch.startElement 'urn:foons', 'foo', 'f:foo', atts
        ch.startElement 'urn:foons', 'bar', 'f:bar', atts
        ch.endElement 'urn:foons', 'bar', 'f:bar', atts
     ch.endElement 'urn:foons', 'foo', 'f:foo', atts        
  ch.endDocument
End Sub

This example generates an XML document that could be serialized as follows:

  <f:foo xmlns:f='urn:foons'>
   <f:bar/>
</f:foo>

Notice that startElement and endElement take three name parameters: the element's namespace URI, local name, and qualified name (QName). It's up to the caller to specify the correct namespace information for each element and attribute in the startElement and endElement method calls.
ContentHandler implementations also need to know when a specific namespace prefix is considered in-scope. This is not only recommended by the Infoset, but it's required to expand namespace prefixes that aren't automatically handled by the XMLReader. There are many situations, such as XML schemas, in which namespace prefixes are used in element and attribute content.

  <circle xsi:type='geo:shape'>

In order to properly process the xsi:type attribute, the ContentHandler implementation needs to know the namespace URI that is associated with the geo prefix. The startPrefixMapping and endPrefixMapping methods supply this information to ContentHandler implementations. StartPrefixMapping is called just before startElement for the element on which the namespace mapping should begin. EndPrefixMapping is called just after the endElement call that closes the corresponding startElement call. The following consumer code illustrates when they should be called for the foo document fragment cited previously:

  Public Sub GenerateXML(ch as IVBSAXContentHandler)
  Dim atts as New CVBSAXAttributesImpl
  ch.startDocument
     ch.startPrefixMapping 'f', 'urn:foons'
     ch.startElement 'urn:foons', 'foo', 'f:foo', atts
        ch.startElement 'urn:foons', 'bar', 'f:bar', atts
        ch.endElement 'urn:foons', 'bar', 'f:bar', atts
     ch.endElement 'urn:foons', 'foo', 'f:foo', atts
     ch.endPrefixMapping 'f'
  ch.endDocument
End Sub

      The last parameter to startElement is a reference to an Attributes interface, which models the element's collection of attributes. This interface makes it possible to access attribute information by index or name. When accessing information by name, either the QName or the namespace name (namespace URI plus local name) may be used for retrieval. The accessible attribute information includes the attribute's namespace URI, local name, QName, type, and value.
      Consumers of ContentHandler are responsible for supplying an object that implements Attributes as the last argument to startElement. Implementations of ContentHandler will then consume the supplied attributes through the Attributes interface. Since all ContentHandler consumers need an Attributes implementation, and it's mostly boilerplate code, the standard Java-language package includes a helper class called AttributesImpl that's been designed for this purpose.
      The current release of MSXML doesn't provide the equivalent of AttributesImpl (it's scheduled for October, after this issue went to press), but it's trivial to implement. I've provided a sample implementation in Visual Basic that is equivalent to the standard Java-language implementation. Besides implementing the IVBSAXAttributes interface, my implementation (CVBSAXAttributesImpl) also provides convenience routines that make it easier to build an attribute list. The following consumer code illustrates the building of an attribute list before calling startElement:

  Dim atts As CVBSAXAttributesImpl
Set atts = New CVBSAXAttributesImpl
atts.addAttribute "urn:personns", "fname", "p:fname", _
    "CDATA", "jane"
atts.addAttribute "urn:personns", "lname", "p:lname", _
    "CDATA", "smith"
ch.startElement "urn:personns", "person", "p:person", atts
atts.clear

This sample generates a person element that could be serialized like this:

  <p:person p:fname='jane' p:lname='smith' 
          xmlns:p='urn:personns'/>

      The ContentHandler interface uses the characters method to model a sequence of character information items that occurs within element content. The identical ignorableWhitespace method models any ignorable whitespace that occurs within element content. Whitespace that occurs within element-only content is considered ignorable because it's only present for readability. The only way a processor can determine that a content model is element-only is by looking at the associated DTD or Schema. If no DTD or Schema is present, whitespace is always considered significant. The current MSXML SAX processor is nonvalidating, so all whitespace characters are always considered significant and are therefore passed through as characters instead of ignorableWhitespace.
      The MSXML definitions of characters and ignorableWhitespace are slightly different that the original Java-language definition, which took three arguments: a character buffer, a start position, and a length. In Visual Basic, for example, it makes sense for these methods to take just a single String argument. Consumers of ContentHandler are free to pass the characters within a given element through a single characters method call or through multiple method calls in smaller chunks.
      The following code adds some text to element content:

  Public Sub GenerateXML(ch as IVBSAXContentHandler)
  Dim atts as New CVBSAXAttributesImpl
  ch.startDocument
     ch.startElement '', 'foo', 'foo', atts
        ch.characters 'hello '
        ch.characters 'world'
     ch.endElement '', 'foo', 'foo', atts        
  ch.endDocument
End Sub

The resulting serialized document would look like this:

  <foo>hello world</foo>

The last couple of ContentHandler content-related methods are meant for modeling processing instructions and skipped entities. ProcessingInstruction simply takes target and data arguments in the form <?target data data data?>. The data portion of a processing instruction is everything that comes after the whitespace that separates the target. The skippedEntity method signals that the caller skipped a specific entity identified by name. The following code illustrates these methods:

  Public Sub GenerateXML(ch as IVBSAXContentHandler)
  Dim atts as New CVBSAXAttributesImpl
  ch.startDocument
     ch.startElement '', 'foo', 'foo', atts
     ch.skippedEntity 'ouch'
     ch.endElement '', 'foo', 'foo', atts        
     ch.processingInstruction 'hack', 'attack'
  ch.endDocument
End Sub

The document that this code represents might look something like this, assuming that the ouch entity is skipped by the caller:

  <!DOCTYPE foo [
  <!ENTITY ouch SYSTEM 'urn:skip-me-baby'>
]>
<foo>&ouch;</foo><?hack attack?>

The last member of ContentHandler is used for passing a Locator interface reference to the ContentHandler implementation. The implementation can use the Locator interface to query for contextual information about the caller, such as line and column number and public/system identifiers of the current entity. This information comes in handy, especially when the caller is an XML parser.
The standard Java-language package also provides a default Locator implementation called LocatorImpl. Consumers of ContentHandler can access this class directly to provide Locator functionality. Implementations of ContentHandler may also find it useful to use this class to make static snapshots or copies of a Locator object. I've provided a sample implementation of IVBSAXLocator called CVBSAXLocatorImpl that you can use to test these ideas (see Figure 4). The code illustrates how a typical ContentHandler consumer would use a Locator object.

Now that I've looked at ContentHandler fundamentals and various ContentHandler consumer examples, let's look at some sample ContentHandler implementations.

ContentHandler Implementations

      The canonical example of a ContentHandler implementation simply serializes the received method calls back out as an XML 1.0 namespace-aware document. The code for such an implementation needs to follow the syntactical productions defined by XML 1.0 and Namespaces for serializing each information item to the output stream. I've provided a fairly generic implementation of this in a Visual Basic class named CSerializer. The partial code for CSerializer is shown in Figure 5.
      Most of the ContentHandler method implementations are straightforward, but startElement and startPrefixMapping require special attention. Serializing the start tag of an element requires three components: the element's QName, the element's list of attributes, and any namespace declarations. How you accomplish this actually depends on the XMLReader's configuration with respect to namespace processing.
      SAX processors are required to report namespace names (namespace URI plus local name) if the namespace and namespace-prefixes properties are set properly, but they are not required to report QNames by default. Also, because namespace-aware processors don't consider namespace declarations to be standard XML attributes, they are not supposed to be delivered via the Attributes interface by default. However, SAX processors can be configured with respect to namespace behavior. SAX defines two well-known properties that have the following IDs:

  https://xml.org/sax/features/namespaces
https://xml.org/sax/features/namespace-prefixes

These features have an initial state of true and false, respectively.
      In this state, the namespace processing behaves as described in the previous paragraph. If namespace-prefixes is set to true, the processor is supposed to deliver all namespace information, including the namespace declarations, as attributes. If namespace is set to false, the processor is no longer required to deliver namespace names or the startPrefixMapping/endPrefixMapping method calls. This essentially makes the processor behave as if it didn't understand namespaces.
      The MSXML implementation does report QNames by default, so this was not a concern with the implementation of CSerializer. The initial state of namespace-prefixes, however, is set to true, so the namespace declarations do show up in the IVBSAXAttributes collection. My CSerializer class was designed for namespace-aware processors, so it expects the namespace declarations to be delivered through startPrefixMapping and endPrefixMapping (and not through the IVBSAXAttributes interface). This means the namespace-prefixes property must be explicitly set to false when using CSerializer with MSXML's implementation of XMLReader (you'll see how to do this in the XMLReader section).
      With that explanation in place, the implementations of startPrefixMapping and startElement should make more sense. StartPrefixMapping saves all namespace declarations that occur before a startElement call. StartElement then serializes them back out as standard namespace declarations. Executing any of the ContentHandler consumer examples that were shown in the previous section against an instance of CSerializer would generate a well-formed XML 1.0 document:

  Dim ch as IVBSAXContentHandler
Set ch = new CSerializer
GenerateXML ch

CSerializer allows the client to register an output stream before an XMLReader begins making the content-related method calls (startElement, processingInstruction, and so on). The output stream can be one of three types: a reference to a text file (Scripting.TextStream), a reference to the ASP Response object (ASPTypeLibrary.Response), or a standard Visual Basic-based textbox control for display and testing purposes. It would be fairly easy to extend the class to support additional output types. The following example illustrates how to use CSerializer to generate an XML file on disk:

  Dim ser as New CSerializer
Dim fso As New Scripting.FileSystemObject
Dim ts As Scripting.TextStream
Set ts = fso.CreateTextFile _ ("c:\temp\out.xml", True, True)
ser.registerOutputStream ts
GenerateXML ser

Generating XML from within an ASP page is just as trivial, as shown by the following example:

  <%
Response.ContentType = "text/xml"
' this class generates a document via 
' IVBSAXContentHandler
Set generator = _
Server.CreateObject("XMLFilesVBSAX.CGenerator")
Set ch = _ Server.CreateObject("XMLFilesVBSAX.CSerializer")
ch.RegisterOutputStream Response
generator.generateXML ch
%>

A more practical implementation of ContentHandler could deserialize an XML document into an application-specific type. For example, consider the following XML document that represents a CInvoice class:

  <Invoice xmlns='urn:www-develop-com:invoices'>
   <InvoiceID>2222</InvoiceID>
   <CustomerName>Jane Smith</CustomerName>
   <LineItems>
     <LineItem>
       <Sku>134</Sku>
       <Description>Don's Boxers</Description>
       <Price>9.95</Price>
     </LineItem>
   </LineItems>
</Invoice>

Another implementation of ContentHandler could simply deserialize this document back into a native CInvoice instance. To accomplish this, the ContentHandler implementation must maintain several pieces of contextual information during processing. One way to do this is to build a simple state machine that represents the current element context. Using a stack, startElement can push the current element state so future method calls can figure out which element is currently being processed, as shown in Figure 6. EndElement needs to pop the current element state off the stack to keep things balanced. It's then up to the characters method to populate the instance with the appropriate data at the appropriate time. Figure 7 shows the complete class file.
Instead of deserializing the invoice document into a class, the ContentHandler implementation could transform the document into an HTML report. The provided CInvoice2HTML class performs such a transformation that then produces the HTML page shown in Figure 8.

Figure 8 Report Page

I've provided several other ContentHandler implementations with this column that illustrate how SAX and DOM can be used together. The CDOM2SAX class demonstrates how to walk a DOM tree and emit ContentHandler method calls. The CSAX2DOM class demonstrates how to build a DOM tree from a stream of ContentHandler method calls. And finally, instead of building a full DOM tree for a document, CSAXFilter2DOM demonstrates how to filter for certain elements and only build the identified element subtrees. These classes are available as part of the sample download for this issue, located at the link at the top of this article.

Other Interfaces

      ContentHandler implementations can also implement the ErrorHandler, DTDHandler, and EntityResolver interfaces. The ErrorHandler interface is used to provide custom handling of caller-generated errors. An implementation of the IVBSAXErrorHandler interface would look like the code shown in Figure 9.
      XML 1.0 clearly defines what constitutes a warning, an error, and a fatal error. Fatal errors are typically violations of the well-formedness rules, while errors are typically violations of application validity constraints (DTD). A warning indicates an exceptional condition that's less serious than an error. Currently, the MSXML implementation reports all errors as fatal errors. It also makes some slight changes to the Java-language interface signatures to account for the exception handling model in COM.
      The ErrorHandler interface essentially allows the XMLReader to signal the ContentHandler implementation that it wants to abort processing. Conversely, ContentHandler implementations can indicate to the XMLReader that it wants to abort processing. This can be accomplished by simply raising an application-specific exception. This is especially useful for aborting processing once the implementation finds what it is looking for:

  Private Sub IVBSAXContentHandler_characters(ByVal strChars As String)
  ' I found what I was looking for, abort processing
  Err.Raise vbObjectError + errDone, "startElement", _
            "I got what I want, let's go play!"
End Sub

The DTDHandler interface makes it possible to process unparsed entities and notations that appear in the document's DTD. Unparsed entities are used to attach non-XML data streams (identified by a system/public ID pair) to an XML document for application-level processing. Every unparsed entity is also associated with a notation that specifies what type of data the entity represents. The DTDHandler interface contains two methods for processing these items: unparsedEntityDecl and notationDecl.

SAX also defines the EntityResolver interface for custom resolution of external entities. The interface contains a single method, resolveEntity, which allows the implementation to supply application-specific resolution rules for external entity identifiers. MSXML supports the DTDHandler interface, but does not currently support EntityResolver. In addition, the current MSXML release doesn't implement anything equivalent to the SAX InputSource class, which was defined to encapsulate all SAX I/O.

Extended Interfaces

Although not part of core SAX, there are two additional interfaces, LexicalHandler and DeclHandler, that make it possible to process lexical and DTD-related document information. In the Java-language space, these interfaces are distributed in a separate package called SAX-ext to emphasize that processors are not required to support them. Because these interfaces are considered extensions, they are not registered with an XMLReader in the same way as the other interfaces, as you'll see in the next section.
The LexicalHandler interface makes it possible to signal comments, the start/end of a CDATA section, the start/end of a DTD, and the start/end of entity reference. In order to make the CSerializer implementation support comments and CDATA sections, it must also implement the IVBSAXLexicalHandler interface. The following code illustrates how CSerializer handles the comment and CDATA section serialization:

  Implements IVBSAXLexicalHandler
Private Sub IVBSAXLexicalHandler_comment(ByVal strChars As String)
    appendXML "<!--" & strChars & "-->"
End Sub
Private Sub IVBSAXLexicalHandler_startCDATA()
    appendXML "<![CDATA["
End Sub
Private Sub IVBSAXLexicalHandler_endCDATA()
    appendXML "]]>"
End Sub
' remaining methods omitted for clarity

The DeclHandler interface makes it possible to signal DTD declarations including:

Element and attribute type declarations (<!ELEMENTï¿½>, <!ATTLISTï¿½>).
Internal entity declarations (<!ENTITYï¿½>).
Parsedexternal entity declarations (<!ENTITY...SYSTEM...>) (recall that unparsed entity declarations are handled by DTDHandler).

XMLReader

XMLReader is the main interface that SAX producers implement. XMLReader serves several purposes. First, XML consumers use this interface to register their implementations of the other SAX interfaces (such as ContentHandler, ErrorHandler, and so on). Second, XMLReader makes it possible to configure the desired behavior of the SAX processor through two generic methods: setFeature and setProperty. And finally, XMLReader encapsulates parsing functionality.
For example, the following code snippet illustrates how Visual Basic-based consumer code would register its IVBSAXContentHandler and IVBSAXErrorHandler implementations with an instance of VBSAXXMLReader:

  Dim rdr as IVBSAXXMLRdr, ser as CSerialzier
Set rdr = new VBSAXXMLRdr30
Set ser = New CSerializer 
Set rdr.contentHandler = ser
Set rdr.errorHandler = ser

XML consumers can also configure the behavior of the producer with respect to the different aspects of processing an XML document. The putFeature and putProperty methods encapsulate this process. SAX defines several well-known features and properties that SAX processors are encouraged to recognize, although not all of the standard features or properties make sense for all implementations. Implementors are also free to add their own proprietary extensions as long as they use unique feature and property IDs (see the MSXML documentation for support).
For example, as discussed earlier, setting the namespace-prefixes feature to false tells the SAX processor not to deliver namespace declarations as regular attributes. The following code illustrates how to accomplish this with putFeature:

  rdr.putFeature "https://xml.org/sax/features/namespace-prefixes", True

Also, for an application to register a LexicalHandler/DeclHandler interface with an XMLReader it must do so through a putProperty call, as shown here:

  rdr.putProperty "https://xml.org/sax/properties/lexical-handler", ser

Once the XML consumer has registered its interface implementations and configured the behavior, the producer can start pushing the document through the appropriate methods. One of the most common types of SAX producers is an XML parser. As the XML parser decomposes the XML 1.0 serialized stream, it can pass the information to the registered SAX consumer. The SAXXMLReader and VBSAXXMLReader coclasses do exactly this.
XML consumers use the parse method to indicate when the parser should begin. The MSXML implementation of parse takes a VARIANT, which can be of type VT_BSTR, SAFEARRAY of bytes (VT_ARRAYVT_UI1), VT_UNK(IStream), and VT_UNK(ISequentialStream). MSXML also offers a parseURL method, which takes a URL string that identifies the file to be parsed. An XML consumer calls parse or parseURL after registering its interface implementations to begin parsing, as seen here:

  Dim reader as IVBSAXXMLReader, ser as CSerialzier
Set reader = new VBSAXXMLReader30
Set ser = New CSerializer 
Set reader.contentHandler = ser
Set reader.errorHandler = ser
reader.parseURL "https://www.develop.com/courses.xml" 
XMLFilter

      The SAX model of streaming Infosets between producer and consumer can be extended to include additional transparent interceptors. As long as a component understands the SAX interfaces, it can be placed between a producer and consumer to filter the method calls going in either direction. This makes it possible to build SAX pipelines that consist of several components, each of which has a specific processing responsibility.
      SAX codifies this practice with the XMLFilter interface, which extends XMLReader and adds two methods for getting and setting a reference to an upstream XMLReader object. An implementation of XMLReader also typically implements one or more consumer interfaces (such as ContentHandler) to which functionality can be added before the calls are pipelined.
      I've provided a default implementation of IVBSAXXMLFilter in CVBSAXXMLFilterImpl that illustrates how this might be accomplished (see Figure 10). This implementation simply writes every method call it receives to a log file before forwarding it on through the pipeline. Figure 10 illustrates how several instances of CVBSAXXMLFilter can be chained together to create a SAX pipeline. Figure 11 illustrates the flow of method calls through the pipeline created in Figure 10. When this pipeline is used to process the invoice document, the resulting log file looks something like this:

  [FILTER 1] IVBSAXContentHandler_startDocument
[FILTER 2] IVBSAXContentHandler_startDocument
â¢â¢â¢
[FILTER 1] IVBSAXContentHandler_startElement: Invoice
[FILTER 2] IVBSAXContentHandler_startElement: Invoice

This type of stream-based processing model is not possible with a traditional DOM implementation. Finally, I pulled together several SAX methods into a unified sample, shown in Figure 12. The code for this sample is available from the link at the top of this article.

Figure 12 The SAX Sample

Conclusion

SAX offers a lightweight alternative to DOM. It facilitates searching through a huge XML document to extract small pieces of informationâ€"and it allows premature aborting when the desired piece of information is found. SAX was designed for any task where the overhead of the DOM is too expensive.
However, the performance benefits of SAX come at a price. In many situations such as advanced queries, SAX becomes quite burdensome because of the complexities involved in managing context while processing. When this is the case, most developers either turn back to the DOM or some combination of SAX and DOM together.

Aaron Skonnard is an instructor and researcher at DevelopMentor, where he develops the XML curriculum. Aaron coauthored Essential XML (Addison-Wesley Longman, 2000) and wrote Essential WinInet (Addison-Wesley Longman, 1998). Get in touch with Aaron at https://staff.develop.com/aarons.

From the November 2000 issue of MSDN Magazine.