XmlLite Reader Programming Overview

Article
10/27/2016

This topic provides an overview of programming with the XmlLite reader.

Overview

XmlLite is implemented as a DLL. There is a header file and a library that comes with XmlLite, but they are there for compiling and linking purposes only; all of the functionality is implemented in the DLL. To use XmlLite, you call methods as appropriate to parse the XML and access the nodes.

Pull vs. Push Parsers

Both XmlLite and SAX2 are non-caching, forward-only parsers. For this type of parser, there are two varieties of programming models:

Pull programming model. In a pull model, after your application initiates parsing, it calls methods repeatedly to retrieve, or pull, the next node. After retrieving a node, your application can look at the node, including its name, value, attributes, and more. As appropriate, the parser advances so that the next time the application retrieves a node, it gets the next one. XmlLite follows a pull programming model.
Push programming model. In a push model, your application registers event handlers to receive nodes and information from the parser. After initiating parsing, the parser calls these event handlers, sending (or pushing) nodes to them. The MSXML SAX2 parser follows this programming model.

There are several advantages to the pull programming model:

It is significantly easier to program for the pull model. Your application takes the form of a loop, where you get the next node, process it, and repeat, until the entire file is processed, or until your application determines that no more processing is required.
In contrast, the push model requires your application to keep a significant amount of state. Keeping state means that your application needs to keep variables that contain the context of the current node. The combination of implementing a number of event handlers and keeping state means that much more housekeeping is necessary to implement an application.
If your application is more suitable for a push parser, it is possible to wrap some classes around a pull parser and implement a push programming model. However, the reverse is not true.

Both the push and pull type parsers are significantly different in semantic behavior and performance from the Document Object Model (DOM) programming interface.

You Must Use a Class that Implements IStream

Both the XmlLite reader and writer use a stream object for reading and writing the XML. Therefore, you must either use or implement a class that extends the IStream interface.

After you have created a reader by calling CreateXmlReader , you attach the IStream object to the reader by calling the SetInput method. After you have created a writer by calling CreateXmlWriter, you attach the IStream object by calling the SetOutput method.

If you implement your own class, and if you want to read or write XML directly from another type of stream, you can change the implementation of your IStream class as appropriate. If you want to use a simple in-memory IStream implementation, you can use the function CreateStreamOnHGlobal. If you want to use an IStream implementation that reads from and to a text file, you can use SHCreateStreamOnFile.

For details about programming with the XmlLite reader, see Reading an XML Document Using XmlLite.

For convenience, you can get a default implementation of a simple IStream class from the default IStream implementation. This implementation is the same as that used in many of the XmlLite samples.

XmlLite is a non-blocking parser

In Windows Developer Preview, the XmlLite reader now handles an E_PENDING HRESULT returned from IStream::Read or ISequentialStream::Read by propagating E_PENDING to client code, and is able to resume parsing in the next function call. With this non-blocking feature, some action can take place on the input stream or a CPU resource can be freed up before resuming the rest of the stream. This can happen in the following two scenarios:

The input stream is customized by user. For example, the customized stream interleaves xml content with binary data. The customized stream raises E_PENDING when the xml content is consumed so that the client code is able to suspend the parsing and handle binary data.

The following code example handles binary data when XML content with binary data is sent through a customized IStream. This method is part of the XMLLiteNonBlockingParser sample.

HRESULT HandleEPendingOnGetValue(IXmlReader * pReader, const WCHAR ** ppwszValue, UINT * pcwchValue)  
{  
    HRESULT hr = S_OK;  

    do  
    {  
         hr = pReader->GetValue(ppwszValue, pcwchValue);  

         if (hr == E_PENDING)  
         {  
            // Add code to handle binary data.  
            …  
         }  
         else if (FAILED(hr))  
         {  
            // Add code to handle failure.  
            …  
            break;  
         }  
         else  
         {  
            // Add code to handle xml data.  
            …  
         }  
    }while(hr != S_FALSE);  

    return hr;  
}

The input Stream is a standard asynchronous stream. The E_PENDING HRESULT may be raised when the data is temporarily unavailable on the network. In this case, you need to try again later in a callback or after some interval of time.

The following code example shows how E_PENDING allows XMLLite to start parsing data as soon as it is available in the buffer compared to waiting until the buffer is full before parsing. This method is part of the XMLLiteNonBlockingParser sample.

HRESULT CRssReader::Parse()  
{  
    HRESULT hr = S_OK;  

    XmlNodeType nodeType;  

    // Parses the stream with XmlLite when progressing through the stream.  
    while (S_OK == (hr = spXmlReader->Read(&nodeType)))  
    {  
        switch (nodeType)  
        {  
        case XmlNodeType_XmlDeclaration:  
            wprintf(L"XmlDeclaration\n");  
            break;  

        case XmlNodeType_Element:  
        …  

        }  
    }  

    return hr;  
}  

HRESULT STDMETHODCALLTYPE CRssReader::OnDataAvailable(DWORD grfBSCF, DWORD dwSize, FORMATETC *pformatetc, STGMEDIUM *pstgmed)  
{  
    ATLTRACE(L"IBindStatusCallback::OnDataAvailable\n");  

    HRESULT hr;  

    switch (grfBSCF)  
    {  
    case BSCF_FIRSTDATANOTIFICATION:  
        // Gets the stream on first data notification.  
        CHKHR(spXmlReader->SetInput(pstgmed->pstm));  
        break;  

    case BSCF_INTERMEDIATEDATANOTIFICATION:  
        // Tries to parse as much of the stream as possible.  
        hr = Parse();  

        // Expects only partial data is available.  
        if (hr == E_PENDING)  
             hr = S_OK;  

        if (FAILED(hr))  
        {  
             ATLTRACE(L"Error found with code 0x%02lX\n", hr);  
             return hr;  
        }  
        break;  

    case BSCF_LASTDATANOTIFICATION:  
        // Tries to parse as much of the stream as possible.  
        hr = Parse();  

        if (FAILED(hr))  
        {  
             ATLTRACE(L"Error found with code 0x%02lX\n", hr);  
             return hr;  
        }  
        break;  
    }  

    return S_OK;  
}

Important

When an E_PENDING HRESULT occurs, the parser will reset the cursor to the end of the last node that was parsed completely and save the status according to that position. The next time a read operation is called, parsing will continue from the next unparsed node.

Because of the nature of XmlLite, there is no callback or notify interface exposed to notify you that the stream is ready for further processing; however, you can leverage callbacks from the asynchronous stream to handle the E_PEDNDING return value to improve robustness.

The E_PENDING HRESULT can be returned from any of the following method calls:

XmlLite Is Not an ActiveX Control

XmlLite is not an ActiveX control. It is simply a DLL. As such, it cannot be used from scripting languages, such as JScript or Visual Basic Scripting Edition (VBScript). Like any DLL, XmlLite can be used from C#; however, C# applications would more typically use the XML parsers in System.XML. XmlLite is primarily meant to be used with C++.

XmlLite Is Not Thread Safe

XmlLite is not thread-safe. If you are writing a multi-threaded application, it is up to you to make sure that you use XmlLite in a thread-safe manner. For example, if one of your threads has called the method to retrieve the next node, and that method has not returned, you must programmatically prevent another thread from attempting to retrieve a node.

XmlLite Does Not Provide Access to Typed Content

Unlike the managed System.Xml.XmlReader, the IXmlReader does not provide APIs to access typed content. You can declare that an element is of a certain type, but you cannot retrieve that element as that specified data type. All values are returned as strings, and it is up to the application to convert to types other than string.

Support for Namespaces

XmlLite implements namespaces in compliance with the W3C Namespaces in XML specification. That specification can be found at http://www.w3.org/TR/REC-xml-names.

Capacity

Due to implementation and security constraints, the IXmlReader imposes bounds on some of the XML constructs. For example, consider the following:

<nameOfElement attrName="attrValue">text</nameOfElement>

All names are limited to 4GB in size. In the example above, nameOfElement and attrName have this limit.
Attribute values are limited to 2GB in size. In the example above, attrValue has this limit.
Text values are limited to 4GB in size without chunking, and are unlimited with chunking. To read values with chunking, you use the ReadValueChunk method. For an example of chunking, see Read an Xml Document Using Chunking.
The number of attributes on an element is limited to 64K.

If any of these limits are exceeded, then XmlLite returns an error.

Special Handling of Nodes

The DOCTYPE node is handled in a special way. When you read a DOCTYPE node, the NodeType is XmlNodeType_DocumentType. The PUBLIC and/or SYSTEM literals are presented as attributes with the names "PUBLIC" and "SYSTEM", respectively. The internal subset can be accessed by using GetValue, and the subset will be returned as a string.

The Xml declaration is also handled in a special way. When you read it, the NodeType is XmlNodeType_XmlDeclaration. The node has no value and has "xml" as the local name. It has no namespace and has up to three attributes: version, standalone, and encoding.

White Space Normalization

All white space in XmlLite is normalized as per the XML 1.0 specification. For more information, see Section 2.10 of the Extensible Markup Language (XML) 1.0 Specification at http://www.w3.org/TR/1998/REC-xml-19980210\#sec-white-space.

Error Handling

There are three classes of errors in XmlLite:

Argument errors. These errors are recoverable and further processing can continue. An example of an argument error is passing an incorrect or illegal value for an argument.
Parsing errors. These errors are not recoverable, but the same instance of the reader can be used by resetting the input source.
All other errors. All other errors are not recoverable, and the instance of the reader can no longer be reused.

Because some errors are not recoverable and may lead to unexpected behavior in further processing, it is crucial that the application user inspect the error code (HRESULT) returned by each method call before proceeding.

For more information about error codes, see Error Codes.

Encoding Detection

There are three places that an encoding can be specified while in the process of parsing an encoded Xml document:

The encoding can be specified in the document via a Byte Order Mark (BOM). The BOM may appear at the beginning of an XML entity. The BOM is used both to indicate the byte order of the input stream and as a specification of the input encoding.
The encoding can be specified in the document in the Xml declaration.
The encoding can be specified programmatically when creating the input stream. This specification can either be mandatory, or as a hint.

If the user of XmlLite specifies a mandatory encoding programmatically, XmlLite ignores the encoding specified in the Xml declaration. The encoding specified in the Xml declaration (or as overridden programmatically) is the desired encoding.

If there is a BOM, and it conflicts with the desired encoding, parsing will fail.

If there is no BOM, and if the encoding hint was specified as TRUE, then XmlLite will try to parse with the desired encoding. If that fails, XmlLite will attempt to automatically detect the encoding of the specified document.

If there is no BOM, and if the encoding hint was specified as FALSE, then XmlLite will try to parse with the desired encoding. If there is no match between the desired encoding and the encoding of the document, XmlLite parsing will fail. In this case, XmlLite will not attempt to automatically detect the encoding of the document.

If automatic detection of the encoding of the document fails, then XmlLite reports an error on line 0.

If there is no BOM, and there is no desired encoding, XmlLite will attempt to automatically detect the encoding of the document.

The following encodings are supported natively:

UTF-8, UTF-16, and UTF-16BE
UCS-2 and UCS-4
ASCII
SO88591, ISO88592, ISO88593, ISO88594, ISO88595, ISO88596, ISO88597, ISO88598, and ISO88599
Windows1250, Windows1251, Windows1252, Windows1253, Windows1254, Windows 1255, Windows1256, Windows1257, and Windows1258

For additional encoding support, you can co-create an instance of IMultiLanguage2* and set XmlReaderProperty_MultiLanguage. By default, the value of this property is NULL.

DTD Support

Document Type Definition (DTD) support is limited to entity expansion and default attributes.

When a DTD is used, DTD default attributes are returned just as if they were normal attributes; the only difference is that default attributes return TRUE when the IsDefault method is called.

For example with a DTD:

<!ATTLIST myElement myAttr CDATA "123">

If the myAttr attribute on the myElement element is not defined in the XML stream, it will be given the default value of 123.

Semantics of String Handling

The following IXmlReader methods return a string pointer: GetLocalName, GetNamespaceUri, GetPrefix, GetQualifiedName, and GetValue.

When calling these methods, be aware that the pointer is only valid until you move the reader to another node. When you move the reader to another node, XmlLite may reuse the memory referenced by the pointer. Therefore, you should not use the pointer after calling one of the following methods: Read, MoveToNextAttribute, MoveToFirstAttribute, MoveToAttributeByNameand MoveToElement. Although they do not move the reader, the following two methods will also make the pointer invalid: SetInput and IUnknown::Release. If you want to preserve the value that was returned in the string, you should make a deep copy.