Putting XML to Work
This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
Greater Office
Parsing XML Data with MSXML
By Mike Gunderloy
Unless you've been on a long vacation lately, you've
probably run across Extensible Markup Language (XML), which is being
widely touted as the next great breakthrough for communication between
applications. But you may be a bit fuzzy on how to use it in your own
applications.
In brief, XML provides a text format for transmitting
structured information. The good news is that you don't have to write your
own parser for this format. Starting with Internet Explorer 4.01,
Microsoft has shipped their own XML parser, MSXML, as a COM component. In
this article, I'll show you how to use the MSXML parser from VBA to
quickly and easily extract information from an XML document.
Why would you want to do this? The fact is that XML is
one of this year's hot new terms in personal computing, and Microsoft is
making a concerted push to incorporate it into all aspects of their
product line. Other major software companies - from IBM to Oracle - have
also produced XML tools and interfaces. With that much pressure behind XML
as a standard, there's little doubt that you'll be faced with an XML data
stream, sooner rather than later. Knowing how to read it quickly and
easily will let you make the best use of your time in working with this
data.
Understanding XML
XML is a huge topic, and one which very few people
understand in depth. If you want to see the full standard, you need to go
the World Wide Web Consortium (W3C) Web site at http://www.w3.org/Consortium/siteindex#X.
A warning, though: The standard is very heavy going if you're not used to
interpreting such legalistic documents. You might find the annotated
version of the specification at http://xml.com/pub/axml/axmlintro.html
more useful, or some of the articles at Microsoft's MSDN XML Developer
Center (http://msdn.microsoft.com/en-us/xml/default.aspx).
Fortunately, if all you want to do is read and parse an
XML file from someone else, and you can make the assumption that the
document is valid XML, you don't have to know the entire standard. In
fact, there are just a few key terms you need to understand to get started
in the XML world. You can always pick up more later, but for now, just
concentrate on these four ideas:
- A document is a chunk of XML. You can think of an XML document as
the equivalent of an HTML page. It's a bit of XML that's served up by
some Web server, or other data feed, as a single chunk.
- An XML declaration is a processing directive that identifies the
document as being XML, and includes the XML version.
- Elements are the building blocks of the document. If you're familiar
with HTML, HTML tags are the rough equivalent of XML elements. One of
the key distinctions is that in XML, new elements may be defined by any
document.
- Attributes describe some feature of an element.
Of course, I've left a lot out to produce this simple
picture of XML. In particular, I'm ignoring the topic of Document Type
Declarations (DTDs). A DTD is a section of XML that explains the rules for
constructing a particular XML document, and it's important because it
allows you to validate that an XML document is actually in the intended
format. Since I'm assuming the source document comes from someone who
knows how to construct XML, I can ignore this piece of the puzzle - at
least for this article.
An XML Example
For this article, I'll be parsing the sample file
shown in FIGURE 1, traffic.xml. This file tracks Web site visits to a
fictitious Web site by country/region of origin. If you refer to the figure,
you'll be able to identify all the parts of XML that I talked about in the
previous section:
- The document is the entire file.
- The XML declaration is the first line in the file, <?xml
version="1.0"?>. This shows that the current document is XML, and
that it meets the standards of version 1.0 of the XML specification.
- The elements of the document include SiteVisits, Country,
TotalVisits, and LatestVisit. As you can see, each element is marked by
a starting and ending tag, and elements can be nested.
- CountryName is an attribute of the Country element.
<?xml version="1.0"?>
<SiteVisits>
<Country CountryName="USA">
<TotalVisits>1348</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="UK">
<TotalVisits>764</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="Argentina">
<TotalVisits>175</TotalVisits>
<LatestVisit>1/2/2000</LatestVisit>
</Country>
<Country CountryName="Brazil">
<TotalVisits>182</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="Canada">
<TotalVisits>688</TotalVisits>
<LatestVisit>1/3/2000</LatestVisit>
</Country>
<Country CountryName="Denmark">
<TotalVisits>204</TotalVisits>
<LatestVisit>1/1/2000</LatestVisit>
</Country>
<Country CountryName="Germany">
<TotalVisits>351</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="Hong Kong">
<TotalVisits>97</TotalVisits>
<LatestVisit>12/30/1999</LatestVisit>
</Country>
<Country CountryName="Ireland">
<TotalVisits>522</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="Malaysia">
<TotalVisits>14</TotalVisits>
<LatestVisit>12/31/1999</LatestVisit>
</Country>
<Country CountryName="Netherlands">
<TotalVisits>542</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="New Zealand">
<TotalVisits>599</TotalVisits>
<LatestVisit>1/3/2000</LatestVisit>
</Country>
<Country CountryName="Norway">
<TotalVisits>452</TotalVisits>
<LatestVisit>1/3/2000</LatestVisit>
</Country>
<Country CountryName="Scotland">
<TotalVisits>538</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
<Country CountryName="Sweden">
<TotalVisits>422</TotalVisits>
<LatestVisit>1/2/2000</LatestVisit>
</Country>
<Country CountryName="Wales">
<TotalVisits>301</TotalVisits>
<LatestVisit>1/1/2000</LatestVisit>
</Country>
<Country CountryName="Yugoslavia">
<TotalVisits>37</TotalVisits>
<LatestVisit>12/30/1999</LatestVisit>
</Country>
<Country CountryName="Zambia">
<TotalVisits>42</TotalVisits>
<LatestVisit>1/4/2000</LatestVisit>
</Country>
</SiteVisits>
FIGURE 1: The example XML file, traffic.xml, as ASCII
text
You can see that the traffic.xml file consists of markup
(the mechanics of XML: angle brackets, element names, and so on) and data
(the country/region names and numbers of visitors). Suppose someone sent you this
XML document once a day, and your job was to extract the data and display
it in Microsoft Excel? Because the structure is so simple, it's easy to
see how you might start breaking it apart with VBA string functions,
looking for angle brackets, element names, and data. You could build a
special-purpose parser that understood the structure of this particular
XML document, and feed the new document into your parser on a daily basis.
But there's a better way. Starting with Internet
Explorer 4.01, Microsoft has been shipping a generalized XML parser with
every Web browser. Even better, this generalized parser is available via
COM. So there's no need to create your own!
The MSXML Object Model
The parser in question is called simply the Microsoft
XML library, or MSXML. To understand the object model used by MSXML, it's
helpful to think of an XML document, not as a nested set of elements, but
as a tree. FIGURE 2 shows the traffic.xml document open in Microsoft XML
Notepad, which uses the tree metaphor. (Microsoft XML Notepad is in beta
as of this writing. You can download it at http://www.microsoft.com/downloads/details.aspx?familyid=72d6aa49-787d-4118-ba5f-4f30fe913628&displaylang=en.)
Rather than showing the TotalVisits and LatestVisit elements as nested
within the Country element, they're presented as child nodes on a tree of
nodes. This tree metaphor is also implemented by the MSXML parser.
FIGURE 2: Another view of traffic.xml, i.e. as it appears in
Microsoft XML Notepad
The MSXML library provides a wide range of objects that
you can use to deal with the complexity of XML. Just as most of that
complexity is beyond the scope of this article, most of the objects are
unnecessary for the simple job of parsing straightforward XML. In fact,
you can do the job with only four objects:
- The XMLDocument object
represents an entire XML document.
- The IXMLDOMNode object
represents a single entity (node in the tree).
- The IXMLDOMNodeList
object represents a collection of child nodes for a particular entity.
- The IXMLDOMNamedNodeMap
object represents a collection of attributes for an entity.
You'll note that the naming of these objects is a bit
peculiar. DOM stands for Document Object Model, a general notion that
documents of any sort can be assigned to a particular object model. This
library provides the particular object model for XML. The "I" stands for
interface, as most of these objects are actually COM interfaces.
The table in FIGURE 3 shows some of the properties and
methods of these four objects, including all of the ones that I'll be
using in this article. For a complete list, refer to the XML
Developer's Guide in the Web Services section of the Platform SDK, or
just use the object browser built into VBA to explore the objects. In
particular, I've omitted the rich set of methods concerned with modifying
XML documents, since the example here is only designed to read an existing
document.
|
Object |
Member |
Type |
Description |
|
XMLDocument |
async |
Property |
Set to True to enable asynchronous
loading |
|
XMLDocument |
attributes |
Property |
List of attributes for the root node of the
document |
|
XMLDocument |
childNodes |
Property |
Returns a NodeList containing all the
children of the root node |
|
XMLDocument |
documentElement |
Property |
Returns the root node itself |
|
XMLDocument |
getElementsByTag |
Method |
Returns a collection of elements having the
specified tag |
|
XMLDocument |
hasChildNodes |
Property |
True if the root node has
children |
|
XMLDocument |
load |
Method |
Loads an XML document from a disk file or
URL |
|
XMLDocument |
loadXML |
Method |
Loads an XML document specified as a
string |
|
XMLDocument |
url |
Property |
Returns the URL for the most recently loaded
document |
|
XMLDocument |
validateOnParse |
Property |
Set to True to validate the XML as it's
loaded |
|
IXMLDOMNode |
attributes |
Property |
List of attributes for this node |
|
IXMLDOMNode |
childNodes |
Property |
Returns a NodeList containing all the
children of this node |
|
IXMLDOMNode |
hasChildNodes |
Property |
True if this node has children |
|
IXMLDOMNode |
nodeName |
Property |
The XML name of the node |
|
IXMLDOMNode |
nodeTypedValue |
Property |
Formatted value of the node |
|
IXMLDOMNode |
parentNode |
Property |
Pointer to the parent node of this node |
|
IXMLDOMNode |
parsed |
Property |
True if this node and its children have
been parsed |
|
IXMLDOMNode |
text |
Property |
Unformatted text of this node |
|
IXMLDOMNodeList |
item |
Property |
Zero-based index to the members of this
collection |
|
IXMLDOMNodeList |
length |
Property |
Count of members in this collection |
|
IXMLDOMNamedNodeMap |
getNamedItem |
Method |
Returns the attribute with the specified name (if
any) |
|
IXMLDOMNamedNodeMap |
item |
Property |
Zero-based index to the members of this
collection |
|
IXMLDOMNamedNodeMap |
length |
Property |
Count of members in this
collection |
FIGURE 3: Representative methods and properties of
selected XML objects
Using MSXML
Now that you've seen the basic objects you need to read
XML, it's time to see what you can do with them. FIGURE 4 shows a
presentation of the data from traffic.xml in an Excel 2000 worksheet. Note
that the worksheet really doesn't contain any more information than the
original XML file; it's just been transformed into a format that's easier
for human beings to grasp.
.jpg)
FIGURE 4: Data
from traffic.xml file as an Excel 2000 worksheet.
The code to generate the worksheet and embedded chart is
shown in Listing
One. It's tied to the Load button on the worksheet. When the user
clicks this button, the VBA code loads and parses an XML file, and uses it
to generate the worksheet.
The procedure starts, of course, by declaring variables
and setting up an error handler. Then it calls the Load method of
the XMLDocument object. This sets up the association between the
object model and the actual XML. In this particular case, I don't want the
code to proceed until the XML is fully loaded, and I don't care about
validating the XML (because I trust the XML supplier). So the applicable
part of the code looks like this:
oDoc.async = False
oDoc.validateOnParse = False
fSuccess = oDoc.Load(ActiveWorkbook.Path
& "\traffic.xml")
Note that in this example, the XML is being loaded from
a disk file in the same folder as the worksheet itself. However, the
Load method is a good deal more flexible than that; it can also
accept a URL. In fact, the odds are good that you'll actually be using
that method to retrieve XML from a Web server somewhere that a supplier is
maintaining for you. So in a real example, the load line would look
something like this:
fSuccess = oDoc.Load( _
"http://TheirServer.com/traffic/traffic.xml")
Once the XML document is loaded, the MSXML object model
makes it simple to pick it apart into its component pieces. The
documentElement property of the XMLDocument object returns
an IXMLDOMNode object that represents the root node of the XML
tree. This particular code then uses the childNodes property of that root
node object with a For Each loop to visit each of the
top-level child nodes in turn. If you refer back to the XML in FIGURE 1,
you'll see that this amounts to passing through the loop once for each
country/region in the XML file.
The name of each country/region is available as an XML
attribute. To get this attribute requires three steps. First, use the
attributes property of the node to get the IXMLDOMNamedNodeMap
object representing all the attributes of the node. Then, use the
getNamedItem method of the node map to get the particular attribute
we're interested in. This attribute is returned as a node. Finally, use
the text property of the node
to retrieve the country/region name:
SetoAttributes = oCountry.Attributes
SetoCountryName = oAttributes.getNamedItem("CountryName")
ActiveSheet.Cells(intI, 1).Value = oCountryName.Text
Retrieving the visit information makes use of the fact
that the properties and methods of nodes are recursive (representing the
nested nature of XML). Given the country/region node, the code can walk through
its children to find the nodes representing the total visits and the most
recent visit. The only complication comes about in that the
IXMLDOMNodeList object (the collection of nodes) doesn't support
retrieving items by name. So the code has to step through all the children
and decide what to do with each child based on its nodeName
property:
ForEachoChild In oChildren
If oChild.nodeName = "TotalVisits" Then
ActiveSheet.Cells(intI, 2) = oChild.nodeTypedValue
EndIf
If oChild.nodeName = "LatestVisit" Then
ActiveSheet.Cells(intI, 3) = oChild.nodeTypedValue
End If
Next oChild
Once you understand these code snippets, you've grasped
all you need to know to parse simple XML documents using the MSXML object
model. There is other code in the VBA procedure in Listing One, but it's
all Excel code, concerned with clearing the worksheet and generating the
chart. The core XML parsing code is really quite simple.
XML in Your Future
If you're a VBA developer, it's very likely that there
is XML somewhere in your future. Consider just a few recent moves from
Microsoft:
- Windows DNA uses XML as a pervasive protocol to communicate between
different tiers of distributed applications.
- Microsoft is the major supporter of the BizTalk initiative, which
seeks to define standard XML documents for use in business.
- The future Microsoft BizTalk Server and Microsoft Host Integration
Server will help integrate XML with legacy and Web environments.
- The XML-based SOAP protocol, submitted to the IETF for approval,
will enable distributed COM and other object services via the Internet.
- SQL Server 2000 will support importing and exporting data in XML
formats.
Remember 1995, when just about every new Microsoft
application included HTML support in one form or another? Well, the year
2000 is shaping up as the year for pervasive XML support in Microsoft
products. While I'm not aware of any XML announcements directly related to
VBA, I'd say it's a safe bet that the VBA and Office teams won't be left
out of this initiative.
Fortunately, although XML can be complex, you can dodge
most of the complexity if you're just using it for data interchange. I
hope this article has convinced you that parsing XML is no big deal, and
that the tools already exist to integrate this into your VBA applications
more easily than many other types of data.
Mike Gunderloy (mailto:MikeG1@mcwtech.com) is a
Senior Consultant with MCW Technologies, a Microsoft Solution Provider.
He's also the author of Visual Basic Developer's Guide to ADO (SYBEX,
1999) and the forthcoming Visual Basic and VBA Developer's Guide to the
Windows Installer (SYBEX, 2000).
Begin Listing One - The
cmdLoad_Click Sub procedure
PrivateSubcmdLoad_Click()
DimoDoc As MSXML.DOMDocument
DimfSuccess AsBoolean
DimoRoot As MSXML.IXMLDOMNode
DimoCountry As MSXML.IXMLDOMNode
DimoAttributes As MSXML.IXMLDOMNamedNodeMap
Dim oCountryName AsMSXML.IXMLDOMNode
DimoChildren As MSXML.IXMLDOMNodeList
DimoChild As MSXML.IXMLDOMNode
DimintI AsInteger
On ErrorGoTo HandleErr
SetoDoc = New MSXML.DOMDocument
' Load the XML from disk, without validating it. Wait
' for the load to finish before proceeding.
oDoc.async = False
oDoc.validateOnParse = False
fSuccess = oDoc.Load( _
ActiveWorkbook.Path & "\traffic.xml")
' If anything went wrong, quit now.
If NotfSuccess Then
GoToExitHere
EndIf
' Set up a row counter.
intI = 5
' Delete the previous information.
ActiveSheet.Cells(4, 1).CurrentRegion.ClearContents
ActiveSheet.Shapes(2).Delete
' Create column headers.
ActiveSheet.Cells(4, 1) = "Country"
ActiveSheet.Cells(4, 2) = "Total Visits"
ActiveSheet.Cells(4, 3) = "Latest Visit"
' Get the root of the XML tree.
Set oRoot = oDoc.documentElement
' Go through all children of the root.
ForEachoCountry InoRoot.childNodes
' Collect the attributes for this country/region.
SetoAttributes = oCountry.Attributes
' Extract the country/region name and
' place it on the worksheet.
SetoCountryName = _
oAttributes.getNamedItem("CountryName")
ActiveSheet.Cells(intI, 1).Value = oCountryName.Text
' Go through all the children of the country/region node.
SetoChildren = oCountry.childNodes
ForEachoChild InoChildren
' Get information from each child node to the sheet.
IfoChild.nodeName = "TotalVisits" Then
ActiveSheet.Cells(intI, 2) = oChild.nodeTypedValue
EndIf
IfoChild.nodeName = "LatestVisit" Then
ActiveSheet.Cells(intI, 3) = oChild.nodeTypedValue
EndIf
NextoChild
intI = intI + 1
Next oCountry
' Now for some eye candy; build a chart of the data.
Charts.Add
WithActiveChart
.ChartType = xl3DPieExploded
.SetSourceData Source:=Sheets("Sheet1"). _
Range("A5:B" & CStr(intI - 1)), PlotBy :=xlColumns
.Location Where:=xlLocationAsObject, Name:="Sheet1"
End With
ActiveChart.HasTitle = True
ActiveChart.ChartTitle.Characters.Text =_
"Web Site Visits"
ActiveSheet.Shapes(2).Top = 0
ActiveSheet.Shapes(2).Left = 200
ExitHere:
ExitSub
HandleErr:
MsgBox "Error " & Err.Number & ": " & Err.Description
ResumeExitHere
Resume
End Sub
End Listing One