New XML Features of the Microsoft Office Word 2003 Object Model

 

Peter Vogel
PH&V Information Services

October 2003

Applies to:
    Microsoft® Office Word 2003

Summary: Use the new XML features of the Microsoft Office Word 2003 object model to manipulate files with code. Among other possibilities, the XML-based capabilities of Word allow you to harness the power of XSLT to transform document content into whatever format you need. You can apply these conversions to a single part of a document to create a mailing label from plain text, or to the document as a whole to convert an e-mail into a sales order. (14 printed pages)

Contents

Working with XML
Accessing the Document
Transforming Text
Conclusion

Working with XML

Working with the XML representation of a document in Microsoft® Office Word 2003 includes two different scenarios:

  • You can load an XML schema and create documents by adding elements and attributes from the schema. As an example of this scenario, you might load the DocBook schema (an open specification used for creating documents about hardware and software). Figure 1 shows Word 2003 with the DocBook schema loaded.
  • You can work with Word 2003 as you always have. You can also choose to have your code work with your Word 2003 document as an XML document, written in the WordML (the native Word XML format) dialect.

In both scenarios, you can access the XML representation of the document to provide new functionality.

Figure 1. Creating an XML document in Word 2003

Loading Schemas

To work with an XML schema, the first steps are to load the schema and associate it with a document. You can add schemas through code from the Application object or the Document object.

Before adding a schema to the document, make sure that the version of Word that you are working with supports adding schemas. The Microsoft Office Word 2003 core program or Microsoft Office Word 2003 in Microsoft Office Professional Edition 2003 support adding custom schemas. If the ArbitraryXMLSupportAvailable property of the Application object is set to True, Word 2003 supports building documents with XML vocabularies other than WordML.

Adding Schemas with the Application Object

You can add schemas to the Application object by using the Add method of the XMLNamespaces collection. Once added, these schemas are available to the user through the XML task pane. The following code adds the Docbook schema to the Namespaces collection, specifies the namespace to use with the schema if the schema does not define one (dcb), and specifies an alias for the schema that Word 2003 uses in its user interface (UI), in this case, the name Docbook:

If Application. ArbitraryXMLSupportAvailable = True Then
Application.XMLNamespaces.Add "c:\Docbook.xsd", "dcb", "DocBook"
End If

Once added, the schema is represented by an XMLNamespace object in the XMLNamespaces collection. Each XMLNamespace can have a collection of XSLT transformations associated with it. You can add new transformations to the XMLNamespace using the Add method of the XSLTransforms collection, creating an XMLTransform object. You must pass the path to the file containing the XSLT transformation to the Add method and, optionally, pass an alias that Word 2003 can use in its UI. This example adds the toc.xsl transformation to the collection, which appear in the UI as Generate TOC:

Application.XMLNamespaces(1).XSLTransforms.Add _
 "c:\transforms\toc.xsl",  "Generate TOC"

Using a Schema with a Document

Once you add a schema to the XMLNamespaces collection for the Application object, you can associate the schema with a document by using the AttachToDocument method of the XMLNamespace object. This code associates the second schema in the XMLNamespaces collection with the current document:

Application.XMLNamespaces(2).AttachToDocument ActiveDocument

The XMLSchemaReferences collection of the Document object allows you to add schemas directly to the document, using the Add method of the XMLSchemaReferences collection. This code adds the DocBook schema to the document's schema references by referring to the NamespaceURI established when you added the schema to the XMLNamespaces collection for the Application object:

ActiveDocument.XMLSchemaReferences.Add "dcb"

Alternatively, you can bypass the XMLNamespaces collection and add the schema directly to the document. This also adds the schema to the XMLNamespaces collection for the Application object:

ActiveDocument.XMLSchemaReferences.Add "dcb",  _
"DocBook",  "C:\shemas\DocBook.xsd"

The first parameter to the Add method for the XMLSchemaReference object is the namespace to use with the schema (in this example, dcb). The second parameter provides an alias for the schema used in the UI. The third parameter is the path to the schema to load.

Accessing the Document

There are two strategies that you can use for accessing the XML representation of document in Word 2003:

  • Extract XML text
  • Access XML nodes that make up the XML Document Object Model

Extracting Text

A Word 2003 document always consists of WordML tags. When you are working with a schema in Word 2003, the tags of the schema are intermixed with the WordML tags. The WordML tags control how the document is displayed in Word 2003.

You can extract the XML text of your document through the XML property of Range and Selection objects. The XML property always returns a complete WordML document, including the WordML root element (wordDocument). For example, this code pulls the XML from the second paragraph of the active document:

ActiveDocument.Paragraphs(2).Range.XML

As an example, this sample text contains just two short paragraphs:

The quick brown fox jumps over the lazy dog.
A rolling stone gathers momentum.

Extracting the WordML representation of the first paragraph returns a lot of text—over 3,000 characters. The document returned by the XML property is large because it not only includes the text of the selected range but also all of the context information that affects text: definitions of the styles in the document, information about the document's fonts, the document's page size and margins, and so on.

Within the document returned by the XML property, the paragraphs that make up the content are inside the WordML body element. The body element document contains the document's content enclosed in <t> tags (text), <r> tags (runs of texts), and <p> tags (paragraphs). The <sectPr> tag that also appears in the body holds elements that define the sections of the document (such as the page size or margin).

If you create a document based on some XML schema, the tags that make up that schema are intermixed with the WordML tags that allow Word 2003 to manipulate the text. As an example, this is the start of a DocBook document:

<book><title>Office MSXML</title></book>

Retrieving the XML property of this document returns a document with this in its body tags:

<w:body>
<wx:sect>
<ns0:book>
<w:p>
<ns0:title>
<w:r><w:t> Office MSXML</w:t></w:r>
</ns0:title>
</w:p>
</ns0:book>
. . .

To keep the WordML and DocBook tags separate, the DocBook tags are flagged with a prefix of ns0. This prefix ties the tags back to a DocBook namespace defined on the wordDocument element:

<w:wordDocument
xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml"
. . .various other namespace declarations. . .
xmlns:ns0="dcb">

To retrieve are the elements for the schema that you're working with, the XML property accepts a parameter (called DataOnly) that allows you to remove WordML tags. When DataOnly is set to True, this code returns only the non-WordML XML tags in the document:

ActiveDocument.Paragraphs(1).Range.XML(True)

For the sample DocBook document, the XML returned looks like this:

<?xml version="1.0" standalone="no"?>
<book xmlns="DocBook">
<title>Office MSXML</title>
</book>

Extracting XML Nodes

In documents created with a schema, you can also access the document as a series of XML nodes, using the XMLNodes property of the Range, Selection, and Document objects. The XMLNodes collection contains XMLNode objects, whose methods and properties look very familiar if you work with the DOM parser in the MSXML toolset. This example concentrates on the features of the XMLNode object that are added to the Word 2003 object model, although some of the methods and properties common to MSXML are also useful to the Word 2003 developer.

The sample DocBook document shown previously has two nodes in it: one for the book element and one for the title element. This code retrieves the title element:

Dim nd As XMLNode
Set nd = ActiveDocument.XMLNodes(2)

Properties of the node object include the content of the node (Text), the name of the node (BaseName), and the node's namespace (NamespaceURI).

On most occasions, you process only some of the nodes in a document. The SelectNodes method of the Document object lets you extract the nodes that you want, using an XPath statement to specify the nodes. When searching for an element with a namespace, you must include the namespace's prefix in the tag name. The second parameter of the SelectNodes method allows you to define the namespace, using the same syntax as in an XML document. This example finds all the TITLE elements in the document in the DocBook namespace, which has ns0 as its prefix, no matter how deeply nested:

Dim nds As XMLNodes
Dim nd As XMLNode
Set nds = ActiveDocument.SelectNodes("//ns0:title",  _
"xmlns:ns0='dcb' ")
For Each nd In nds
   MsgBox "Value for title tags: " & nd.Text
Next

The XMLNodes collection first member is at position 1, so this is the code to use to process all members using a For. . .Next loop:

For ing = 1 To nds.Count
    Set nd = nds.Item(ing)
    MsgBox "Value for title tags: " & nd.Text
Next

To speed up searching, you can set the third parameter of the SelectNodes method to False, causing the search to skip any text nodes (the text content between an XML element's open and close tag). If you're not searching the content of your nodes and your document contains a lot of text (typical for a Word 2003 document), setting this parameter can speed up your search significantly.

For any particular namespace that you add to the document, you can retrieve the namespace through the NamespaceURI property of the XMSchemaReference object. Rewriting the previous code to use the NamespaceURI property gives this code:

Set nds = ActiveDocument.SelectNodes("//ns0:title", "xmlns:ns0=' " & _
ActiveDocument.XMLSchemaReferences(1).NamespaceURI  & _
"' ")

The new XML-based technology of Word 2003 merges smoothly with the Word objects from previous versions of Word. For example, once you retrieve a node, you can work with the Word objects that the node represents by retrieving the Range property of the XMLNode object:

Dim rng As Range
Set rng = nd.Range

You can also get some feedback on where a node fits within the larger Word 2003 document. The Level property of a node indicates whether a node is, for example, displayable text (Level = wdXMLNodeLevelInline) or part of a paragraph (Level = wdXMLNodeLevelParagraph). The Level property also indicates whether the node is part of a table row or cell.

Changing the Document

You can update the document using either the InsertXML method or the XMLNodes collection. Both allow manipulation of the document's content to create new material.

Updating Text

The InsertXML method of the Range and Selection objects allows you to update a document by inserting arbitrary strings of XML text. When updating a WordML document that isn't using a schema, you must insert a complete WordML document. Fortunately, a basic WordML document consists of just five tags (wordDocument, body, p, r, t) and the WordML namespace. For example, this document contains a single string in a minimal WordML document:

<w:wordDocument xmlns:w=
'https://schemas.microsoft.com/office/word/2003/wordml'>
<w:body>
    <w:p>
<w:r>
<w:t>Hello, World.</w:t>
</w:r>
   </w:p>
</w:body>
</w:wordDocument>

As an example, in the following text, let's say the word jumps is selected:

The quick brown fox jumps over the lazy dog.

This code replaces the current selection (jumps) with the word leaps:

Selection.Range.InsertXML _
"<w:wordDocument xmlns:w=" & _
"'https://schemas.microsoft.com/office/word/2003/wordml'>" & _
"<w:body><w:p><w:r><w:t>leaps</w:t></w:r></w:p></w:body>" & _
"</w:wordDocument>"

Updating Nodes

For a document using a schema, the XML property of the Range property of the XMLNode object lets you add new XML. When new material is inserted into your document, it needs to be a combination of WordML tags (so that text is displayed in Word) and the schema's tags. However, to make the code easier to read, the next examples omit the WordML tags.

The sample DocBook document has a title of "Office XML." One change to this DocBook document could insert a subtitle tag to go with the title tag and give this structure:

<?xml version="1.0" standalone="no"?>
<book xmlns="DocBook">
<title>Office MSXML<subtitle>XML Objects</subtitle></title>
</book>

A prefix is required when inserting the XML so that the tag is tied to the right schema. To ensure that the inserted XML is well formed, the prefix (and its namespace) must be defined within the XML being inserted. This code inserts the subtitle tag into the document:

ActiveDocument.XMLNodes(2).Range.InsertXML  _
"<ns0:subtitle xmlns:ns0='dcb'>XML Objects</ns0:subtitle>"

Inserting XML into a node replaces any existing content for that node. Using the previous code to insert new content into the title node replaces the existing content (as shown in Figure 2):

<ns0:book>
<ns0:title>
<ns0:subtitle>XML Objects</ns0:subtitle>
</ns0:title>
</ns0:book>

Figure 2. Replacing the title element content with a subtitle element

If your goal is to add to the existing content (rather than replace it), you must add a child node to the parent node. You add nodes using the Add method of the node's ChildNodes collection, passing the name of the node and the namespace to which the new node belongs (Word 2003 takes care of assigning the appropriate prefix to the node). The following example adds a child node, called subtitle, as part of the DocBook namespace. Once the node is created, the code assigns the node's Text property the string "XML Objects":

Dim nd As XMLNode
Set nd = ActiveDocument.XMLNodes(2).ChildNodes.Add ("subtitle", _
"dcb")
nd.Text = "XML Objects"

The resulting code contains this set of tags (see Figure 3):

<ns0:book>
<ns0:title>Office MSXML<ns0:subtitle>XML Objects</ns0:subtitle>
</ns0:title>
</ns0:book>

Figure 3. Adding a subtitle child element to a title element

Running the code again adds another subtitle following the subtitle just added.

You're not restricted to adding child nodes. Using the Add method of the XMLNodes collection lets you add a node anywhere in the existing document. Where the node is added depends on what object you use. For example, if you use the ActiveDocument object, as in this example, the new node is inserted at the cursor:

ActiveDocument.XMLNodes.Add "subtitle", "dcb"

The XMLNodes collection is also available from the Range property of any node. Used from an existing node in the document, the new node is added so that it encloses any already existing child nodes. For example, this document nests a paragraph text inside a chapter element:

<ns0:book xmlns:ns0='dcb'>
<ns0:chapter>My Paragraph</ns0:chapter>
</ns0:book>

To add a paragraph node to the chapter (and place the paragraph text inside the newly created paragraph), you could use this code with the chapter's Range property:

Dim ndBook As XMLNode

Set nd = ActiveDocument.XMLNodes(1)
nd.Range.XMLNodes.Add "para", "dcb"

The result is this XML:

<ns0:book xmlns:ns0='dcb'>
<ns0:chapter>
<ns0:para>My Paragraph</ns0:para>
</ns0:chapter>
</ns0:book>

Transforming Text

The real power in the InsertXML method is in the method's second parameter, which accepts the path name to a file containing XSLT. Passed the pathname, the InsertXML method processes the XML in the first parameter using the XSLT code in the file and inserts the results into the document.

For example, this DocBook document has two chapter elements and an empty table of contents element called toc (for these examples, the relevant WordML tags are included to ensure that the text is displayed by Word 2003 after the text is inserted):

<ns0:book>
   <ns0:toc/>
   <ns0:chapter>
      <ns0:title>
<w:p><w:r><w:t>Chapter 1</w:t></w:r></w:p>
</ns0:title>
   <ns0:chapter>
<ns0:chapter>
      <ns0:title>
<w:p><w:r><w:t>Chapter 2</w:t></w:r></w:p>
</ns0:title>
   <ns0:chapter>
</ns0:book>

The following XSLT finds all chapter title tags and generates a DocBook table of contents consisting of TOCENTRY and TITLE tags:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" 
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:w='https://schemas.microsoft.com/office/word/2003/wordml'
xmlns:ns0='dcb'>

<xsl:template match="/">
<w:wordDocument>
<w:body>
   <xsl:for-each select="//ns0:chapter">
      <ns0:tocchap><ns0:tocentry>
      <w:p>
<w:r>
<w:t>
<xsl:value-of select="w:p/ns0:title"/>
</w:t>
</w:r>
</w:p>
      </ns0:tocentry></ns0:tocchap>
   </xsl:for-each>
</w:body>
</w:wordDocument>

</xsl:template>

</xsl:stylesheet>

In all updates done through the InsertXML method, you must insert a complete wordDocument document, so the stylesheet also adds wordDocument and BODY elements. The <WordML>, <p>, <r>, and <t> tags are required so that the text in the TOCENTRY elements is displayed in Word 2003.

The next bit of code retrieves a reference to the TOC node in the document. The code then calls the InsertXML method for that node, passing the XML text for the document and the path to the file containing the XSLT:

Dim ndToc As XMLNode

Set ndToc = ActiveDocument.SelectSingleNode("//ns0:toc", _
"xmlns:ns0='dcb')
ndToc.Range.InsertXML ActiveDocument.Range.XML, _
"c:/Transforms/toc.xsl"

The result is a DocBook document with a table of contents, built from the chapter elements in the document.

If you intend to transform the whole document, rather than the content of a selected node, you can call the TransformDocument method of the Document object. The TransformDocument method accepts two parameters:

  • The path to the file with the XSLT code
  • A Boolean indicating whether the whole document is to be processed or just the non-WordML tags. Setting this parameter to True causes the WordML tags to be ignored.

Rather than hard coding the path name in the call to these methods, you can use an XSLTransform object from an XSLTransforms collection of a specified namespace. The Location property of an XSLTransform object provides the path name to the file containing the XSLT. If you rewrite the earlier code to use the Location property you would have this code:

ndToc.Range.InsertXML ActiveDocument.Range.XML, _
Application.XMLNamespaces(1).XSLTransforms(1).Location

Transforming existing document content is only one of the uses for applying XSLT. You can just as easily use XSLT as part of reading in an XML document to convert the document into WordML format (or any other XML format, for that matter). For example, Microsoft Office Access 2003 supports exporting its tables in an XML format. A simple XSLT transform allows files in that format to be converted into WordML for processing as a Word 2003 document—or into DocBook format if that were the ultimate format for the information.

Responding to Changes

Word 2003 also fires events as XML operations are performed. The Document object fires two new XML-related events:

  • XMLAfterInsert: Fires when a new node is added. The event routine is passed a reference to the new node.
  • XMLBeforeDelete: Fires when a node is deleted. The event routine is passed a reference to a Range object that includes the parts of the document to be deleted and a reference to the node to be deleted.

Both events are also passed a Boolean variable that is set to True if the node is added or deleted during an undo or redo operation.

The Word 2003 Application object fires two new XML-related events:

  • XMLValidationError: Fires whenever a validation error occurs in the document. A single parameter is passed to the event: a reference to the node with the error.
  • XMLSelectionChange: Fires whenever you select a new node. This event routine is passed four parameters: the WordSelection object for the newly selected material, references to both the node that lost focus and the node that gained focus, and a reason code.

To use these events, you'll need to set up global variables and assign them to the appropriate objects. This code does that as part of a class module's Initialize event:

Dim WithEvents wrd As Application
Dim WithEvents doc As Document

Private Sub Class_Initialize()
   Set doc = ActiveDocument
   Set wrd = Application
End Sub

In some other event, such as the AutoOpen routine in Word 2003, you must create the class module. Once complete, the code in the class module can catch the XML events.

Selection Changed

You may change the currently selected node for a variety of reasons: moving the cursor, inserting a new node, and deleting a node. Not all of those activities generate a selection change event. For example, you may have your cursor between the end element of one child element and the start element of another child element. In that scenario, the currently selected node is the parent node of the two children. If you click between two other children within the same parent, the selected node doesn't change and no selection change event fires.

When a selection change event does fire, the reason code passed as the last parameter to the XMLSelectionChange event indicates why the selection has changed. The reason code can have three values:

  • wdXMLSelectionChangeReasonMove: The selection changed because you moved the cursor in the document to select a different node. When, after the document is first opened, you first select a node, the OldXMLNode property is Nothing.
  • wdXMLSelectionChangeReasonDelete: The selection changed because a node was deleted. OldXMLNode is always Nothing. Not all deletions trigger a selection change. For example, if the currently selected node is the parent node, deleting a child does not generate a selection change event. Use the XMLBeforeDelete event to catch all deletions.
  • wdXMLSelectionChangeReasonInsert: The selection changed because a new node was inserted. The new node is passed in NewXMLNode. You can also use the XMLAfterInsert event to catch node insertions.

As an example, this code, when a TOC element is inserted, runs the code that creates the table of contents:

Private Sub doc_XMLAfterInsert(ByVal NewXMLNode As XMLNode, _
ByVal InUndoRedo As Boolean)
If NewXMLNode.BaseName = "toc" Then
NewXMLNode.Range.InsertXML ActiveDocument.Range.XML,  _
"c:/Transforms/toc.xsl"
End If

End Sub

Validating Documents

One of the benefits of working with XML schemas is the ability to validate any particular document against the schema defining the tags used in the document. When working with a schema, as users insert new tags, Word 2003 flags violations of the schema in two places: the XML task pane and in the document itself.

The generated messages are generic XML error messages (a typical message when an element is first inserted is "Some required content is missing"). These messages give the user no real guidance as to what, exactly, should be done to eliminate the validation error.

Word 2003 lets you customize the messages that are generated for any node in the document. For any node, you can call the node's SetValidationError method, passing a ValidationStatus constant, a message to display on validation errors, and a flag indicating whether the message should be cleared automatically.

In this example, the code adds a new message when a validation error occurs on a TOC element. The message provides more information about what is required in a DocBook table of contents entry:

Private Sub wrd_XMLValidationError(ByVal XMLNode As XMLNode)
If XMLNode.BaseName = "toc" Then
XMLNode.SetValidationError wdXMLValidationStatusCustom, _
"You must insert a title or tocentry element", True
End If

End Sub

You can control when validation takes place. The default is for Word 2003 to validate on every user action that changes the document text against the schemas currently in use. So, the first step in managing validation is to turn off Word's automatic validation:

ActiveDocument.XMLSchemaReferences.AutomaticValidation = False

You can then either validate the whole document by calling the Validate method of the XMLSchemaReferences or validate just a single node by calling its Validate method. This code validates only the second node and its children:

ActiveDocument.XMLNodes(2).Validate

For the Validate method to work, you must set the XMLSchemaReferences AutomaticValidation property back to True.

After the Validate method is called, you can check the ValidateStatus method of any node to determine whether an error was found. You can also process the nodes in the document's XMLSchemaViolations collection to find all the nodes in error. This code displays the error message associated with each node that has a validation error:

Dim nd As XMLNode
For Each nd In ActiveDocument.XMLSchemaViolations
   Msgbox nd.ValidationErrorText
Next

Saving Your Document

When you are finished with the document, you will want to save it. If you can save the document without completing it, you may want to set the AllowSaveAsXMLWithoutValidation property of the schema that the document is using. This prevents you from getting error messages when the only problem is that all the required portions of the document aren't yet filled in.

The Save method of the Document object still saves the complete Word 2003 document. Setting the XMLSaveDataOnly property of the Document object to True causes Word 2003 to save only the tags associated with the schema into the document's file. Setting the XMLUseXSLTWhenSaving of the Document object to True causes an XSLT transform to be applied to the document and the output of the transform to be saved. If you set XMLUseXSLTWhenSaving to True, you must also set the XMLSaveThroughXSLT property to the path name of the file containing the XSLT code. Finally, you must set the document format to XML.

This code, for example, saves only the added schema-related tags in the document using the ConvertToFOMM.XSL stylesheet:

ActiveDocument.XMLSaveDataOnly = True
ActiveDocument.XMLSaveThroughXSLT = _
"c:\Transforms\ConvertToHTML.XSL"
ActiveDocument.XMLUseXSLTWhenSaving = True
ActiveDocument.SaveAs MyDocument.HTM, wdFormatXML

The document should be valid against its schema before you save it with an XSLT transform. Although you can set the AllowSaveAsXMLWithoutValidation property of the SchemaReferences collection to True, the results of processing an XSLT stylesheet against an invalid document are unpredictable.

The XML data-only document that is created with this code does not include all the information that the original Word 2003 document did. All the WordML tags are lost and, typically, the data in the new document is a subset of the data in the original Word 2003 document. Because of this, you also want to save the original Word 2003 document to hang on to all of the information in it. The easiest way to do that is to set the XSMLSaveDataOnly and UseXSLTWhenSaving properties to False and do another save:

ActiveDocument.XMLSaveDataOnly = False
ActiveDocument.XMLUseXSLTWhenSaving = False
ActiveDocument.Save

Conclusion

Word 2003 has acquired a rich set of functionality by integrating XML into its object model. This integration provides the tools you need to create applications that take advantage of XML to support document creation. By extending the base functionality of Word 2003 with customized code, you can provide new levels of support to users.

About the Author

Peter Vogel is a consultant on Office and .NET development. He is the author of Visual Basic Object and Component Handbook (Prentice Hall PTR, 2000, ISBN: 0-13023-0731) and the editor of the Smart Access newsletter.