Export a Word Document to XML
Microsoft® Word 2000 and Microsoft Word 2002
Summary: This solution allows you to export a Word document to an XML file. (12 printed pages)
The Case for Styles
The Style Object
Microsoft Word documents are not usually thought of as a data source in the traditional sense. However, if you author a document in Word with an eye towards converting it to XML, you can turn the document into a data source that can be easily queried and reused. XML is an ideal format because XML is data centric, XML is easily manipulated and displayed, and XML is accessible programmatically.
Converting any data to XML requires parsing the data and tagging it with descriptors. Within a Word document, text and hyperlinks already tagged by their formatting. Most documents contain multiple structural elements, such as headings, bylines, footnotes, and quotations. All types of formatting can be applied to indicate what the elements are. For example, most headings are not the same size, weight, or even font as paragraph level text. Within a Word document, you alter text by one of two methods: by applying a style or by applying formatting manually. A style in Word is nothing more than a named set of specific instructions describing the formatting to apply. When you apply a style, you are basically tagging that text as something: a heading, a subheading, a code block, a quotation, or some other document element. When you apply formatting manually, you are tagging that text as something special, but that something is not defined. If you were to attempt to parse the document by formatting, you would know how the text appears in the document, but you wouldn't know what the text is. However, if you only apply formatting using styles, when you parse the document, not only do you know how the text appears in the document, but also you have a style name to describe what the text is.
Creating a document in this manner requires that you know what your formatting represents. Instead of making text bold for emphasis, you apply a style that not only bolds the text but is descriptive of why the text is bold to begin with. For example, suppose you are creating a document and typed out a quotation. Rather than applying italic formatting to the quotation to highlight it, you should create a style called "Quotation" that includes italicized formatting.
After you author a document by using styles and then convert it to XML, it becomes a queryable data source. If you have a folder of XML documents, it is essentially a database. Using the FileSystemObject object in the Microsoft Scripting Runtime object model to loop through all of the files in the folder, you could apply an Extensible Stylesheet Language (XSL) query to pull out all of the headings, author information, quotations, or whatever you want, from each of the XML articles. Possible uses for this approach are to create synopses of articles, create a code library from developer articles, or retrieve all of the references in an article set.
The best document is one that can be used multiple times in multiple ways. When you create a document in Word, to reuse it, you usually copy and paste sections of text, reformat the text, resave the document with a different file name, or republish the document as a Hypertext Markup Language (HTML) file.
With a document saved in XML format, you can build different documents from the original document using XSL transformations. XSL allows you to pick and choose which pieces of data in the XML document that you want to display. A full discussion of XSL is outside the scope of this article, but there are many great resources discussing XSL on MSDN. This solution described later in this article allows you to output an HTML file that uses an XSL transformation of your XML source.
Now that you know some of the reasons to put Word documents into an XML format, let me describe how to do it. I will describe how I did it, what options I discovered along the way, and some of the problem areas I encountered. Finally, I will list a template of the XML that is output by my solution.
The goal of this solution is to convert a Word document into a well-formed XML document. As noted above, the most logical way to do this is to tag data in a Word document by using styles. An overly simplified description of how this solution works is that it parses through the document paragraph by paragraph and identifies text with styles applied, and then tags that text. The complications with this approach are discussed later in this article.
The following list outlines the object library references needed for this solution:
|Control or object model reference||Implementation|
|Microsoft Office 9.0 Object Library||Used to handle graphics.|
|Microsoft Word 9.0 Object Library||Used for document access.|
|Microsoft Excel 9.0 Object Library||Used for sorting arrays.|
|Microsoft Scripting Runtime||Used to write out the XML, XSL, and HTML files.|
The following table outlines the custom classes used in this solution:
|Class or Form||Description|
|frmWizard||This form contains all of the graphical elements for this solution. Each step is contained in a separate frame control.|
|XMLConverter||This object stores all of the options that are set on the forms and contains all the methods directly involved in converting the document to XML.|
|StyleInstance||This object contains information about a specific instance of a style in the document.|
|StyleInstances||This object contains a collection of StyleInstance objects.|
|DocumentStyleInformation||This object contains a StyleInstances collection and the function to retrieve a StyleInstances collection.|
|GraphicInstance||This object contains information about a specific instance of a graphic in the document.|
|GraphicInstances||This object contains a collection of GraphicInstance objects.|
|DocumentGraphicInformation||This object contains a GraphicInstances collection and the function to retrieve a GraphicInstances collection.|
The following table shows the primary functions (and their parents) used in this solution:
|ConvertToXML||Parent: XMLConverter. This function handles the overall build process of the XML output.||ParseParagraph, CleanString, WriteShape, WriteInlineShape, GetCSSArray, ParseTable|
|ParseParagraph||Parent: XMLConverter. This function handles individual paragraphs that are not within list paragraphs or tables.||CleanString, WriteShape, WriteInlineShape, ParseForGraphics|
|CleanString||Parent: XMLConverter. This function removes nonprinting characters from strings, and replaces characters that cause problems in XML with their appropriate escape codes.|
|WriteShape||Parent: XMLConverter. This function writes an XML string that represents a Shape object. This function is only called when the user selects to include graphics in a separate list.|
|WriteInlineShape||Parent: XMLConverter. This function writes an XML string representing an InlineShape object. This function is only called when the user selects to include graphics in a separate list.|
|ParseForGraphics||Parent: XMLConverter. This function parses a Range passed to it that represents Shape or InlineShape objects. This function is only called when graphics are written out inline.||WriteGraphicInfo|
|WriteGraphicInfo||Parent: XMLConverter. This function writes an XML string that represents a GraphicInstance object. This function is only called when graphics are written out inline.|
|GetCSSArray||Parent: XMLConverter. Using the HTML Document Object Model, this function gets a list of the Cascading Style Sheet (CSS) formatting associated with each style in the document.|
|ParseTable||Parent: XMLConverter. This function handles tables.||ParseCell|
|ParseCell||Parent: XMLConverter. This function parses individual cells within a table. This function calls the ParseTable function when a cell contains a nested table. The function calls the ParseParagraph function for each paragraph within a cell.||ParseParagraph, ParseTable|
|GetStyleInformation||Parent: DocumentStyleInformation. This function populates the StyleInstances collection of the DocumentStyleInformation object.||ExcelSort|
|GetGraphicInformation||Parent: DocumentGraphicInformation. This function populates the GraphicInstances collection of the DocumentGraphicInformation object.||ExcelSort|
|ExcelSort||This function uses Microsoft Excel to sort an array by the specified field. This function is used to sort the style and graphics in the order they appear in the document.|
|WriteXSL||This routine creates a skeleton XSL file based on the XML file structure created by this application. The XSL file is stored in the same location and with the same name as the XML file, but with an .xsl extension.|
|WriteHTM||This routine creates an HTML file that uses script to write out the contents of the XML file based on the associated XSL file. The HTML file is stored in the same location and with the same name as the XML file, but with an .htm extension.|
The structure of a Word document creates several interesting parsing options. The first option is the granularity with which the document is parsed. I give the user the option to export the document only, or to export an enhanced XML document. If the first option is chosen, all the text is written out in <Paragraph> elements. The second option creates a much more useful document. With an enhanced document, the options all revolve around the properties, styles, and graphics contained in the document. Additionally, you have the option to output an HTML and XSL file.
Users may want to know not just about the data in the document, but data about the document. The document properties options let the user choose among the following:
- No document properties.
- All document properties (both built-in and custom).
- Only built-in document properties.
- Only custom document properties (user-defined properties).
Users have the option of deciding to what level styles are handled:
- Paragraph and character styles are handled: all style formatting is represented in the XML document, including tables and lists.
- Only paragraph level styles are handled: only the style applied to the paragraph is exported. Styles applied to text within the paragraphs are ignored. Tables and lists are not exported as anything but paragraphs.
Additionally, since we are using styles to tag information in the document, it may be useful to know what the formatting is that the style represents. Users can choose from the following options:
- List only styles actually in use in the document (this option is faster).
- List all styles available in the document.
If pictures are worth a thousand words, then they should also be converted to XML. However, unless the picture is an external file being linked to, there is no way to take the binary information that represents the picture and convert it to XML. Microsoft clipart pictures can be converted to a Vector Markup Language (VML) format, but that is outside the scope of this project. Even if the picture can't be converted, users may want to see a placeholder for the graphic, so I provide the following options:
- No graphic information is written out.
- All graphic information is written out inline.
- Only linked graphics are written out inline.
- All graphics are written out in a separate list (not inline).
When you select to output an HTML and XSL file, the solution builds an HTML wrapper for the XML file. The HTML file contains script that applies the XSL file to the XML file. The XSL is very basic but should give you a good starting point for building a more elaborate XSL transformation.
When I first undertook this project, I thought it would be even easier than converting an Access table or Excel data to XML. I was wrong. The intricacies and limitations of the Word object model make this a very interesting exercise. The biggest areas of concern were the Style object, graphics, tables, and lists.
The Style object in the Word object model represents a style definition, not an instance of the style in the document. While the searching mechanism in Word is powerful and lets you search for instances of a style, it does not let you search for the next instance of any style - only a specific style. When you are parsing out a document sequentially, knowing when the next instance of a specific style occurs is useless because you may have passed over an instance of another style.
To get around this problem, I created three custom classes: the StyleInstance object, the StyleInstances collection, and the DocumentStyleInformation object. The StyleInstance object has five properties
|StartPosition||Start position of the style within the document. Retrieved from the Start property (Range object).|
|EndPosition||End position of the style within the document. Retrieved from the End property (Range object).|
|StyleName||Name of the style. Retrieved from the NameLocal property (Style object).|
|StyleType||The style type: character or paragraph. Retrieved from the Type property (Style object).|
|Target||Hyperlink address. Only written when the text is formatted with the Hyperlink style.|
|Index||The index of the StyleInstance within the StyleInstances collection.|
The DocumentStyleInformation has two important pieces to it: a StyleInstances collection and the GetStyleInformation function. The function is simple: it loops through every style in the document style list and searches for instances of the style within the document. If it finds that style, it adds the start and end position, the style name, the style type, and the target address (if it is hyperlinked) to an internal array. After it has looped through the entire style list, the array is sorted by the start position and then placed into the StyleInstances collection. The result is a sequential list of all instances where the styles have been applied. When you then parse the document sequentially, you can access your styles in sequential order.
It is possible in Word to set the style of hyperlinked text to something other than Hyperlink, or even change hyperlinked text to multiple styles. Because XML must be well-formed, it is easier to handle a hyperlink as one consecutive text stream, rather than try to break it into various style pieces. For this solution, hyperlinks are only handled if they have the Hyperlink style applied. When text with the Hyperlink style is encountered, the hyperlink's address is written as an attribute of the element.
Graphics in Word documents pose a similar problem as styles. Pictures in Word documents are contained in two types of objects: Shape or InlineShape objects. If you sequentially loop through a collection of Shape objects, you could pass over instances of InlineShape objects, and vice versa.
This problem was handled in the same way as styles. I created three custom classes to handle all of the graphics: the GraphicInstance object, the GraphicInstances collection, and the DocumentGraphicInformation object. The GraphicInstance object has eight properties outlined in the following table:
|WordShapeType||1 for Shape objects, 2 for InlineShape objects.|
|StartPosition||Start position of the graphic within the document. Retrieved from the Start property (Anchor object) for Shape objects and the Start property (Range object) for InlineShape objects.|
|EndPosition||End position of the graphic within the document. Retrieved from the End property (Anchor object) for Shape objects and the End property (Range object) for InlineShape objects.|
|GraphicType||The type of Shape or InlineShape object; maps to a wdShapeType or wdInlineShapeType constant value.|
|LinkedPath||The path to the external file represented by the graphic. Only written for linked objects.|
|GraphicSubType||The AutoShape type, maps to a msoAutoShapeType constant value. Only written if the Shape object is an AutoShape.|
|OLEClass||The OLE class type.|
|Target||The target address if the graphic is a hyperlink.|
|Index||The index of the GraphicInstance in the GraphicInstances collection.|
The DocumentGraphicInformation object is similar to the DocumentStyleInformation object. It contains a GraphicInstances collection and the GetGraphicInformation function. The GetGraphicInformation function builds an array of all the Shape and InlineShape objects information, sorts the array by the start position, and then places the array into the GraphicInstances collection.
One quirk with Shape and InlineShape objects is that they don't always return the same value for the Hyperlink property if they aren't hyperlinks. When an InlineShape object is not a hyperlink, the Hyperlink property returns Nothing, which is easy enough to check. However, when a Shape object is not a hyperlink, the Hyperlink property returns a Hyperlink object, but accessing any properties of the Hyperlink object returns "Run-time error 4198: Command failed." In the GetGraphicInformation routine, this error is trapped and handled appropriately.
The DocumentGraphicInformation object not only stores information about the pictures in a document, but any ActiveX® embedded or linked object. Since the primary use of the Shape and InlineShape objects is to represent pictures, I used "Graphic" in naming these objects, but you could just as easily use OLE or ActiveX in place of "Graphic".
You may have also noticed that the DocumentStyleInformation and DocumentGraphicInformation objects are very similar. It is possible to put all of the information into a single object, but to make it clearer what type of object I was working with, I wanted to separate the classes. Since a picture can occur in the middle of a block of styled text, it is also easier to parse with two separate classes, rather than one.
I initially created the parsing routine (ConvertToXML) to walk through the document paragraph by paragraph. The first thing that the paragraph loop does is check to see if the range of the current paragraph contains a table. If it does, it sets a flag (blnResumeParagraphProcessing) not to parse out the following ranges as paragraphs, but rather table cells. Once it is determined that a range is no longer part of a table, it reverts to processing ranges as paragraphs. The downside of this mechanism is that it only parses top-level tables, not nested tables.
A quirk within the Word object model makes it very difficult to handle nested tables. When you check a table cell to see if it contains a nested table, using the following syntax, it returns 0:
However, when you check the range contained within that cell for nested tables, using the following syntax, it always returns 1:
There is no way to use the Range object to determine if there is a nested table, because the Range object of a cell always contains the parent table. Since the top-level parsing mechanism first checked the range to see if it contained a table, this proved to be troublesome. I got around this limitation by using two functions, ParseTable and ParseCell. The top-level parser determines that a table has been encountered, and then calls the ParseTable function. TheParseTable function then loops through each cell and writes out the appropriate tags to wrap each cell, and then calls the ParseCell function to handle the data within the cell. The ParseCell function loops through each paragraph object in a cell and then calls the ParseParagraph function to write it out. If the ParseParagraph function encounters a table, it in turn calls the ParseTable function.
There are two caveats to the way tables are handled by this solution. First, styles applied with the table are ignored. Second, nested lists and graphics are also ignored. It would not be difficult to handle these cases, but they fall outside of the scope of this project. To accomplish these two objectives, you would simply need to pull the loop that parses paragraphs (marked by the comment "Handles paragraph XML output") into a separate function, and then call the new function from within the ParseCell function.
Lists in Word documents present a similar challenge as tables. Each item in a list is a Paragraph object. You determine whether the paragraph is part of a list by using the ListParagraph property of the Range object:
If ThisDocument.Paragraph(1).Range.ListParagraphs.Count>0 Then ... End if
This routine also sets the blnResumeParagraphProcessing flag. When a paragraph contains a list item, the routine stops processing as text paragraphs, and processes the list. The portion of the routine that handles the list also includes the type of list as an attribute of the list element. The value of the attribute maps to the wdListType constant. If the list contains outline indenting, the level of indenting is also written out in the level attribute.
The XML output by this application is very straightforward and very similar to the HTML output by Word itself, but it fully accounts for all styled text, tables, and lists. Listed below is an XML representation of this structure, without any data included:
<?xml version="1.0" encoding="ISO-8859-1"?> <Document> <DocumentProperties> <BuiltInProperties> <Property/> </BuiltInProperties> <CustomProperties> <Property/> </CustomProperties> </DocumentProperties> <DocumentStyles> <Style/> </DocumentStyles> <DocumentBody> <Paragraph> <Text/> <StyledText/> <InlineShape/> <Shape/> <Table> <Row> <Cell> <Paragraph> <Text/> <StyledText/> <Table/> </Paragraph> </Cell> </Row> </Table> <List/> </Paragraph> </DocumentBody> </Document>
This solution provides a starting point to build an XML parser for Word documents. In addition to the XML functionality, it discusses how to build custom objects to handle sequential instances of all styles and graphics and how to loop through tables and lists. Remember, documents shouldn't be converted to XML merely for the sake putting them in XML. The best document to convert to XML is one that makes use of styles and will be reused in other ways.