Retrieving the Table of Contents from Word 2010 Documents by Using the Open XML SDK 2.0
Summary: Use the strongly typed classes in the Open XML SDK 2.0 to retrieve an XML block that contains the table of contents from a Word document, without loading the document into Microsoft Word.
Applies to: Excel 2010 | Office 2007 | Office 2010 | Open XML | PowerPoint 2010 | VBA | Word 2010
Published: November 2010
Provided by: Ken Getz, MCW Technologies, LLC
The Open XML file formats make it possible to retrieve blocks of content from Word documents. The Open XML SDK 2.0 adds strongly typed classes that simplify access to the Open XML file formats: The SDK simplifies the tasks of retrieving, in particular, the block of XML that contains the table of contents. The code sample that is included with this Visual How To describes how to the use the SDK to achieve this goal. The sample provided with this Visual How To includes the code that is required to retrieve the XML block that contains the table of contents for a Word 2007 or Word 2010 document. Word provides several ways to insert a table of contents. Only a subset of these results in the appropriate internal XML structure that allows the code, shown in this Visual How To, to operate correctly. For more information, see the Read It section. The following sections walk through the code. Be aware that the table of contents for a document only exists if the author of the document has explicitly created it (using the Word user interface), and that the table of contents consists of a block of XML that describes references within the document. When you use the sample code to retrieve the table of contents, the procedure returns an XML element named TOC that contains the XML block of information from the original document. You (and your application) must interpret the results of retrieving the table of contents. Setting Up References To use the code from the Open XML SDK 2.0, you must add several references to your project. The sample project already includes these references, but in your own code, you would have to explicitly reference the following assemblies:
Also, you should add using and Imports statements to the top of your code file, as shown in the following example.
Examining the Procedure The WDRetrieveTOC procedure accepts a single parameter that indicates the name of the document from which you want to retrieve the table of contents (string). The procedure returns an XElement instance that contains an element named TOC that contains the table of contents as an XML element (or a null reference, if the table of contents does not exist).
The procedure examines the document that you specify, looking for the special table of contents element. If it exists, the procedure returns it, wrapped in an element named TOC. To call the procedure, pass the parameter value, as shown in the example code. Verify that you provide a document named C:\temp\TOC.docx, which (for demonstration) contains a table of contents, before you run the sample code.
Accessing the Document The starts by creating a variable named TOC that the procedure will return before it exits.
The code then opens the document by using the WordprocessingDocument.Open method and indicating that the document should be opened for read-only access (the final false parameter). Given the open document, the code uses the MainDocumentPart property to navigate to the main document, and then the Document property of the part to retrieve a reference to the contents of the main document part. The code stores this reference in a variable named doc.
Finding the Table of Contents The sample procedure continues by searching within the descendants of the document, looking for XML elements of type DocPartGallery that have a value of "Table of Contents". The code returns a reference to the first matching element (or a null reference, if no element exists). The code uses the HasValue property of the Val property of the element, to ensure that the value exists and has a value before the code attempts to retrieve the value.
Finding the Parent and Creating the Return Value When the code has found the table of contents marker element, it must traverse back up the hierarchy of elements until it finds the parent SdtBlock element that contains the complete table of contents information. The code loops while the block variable is neither a null reference nor the kind of the block variable is SdtBlock, setting the block variable so that it refers to its parent node. When the loop finally ends, the code sets the TOC variable to contain the OuterXml property of the block variable (the code does not verify that block is not a null reference after looping through parent elements — it would only raise an exception if block was null , to indicate that the document was not well-formed).
Sample Procedure The following code example contains the complete sample procedure.
The sample that is included with this Visual How To describes code that retrieves the table of contents from a Word document. In order to use the sample, you must install the Open XML SDK 2.0, available from the link listed in the Explore It section. The sample also uses a modified version of code included as part of a set of code examples for the Open XML SDK 2.0. The Explore It section also includes a link to the full set of code examples, although you can use the sample without downloading and installing the code examples. Be aware that Word provides at least four distinct means of inserting a table of contents into a document. Two of these techniques result in internal XML code that will work with the code shown in this Visual How To. Specifically, the following two techniques provide a table of contents nested with an SdtBlock element:
The following two techniques, which also result in a valid table of contents in the document, do not nest the table of contents inside an SdtBlock element so that the code shown in this Visual How To will not work:
These two options create free-standing table of contents elements, and would require more complex code to extract the table of contents from the document. If you import a document from a version of Word earlier than Word 2007, the code will also not work, because in those versions, Word never used the SdtBlock around the table of contents elements. To understand the sample code, it is useful to examine the contents of the document by using the Open XML SDK 2.0 Productivity Tool for Microsoft Office, which is included as part of the Open XML SDK 2.0. Figure 1 shows a sample document that contains a table of contents, opened in the tool. The sample code retrieves a reference to the Document part, and within that part, locates the DocumentPartGallery element (known as w:docPartGallery) with the value:Table of Contents. The sample application demonstrates only some available properties and methods that are provided by the Open XML SDK 2.0 that you can use to modify document structure. For more information, see the documentation that is included with the Open XML SDK 2.0 Productivity Tool. Click the Open XML SDK Documentation tab in the lower-left corner of the application window, and search for the class that you want to study. Although the documentation does currently not include code examples, given the sample shown here and the documentation, you should be able to successfully modify the sample application. |
> [!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/15008383-542b-4505-864e-0fc43f2ccda1] Length: 00:09:53 About the Author |