Retrieving the Table of Contents from Word 2010 Documents by Using the Open XML SDK 2.0

Office Visual How To

Summary:  Use the strongly typed classes in the Open XML SDK 2.0 to retrieve an XML block that contains the table of contents from a Word document, without loading the document into Microsoft Word.

Applies to: Excel 2010 | Office 2007 | Office 2010 | Open XML | PowerPoint 2010 | VBA | Word 2010

Published:  November 2010

Provided by:  Ken Getz, MCW Technologies, LLC

Overview

The Open XML file formats make it possible to retrieve blocks of content from Word documents. The Open XML SDK 2.0 adds strongly typed classes that simplify access to the Open XML file formats: The SDK simplifies the tasks of retrieving, in particular, the block of XML that contains the table of contents. The code sample that is included with this Visual How To describes how to the use the SDK to achieve this goal.

Code It

The sample provided with this Visual How To includes the code that is required to retrieve the XML block that contains the table of contents for a Word 2007 or Word 2010 document. Word provides several ways to insert a table of contents. Only a subset of these results in the appropriate internal XML structure that allows the code, shown in this Visual How To, to operate correctly. For more information, see the Read It section. The following sections walk through the code. Be aware that the table of contents for a document only exists if the author of the document has explicitly created it (using the Word user interface), and that the table of contents consists of a block of XML that describes references within the document. When you use the sample code to retrieve the table of contents, the procedure returns an XML element named TOC that contains the XML block of information from the original document. You (and your application) must interpret the results of retrieving the table of contents.

Setting Up References

To use the code from the Open XML SDK 2.0, you must add several references to your project. The sample project already includes these references, but in your own code, you would have to explicitly reference the following assemblies:

  • WindowsBase─This reference may be set for you, depending on the kind of project that you create.

  • DocumentFormat.OpenXml─Installed by Open XML SDK 2.0.

Also, you should add using and Imports statements to the top of your code file, as shown in the following example.

Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

Examining the Procedure

The WDRetrieveTOC procedure accepts a single parameter that indicates the name of the document from which you want to retrieve the table of contents (string). The procedure returns an XElement instance that contains an element named TOC that contains the table of contents as an XML element (or a null reference, if the table of contents does not exist).

Public Function WDRetrieveTOC(ByVal fileName As String) As XElement
public static XElement WDRetrieveTOC(string fileName)

The procedure examines the document that you specify, looking for the special table of contents element. If it exists, the procedure returns it, wrapped in an element named TOC. To call the procedure, pass the parameter value, as shown in the example code. Verify that you provide a document named C:\temp\TOC.docx, which (for demonstration) contains a table of contents, before you run the sample code.

Dim result = WDRetrieveTOC("C:\Temp\TOC.docx")
Console.WriteLine(result.Value)
var result = WDRetrieveTOC(@"C:\temp\toc.docx");
Console.WriteLine(result.Value.ToString());

Accessing the Document

The starts by creating a variable named TOC that the procedure will return before it exits.

Dim TOC As XElement = Nothing
' Code removed here…
Return TOC
XElement TOC = null;
// Code removed here…
return TOC;

The code then opens the document by using the WordprocessingDocument.Open method and indicating that the document should be opened for read-only access (the final false parameter). Given the open document, the code uses the MainDocumentPart property to navigate to the main document, and then the Document property of the part to retrieve a reference to the contents of the main document part. The code stores this reference in a variable named doc.

Using document = WordprocessingDocument.Open(fileName, False)
  Dim docPart = document.MainDocumentPart
  Dim doc = docPart.Document
  ' Code removed here…
End Using
using (var document = WordprocessingDocument.Open(fileName, false))
{
  var docPart = document.MainDocumentPart;
  var doc = docPart.Document;
  // Code removed here…
}

Finding the Table of Contents

The sample procedure continues by searching within the descendants of the document, looking for XML elements of type DocPartGallery that have a value of "Table of Contents". The code returns a reference to the first matching element (or a null reference, if no element exists). The code uses the HasValue property of the Val property of the element, to ensure that the value exists and has a value before the code attempts to retrieve the value.

Dim block As OpenXmlElement = _
  doc.Descendants(Of DocPartGallery)().
  Where(Function(b) b.Val.HasValue AndAlso
          (b.Val.Value = "Table of Contents")).FirstOrDefault()
OpenXmlElement block = doc.Descendants<DocPartGallery>().
  Where(b => b.Val.HasValue && 
    (b.Val.Value == "Table of Contents")).FirstOrDefault();

Finding the Parent and Creating the Return Value

When the code has found the table of contents marker element, it must traverse back up the hierarchy of elements until it finds the parent SdtBlock element that contains the complete table of contents information. The code loops while the block variable is neither a null reference nor the kind of the block variable is SdtBlock, setting the block variable so that it refers to its parent node. When the loop finally ends, the code sets the TOC variable to contain the OuterXml property of the block variable (the code does not verify that block is not a null reference after looping through parent elements — it would only raise an exception if block was null , to indicate that the document was not well-formed).

If block IsNot Nothing Then
  ' Back up to the enclosing SdtBlock and return that XML.
  Do While (block IsNot Nothing) AndAlso (Not TypeOf block Is SdtBlock)
    block = block.Parent
  Loop
  TOC = New XElement("TOC", block.OuterXml)
End If
if (block != null)
{
  // Back up to the enclosing SdtBlock and return that XML.
  while ((block != null) && (!(block is SdtBlock)))
  {
    block = block.Parent;
  }
  TOC = new XElement("TOC", block.OuterXml);
}

Sample Procedure

The following code example contains the complete sample procedure.

Public Function WDRetrieveTOC(ByVal fileName As String) As XElement
    Dim TOC As XElement = Nothing

    Using document = WordprocessingDocument.Open(fileName, False)
      Dim docPart = document.MainDocumentPart
      Dim doc = docPart.Document

      Dim block As OpenXmlElement = _
        doc.Descendants(Of DocPartGallery)().
        Where(Function(b) b.Val.HasValue AndAlso
                (b.Val.Value = "Table of Contents")).FirstOrDefault()
      If block IsNot Nothing Then
        ' Back up to the enclosing SdtBlock and return that XML.
        Do While (block IsNot Nothing) AndAlso 
         (Not TypeOf block Is SdtBlock)
          block = block.Parent
        Loop
        TOC = New XElement("TOC", block.OuterXml)
      End If
    End Using
    Return TOC
  End Function
public static XElement WDRetrieveTOC(string fileName)
{
  XElement TOC = null;

  using (var document = WordprocessingDocument.Open(fileName, false))
  {
    var docPart = document.MainDocumentPart;
    var doc = docPart.Document;

    OpenXmlElement block = doc.Descendants<DocPartGallery>().
      Where(b => b.Val.HasValue && 
        (b.Val.Value == "Table of Contents")).FirstOrDefault();

    if (block != null)
    {
      // Back up to the enclosing SdtBlock and return that XML.
      while ((block != null) && (!(block is SdtBlock)))
      {
        block = block.Parent;
      }
      TOC = new XElement("TOC", block.OuterXml);
    }
  }
  return TOC;
}
Read It

The sample that is included with this Visual How To describes code that retrieves the table of contents from a Word document. In order to use the sample, you must install the Open XML SDK 2.0, available from the link listed in the Explore It section. The sample also uses a modified version of code included as part of a set of code examples for the Open XML SDK 2.0. The Explore It section also includes a link to the full set of code examples, although you can use the sample without downloading and installing the code examples.

Be aware that Word provides at least four distinct means of inserting a table of contents into a document. Two of these techniques result in internal XML code that will work with the code shown in this Visual How To. Specifically, the following two techniques provide a table of contents nested with an SdtBlock element:

  • Use the Built-In Designs — on the References tab, in the Table of Contents group, click Table of Contents, and select one of the built-in designs from the gallery of choices (currently, three options).

  • Use the More Table of Contents from Office.com entry — On the references tab, in the Table of Contents group, click Table of Contents, and then click More Table of Contents from Office.com… near the bottom.

The following two techniques, which also result in a valid table of contents in the document, do not nest the table of contents inside an SdtBlock element so that the code shown in this Visual How To will not work:

  • Use the Insert Table of Contents Entry—On the References tab, in the Table of Contents Group, click Table of Contents, and then click Insert Table of Contents near the bottom.

  • Use the Insert Fields option—On the Insert tab, in the Text Group, click Quick Parts, and then click Field. Under Field Names, click TOC, and then click Table of Contents (optional) and then click OK.

These two options create free-standing table of contents elements, and would require more complex code to extract the table of contents from the document. If you import a document from a version of Word earlier than Word 2007, the code will also not work, because in those versions, Word never used the SdtBlock around the table of contents elements.

To understand the sample code, it is useful to examine the contents of the document by using the Open XML SDK 2.0 Productivity Tool for Microsoft Office, which is included as part of the Open XML SDK 2.0. Figure 1 shows a sample document that contains a table of contents, opened in the tool. The sample code retrieves a reference to the Document part, and within that part, locates the DocumentPartGallery element (known as w:docPartGallery) with the value:Table of Contents.

Figure 1. Open the sample document and locate the element

Open the sample document and locate the element

The sample application demonstrates only some available properties and methods that are provided by the Open XML SDK 2.0 that you can use to modify document structure. For more information, see the documentation that is included with the Open XML SDK 2.0 Productivity Tool. Click the Open XML SDK Documentation tab in the lower-left corner of the application window, and search for the class that you want to study. Although the documentation does currently not include code examples, given the sample shown here and the documentation, you should be able to successfully modify the sample application.

See It

Watch the video

> [!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/15008383-542b-4505-864e-0fc43f2ccda1]

Length: 00:09:53

Click to grab code

Grab the Code

Explore It

About the Author
Ken Getz is a senior consultant with MCW Technologies. He is coauthor of ASP.NET Developers Jumpstart (Addison-Wesley, 2002), Access Developer's Handbook (Sybex, 2001), and VBA Developer's Handbook, 2nd Edition (Sybex, 2001).