Export (0) Print
Expand All
Expand Minimize

Manipulating Word 2007 Files with the Open XML Format API (Part 1 of 3)

Office 2007

Summary: The Welcome to the Open XML Format SDK 1.0 is a library for manipulating Open XML Format files. This series of articles describes the Open XML object model code that you can use to access and manipulate Microsoft Office Word 2007 files. (26 printed pages)

The 2007 Microsoft Office system introduces new file formats that are based on XML called Open XML Formats. Microsoft Office Word 2007, Microsoft Office Excel 2007, and Microsoft Office PowerPoint 2007 all use these formats as the default file format. Open XML formats are useful because they are an open standard and are based on well-known technologies: ZIP and XML. Microsoft provides a library for accessing these files as part of the .NET Framework 3.0 technologies in the DocumentFormat.OpenXml namespace in the Welcome to the Open XML Format SDK 1.0. The Open XML Format members are contained in theDocumentFormat.OpenXml API and provide strongly-typed part classes to manipulate Open XML documents. The SDK simplifies the task of manipulating Open XML packages. The Open XML Format API encapsulates many common tasks that developers perform on Open XML Format packages, so you can perform complex operations with just a few lines of code.

NoteNote

You can find additional samples of manipulating Open XML Format files and references for each member contained in the Open XML object model in the 2007 Office System: Microsoft SDK for Open XML Formats.

The Open XML Package Convention specification defines a set of XML files that contain the content and define the relationships for all of the parts stored in a single package. These packages combine the parts that make up the document files for the 2007 Microsoft Office programs that support the Open XML Format. The Open XML Format API discussed in this article allows you to create packages and manipulate the files that make up the packages.

In the following code, you create an Office Open XML package as a Word 2007 document and then add content to the main document part in the package:

Public Sub CreateNewWordDocument(ByVal document As String)
   Dim wordDoc As WordprocessingDocument = WordprocessingDocument.Create(document, WordprocessingDocumentType.Document)
   Using (wordDoc)
      ' Set the content of the document so that Word can open it.
      Dim mainPart As MainDocumentPart = wordDoc.AddMainDocumentPart
      SetMainDocumentContent(mainPart)
   End Using
End Sub

Public Sub SetMainDocumentContent(ByVal part As MainDocumentPart)
   Const docXml As String = "<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>" & _
   "<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">" & _
   "<w:body><w:p><w:r><w:t>Hello world!</w:t></w:r></w:p></w:body></w:document>"
   Dim stream1 As Stream = part.GetStream
   Dim utf8encoder1 As UTF8Encoding = New UTF8Encoding()
   Dim buf() As Byte = utf8encoder1.GetBytes(docXml)
   stream1.Write(buf, 0, buf.Length)
End Sub

In the first procedure, you pass in a parameter that represents the path to and name of a Word 2007 document. Then you create a WordprocessingDocument object representing the package based on the name of the input document. The remaining code is encapsulated in a using statement that allows you to ensure that any objects created are properly disposed of.

Next, the AddMainDocumentPart method creates and adds the main document part (document.xml) to the package. The main document part is added to the /word folder in the package. And then the SetMainDocumentContent procedure is called with the main document part. This procedure uses the Stream object to populate the part with XML markup that contains the structure and content of the Word 2007 document. The structure adds markup and text to the part with the following structure:

@"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?> 
<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
   <w:body>
      <w:p>
         <w:r>
            <w:t>Hello world!</w:t>
         </w:r>
      </w:p>
   </w:body>
</w:document>

In the following code, you remove nodes in the main document part that signal Word 2007 that any pending revisions in a document are accepted:

Public Sub WDAcceptRevisions(ByVal docName As String, ByVal authorName As String)
   ' Given a document name and an author name (leave author name blank to accept revisions
   ' for all authors), accept revisions.
   Const wordmlNamespace As String = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, True)
   ' Manage namespaces to perform Xml XPath queries.
   Dim nt As NameTable = New NameTable
   Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
   nsManager.AddNamespace("w", wordmlNamespace)
   ' Get the document part from the package.
   Dim xdoc As XmlDocument = New XmlDocument(nt)
   ' Load the XML in the part into an XmlDocument instance.
   xdoc.Load(wdDoc.MainDocumentPart.GetStream)
   ' Handle formatting changes.
   Dim nodes As XmlNodeList = Nothing
   If String.IsNullOrEmpty(authorName) Then
      nodes = xdoc.SelectNodes("//w:pPrChange", nsManager)
   Else
      nodes = xdoc.SelectNodes(String.Format("//w:pPrChange[@w:author='{0}']", authorName), nsManager)
   End If
   For Each node As System.Xml.XmlNode In nodes
      node.ParentNode.RemoveChild(node)
   Next
   ' Handle deletions.
   If String.IsNullOrEmpty(authorName) Then
      nodes = xdoc.SelectNodes("//w:del", nsManager)
   Else
      nodes = xdoc.SelectNodes(String.Format("//w:del[@w:author='{0}']", authorName), nsManager)
   End If
   For Each node As System.Xml.XmlNode In nodes
      node.ParentNode.RemoveChild(node)
   Next
   ' Handle insertions.
   If String.IsNullOrEmpty(authorName) Then
      nodes = xdoc.SelectNodes("//w:ins", nsManager)
   Else
      nodes = xdoc.SelectNodes(String.Format("//w:ins[@w:author='{0}']", authorName), nsManager)
   End If
   For Each node As System.Xml.XmlNode In nodes
      ' You found one or more new content.
      ' Promote them to the same level as node, and then
      ' delete the node.
      Dim childNodes As XmlNodeList
      childNodes = node.SelectNodes(".//w:r", nsManager)
      For Each childNode As System.Xml.XmlNode In childNodes
         If (childNode Is node.FirstChild) Then
            node.ParentNode.InsertAfter(childNode, node)
         Else
            node.ParentNode.InsertAfter(childNode, node.NextSibling)
         End If
      Next
         node.ParentNode.RemoveChild(node)
         ' Remove the modification id from the node 
         ' so Word can merge it on the next save.
         node.Attributes.RemoveNamedItem("w:rsidR")
         node.Attributes.RemoveNamedItem("w:rsidRPr")
   Next
   ' Save the document XML back to its part.
   xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create))
End Sub

In this procedure, you first pass in parameters representing the path to and the name of the source Word 2007 document and, optionally, the name of the document author.

NoteNote

If you do not specify an author name, you still need to pass in an empty string.

You open the document by using the Open method of the WordprocessingDocument object. Next, you set up a namespace manager by using the XmlNamespaceManager object and setting a reference to the default WordprocessingML namespace, using the qualifier (w). The contents of the main document part (/word/document.xml) are loaded into a memory-resident XML document. Then you test the [w:pPrChange] nodes for revisions assigned by a specific author (assuming you passed in an author name) by using an XPath expression. If no author name was passed to the procedure, all revisions by every author are affected. Regardless, all [w:pPrChange] nodes (if any exist) are selected by the following statement:

nodes = xdoc.SelectNodes("//w:pPrChange", nsManager)

These nodes denote pending formatting changes to the document. The same selection occurs for deletions and insertions with the node names w:del and w:ins, respectively. For any nodes found, the child nodes are removed with the statement:

node.ParentNode.RemoveChild(node)

To understand how this works, consider the following example. Assume that while reviewing a document, you highlight a line of text and apply a Title style to it. Doing this generates the following WordprocessingML markup in the main document part:

<w:pPr>
   <w:pStyle w:val="Title" /> 
      <w:pPrChange w:id="0" w:author="Nancy Davolio" w:date="2007-07-03T08:22:00Z">
         <w:pPr /> 
      </w:pPrChange>
</w:pPr>
<w:r w:rsidRPr="00655EFA">
  <w:t>Gettysburg Address</w:t> 
</w:r>

The w:pStyle element specifies that this is a style change; in this case, the change sets the highlighted text to the Title style. The w:pPrChange element identifies the author and date of the revision. This element also signals to Word 2007 that the change is pending. The w:rand w:telements designate the run and the text, respectively, that contain the highlighted text; in this case, the phrase Gettysburg Address. While reviewing a document in Word 2007, you highlight the text, click Accept on the Review tab, and then click Accept and Move Next. Word 2007 implements the change and removes the w:pPrChange element. You can emulate this behavior using code. Using code to remove the w:pPrChange element has the same result as accepting the revision. This is exactly what the following statement does:

node.ParentNode.RemoveChild(node)

Here, the current node is the w:pPrChange element. To remove the current node (the w:pPrChange element), specify the parent of the current node (thew:pStyleelement) and call the RemoveChild method. Removing the current node in this manner is the same as accepting the change. A similar process is performed for deletions with the w:del element.

Insertions are more complicated than other formatting changes because the w:inselement may be a container for one or more insertions. For example, you may insert text and spaces in the same operation, as shown by the following WordprocessingML markup:

<w:ins w:id="12" w:author="Nancy Davolio" w:date="2007-07-03T08:23:00Z">
   <w:r w:rsidR="00655EFA">
      <w:t>word</w:t> 
   </w:r>
   <w:proofErr w:type="spellEnd" /> 
   <w:r w:rsidR="00655EFA">
      <w:t xml:space="preserve"></w:t> 
   </w:r>
</w:ins>

In this segment, the word word is inserted into the document followed by a blank space. Both of these w:t text elements are contained within w:r runelements which, in turn, are contained in the same w:ins insertion element. In the programming code procedure, these nodes (child nodes to the [w:ins] node) are promoted to be at the same level as the [w:ins] node. Then the [w:ins] node is deleted, which has the effect of accepting the revisions.

nodes = xdoc.SelectNodes("//w:ins", nsManager)
...
For Each childNode As System.Xml.XmlNode In childNodes
   If (childNode Is node.FirstChild) Then
      node.ParentNode.InsertAfter(childNode, node)
   Else
      node.ParentNode.InsertAfter(childNode, node.NextSibling)
   End If
Next
node.ParentNode.RemoveChild(node)

The potential difficulty of this operation depends on the order that the w:t text elements are processed. Continuing the previous example, suppose you want to insert the word word followed by a blank space at a specific point. The w:t element for the word is processed and inserted after the parent node (w:r). Next, the w:t element for the space is processed and also inserted after the parent node. This results in the space and then the text word appearing after the w:r parent element which, unfortunately, is the opposite of what we had in mind. To avoid this problem, you need a way to detect the order of the text elements. In the code, you do this by inserting the first w:t element encountered after the parent node and then inserting subsequent elements after that first child node.

After the order of the nodes is correct, the code deletes the appropriate nodes and saves the updated XML back to the document part. When the updated document is opened, the missing XML statements signal to Word 2007 that the revisions are accepted.

In the following code, you remove the existing header part in a document and replace it with WordprocessingML markup to change the document's header:

Public Sub WDAddHeader(ByVal docName As String, ByVal headerContent As Stream)
   ' Given a document name, and a stream containing valid header content,
   ' add the stream content as a header in the document and remove the original headers
   Const wordmlNamespace As String = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   Const relationshipNamespace As String = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, True)

   Using (wdDoc)
      ' Delete existing header part
      wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.HeaderParts)

      ' Create a new header part.
      Dim headerPart As HeaderPart = wdDoc.MainDocumentPart.AddNewPart(Of HeaderPart)()
      Dim rId As String = wdDoc.MainDocumentPart.GetIdOfPart(headerPart)
      Dim headerDoc As XmlDocument = New XmlDocument
      headerContent.Position = 0
      headerDoc.Load(headerContent)

      ' Write the header out to its part.
      headerDoc.Save(headerPart.GetStream)

      ' Manage namespaces to perform Xml XPath queries.
      Dim nt As NameTable = New NameTable
      Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
      nsManager.AddNamespace("w", wordmlNamespace)

      ' Get the document part from the package.
      ' Load the XML in the part into an XmlDocument instance.
      Dim xdoc As XmlDocument = New XmlDocument(nt)
      xdoc.Load(wdDoc.MainDocumentPart.GetStream)

      ' Find the node containing the document layout.
      Dim targetNodes As XmlNodeList = xdoc.SelectNodes("//w:sectPr", nsManager)
      For Each targetNode As XmlNode In targetNodes
         ' Delete any existing references to headers.
         Dim headerNodes As XmlNodeList = targetNode.SelectNodes("./w:headerReference", nsManager)
         For Each headerNode As System.Xml.XmlNode In headerNodes
            targetNode.RemoveChild(headerNode)
         Next
         ' Create the new header reference node.
         Dim node As XmlElement = xdoc.CreateElement("w:headerReference", wordmlNamespace)
         Dim attr As XmlAttribute = node.Attributes.Append(xdoc.CreateAttribute("r:id", relationshipNamespace))
         attr.Value = rId
         node.Attributes.Append(attr)
         targetNode.InsertBefore(node, targetNode.FirstChild)
      Next

      ' Save the document XML back to its part.
      xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create))
   End Using
End Sub

First, the WDAddHeader method is called passing in a reference to the Word 2007 document and a Stream object containing the replacement header markup and data. Then you set up the WordprocessingDocument object representing the Open XML File Format package and the namespace manager for the WordprocessingML markup. Next, you delete the existing header parts in the document and then create a blank header part with the following code:

wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.HeaderParts)

' Create a new header part.
Dim headerPart As HeaderPart = wdDoc.MainDocumentPart.AddNewPart(Of HeaderPart)()
Dim headerDoc As XmlDocument = New XmlDocument
headerContent.Position = 0
headerDoc.Load(headerContent)

' Write the header out to its part.
headerDoc.Save(headerPart.GetStream)

Then you create a memory-resident XML document as a temporary holder and load the document with the markup and data that describes the replacement header. The remaining code searches for the references to the header parts that you just deleted in the main document part. It does this by using XPath queries to search for the appropriate namespaces, deletes them, and then inserts references to the new header part. Finally, the updated WordprocessingML markup is saved back to the main document part.

In the following code, you delete the comments part and all references to the part from a Word 2007 document:

Public Sub WDDeleteComments(ByVal docName As String)
   ' Given a document name, remove all comments.
   Const wordmlNamespace As String = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, true)
   If (Not (wdDoc.MainDocumentPart.CommentsPart) Is Nothing) Then
      wdDoc.MainDocumentPart.DeletePart(wdDoc.MainDocumentPart.CommentsPart)

      ' Manage namespaces to perform Xml XPath queries.
      Dim nt As NameTable = New NameTable
      Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
      nsManager.AddNamespace("w", wordmlNamespace)

      ' Get the document part from the package.
      ' Load the XML in the part into an XmlDocument instance:
      Dim xdoc As XmlDocument = New XmlDocument(nt)
      xdoc.Load(wdDoc.MainDocumentPart.GetStream)

      ' Retrieve a list of nodes representing the comment start elements, and delete them all.
      Dim nodes As XmlNodeList = xdoc.SelectNodes("//w:commentRangeStart", nsManager)
      For Each node As System.Xml.XmlNode In nodes
         node.ParentNode.RemoveChild(node)
      Next

      ' Retrieve a list of nodes representing the comment end elements, and delete them all.
      nodes = xdoc.SelectNodes("//w:commentRangeEnd", nsManager)
      For Each node As System.Xml.XmlNode In nodes
         node.ParentNode.RemoveChild(node)
      Next

      ' Retrieve a list of nodes representing the comment reference elements, and delete them all.
      nodes = xdoc.SelectNodes("//w:r/w:commentReference", nsManager)
      For Each node As System.Xml.XmlNode In nodes
                node.ParentNode.RemoveChild(node)
      Next

      ' Save the document XML back to its part.
      xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create))
   End If
End Sub

First, the WDDeleteComments method is called passing in a reference to the Word 2007 document. Then a WordprocessingDocument object is created from the input document, representing the Open XML File Format package. Next, the package is tested for a comments part to see if it exists and if so, deletes it. A namespace manager is created to set up the XPath queries. The queries, using the //w:commentRangeStart and //w:commentRangeEnd XPath expressions, are used to search the main document part for the starting and ending comment nodes, respectively. When the list is compiled, each node is removed.

Dim nodes As XmlNodeList = xdoc.SelectNodes("//w:commentRangeStart", nsManager)
For Each node As System.Xml.XmlNode In nodes
   node.ParentNode.RemoveChild(node)
Next

' Retrieve a list of nodes representing the comment end elements, and delete them all.
nodes = xdoc.SelectNodes("//w:commentRangeEnd", nsManager)
For Each node As System.Xml.XmlNode In nodes
   node.ParentNode.RemoveChild(node)
Next

The same process is used to remove the nodes that reference the comments by using the //w:r/w:commentReference XPath expression. And finally, the updated WordprocessingML markup is saved back to the main document part.

In the following code, you delete the header and footers from a Word 2007 document:

Public Sub WDRemoveHeadersFooters(ByVal docName As String)
   ' Given a document name, remove all headers and footers.
   Const wordmlNamespace As String = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, true)
   Using (wdDoc)
      If ((wdDoc.MainDocumentPart.GetPartsCountOfType(Of HeaderPart)() > 0) OR (wdDoc.MainDocumentPart.GetPartsCountOfType(Of FooterPart)() > 0))
         wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.HeaderParts)
         wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.FooterParts)

         ' Manage namespaces to perform XPath queries.
         Dim nt As NameTable = New NameTable
         Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
         nsManager.AddNamespace("w", wordmlNamespace)

         ' Get the document part from the package.
         ' Load the XML in the part into an XmlDocument instance.
         Dim xdoc As XmlDocument = New XmlDocument(nt)
         xdoc.Load(wdDoc.MainDocumentPart.GetStream)

         ' Find the node containing the document layout.
         Dim layoutNodes As XmlNodeList = xdoc.SelectNodes("//w:sectPr", nsManager)
         For Each layoutNode As System.Xml.XmlNode In layoutNodes
            ' Delete any existing references to headers.
            Dim headerNodes As XmlNodeList = layoutNode.SelectNodes("./w:headerReference", nsManager)
            For Each headerNode As System.Xml.XmlNode In headerNodes
               layoutNode.RemoveChild(headerNode)
            Next

            ' Delete any existing references to footers.
            Dim footerNodes As XmlNodeList = layoutNode.SelectNodes("./w:footerReference", nsManager)
            For Each footerNode As System.Xml.XmlNode In footerNodes
               layoutNode.RemoveChild(footerNode)
            Next
         Next

         ' Save the document XML back to its part.
         xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create))
      End If
   End Using
End Sub

First, the WDRemoveHeadersFooters method is called passing in a reference to the Word 2007 document. Then a WordprocessingDocument object is created from the input document, representing the Open XML File Format package. Next, the package is tested for header and footer parts and if they exist, deletes them. A namespace manager is then created to set up the XPath queries.

Then you create memory-resident XML document as a temporary holder and load the document with the markup and data from the main document part. The remaining code searches for the nodes that reference the header and footer parts that you just deleted, and deletes them as well.

Dim layoutNodes As XmlNodeList = xdoc.SelectNodes("//w:sectPr", nsManager)
For Each layoutNode As System.Xml.XmlNode In layoutNodes
   ' Delete any existing references to headers.
   Dim headerNodes As XmlNodeList = layoutNode.SelectNodes("./w:headerReference", nsManager)
   For Each headerNode As System.Xml.XmlNode In headerNodes
      layoutNode.RemoveChild(headerNode)
   Next

   ' Delete any existing references to footers.
   Dim footerNodes As XmlNodeList = layoutNode.SelectNodes("./w:footerReference", nsManager)
   For Each footerNode As System.Xml.XmlNode In footerNodes
      layoutNode.RemoveChild(footerNode)
   Next
Next

Finally, the updated WordprocessingML markup is saved back to the main document part.

As this article demonstrates, working with Word 2007 files is much easier with the Microsoft SDK for Open XML Formats Technology Preview. In part two of this series of articles, I describe other common tasks that you can perform with the Open XML Formats SDK.

Additional Resources

Show:
© 2014 Microsoft