Information
The topic you requested is included in another documentation set. For convenience, it's displayed below. Choose Switch to see the topic in its original location.

Manipulating Word 2007 Files with the Open XML Format API (Part 1 of 3)

Office 2007

Summary: The Welcome to the Open XML Format SDK 1.0 is a library for manipulating Open XML Format files. This series of articles describes the Open XML object model code that you can use to access and manipulate Microsoft Office Word 2007 files. (26 printed pages)

The 2007 Microsoft Office system introduces new file formats that are based on XML called Open XML Formats. Microsoft Office Word 2007, Microsoft Office Excel 2007, and Microsoft Office PowerPoint 2007 all use these formats as the default file format. Open XML formats are useful because they are an open standard and are based on well-known technologies: ZIP and XML. Microsoft provides a library for accessing these files as part of the .NET Framework 3.0 technologies in the DocumentFormat.OpenXml namespace in the Welcome to the Open XML Format SDK 1.0. The Open XML Format members are contained in theDocumentFormat.OpenXml API and provide strongly-typed part classes to manipulate Open XML documents. The SDK simplifies the task of manipulating Open XML packages. The Open XML Format API encapsulates many common tasks that developers perform on Open XML Format packages, so you can perform complex operations with just a few lines of code.

NoteNote

You can find additional samples of manipulating Open XML Format files and references for each member contained in the Open XML object model in the 2007 Office System: Microsoft SDK for Open XML Formats.

The Open XML Package Convention specification defines a set of XML files that contain the content and define the relationships for all of the parts stored in a single package. These packages combine the parts that make up the document files for the 2007 Microsoft Office programs that support the Open XML Format. The Open XML Format API discussed in this article allows you to create packages and manipulate the files that make up the packages.

In the following code, you create an Office Open XML package as a Word 2007 document and then add content to the main document part in the package:

// How to: Create a new package as a Word document.
public static void CreateNewWordDocument(string document)
{
   using (WordprocessingDocument wordDoc = WordprocessingDocument.Create(document, WordprocessingDocumentType.Document))
   {
      // Set the content of the document so that Word can open it.
      MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();

      SetMainDocumentContent(mainPart);
   }
}

// Set content of MainDocumentPart.
public static void SetMainDocumentContent(MainDocumentPart part)
{
   const string docXml =
@"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?> 
<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
    <w:body><w:p><w:r><w:t>Hello world!</w:t></w:r></w:p></w:body>
</w:document>";

    using (Stream stream = part.GetStream())
    {
       byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
       stream.Write(buf, 0, buf.Length);
    }
}

In the first procedure, you pass in a parameter that represents the path to and name of a Word 2007 document. Then you create a WordprocessingDocument object representing the package based on the name of the input document. The remaining code is encapsulated in a using statement that allows you to ensure that any objects created are properly disposed of.

Next, the AddMainDocumentPart method creates and adds the main document part (document.xml) to the package. The main document part is added to the /word folder in the package. And then the SetMainDocumentContent procedure is called with the main document part. This procedure uses the Stream object to populate the part with XML markup that contains the structure and content of the Word 2007 document. The structure adds markup and text to the part with the following structure:

@"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?> 
<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
   <w:body>
      <w:p>
         <w:r>
            <w:t>Hello world!</w:t>
         </w:r>
      </w:p>
   </w:body>
</w:document>

In the following code, you remove nodes in the main document part that signal Word 2007 that any pending revisions in a document are accepted:

public static void WDAcceptRevisions(string docName, string authorName)
{
   // Given a document name and an author name (leave author name blank to accept revisions
   // for all authors), accept revisions.

   const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

   using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(docName, true))
   {
       // Manage namespaces to perform Xml XPath queries.
       NameTable nt = new NameTable();
       XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
       nsManager.AddNamespace("w", wordmlNamespace);

       // Get the document part from the package.
       XmlDocument xdoc = new XmlDocument(nt);
       // Load the XML in the part into an XmlDocument instance.
       xdoc.Load(wdDoc.MainDocumentPart.GetStream());

       // Handle formatting changes.
       XmlNodeList nodes = null;
       if (string.IsNullOrEmpty(authorName))
       {
           nodes = xdoc.SelectNodes("//w:pPrChange", nsManager);
       }
       else
       {
          nodes = xdoc.SelectNodes(string.Format("//w:pPrChange[@w:author='{0}']", authorName), nsManager);
       }
       foreach (System.Xml.XmlNode node in nodes)
       {
          node.ParentNode.RemoveChild(node);
       }

       // Handle deletions.
       if (string.IsNullOrEmpty(authorName))
       {
          nodes = xdoc.SelectNodes("//w:del", nsManager);
       }
       else
       {
          nodes = xdoc.SelectNodes(string.Format("//w:del[@w:author='{0}']", authorName), nsManager);
       }

       foreach (System.Xml.XmlNode node in nodes)
       {
           node.ParentNode.RemoveChild(node);
       }


       // Handle insertions.
       if (string.IsNullOrEmpty(authorName))
       {
          nodes = xdoc.SelectNodes("//w:ins", nsManager);
       }
       else
       {
          nodes = xdoc.SelectNodes(string.Format("//w:ins[@w:author='{0}']", authorName), nsManager);
       }

       foreach (System.Xml.XmlNode node in nodes)
       {
          // You found one or more new content.
          // Promote them to the same level as node, and then
          // delete the node.
          XmlNodeList childNodes;
          childNodes = node.SelectNodes(".//w:r", nsManager);
          foreach (System.Xml.XmlNode childNode in childNodes)
          {
             if (childNode == node.FirstChild)
             {
                 node.ParentNode.InsertAfter(childNode, node);
             }
             else
             {
                 node.ParentNode.InsertAfter(childNode, node.NextSibling);
              }
           }
           node.ParentNode.RemoveChild(node);

           // Remove the modification id from the node 
           // so Word can merge it on the next save.
           node.Attributes.RemoveNamedItem("w:rsidR");
           node.Attributes.RemoveNamedItem("w:rsidRPr");
       }

       // Save the document XML back to its part.
       xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create));
   }
}

In this procedure, you first pass in parameters representing the path to and the name of the source Word 2007 document and, optionally, the name of the document author.

NoteNote

If you do not specify an author name, you still need to pass in an empty string.

You open the document by using the Open method of the WordprocessingDocument object. Next, you set up a namespace manager by using the XmlNamespaceManager object and setting a reference to the default WordprocessingML namespace, using the qualifier (w). The contents of the main document part (/word/document.xml) are loaded into a memory-resident XML document. Then you test the [w:pPrChange] nodes for revisions assigned by a specific author (assuming you passed in an author name) by using an XPath expression. If no author name was passed to the procedure, all revisions by every author are affected. Regardless, all [w:pPrChange] nodes (if any exist) are selected by the following statement:

nodes = xdoc.SelectNodes("//w:pPrChange", nsManager);

These nodes denote pending formatting changes to the document. The same selection occurs for deletions and insertions with the node names w:del and w:ins, respectively. For any nodes found, the child nodes are removed with the statement:

node.ParentNode.RemoveChild(node);

To understand how this works, consider the following example. Assume that while reviewing a document, you highlight a line of text and apply a Title style to it. Doing this generates the following WordprocessingML markup in the main document part:

<w:pPr>
   <w:pStyle w:val="Title" /> 
      <w:pPrChange w:id="0" w:author="Nancy Davolio" w:date="2007-07-03T08:22:00Z">
         <w:pPr /> 
      </w:pPrChange>
</w:pPr>
<w:r w:rsidRPr="00655EFA">
  <w:t>Gettysburg Address</w:t> 
</w:r>

The w:pStyle element specifies that this is a style change; in this case, the change sets the highlighted text to the Title style. The w:pPrChange element identifies the author and date of the revision. This element also signals to Word 2007 that the change is pending. The w:rand w:telements designate the run and the text, respectively, that contain the highlighted text; in this case, the phrase Gettysburg Address. While reviewing a document in Word 2007, you highlight the text, click Accept on the Review tab, and then click Accept and Move Next. Word 2007 implements the change and removes the w:pPrChange element. You can emulate this behavior using code. Using code to remove the w:pPrChange element has the same result as accepting the revision. This is exactly what the following statement does:

node.ParentNode.RemoveChild(node);

Here, the current node is the w:pPrChange element. To remove the current node (the w:pPrChange element), specify the parent of the current node (thew:pStyleelement) and call the RemoveChild method. Removing the current node in this manner is the same as accepting the change. A similar process is performed for deletions with the w:del element.

Insertions are more complicated than other formatting changes because the w:inselement may be a container for one or more insertions. For example, you may insert text and spaces in the same operation, as shown by the following WordprocessingML markup:

<w:ins w:id="12" w:author="Nancy Davolio" w:date="2007-07-03T08:23:00Z">
   <w:r w:rsidR="00655EFA">
      <w:t>word</w:t> 
   </w:r>
   <w:proofErr w:type="spellEnd" /> 
   <w:r w:rsidR="00655EFA">
      <w:t xml:space="preserve"></w:t> 
   </w:r>
</w:ins>

In this segment, the word word is inserted into the document followed by a blank space. Both of these w:t text elements are contained within w:r runelements which, in turn, are contained in the same w:ins insertion element. In the programming code procedure, these nodes (child nodes to the [w:ins] node) are promoted to be at the same level as the [w:ins] node. Then the [w:ins] node is deleted, which has the effect of accepting the revisions.

nodes = xdoc.SelectNodes("//w:ins", nsManager);
...
childNodes = node.SelectNodes(".//w:r", nsManager);
foreach (System.Xml.XmlNode childNode in childNodes)
{
   if (childNode == node.FirstChild)
   {
       node.ParentNode.InsertAfter(childNode, node);
   }
   else
   {
       node.ParentNode.InsertAfter(childNode, node.NextSibling);
   }
}
node.ParentNode.RemoveChild(node);

The potential difficulty of this operation depends on the order that the w:t text elements are processed. Continuing the previous example, suppose you want to insert the word word followed by a blank space at a specific point. The w:t element for the word is processed and inserted after the parent node (w:r). Next, the w:t element for the space is processed and also inserted after the parent node. This results in the space and then the text word appearing after the w:r parent element which, unfortunately, is the opposite of what we had in mind. To avoid this problem, you need a way to detect the order of the text elements. In the code, you do this by inserting the first w:t element encountered after the parent node and then inserting subsequent elements after that first child node.

After the order of the nodes is correct, the code deletes the appropriate nodes and saves the updated XML back to the document part. When the updated document is opened, the missing XML statements signal to Word 2007 that the revisions are accepted.

In the following code, you remove the existing header part in a document and replace it with WordprocessingML markup to change the document's header:

public static void WDAddHeader(string docName, Stream headerContent)
{
   // Given a document name, and a stream containing valid header content,
   // add the stream content as a header in the document and remove the original headers

   const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
   const string relationshipNamespace = "http://schemas.openxmlformats.org/officeDocument/2006/relationships";

   using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(docName, true))
   {
      // Delete existing header part.
      wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.HeaderParts);

      // Create a new header part.
      HeaderPart headerPart = wdDoc.MainDocumentPart.AddNewPart<HeaderPart>();
      string rId = wdDoc.MainDocumentPart.GetIdOfPart(headerPart);                
      XmlDocument headerDoc = new XmlDocument();
      headerContent.Position = 0;
      headerDoc.Load(headerContent);

      // Write the header out to its part.
      headerDoc.Save(headerPart.GetStream());

      // Manage namespaces to perform Xml XPath queries.
      NameTable nt = new NameTable();
      XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
      nsManager.AddNamespace("w", wordmlNamespace);

      // Get the document part from the package.
      // Load the XML in the part into an XmlDocument instance.
      XmlDocument xdoc = new XmlDocument(nt);
      xdoc.Load(wdDoc.MainDocumentPart.GetStream());

      // Find the node containing the document layout.
      XmlNodeList targetNodes = xdoc.SelectNodes("//w:sectPr", nsManager);
      foreach (XmlNode targetNode in targetNodes)
      {
         // Delete any existing references to headers.
         XmlNodeList headerNodes = targetNode.SelectNodes("./w:headerReference", nsManager);
         foreach (System.Xml.XmlNode headerNode in headerNodes)
         {
            targetNode.RemoveChild(headerNode);
         }

         // Create the new header reference node.
         XmlElement node = xdoc.CreateElement("w:headerReference", wordmlNamespace);
         XmlAttribute attr = node.Attributes.Append(xdoc.CreateAttribute("r:id", relationshipNamespace));
         attr.Value = rId;
         node.Attributes.Append(attr);
         targetNode.InsertBefore(node, targetNode.FirstChild);
      }

      // Save the document XML back to its part.
      xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create));
   }
}

First, the WDAddHeader method is called passing in a reference to the Word 2007 document and a Stream object containing the replacement header markup and data. Then you set up the WordprocessingDocument object representing the Open XML File Format package and the namespace manager for the WordprocessingML markup. Next, you delete the existing header parts in the document and then create a blank header part with the following code:

wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.HeaderParts);

// Create a new header part.
HeaderPart headerPart = wdDoc.MainDocumentPart.AddNewPart<HeaderPart>();

XmlDocument headerDoc = new XmlDocument();
headerContent.Position = 0;
headerDoc.Load(headerContent);

// Write the header out to its part.
headerDoc.Save(headerPart.GetStream());

Then you create a memory-resident XML document as a temporary holder and load the document with the markup and data that describes the replacement header. The remaining code searches for the references to the header parts that you just deleted in the main document part. It does this by using XPath queries to search for the appropriate namespaces, deletes them, and then inserts references to the new header part. Finally, the updated WordprocessingML markup is saved back to the main document part.

In the following code, you delete the comments part and all references to the part from a Word 2007 document:

public static void WDDeleteComments(string docName)
{
   // Given a document name, remove all comments.

   const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

   using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(docName, true))
   {
      if (wdDoc.MainDocumentPart.CommentsPart != null)
      {
         wdDoc.MainDocumentPart.DeletePart(wdDoc.MainDocumentPart.CommentsPart);
         // Manage namespaces to perform Xml XPath queries.
         NameTable nt = new NameTable();
         XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
         nsManager.AddNamespace("w", wordmlNamespace);

         // Get the document part from the package.
         // Load the XML in the part into an XmlDocument instance:
         XmlDocument xdoc = new XmlDocument(nt);
         xdoc.Load(wdDoc.MainDocumentPart.GetStream());

         // Retrieve a list of nodes representing the comment start elements, and delete them all.
         XmlNodeList nodes = xdoc.SelectNodes("//w:commentRangeStart", nsManager);
         foreach (System.Xml.XmlNode node in nodes)
         {
            node.ParentNode.RemoveChild(node);
         }

         // Retrieve a list of nodes representing the comment end elements, and delete them all.
         nodes = xdoc.SelectNodes("//w:commentRangeEnd", nsManager);
         foreach (System.Xml.XmlNode node in nodes)
         {
            node.ParentNode.RemoveChild(node);
         }

         // Retrieve a list of nodes representing the comment reference elements, and delete them all.
         nodes = xdoc.SelectNodes("//w:r/w:commentReference", nsManager);
         foreach (System.Xml.XmlNode node in nodes)
         {
            node.ParentNode.RemoveChild(node);
         }

         // Save the document XML back to its part.
         xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create));
      }
   }
}

First, the WDDeleteComments method is called passing in a reference to the Word 2007 document. Then a WordprocessingDocument object is created from the input document, representing the Open XML File Format package. Next, the package is tested for a comments part to see if it exists and if so, deletes it. A namespace manager is created to set up the XPath queries. The queries, using the //w:commentRangeStart and //w:commentRangeEnd XPath expressions, are used to search the main document part for the starting and ending comment nodes, respectively. When the list is compiled, each node is removed.

XmlNodeList nodes = xdoc.SelectNodes("//w:commentRangeStart", nsManager);
foreach (System.Xml.XmlNode node in nodes)
{
   node.ParentNode.RemoveChild(node);
}

// Retrieve a list of nodes representing the comment end elements, and delete them all.
nodes = xdoc.SelectNodes("//w:commentRangeEnd", nsManager);
foreach (System.Xml.XmlNode node in nodes)
{
   node.ParentNode.RemoveChild(node);
}

The same process is used to remove the nodes that reference the comments by using the //w:r/w:commentReference XPath expression. And finally, the updated WordprocessingML markup is saved back to the main document part.

In the following code, you delete the header and footers from a Word 2007 document:

public static void WDRemoveHeadersFooters(string docName)
{
   // Given a document name, remove all headers and footers.

   const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";

   using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(docName, true))
   {
      if (wdDoc.MainDocumentPart.GetPartsCountOfType<HeaderPart>() > 0 ||
        wdDoc.MainDocumentPart.GetPartsCountOfType<FooterPart>() > 0)
      {
         wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.HeaderParts);
         wdDoc.MainDocumentPart.DeleteParts(wdDoc.MainDocumentPart.FooterParts);

         // Manage namespaces to perform XPath queries.
         NameTable nt = new NameTable();
         XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
         nsManager.AddNamespace("w", wordmlNamespace);

         // Get the document part from the package.
         // Load the XML in the part into an XmlDocument instance.
         XmlDocument xdoc = new XmlDocument(nt);
         xdoc.Load(wdDoc.MainDocumentPart.GetStream());

         // Find the node containing the document layout.
         XmlNodeList layoutNodes = xdoc.SelectNodes("//w:sectPr", nsManager);
         foreach (System.Xml.XmlNode layoutNode in layoutNodes)
         {
            // Delete any existing references to headers.
            XmlNodeList headerNodes = layoutNode.SelectNodes("./w:headerReference", nsManager);
            foreach (System.Xml.XmlNode headerNode in headerNodes)
            {
               layoutNode.RemoveChild(headerNode);
            }

            // Delete any existing references to footers.
            XmlNodeList footerNodes = layoutNode.SelectNodes("./w:footerReference", nsManager);
            foreach (System.Xml.XmlNode footerNode in footerNodes)
            {
               layoutNode.RemoveChild(footerNode);
            }
         }

         // Save the document XML back to its part.
         xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create));
      }
   }
}

First, the WDRemoveHeadersFooters method is called passing in a reference to the Word 2007 document. Then a WordprocessingDocument object is created from the input document, representing the Open XML File Format package. Next, the package is tested for header and footer parts and if they exist, deletes them. A namespace manager is then created to set up the XPath queries.

Then you create memory-resident XML document as a temporary holder and load the document with the markup and data from the main document part. The remaining code searches for the nodes that reference the header and footer parts that you just deleted, and deletes them as well.

XmlNodeList layoutNodes = xdoc.SelectNodes("//w:sectPr", nsManager);
foreach (System.Xml.XmlNode layoutNode in layoutNodes)
{
   // Delete any existing references to headers.
   XmlNodeList headerNodes = layoutNode.SelectNodes("./w:headerReference", nsManager);
   foreach (System.Xml.XmlNode headerNode in headerNodes)
   {
      layoutNode.RemoveChild(headerNode);
   }

   // Delete any existing references to footers.
   XmlNodeList footerNodes = layoutNode.SelectNodes("./w:footerReference", nsManager);
   foreach (System.Xml.XmlNode footerNode in footerNodes)
   {
      layoutNode.RemoveChild(footerNode);
   }
}

Finally, the updated WordprocessingML markup is saved back to the main document part.

As this article demonstrates, working with Word 2007 files is much easier with the Microsoft SDK for Open XML Formats Technology Preview. In part two of this series of articles, I describe other common tasks that you can perform with the Open XML Formats SDK.

Additional Resources

Show:
© 2014 Microsoft