Export (0) Print
Expand All
Expand Minimize
3 out of 4 rated this helpful - Rate this topic

Manipulating Word 2007 Files with the Open XML Format API (Part 2 of 3)

Office 2007

Summary: This is the second in a series of three articles that describes the Open XML object model code that you can use to access and manipulate Microsoft Office Word 2007 files. (8 printed pages)

Frank Rice, Microsoft Corporation

August 2007 (Revised August 2008)

Applies to: Microsoft Office Word 2007

Contents

View Part 1: Manipulating Word 2007 Files with the Open XML Format API (Part 1 of 3).

Overview

The 2007 Microsoft Office system introduces new file formats that are based on XML called Open XML Formats. Microsoft Office Word 2007, Microsoft Office Excel 2007, and Microsoft Office PowerPoint 2007 all use these formats as the default file format. Open XML formats are useful because they are an open standard and are based on well-known technologies: ZIP and XML. Microsoft provides a library for accessing these files as part of the .NET Framework 3.0 technologies in the DocumentFormat.OpenXml namespace in the Microsoft SDK for Open XML Formats Technology Preview. The Open XML Format members are contained in theDocumentFormat.OpenXml API and provide strongly-typed part classes to manipulate Open XML documents. The SDK simplifies the task of manipulating Open XML packages. The Open XML Format API encapsulates many common tasks that developers perform on Open XML Format packages, so you can perform complex operations with just a few lines of code.

Removing Hidden Text from Documents

In the following code, you remove nodes from the main document part that contain hidden text in a Word 2007 document.

Public Sub WDDeleteHiddenText(ByVal docName As String)
   ' Given a document name, delete all the hidden text.
   Const wordmlNamespace As String = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, true)
   Using (wdDoc)
      ' Manage namespaces to perform XPath queries.
      Dim nt As NameTable = New NameTable
      Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
      nsManager.AddNamespace("w", wordmlNamespace)
      ' Get the document part from the package.
      ' Load the XML in the part into an XmlDocument instance.
      Dim xdoc As XmlDocument = New XmlDocument(nt)
      xdoc.Load(wdDoc.MainDocumentPart.GetStream)
      Dim hiddenNodes As XmlNodeList = xdoc.SelectNodes("//w:vanish", nsManager)
      For Each hiddenNode As System.Xml.XmlNode In hiddenNodes
         Dim topNode As XmlNode = hiddenNode.ParentNode.ParentNode
         Dim topParentNode As XmlNode = topNode.ParentNode
         topParentNode.RemoveChild(topNode)
         If Not topParentNode.HasChildNodes Then
            topParentNode.ParentNode.RemoveChild(topParentNode)
         End If
      Next
      ' Save the document XML back to its part.
      xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create, FileAccess.Write))
   End Using
End Sub

First, the code example calls the WDDeleteHiddenText method, passing in a reference to the Word 2007 document. Then it creates a WordprocessingDocument object from the input document, representing the Office Open XML Format package. Next, it creates a namespace manager to set up the XPath queries. Then you create a memory-resident XML document as a temporary holder and load the document with the markup and data from the main document part. The remaining code searches for the hidden text nodes using the XPath expression //w:vanish and compiles a list of nodes. The code then loops through the node list and deletes the parent and child nodes.

For Each hiddenNode As System.Xml.XmlNode In hiddenNodes
   Dim topNode As XmlNode = hiddenNode.ParentNode.ParentNode
   Dim topParentNode As XmlNode = topNode.ParentNode
   topParentNode.RemoveChild(topNode)
   If Not topParentNode.HasChildNodes Then
      topParentNode.ParentNode.RemoveChild(topParentNode)
   End If
Next

Finally, save the updated WordprocessingML markup back to the main document part.

Retrieving Document Application Properties

In the following code, you retrieve the value of an application property contained in the extended properties part (app.xml) in a document.

Public Shared Function WDRetrieveAppProperty(ByVal docName As String, ByVal propertyName As String) As String
   ' Given a document name and an app property, retrieve the value of the property.
   ' Note that because this code uses the SelectSingleNode method,
   ' the search is case sensitive. That is, looking for "Words" is not 
   ' the same as looking for "words".
   Const appPropertiesSchema As String = "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties"
   Dim propertyValue As String = string.Empty
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, false)

   ' Manage namespaces to perform Xml XPath queries.
   Dim nt As NameTable = New NameTable
   Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
   nsManager.AddNamespace("d", appPropertiesSchema)

   ' Get the properties from the package.
   Dim xdoc As XmlDocument = New XmlDocument(nt)

   ' Load the XML in the part into an XmlDocument instance.
   xdoc.Load(wdDoc.ExtendedFilePropertiesPart.GetStream)
   Dim searchString As String = String.Format("//d:Properties/d:{0}", propertyName)
   Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
   If Not (xNode = Nothing) Then
      propertyValue = xNode.InnerText
   End If

   Return propertyValue
End Function

First, the code example calls WDRetrieveAppProperty method, passing in a reference to the Word 2007 document and the name of the application property that you want to retrieve. Then you create a WordprocessingDocument object from the input document, representing the Office Open XML Format package. Next, you create a namespace manager to set up the XPath queries. Then you create a memory-resident XML document as a temporary holder and load the document with the markup and data from the extended properties part (app.xml). Next, you set up the search string as an XPath query to search for the d:Properties node.

Dim searchString As String = String.Format("//d:Properties/d:{0}", propertyName)
Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
If Not (xNode = Nothing) Then
   propertyValue = xNode.InnerText
End If

Then you select the node that contains the specified application property and assign its text to a return variable.

Converting Macro-Enabled Documents to DOCX Files

The following code removes all Microsoft Visual Basic for Applications (VBA) parts from a document. It also converts the document from a .docx file, which has macros enabled, to a .docm file, which has macros disabled.

Public Sub WDConvertDOCMToDOCX(ByVal docName As String)
   ' Given a DOCM file (with macro storage), remove the VBA 
   ' project, reset the document type, and save the document with a new name.
   Dim vbaRelationshipType As String = "http://schemas.microsoft.com/office/2006/relationships/vbaProject"
   Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(docName, True)
   Using (wdDoc)
      wdDoc.ChangeDocumentType(WordprocessingDocumentType.Document)
      Dim partsToDel As List(Of ExtendedPart) = New List(Of ExtendedPart)()
      For Each part As ExtendedPart In wdDoc.MainDocumentPart.GetPartsOfType(Of ExtendedPart)()
         If (part.RelationshipType = vbaRelationshipType) Then
            partsToDel.Add(part)
         End If
      Next part
      wdDoc.MainDocumentPart.DeleteParts(partsToDel)
   End Using

   ' Generate the new file name.
   Dim newFileName As String = (Path.GetDirectoryName(docName) + ("\" _
                    + (Path.GetFileNameWithoutExtension(docName) + ".docx")))
   ' If the new file exists, delete it. You may
   ' want to make this code less destructive.
   If File.Exists(newFileName) Then
      File.Delete(newFileName)
   End If
   File.Move(docName, newFileName)
End Sub

First, you call the WDConvertDOCMToDOCX method, passing in a reference to the Word 2007 document. Then, you create a WordprocessingDocument object from the input document, representing the Open XML File Format package. Next, call the ChangeDocumentType method of the WordprocessingDocument object to change the type of the document from a macro-enabled file to the default document format, which does not have macros enabled.

wdDoc.ChangeDocumentType(WordprocessingDocumentType.Document)

The procedure then loops through the extended parts in the package and tests for parts that have a VBA reference. The procedure then adds all parts that include a VBA reference to a list of parts to delete.

For Each part As ExtendedPart In wdDoc.MainDocumentPart.GetPartsOfType(Of ExtendedPart)()
   If (part.RelationshipType = vbaRelationshipType) Then
      partsToDel.Add(part)
   End If
Next part
wdDoc.MainDocumentPart.DeleteParts(partsToDel)

The parts are deleted by calling the DeleteParts method of the main document part. Finally, the name of the original document is changed to reflect its updated type.

Retrieving Core Document Properties

In the following code, you retrieve the value of a core property contained in the core properties part (core.xml) in a document.

Public Function WDRetrieveCoreProperty(ByVal docName As String, ByVal propertyName As String) As String
   ' Given a document name and a core property, retrieve the value of the property.
   ' Note that because this code uses the SelectSingleNode method, 
   ' the search is case sensitive. That is, looking for "Author" is not 
   ' the same as looking for "author".

   Const corePropertiesSchema As String = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties"
   Const dcPropertiesSchema As String = "http://purl.org/dc/elements/1.1/"
   Const dcTermsPropertiesSchema As String = "http://purl.org/dc/terms/"
   Dim propertyValue As String = String.Empty
   Dim wdPackage As WordprocessingDocument = WordprocessingDocument.Open(docName, True)

   ' Get the core properties part (core.xml).
   Dim corePropertiesPart As CoreFilePropertiesPart = wdPackage.CoreFilePropertiesPart

   ' Manage namespaces to perform XML XPath queries.
   Dim nt As NameTable = New NameTable
   Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
   nsManager.AddNamespace("cp", corePropertiesSchema)
   nsManager.AddNamespace("dc", dcPropertiesSchema)
   nsManager.AddNamespace("dcterms", dcTermsPropertiesSchema)

   ' Get the properties from the package.
   Dim xdoc As XmlDocument = New XmlDocument(nt)
   ' Load the XML in the part into an XmlDocument instance.
   xdoc.Load(corePropertiesPart.GetStream)

   Dim searchString As String = String.Format("//cp:coreProperties/{0}", propertyName)
   Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
   If Not (xNode Is Nothing) Then
      propertyValue = xNode.InnerText
   End If
   Return propertyValue
End Function

First, the code example calls the WDRetrieveCoreProperty method, passing in a reference to the Word 2007 document and the name of the application property that you want to retrieve. Then, you create a WordprocessingDocument object from the input document, representing the Office Open XML Format package. Next, retrieve the CoreFilePropertiesPart part. Then, you create a namespace manager to set up the XPath query. Create a memory-resident XML document as a temporary holder and load the document with the markup and data from the core file properties part (core.xml). Next, you set up the search string as an XPath query. Then you set up the search string as an XPath query to search for the cp:coreProperties node.

Dim searchString As String = String.Format("//cp:coreProperties/{0}", propertyName)
Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
If Not (xNode Is Nothing) Then
   propertyValue = xNode.InnerText
End If

Then the code selects the node that contains the specified property and assigns its text to a return variable.

Retrieving Custom Document Properties

In the following code, you retrieve the value of a custom property contained in the custom properties part (custom.xml) in a document.

Public Function WDRetrieveCustomProperty(ByVal docName As String, ByVal propertyName As String) As String
   ' Given a document name and a core property, retrieve the value of the property.
   ' Note that because this code uses the SelectSingleNode method
   ' the search is case sensitive. That is, looking for "Author" is not 
   ' the same as looking for "author".

   Const customPropertiesSchema As String = "http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
   Const customVTypesSchema As String = "http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"
   Dim propertyValue As String = String.Empty
   Dim wdPackage As WordprocessingDocument = WordprocessingDocument.Open(docName, True)

   ' Get the custom properties part (custom.xml).
   Dim customPropertiesPart As CustomFilePropertiesPart = wdPackage.CustomFilePropertiesPart
   ' There may not be a custom properties part.
   If (Not (customPropertiesPart) Is Nothing) Then
      ' Manage namespaces to perform Xml XPath queries.
      Dim nt As NameTable = New NameTable
      Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
      nsManager.AddNamespace("d", customPropertiesSchema)
      nsManager.AddNamespace("vt", customVTypesSchema)

      ' Get the properties from the package.
      Dim xdoc As XmlDocument = New XmlDocument(nt)

      ' Load the XML in the part into an XmlDocument instance.
      xdoc.Load(customPropertiesPart.GetStream)
      Dim searchString As String = String.Format("d:Properties/d:property[@name='{0}']", propertyName)
      Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
      If (Not (xNode) Is Nothing) Then
         propertyValue = xNode.InnerText
      End If
    End If
   Return propertyValue
End Function

First, the code example calls the WDRetrieveCustomProperty method, passing in a reference to the Word 2007 document and the name of the application property that you want to retrieve. Then you create a WordprocessingDocument object from the input document, representing the Office Open XML Format package. Next, retrieve the CustomFilePropertiesPart part and assign it to a variable. Because the package may not contain a custom properties part, you should test the variable to see whether it is empty or not.

If a custom properties part exists, you create a namespace manager to set up the XPath query. Create a memory-resident XML document as a temporary holder and load the document with the markup and data from the custom file properties part (custom.xml). Next, you set up the search string as an XPath query. You then set up the search string as an XPath query to search for the d:Properties/d:property node that contains the specified property.

xdoc.Load(customPropertiesPart.GetStream)
Dim searchString As String = String.Format("d:Properties/d:property[@name='{0}']", propertyName)
Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
If (Not (xNode) Is Nothing) Then
   propertyValue = xNode.InnerText
End If

Then the code example selects the node that contains the specified property and assigns its text to a return variable.

Setting Core Document Properties

In the following code, you set the value of a core property in a document. If you successfully set the property to a new value, the original value is returned to the calling procedure; otherwise a null string is returned.

Public Function WDSetCoreProperty(ByVal docName As String, ByVal propertyName As String, ByVal propertyValue As String) As String
   ' Given a document name, a property name, and a value, update the document.
   ' Note that you cannot set the value of a property that does not
   ' exist within the core.xml part. If you successfully updated the property
   ' value, return its old value.
   ' Attempting to modify a non-existent property raises an exception.

   ' Property names are case-sensitive.

   Const corePropertiesSchema As String = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties"
   Const dcPropertiesSchema As String = "http://purl.org/dc/elements/1.1/"
   Dim retVal As String = Nothing
   Dim wdPackage As WordprocessingDocument = WordprocessingDocument.Open(docName, True)
   Dim corePropertiesPart As CoreFilePropertiesPart = wdPackage.CoreFilePropertiesPart

   ' Manage namespaces to perform Xml XPath queries.
   Dim nt As NameTable = New NameTable
   Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
   nsManager.AddNamespace("cp", corePropertiesSchema)
   nsManager.AddNamespace("dc", dcPropertiesSchema)

   ' Get the properties from the package.
   Dim xdoc As XmlDocument = New XmlDocument(nt)

   ' Load the XML in the part into an XmlDocument instance:
   xdoc.Load(corePropertiesPart.GetStream)
   Dim searchString As String = String.Format("//cp:coreProperties/{0}", propertyName)
   Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
   If (xNode Is Nothing) Then
      ' Trying to set the value of a property that 
      ' does not exist? Throw an exception.
      Throw New ArgumentException("Invalid property name.")
   Else
      ' Get the current value.
      retVal = xNode.InnerText
      ' Now update the value.
      xNode.InnerText = propertyValue
      ' Save the properties XML back to its part.
      xdoc.Save(corePropertiesPart.GetStream)
   End If
   Return retVal
End Function

First, the code example calls the WDSetCoreProperty method, passing in a reference to the Word 2007 document, the name of the core property, and the new value that you want to set the property to. Then you create a WordprocessingDocument object from the input document, representing the Office Open XML Format package. Next, you retrieve the CoreFilePropertiesPart part. Next, the code example creates a namespace manager to set up the XPath query. Create a memory-resident XML document as a temporary holder and load the document with the markup and data from the core file properties part (core.xml). Next, you set up the search string as an XPath query. Then, you set up the search string as an XPath query to search for the cp:coreProperties node. Because the part may not contain the property, you should test the variable to see whether it is empty or not.

Dim searchString As String = String.Format("//cp:coreProperties/{0}", propertyName)
Dim xNode As XmlNode = xdoc.SelectSingleNode(searchString, nsManager)
If (xNode Is Nothing) Then
   ' Trying to set the value of a property that 
   ' does not exist? Throw an exception.
   Throw New ArgumentException("Invalid property name.")
Else
   ' Get the current value.
   retVal = xNode.InnerText
   ' Now update the value.
   xNode.InnerText = propertyValue
   ' Save the properties XML back to its part.
   xdoc.Save(corePropertiesPart.GetStream)
End If
Return retVal

If the property does not exist, the code throws an exception. Otherwise, it retrieves the current value of the property and sets the node that contains the property to the new value. Finally, it saves the updated markup back to the package and then returns the original value to the calling procedure.

Conclusion

As this article demonstrates, working with Word 2007 files is much easier with the Microsoft SDK for Open XML Formats Technology Preview. In part three of this series of articles, I describe other common tasks that you can perform with the Open XML Formats SDK

Additional Resources

Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback

Community Additions

ADD
Show:
© 2014 Microsoft. All rights reserved.