Creating a Simple Search and Replace Utility for Word 2007 Open XML Format Documents

Summary:  Create a simple Windows Form project that uses the Open XML Format API to search for and replace terms in Microsoft Office Word 2007 documents located in folders and subfolders. (10 printed pages)

Frank Rice, Microsoft Corporation

October 2008

**Applies to:**Microsoft Office Excel 2007, Microsoft Office PowerPoint 2007, Microsoft Office Word 2007, Microsoft Visual Studio 2005, Microsoft Visual Studio 2005 Tools for the 2007 Microsoft Office system

Contents

  • Overview

  • Creating the Windows Form Project

  • Creating the User Interface

  • Adding Code to the Project

  • Testing the Project

  • Conclusion

  • Additional Resources

Overview

The 2007 Microsoft Office system introduced a new file format based on the Standard ECMA-376: Office Open XML File Formats for Microsoft Office Excel 2007, Microsoft Office PowerPoint 2007, and Microsoft Office Word 2007. This file format consists of a zip file containing various XML parts. These document parts contain the content and define the relationships for all of the parts in a single package. The beauty of this file format is that unlike the proprietary binary file formats used in previous versions of Microsoft Office, you can create and manipulate files in this format outside of the Microsoft Office application. For example, you can create an application to create Word 2007 .docx (macro-free) files or .docm (macro-enabled) files without the need to have Word 2007 installed.

To work with this file format, one option is to use the Open XML Format Application Programming Interface (API) in the DocumentFormat.OpenXml.Packaging namespace. The classes, methods, and properties in this namespace are located in the DocumentFormat.OpenXml.dll file. You can install this DLL file by installing the Open XML Format SDK version 1.0. The members in this namespace allow you to easily work with the package contents for Excel 2007 workbooks, PowerPoint 2007 presentations, and Word 2007 documents.

In this article, I show you how to create a Microsoft Visual Basic Windows Form application in Microsoft Visual Studio 2005 or Microsoft Visual Studio 2008 that allows you to do simple text searches and replacements in Word 2007 .docx and .docm files. The tool also gives you the option to search a single folder or subfolders.

Creating the Windows Form Project

In the following steps, you create the Windows Form project.

To create the Windows Form project

  1. Start Microsoft Visual Studio 2005or Microsoft Visual Studio 2008.

  2. On the File menu, click New Project.

  3. In the New Project dialog box, in the Project types pane, expand the Visual Basic node, and in the Templates pane, click Windows Form Application.

  4. Provide a name for the project and then click OK. This creates the Form1 class.

Creating the User Interface

In the following steps, you create the form for the project as shown in Figure 1.

To create the form

  1. In the Solution Explorer, expand the Form1.vb node and then double-click the Form1.Designer.vb node.

  2. On the Windows form designer, drag the following controls from the toolbox and then set the properties in the Properties pane as shown in Table 1. You can display the toolbox by pressing Ctrl + Alt + X and then display the Properties pane by pressing F4.

    Table 1. Add these controls to the Windows Form designer

    Control

    Property

    Value

    Form1

    Text

    Search and Replace

    Label1

    Text

    Search Directory

    TextBox1

    Name

    txtDirectory

    CheckBox1

    Checked

    True

    Name

    ckbIncludeSub

    Text

    Include Subfolders

    Label2

    Text

    Original Text

    TextBox2

    Name

    txtOldText

    Label3

    Text

    New Text

    TextBox3

    Name

    txtNewText

    Button1

    Name

    btnGetFiles

    Text

    Search and Replace

    Button2

    Name

    btnClose

    Text

    Close Form

    Label4

    Text

    Number of Changes Made

    TextBox4

    Name

    txtNumChanged

    The finished form should look like Figure 1.

    Figure 1. The Search and Replace form

    Search and Replace Form Designer

Adding Code to the Project

In the following steps, you setup and add Visual Basic code to the project.

To add functionality to the project

  1. First, you need to add a reference for the DocumentFormat.OpenXml.Packaging library to the project. This DLL file contains the namespace and programming members that you use to work with the Open XML Format files.

    On the Project menu, click Show All Files.

  2. In the Solution Explorer, right-click the References node and then click Add Reference.

  3. In the Add Reference dialog box, on the .NET tab, select DocumentFormat.OpenXML, and then click OK.

  4. On the Form1.vb Designer, double-click the btnGetFiles button to add the btnGetFiles_Click event procedure to the Form1 class.

  5. Add the following code to the btnGetFiles_Click procedure.

    Dim rootdir As String = txtDirectory.Text
    numChanged = 0
    SearchFolders(rootdir)
    

    This event procedure is initiated when the user clicks the Search and Replace button. First, it assigns the starting directory to a variable and then calls the SearchFolders procedure.

  6. Next, before the Public Class Form1 statement, add the following namespaces. Adding these namespaces here allows you to use the name for each member without having to fully qualify the member name. Note that the DocumentFormat.OpenXml.Packaging namespace contains the members used for working with the Open XML Format files.

    Imports System
    Imports System.IO
    Imports System.Text
    Imports System.Xml
    Imports System.Windows.Forms
    Imports DocumentFormat.OpenXml.Packaging
    
  7. After the Public Class Form1 statement, add the following class fields.

    Private fileType() As String = {"*.docx", "*.docm"}
    Private numChanged As Integer
    Const wordmlNamespace As String = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    

    Here, you specify the file extensions that you will search. Instead of hard-coding the file extensions, you could add these selections to a combo box in the form and let the user select which types of document to search. You also specify a constant containing the WordprocessingML namespace.

  8. Next, add the SearchFolders procedure to the project. This procedure is responsible for searching the folders and subfolders, if requested by the user, for the document files.

    Private Sub SearchFolders(ByVal rootdir As String)
    Dim files() As String
    Dim file As String
    Dim subdir As String
    
    Try
       ' Search the files in the given directory.
       For i As Integer = 0 To fileType.Length - 1
          files = Directory.GetFiles(rootdir, fileType(i))
          For Each file In files
              Search_Replace(file)
          Next
       Next
       ' Also recursively search subfolders if requested.
       If ckbIncludeSub.Checked Then
          For Each subdir In Directory.GetDirectories(txtDirectory.Text)
             SearchFolders(subdir)
          Next
       End If
    
    Catch
       ' Ignore any errors.
    End Try
    End Sub
    

    This procedure uses the GetFiles method of the Directory object to get the list of files from the root directory. The loop first looks for .docx files and the second pass looks for .docm files as specified in the fileType() array which you assigned in the class field declarations. Then, for each file found, the Search_Replace procedure is called.

    Next, if the user selected the Search Subfolders checkbox (it is selected by default) in the form, the GetDirectories method of the Directory object is called with the root path, which then retrieves and assigns a subfolder path, if they exist, to the subdir string variable. That path is recursively passed to the SearchFolders procedure which reinitiates the search process. Note that in this particular structure, only subfolders one level below the root directory are searched. With a little research, you could modify the structure to search lower level subfolders.

  9. Now add the Search_Replace procedure to the project. This procedure is where you search the Open XML Format files for the text terms.

    Private Sub Search_Replace(ByVal file As String)
    Dim wdDoc As WordprocessingDocument = WordprocessingDocument.Open(file, True)
    
    ' Manage namespaces to perform Xml XPath queries.
    Dim nt As NameTable = New NameTable
    Dim nsManager As XmlNamespaceManager = New XmlNamespaceManager(nt)
    nsManager.AddNamespace("w", wordmlNamespace)
    
    ' Get the document part from the package.
    Dim xdoc As XmlDocument = New XmlDocument(nt)
    ' Load the XML in the part into an XmlDocument instance.
    xdoc.Load(wdDoc.MainDocumentPart.GetStream)
    
    ' Get the text nodes in the document.
    Dim nodes As XmlNodeList = Nothing
    nodes = xdoc.SelectNodes("//w:t", nsManager)
    Dim node As XmlNode
    Dim nodeText As String = ""
    
    ' Make the swap.
    Dim oldText As String = txtOldText.Text
    Dim newText As String = txtNewText.Text
    For Each node In nodes
       nodeText = node.FirstChild.InnerText
       If (InStr(nodeText, oldText) > 0) Then
          nodeText = nodeText.Replace(oldText, newText)
          ' Increment the occurrences counter.
          numChanged += 1
       End If
    Next
    
    ' Write the changes back to the document.
    xdoc.Save(wdDoc.MainDocumentPart.GetStream(FileMode.Create))
    
    ' Display the number of change occurrences.
    txtNumChanged.Text = numChanged
    End Sub
    

    First, the file name is passed into the procedure. Then, the file is opened as a Open XML Format WordprocessingML document with read and write access. Next, the namespace alias used in the document is assigned to a namespace table. This namespace alias is used with the XPath query to find the text nodes in the document.

    Then, an instance of an XML document is created and the contents of the main document part in the Open XML Format file is loaded into the XML document. The structure of the XML in the main document part of a very simple document is as follows.

    <w:document>
       <w:body>
          <w:p>
             <w:r>
                <w:t> </w:t>
             <w:r>
          </w:p>
       </w:body>
    </document>
    

    Where
    w:p – Denotes a paragraph.
    w:r – Denotes a run of text nodes.
    w:t – Contains the text.

    Thus, the following XPath expression selects all of the text nodes in the main document part.

    nodes = xdoc.SelectNodes("//w:t", nsManager)
    

    The next section in the procedure loops through each of the text nodes and searches for an occurrence the original term by using the InStr function. If the term is found, it is swapped with the new term.

    For Each node In nodes
       nodeText = node.FirstChild.InnerText
       If (InStr(nodeText, oldText) > 0) Then
          nodeText = nodeText.Replace(oldText, newText)
          node.FirstChild.InnerText = nodeText
    

    Next, a counter for the number of change occurrences is incremented. Once all of the nodes have been searched, the updated XML is saved back to the main document part. And finally, the txtNumChanged.Text textbox is set to the number of changes affected.

    As you can see, this process is fairly straight-forward and made simple by using the methods of the DocumentFormat.OpenXml.Packaging namespace.

  10. And finally, add the following procedure to the project. This event procedure is called when the user click the Close Form button.

    Private Sub btnClose_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnClose.Click
       Close()
    End Sub
    

Testing the Project

Before testing the project, you need to create some test documents in Word 2007 containing the term you want to search for. Also remember to add some documents in a subfolder below the root directory.

For my test, I created several documents containing the Gettysburg address and swapped the terms Four score with the term Eighty. As an aside, a score is equal to twenty-years so four score is equal to eighty years.

To test the project

  1. On the Debug menu, click Start Debugging.

  2. In the Search and Replace form, type in the path containing the files and subfolders.

  3. Type the term to search for and the term you want to swap it with.

  4. Click Search and Replace.

    If everything goes well, you see the number of changes displayed in the Number of Documents Updated text box (see Figure 2).

  5. Click Close Form.

    Figure 2. The form displays the results of the search and replace process

    The Search and Replace Form at runtime.

Conclusion

As you saw in this article, using the members in the DocumentFormat.OpenXml.Packaging namespace makes working with Open XML Format files relatively simple. I encourage you to add other features to the project such as letting the user specify the document types to search, searching for subfolders more than one level below the root directory, or providing the ability to search and replace Excel 2007 files.

Additional Resources