Using Open XML WordprocessingML Documents as Data Sources

Office 2010

Summary:   You can use Microsoft Office 2010 or the 2007 Microsoft Office system as part of a comprehensive collaboration system. You can send preformatted documents to your customers and extract data and content from those documents after they are returned. This article contains guidance and links to other resources to help you get started.

Last modified: April 11, 2011

Applies to: Office 2010 | Open XML | Visual Studio Tools for Microsoft Office | Word 2007 | Word 2010

Published: March 2010

Provided by: Eric White, Microsoft Corporation

Developers sometimes use word-processing documents as data sources to enable interesting collaboration applications. For example, a software system can send a carefully configured document by e-mail to multiple users. Each user then reads and fills out the document, perhaps by typing text into content controls, or by inserting rows into a table. Parts of the document that should not be edited by the user can be locked. This prevents confusion for the user, providing a more reliable collaboration system. The user then sends the document back, and the software system extracts the relevant information from the document.

When you use a document as a data source, you typically extract data such as integers, floating point numbers, or strings. A variation on this approach is using a word-processing document as a content source. In this case, you extract content such as formatted paragraphs, formatted runs of text, and formatted tables. Developers may want to extract interesting content from multiple locations in a source document, and construct a new document or documents consisting of those sources. As an example, inspectors may visit several construction sites, and after each visit write a site report. These reports may have relevant information such as narrative descriptions of progress, and factual details such as number of linear feet of conduit that was installed, or cubic yards of concrete poured. Before each document is submitted to a software system, appropriate content in each document is marked so that it can be extracted in a reliable manner.

The easiest way to write such a system is to use content controls to contain the data or content that you want to pull from the document. You can use the Open XML SDK for Microsoft Office to extract the data, or you can write a managed add-in to access the data.

If you must have users supply tabular information, you can let them complete a table, and then retrieve the data from the table. One approach is to enclose the table with a content control. This enables you to determine the table without using a simplistic approach of finding the table to extract from by taking the first, second, or nth table in the document.

A general and more robust approach to extracting data from a document is to transform the WordprocessingML document into a simpler form and then extract the data from the transformed XML. This is especially true when extracting data from a WordprocessingML table. This is one purpose for converting Open XML WordprocesslingML Documents to HTML. For more information, see Conversion of an Open XML WordprocessingML Document to Html. If you first convert to HTML by using the HtmlConverter class, you can then easily extract data from the tables in the HTML, knowing that issues such as tracked changes and split runs are addressed.

Summary of Resources



Working with Content Controls

Provides an introduction to content controls.

How To: Add Content Controls to Word Documents

Shows how to create document-level and application-level projects that add content controls at design time or run time.

Building Word 2007 Document Templates Using Content Controls

Shows how to write managed add-in code that accesses content controls.

Associating Data with Content Controls

Provides guidance and an example that shows how to associate arbitrary amounts of data with each content control in a document.

"How do I" articles on MSDN

Contains several topics about content controls and binding to content controls in various scenarios.

Visual "How to" articles (with accompanying video) for Word 2007

Contains several topics with accompanying videos that show how to implement various scenarios that require programming.

Converting Open XML WordprocessingML to Html

Blog post series that explains issues that you can encounter when you transform Open XML WordprocessingML documents to another format, such as HTML.

Transforming WordprocessingML to Simpler XML for Easier Processing

Shows one approach of transforming Open XML word-processing documents into simpler forms. You can then write your own application to do any necessary querying or additional processing of the transformed documents.

Using Nested Content Controls for Data and Content Extraction from Open XML WordprocessingML Documents

Shows how to nest content controls, which enables more complex semantic structure in WordprocessingML documents. Shows how to use Design Mode, which provides a more convenient user interface for content controls.