Building Publishing Systems that Use Word 2010 or Word 2007
Summary: Using Word 2010 or Word 2007 as an important part of a content publishing system is a powerful approach. This article contains guidance and direction about how to build a content management system, and transforming Open XML WordprocessingML documents to other document formats.
Applies to: Microsoft Word 2010 | Microsoft Office Word 2007
Published: March 2010
Provided by: Eric White, Microsoft Corporation
Building Publishing Systems
Custom publishing solutions are perhaps the most complex application of document formats such as Open XML WordprocessingML. Content authoring requires a publishing system to present a friendly, powerful user interface. Using Word 2007 and Word 2010 as a basic part of a content publishing system is a great way to do this. In the publishing process, you can transform Open XML WordprocessingML to your desired destination document format.
Some projects transform Open XML documents to another form for publishing, perhaps by using a specialized transform to HTML. Software systems such as these are somewhat larger projects, typically having a whole development and test team. It is not simple to build publishing systems such as these. Key to building these kinds of transformations is to carefully design the transformation. A development team may spend as much or more time writing the specifications for such a system as it would in code development. Typically, such publishing systems are written using a language that is designed for writing document-centric transformations.
A typical approach to publishing systems uses styles as a means for supplying semantic meaning to content. This is problematic in some scenarios. However, the problems are solved by content controls. For example, you could format a document as follows.
Figure 1. Example of a document with styles
Figure 1 uses an option in Microsoft Word 2010 and Microsoft Office Word 2007 that enables you to display the style name for every paragraph to the left of the paragraph.
You could then write a transform from this to the desired format. This approach is problematic because the approach for extracting content involves grouping together adjacent paragraphs of a particular style, and this could lead to an idiosyncratic experience for content writers. For example, I have seen a system where the writer needed to supply a specially formatted line for a code block that would indicate the language for the code. If you needed to supply code examples for multiple languages, you needed to supply a blank line formatted with the Normal paragraph style between the code examples:
Figure 2. Blank line formatted with normal
This is, of course, problematic. If the writer did not supply this blank line, the transform would conjoin the two code blocks. One approach is that the developers who are writing the transform could watch for the magic lines that contain [c#] or [vb], but writing code that looks for magic values is never a good idea. What if there is some valid code that contains that exact string on a line as part of a multiline string? The transform would be broken in a non-intuitive way, and worse, the writer would have to artificially modify the sample code in some way to make the transform work correctly.
A better way to design a content publishing system is to use content controls to group multiple lines together, and to supply appropriate metadata about those lines.
Figure 3. Document with content control
The transform can then extract the contents of each code block in a deterministic manner.
The following list summarizes this approach to building a content publishing system.
If there is a direct transform from a single paragraph in a source document to a corresponding construct in the transformed document, use paragraph styles. You can alter the user experience in Word to allow for only valid styles. One of the links in the summary of resources explains how to do this.
If there is a direct transform from runs in a paragraph to desired constructs in the transformed document, use character styles.
Using styles gives the writer a great user experience. The Word user interface (UI) is optimized to enable writers to select paragraphs and runs of text, and apply paragraph and character styles.
If you must select multiple paragraphs for transform to some construct in the transformed document, use content controls to group those paragraphs. This also gives a great experience. The metadata about grouped paragraphs (such as the code block in the previous example) is clearly specified, and not through some magically formatted line or some other questionable technique. If you are the developer for this system, you can supply macros for easy insertion of content controls, although it is easy to insert content controls by using the stock UI.
Summary of Resources
Explores what a document-centric transform is, and describes an approach for writing them.
Describes an approach of converting an Open XML package to flat OPC, transforming it to a new form, and then converting back to Open XML OPC format.
Explains an approach for writing document-centric transforms by using Microsoft Visual C# 3.0.
Describes an approach for integrating content publishing systems with Word 2007.
Sometimes developers of publishing systems want to programmatically limit styles so that the transform is more deterministic.
Blog post series that explains issues that you can encounter when you transform Open XML WordprocessingML documents to another format, such as HTML.
CodePlex project for an XSLT-based transform from Open XML WordprocessingML to HTML.