Building Server-Side Document Generation Solutions Using the Open XML Object Model (Part 1 of 2)
Summary: Learn the basics of the Open XML architecture and WordprocessingML. Discover the advantages of creating document packages and manipulating document parts by using the new Open XML object model. (7 printed pages)
Erika Ehrli, Microsoft Corporation
Applies to: 2007 Microsoft Office Suites, Microsoft Office Excel 2007, Microsoft Office PowerPoint 2007, Microsoft Office Word 2007
Download the accompanying sample: 2007 Office Sample: Building a Server-Side Document Generation Solution Using the Open XML Object Model.
Document assembly is one of the most common feature requirements in the world of information systems. Every day companies create millions of reports that help business decision makers get relevant information based on data stored in different line-of-business (LOB) systems.
Companies are always looking for software-based solutions that help them improve business processes, optimize workflows, and reduce the time and cost to output information to improve their business. Some need solutions to generate documents such as invoices, legal forms, memos, or letters for customers based on a template.
Some of these companies use Web-based applications or SharePoint sites as a front-end Web server to manipulate information. These applications may need to provide a means to export a Web-based report to Microsoft Office Word 2007 or Microsoft Office Excel 2007.
In the past, generating Microsoft Office documents from the server was a challenge. Earlier versions of Microsoft Office were based on the old binary format. This means you needed to use Automation to manipulate and generate Microsoft Office documents. Although a possibility, this server-side approach is not supported with earlier versions of Microsoft Office.
The 2007 Microsoft Office system provides a new feature that helps you generate server-side Word documents, Excel worksheets, and Microsoft Office PowerPoint 2007 presentations programmatically: the Open XML Formats.
This article, Building Server-Side Document Generation Solutions Using the Open XML Object Model (Part 1 of 2) reviews how to design and build a server-side solution, based on the Microsoft .NET Framework, to assemble Microsoft Office documents using the Open XML Formats and the Open XML object model. It describes the Open XML Formats architecture and the basic concepts for creating document packages, manipulating document parts, and writing WordprocessingML code using the Open XML object model. It also explores the architecture of a server-side document integration solution.
Part 2, Building Server-Side Document Generation Solutions Using the Open XML Object Model (Part 2 of 2) presents a business scenario and explores how to generate sales reports in Word 2007 by using a Microsoft ASP.NET 2.0 application.
The download, 2007 Office Sample: Building a Server-Side Document Generation Solution Using the Open XML Object Model, provides sample files in Microsoft Visual Basic 2005 and Microsoft Visual C# that show how to use the Open XML object model to assemble documents from a server-side application.
The Open XML Package specification defines documents as a set of XML files (document parts) and defines relationships between the document parts. Most document parts are XML files that describe application data, metadata, and even customer data, stored inside the container file (package). Other non-XML parts can also be included within the container package, including such document parts as binary files representing images or OLE objects embedded in the document. In addition, there are relationship parts that specify the relationships between document parts. This design provides the structure for a file in the 2007 Microsoft Office system.
While the document parts make up the contents of the file, the relationships describe how the document parts work together. It is also important to notice that relationships represent not only internal document references but also links to external resources. For example, if a document contains linked pictures or objects, these are also represented by using relationships. This makes links to external resources easy to locate, inspect, and change.
In Word 2007, the package represents a document. Within the package, there are parts that, when aggregated, compose the document. For example, a file in Word 2007 can contain (but is not limited to) the following folders and files:
docProps folder. Contains the application's properties parts.
App.xml file. Contains application-specific properties.
Core.xml file. Contains common file properties for all files based on the Open Packaging Conventions document format.
_rels folder. Stores the relationship part for a specified document part.
.rels file. A relationship part that describes the relationships that define the document structure.
document.xml. The only required part in the Word XML format file. This part contains XML markup (WordprocessingML) that defines the contents of the document.
[Content_Types].xml. Describes the content type for each part that appears in the package.
styles.xml file. Contains the style definition for the document.
An OpenXML document is an Open Packaging Conventions (OPC) package. All document parts are stored in a container file or package by using the industry-standard ZIP format. This package holds all of the content that is contained within the document.
For more information related to the architecture of the Open XML Formats, see Introducing the Office (2007) Open XML File Formats.
For more information related to the architecture of the Word 2007 XML format, see the Walkthrough: Word 2007 XML Format.
Word 2007 documents are defined by using WordprocessingML markup. Microsoft created WordprocessingML based on the old binary format. WordprocessingML uses markup language to describe the textual contents and formatting of a document. WordprocessingML does not represent data in a hierarchical manner. Instead, a Word document is composed of a collection of stories (main document, comments, headers) and properties (styles, numbering definitions). Each story contains a specific type of hierarchy defined through the use of structured document tags (SDTs).
In WordprocessingML, stories are defined as unique regions of content into which the user can type. The most important story in a WordprocessingML document is the main document story, which contains the primary contents of the document.
The following code is typical of the WordprocessingML markup in a document.xml part in a Word 2007 document.
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:body> <w:p> <w:r> <w:t>I saw a rabbit today.</w:t> </w:r> </w:p> </w:body> </w:document>
XML Declaration, Root Element, and Namespaces
The first line represents the XML declaration that all XML files must have.
Word supports UTF-8 encoding. Make sure you use UTF-8 encoding when you are creating WordprocessingML. For more information related to UTF-8 encoding, see UTF-8 and Unicode Standards.
The second line is the document root element. An XML document must contain one and only one root element. The root element defines the namespaces for the elements and attributes in the file.
For a complete list of the namespaces used in WordprocessingML documents, see the Ecma Office Open XML File Formats Standard Final draft - Part 4 - Markup Language Reference.
Main Document Story Parts, Paragraph, Runs, and Text
The main document story is stored inside the body element. Analogous to HTML, the content of the Word document is contained within a <w:body> tag.
Paragraphs <w:p> are the most basic unit of a WordprocessingML document. The fact that the paragraph element is inside the body element makes it part of the main document story. The paragraph element contains three pieces of information:
Optional revision IDs used for document merge and compare actions
It can also contain inline structures such as:
Runs <w:r> containing <w:t> text regions.
Custom markup, which can occur in a block of text or inline text.
Annotations such as comments, tracked changes, bookmarks.
Fields such as date, page number, document title or creator.
A paragraph can occur at any location that allows block level content:
At the top-most level within a story (header, footer, main document)
Nested within a table cell
Nested within a structured document tag or within annotation markers
Tables are a set of paragraphs that are arranged into rows and columns. WordprocessingML tables contain block level content and are specified using the <w:tbl> element. This is analogous to the HTML <table> element.
A WordprocessingML table contains four types of content:
Properties. The <w:tblPr> section specifies properties specific to the table. You can define properties such as sizing, alignment, text wrap, styles, borders, shadings.
Grid. Defines a virtual grid used to lay out cells in the table. You use the <w:tblGrid> element to define a grid in the table.
Rows. The <w:tr> element defines a table row. This is analogous to the HTML <tr> element.
Cells. The <w:tc> element defines a table cell. This is analogous to the HTML <td> element.
The following sample code defines the properties and content of a table.
<w:tbl><w:tblPr> <w:tblStyle w:val="LightGrid-Accent2" /> <w:tblW w:w="0" w:type="auto" /> <w:tblLook w:val="04A0" /> </w:tblPr> <w:tblGrid> <w:gridCol w:w="2394" /> </w:tblGrid><w:tr w:rsidR="00D66682" w:rsidTr="00616703"> <w:trPr> <w:cnfStyle w:val="100000000000" /> </w:trPr> <w:tc> <w:tcPr> <w:tcW w:w="2394" w:type="dxa" /> </w:tcPr> <w:p> <w:r> <w:t>WordprocessingML is fun</w:t> </w:r> </w:p> </w:tc></w:tr></w:tbl>
You cannot embed a <w:tbl> element in a <w:p> paragraph element.
Style Part, Styles, and Formatting
A style part defines a specific set of values for formatting properties that you can apply as a single logical unit. For example, the Normal style in Word 2007 defines these formatting properties:
Font = Calibri (body)
Font Size = 11 point
Font Language = Word default (as configured by user)
Justification = Left
Line Spacing = Single
Within the Open XML package, styles are stored in a unique part (styles.xml) that contains style definitions in <style> elements.
Line breaks are added to the following code example to facilitate online viewing. You must remove them before using this in your own code.
<?xml version="1.0" encoding="utf-8" standalone="yes"?> <w:styles xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:docDefaults> … </w:docDefaults> <w:latentStyles w:defLockedState="0" w:defUIPriority="99" w:defSemiHidden="1" w:defUnhideWhenUsed="1" w:defQFormat="0" w:count="267"> <w:lsdException w:name="Normal" w:semiHidden="0" w:uiPriority="0" w:unhideWhenUsed="0" w:qFormat="1" /> <w:lsdException w:name="heading 1" w:semiHidden="0" w:uiPriority="9" w:unhideWhenUsed="0" w:qFormat="1" /> <w:lsdException w:name="Title" w:semiHidden="0" w:uiPriority="10" w:unhideWhenUsed="0" w:qFormat="1" /> <w:lsdException w:name="Light List Accent 1" w:semiHidden="0" w:uiPriority="61" w:unhideWhenUsed="0" /> </w:latentStyles> <w:style w:type="paragraph" w:default="1" w:styleId="Normal"> <w:name w:val="Normal" /> <w:qFormat /> </w:style> <w:style w:type="character" w:default="1" w:styleId="DefaultParagraphFont"> … </w:style> <w:style w:type="table" w:default="1" w:styleId="TableNormal"> … </w:style> <w:style w:type="numbering" w:default="1" w:styleId="NoList"> … </w:style> </w:styles>
Styles define the appearance of a document. You define style sets in the styles.xml part and you add a reference to them by using their IDs when you want to use them to format tables, paragraphs, number lists and so on.
<w:p> <w:pPr> <w:spacing w:before="200" w:after="0" /> </w:pPr> <w:pPr> <w:pStyle w:val="Heading 3" /> </w:pPr> <w:r> <w:rPr> <w:b /> <w:sz w:val="26" /> </w:rPr> <w:t>Sales by Territory - Northwest</w:t> </w:r> </w:p>
In Open XML, you must perform four tasks to work with styles:
Create the style part in the document package MySampleDocument.docx\word\styles.xml. This file contains style markup definition as shown in the previous code listing.
Define the relationship at the MySampleDocument.docx\word\_rels\document.xml.rels file.
Define the style content type at the MySampleDocument.docx\[Content_Types].xml file.
Use a style in a document by adding a reference to a predefined style.
Image Part and Pictures
Pictures are part of the DrawingML namespace. The DrawingMLMain namespace defines all of the base constructs for all types of DrawingML objects (charts, diagrams, shapes, pictures, and so on).To insert an image into a document you need to insert a <w:drawing> element inside a run <w:r> element.
The <pic:pic> element defines the picture and the <p:blipFill> element specifies the type of picture fill that the picture object has.
<w:p> <w:r><w:drawing> <wp:inline xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"> <wp:extent cx="1628936" cy="733586" /> <wp:docPr name="image.jpeg" id="1" /> <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"> <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:nvPicPr> <pic:cNvPr id="0" name="image.jpeg" /> <pic:cNvPicPr /> </pic:nvPicPr> <pic:blipFill> <a:blip r:embed="rId2" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" /> <a:stretch> <a:fillRect /> </a:stretch> </pic:blipFill> <pic:spPr> <a:xfrm> <a:off x="0" y="0" /> <a:ext cx="1630190" cy="734151" /> </a:xfrm> <a:prstGeom prst="rect" /> </pic:spPr> </pic:pic> </a:graphicData> </a:graphic> </wp:inline> </w:drawing> </w:r> </w:p>
You add a reference to the actual image with a relationship defined by using the r:embed attribute of the <a:blip> element. The relationship points to an image part in the package.
In Open XML, you must complete four tasks to add images to document parts and show them as document content:
Create the image part in the document package MySampleDocument.docx \media\myimage.jpg.
Define the relationship at the MySampleDocument.docx\word\_rels\document.xml.rels file:
Define the image content type at the MySampleDocument.docx\[Content_Types].xml file.
Finally, to insert an image into a document you need to insert a <w:drawing> element inside a run <w:r> element.
<w:p> <w:r> <w:drawing> <wp:inline xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"> <wp:extent cx="1628936" cy="733586" /> <wp:docPr name="image.jpeg" id="1" /> <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"> <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"> <pic:nvPicPr> <pic:cNvPr id="0" name="image.jpeg" /> <pic:cNvPicPr /> </pic:nvPicPr> <pic:blipFill> <a:blip r:embed="rId2" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" /> <a:stretch> <a:fillRect /> </a:stretch> </pic:blipFill> <pic:spPr> <a:xfrm> <a:off x="0" y="0" /> <a:ext cx="1630190" cy="734151" /> </a:xfrm> <a:prstGeom prst="rect" /> </pic:spPr> </pic:pic> </a:graphicData> </a:graphic> </wp:inline> </w:drawing> </w:r> </w:p>
Word 2007 allows you to create "what you see is what you get" (WYSIWYG) documents with WordprocessingML markup. This is a very powerful feature offered by the 2007 Microsoft Office system. You can create elaborate documents with autonumbered lists, custom content XML, themes, footers, headers, and more by using WordprocessingML markup.
For more information about creating advanced documents and using WordprocessingML markup, see the Ecma Office Open XML File Formats Standard Final draft - Part 4 - Markup Language Reference. This document provides the schema definition for WordprocessingML. It is an essential reference for generating Word 2007 documents programmatically. The guide also provides reference material for SpreadsheetML, PresentationML, DrawingML and additional Microsoft Office markup schemas.
Manipulating Document Packages by using the Open XML Object Model
You manipulate document packages manually by using standard ZIP utilities. You can also explore and manipulate the contents (markup) of the files stored in a document package by using tools such as the Open XML Package Explorer or XMLSpy.
You can also manipulate document packages and the document contents programmatically by using Microsoft developer applications, third-party solutions, or open source APIs.
If you are building a document integration solution based on the.NET Framework, you should consider using the Welcome to the Open XML Format SDK 1.0. This software development kit (SDK) contains the Open XML object model that allows you to create packages and manipulate the files that make up the packages.
Using the Open XML object model is simple. In your project or application, add a reference to the Microsoft.Office.DocumentFormat.OpenXml.dll. You can also download the Open XML Format SDK 1.0 from the Microsoft Office Developer Center.
When you build the business logic layer of your application, consider creating a custom-built report generation helper class that uses the Open XML object model. By using the Open XML object model, the code you write to create and manipulate a document package and its parts is simplified significantly. However, the amount of markup code you need to create depends on the complexity of the document content. As with all markup languages, WordprocessingML is verbose.
The following code example shows how to create a simple document package with a main document part and simple WordprocessingML content by using both Visual C# and Microsoft Visual Basic 2005 code.
Public Sub CreateNewWordDocument(ByVal document As String) Dim wordDoc As WordprocessingDocument = WordprocessingDocument.Create(document, WordprocessingDocumentType.Document) Using (wordDoc) ' Set the content of the document so that Word can open it. Dim mainPart As MainDocumentPart = wordDoc.AddMainDocumentPart SetMainDocumentContent(mainPart) End Using End Sub Public Sub SetMainDocumentContent(ByVal part As MainDocumentPart) Const docXml As String = "<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>" & _ "<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">" & _ "<w:body><w:p><w:r><w:t>I love coffee</w:t></w:r></w:p></w:body></w:document>" Dim stream1 As Stream = part.GetStream Dim utf8encoder1 As UTF8Encoding = New UTF8Encoding() Dim buf() As Byte = utf8encoder1.GetBytes(docXml) stream1.Write(buf, 0, buf.Length) End Sub
Compatibility with Earlier Versions
A common requirement for document integration solutions is backward compatibility with earlier versions of the Microsoft Office system. The document packages created with the Open Packaging Convention have a different format than the old binary formats used in earlier versions of Microsoft Office.
When you generate 2007 Microsoft Office documents by using the Open XML Formats, you can open, edit, and save files created with earlier versions of Microsoft Office by downloading and installing the following: Microsoft Office Compatibility Pack for Word, Excel, and PowerPoint 2007. This is a free download that adds Open XML support to Word, Excel, and PowerPoint in Microsoft Office 2000, Microsoft Office XP, or Microsoft Office 2003.
You can categorize solutions developed on the Microsoft Office platform as Office Business Applications (OBA). A common task of an OBA solution is document integration, which is automating the generation of documents with data from another system or processing the documents to extract data.
You can combine different Microsoft Office programs, tools, and technologies depending on what you are trying to accomplish. You can use Microsoft Visual Studio 2005 Tools for Office, the extensible Microsoft Office Fluent user interface, Microsoft SharePoint Products and Technologies, and the Office Open XML Formats to bring together different building blocks to create a custom OBA solution.
A minimal server-side document integration solution integrates LOB systems and data with custom LOB software and components that help you build a richer presentation layer for the user.
You can deconstruct a server-side document integration solution into four tiers:
LOB systems and data
LOB software and components
Front-end Web server
LOB Systems and Data
The LOB systems and data encapsulate the data layer logic of an application. Companies can store data in different cross-platform LOB systems such as SAP, Siebel, Microsoft Dynamics, SharePoint sites, or other third-party applications. Data can also be stored in database systems such as Microsoft SQL Server or Oracle.
You can create custom data helper classes to encapsulate the process of retrieving data from different LOB systems. The data helper classes can use Web services, third-party APIs, or ODBC connections to access the different LOB back-end systems and data.
LOB Software or Components
The LOB software and components encapsulate the business logic tier of an application. You can build custom classes to define business processes and workflows that consume existing LOB services or create additional ones. You can also use your application in a services-oriented architecture. You can create custom report generation classes to encapsulate the process of document integration. These classes use the Open XML object model to generate document packages and output document content based on LOB data.
You do not need to install the 2007 Microsoft Office system on the server to generate Microsoft Office documents. The Open XML Formats enable you to generate and manipulate documents without Automation.
Front-end Web Server
A front-end Web server encapsulates the productivity layer of an application. You can use a SharePoint site or an ASP.NET application to process, filter, select, and define custom reports. The front-end Web server can provide Web Parts or Web pages that have buttons or predefined custom actions that enable users to export reports to Word, Excel, or PowerPoint.
The client application represents the presentation layer of the solution. After the server-side solution creates the documents, you can open the reports using Microsoft Office programs to view or manipulate the documents.
Companies around the world need systems to help generate documents that provide information contained in different LOB systems and data.
The Open XML Formats allow you to create document packages and document parts that store document content as WordprocessingML for Word documents, SpreadsheetML for Excel workbooks, and PresentationML for PowerPoint presentations.
The Open XML object model simplifies the process of programmatically generating and manipulating Open XML document packages and document parts in applications that use the .NET Framework.
You can extract and expose LOB systems data by building custom data helper classes (the data access layer). You can also build custom report generation classes (the business logic layer) and use the Open XML object model to generate and manipulate Word 2007 documents, Excel 2007 worksheets, or PowerPoint 2007 presentations. You can expose "export to" Word, Excel, or PowerPoint functionality through front-end Web servers such as SharePoint sites or through ASP.NET applications.
Part 2 of this document shows you how to build a server-side document solution by using the Open XML object model and ASP.NET 2.0.
I want to thank Doug Mahugh, Wouter van Vugt, and Frank Rice for their contributions to this article.
Ecma Office Open XML File Formats Standard Final draft - Part 4 - Markup Language ReferenceNote This is a download, in PDF format, of substantial size.