Transforming Open XML WordprocessingML to XHTML Using the Open XML SDK 2.0

Summary:   Transforming Open XML WordprocessingML to XHTML using the Open XML SDK.

Applies to: Office 2007 | Office 2010 | Open XML | Visual Studio Tools for Microsoft Office | Word 2007 | Word 2010

In this article
Benefits to Converting WordprocessingML to XHTML
Using the HtmlConverter Class
Example: Use of HtmlConverter Without Formatting or Images
Converting Images
Supplying a CSS to the Transformation
Accepting Revisions Before Transforming
Simplifying Markup Before Transforming
Recursive Pure Functional Transformations
Other Limitations
Conclusion
Additional Resources

Published:   April 2010

Provided by:   Eric White, Microsoft Corporation

Contents

  • Benefits to Converting WordprocessingML to XHTML

  • Using the HtmlConverter Class

  • Example: Use of HtmlConverter Without Formatting or Images

  • Converting Images

  • Supplying a CSS to the Transformation

  • Accepting Revisions Before Transforming

  • Simplifying Markup Before Transforming

  • Recursive Pure Functional Transformations

  • Other Limitations

  • Conclusion

  • Additional Resources

Benefits to Converting WordprocessingML to XHTML

Converting WordprocessingML documents to XHTML has several interesting uses.

  • You may want to provide a simple preview of Open XML WordprocessingML documents in your application. By converting to XHTML, you can use one of many HTML viewers to integrate this functionality.

  • Another interesting application is interoperating with other software systems where you can use HTML to achieve rich functionality. For example, you could populate a SharePoint wiki from a set of word-processing documents.

  • For developers, a transform to XHTML is especially interesting. If you need a simple query of a word-processing document, perhaps selecting paragraphs by content or style name, it is easier to write the query for XHTML than it is for WordprocessingML.

The converter presented in this article has some limitations. It does not transform formatting such as styles and fonts. Instead it transforms the actual textual content and images. Although the example does not transform formatting from the source document, you can supply a cascading style sheet that defines appropriate classes for paragraph styles. The cascading style sheet can then supply formatting at a paragraph level.

Using the HtmlConverter Class

To download the HtmlConverter class and examples, see PowerTools for Open XML, click the Downloads tab, and download HtmlConverter.zip.

To build HtmlConverter, you must also download and install the Open XML SDK 2.0 for Microsoft Office.

If you are setting up your own project to include the HtmlConverter class, you need to add references to the following four assemblies:

  • DocumentFormat.OpenXml

  • OpenXmlPowerTools

  • System.Drawing

  • Windows.Base

Figure 1 shows a WordprocessingML document that I used as a test source.

Figure 1. Test source WordprocessingML document

Screen clipping of Word document

When transformed to XHTML, the document looks like Figure 2.

Figure 2. Rendered XHTML

Screen clipping of Internet Explorer

This first version of this example has the following goals.

  • Accurately transform the text of the document. It transforms the text of each paragraph to an appropriate XHTML element—either a paragraph element (x:p) or an appropriate heading element (such as x:h1 or x:h2). The textual transform is robust. It accurately transforms the text if the WordprocessingML contains revision tracking, content controls, or various annotations such as fields.

  • Does not convert formatting, such as paragraph fonts, or character fonts. However, the HtmlConverter class optionally adds a class attribute (derived from the paragraph style name) for each paragraph or heading element. You can then supply a cascading style sheet that applies basic formatting at the paragraph level.

  • Transform WordprocessingML tables to simple XHTML tables. The transform is recursive. It handles tables that are embedded in cells in other tables.

  • Transform hyperlinks appropriately. This is somewhat more complex than first seems, because there are two forms of markup for hyperlinks.

  • Transform numbered lists and bulleted lists. The article Working with Numbered Lists in Open XML WordprocessingML explains the semantics of the markup for numbered and bulleted lists. The code in this article includes a new class, the ListItemRetriever class, which retrieves the text of the list item for any paragraph.

  • Correctly transform images. This issue is somewhat more difficult because images in WordprocessingML are embedded in the package (the .zip file), but in XHTML, images have their own URL. You must write images to separate files on the disk, or upload them to a server depending on how you want to use the XHTML. The browser that displays the XHTML needs access to those images. The HtmlConverter class enables you to supply an event handler that takes an image as an argument and returns the markup to insert in the XHTML. In that event handler, you can upload or save the image as appropriate, and then return an XHTML IMG element that contains the appropriate URI or URL that points to the image.

The most important goal of the HtmlConverter class is to give you source code and guidance on how to perform this kind of transformation and to provide a flexible starting point so that you can code your own conversion to HTML that meets your goals.

Example: Use of HtmlConverter Without Formatting or Images

The following example shows the simplest use of the HtmlConverter class.

// This example shows the simplest conversion. No images are converted.
// A cascading style sheet is not used.
byte[] byteArray = File.ReadAllBytes("Test.docx");
using (MemoryStream memoryStream = new MemoryStream())
{
    memoryStream.Write(byteArray, 0, byteArray.Length);
    using (WordprocessingDocument doc =
        WordprocessingDocument.Open(memoryStream, true))
    {
        HtmlConverterSettings settings = new HtmlConverterSettings()
        {
            PageTitle = "My Page Title"
        };
        XElement html = HtmlConverter.ConvertToHtml(doc, settings);

        // Note: the XHTML returned by ConvertToHtmlTransform contains objects of type
        // XEntity. PtOpenXmlUtil.cs defines the XEntity class. See
        // https://blogs.msdn.com/ericwhite/archive/2010/01/21/writing-entity-references-using-linq-to-xml.aspx
        // for detailed explanation.
        //
        // If you further transform the XML tree returned by ConvertToHtmlTransform, you
        // must do it correctly, or entities do not serialize properly.

        File.WriteAllText("Test.html", html.ToStringNewLineOnAttributes());
    }
}

The HtmlConverter class performs two transformations on the Open XML source document before converting it to XHTML. First, it accepts all revisions. Second, it simplifies the markup. You do not want to write these changes back to the source document. Therefore, all examples use the approach of reading the source document into a byte array, creating a resizable memory stream, writing the byte array to the memory stream, and then opening the word-processing document from the memory stream. For more information about this, see the blog entry Simplifying Open XML WordprocessingML Queries by First Accepting Revisions.

You set various conversion settings by instantiating an instance of the HtmlConverterSettings class, setting fields in it, and passing it to the HtmlConverter.ConvertToHtml method. The Open XML source document does not necessarily contain text that is appropriate for the page title of the XHTML page. Therefore, you specify the page title by using the HtmlConverterSettings object.

The HtmlConverter.ConvertToHtml method returns an XElement object that contains the XHTML. This example writes the XHTML by using an extension method, ToStringNewLineOnAttributes, which is defined in the PtExtensions class, from the PtUtil.cs model. This extension method serializes the XML with each attribute on its own line, which makes the XHTML easier to read when an element contains several attributes. The following XML shows attributes, each on their own line.

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta
      http-equiv="Content-Type"
      content="text/html; charset=windows-1252" />
    <meta
      name="Generator"
      content="PowerTools for Open XML" />
    <title>My Page Title</title>
  </head>
  <body>
    <p>This is a test document to transform from WordprocessingML to XHTM.</p>
    <p />
    <p>Find out more about <A
        href="https://www.codeplex.com/powertools">PowerTools for Open XML</A>.</p>
    . . .

Important

To serialize entities correctly, the HTML returned by the ConvertToHtmlTransform method contains objects of type XEntity. PtOpenXmlUtil.cs defines the XEntity class. For more information, see Writing Entity References using LINQ to XML. If you transform the XML tree returned by the ConvertToHtmlTransform method additionally, you must do it correctly, or entities do not serialize correctly.

Converting Images

A more complex example is one where you convert images in the WordprocessingML to images in the XHTML. As I mentioned, images are stored directly in an Open XML package, but, in XHTML, they must be stored as a separate file that has a distinct URI or URL. Therefore, you must supply a delegate that correctly stores images. This delegate is passed an ImageInfo object, which is defined as follows

public class ImageInfo
{
    public Bitmap Bitmap;
    public XAttribute ImgStyleAttribute;
    public string ContentType;
    public XElement DrawingElement;
    public string AltText;

    public static int EmusPerInch = 914400;
    public static int EmusPerCm = 360000;
}

This gives you all the information that is needed to decide what to do with the image. The DrawingElement field contains the XML for the WordprocessingML drawing object.

A convenient way to supply that delegate is to supply a lambda expression. This allows the code inside your method (the lambda expression) to access local variables in the containing scope.

The method that you write for the delegate returns the markup to insert into the XHTML at the appropriate position. The ImgStyleAttribute object is an XAttribute object that contains appropriate sizing information for the image. It is appropriate to add directly to the IMG element that you create and return from the delegate. For example, it may contain the style attribute style="width: 0.5in; height: 0.4833333in". When you create the IMG element, it might look as follows.

<img
  src="Test_files/image1.jpeg"
  style="width: 0.5in; height: 0.4833333in"
  alt="Picture 1" />

The following example creates a subdirectory to contain images, saves the images into that subdirectory, and returns an IMG element that contains the path to the image. The name of the subdirectory is the document name with "_files" appended to the name. To make it easy to run the example multiple times, if the directory already exists, the example deletes and recreates the directory.

The following example shows conversion of a document with images to XHTML.

// This example shows conversion of images. A cascading style sheet is not used.
string sourceDocumentFileName = "Test.docx";
FileInfo fileInfo = new FileInfo(sourceDocumentFileName);
string imageDirectoryName = fileInfo.Name.Substring(0,
    fileInfo.Name.Length - fileInfo.Extension.Length) + "_files";
DirectoryInfo dirInfo = new DirectoryInfo(imageDirectoryName);
if (dirInfo.Exists)
{
    // Delete the directory and files.
    foreach (var f in dirInfo.GetFiles())
        f.Delete();
    dirInfo.Delete();
}
int imageCounter = 0;
byte[] byteArray = File.ReadAllBytes(sourceDocumentFileName);
using (MemoryStream memoryStream = new MemoryStream())
{
    memoryStream.Write(byteArray, 0, byteArray.Length);
    using (WordprocessingDocument doc =
        WordprocessingDocument.Open(memoryStream, true))
    {
        HtmlConverterSettings settings = new HtmlConverterSettings()
        {
            PageTitle = "Test Title",
            ConvertFormatting = false,
        };
        XElement html = HtmlConverter.ConvertToHtml(doc, settings,
            imageInfo =>
            {
                DirectoryInfo localDirInfo = new DirectoryInfo(imageDirectoryName);
                if (!localDirInfo.Exists)
                    localDirInfo.Create();
                ++imageCounter;
                string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                ImageFormat imageFormat = null;
                if (extension == "png")
                {
                    // Convert the .png file to a .jpeg file.
                    extension = "jpeg";
                    imageFormat = ImageFormat.Jpeg;
                }
                else if (extension == "bmp")
                    imageFormat = ImageFormat.Bmp;
                else if (extension == "jpeg")
                    imageFormat = ImageFormat.Jpeg;
                else if (extension == "tiff")
                    imageFormat = ImageFormat.Tiff;

                // If the image format is not one that you expect, ignore it,
                // and do not return markup for the link.
                if (imageFormat == null)
                    return null;

                string imageFileName = imageDirectoryName + "/image" +
                    imageCounter.ToString() + "." + extension;
                try
                {
                    imageInfo.Bitmap.Save(imageFileName, imageFormat);
                }
                catch (System.Runtime.InteropServices.ExternalException)
                {
                    return null;
                }
                XElement img = new XElement(Xhtml.img,
                    new XAttribute(NoNamespace.src, imageFileName),
                    imageInfo.ImgStyleAttribute,
                    imageInfo.AltText != null ?
                        new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                return img;
            });

        // Note: the XHTML returned by the ConvertToHtmlTransform method contains objects of type
        // XEntity. PtOpenXmlUtil.cs define the XEntity class. For more information
        // https://blogs.msdn.com/ericwhite/archive/2010/01/21/writing-entity-references-using-linq-to-xml.aspx.

        //
        // If you transform the XML tree returned by the ConvertToHtmlTransform method further, you
        // must do it correctly, or entities do not serialize correctly.

        File.WriteAllText(fileInfo.Directory.FullName + "/" + fileInfo.Name.Substring(0,
            fileInfo.Name.Length - fileInfo.Extension.Length) + ".html",
            html.ToStringNewLineOnAttributes());
    }
}

There are many transformations that WordprocessingML markup can apply to an image, such as resizing, rotating, or flipping. This example does not apply rotation or flipping transformations to the image before storing it.

Supplying a CSS to the Transformation

You can configure the HtmlConverter class to output a class attribute for every paragraph and heading element and you can supply a CSS to the conversion. The class attribute is the name of the style for the paragraph or heading, with a specified string prefixed to the style name. These class attributes enable you to configure some formatting at the paragraph level. The following example prepends "Pt" to the style name. The style sheet then defines classes for PtNormal, PtHeading1, and PtHeading2. This approach gives you a fair amount of flexibility in configuring some styling for the transformation.

The following example shows conversion where the resulting XHtml uses a cascading style sheet.

// This example shows conversion using a cascading style sheet. It also converts images.
string css = @"
        p.PtNormal
            {margin-bottom:10.0pt;
            font-size:11.0pt;
            font-family:""Times"";}
        h1.PtHeading1
            {margin-top:24.0pt;
            font-size:14.0pt;
            font-family:""Helvetica"";
            color:blue;}
        h2.PtHeading2
            {margin-top:10.0pt;
            font-size:13.0pt;
            font-family:""Helvetica"";
            color:blue;}";

string sourceDocumentFileName = "Test.docx";
FileInfo fileInfo = new FileInfo(sourceDocumentFileName);
string imageDirectoryName = fileInfo.Name.Substring(0,
    fileInfo.Name.Length - fileInfo.Extension.Length) + "_files";
DirectoryInfo dirInfo = new DirectoryInfo(imageDirectoryName);
if (dirInfo.Exists)
{
    // Delete the directory and files.
    foreach (var f in dirInfo.GetFiles())
        f.Delete();
    dirInfo.Delete();
}
int imageCounter = 0;
byte[] byteArray = File.ReadAllBytes(sourceDocumentFileName);
using (MemoryStream memoryStream = new MemoryStream())
{
    memoryStream.Write(byteArray, 0, byteArray.Length);
    using (WordprocessingDocument doc =
        WordprocessingDocument.Open(memoryStream, true))
    {
        HtmlConverterSettings settings = new HtmlConverterSettings()
        {
            PageTitle = "Test Title",
           CssClassPrefix = "Pt",
           Css = css,
            ConvertFormatting = false,
        };
        XElement html = HtmlConverter.ConvertToHtml(doc, settings,
            imageInfo =>
            {
                DirectoryInfo localDirInfo = new DirectoryInfo(imageDirectoryName);
                if (!localDirInfo.Exists)
                    localDirInfo.Create();
                ++imageCounter;
                string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                ImageFormat imageFormat = null;
                if (extension == "png")
                {
                   // Convert the .png file to a .jpeg file.
                    extension = "jpeg";
                    imageFormat = ImageFormat.Jpeg;
                }
                else if (extension == "bmp")
                    imageFormat = ImageFormat.Bmp;
                else if (extension == "jpeg")
                    imageFormat = ImageFormat.Jpeg;
                else if (extension == "tiff")
                    imageFormat = ImageFormat.Tiff;

                // If the image format is not one that you expect, ignore it,
                // and do not return markup for the link.
                if (imageFormat == null)
                    return null;

                string imageFileName = imageDirectoryName + "/image" +
                    imageCounter.ToString() + "." + extension;
                try
                {
                    imageInfo.Bitmap.Save(imageFileName, imageFormat);
                }
                catch (System.Runtime.InteropServices.ExternalException)
                {
                    return null;
                }
                XElement img = new XElement(Xhtml.img,
                    new XAttribute(NoNamespace.src, imageFileName),
                    imageInfo.ImgStyleAttribute,
                    imageInfo.AltText != null ?
                        new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                return img;
            });

        // Note: The XHTML returned by the ConvertToHtmlTransform method contains objects of type
        // XEntity. PtOpenXmlUtil.cs define the XEntity class. For more information, see
        // https://blogs.msdn.com/ericwhite/archive/2010/01/21/writing-entity-references-using-linq-to-xml.aspx.

        //
        // If you transform the XML tree returned by the ConvertToHtmlTransform method further, you
        // must do it correctly, or entities do not serialize correctly.

        File.WriteAllText(fileInfo.Directory.FullName + "/" + fileInfo.Name.Substring(0,
            fileInfo.Name.Length - fileInfo.Extension.Length) + ".html",
            html.ToStringNewLineOnAttributes());
    }
}

Now that you see some examples of how to use the HtmlConverter class, this next section describes how I wrote the transformation.

Accepting Revisions Before Transforming

The first step I take to transform WordprocessingML to XHTML is to accept all tracked revisions. The article, Accepting Revisions in Open XML Word-Processing Documents, discusses the semantics of tracked revisions. I also posted a code example in Microsoft Visual C# 3.0 that accepts tracked revisions on CodePlex. To download this example, see PowerTools for Open XML. The RevisionAccepter class is available in its own download, but is also included in the HtmlConverter example. To download RevisionAccepter.zip or HtmlConverter.zip, click the Download tab at PowerTools for Open XML.

Accepting revisions significantly reduces the complexity of the markup to transform. There are more than 40 elements that complicate the processing of content. Many of those elements have complex semantics. It is much better to first process them and then transform the contents of the document.

Simplifying Markup Before Transforming

WordprocessingML has a very rich vocabulary. There are many aspects of it that are not relevant to transforming the content to XHTML. It is easier to write a robust transformation if I first modify a document to a simpler, valid document. I want to put the WordprocessingML in the ideal form for the XHTML transformation.

To implement this, I wrote a class, MarkupSimplifier, which contains a single method, SimplifyMarkup. You pass an open WordprocessingDocument object to the SimplifyMarkup method, and when the method returns, the document is simplified. To control simplification, you create a SimplifyMarkupSettings object, initialize it, and pass it to the MarkupSimplifier.SimplifyMarkup method.

The following code snippet shows how to call the MarkupSimplifier.SimplifyMarkup method.

SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
    RemoveComments = true,
    RemoveContentControls = true,
    RemoveEndAndFootNotes = true,
    RemoveFieldCodes = false,
    RemoveLastRenderedPageBreak = true,
    RemovePermissions = true,
    RemoveProof = true,
    RemoveRsidInfo = true,
    RemoveSmartTags = true,
    RemoveSoftHyphens = true,
    ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);

One aspect of simplification deals with merging adjacent runs with the same formatting. Through the processing of editing, runs can be arbitrarily split into multiple runs. It is more convenient if you merge identically formatted, adjacent runs into a single run, because the HtmlConverter object can then transform a single run into a single span. After simplifying some of these options, the MarkupSimplifier method merges adjacent runs with same formatting.

Removing Rsid information is not necessary for the HtmlConverter object to work, but in the process of developing or enhancing this conversion, you often want to look at the source XML. Removing Rsid information makes that much easier.

I did not have a use for content controls in this first version of the XHTML transformation, but I certainly will for future ones.

Replacing tabs with spaces is an interesting issue. WordprocessingML has a notion of hard, physical tabs, and XHTML does not. Most transformations attempt to approximate using spaces, but for this version, I chose not to take that approach. Many modern documents use tables instead of physical tabs, and tables convert in a reasonable fashion.

Recursive Pure Functional Transformations

You can use the HtmlConverter object very easily without understanding how the transformations are written. You can do quite a bit of customization of the transformation by using the options that you can specify in the HtmlConverterSettings argument. However, if you want to extend this code for your own scenario, perhaps significantly altering the generated XHTML, or giving special significance to content controls, you must understand these types of transformations. Transforming WordprocessingML to XHTML is a document-centric transformation, and by definition, document-centric transformations are recursive. Document-Centric XML Transforms Using LINQ to XML explores this idea.

The HtmlConverter class is written using a recursive C# technique, but it is helpful to compare its approach to that of XSLT transformations. XSLT style sheets, when written properly, are a form of recursive pure functional transformations. Good XSLT developers write stateless, functional code. Sometimes this is referred to as an XSLT style sheet that is written with the push model. Some developers try to solve problems in XSLT by using a pull model, where they write imperative code to pull out the XML and then form new XML. This leads to XSLT programs that are longer and harder to debug. Writing using a recursive technique in C# 3.0 is analogous to writing pure, functional XSLT push code.

The first thing to learn is how to write pure functional queries by using LINQ. Some time ago, I wrote a tutorial entitled Query Composition using Functional Programming Techniques in C# 3.0. I recommend that tutorial if you are not yet familiar with writing LINQ queries.

Recursive pure functional transformations can be described very simply. You write a method where every element of the source document is passed through the method (except that you can conveniently bypass elements as desired). In this method, you are free to test for the identity of an element in any way that you care—you can test on the element name, attributes of the element, elements following or preceding, or existence of specific ancestors. When you match your desired element, you return a new element (or set of elements) that is the transformed shape of that element in the new XML tree. You typically write a bit of functional construction code, or you write a query that returns a set of elements that replaces the matched element. You also can return null, which effectively removes the element from the new XML tree. Recursive Approach to Pure Functional Transformations of XML walks through creating these types of transformations.

Other Limitations

This example's purpose is to demonstrate one approach for doing document-centric transformations. It also serves the purpose of educating around some key aspects of WordprocessingML. It is not meant as a full-fidelity transform to HTML. As such, it has the following limitations.

  • It only works with left-to-right languages. I have not validated with languages other than English.

  • It implements only a subset of numbering systems.

  • It does not convert math formulas, SmartArt, or DrawingML drawings.

  • It does not convert text boxes.

  • It does not handle merged cells.

Conclusion

Converting WordprocessingML text to XHTML is useful functionality in two scenarios. Sometimes you want to have a very limited file preview of Open XML word-processing documents and if you have an HTML viewer, you can implement this with a minimum of effort. In addition, sometimes you want to query a word-processing document for some specific content. It may be easier to write an extremely simple query over XHTML instead of a more involved WordprocessingML query.

Additional Resources

For more information, see the following resources: