Using Word to publish content for SharePoint and web applications

Summary:  Learn how to use Microsoft Word to publish formatted contents for SharePoint and web applications by customizing PowerTools for Open XML.

Applies to: Office 2010 | Open XML | Visual Studio Tools for Microsoft Office | Word 2007 | Word 2010

In this article
Introduction to using Word to publish contents
Customize HtmlConverter in PowerTools for Open XML
Create a Helper Class for HTML Conversion
Conclusion
Additional resources
About the authors

Published:  April 2013

Provided by:  Jinhui (Ivory) Feng and Omar Villavicencio-Calderon

Contents

  • Introduction to using Word to publish contents

  • Customize HtmlConverter in PowerTools for Open XML

  • Create a Helper Class for HTML Conversion

  • Conclusion

  • Additional resources

  • About the authors

Download the code

Introduction to using Word to publish contents

Businesses often want to use Microsoft Word to publish formatted contents for SharePoint and web applications. Users are familiar with, and feel comfortable with, the powerful formatting features in Word, and they love the offline editing capability.

This article describes how to design and implement a simple solution that uses Word to publish formatted contents.

What are the Challenges?

Suppose you want to use a predefined Word template to prepare contents that you want to publish. Also suppose that you do not want to show the whole document in a webpage, but instead, only want to show the contents of some predefined sections of the document. That means you have to convert some Word contents to HTML contents, and that is the main challenge. How do you do that, and what is the best tool to use?

HtmlConverter in PowerTools for Open XML is a very good tool. The problem is that it converts the whole Word document into HTML, and not just the sections that you specify. This article describes how to create those customizations.

In addition to the HTML conversion, you can control some of the conversion format and features. For example, you can convert a bulleted list in Word into an HTML unordered list by using <ul> tags.

When you display the converted HTML contents, you want to enable users to download the published Word contents into another Word template that differs from the original publishing template. You face a big challenge when there are some hyperlinks in the contents, as explained in the following sections.

Customize HtmlConverter in PowerTools for Open XML

The HtmlConverter class in PowerTools for Open XML has some public methods to convert a Word document into HTML. To convert a section of the Word document, you can customize this class by adding the following method to the HtmlConverter class.

        public static XElement ConvertToHtml(WordprocessingDocument wordDoc, XNode node, HtmlConverterSettings htmlConverterSettings)
        {
            InitEntityMap();
            if (htmlConverterSettings.ConvertFormatting)
            {
                throw new InvalidSettingsException("Conversion with formatting is not supported");
            }
            RevisionAccepter.AcceptRevisions(wordDoc);
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false,
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                ReplaceTabsWithSpaces = true,
            };
            MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
            AnnotateHyperlinkContent((XElement)node);
            XElement xhtml = (XElement)ConvertToHtmlTransform(wordDoc, htmlConverterSettings, node, null);

            return xhtml;
        }
    }

The second parameter of the method is for the section of the Word document that you want to convert to HTML. You still have to pass the whole Word document as the first parameter because some information is stored at the document level, and not in the specific section to be converted. For example, if a section contains a hyperlink, the section contains only a hyperlink Identifier that points to the HyperlinkRelationships in the Word document. The actual hyperlink target URL is stored in the HyperlinkRelationships.

Create a Helper Class for HTML Conversion

The HtmlConverter class in PowerTools for Open XML provides many features, but not everything that you need. Although you can customize PowerTools for Open XML, it is a good idea to minimize direct customizations, and instead, introduce a local helper class to provide the functionality that you need.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using DocumentFormat.OpenXml.Wordprocessing;
using DocumentFormat.OpenXml.Packaging;
using System.Xml;
using System.Xml.Linq;
using OpenXmlPowerTools;

namespace OpenXMLHtmlConverter.Helper
 {

    public class HTMLConverterHelper
    {
        public static XElement ConvertToHtml(WordprocessingDocument wordDocument, XNode node)
        {
            HtmlConverterSettings settings = new HtmlConverterSettings()
            {
                PageTitle = "My Page Title",
                CssClassPrefix = "",
                Css = "",
                ConvertFormatting = false
            };

            XElement html = HtmlConverter.ConvertToHtml(wordDocument, node, settings);
            return html;
        }

    }
}

The helper class provides a static method to convert a section in Word to HTML. That method calls the custom method that you previously added to the HtmlConverter class in PowerTools for Open XML.

The plan is to use a table cell to store the content that you want to convert to HTML. The following method converts a table cell into HTML.

        public static string ConvertToHtml(WordprocessingDocument wordDocument, TableCell tableCell)
        {
            XElement xElement = HTMLConverterHelper.ConvertToHtml(wordDocument, XElement.Parse(tableCell.OuterXml));
            string html = xElement.ToString();
            html = ConvertToHtmlList(html);

            // remove the <td> tag
            int startIndex = html.IndexOf("<", html.IndexOf("<td") + 3);
            html = html.Substring(startIndex, html.LastIndexOf(">", html.Length - 3) - startIndex + 1);

            html = html.Replace(" class=\"Normal\"", string.Empty);
            // add target="_blank" to open the link in new windows.
            html = html.Replace("href=", "target=\"_blank\" href=");

            return html;
        }

This method takes a WordTableCell as an input parameter and outputs an HTML string. First, the method converts the TableCell into an XNode by calling the XElement.Parse method. Then it calls the ConvertToHtml(WordprocessingDocument wordDocument, XNode node) method that you first defined in this helper class to get the HTML string. It further calls the ConvertToHtmlList method to convert the HTML string into another one that supports HTML unordered lists that use <ul> tags (more on the ConvertToHtmlList method later).

To support HTML unordered lists using <ul> tags

When you convert a bulleted list in Word to HTML, the output format from the HtmlConverter class in PowerTools for Open XML uses the &bull; HTML entity for bullets instead of <ul> tags. Because <ul> tags work better work with Cascading Style Sheets (CSS), add the following method.

        private static string ConvertToHtmlList(string html)
        {
            List<string> elementLists = new List<string>();
            const string ListParagraphClassTag = "class=\"ListParagraph\"";
            const string ListParagraphStartTag = "<p class=\"ListParagraph\">";
            const string ListParagraphEndTag = "</p>";

            int startIndex = 0;
            int tagIndex = 0;
            int previousTagIndex = 0;
            int stringLength = 0;


            if (html.Contains(ListParagraphStartTag))
            {
                tagIndex = html.IndexOf(ListParagraphStartTag, startIndex);
                stringLength = tagIndex - startIndex; // the portion before the first list paragraph start tag
                elementLists.Add(html.Substring(startIndex, stringLength));

                previousTagIndex = tagIndex;
                startIndex = tagIndex + ListParagraphStartTag.Length; // next start index


                while (html.IndexOf(ListParagraphStartTag, startIndex) > 0) // next start tag exists
                {
                    tagIndex = html.IndexOf(ListParagraphStartTag, startIndex); // next start tag index
                    stringLength = tagIndex - previousTagIndex;
                    
                    string sectionHtml = html.Substring(previousTagIndex, stringLength);
                    if (sectionHtml.IndexOf(ListParagraphEndTag) != (sectionHtml.Length - ListParagraphEndTag.Length))
                    {
                        // something more after ListParagraphEndTag
                        elementLists.Add(sectionHtml.Substring(0, sectionHtml.IndexOf(ListParagraphEndTag) + ListParagraphEndTag.Length));
                        string secondSectionHtml = sectionHtml.Substring(sectionHtml.IndexOf(ListParagraphEndTag) + ListParagraphEndTag.Length);
                        if (secondSectionHtml.Trim() != string.Empty)
                        {
                            elementLists.Add(secondSectionHtml);
                        }
                    }
                    else
                    {
                        elementLists.Add(sectionHtml);
                    }
                    previousTagIndex = tagIndex;
                    startIndex = tagIndex + ListParagraphStartTag.Length; // next start index
                }

                elementLists.Add(html.Substring(previousTagIndex)); // last piece

                for (int i = 0; i < elementLists.Count(); i++)
                {
                    if (elementLists[i].Contains(ListParagraphClassTag))
                    {
                        if (i == 0 || !elementLists[i - 1].Contains(ListParagraphClassTag)) // it's the first element or previous element is not a list paragraph.
                        {
                            // first list item, needs to add <ul>
                            html = html.Replace(elementLists[i], elementLists[i].Replace(ListParagraphStartTag, "<ul>\r\n  <li>").Replace(ListParagraphEndTag, "</li>"));
                        }
                        else
                        {
                            if (i == elementLists.Count() - 1 || !elementLists[i + 1].Contains(ListParagraphClassTag)) // it's the last list item in a block
                            {
                                html = html.Replace(elementLists[i], elementLists[i].Replace(ListParagraphStartTag, "<li>").Replace(ListParagraphEndTag, "</li>\r\n</ul>"));
                            }
                            else
                            {
                                html = html.Replace(elementLists[i], elementLists[i].Replace(ListParagraphStartTag, "<li>").Replace(ListParagraphEndTag, "</li>"));
                            }
                        }
                    }
                }
            }

            html = html.Replace("&bull; ", "&nbsp;");

            return html;
        }

This method detects the beginning and ending of each list and list item, and then adds the HTML unordered list tags.

If you want to enable users to download the published content to another Word template, one easy way is to store the original WordML of the published content for download. However, when the content contains a hyperlink, you have to do something special to support it. Otherwise, the hyperlink in the downloaded document will not have a target URL (because the hyperlink identifier does not exist in the target document’s HyperlinkRelationships collection), or the hyperlink will point to a wrong URL if the target document has the same hyperlink identifier that points to another URL.

That means that you have to change the original hyperlink identifiers in the published content to be unique so that they do not conflict with those in the target document. Add the following method to the helper class.

        public static string UpdateHyperlinkIds(WordprocessingDocument wordDocument, string wordML, string documentId)
        {
            string wordMLString = wordML;

            foreach (var hyperlink in wordDocument.MainDocumentPart.HyperlinkRelationships)
            {
                string oldId = "\"" + hyperlink.Id + "\"";
                string newId = "\"" + documentId + hyperlink.Id + "\"";

                wordMLString = wordMLString.Replace(oldId, newId);
            }

            return wordMLString;
        }

This method prefixes all the hyperlink identifiers with a unique documentId by looping through the document’s HyperlinkRelationships. The documentId string has to start with a character, or it will throw an exception. The prefixing ensures that the modified hyperlink identifiers are unique in the WordML of the published content, and the method returns a modified WordML string for download.

Next, you have to store the related HyperlinkRelationships information together with the modified hyperlink identifiers of the original document for download. Add the following method to the helper class.

        public static string GetHyperlinks(WordprocessingDocument wordDocument, string documentId)
        {
            XElement hyperlinksXElement = new XElement("hyperlinks",  wordDocument.MainDocumentPart.HyperlinkRelationships.Select(a => new XElement("hyperlink", new XAttribute("uri", a.Uri.OriginalString), new XAttribute("id", documentId + a.Id))));

            return hyperlinksXElement.ToString();
        }

This method returns a string of the hyperlinks, with modified hyperlink identifiers, for download. It modifies the hyperlink identifiers by prefixing a documentId that comes from the second parameter, which has to be the same as the one in the UpdateHyperlinkIds method.

Finally, when a user downloads the published content, you have to add the related hyperlinks into the target document’s HyperlinkRelationships. Add the following method to the helper class.

        public static void AddHyperlinks(WordprocessingDocument wordDocument, string hyperlinksString)
        {
            XElement hyperlinksXElement = XElement.Parse(hyperlinksString);
            foreach (var hyperlink in hyperlinksXElement.Descendants("hyperlink"))
            {
                // Add the HyperlinkRelationship
                wordDocument.MainDocumentPart.AddHyperlinkRelationship(new Uri(hyperlink.Attribute("uri").Value), true, hyperlink.Attribute("id").Value);
            }
        }

This method just blindly adds the hyperlinks from the second parameter into the WordprocessingDocument that is specified by the first parameter.

You can download all the code from this article in a sample code package on Code Gallery. The sample code package was created in Microsoft Visual Studio 2010. It contains a simple ASP.NET MVC 4 web application to upload a Word document to publish contents and download the published contents into another Word document. This sample application does not store the published contents anywhere, but in real-world applications, you can store them in a database or a SharePoint list.

Conclusion

By customizing the HtmlConverter class in PowerTools for Open XML, you can convert a specific section of a Word document to HTML. By using the custom helper class in this article, you can provide better HTML conversion control and enable users to download the published contents back to another Word document. This solution provides an effective application design alternative when Word is the preferred publication tool for an application, and makes it easy for content authors to prepare contents offline.

Additional resources

For more information, see the following:

About the authors

Jinhui (Ivory) Feng is a senior software development engineer at Microsoft who works on internal human resources applications.

Omar Villavicencio-Calderon is a software development engineer at Microsoft who works on internal human resources applications.