Transforming Word Documents into the XSL-FO Format

Summary: Learn how to transform Word documents into the XSL Formatting Objects (XSL-FO) format. From the XSL-FO format, you can convert documents into formats such as Adobe Portable Document Format (PDF) and Hypertext Markup Language (HTML). (7 printed pages)

Alexei Gagarinov, RenderX

Mark Iverson, Microsoft Corporation

February 2005

Applies to: Microsoft Office Word 2003, Microsoft Office 2003 Editions

Download OfficeWordWordMLtoXSL-FOSample.exe.

Contents

  • Introduction to WordprocessingML and XSL-FO Formatting

  • Creating an XSL-FO Document from Word

  • Word Features Supported by the XML-FO Style Sheet

  • Limitations to Transforming Documents from Word into the XSL-FO Format

  • Examining the XSL-FO Style Sheet

  • Extended Style Sheet for Transforming Word Documents to an XSL-FO Format

  • Conclusion

  • Additional Resources

Introduction to WordprocessingML and XSL-FO Formatting

Microsoft made customizing Microsoft Office Word documents much easier and simpler when it introduced a new XML file format called WordprocessingML with Microsoft Office Word 2003. WordprocessingML is an XML representation, or schema, of the Word document file format. The flexibility of XML allows you to export data seamlessly from a Word document using standard XML representations of objects in a document. The new WordprocessingML schema also provides easy access to contents of Word documents without programming efforts or knowledge of the internal format of a Word document.

XSL Formatting Objects

In 2001, before the advent of WordprocessingML, the W3C endorsed an XML formatting language known as XSL Formatting Objects (XSL-FO). XSL-FO is synonymous with eXtensible Stylesheet Language (XSL), one of three recommendations by the W3C's XSL working group. The three recommendations are:

  • XSL Transformations (XSLT), for transforming XML files

  • XML Path Language (XPath), for defining parts of an XML document

  • Extensible Stylesheet Language (XSL), or XSL-FO, for XML formatting information

XSL-FO is an intermediate form that results from applying an XSLT style sheet to an XML structured document. The XML-FO form describes how pages appear when presented to a reader, such as a Web browser. Currently, there are no readers that directly interpret an XSL-FO document. To interpret them, you must run them through a formatter, along with other data, such as graphics and font metrics, to create a final displayable or printable file. Possible formats for the resulting file include Adobe's Portable Document Format (PDF) and Hypertext Markup Language (HTML).

When compared to Cascading Style Sheets (CSS), XSL-FO provides a more sophisticated visual layout model. You can use CSS to apply specific style elements to an XML or HTML document. By contrast, XSL-FO is a language for describing a complete document. It includes everything needed to paginate and format a document. Some of the formatting supported by XSL-FO, but not by CSS, includes right-to-left and top-to-bottom text, footnotes, margin notes, page numbers in cross-references, and more. Note that while CSS is primarily intended for use on the Web, XSL-FO is designed for broader use. As an example, you could use an XSL-FO document to lay out an XML document as a printed book. You could write a completely separate XSL-FO document to transform the same XML document into HTML.

Using XSL-FO and WordProcessingML to Transform Documents

The advent of WordprocessingML created a powerful use for XSL-FO. You can use WordprocessingML and XSL-FO to transform Word documents into other viewable or printable industry adopted formats, such as PDF or HTML. The promise of XSL-FO as an intermediate format led to the creation of several tools that facilitate the transformation of XSL-FO documents into PDF, HTML, and other types of files. This article specifically focuses on various style sheets created explicitly for converting WordprocessingML files into XSL-FO documents.

Creating an XSL-FO Document from Word

The style sheet presented with this article is designed for using WordprocessingML to convert documents to the XSL-FO format. There are two simple ways to obtain an XSL-FO document directly from Word using the style sheet.

  • Save a Word document as XML by using the Save As command in Word and then use a third-party parser to convert it into an XSL-FO document.

  • Apply a transformation when using the Save As command in Word. To do this:

    1. In the Save As dialog, check the Apply transform box.

      Note

      Ensure that the Save data only checkbox is cleared. If Save data only is checked when you apply the transform, then Word discards the formatting stored in the document.

    2. Click Transform. . ..

    3. In the Choose an XML Transform dialog, browse to the Word2FO.xsl style sheet and click on Open.

Each of these two methods results in a file with an XML extension. While you may expect something like a .FO extension, the XML extension is intentional. The Word2FO.xsl style sheet creates an XSL-FO document regardless of the resulting extension.

Word Features Supported by the XML-FO Style Sheet

The following sections describe the Word features supported in the current version of the XML-FO style sheet. They do not discuss specific implementation details or techniques. For a more extensive examination of style sheet design and workings, view the transforms in a text editor and read the comments included in the style sheet.

Word Styles

Styles are the recommended tool for altering the appearance of your document. Properly designed styles make your document consistent, simplify maintenance, and provide better results when transferring data to other Microsoft Office applications or converting them into XSL-FO documents.

The Word2FO.xsl style sheet supports the use of styles in Word, including paragraph and inline styles, style derivation, and style overrides on specific elements.

Text Formatting

The style sheet provides support for the following text formatting properties:

  • Font attributes. Family, size, weight, slant, sub/superscripts, letter spacing, and color

  • Properties for spans of text. Decorations, case transformations, shadowed text

    Note

    Not all decoration styles are supported.

Paragraph Formatting

The style sheet provides support for the following paragraph properties:

  • Alignment, margins, and indentation

  • Keeping lines of a paragraph together on a page or in a column

  • Forcing a page break before a paragraph

  • Widow and orphan control

    Note

    A widow is the last line of a paragraph printed by itself at the top of a page. An orphan is the first line of a paragraph by itself at the bottom of a page.

  • Borders, padding, and background

    Note

    Not all border styles are supported.

Lists

The style sheet provides support for numbered, bulleted, and multi-level lists.

Tables

The following tabular element properties are supported:

  • Horizontal and vertical spans (merged cells)

  • Explicit setting of column widths

  • Borders, padding, and background

Images

The style sheet provides support for inline bitmap graphics, both embedded and linked from an external file.

Note

Floating graphics and vector drawings are not supported in the current version of the style sheet.

Footnotes

Currently, the style sheet provides support for only the simplest footnote-numbering scheme.

Note

Footnote layout may differ from that of Word, especially in multi-column sections.

The style sheet provides support for internal and external hyperlinks.

Page Headers and Footers

The style sheet supports the use of page headers and footers. The implementation is subject to a number of limitations. For more information, see Using Page Headers and Footers.

Sections

The style sheet provides support for sections. Not all Word layouts render correctly in XS-FO format. Specifically, XSL-FO has no provision for support of continuous section breaks. However, some XSL-FO rendering engines provide extensions to the XSL-FO specification in this area. For example, RenderX includes examples of using RenderX XEP software to support continuous section breaks in Word.

Limitations to Transforming Documents from Word into the XSL-FO Format

Both XSL-FO and WordprocessingML are powerful languages for describing richly formatted content. Their scopes overlap but do not match exactly. Therefore, in order to achieve maximum fidelity in layout patterns it is important to have a clear idea about the capabilities and limitations of transforming documents from one format into the other format. The following sections provide an overview of the differences, and some formatting tips for Word users to follow in order to achieve maximum visual fidelity.

Using Tab Marks to Lay Out Text

Many Office users use the TAB key to alter spacing, create indentations, or create table structures. While this method is acceptable with Word, there is no easy way to reproduce the same behavior with XSL-FO because it lacks the equivalent of a tab mark. Although the Word2FO.xsl style sheet provides some methods to represent tab stops, the resulting output may not look exactly like the original Word document. Users are encouraged to use different formatting techniques instead of tab marks. For example:

  • To create first-line indents or negative indents in paragraphs. Replace tab marks by using explicit indentation.

  • To create table structures. Replace tab marks with explicit tables.

  • To create dotted rules in a table of contents, fill-in form fields, and so on. You can approximate this by applying a specific Underline style to a sequence of spaces, though this lacks the "stretch ability" found with the dotted rules in a table of contents formatted using tab marks.

Using Page Headers and Footers

In Word, the size of header and footer areas can change dynamically. The central region of the page increases or decreases when the contents of a side region changes. The XSL-FO format has no means to express dynamically changing header and footer areas. In the XSL-FO format, side regions must have fixed dimensions, regardless of their actual contents. Consequently, you must reserve the space for headers and footers manually by adjusting page margins.

  • To adjust page margins. Drag the top or bottom ruler up or down, so that the contents of the header or footer fits within the allotted space.

    Note

    If there are headers or footers with different sizes within a section, you must break the section into multiple sections. Margins must be the same for each page within a section.

Other Limitations to Transforming Documents to XSL-FO Format

The style sheet does not support the following:

  • Floating graphics and vector drawings (VML).

  • Translating complex borders into the XSL-FO format because they are not defined in the XSL-FO specification.

  • Reproducing adjacent continuous sections with different column counts by using XSL-FO elements only.

    Note

    The third-party rendering engine, RenderX, offers an extension element for this purpose. This extension element is used in the Word2FO_ext.xsl style sheet to implement this feature.

  • Translating custom user schema. You must remove user tags before converting the file.

  • In addition, the style sheet supports only the "PAGE", "PAGENUM", and "REF" fields.

Examining the XSL-FO Style Sheet

The style sheet uses no extension elements so it remains usable by any XSL-FO rendering engine. The style sheet is designed to process WordprocessingML generated by Word. When the actual markup used by Word differs from the specification, the style sheet favors the real implementation, rather than the theory. For more information about specific implementation choices, see the comments within the style sheet.

Structure of the XSL-FO Style Sheet

The style sheet is divided into several sub-style sheets for ease of maintenance and support. The following are brief descriptions of the files:

  • Word2FO.xsl. The main style sheet that defines global constants and processes contents of a Word XML document.

  • pageLayout.xsl. Defines physical page layouts and creates page sequences.

  • elementStructure.xsl. Defines presentation and structure of the resulting XSL-FO document.

  • elementProperties.xsl. Controls translation of properties.

  • profile.xsl. Contains parameters to adjust the text layout of the resulting XSL-FO document.

  • Word2FO_ext.xsl. Contains RenderX extensions to the XSL-FO format. Intended for use with RenderX XEP formatter.

Global Parameters in the XSL-FO Style Sheet

A number of parameters, defined as global constants, determine the behavior of the style sheet profile.xsl. Table 1 is a list of available global variables and their descriptions. The values are selected to match Word 2003 choices as closely as possible.

Table 1. Description of global variables in the XSL-FO style sheet

Variable Description

default-width.list-label

Specifies the default width of list labels

default-font-size

Specifies the default font size of text

default-font-size.list-label

Specifies the default font size of list labels

default-font-size.symbol

Specifies the default font size of Symbol font

default-widows

Specifies the default value for the widows property of paragraphs

default-orphans

Specifies the default value for the orphans property of paragraphs

white-space-collapse

Specifies whether to collapse white space characters

default-header-extent

Specifies the default extent of page headers

default-footer-extent

Specifies the default extent of page footers

default-line-height

Specifies the factor for line-spacing calculation

Extended Style Sheet for Transforming Word Documents to an XSL-FO Format

In addition to the Word2FO.xsl style sheet for the general transformation, RenderX includes an additional style sheet that supports several RenderX extensions to the Extensible Stylesheet Language (XSL) Version 1.0 Specification. You can use this additional style sheet, Word2FO_ext.xsl, in place of the general style sheet to produce an XSL-FO formatted document for further processing with RenderX XEP software. The additional features in this style sheet include:

  • Mapping document properties from the Word document into meta information in the output file.

  • Support for continuous section breaks as continuous flows sections in the output document. This enables such features as supporting different column layouts on a single page.

  • Mapping outline levels on section headings to PDF hierarchical bookmarks.

Conclusion

XSL-FO and WordprocessingML, when used together, create an extremely useful tool for transforming Word documents into other standard formats. XSL-FO makes it possible for an increasing number of systems to work with Word by enabling easy conversion into formats such as PDF or HTML. Check out RenderX or CambridgeDocs XML Conversion and Publishing Technologies if you would like to learn more about companies with products that take advantage of XSL-FO and WordprocessingML.

Additional Resources

The following is a list of additional resources to assist you when developing custom solutions using these style sheets.