Transform XML Files with XSLT When Importing into Microsoft Excel 2002
Frank C. Rice
Microsoft® Excel 2002
Summary: This article explains the Microsoft Excel 2002 XML Spreadsheet format and how XSLT can be used to transform XML data imported into Excel to another format. Examples guide the reader through creating an XML spreadsheet file and changing the format by using an XSLT style sheet. (15 printed pages)
Displaying XML Data in Excel
Creating an XML Data File
Displaying Data in the XML Spreadsheet Format
Importing Data by Using a Style Sheet
An Example of an XSLT Style Sheet
Microsoft® Excel 2002 supports loading and saving Extensible Markup Language (XML) files by using a default XML Spreadsheet (XMLSS) format. XMLSS is a format that is common to both Excel 2002 and the Microsoft® Office XP Spreadsheet Component, so files in XMLSS may be shared between the two. You can also create your own XSL Transformations (XSLT) style sheets for XMLSS files to transform the data to another format so that it can be shared with any number of applications.
In addition, Excel also supports the use of Extensible Stylesheet Language (XSL) and XSLT files when importing data to transform XML data into other formats for display in Excel.
This article explains the XMLSS format and provides an example of its use by importing a sample XML data file from Microsoft Access 2002. It will also demonstrate the steps to use XSL to format the XML and the steps to use XSLT to transform the XML data into another format during the import process.
This article assumes that you have a basic understanding of XML and the terms used to describe XML-related files. A glossary of common XML-, XSL-, and XSLT-related terms is provided at the end of this article. In addition, the MSDN Online XML Developer Center is a great resource for information on XML.
XML is a markup language used for describing structured data (such as that in a worksheet) so that it can be read by a variety of applications. Designers can create customized XML elements, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations. This versatility provides many opportunities for data interchange, but it also has at least one inherent challenge. By manipulating the elements of an XML document, designers can create any number of nested element dimensions within a document. These parent/child element relationships can lead to ambiguity in the two-dimensional row-and-column paradigm of an Excel worksheet.
To deal with this ambiguity, Excel forces a format to any XML data it receives, unless the data comes with a style sheet that predefines another format. This forced format is created by using a flattening algorithm to populate the rows and columns that comprise a worksheet. Yet while this format is effective in forcing multidimensional data into a two-dimensional format, it doesn't always present the data in the most optimal format for a human reader.
So how does Excel know when to use either the flattening algorithm or the XSLT style sheet? When Excel opens or imports an XML data file, it looks for an element inside the file that points to an XSLT style sheet. If that element is present, Excel displays a dialog box that prompts you to apply the style sheet or to open the file without a style sheet.
Figure 1. The Import XML dialog box
If you choose to apply a style sheet, the XSLT file will format and display the data based on the instructions in the file.
If there is no style sheet element or you choose not to apply the XSLT style sheet, Excel then uses the flattening algorithm to display your data. Let's apply the flattening algorithm format to some actual data.
First, we'll create the Employees.xml data file by exporting the Employees table from the sample Northwind database in Microsoft® Access 2002. The following steps explain how to create the Employees.xml file. The file is also available in the download accompanying this article.
- Start Access 2002, and then open the sample Northwind database. You can find the Northwind database on your computer by clicking Start, pointing to Search, and then selecting For Files or Folders. Enter Northwind.mdb and then click Search Now. If the Northwind database isn't found on your computer, you can install it from the same location you used to install Microsoft Office XP.
- In the Database window, click Tables. Click the Employees table, and then on the File menu, click Export.
- In the Save as type list, click XML Documents.
- Click the arrow to the right of the Save in list and select the drive or folder to export to.
- In the File name list, type Employees (or keep the default), and then click Export.
- In the Export XML dialog box, make sure the Data (XML) box is checked. Uncheck the options for Schema of the data and Presentation of your data (XSL) if they are checked.
- Click OK to export the data to an XML document.
- To view the data, open the file in a text editor such as Notepad.
<?xml version="1.0" encoding="UTF-8"?> <dataroot xmlns:od="urn:schemas-microsoft-com:officedata"> <Employees> <EmployeeID>1</EmployeeID> <LastName>Davolio</LastName> <FirstName>Nancy</FirstName> <Title>Sales Representative</Title> <TitleOfCourtesy>Ms.</TitleOfCourtesy> ... </Employees> ... </dataroot>
Looking at the XML data fragment, you see a series of HyperText Markup Language (HTML)-like elements that describe a group of Northwind Traders employees. Although it may look like just a long list of elements enclosing data, there is a structure to the elements in the document. You can also look at this document as a tree structure containing a hierarchy of elements (or nodes), as illustrated in Figure 2:
Figure 2. Employee data as a tree structure
The EmployeeID, FirstName, and LastName elements are children of the Employees element, which is then a child of the dataroot element. In other words, the descriptive elements such as EmployeeID and FirstName are nested inside of the Employees element, which is nested inside of the dataroot element. In the absence of a style sheet, this is the layered-structure of elements that Excel must interpret in order to bring the data in to a worksheet.
Now let's see what the file looks like in the flattened format inside Excel.
Import the Employees.xml data file into Excel 2002 using the following steps:
- Start Excel 2002 and create a new workbook.
- Import the Employees.xml file by pointing to Import External Data on the Data menu and then clicking Import Data. In the Select Data Source dialog box, select XML Files in the Files of type list, locate the Employees.xml file, and then click Open.
- Click Open in the Import Data dialog box.
Figure 3. Employee data in the flattened format
Figure 3 illustrates how Excel interpreted our nested data to fit its two-dimensional model. As you can see, although all of the information is present, the display is not very pleasant.
The first element (cell A1) is the document element and is used like a document title: /
Beneath this cell, we see the children of the Employees element spread across several columns starting with the Address element:
Excel spreads elements that contain nested elements across multiple columns. Likewise, for each element that contains multiple sets of data, that data is spread across multiple rows. Also notice that the nested elements are sorted alphabetically across the columns instead of in the order that they appear in the Employees.xml file. If we had additional nested layers of elements or data, those in turn would also be spread across multiple columns or rows.
The following example shows how Excel displays two nested XML data islands:
<Sauces> <Ingredients> <Ingredient1>Water</Ingredient1> <Ingredient2>Flour</Ingredient2> </Ingredients> </Sauces>
Figure 4. Nested elements
<Sauces> <Ingredients> <Ingredient>Water</Ingredient> <Ingredient>Flour</Ingredient> </Ingredients> </Sauces>
Figure 5. Nested data
Continuing to scroll across our worksheet, we see a #agg column:
With the #agg designation, Excel sees the column of data as a series of numbers and tries to perform an aggregation on the values. As an example, for values that apply to multiple rows (multiple sale items) in an order, Excel would attempt an Auto Fill of the data by including the number in the rows where it applies. If Excel can't determine how to apply the value, as in this case, it just lists the value of the element.
Now let's look at using a style sheet to improve the presentation of our XML data.
Style sheets can provide significant improvements in the presentation of your XML data in Excel. For example, you can control how data is placed in cells and even provide some formatting for the cells. To use an XSLT style sheet, you'll need to add a style sheet declaration to your XML data file. A sample XSLT style sheet (SortNames.xsl) is included in the download available with this article. The code for the style sheet is also presented later in this article.
To use the SortNames.xsl file when importing the Employees.xml data file into Excel, complete the following steps:
- Open the Employees.xml data file with a text editor such as Notepad.
- Create a blank line after the first line in the document. Copy and paste the following line of text into the blank line:
<?xml-stylesheet type="text/xsl" href="SortNames.xsl"?>
This line tells Excel to apply the SortNames.xsl style sheet when importing the Employees.xml file.
- Save your changes to Employee.xml.
- Create a new workbook in Excel and import the Employee.xml file. The data should open in your worksheet to display a list of employee information sorted by last name.
Figure 6. Employees.xml data displayed with a style sheet
As you can see, this display is much more pleasant and effective than the previous example. To see how we accomplished this transformation, we'll take a detailed look at XSLT and the style sheet we used.
XSLT uses a combination of HTML elements and template rules to restructure and reformat the output file. These template rules are triggered when the pattern of an element in the input XML data file matches the pattern in a template rule. Templates within the XSLT style sheet also contain the instructions used to locate parent and child elements by using XML Path Language (XPath) expressions.
XPath is a language used for addressing parts of an XML document and is designed to be used by XSLT. A detailed discussion of XPath is beyond the scope of this article, but additional information can be found at the MSDN Online XML Developer' Center. XPath expressions are mainly used to identify and navigate the elements and items in a tree structure, allowing us to process them with XSLT instructions.
To illustrate, let's look at the XSLT style sheet file (SortNames.xsl) we used to display our data:
<?xml version="1.0" encoding="UTF-8"?> <!-- This style sheet is used with the Employees.xml file exported from Microsoft Access to produce a listing of employee names, addresses, and phone numbers in Microsoft Excel. Used with: Employees.xml Output: Table --> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <!-- Output as an HTML file. --> <xsl:output method="html"/> <xsl:template match="/"> <!-- Create a table with headings, then search for children of the Employees element. Once processing is completed, sort the table by the LastName field. --> <TABLE> <TR> <TH>Name</TH> <TH>Address</TH> <TH>Home Phone</TH> </TR> <xsl:apply-templates select="dataroot/Employees"> <xsl:sort select="LastName" order="ascending"/> </xsl:apply-templates> </TABLE> </xsl:template> <!-- Process and output the child elements of the Employees element. --> <xsl:template match="dataroot/Employees"> <TR> <TD> <xsl:value-of select="FirstName" /> <xsl:text> </xsl:text> <xsl:value-of select="LastName" /> </TD> <TD> <xsl:value-of select="Address" /> <br/> <xsl:value-of select="City" /> <xsl:text>, </xsl:text> <xsl:value-of select="Region"/> <xsl:text> </xsl:text> <xsl:value-of select="PostalCode"/> </TD> <TD> <xsl:value-of select="HomePhone" /> <!-- Test whether the phone number has an extension. If not, skip the template instruction. --> <xsl:if test="string-length(Extension)>0"> <xsl:text> Ext </xsl:text> <xsl:value-of select="Extension" /> </xsl:if> </TD> </TR> </xsl:template> </xsl:stylesheet>
The initial section of the style sheet, starting from the XML declaration to the first template instruction, is sometimes referred to as the prolog. XSL and XSLT are XML-based languages, so it follows that their style sheets (or transformation files) must follow the rules for XML. Accordingly, the first statement of the style sheet contains a standard XML declaration statement including the encoding format and version number.
<?xml version="1.0" encoding="UTF-8"?>
Comments are inserted throughout the document and begin with the <!--
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
The xmlns:xsl attribute defines a namespace prefix for use by the objects in the style sheet. Namespaces provide a context for elements in an XML document. For example, by using a namespace, you can tell the processor that our use of the Table element signifies a data structure with rows and columns. Like XML documents, XSL/XSLT style sheets are required to have a version number as seen in the attribute in the xsl:stylesheet element.
Next comes the xsl:output element, which is used to help the processor identify and control the format of the output.
Then we come to the first in a series of template elements.
Each XSL template rule is executed when a particular part of the source document is encountered. In this case, the template element contains an attribute called match. The value of this attribute is a pattern that matches the particular node or element in the input tree to which the template rule should be applied. This specific attribute contains the value "
/", which matches the root node of the input file. The root node is the top-most node in a tree and always contains just one child element representing the document element. The document element is the outermost element in an XML document and contains all of the other elements in the document. In an Access 2002 XML file, the document element is the dataroot element that contains all of the other elements in the file. As we'll see later in this article, the "
/" symbol also provides a means of navigating through the nodes in a tree structure by using XPath expressions.
Next, we see a series of HTML elements that create a table to store our data.
<TABLE> <TR> <TH>Name</TH> <TH>Address</TH> <TH>Home Phone</TH> </TR> .... </TABLE>
HTML elements can be used in your style sheets to format the display of the data.
More Template Rules
Next come XSL instructions that identify and process the remaining elements in our input document.
<xsl:apply-templates select="dataroot/Employees"> <xsl:sort select="LastName" order="ascending"/> </xsl:apply-templates>
The xsl:apply-templates element defines a set of elements to process by selecting matching template rules. The select attribute contains the XPath expression "
dataroot/Employees" that navigates through the elements in the input tree structure.
Tip The "
dataroot/Employees" expression could have been shortened to "
//Employees". The two forward slashes are XPath shorthand representing the root and succeeding levels, in this case, the dataroot element.
So what exactly does
<xsl:apply-templates select="dataroot/Employees"> mean? It says "from the current location (the root node based on the previous element
<xsl:template match="/">), select the children of the dataroot/Employees element, and for each one, find the matching template rule and process it." The specific template rules will be discussed shortly.
Note If xsl:apply-templates is invoked and there is no matching template rule, a default template is used instead. Determining which default template rule is used depends on the node type being processed. For example, the default template rule for an element-type node tells the processor to "process all of the children of the current node and copy their data value to the output." In the instruction
<xsl:apply-templates select="dataroot/Employees">, if no template rule is found, all of the children of the Employees element would be processed. In other words, the instructions for the LastName, FirstName, and Address elements would be processed and any data they contain would be copied to the output.
Continuing with our example, the next statement is used to specify a sort key, in this case, the LastName
<xsl:sort select="LastName" order="ascending"/>
Notice that we have also included an order="ascending" attribute to explicitly define the sort order (also the default sort order). Following the sort instructions are closing HTML TABLE elements.
Then comes a template rule which matches the pattern from the
<xsl:apply-templates select="dataroot/Employees"> element discussed earlier.
Again, the pattern determines which nodes in the input tree the template rule will process. In this case, all of the child nodes of the Employees element are processed.
Note If more than one rule matches the node pattern, then XSLT will use conflict resolution. For example, rules have a priority value assigned to them, either by the user with a priority attribute in the xsl:template element, or by a system default value. For system-defined priorities, the more specific the pattern is, the higher the rule value. In our example, the pattern "
dataroot/Employees" would receive a higher priority value than would "
*", which represents any element in the current location.
Since the template rule matches the pattern in the xsl:apply-templates element, we start by processing the body of the template. First, HTML TABLE elements are output and then the following instructions are processed.
<xsl:value-of select="FirstName" /> <xsl:text> </xsl:text> <xsl:value-of select="LastName" />
The first instruction evaluates the XPath expression select="FirstName" and writes its data to the output based on the xsl:value-of element. The next instruction
<xsl:text> </xsl:text> is used to output literal text to the result tree, in this case a space between the FirstName node and the LastName node.
Tip Another way to copy literal text to the output is simply to include it in the line with the instructions. For example, in the fragment
<xsl:value-of select="PhoneNumber" /> Ext <xsl:value-of select="Extension" />
The text "Ext" would be copied to the output between the text of the PhoneNumber element and the Extension element. By using xsl:text, you can easily include white space around text.
Next, the following XSLT instructions perform essentially the same function as described earlier.
<xsl:value-of select="Address" /> <br/> <xsl:value-of select="City" /> <xsl:text>, </xsl:text> <xsl:value-of select="Region"/> <xsl:text> </xsl:text> <xsl:value-of select="PostalCode"/>
The various child elements of the Employees element are selected and their values are copied to the output tree. In the case of the Address
Next we see an xsl:if element being used to test whether the employee has a phone extension. If the employee does not have a phone extension, the xsl:if element bypasses the instruction processing the Extension element.
<xsl:if test="string-length(Extension)>0"> <xsl:text> Ext </xsl:text> <xsl:value-of select="Extension" /> </xsl:if>
An xsl:if element works just like the typical
if-end-if control statement found in many programming languages; it processes the instructions encapsulated in its loop only if the condition specified in the test attribute is true. In this case, we used a built-in function called string-length to determine if the Extension element contained any characters. If the expression passed to the function was empty, it returns a zero. If test is zero, the test evaluates as False and the loop is bypassed.
Transformation Process Summary
To summarize the use of XSLT in the transformation process:
- XSLT uses a combination of HTML elements and template instructions to restructure and reformat the output file.
- Template rules contain XPath expressions used to locate parent and child elements. Template instructions specify how to process the input. Much of the template body consists of HTML elements for the output.
- Instructions are usually placed in the order that the data is to be output, regardless of the actual order of the elements in the input XML data file.
- Conditional programming instructions allow you to further filter and process the data.
In this article, we introduced the flattening algorithm used by Excel to structure XML data in the absence of another style sheet. We also looked at the XSLT markup language and described its file structures and a few processing instructions. We discussed its use in transforming XML data files from one format to another. These concepts will provide a starting point for you to build your own XSLT files and use them in Excel.
For a more in-depth discussion of the XML concepts presented in this article, see the MSDN Online XML Developer Center or the W3C Extensible Markup Language (XML) Web site.
For more information about XSL and XSLT, see the XSL Developer's Guide on MSDN or the W3C Extensible Stylesheet Language (XSL) Web site.
This glossary defines some of the technical terms presented in this article. For more information on any of these terms, see the links in the "Additional Resources" section above.
- Attribute: A name/value pair associated with an element such as
- Comment: An annotation that the XML/XSLT processor ignores. Written between the delimiters
- Document Element: The outermost element of a document. The document element contains all of the other elements. Also known as the root element.
- Element: A discrete unit within an XSLT style sheet delimited by start and end tags such as
- Expression: An XPath construct that can be evaluated to a string, a number, a Boolean, or a set of elements. Expressions are used in many contexts such as the select attribute of the xsl:templates element.
- Namespace: A namespace insures uniqueness by providing context for elements and attributes.
- Node: An element in a tree structure.
- Output Method: An attribute that defines the format of the output. XSLT defines three output methods, XML, HTML, and text.
- Pattern: A construct that defines a condition that every node either satisfies or does not satisfy. Used as XPath expressions in XSLT elements such as xsl:template.
- Processing Instruction Nodes: Nodes containing processing instructions. Processing instructions carry instructions to the XML parser. Written between the delimiters
?>such as the
<?xml version="1.0" encoding="UTF-8" ?>declaration in an XML document.
- Root Element: See Document Element.
- Root Node: The top-most node in a tree structure. If the tree structure represents a well-formed XML document, the root node will have exactly one element node, representing the document element, and no child nodes. It may also contain comment nodes and processing instructions nodes as children.
- Template Body: A sequence of XSLT instructions, HTML elements, and text forming the contents of an xsl:template element.
- Template Rule: An xsl:template element with a match attribute. A template rule may be invoked by using the xsl:apply-templates element.
- Tree Structure: An abstract data structure presented hierarchically with a root node and branches of elements representing the contents of an XML document. The tree structure always has a single root node.
- URI: Uniform Resource Identifier: a generalization of the URLs (Uniform Resource Locators) used to uniquely identify resources such as addresses on the Internet.
- Valid XML Document: A document is valid if, in addition to meeting the criteria of a well-formed XML document, it also contains a declaration that points to a schema, which defines the structure of the document (such as a XSL/XSLT style sheet). To be valid the document must also adhere to the structure defined in the schema.
- Well-Formed XML Document: A document is well-formed if it follows the syntax rules in the XML specification. These include such rules as there must be a single outermost element that encloses all other elements and all opening tags must have closing tags.
- XML Spreadsheet Format: The XML format that Microsoft Excel applies to data when that data is saved as XML.