Dive into SpreadsheetML (Part 1 of 2)
Microsoft Office Excel Professional 2003
Microsoft Office Excel Enterprise 2003
Microsoft Office 2003 Edition XML Schema References
Summary: SpreadsheetML is an XML dialect developed by Microsoft to represent information in an Excel workbook. This column, the first of a two-part series, explores the concept, schema definition, and some scenarios in which it makes sense to programmatically generate Excel workbooks using Office 2003 and SpreadsheetML. (10 printed pages)
Have you ever thought about all the data that is stored in Microsoft Office Excel workbooks around the world? Excel workbooks are an extraordinary data store, but they also offer great potential as a resource of dynamic information that can move across data-management workflows and business processes.
The future of Microsoft Office development is moving toward greater support to integrate data with documents and workbooks, and it is important for developers to understand the XML story. I have been talking with some developers who are exploring development possibilities offered by Excel, and I decided to write this column for all of you who want to dive into SpreadsheetML to learn more about the concept, recommended usage scenarios, and the steps you need to follow to programmatically create Excel workbooks.
Because this is a long topic, I decided to split this column into two parts:
- This article, Part 1, explores the concept and the schema definition of SpreadsheetML. Part 1 also explains some scenarios in which you might consider using SpreadsheetML to programmatically generate Excel workbooks.
- Dive into SpreadsheetML (Part 2 of 2) presents a business scenario and explores how to generate an Excel workbook from scratch.
SpreadsheetML is an XML dialect developed by Microsoft to represent the information in an Excel workbook. SpreadsheetML allows you to save Excel workbooks as XML documents and to open them in Excel. Microsoft created a format that allows you to save, in an XML-based file, almost every Excel customization (including formulas, data, and formatting). Microsoft created SpreadsheetML to represent core spreadsheet models, so the following embedded objects cannot be represented using XML:
- Drawing shapes or AutoShapes
- OLE objects
- Microsoft Visual Basic for Applications (VBA) projects
- Groups and outlines
Note I strongly recommend downloading the Microsoft Office 2003 Edition XML Schema References to explore the schema references and read an overview of SpreadsheetML (Excel 2003).
In this column, I start by walking you through a simple "Hello World" sample to explain what kind of information is stored inside Excel when you save your files as SpreadsheetML. To accomplish this task, you will create a simple Excel file, save it as an XML Spreadsheet file, and open it in a text editor to review the structure of the file.
To create a simple SpreadsheetML sample file
- Start Excel.
- Type Hello World into cell A1, as shown in Figure 1.
- Type 1 into cell B1, as shown in Figure 1.
- Type 2 into cell B2, as shown in Figure 1.
- Type the formula =SUM(B1:B2) into cell B3, as shown in Figure 1.
Figure 1. Create a simple SpreadsheetML sample file
- On the File menu, click Page Setup.
- On the Page tab, under Orientation, select the Landscape option, as shown in Figure 2.
Figure 2. Select Landscape page setup
- On the File menu, click Save As.
- In the Save as type list, select XML Spreadsheet (*.xml).
- In the File name box, type a new name for the workbook (for example, SpreadsheetMLDemo.xml), as shown in Figure 3.
Figure 3. Save the file as an XML Spreadsheet
- Click Save.
After you create the XML Spreadsheet file, you can open it by using a text editor program.
To open and explore the XML Spreadsheet file
- Open a text editor program, such as Notepad, and then open the SpreadsheetMLDemo.xml file.
The first line that appears in the file represents the XML declaration that all XML files must have:
The second line is a processing instruction that defines the document as an Excel Spreadsheet file:
The third line introduces the Workbook root element. This element stores characteristics and properties of the workbook, such as the namespaces used in SpreadsheetML.
<Workbook xmlns="urn:schemas-microsoft-com:office:Spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:Spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
Table 1 shows the namespaces that appear by default.
Table 1. SpreadsheetML namespaces
|urn:schemas-microsoft-com:office:Spreadsheet||ss||This namespace contains the elements and attributes related to basic spreadsheet functionality.||XML Spreadsheet 2002|
|urn:schemas-microsoft-com:office:office||o||This namespace defines the shared office schema and document properties collection (such as author, creation date) and document statistics (such as word count).||Common Properties|
|urn:schemas-microsoft-com:office:excel||x||This namespace defines elements and attributes used to describe more complex features of Excel, such as PivotTables, worksheet options, and validation.||XML Spreadsheet 2000|
|http://www.w3.org/TR/REC-html40||This namespace references the W3C HTML 4.01 specification.|
Note Depending on the Excel Spreadsheet file that you create, the namespaces might change. For a complete reference of namespaces that can appear in SpreadsheetML, please review the Microsoft Office 2003 Edition XML Schema References.
The first child element of the WorkBook element is DocumentProperties. Office documents store metadata related to the document—for example, the author name, company, creation date, and more. The XML representation of Excel workbooks stores this metadata in the DocumentProperties element.
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office"> <Author>Erika Ehrli Cabral</Author> <LastAuthor>Erika Ehrli Cabral</LastAuthor> <Created>2005-11-09T23:14:06Z</Created> <Company>Microsoft Corporation</Company> <Version>11.6568</Version> </DocumentProperties>
The next element, ExcelWorkbook, represents workbook-level characteristics and properties of the document.
<ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel"> <WindowHeight>12660</WindowHeight> <WindowWidth>15180</WindowWidth> <WindowTopX>480</WindowTopX> <WindowTopY>105</WindowTopY> <ProtectStructure>False</ProtectStructure> <ProtectWindows>False</ProtectWindows> </ExcelWorkbook>
The next element, Styles, represents information related to individual styles that can be used to format components of the workbook.
<Styles> <Style ss:ID="Default" ss:Name="Normal"> <Alignment ss:Vertical="Bottom"/> <Borders/> <Font/> <Interior/> <NumberFormat/> <Protection/> </Style> </Styles>
The next element is the Worksheet element, the heart of Excel XML workbooks:
Within a Worksheet element, a hierarchy of elements defines a worksheet:
Table Column Row Cell Data
As you can see in this example, the first child element is Table. This element keeps all the information related to the table that belongs to the worksheet and holds the row and column elements. The Table element has attributes that define the column count (ss:ExpandedColumnCount) and row count (ss:ExpandedRowCount) of the table.
Nested inside the Table element, you see a Column element that defines column-level properties and a Row element for each row of cells defined in the worksheet. The Row element contains a Cell child element for each cell of a row. The Cell element contains information for an individual cell—for example, the data type and the value stored in a cell.
<Table ss:ExpandedColumnCount="2" ss:ExpandedRowCount="3" x:FullColumns="1" x:FullRows="1"> <Column ss:AutoFitWidth="0" ss:Width="60"/> <Row> <Cell><Data ss:Type="String">Hello World</Data></Cell> <Cell><Data ss:Type="Number">1</Data></Cell> </Row> <Row> <Cell ss:Index="2"><Data ss:Type="Number">2</Data></Cell> </Row> <Row> <Cell ss:Index="2" ss:Formula="=SUM(R[-2]C:R[-1]C)"><Data ss:Type="Number">3</Data></Cell> </Row> </Table>
After the Table element is the WorksheetOptions element, which stores information related to the operation and presentation of the worksheet. When you built this sample file, you set the page orientation to Landscape. This kind of information appears inside the WorksheetOptions element.
<Worksheet> <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel"> <PageSetup> <Layout x:Orientation="Landscape"/> </PageSetup> <Print> <ValidPrinterInfo/> <HorizontalResolution>600</HorizontalResolution> <VerticalResolution>600</VerticalResolution> </Print> <Selected/> <ProtectObjects>False</ProtectObjects> <ProtectScenarios>False</ProtectScenarios> </WorksheetOptions> </Worksheet>
You see a Worksheet element for each worksheet stored in the Excel workbook. By default, new Excel files always contain three worksheets. For this reason, you see three Worksheet elements inside your file.
<Worksheet ss:Name="Sheet2"> <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel"> <ProtectObjects>False</ProtectObjects> <ProtectScenarios>False</ProtectScenarios> </WorksheetOptions> </Worksheet> <Worksheet ss:Name="Sheet3"> <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel"> <ProtectObjects>False</ProtectObjects> <ProtectScenarios>False</ProtectScenarios> </WorksheetOptions> </Worksheet>
The last element is the closing tag for the Workbook element:
In a production environment, Excel workbooks are more complex than the one I just used to explain the content of an XML Spreadsheet file. The important thing is that you understand the kind of information stored in the Workbook, Worksheet, Row, Cell, and Data elements. With this understanding, you can either create an Excel file or extract information stored in an XML Spreadsheet file to process it with extra business logic and then save it in a different data store.
I have seen articles, white papers, blogs, and code samples that explore different approaches to generate Excel workbooks or to extract data from workbooks. As a developer, I am faced with a common question: What is my best option? Here is a list of available options you can use to insert or extract data from an Excel workbook:
- Define an Excel workbook as an OLE DB data source, and extract or pump information by using OLE objects such as OleDbDataReader and OleDbCommand.
- Use the Excel Primary Interop Assemblies and ADO.NET objects such as DataSet and DataTable to extract and pump information from and to Excel workbooks. For more information, see Understanding the Excel Object Model from a .NET Developer's Perspective.
- Use Microsoft Office Excel 2003 (Professional and Enterprise editions) as a user interface to build Excel workbooks that load data from well-formed XML files.
- Use the XML support provided by Microsoft Office XP and Microsoft Office 2003 to save, create, and open Excel workbooks using SpreadsheetML.
There are certain scenarios in which it makes sense to choose one approach over all others. An XML Spreadsheet file is a text document, and many tools across a variety of platforms support XML technologies. You might consider using SpreadsheetML for integrating data into Excel if:
- You have a Web-based application or Web service that needs to repeatedly extract data from Excel files to process information or store it in a database or repository.
- Your application needs to use a text-based messaging service, such as e-mail, or a mobile device.
- Your application needs to generate XML-based files that will be processed by other applications (such as Microsoft BizTalk Server) or other cross-platform systems.
- You need to avoid the use of Automation by way of COM or primary interop assemblies to connect to Excel workbooks in order to insert and extract information.
- You need to be able to open the files in a text-based editor for further processing.
Figure 4 shows how you can build tools that use SpreadsheetML to extract data from sources such as XML documents, databases, and different systems and applications to generate Excel workbooks. You can also build tools to extract data from Excel workbooks to process and transform information and send it to messaging systems (e-mail and mobile devices), other XML documents, databases, and cross-platform systems and applications.
Figure 4. Generating Excel workbooks and extracting data from an Excel workbook
Just as Excel has an XML definition for workbooks (SpreadsheetML), Microsoft Office Word 2003 has an XML definition for Word documents (WordprocessingML). You can download and explore in detail the Microsoft Office 2003 Edition XML Schema References to explore XML schemas for Microsoft Office 2003 Editions, including Microsoft Office Word 2003, Microsoft Office Excel 2003, Microsoft Office InfoPath 2003, Microsoft Office Visio 2003, Microsoft Office OneNote 2003, Microsoft Office Project 2003, and Microsoft Office Research Services.
You can also download the Office "12" XML Schema Reference - PDC 2005 Preview to review the documentation about a selection of Microsoft Office 2007 XML schemas and learn more about updates to the schemas.
In Dive into SpreadsheetML (Part 2 of 2), I explain how to create an Excel file from scratch by using SpreadsheetML.
For more information about the tools referenced in this article, see these resources:
- Office 2003 XML Reference Schemas
- Office 2003 XML Reference Schemas Frequently Asked Questions
- Excel 2003 Tool: Excel XML Toolbox for Microsoft Office Excel 2003
- Microsoft Office Developer Center: XML in Office Developer Portal
- Brian Jones' Blog: Some Background Information on the Reasons We Have Moved to an XML Format as the Default in Office "12"
- Article from Simon St. Laurent: 2+2=4, in SpreadsheetML
I would like to thank Frank Rice and Chad Rothschiller for their contributions to this article.