How Excel 2003 Infers XSDs When Importing XML Data

 

Frank Rice
Microsoft Corporation

February 2005

Applies to:
    Microsoft Office Excel 2003

Summary: Learn the rules and instances where Microsoft Office Excel 2003 infers schema from XML data during import and export. Understand why Excel creates the schemas it does and learn to modify your own XML data to create a schema that fits your needs. (6 printed pages)

Contents

Introduction
General Rules
Rules Applying to Double-Byte Numbers
How Excel Handles XSD Data Types
Unsupported XML Schema Constructs
XSD Processing Rules
Exceptions to Processing Rules
Effect of Multiple Namespaces on Importing and Exporting XML Data
Conclusion
Additional Resources

Introduction

You can import and export XML data into Microsoft Office Excel 2003. In both cases, you can use a schema (XSD) file to define the structure of the resulting file. However, when no schema is explicitly stated during these operations, Excel creates or infers one. This article reviews the rules and what is inferred in a specific instance of XML data.

General Rules

The following rules are applicable when importing and exporting XML data to Excel.

  • If every data value of an element conforms to a pattern then the corresponding data type is inferred in the schema. For example, "abcdef" is inferred as String; "12345" is inferred as numeric.
  • If every data value of an element does not conform to a single pattern, then the String data type is inferred in the schema and passed to the grid in such a way that Excel treats numbers as numbers and strings as Strings.
  • Elements with xsi:nil="true" are ignored when inferring schema from XML data. The attribute nil is a Boolean value that indicates if an element should contain content. If the value is True, the element must be empty. If the value is False, the element cannot be empty.
  • Leading and trailing white space is ignored. For example, " 11", "11 ", and " 11 " are treated as numbers. "1 1", " 1 1", " 1 1 ", and "1 1 " are all treated as strings (stripping out leading and trailing white space for the second, third, and fourth results in the first, which Excel does not understand as a number).

Rules Applying to Double-Byte Numbers

Excel only infers a numeric data type for an element when all of the characters in the data for that element are XSD numeric characters. That means that double-byte numbers are inferred as String data type (see the exception noted below). Double-byte numbers provide support for many different East Asian language alphabets, such as Chinese, Japanese, and Korean. ASCII characters are only one byte in length, whereas Japanese, Korean, and other East Asian characters are two bytes in length.

In Extended ASCII (EA) Mode (InstallLanguage = JPN, KOR, CHS, CHT), Excel converts double-byte numbers imported or entered into the grid to numbers. In non-EA Mode, Excel treats double-byte numbers as strings.

How Excel Handles XSD Data Types

The following table lists the XSD data types and describes how Excel handles them:

Table 1. XSD data types

XSD Simple Type Examples Comments
String "AbCdEf" Supported and used as is.
Integer -1, 12678967543233 Supported and used as is.
Decimal 1.23, 123.4, 1000.00 Not supported.
Double -INF, -1E4, -0, 0, 12.78E-2,INF, NaN, 1.23,-1.23,123.4 Supported.

INF – infinity; NaN – Not a number. Each of these have a binary representation in Excel. The XML Mapper converts INF and NaN to binary, and passes that to Excel, and converts it back during export. Excel displays as much of the number as the cell width allows.

Boolean true, false, 1, 0 Supported.

If the data element contains only ones and zeros Boolean is not inferred. If the data element contains only True or False, with optional ones and zeros, Boolean is inferred.

Time 13:20:00, 13:20:00Z-05:00 (1.20pm Eastern Time, which is 5 hours behind Coordinated Universal Time (UTC) Supported.

Excel format: H:MM:SS

Fractional seconds are optional and supported. For example, Excel infers 13:20:00 as a Time value. Also, Excel infers 13:20:00.001 as Time. Excel supports fractional seconds in the grid for exporting as well.

Note   Eastern Time is five hours behind Coordinated Universal Time. Excel supports everything except the UTC notation "-05:00", and "Z". The "Z" follows the decimal seconds, indicating that the specified time is UTC. Excel formats and infers the time component as date/time, but treats values containing "Z" and UTC differences as a String in a date/time cell. Excel does not incorporate the UTC difference calculation into the data.

On export, the XML Mapper reconstructs the date/time Variant value into the proper XSD format.

Date 1999-05-31 Supported.

Excel format: YYYY-MM-DD

Excel infers the notation "1/02/2005" as a String on import. If, in the grid, you format this cell as a Date, Excel exports the displayed value. For example, if you format to mm/dd/yyyy, Excel exports 01/02/2005. If you format the cell to m/d, Excel exports 1/2.

Excel treats dates before 1900 and after 9999 in the grid as a String, but formats and infers them as date/time, per the XSD data type.

DateTime 1999-05-31T13:20:00Z-05:00 (May 31st 1999 at 1.20pm Eastern Time, which is 5 hours behind Coordinated Universal Time.) Excel format: M/D/YYYY H:MM

As with the Time format, Excel supports everything except the UTC notation "-05:00", and "Z". In this format, "Z" follows the decimal seconds, indicating that the specified time is Coordinated Universal Time. Values containing "Z" and Coordinated Universal Time differences are treated as a String.

gMonth --05-- Supported.

Excel format: M

As displayed in the example, the value that is handed to the grid is 5/1/1900.

Export: Value exported is the month portion of what is in the grid, surrounded by two dashes.

gYearMonth 1999-02 Supported.

Excel format: YYYY-MM

The value handed to the grid is 2/1/1999.

Export: Value exported is year month portion of date/time variant, separated by a dash.

gDay ---31 Supported.

Excel format: D

The value handed to the grid is 1/31/1900.

Export: Value exported is day portion of what's in the grid, preceded by three dashes.

gMonthDay --05-31 Supported.

Excel format: MM-DD

The value handed to the grid is 5/31/1900.

Export: Value exported is the month day portion of what's in the grid, preceded by two dashes, separated by a single dash.

anyURI http://www.example.com/, http://www.example.com/doc.html#ID5 Supported URIs:
  • http
  • https
  • ftp
  • \\
  • file://

Note   For additional information on XSD data type support, see Excel Help.

Unsupported XML Schema Constructs

There are some XML schema constructs that Excel does not support. The following list details the XML schema constructs that you cannot import into Excel:

  • <any>. The <any> element allows you to include elements that are not declared by the schema and is not supported.
  • <anyAttribute>. The <anyAttribute> element allows you to include attributes that are not declared by the schema and is not supported.
  • Recursive structures. Excel does not support recursive structures that are more than one level deep.
  • Abstract elements. Abstract elements are meant to be declared in the schema, but never used as elements. Abstract elements depend on other elements being substituted for the abstract element. Excel does not support abstract elements.
  • Substitution groups. Substitution groups allow you to swap an element wherever you refer to another element. An element indicates that it is a member of another element's substitution group through the <substitutionGroup> attribute. Excel does not support substitution groups.
  • Mixed content. Mixed content, which is not supported by Excel, occurs when an element contains a child element and simple text outside of a child element. One common case is where formatting tags, such as bold tags, are used to mark up data within an element.

XSD Processing Rules

The following rules are used by Excel for processing XSDs:

  1. A single root must always be identifiable. This is accomplished by using a nested XSD structure.

  2. minOccurs must always be zero for elements. This assumes that all elements are optional.

  3. Attributes are always optional.

  4. Elements that contain attributes but do not have child elements are inferred as simple content. If Excel can discern the content type according to the data type rules, then the simple content shall be of that type.

  5. Elements should never use a definite upper bound other than one. If more iterations of an element are required, <element maxOccurs="unbounded"> is used.

  6. Compositors use the default minOccurs/maxOccurs values (1/1); the exception is an unbounded <sequence>. Compositors define ordered groups of elements.

    Note   The <sequence> compositor aggregates other elements and compositors, and in the instance document, all of the elements/compositors must appear in the prescribed order, subject to the minOccurs/maxOccurs optional properties defined for them.

  7. <Choice> compositor are never used in inference.

    Note   A <choice> compositor aggregates elements and compositors, and in the instance document, any one (and only one) of the elements/compositors may appear.

  8. <all> compositor is the default choice. If a child element occurs more than once, <sequence> may be used.

    Note   The <all> compositor, like all other compositors, groups together items that must occur in an instance of the complex type to which it belongs. Unlike <sequence> compositor, the grouped items may occur in any order in an instance document, not just in the order they are listed under the compositor.

    1. <all> allows entry to any order while remaining restrictive on the number of occurrences.
    2. <sequence> compositors allow entry in a specific order while not being restrictive in the number of occurrences.
    3. <sequence> compositors use the default maxOccurs="1" unless the order of elements needs to be unrestricted.
    4. If it is necessary to be unrestrictive in both the order and number of occurrences, <sequence maxOccurs="unbounded"> may be used.
    5. When there is one and only one child element, <sequence> is used.
  9. The <sequence> compositor should be set to maxOccurs="1". Note, this is usually not specified as this is the default value. Any elements wrapped by the <sequence> compositor are counted within their immediate context. If an element occurrence is greater than 1 and occurs consecutively in the instance, the element's maxOccurs is set to unbounded. If an element occurrence equals 1, then it is set to maxOccurs="1".

Exceptions to Processing Rules

If the elements wrapped by the <sequence> compositor occur more than one time and at least one element does not occur consecutively in the instance document in their immediate context, then the sequence is unbounded, and each element's maxOccurs="1". For example, they are A,B,C,A,B,C or some other random order other than A,A,A,B,B,B,C,C,C.

Effect of Multiple Namespaces on Importing and Exporting XML Data

In general, XML instances that contain elements from more than one namespace have as many different XSD schema files as there are namespaces referred to in the XML. Excel generates multiple XSD files on export, one for each namespace. The XML that Excel exports may be invalid because of the action of deriving the schema. If you want a better experience, provide Excel with the XSD files so it does not need to infer the schema.

Conclusion

Excel infers a schema when importing or exporting XML data in the absence of a schema. The structure of the inferred schema depends on the XML data you are importing or exporting. Understanding how the structure of your data affects the resulting schema can help you understand how to modify the structure to get the schema you need.

Additional Resources

For more information on schemas and XML in Excel, see the following articles:

Using Schemas with Word 2003 and Excel 2003

Importing XML Maps, XML Lists, and Dynamic Chart Sources in Excel 2003

Using the Excel 2003 Object Model to Add XML Data Integration