Microsoft Office Document Format Compatibility and Extensibility
Summary: This article discusses specific challenges in the evolution of the Microsoft Office file formats, and specifically the evolution from binary file formats (BFF) to the Open XML (Open XML) ISO/IEC 29500 format. It highlights various Open XML extensibility mechanisms as potential solutions.
Applies to: Microsoft Office
File formats must store data that is defined by product features. To support innovation, particularly while new product versions are released, new features are introduced that must persist in a specific format in the file. This persistence creates multiple versions of the file format containing new features that older versions of the software cannot read. For example, sparklines, a new feature in Microsoft Excel 2010, are not recognized by Microsoft Office Excel 2007.
Similarly, in multi-vendor environments, extensions by one vendor may create a difference in the file format that is unrecognized by software from other vendors. The difference in file formats creates interoperability and extensibility issues across versions of a product and between products produced by different vendors.
The 2007 Microsoft Office system and Microsoft Office 2010 address these issues by implementing the Office Open XML (Open XML) standard, an accepted ISO/EIC standard. Open XML includes built-in extensibility mechanisms that enable software vendors to add features as the versions of their products evolve. They also can create extensions to the format of Open XML documents. Even with the new features and extensions, vendors can maintain the ability to provide a baseline version of the document to consumers who do not have the software capability to read the updated version.
Microsoft File Formats
Microsoft Office Binary File Formats
Microsoft Office 97 through Microsoft Office 2003 use the .doc, .xls, and .ppt binary file formats (BFFs), as described in the Microsoft Office Open Specifications. These file formats store information as binary streams, and therefore are not human-readable. This can present challenges to vendors who are creating software that reads, modifies, or writes these files.
Office Open XML File Formats
In the 2007 Microsoft Office system and Microsoft Office 2010, BFFs were replaced by the new XML-based Open XML. The file format extensions are .docx, .xlsx, and .pptx for Microsoft Office Word 2007, Microsoft Word 2010, Microsoft Office Excel 2007, Excel 2010, Microsoft Office PowerPoint 2007, and Microsoft PowerPoint 2010. Open XML is an open standard specification for electronic documents that can be implemented by any technology provider.
All software vendors strive to innovate and provide their customers with products that address their needs and increase productivity. However, if a new product causes compatibility issues in the file format, this creates issues for the customer. Software vendors must carefully consider the benefits and risks of providing their customers with new features if those new features introduce upgrade issues.
Issues Caused by Major Format Changes
Significant issues can occur during deployment when the new version of software introduces an entirely new file format standard. For example, an individual user may share documents created in one version of a product with another user who does not have that same version. Software deployment in a large enterprise can take months or years, and therefore it is common for multiple versions of an application to be in use at the same time.
The most common fix to this challenge is to provide compatibility packs, which can convert the new format to a format understood by an earlier version of the software. For instance, Microsoft provides downloadable compatibility packs for Microsoft Office 2003 and earlier. The compatibility packs enable earlier versions of Microsoft Office to open, edit, and save information by using the Open XML file format. However, compatibility packs are expensive to produce and often entail some compromise in performance.
Another common solution for dealing with new file formats is to make new versions of the product backward-compatible with the old file formats. Backward compatibility is built into Word 2010. For example, it is capable of viewing .doc files created by Microsoft Word 2003 and can also save files from other versions of Microsoft Word in that format so those files are viewable within Word 2003. However, users of Word 2003 cannot view a .docx file unless they install a compatibility pack.
A better solution is to build cross-version compatibility into the file formats themselves. This is the strategy Microsoft has used with the adoption of Office Open XML file formats. This approach enables previous versions of the software to ignore features they are not designed to support. As an example, the 2007 Microsoft Office system does not need a compatibility pack to read, edit, or write files produced by Office 2010 because Open XML handles versioning issues by using the Markup Compatibility and Extensibility (MCE) mechanisms specified in the Open XML standard.
Enterprise Deployment Practices
For many large enterprises, the adoption of new software is accomplished in phases. The phased approach depends on such conditions as the number of Information Technology (IT) personnel available to perform the upgrade work, the network bandwidth required for automated upgrades, and specific organizational requirements. In a large-scale move to a new product version that makes it difficult to share documents with older product versions, the organization’s IT professionals must consider any risks and devise workable deployment strategies that minimize productivity loss.
As an example of obstacles faced by IT departments, imagine that an IT manager is considering deploying a new version of an existing program across her entire organization. The vendor of this program provides a compatibility pack that enables it to read files produced by the previous versions. One risk she must consider before deployment is determining whether the compatibility pack will enable file sharing with all earlier versions of the software that her organization uses. Her IT department would need to enact a strategy for making the compatibility pack available—especially in situations where employees are prohibited from downloading software. She also would need to ensure that the compatibility pack is deployed across the entire organization. In addition, the IT manager may need to provide communication and training about the compatibility pack and how to share files across versions or instances of the application.
The three major productivity issues for large corporations conducting such a deployment are: the network bandwidth required for deployment, the time and storage space requirement to download compatibility packs, and the short-term need to train its personnel to use compatibility packs.
Consumer Deployment Practices
Consumer deployment practices are often driven by the need for newer and faster computer systems. The deployment issues faced by an individual consumer are similar to the deployment issues of large enterprises. The individual consumer also wants to share files with other users of the product.
For example, suppose a user writes a document in Office Word 2007, and emails it to another user who has Word 2003. Either the second user must know how to download the compatibility pack to open the file, or the first user must remember to save it in the Word 2003 format before sending it.
Markup Compatibility and Extensibility
Beginning with the release of the 2007 Microsoft Office system, compatibility packs are no longer a concern in deployment because both forward- and backward compatibility are built into the Open XML format. Individuals in organizations using 2007 Microsoft Office system and Office 2010 can freely exchange files. Individuals with versions of Microsoft Office that predate 2007 Microsoft Office system must install compatibility packs.
ISO/IEC-29500 solves most of the issues related to file format version incompatibility through Markup Compatibility and Extensibility (MCE). MCE combines several approaches to enable the addition of new content in a file while annotating that content so that it can be ignored or downgraded by applications that don’t understand the new content. MCE as discussed here is specified in ISO/IEC-29500:2008 – Part 3.
The issue of extensibility and interoperability can be encapsulated by using two hypothetical applications, where:
Considerations for a Good Interoperability Solution
To implement an interoperable file format, consider three main points: visual fidelity, ability to edit, and security.
Visual fidelity means that a document in one product version has the same visual representation when opened in another version of the product. However, because the product versions have different features, visual fidelity might not be possible, and the newer feature most likely will display according to the ability of the least capable application.
The ability to edit means that changes can be made to the document in either version of the product. Again, the ability to edit in the least capable application will likely need to be the baseline, or the application can discard an object if an attempt to edit it was made by a user who was using the least capable application.
Security considerations can arise when a product with the lesser capability tries to save information in markup that it does not understand. In such cases, two representations of the same data can potentially be passed between the products. If the data contains sensitive material, the most secure strategy may be to discard all unrecognized content; however, a software vendor might choose to retain such content to enable better interoperability.
Ignorable Attribute for Namespaces
The Ignorable attribute is defined in ISO/IEC-29500:2008 – Part 3, section 10.1.1 and it contains a whitespace-delimited list of namespace prefixes that can be ignored by an application that does not understand those namespaces.
The Ignorable attribute can be used to define content that is understood by a newer application but ignored by an earlier version.
Word 2010 uses the Ignorable attribute to render a feature called Glow text effect, shown in the following figure.
The following XML shows the Open XML markup produced by Word 2010 to define the Glow text effect, and the use of the Ignorable attribute to remove this effect in an application that does not understand the specified namespace.
When a Word 2010 file with this markup is opened in Office Word 2007, the Glow text effect is lost, but the text is still editable. If the text is changed in Office Word 2007 and the file is subsequently opened in Word 2010, the edit changes made in Office Word 2007 are saved, but the Glow text effect is lost because Office Word 2007 ignored the elements that defined the effect.
In this example, the feature was lost during the round trip between the two versions to maintain the ability to edit. This is considered a more secure solution because Microsoft Office Word 2007 does not attempt to maintain two forms of the same text in effort to preserve an effect that is readable by Word 2010 only. For more information about Word 2010 extensions, see [MS-DOCX].
The <extLst> element is defined in ISO/IEC-29500:2008 – Part 1, sections 22.214.171.124, 126.96.36.199, 188.8.131.52.15, 184.108.40.206, and 220.127.116.11. The <ext> element is defined in ISO/IEC-29500:2008 – Part 1, sections 10.1.2 and 18.2.7. These lists are an extension of a schema where no alternate representation is desired or provided.
An <extLst> element is always preserved if the parent element continues to exist. Extension lists are used for behavior that can remain valid even if an earlier version of the product edits the content of the parent element. For more information about extensions that are specific to Microsoft Excel, see all Excel 2010 extensions in [MS-XLSX].
Each <ext> block is examined in turn by the consuming product and processed if the identifier of the <ext> block is understood. <ext> blocks that are not understood by the product are preserved entirely; however, if an <ext> block is edited or otherwise processed, it is not preserved.
Excel 2010 uses extension lists to define a new feature called sparklines. The following figure shows an example of sparklines embedded in a chart.
The following XML shows the Open XML markup produced by Excel 2010 to define the sparklines by using an extension list.
When an Excel 2010 file with this markup is opened in Excel 2007, the sparklines are not displayed. If the file is subsequently opened in Excel 2010, the sparklines are displayed. If edits were made to the data cells for the sparklines while the file was open in Excel 2007, the sparklines will reflect the data changes when opened in Excel 2010.
Alternate Content Blocks
The <AlternateContent> element block is defined in ISO/IEC-29500:2008 – Part 3, section 10. 2. Alternate content blocks offer multiple solutions that target different versions of the consuming products and depend on the abilities of the product.
Alternate content blocks define a <Choice> element that is used to offer multiple solutions for representation of an object or text in an Open XML file. For example, a vendor might choose to reproduce the Glow text effect in Word 2010 as colored text in a version of an application that does not understand the effect.
The <Choice> element blocks are processed in the order they appear, and the first choice that can be processed by the consuming application is taken.
If the consuming application cannot select a <Choice> element, a <Fallback> element can be offered as a default baseline. If <Fallback> is not provided and the application cannot select a Choice element it understands, then it ignores the <AlternateContent> block, as shown in the following DrawingML usage example.
The following is a hypothetical example where an earlier version of an application uses DrawingML, which creates two separate objects to create a line with a label, whereas the later version uses a version of DrawingML that extends the connector object to include the label. The following figure shows how a connector line with a label might be represented in the earlier version of the application.
The following code blocks show the Open XML markup produced by using the alternate content <Choice> blocks to target a different display solution depending on the capabilities of the consuming application.
The first block defines the structure of the overall Open XML with placeholders for the two <Choice> blocks. The <Choice> for the later version is listed first, because it is the most recent version, and should be processed first if the application has the ability to do so.
The comment text “more content” in the following XML denotes that some Open XML was omitted for clarity.
The following Open XML block defines the <Choice> element for the later version of DrawingML. In this example, the object is represented as one object. The comment text “more content” denotes that some Open XML was omitted for clarity.
The following Open XML block defines the <Choice> element for the earlier version of DrawingML. In this example the object is represented as two objects, the connector and the text box. The comment text “more content” denotes that some Open XML was omitted for clarity.
The specific behavior for a round trip between the two versions is implementation-specific. For example, hypothetically, in the newer application the embedded label might be lost if the file is opened by using the earlier version of the application. The user of the newer application can then choose to accept the old version of the object with the label as a separate object, or to discard it.
PreserveElements, PreserveAttributes, and MustUnderstand Attributes
In addition to the extension lists, Ignorable attributes and <AlternateContent> blocks, MCE offers PreserveElements, PreserveAttributes and MustUnderstand attributes. For more information about these attributes, see ISO/IEC-29500:2008 – Part 3, section 10.1.3 and 10.1.4. These attributes can specify to the application that is consuming the Open XML what content to keep. This behavior, however, is specific to the implementation of the application.
The MustUnderstand attribute, similarly to the Ignorable attribute, contains a whitespace-delimited list of namespace prefixes that must be understood by an application to consume the information in the Open XML file.
Dealing with new features in a new product version in a file format that does not break earlier product versions is challenging. MCE addresses this challenge in the Open XML standard by providing mechanisms that enable software vendors to choose how new features should behave when interoperating with earlier versions of their products.
<AlternateContent> blocks can offer solutions to many versions of consuming applications without losing the ability to edit in any version. It is possible that some of the newer features can be lost during the round trips between products, but feature loss is preferred over compatibility issues that can cause productivity loss and unexpected results for users.
The Ignorable attribute can be used to display effects that can be lost without loss of understanding. The MustUnderstand attribute can be used to provide users with an error message that directs them to a download site for an appropriate version of the product.
Extension lists can be used when no alternative representation is desired. Features that are only meaningful in the new product version can be saved, but they will not be represented in earlier product versions.
Through alternate content blocks, the Ignorable attribute, and extension lists, MCE enables vendors to innovate and choose how their features interoperate across product versions with minimal loss of functionality.