Understanding Office Binary File Formats
Summary: Learn about the binary file formats that are used in current and previous Microsoft Office products, including how to use them, their basic structures, and key concepts for interacting with them programmatically.
Applies to: Microsoft Word | Microsoft PowerPoint | Microsoft Excel | Microsoft Outlook
Published: February 2011
Provided by: Microsoft Corporation
This article is the first in a series of articles that introduce the binary file formats used by Microsoft Office products. This first article provides an overview of how to work with Microsoft Office binary file formats in general, and explains some of the shared structural traits and key concepts that the different formats have in common. The other articles in the series provide more detail about the individual file formats. These articles are designed to be used in conjunction with the Microsoft Office File Format Documents available on MSDN.
This article series deals with only the four core Microsoft Office products: Microsoft Word, Microsoft PowerPoint, Microsoft Excel, and Microsoft Outlook.
What Are Binary File Formats?
A binary file format is any file format that contains primarily binary data. This includes compiled programs, images, media, and most compressed files, and files that may contain textual information but are stored as binary data. The binary file formats used by Microsoft Office products fit in this last category. Non-binary formats may include text (.txt), .html, .xml, and their derivatives, and interpreted scripts and source code files.
All of the file data in Microsoft Office binary file formats exists in one or more streams. Each stream contains data structures to store metadata, such as user and system information and file properties, formatting information, text content, and media content. These data structures are expressed as groups of hexadecimal numbers that the host program interprets and presents through its user interface.
Meanwhile, the organization of data structures varies within a stream. The most common unit of data is a record. A record typically contains some metadata about the file in the form of fields and flags. This includes one or more offset values to indicate the locations of other relevant records or other data. Text is stored as numeric values that represent ANSI or Unicode characters. Images can be stored as pointers to external files or as embedded images in their own binary file formats, such as .gif, .jpeg, or .png within the file. More active content, such as PowerPoint slide transitions, are marked with the information that is needed for interpretation, such as the transition properties, and then rendered by the host program.
The file formats used by Microsoft Word, Microsoft PowerPoint, Microsoft Excel, and Microsoft Outlook are all documented, comprehensively, in the MSDN library in the following location: Microsoft Office File Format Documents. From there, you can open the full specification for the file format, either directly on the MSDN site or as a .pdf file.
The recommended way to perform most programming tasks in Microsoft Office is to use the Office Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Office. This article series deals only with advanced scenarios, such as where Microsoft Office is not installed.
What Versions of Microsoft Office Use Binary File Format Files?
The Microsoft Office binary file formats discussed in this article are primarily used by Microsoft Outlook, Microsoft Excel, and previous versions of Microsoft Word and Microsoft PowerPoint. Microsoft Office Word 2007 and Office PowerPoint 2007 use XML-based file formats as their default file format, and Microsoft Excel 2010 uses a newer binary format. The following table shows the binary file format files that apply to specific versions of Word, Excel, PowerPoint, and Outlook.
|File format||Application version|
Microsoft Office binary file format–based files are also used by companies that work with Microsoft Office files without using the original host application. Some of the more common uses outside Microsoft include custom cross-document search tools, data recovery from damaged files, or reading and writing for compatibility with other applications.
Viewing Content in Microsoft Office Binary File Format–Based Files
By far the easiest way to view a Microsoft Office binary file is with the host program that created it. For example, by using Word to view a .doc file, or PowerPoint to view a .ppt file. That approach shows the user view of the content, such as text, formatting, and general state of the user interface.
You can get a more structural picture of a binary file by using the Office Visualizer Tool, offvis.exe. The following link allows you to directly download this tool from the Download Center: http://download.microsoft.com/download/1/2/7/127BA59A-4FE1-4ACD-BA47-513CEEF85A85/OffVis.zipWhen you load any Microsoft Office binary file into the Visualizer, you are presented with two panes. The navigation pane shows the raw file contents, with each row showing current offset, a chain of hexadecimal numbers, and their text representation if any. The results pane shows the parsing results that consist of the name of the current data structure, its value, offset location, size, and type. The following screen shot shows part of a .doc file that contains the text “Hello, world” in the visualizer. The letter "w” is selected. This causes the visualizer to highlight the corresponding hexadecimal number and data structure.
Figure 1. HelloWorld.doc rendered in offvis.exe
Creating Custom Binary File Format Viewers
You can create a custom viewer, which you can use to target specific content, or as a way to become familiar with the file format. Your viewer has to read the data stream, interpret the structures in it, and navigate the offsets to find the text and whatever other content that you want to show. These data structures are different for each file type, but in every case, the process is similar.
To find content in binary file format-based files
Read the file stream.
Identify the structure or structures that may contain the content that you are looking for.
In the first structure, find the offset value that specifies the location of the next section that you are looking for.
Go to that section in the stream.
Repeat the previous two steps until you locate the content that you want.
Read and parse the content.
Depending on your needs, this can take from less than a hundred lines of code for a simple text extractor to millions of lines to emulate the original host program.
Editing Office Binary File Format–Based Files
In general, you should never attempt to directly edit a Microsoft Office binary file. Instead, use a Save operation, which is similar to how you send a document to a printer. When you print a Word document for example, you do not send the whole .doc file to the printer to render. Instead, Word creates a snapshot of your document, formatted according to the printer specifications. The printer may have logic to interpret fonts, but all of the layout information is processed by the sending application.
Similarly, when you save a file in a binary format, the host application translates the data in memory to the specified binary format and creates the file. If a file already exists with the same name, the new file overwrites it.
This approach has several advantages.
Your application can store and manipulate the file contents in any format that you choose, which is much easier than working with binary data directly.
By reading the original binary file into memory once and then immediately converting the data into an internal representation, you avoid having to recalculate multiple pointers to different offset positions, which may change with every edit.
After your application has an internal representation of the file in memory, it can save that file to any format the application supports.
By using a shared internal representation, your application can include logic to read multiple file formats and then work with them in the same manner.
So, the process of editing a binary format file really has three steps.
To edit a binary file format-based file
Read the file into an internal representation.
Edit the internal representation in your application.
Save the representation into the binary format, with the same file name and location as the source file.
Understanding and working with binary file formats can be a challenge. Hopefully, by learning about the basic structures and experimenting with some procedures provided in this article series, you will be ready to delve into more complex implementations with nothing more than the open specification documentation and some downloadable tools.
For more information, see the following resources: