Understanding Graphics in Office Binary File Formats
Summary: This article describes the MS-ODRAW binary file format, also known as OfficeArt, which is used to store drawing objects in current and previous Microsoft Office products.
Applies to: Microsoft Office
Provided by: Microsoft Corporation
This article describes elements of the MS-ODRAW binary file format, and gives examples of how to extract art content from a binary file. The intention of this article is to show some practical tasks that you can perform on binary files, and to equip you with a basic understanding of the format that will help you in more in-depth explorations. This article follows a series of articles that introduce the binary file formats that are used by Microsoft Office products. These articles are designed to be used together with the open file format specifications available on Microsoft MSDN:
What Is the MS-ODRAW File Format?
The [MS-ODRAW]: Office Drawing Binary File Format Structure Specification, also known as OfficeArt, is a binary file format that is used by Microsoft Office applications to store drawing elements, such as pictures, shapes, and WordArt, and their associated formatting. These elements can be contained in other drawings, or in charts, diagrams, tables, and controls, or may appear as self-contained components in the file.
The MS-ODRAW format stores drawing objects within files that are created by a host program, such as Microsoft Word. Those drawing objects may be found in context of a chart, graph, or table that are defined by the MS-OGRAPH format, which is also stored inside a host file. There are no MS-ODRAW files.
This format, and the file formats that are used by Microsoft Word, Microsoft PowerPoint, Microsoft Excel, and Microsoft Outlook are all documented, comprehensively, in the MSDN library in the following location: Microsoft Office File Format Documents. From this location, you can open the full specification for the file format, either directly on the MSDN site or as a .pdf file.
The recommended way to perform most programming tasks in Microsoft Office is to use the Office Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Office. This article series deals only with advanced scenarios, such as where Microsoft Office is not installed.
Structures in the MS-ODRAW File Format
A drawing object is composed of a series of records, which may contain other records. A record that contains other records is called a container, whereas a record that holds data is called an atom. All records share a common Record Header, which specifies the record type and length. The record header also has fields used by atom records to specify version and instance, to differentiate from other atoms in the same container.
Because the records share a common header, you can selectively parse records by reading only the recType and recLen fields of each header until the desired record is found. In addition, you can define custom record types that apply only to your application by creating unique recType values, which is how host applications position drawing elements in a document and render associated text.
The following sections explain the organization of the object container hierarchy inside a host document, and describe some key structures that apply to different kinds of graphical elements. To view a diagram and additional explanation of the object container hierarchy, see section 1.3 Structure Overview (Synopsis) in the MS-ODRAW specification.
Each host file that contains graphical elements contains a single Drawing Group, which hosts all of the drawings in the file. The drawings are stored in individual drawing objects, which are not stored in the drawing group itself, but are logically associated with it. The Drawing Group contains a Property Table, which stores the defaults for new shapes, and a BLIP Store, which contains all of the static images that are used in the file. Individual drawings reference these images from a central location to reduce duplication of picture data.
The Drawing Group is specified in an OfficeArtDggContainer record.
A drawing object represents a complete graphical element inside a host document, such as a piece of clip art, WordArt, or a grouping of shapes in a Venn diagram. A drawing object will contain one or more Shapes objects and a collection of rules that apply to all shapes in the drawing.
A Drawing object is represented by an OfficeArtDgContainer record. It contains a common record header, and 16 bytes of drawing data in an OfficeArtFDG record to specify the number of shapes in the drawing and the identifier of the last shape in the drawing.
A drawing object also contains a collection of rules that apply to connectors, arcs, and callouts in the drawing, and four different collections of shapes, sorted according to the shapes grouping and deletion status, as shown in the following table.
Shapes Collections in a Drawing Record
An array of OfficeArtSpgrContainerFileBlock
Defines deleted shapes and shape groups.
Defines the shape properties for the group shape, and contains all of the shapes in the group as an array.
Holds group identifiers for shapes that are ungrouped.
Defines the default shape properties for the current drawing.
Most drawings created in Microsoft Office applications consist of shapes. Individual shapes have properties, which determine the shape type, such as rounded rectangle or double arrow, its relationships to other shapes, size, position, and various details about how it is rendered, such as line style and fill. An individual shape is defined in an OfficeArtSpContainer record.
All of the OfficeArt shape types are listed in the MSOSPT enumeration. Inside the OfficeArtSpContainer record that defines a shape, there is a shapeProp attribute, which is an OfficeArtFSP record. The recInstance field of the OfficeArtFSP record header holds an MSOSPT enumeration value, which sets the type of the shape. The following diagram shows how a four-pointed star is defined in the container hierarchy.
Coordinates and Grouping
Coordinates and grouping are intimately connected because how a shape is positioned and sized depends on its grouping status. All of the shapes in a drawing are part of a top-level shape group, which is defined in an OfficeArtSpgrContainer record. Within a shape group, there is a core shape which other shapes are anchored to, called the group shape. This shape is not displayed. The top-level shape group is positioned on the client UI surface by the clientAnchor attribute of the OfficeArtSpContainer record that defines the group shape. The clientAnchor is an OfficeArtClientAnchor record, which is defined by the client application.
Position and size relative to the client anchor are determined by grouping. A simple shape is defined as a direct child of the top-level shape group, whereas shapes that have been grouped to other shapes in the UI become a subordinate shape group. As with the top level group, a subordinate shape group is defined in an OfficeArtSpgrContainer record, which wraps the OfficeArtSpContainer record that defines the shape in an OfficeArtSpgrContainerFileBlock record. The size and location of the group shape are defined by the OfficeArtSpContainer.shapeGroup property, which is an OfficeArtFSPGR record. The coordinates of each child shape are defined by its OfficeArtSpContainer.childAnchor property, which is an OfficeArtChildAnchor record.
The following example shows two similar shapes.
The compass rose is composed of two diamond shapes, and , grouped together, whereas the four-pointed star is a simple shape and not part of a subordinate group.
In the compass rose, the user created the vertical diamond first, and then laid the horizontal diamond over it, and grouped them together. This puts them in a subordinate group shape, with the two diamonds as child shapes. The following diagram shows where the two diamonds appear in the drawing container hierarchy, and where their respective coordinates are found.
The coordinates for the group shape are found in the .shapeGroup attribute. The second and third shapes in the array, being child shapes, keep their coordinates in their respective .childAnchor attributes. Any additional shapes added to this group would also store their coordinates in the .childAnchor attribute.
The four-pointed star from the previous picture was created as a single shape, and is stored as a direct child of the top-level group shape, as shown in the following diagram:
In a single drawing, rules define connectors, arcs, and callouts. Each OfficeArtDgContainer drawing object in a file contains a OfficeArtSolverContainer instance, which holds an array of OfficeArtSolverContainerFileBlock records, each of which may be an OfficeArtFConnectorRule, an OfficeArtFArcRule or an OfficeArtFCalloutRule.
An OfficeArtFConnectorRule record defines a connector according to which shapes it connects to, such as in a flow chart, and at what point on the shape it anchors to. Connector rules are the main rules of interest. Callout rules and arc rules only give the ID of the callout or arc they apply to and no additional information.
To see an example of an OfficeArtSolverContainer with connector rules, see 3.1.5 OfficeArtSolverContainer in the MS-ODRAW documentation.
The MS-ODRAW format refers to bitmap images as binary large images or pictures (BLIPs). Bitmaps in a Microsoft Office document are stored centrally in the BLIP store, which is an OfficeArtBStoreContainer record in the top level drawing group. The BLIP store contains an array of OfficeArtBStoreContainerFileBlock records. In the record header of the BLIP store, the recInstance and recLen fields specify the number of OfficeArtBStoreContainerFileBlock records and the total of their size in bytes.
Each OfficeArtBStoreContainerFileBlock record contains either an OfficeArtFBSE record or a bare OfficeArtBlip record, as specified by the type field of its record header. An OfficeArtFBSE record wraps an OfficeArtBlip record in name, size, type, and offset information. An OfficeArtBlip record contains a record header and the actual bit stream of the image.
The following diagram shows the location of binary image data in the container hierarchy.
Properties define shapes and bitmaps. The MS-ODRAW format stores properties in property tables. Each shape has property tables for properties that apply only to that shape, which consists of an OfficeArtFOPT record, an OfficeArtSecondaryFOPT record, and an OfficeArtTertiaryFOPT record. In the record names, “PT” means “Property Table.” Additionally, the top-level drawing group contains a property table to hold properties that apply across the file, which consists only of an OfficeArtFOPT record and an OfficeArtTertiaryFOPT record. Properties defined in the top level drawing group are used as defaults for properties not defined by individual shapes, but where individual shape properties are set, they override the defaults. Properties not set in either location revert to system or application defaults. The record header for each property table specifies the number of properties and the length in bytes of the record.
Within an OfficeArtFOPT, OfficeArtSecondaryFOPT, or OfficeArtTertiaryFOPT record is an array of categorized OfficeArtRGFOPTE records, each of which represents a group of related properties. In the record names, “RG” means “Related Group,” and “PTE” means “Property Table Entry.” The individual properties are stored as OfficeArtFOPTE records. These records do not have a record header, but begin with an opid field, which specifies header information for the property. The opid is stored as an OfficeArtFOPTEOPID record. If the last bit of an OfficeArtFOPTEOPID record, the fComplex field, equals 0x1, the property is a complex property. In this case, information that exceeds the 4-byte limit of an OfficeArtFOPTE record is moved to the complexData section of the parent OfficeArtRGFOPTE table.
Extracting Art from a Binary File
The process of extracting art from Microsoft Office binary files depends in part on the types of art to extract and in part on the host application. The following table shows which structures host drawings and drawing groups in different Microsoft Office binary file formats.
Drawing Group Locations by Host Application
|Host Application||Structures Containing Drawings||Structures Containing Drawing Groups|
The following procedures show generally how to extract MS-ODRAW content from a host application, without including anything specific to that application.
All record headers in the MS-ODRAW format are 8 bytes long, and zero-indexed. For example, the third byte is byte 2 and the last byte is byte 7. Bytes 2-3 give the record type, and the last 4 bytes give the record length in bytes. Once you know the type and length of the record, you can decide to either read the number of bytes specified in the record length, or skip that same number of bytes and go on to the next record.
Shapes and Shape Groups
The following procedures show how to reconstruct simple shapes and shape groups in Microsoft Office documents. This does not include connectors or callouts, which are specified in solver containers, or charts and graphs, which use the MS-OGRAPH format. These procedures also do not specify the exact position of the shape or shape group in the client document, nor do they include undo history or revision history options.
To reconstruct a shape group or solo shape in the MS-ODRAW format
Find the OfficeArtDgContainer record in the file for the drawing that contains the shape group in question, or iterate through all of the drawings until you find the one that matches your criteria.
Read the record header to get the number of bytes to the end of the drawing.
Check each record header in the container until you find one where record type = OfficeArtSpgrContainer (0xf003). If there is no OfficeArtSpgrContainer record, there are no shapes in the drawing.
The OfficeArtSpgrContainer record represents the .groupShape field. This record contains all the active shapes in the drawing, as an array of OfficeArtSpgrContainerFileBlock records.
Read the record header to get the length of the container, and then begin reading the first OfficeArtSpgrContainerFileBlock record. Because this is the first OfficeArtSpgrContainerFileBlock record in the array, it must contain an OfficeArtSpContainer record, which must correspond to the group shape for the current group.
Read the OfficeArtSpContainer record as described in the next procedure, “To reconstruct a single shape in the MS-ODRAW format.”
Begin reading the next OfficeArtSpgrContainerFileBlock record, starting with the record header.
If .recType = 0xF004, the rest of the current file block is an OfficeArtSpContainer record. Read the record as described in the next procedure, “To reconstruct a single shape in the MS-ODRAW format.”
If .recType = 0xF003, the rest of the current file block is an OfficeArtSpgrContainer record, which represents a subordinate shape group. Read the record as described in the previous three steps.
Read the remaining OfficeArtSpgrContainerFileBlock records in the same manner.
Find the OfficeArtDggContainer that represents the drawing group for the file.
Inside the OfficeArtDggContainer, find the property tables by checking each record header, reading the records of type OfficeArtFOPT and OfficeArtTertiaryFOPT, and skipping the rest.
These property tables represent default properties across the file. Parse the property tables as before, but apply properties from these tables only to shapes that do not already specify the property in question.
Render the shape group in your application according to the information collected.
To reconstruct a single shape within a shape group
Starting with an OfficeArtSPContainer record, read the record header for each record inside the container, and continue as follows:
If record type = OfficeArtFSPGR (0xF009), AND this is the first shape in the drawing, and therefore the group shape, the record represents the .shapeGroup field. Skip the record header, and read the remaining 16 bytes into memory as four 4-byte signed integers that give the left, top, right, and bottom coordinates of the top-level group shape.
If record type = OfficeArtChildAnchor (0xF00F), AND the current shape is not the group shape, the record represents the .childAnchor field. Skip the record header and read the remaining 16 bytes into memory as four signed integers that specify the left, top, right, and bottom coordinates of the current shape in relation to its parent group shape.
If record type = OfficeArtFSP (0xF00A), the record represents the .shapeProp field, which is a 16-bytes long. Read the record into memory. Bits 4-15 specify an MSOSPT enumeration value which defines the shape type. Bits 101 and 102 specify if the shape is horizontally or vertically flipped from its default orientation.
If record type = OfficeArtFOPT (0xF00B), OfficeArtSecondaryFOPT (0xF121), or OfficeArtTertiaryFOPT (0xF122), the record is a property table. Parse the properties as described in the next procedure, “To parse properties in the MS-ODRAW format.”
Skip all other records.
To parse properties in the MS-ODRAW format
Starting with a property table, such as an OfficeArtFOPT record, begin reading the first member of the .fopt array, which consists of OfficeArtRGFOPTE property groups. Each group represents a category of properties in the property table.
Begin reading the first OfficeArtRGFOPTE group. OfficeArtRGFOPTE records have no record header, but begin with the first member of the .rgfopte array, which consists of OfficeArtFOPTE records.
Each OfficeArtFOPTE record represents a single property. Read the 2-byte .opid field, which is an OfficeArtFOPTEOPID record into memory.
Read the first 14 bits the identifier of the property as an unsigned integer. To find which integer values correspond to which properties, consult the properties list in section 2.3 of the MS-ODRAW specification.
The last bit, if set to 1, means that this is a complex property.
Read the remaining 4 byte .op field of the current OfficeArtFOPTE record as a signed integer, which specifies the value for the property, or if the property is complex, the number of bytes of complex data that hold the value.
Read the remaining entries in the .rgfopte array in the same manner.
After the last OfficeArtFOPTE record, read the complex data. The complex data appears in the same order as the properties in the array, in the sizes specified in their .op fields.
Read the remaining OfficeArtRGFOPTE property groups in the same manner as in the previous steps until the property table is completed.
To extract bitmap images in the MS-ODRAW format
Locate the OfficeArtDggContainer record for the file.
Read the record header to get the length of the container.
Check the record header for each record inside the container until you find an OfficeArtBStoreContainer (0xF001) record, which represents the .blipStore field, or until you reach the end of the OfficeArtDggContainer. If there is no OfficeArtBStoreContainer record there are no bitmaps to extract.
The OfficeArtBStoreContainer record holds all of the bitmaps for the file in an array of OfficeArtBStoreContainerFileBlock records. Read the record header.
Bits 4-15 are an unsigned integer, which specifies the length of the array.
For each OfficeArtBStoreContainerFileBlock record in the array, do the following.
Read the bytes 2-3 of the record header to get the record type.
If record type = 0xF018-0xF117, the record is an OfficeArtBlip. Continue to the next step in this procedure.
If record type = 0xF007, the record is an OfficeArtFBSE. Do the following:
Skip the first 20 bytes.
Read the next 4 bytes, which give the size of the bitmap as an unsigned integer.
Skip the next 12 bytes.
Read the .name field, which is a variable length, null -terminated, Unicode string that gives the name of the bitmap.
The next field is .embeddedBlip, which is an OfficeArtBlip record.
Read the record header of the OfficeArtBlip record. Bytes 2-3 specify the file type the image would have if it were saved separately. The last 4 bytes of the record header give the length of the rest of the record. For more information about which type values correspond to which file types, see section 2.2.23 of the MS-OGRAPH specification.
Read the rest of the record into memory and save as the file type specified by the record header.
Begin reading the next OfficeArtBStoreContainerFileBlock record, and continue in the same manner until all of the bitmaps are processed.
Making changes to any Microsoft Office binary file requires reading the file into an internal representation, editing the representation, and then re-writing the file. Because of these requirements, save operations require a thorough understanding of the file formats involved, which exceeds the scope of this article. This article and the related articles in the series provide the steps and information to perform simple extractions and facilitate a better understanding of the MS-ODRAW format.
Specifically, a basic knowledge of the MS-ODRAW format can help improve your facility with the other Microsoft Office binary file formats by enabling you to identify and parse drawing objects and bitmaps as you encounter them in Microsoft Office binary files.
For more information, see the following resources: