Export (0) Print
Expand All

Finding Graphics in a Binary Word .doc File

Office 2010

Summary: This article describes how to extract drawings and bitmap images from a binary Microsoft Word .doc file in the Word Binary File Format.

The Word Binary File Format (.doc) is used by Microsoft OfficeWord 2003, Microsoft Word 2002, Microsoft Word 2000, and Microsoft Word 97. Extracting images directly from the binary file lets you quickly process images from many files without opening the Microsoft Word application. You can also extract drawings, such as particular pieces of clip art, WordArt, or shapes.

There are three ways to extract graphical elements from binary Microsoft Word documents.

  1. Open the document in Microsoft Word or another compatible program. This is the easiest way to process small numbers of files.

  2. Use the Word Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Word, and are included by default in Visual Studio Tools for Office (VSTO) projects. This is recommended for most automated solutions.

  3. Interpret the binary file directly, as described in this article. This is recommended only for advanced scenarios where the Word Primary Interop Assemblies are not appropriate, such as creating applications that must run without Microsoft Word installed.

The first step to locating most graphical content in a binary Word file is to find the OfficeArtContent container, which is located in the Table Stream. The location of this structure is found by reading the File Information Block (FIB).

Inline pictures are stored in the Data stream, in PICFAndOfficeArtData structures. These pictures can only be found by parsing character properties, as described in this article in the procedure To locate inline pictures in a binary Word document.

To find the office art content in a binary Word document

  1. Open the Word Document stream ("DataStream").

    The File Information Block (FIB) begins at byte 0 of this stream.

  2. Within the FIB, skip ahead 154 bytes to the FibRgFcLcb blob, which starts with the FibRgFcLcb97 structure.

  3. Within the FibRgFcLcb97 structure, skip ahead 400 bytes to the .fcDggInfo attribute, which is 4 bytes long. Read that attribute as an unsigned integer.

    This number specifies the location of art content in the 1Table stream or 0Table stream.

  4. Open the 1table or 0table stream, and go to the offset specified in the previous step.

    This is the OfficeArtContent container for the file.

The OfficeArtContent container starts with the .DrawingGroupData attribute, which is an OfficeArtDggContainer record, as described in section 2.2.12 of the [MS-ODRAW]: Office Drawing Binary File Format Structure Specification specification. It contains the bitmap images for the document, and shared properties for all of the drawings in the document.

The next attribute of the OfficeArtContent object is .Drawings, which is an array of OfficeArtWordDrawing objects. Each OfficeArtWordDrawing object starts with a 1-byte label that specifies whether the drawing is in the Main Document(0x00) or in the Header Document(0x01). The rest of the OfficeArtWordDrawing object is an OfficeArtDgContainer record that contains the drawing information, as specified in [MS-ODRAW] section 2.2.13.

To extract bitmap images from a binary Word document

  1. Open the OfficeArtContent container for the document. The first part of this structure is an OfficeArtDggContainer record.

  2. Follow the procedure titled http://msdn.microsoft.com/en-us/library/gg985447#OfficeArtBFF_Bitmaps_Extract from the article "Understanding Graphics in Office Binary File Formats".

To extract shapes and drawings from a binary Word document

  1. Open the OfficeArtContent container for the document. The first part of this structure is an OfficeArtDggContainer record.

  2. Read the first 8 bytes of the OfficeArtDggContainer record.

    Bytes 4-7 specify the length of the container.

  3. Inside the OfficeArtDggContainer record, find the shared property tables by checking each record header, saving the records of type OfficeArtFOPT(0xF00B) and OfficeArtTertiaryFOPT(0xF122) into memory, and skipping the rest.

  4. Skip to the next child element of the OfficeArtContent container, which is the .Drawings array.

  5. For each member of the array:

    1. Read the 1-byte OfficeWordArtDrawing.dgglbl label.

      If OfficeWordArtDrawing.dgglbl = 0x00, the drawing is in the main document. If OfficeWordArtDrawing.dgglbl = 0x01, the drawing is in the header.

    2. The rest of the OfficeWordArtDrawing object is an OfficeArtDgContainer record, which represents a drawing object.

      You can parse the drawing by following the procedure titled http://msdn.microsoft.com/en-us/library/gg985447#OfficeArtBFF_Shapes_ReconstructShapeGroup in the article "Understanding Graphics in Office Binary File Formats".

A picture in the Word Binary File Format can be any drawing object, image, or combination thereof. Pictures can be either floating or inline.

A floating picture is represented by an anchor character with a Unicode value of 0x0008, with the sprmCFSpec character property applied with a value of 1. Floating pictures are referenced by a PlcfSpa structure, which contains additional data about the picture. A floating picture can appear anywhere on the same page as its anchor, according to the text-wrapping options set by the document author. For more information about character properties, see [MS-DOC] section 2.6.1

An inline picture is represented by a character whose Unicode value is 0x0001 and has the following character properties applied.

Property Modifier

Value

sprmCFSpec

1

sprmCPicLocation

The location of the picture data in the Data Stream.

Each bitmap image is anchored to a drawing object, which may or may not contain visible shapes. The drawing object uses its .clientAnchor property to anchor to a Character Position (CP). To determine the text character that corresponds to a given CP, follow the procedure titled Extracting Text from Word Files in the article "Understanding the Word MS-DOC Binary File Format".

To locate floating pictures in a binary Word document

  1. Extract the drawing objects as described in this article in the procedure titled To extract shapes and drawings from a binary Word document, and save the OfficeArtFSP records into an array.

    An OfficeArtFSP record fills the .ShapeProp attribute of an OfficeArtSpContainer record, which represents an individual shape.

  2. Open the FIB at byte 0 of the Word Document Stream, and skip ahead 154 bytes to the FibRgFcLcb97 structure.

  3. Read the 4-byte .fcPlcSpaMom and .lcbPlcSpaMom attributes starting at byte 364 of the FibRgFcLcb97 structure.

  4. Go to the offset specified by FibRgFcLcb97.fcPlcSpaMom and read the number of bytes specified by FibRgFcLcb97.lcbPlcSpaMom, which consists of a PlcfSpa structure.

  5. The PlcfSpa structure consists of two arrays: .aCP and .aSpa. The .aCP array consists of the CPs, which specify positions of anchors for floating pictures. The .aSpa array contains Spa structures, which specify positional information for how the corresponding picture relates to the surrounding text.

    Note Note

    The .aCP array always contains one more element than the .aSpa array. For more information, see [MS-DOC] section 2.8.27.

    For each Spa structure in the .aSpa array, do the following:

    1. Read the first 4 bytes (Spa.lid), which specify the ID of the drawing object that is anchored to the corresponding CP.

    2. Iterate through the array of OfficeArtFSP records, reading bytes 8-11 of each record to find the .spid attribute.

      Where OfficeArtFSP.spid is equal to Spa.lid, the parent shape of the current OfficeArtFSP corresponds to the current Spa structure, and is anchored to the corresponding CP.

    3. Read the rest of the current Spa structure to determine the page coordinates and text wrapping attributes of the shape.

To locate inline pictures in a binary Word document

  1. Read the text and character positions of the document by following the procedure titled Extracting Text from Word Files in the article "Understanding the Word MS-DOC Binary File Format", and load them into memory.

  2. From the in-memory representation, create a dictionary of text characters, their CPs, and the offsets in the Word Document stream where they appear.

  3. Open the FIB at byte 0 of the Word Document stream, and skip ahead 154 bytes to the FibRgFcLcb97 structure.

  4. Read the 4-byte FibRgFcLcb97.fcPlcfBteChpx and FibRgFcLcb97.lcbPlcfBteChpx attributes, which start at byte 100 of the FibRgFcLcb97 structure.

  5. Go to the offset specified by FibRgFcLcb97.fcPlcfBteChpx and read the number of bytes specified by FibRgFcLcb97.lcbPlcfBteChpx, which contains a PlcBteChpx structure.

  6. The PlcBteChpx structure contains two arrays: .aFC, and .aPnBteChpx. .aFC contains an array of unsigned integers that specify offsets where text begins. .aPnBteChpx contains 4-byte PnFkpChpx structures. The PnFkpChpx structures point to ChpxFkp structures, which contain property information.

    Note Note

    The .aFC array always contains one more element than the . aPnBteChpx array. For more information, see [MS-DOC] section 2.8.5.

    Load the PlcBteChpx structure into memory.

  7. Iterate through your in-memory dictionary of characters, CPs, and offsets.

    For each character whose Unicode value is 0x0001, do the following:

    1. Find the last member of the PlcBteChpx.aFC array whose value is less than the offset of the current character. The members of the .aFC array are ordered from lowest to highest.

    2. Read the first 22 bits of the corresponding member of the PlcBteChpx.aPnBteChpx array as an unsigned integer, and go to the offset specified by that number * 512.

    3. The next 512 bytes contain a ChpxFkp structure. Read the structure into memory as described in [MS-DOC] section 2.9.33, and mark the position of each array element.

    4. Find the last member of the ChpxFkp.rgfc array whose value is less than the offset of the current character. The members of the .rgfc array are ordered from lowest to highest.

    5. Find the member of the ChpxFkp.rgb array whose index is equal to that of the current ChpxFkp.rgfc member plus the length of ChpxFkp.rgfc. This member is a Chpx structure.

    6. The Chpx structure starts with a 1-byte integer that specifies the size in bytes of the rest of the structure, which consists of an array or Prl structures. Each Prl structure contains a 2-byte Single Property Modifier (Sprm), and a variable-length operand that specifies the value for the property.

      Iterate through the Prl structures in the array by reading the Sprm at the first 2 bytes of each Prl structure to determine the property and the length of the operand, as described in [MS-DOC] section 2.2.5.1.

      If the Sprm is sprmCPicLocation:

      1. Open the Data Stream and go to the offset specified by the value of the current Prl.Operand, which marks the beginning of a PICFAndOfficeArtData structure.

      2. If bytes 6-7 of the PICFAndOfficeArtData structure equal 0x0066, read byte 68 and skip ahead that number from byte 69 of the structure. Otherwise, go to byte 68.

        The rest of the structure is the .picture attribute, which is an OfficeArtInlineSpContainer record.

      3. An OfficeArtInlineSpContainer record contains an OfficeArtSpContainer record, which may be a visible shape or an invisible shape anchor, followed by an array of zero or more OfficeArtBStoreContainerFileBlock records, which contain bitmap images.

        Parse the OfficeArtSpContainer record as described in the procedure titled http://msdn.microsoft.com/en-us/library/gg985447#OfficeArtBFF_Shapes_ReconstructSoloShape from the article "Understanding Graphics in Office Binary File Formats". Then parse the OfficeArtBStoreContainerFileBlock records as described in the procedure titled http://msdn.microsoft.com/en-us/library/gg985447#OfficeArtBFF_Bitmaps_Extract from the same article, starting on step 5.

    7. In the memory model, associate the contents of the OfficeArtInlineSpContainer record with the current character, and its corresponding CP.

Show:
© 2014 Microsoft