Finding OLE Objects in a Binary Word .doc File

Summary: This article describes how to extract OLE objects from a binary Microsoft Word .doc file.

Applies to: Office 2010 | Open XML | Visual Studio Tools for Microsoft Office | Word 2000 | Word 2002 | Word 2003 | Word 2007 | Word 2010 | Word 97

In this article
Introduction to OLE Objects in Binary Word Documents
Finding OLE Objects in Binary Word Documents
Finding the positions of OLE objects in a Word Document
Additional Resources

Published:   April 2012

Provided by:   Microsoft Corporation

Contents

  • Introduction to OLE Objects in Binary Word Documents

  • Finding OLE Objects in Binary Word Documents

    • To find all of the OLE Objects in a binary Word file
  • Finding the positions of OLE objects in a Word Document

    • To check for Office Binary Document RC4 CryptoAPI Encryption

    • To find the positions of OLE Objects in a binary Word document

  • Conclusion

  • Additional Resources

Introduction to OLE Objects in Binary Word Documents

The Microsoft Word binary file format (.doc) is used by Microsoft Office Word 2003, Microsoft Word 2002, Microsoft Word 2000, and Microsoft Word 97. Extracting OLE objects directly from the binary file lets you quickly scan many files for a particular object without opening the Word application.

There are three ways to extract OLE objects from binary Microsoft Word documents.

  1. Open the document in Microsoft Word or another compatible program. This is the easiest way to process small numbers of files.

  2. Use the Word Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Word, and are included by default in Visual Studio Tools for Office (VSTO) projects. This is recommended for most automated solutions.

  3. Interpret the binary file directly, as described in this article. This is recommended only for advanced scenarios where the Word Primary Interop Assemblies are not appropriate, such as creating applications that must run without Microsoft Word installed.

Finding OLE Objects in Binary Word Documents

The Word Binary File Format (.doc) is used by Microsoft OfficeWord 2003, Microsoft Word 2002, Microsoft Word 2000, and Microsoft Word 97. All OLE objects in a binary Microsoft Word document are stored in the ObjectPool storage, which is located in the Data Stream. The general ObjectPool storage contains storages for individual objects, and each storage contains a stream. Each stream is named \003ObjInfo where "\003" is the character with value 0x0003, not the string literal "\003". This stream contains an ODT structure, which specifies information about the embedded OLE object.

You can find all of the OLE objects in a file by finding the storages that contain \003ObjInfo streams inside the Data Stream. An \003ObjInfo stream contains a 6-byte ODT structure, which provides information about the OLE object. After the ODT structure, the stream ends. The rest of the storage is the OLE object itself.

To find all of the OLE objects in a binary Word file

  1. Open the Data Stream, and inside the stream open the ObjectPool storage. If there is no Data Stream, or if the stream does not contain an ObjectPool storage, there are no OLE objects in the file, and you may exit this procedure.

  2. Check each storage inside the ObjectPool for an \003ObjInfo stream.

  3. For each \003ObjInfo stream, read the 6-byte ODT structure, as described in the Word (.doc) Binary File Format ([MS-DOC] [MS-DOC} section 2.1.165), and then read the remainder of the stream as an OLE object.

Finding the positions of OLE objects in a Word Document

The OLE objects in a binary Microsoft Word document are logically bound to field separator characters in the document and their locations in the Data Stream are specified by properties on those characters. To find the positions of OLE objects in a document, you must first locate the separator characters and then read their properties. You must also know whether the document is encrypted by using Office Binary Document RC4 CryptoAPI Encryption because this changes how the locations of OLE objects are arranged in the file.

To check for Office Binary Document RC4 CryptoAPI Encryption

  1. Begin reading the File Information Block (FIB) at offset 0 of the Word Document stream.

  2. Read the 1-bit FibBase.fEncrypted flag at byte 11 of the FIB.

    If fEncrypted = 0, the file is not encrypted, and you may exit this procedure.

  3. If fEncrypted = 1, read the FibBase.fObfuscated flag at the last bit of byte 11.

    If fObfuscated = 1, the file uses XOR obfuscation, which does not affect the arrangement of OLE objects, so you may exit this procedure.

  4. If fObfuscated = 0, read the first 2 bytes of the table stream as an unsigned integer which specifies the encryption version. If this number is larger than 0x0001, the file uses Office Binary Document RC4 CryptoAPI Encryption.

To find the positions of OLE Objects in a binary Word document

  1. Read the text and character positions of the document by following the procedure titled Extracting Text from Word Files in the article "Understanding the Word MS-DOC Binary File Format" and load them into memory.

  2. From the in-memory representation, create a dictionary of characters, their Character Positions (CPs), and the offsets in the Word Document Stream where they appear.

  3. Open the FIB at byte 0 of the Word Document Stream, and skip ahead 154 bytes to the FibRgFcLcb97 structure.

  4. Read the 4-byte FibRgFcLcb97.fcPlcfBteChpx and FibRgFcLcb97.lcbPlcfBteChpx attributes, which start at byte 100 of the FibRgFcLcb97 structure.

  5. Open the 0Table or 1Table stream.

    For more information about the table stream, see [MS-DOC] section 2.1.2.1.

  6. Go to the offset in the table stream specified by FibRgFcLcb97.fcPlcfBteChpx and read the number of bytes specified by FibRgFcLcb97.lcbPlcfBteChpx, which contains a PlcBteChpx structure.

  7. The PlcBteChpx structure contains two arrays: .aFC, and .aPnBteChpx. .aFC contains an array of unsigned integers that specify offsets where text begins. .aPnBteChpx contains 4-byte PnFkpChpx structures. The PnFkpChpx structures point to ChpxFkp structures, which contain property information.

    Load the PlcBteChpx structure into memory.

  8. Iterate through your in-memory dictionary of characters, CPs, and offsets.

    For each character whose Unicode value is 0x0014, do the following:

    1. Find the last member of the PlcBteChpx.aFC array whose value is less than the offset of the current character. The members of the .aFC array are ordered from lowest to highest.

    2. Read the first 22 bits of the corresponding member of the PlcBteChpx.aPnBteChpx array as an unsigned integer, and go to the offset specified by that number * 512.

    3. The next 512 bytes contain a ChpxFkp structure. Read the structure into memory as described in [MS-DOC] section 2.9.33, and mark the position of each array element.

    4. Find the last member of the ChpxFkp.rgfc array whose value is less than the offset of the current character. The members of the .rgfc array are ordered from lowest to highest.

    5. Find the member of the ChpxFkp.rgb array whose index is equal to that of the current ChpxFkp.rgfc member plus the length of ChpxFkp.rgfc. This member is a Chpx structure.

    6. The Chpx structure starts with a 1-byte integer that specifies the size in bytes of the rest of the structure, which consists of an array of Prl structures. Each Prl structure contains a 2-byte Single Property Modifier (Sprm), and a variable-length operand which specifies the value for the property.

      Iterate through the Prl structures in the array by reading the Sprm at the first 2 bytes of each Prl to determine the property and the length of the operand, as described in [MS-DOC] section 2.2.5.1.

      If sprmCFOLE2 equals true, do the following:

      1. If the file uses Office Binary Document RC4 CryptoAPI Encryption:

        1. Open the Data Stream and go to the offset specified by the value of the current Prl.Operand.

        2. Read the FOBJH structure that begins at the current offset.

          If the 1-bit fCompressed flag at byte 2 is set to "1", the OLE storage is compressed, as specified in RFC 1950.

          The last 4 bytes of the structure are the .cbObj field. This field specifies the total number of bytes of the current FOBJH structure and the object storage that follows it, as a signed integer.

      2. If the file does not use Office Binary Document RC4 Crypto API Encryption, do the following:

        1. Convert the decimal value of the current Prl.Operand to a string, and prefix the string with an underscore ("_") character.

        2. Open the Data Stream, and search inside the object pool storage for a storage whose name matches the string from the previous step.

    7. Load the object storage you found in the previous step, and the \003ObjInfo stream it contains, into memory.

      The \003ObjInfo stream begins with a 6-byte ODT structure, as defined in [MS-DOC] section 2.9.165, which provides information about the OLE object. The rest of the object storage is the OLE object itself.

    8. In the memory model, associate the OLE object with the current character, and its corresponding CP.

Additional Resources

For more information, see the following resources: