Parsing Formulas in a Binary Excel .xls File
Summary: This article describes how to find and interpret formulas from a binary Microsoft Excel file.
Applies to: Microsoft Excel 97 | Microsoft Excel 2000 | Microsoft Excel 2002 | Microsoft Office Excel 2003
Published: April 2012
Provided by: Microsoft Corporation
The Microsoft Excel binary file format (.xls) is used by Microsoft Office Microsoft Excel 2003, Microsoft Excel 2002, Microsoft Excel 2000, and Microsoft Excel 97. Use the procedures in this article to find formula cells in a binary Excel file, and extract the row and column information, current value, and formula elements for the cell.
The recommended way to perform most programming tasks in Microsoft Excel is to use the Excel Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Excel. This article series deals only with advanced scenarios, such as creating applications that must run without Microsoft Excel installed.
Extracting formulas directly from the binary file lets you quickly scan many files for a particular formula without opening the Microsoft Excel application. You can then mark all files that contain a given formula for review, or even programmatically update instances of a formula with minimal changes to the file.
If you make changes to a binary Excel file, they must be reflected in the count of bytes of the changed record. Those changes must in turn be reflected up to the parent record and its parent records, all the way up to the stream level. For information about how count of bytes is calculated, see [MS-XLS] section 2.1.4.
Finding and Parsing Formula Cells in Binary Excel Files
A Microsoft Excel workbook is organized into streams and substreams, with most of the content residing in sheet substreams within the Workbook stream. Each worksheet in the file is stored as a Worksheet substream. Within a sheet substream, data is contained in records, which may contain child records and other structures. Each record begins with a 2-byte unsigned integer to specify record type (rt), and another 2-byte unsigned integer to specify count of bytes (cb). Records may be read or skipped by reading these values, then either reading or skipping the number of bytes specified by cb, depending on the record type specified by rt.
Cell data is stored in various types of cell records, depending on the type of data. Cells containing formulas are stored as Formula cells. Within a Formula cell, a FormulaValue structure stores the current value, while the formula itself is stored as an array of Ptg structures.
The following procedures describe ways to extract formula data from a binary Excel file. To update the binary file, you must also preserve the bytes that are not used in formula calculations, and update the count of bytes values of any records you change.
To find formula cells in a binary Excel document
Follow the procedure titled Extracting Data from Excel Files from the article “Understanding the Excel MS-XLS Binary File Format”.
The cells where record type (rt) = 6 contain formulas.
To parse formula cells in binary Excel documents
Starting with a Formula cell record, record the count of bytes (cb) of the record.
Read the next two bytes as an unsigned integer to specify the row position of the current cell.
Read the next two bytes as an unsigned integer to specify the column position of the current cell.
Skip the next two bytes, which contain formatting information.
The next 8 bytes are a FormulaValue structure, which gives the current value of the cell.
If the last two bytes of the structure = 0xFFFF, you can interpret the value as described in [MS-XLS]: Excel Binary File Format (.xls) section 220.127.116.11. Otherwise, read the value as a 64-bit floating point number.
Skip the next six bytes.
The rest of the record is a CellParsedFormula structure. Read the first two bytes as an unsigned integer that specifies the length of the .rgce array that follows.
The .rgce array consists of Ptg structures, which specify elements of a formula. The Rgce array documentation lists the lengths of the different types of Ptg structures. Ptg types that represent a simple operator such as PtgAdd, which specifies an addition operation, are 1-byte in length. In contrast, Ptg types such as PtgArea, which specifies a reference to a rectangular range of cells, may include additional fields.
For each Ptg in the .rgce array:
Read the first one or two bytes to determine the type of Ptg, as described in the Ptg documentation.
If the type of Ptg specified is more than one byte long, read the rest of the Ptg according to its specification.
Some of the Ptg types in the .rgce array may require an RgbExtra structure to hold additional data. The RgbExtra structures reside in the CellParsedFormula.rgcb array, which if present, immediately follows the .rgce array. For a list of the Ptg types that require RgbExtra structures, see the RgbExtra documentation.
If RgbExtra structures are required for the current cell, read each element in the array according to its specification. All array elements are arranged in the same order as the .rgce array members they correspond to. For example, if .rgce and .rgce are PtgArray structures, .rgce is a PtgMemArea structure, and all of the other Ptgs are simple operators, .rgcb array will look like
.rgcb[PtgExtraArray, PtgExtraArray, PtgExtraMem].
Save the row, column, value, and formula information into memory into an internal representation of the cell.
The length of the record should equal the value of Formula.cb.
For more information, see the following resources: