Understanding the Outlook MS-PST Binary File Format
Summary: Learn about the MS-PST binary file format that is used in Microsoft Outlook, including the basic structures and key concepts for interacting with it programmatically.
Applies to: Microsoft Outlook 2010 | Microsoft Office Outlook 2007 | Microsoft Office Outlook 2003 | Microsoft Outlook 2002 | Microsoft Outlook 2000
Published: February 2011
Provided by: Microsoft Corporation
This article describes the structures and some procedures for working with MS-PST files. It is the part of a series of articles that introduce the binary file formats used by Microsoft Office products. These articles are designed to be used in conjunction with the Microsoft Office File Format Documents on MSDN.
Overview of the MS-PST File Format
The MS-PST binary file format is the local message store for Microsoft Outlook. Microsoft Outlook 2010, Microsoft Office Outlook 2007, Microsoft Office Outlook 2003, Microsoft Outlook 2002, and Microsoft Outlook 2000 use this format. It is based on the Exchange data store, which is proprietary and not at all related to SQL or any other general purpose database environment. A .pst file represents a message store that contains a hierarchy of folders, and these folders contain messages, which may themselves contain attachments. Information about folders, messages, and attachments are stored in properties.
The recommended way to perform most programming tasks in Microsoft Outlook is to use the Outlook Primary Interop Assemblies. These are a set of .NET classes that provide a complete object model for working with Microsoft Outlook. This article series deals only with advanced scenarios, such as where Microsoft Outlook is not installed.
Key Components of the MS-PST File Format
At the logical level, a .pst file has three layers: the Node Database (NDB) layer, the Lists, Tables, and Properties (LTP) layer, and the Messaging layer.
The NDB layer includes the header, file allocation information, and the nodes and blocks that hold the message data, plus nodes that help locate that data. It uses two Btrees to help locate data: the Node BTree (NBT) and the Block Btree (BBT).
The Properties (LTP) layer is concerned primarily with properties, which it stores in two-dimensional tables.
The Messaging layer has the logic to combine the other two layers into folders, messages, attachments, and properties.
At the physical level, the file starts with a header, followed by an optional density list, and then a series of mapping structures interspersed at set intervals between blocks of data. The mapping structures are of fixed size, and repeat as often as needed to encapsulate areas of data as the file grows. To see the order in a file, see the following diagram in section 1.3.2 of the MS-PST documentation.
Most .pst files use Unicode text, but some older versions of Outlook create ANSI-based .pst files. Your code should recognize whether the PST file is Unicode or ANSI, because in ANSI files, the offsets where different parts of the file are located must be calculated differently.
The header structure resides at the very beginning of the file, and contains three main groups of information: Metadata, root record, and initial free map (FMap) and free page map (FPMap).
An Allocation Map (AMap) page tracks the allocation status of the data section that immediately follows the AMap page in the file. You can view the complete AMap page as an array of bits, where each bit corresponds to the allocation state of 64-bytes of data. An AMap page appears approximately every 250 KB in the .pst file.
The Density List (DList) is a list of references to AMap pages, sorted in order of ascending density. It optimizes the space allocation so that data is written to the sections with the most free space first. The DList is always located at offset 0x4200 in the file.
Remember that some older versions of Outlook do not use a DList. Also, the DList can sometimes be overwritten by transient processes, and may return an invalid cyclic redundancy check (CRC).
Other map pages
In .pst files that do not contain a valid DList, you can navigate by using the following AMap-like legacy mapping structures, which are maintained for backward compatibility and to maintain fixed file positions.
Page Map (PMap) page
A Page Map (PMap) page is used for storing the BBT and NBT, which contain most of the metadata in the .pst file, to optimize for the search of available pages. The PMap page is 512 bytes, and maps 512-byte pages. A PMap page appears approximately every 2 MB, or one PMap page for every 8 AMap pages.
Free Map (FMap) page
A Free Map (FMap) page provides a mechanism to locate contiguous free space quickly. Each byte in the FMap page corresponds to one AMap page. The value of each byte indicates the longest number of free bits found in the corresponding AMap page. Each FMap page (496 bytes) spans around 125 MB of data.
Each bit in the Free Page Map (FPMap) page corresponds to a PMap page, and the value of the bit indicates whether there are any free pages within that PMap page. At 496 bytes, an FPMap page spans about 8 GB of space.
There are too many important structures in a .pst file to define them all in the scope of this article. Here are some core structures they are built on.
Blocks are the fundamental units of data storage at the NDB layer. Blocks are assigned in sizes that are multiples of 64-bytes, and aligned on 64-byte boundaries, up to a maximum of 8 KB. Each block stores its metadata in a block trailer at the end of the block. Data blocks store raw data. Subnode blocks represent subnodes contained in a node.
A node consists of a data block and a Subnode BTree. It is used to divide .pst data into logical streams.
The Node BTree (NBT) and Block BTree (BBT) contain references to all of the accessible nodes and blocks in the file. They are found in the ROOT node of the header.
Property Context (PC) records
Message properties are stored at the LPT layer as Property Context (PC) records. A PC record is built on the node’s data stream on a BTree-on-Heap (BTH).
Extracting Data from an Outlook File
PST files are large and complex. Rather than trying to construct a custom .pst reader, you can use the PST File Format SDK. The PST File Format SDK includes sufficient tools and documentation to perform message extraction and other basic tasks. You also can browse the internal structures of a .pst file by using the PST Data Structure View Tool.
Understanding and working with binary file formats in general, and the MS-PST file format in particular, can be a challenge. Fortunately, the PST File Format SDK exists to make this easier. By combining the information in this article with the tools and documentation provided with the SDK, and using the Open Specifications documents as a reference, you will have several tools to help you accomplish your objectives.
For more information, see the following resources: