OfficeTalk: Essentials of the Open Packaging Conventions

This content is no longer actively maintained. It is provided as is, for anyone who may still be using these technologies, with no warranties or claims of accuracy with regard to the most recent product version or service release.

Summary:   Understanding Open Packaging Conventions is key to working with Open XML. Review the structure and learn about the underlying architecture. Compare the different approaches to working with the Open Packaging Conventions and the Open XML Formats.(17 printed pages)

Eric White, Microsoft Corporation

September, 2009

Applies to:   2007 Microsoft Office System, Microsoft Office Excel 2007, Microsoft Office PowerPoint 2007, Microsoft Office Word 2007

Contents

  • Overview of Open Packaging Conventions

  • Learning about ZIP Packages and Document Parts

  • Using Relationships to Connect Document Parts

  • Understanding Relationship Types

  • Retrieving Document Parts by Using Streams

  • Working with the Unique URI for a Document Part

  • What are Content Types?

  • Working with Relationship IDs

  • Diagramming a ZIP Package and Document Parts

  • Looking at ZIP Packages and Document Parts with Visual Studio Team System 2008 Team Foundation Server Power Tools

  • Navigating from Document Part to Part by Using the System.IO.Packaging Namespace

  • Navigating from Document Part to Part by using the Open XML SDK 2.0 for Microsoft Office

  • Conclusion

  • Additional Resources

Overview of Open Packaging Conventions

'This column presents the absolute minimum you need to know about the Open Packaging Conventions if you are building applications using the Welcome to the Open XML Format SDK 1.0 (Open XML Format SDK 1.0), the Microsoft Open XML SDK 2.0 for Microsoft Office (Open XML SDK 2.0 for Microsoft Office) or the [System.IO.Packaging] namespace. For those who are new to Open XML Formats and the Open Packaging Conventions, this article is for you. For those who are already familiar with the Open Packaging Conventions, by ensuring that you understand the semantics of relationships—especially the concept that two or more source document parts can have relationships that refer to the same target document part, and that a target document part can be referred to by multiple source document parts, you will increase your facility with working with Open XML and the Open Packaging Conventions .

Open XML documents are really ZIP files that contain XML, binary, and other types of files inside them. You can take any Microsoft Office Word 2007 file, Microsoft Office Excel 2007 file, or Microsoft Office PowerPoint file that is saved in the new document format, and append .ZIP as the file name extension and then use any of a variety of tools, including WinZip or Windows Explorer to look inside of them. You gain many benefits from this—files are smaller and easy to access—all you need is a library that knows how to crack open a ZIP file along with some sort of XML programming application programming interface (API).

The standard that defines how Open XML documents are stored in ZIP files is the Open Packaging Conventions. For more information about the Open Packaging Conventions, see ECMA 376-2 and ISO/IEC 29500-2. The specification contains details that are certainly necessary for those who are implementing an API to access packages, such as when using Open XML Format SDK 1.0, Open XML SDK 2.0 for Microsoft Office, or classes in the [System.IO.Packaging] namespace. However, for the rest of us who simply need to use those APIs, there are just a few vital pieces of information to know about them.

Note

If you are using a language that doesn't contain a library that knows about packages and document parts, and you only have a library that can read or write ZIP files and read or write XML documents, you need to know more about the internals of the Open Packaging Conventions. It's not hard, but there is more to know. But if you are using the .NET Framework and Open XML Format SDK 1.0, Open XML SDK 2.0 for Microsoft Office, or the classes in the [System.IO.Packaging] namespace, you can disregard some of the mechanics of how the Open Packaging Conventions packages are put together.

Learning about ZIP Packages and Document Parts

To start with the basics: A ZIP package contains document parts. The Open XML Package specification defines documents as a set of XML files (document parts) and defines relationships between the document parts. The ZIP package is the ZIP file and the document parts are the files contained in the ZIP file. Like files on your disk, document parts can contain any type of content: text files, XML files, binary files, image files, or audio files—whatever. I oversimplified here a bit, because packages are not tied to the ZIP file format, and the parts in a package are not necessarily files, but are streams, per the package abstraction. But for all practical purposes, you can think of ZIP packages as ZIP files and document parts as files contained within the ZIP file.

Using Relationships to Connect Document Parts

While the document parts make up the contents of the file, the relationships describe how the document parts work together. These relationships are themselves stored in special relationship parts (special XML files stored within the ZIP). The [System.IO.Packaging] namespace and the Open XML Format SDK 1.0 and Open XML SDK 2.0 for Microsoft Office address these special relationship files, each in their own way. These relationships are important—I'll explain why in a bit.

The ZIP package itself can contain relationships to a set of initial document parts. In the case of word-processing documents, the package contains relationships to three document parts—the main document part, the core properties part, and the extended properties part.

Note

According to the Open XML Package specification, a word-processing document package only needs to contain a relationship to the main document part. The core properties part and the extended properties part are optional, although Word 2007 includes them as part of its default document. However, it's valid to create a word-processing document (.docx) where the package contains only a relationship to the main document part, and a conforming application happily opens the document.

Document parts can also have relationships that refer to other parts. For example, the main document part of a word-processing document can have a relationship to a number of other parts, such as the settings part, the styles part, a theme part, a font table part, and a Web settings part.

Note

Now for a terminology check in. In this article, the source part is the from part in the relationship. The target part is the to part in the relationship. The Open Packaging Conventions standard and the Open XML Format SDK 1.0 and Open XML SDK 2.0 for Microsoft Office use these terms as well.

Two (or more) source parts can have a relationship to the same target part—in other words, the network of parts forms a directed graph, not a hierarchy. A common example of this is where several parts may refer to the same image—a corporate logo, for example. The corporate logo is stored in the package only once, and multiple parts can have a relationship referring to the same image part. So it follows that any document part can have relationships from multiple source parts as well as relationships targeting multiple child parts.

Most relationships in a ZIP package are references from one document part (or the package) to another part within the package, but there is a different kind of relationship—an external relationship. External relationships refer to a file or resource outside of the package. All of the information about the external relationship is stored in the relationship itself.

Here is what one of those relationship XML parts looks like. The last relationship is an example of an external relationship:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships
  xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship
    Id="rId3"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings"
    Target="webSettings.xml"/>
  <Relationship
    Id="rId2"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings"
    Target="settings.xml"/>
  <Relationship
    Id="rId1"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
    Target="styles.xml"/>
  <Relationship
    Id="rId6"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme"
    Target="theme/theme1.xml"/>
  <Relationship
    Id="rId5"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable"
    Target=" fontTable.xml"/>
  <!-- External Relationship -->
  <Relationship
    Id="rId4"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink"
    Target="http://blogs.msdn.com/EricWhite"
    TargetMode="External"/>
</Relationships>

The Open XML Format SDK 1.0 and Open XML SDK 2.0 for Microsoft Office put a strongly-typed object model abstraction over the document parts. You navigate from document part to part by accessing properties in the ZIP package object or part object. Although much of the mechanics of the Open Packaging Conventions are hidden by the Open XML Format SDK 1.0 and Open XML SDK 2.0 for Microsoft Office, in my opinion, it is still important to understand how ZIP packages and document parts are put together as well as how relationships work. There may be a time where you want to examine the contents of a package directly. Understanding how packages are put together helps.

Understanding Relationship Types

Each relationship has a relationship type. A relationship type is defined in the same way that namespaces are defined for XML. By using relationship types patterned after the Internet domain-namespace, independent parties can safely create non-conflicting relationship types. For example, the relationship type for the relationship from the package to the main document part is http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument. The relationship type from the main document part to a header is http://schemas.openxmlformats.org/officeDocument/2006/relationships/header.

When using the Open XML Format SDK 1.0 and Open XML SDK 2.0 for Microsoft Office, you don't have to deal directly with relationship types. When you move from document part to part using the strongly-typed part object model, the Open XML Format SDK 1.0 and Open XML SDK 2.0 for Microsoft Office find the related part by using the appropriate relationship type. All we need to do is to access public methods and properties to retrieve the related part.

Retrieving Document Parts by Using Streams

Document parts are always stored and retrieved by using streams. When updating an existing part or adding a new one, you stream the content to the part.

Note

According to the Open XML Package specification, ZIP files are only one possible implementation of the Open Packaging Conventions. Because document parts are always retrieved by using streams, someone could create a conformant Open Packaging Conventions API where the document store is not a ZIP file; it could be served over a network connection from a server. That server could store the parts in a SQL database or some data store other than a ZIP file. I have not yet seen an Open Packaging Conventions API that serves parts from a data store other than ZIP files, but the specification allows for this. This probably isn't something that you need to consider at this point, but I find this interesting.

When accessing XML parts using the Open XML SDK 2.0 for Microsoft Office, you typically do not need to deal with these streams. Instead, you just use the strongly-typed XML object model. The Open XML SDK 2.0 for Microsoft Office automatically populates the XML objects from streams. However, when accessing binary parts such as images, you need to interact with streams to write the binary parts to the package or to retrieve the binary part from the package.

Working with the Unique URI for a Document Part

Every part has a unique URI in the ZIP file. This is the full path name of the file in the ZIP file. There is always a leading / character. This only makes sense. In a ZIP archive, you can only have one file with a given path and file name.

As mentioned previously, the Open XML Format SDK 1.0 and the Open XML SDK 2.0 for Microsoft Office put a strongly-typed object model abstraction over the document parts. You navigate through the part graph by accessing the properties in a package object or part object. The Open XML Format SDK 1.0 and the Open XML SDK 2.0 for Microsoft Office use the underlying URI to retrieve the document part, but this happens under the covers. But the same basic principles apply—the package is related to a key set of starting parts, those document parts are in turn related to other parts. As noted before, multiple parts can have relationships that refer to the same part.

You can also relate document parts to other parts by using a relative path name. For example you can relate the /word/document.xml part to the theme/theme1.xml part and the full path of the theme part is /word/theme/theme1.xml. In this example, note the lack of leading / makes it a relative URI. It is also valid for the relative URI of the related part to be something like ../theme1.xml, in which case the full path of the related part is /word/theme/../theme1.xml. When converted to its canonical form, the full path is /word/theme1.xml. Alternatively, you can relate document parts to other parts by using an absolute path.

But the most important rule about the URI of document parts is that you should never rely on the location of a part. You never know what the URI will be. If it is referred to properly by another document part with a correct relative path, the actual location represented by the URI has no significance whatsoever.

To move from document part to part through relationships by using the [System.IO.Packaging] namespace, you first get a list of relationships from the source part (or package). You filter (by relationship type or relationship ID, more on this later) and find the one you want. You get the relative URI from the relationship and construct the full path URI to the target part by using the full path of the source part and the relative URI from the relationship. You then can open the target part. This is the only time that you actually use the URI of a part.

What are Content Types?

Document parts have a content type associated with them. As with relationship types, a content type is a string similar to a MIME media type that is defined in a manner to be unique naturally, such as application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml. This example is the content type of the main document part in a word-processing document. By far, the most common types of content types indicate that the part is XML that must validate according to a particular schema. Other common content types identify the part as being an image of a certain format.

There is a strong tie between relationship types and content types. In many cases, you do not need to deal directly with content types. When using the [System.IO.Packaging] namespace, you retrieve related target parts by using relationship types (or relationship IDs, see below). After you retrieve the part using a relationship type or relationship ID, you most often implicitly know its content type. In the case of XML parts, your code can assume that the XML in the part conforms to the schema defined for the content type. You can produce and query the XML appropriately. When using the Open XML SDK 2.0 for Microsoft Office, you access the XML in parts using strongly-typed objects. The object hierarchy that represents the XML in the part reflects the schema defined for the content type of the part.

If your code processes images in a package, you must work with content types. If you have a document that contains two images, one a JPEG, the other a PNG, the relationship type for both images is as follows: http://schemas.openxmlformats.org/officeDocument/2006/relationships/image, but the content types are image/jpeg and image/png respectively.

Working with Relationship IDs

Each relationship also has a relationship ID. These relationship IDs need to be unique only within the list of relationships of the source part. So for example, you could have a package that is related to three parts, with the relationship IDs of rId1, rId2, and rId3. The main document part can be related to several parts with the relationship IDs of rId1, rId2, rId3, rId4, and rId5. There's no conflict between these relationship IDs because they are from different source parts.

Relationship IDs are important. In many cases, markup in one part identifies a related part through its relationship ID. For example, if you embed an image in a word-processing document, the markup in the main document part might look like the following example.

Note

Some XML omitted to simplify the example.

<w:p>
  <w:r>
    <w:drawing>
      <wp:inline>
        ...
        <a:graphic>
          <a:graphicData>
            <pic:pic>
              ...
              <pic:blipFill>
                <a:blip r:embed="rId4"/>  <!-- Relationship Id -->
                <a:srcRect/>
                <a:stretch>
                  <a:fillRect/>
                </a:stretch>
              </pic:blipFill>
              ...
            </pic:pic>
          </a:graphicData>
        </a:graphic>
      </wp:inline>
    </w:drawing>
  </w:r>
</w:p>

You can see that the markup refers to the image part with the relationship ID of rId4. To retrieve the appropriate image part, filter the related target parts by the relationship ID. The markup that refers to another part by relationship ID does not necessarily use the r:embed attribute. There are other attribute names that you can also use to refer to another part, such as r:id. The key point here is that you retrieve the part using its ID.

We must concern ourselves with relationship IDs when using both the [System.IO.Packaging] namespace, the Open XML Format SDK 1.0, and Open XML SDK 2.0 for Microsoft Office. When retrieving an image, you can identify the correct part by using its relationship ID.

Diagramming a ZIP Package and Document Parts

The following diagram shows a ZIP package related to some document parts, which are, in turn, related to other parts:

Figure 1. ZIP package and document parts

Content Types

-
*1 application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml

-
*2 application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml

-
*3 image/png

Relationship Types

-
*a http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument

-
*b http://schemas.openxmlformats.org/officeDocument/2006/relationships/image

-
*c http://schemas.openxmlformats.org/officeDocument/2006/relationships/header

-
*d http://schemas.openxmlformats.org/officeDocument/2006/relationships/image

In Figure 1, I made the relationship ids unique, however, rId3 and rId4 could each just as easily been rId1, and there would be no conflict because relationship ids need only be unique within the list of relationships from a specific source part.

Looking at ZIP Packages and Document Parts with Visual Studio Team System 2008 Team Foundation Server Power Tools

One of the most convenient ways to explore ZIP packages and document parts is by using the Microsoft Visual Studio Tools for the Office System Power Tools v1.0.0.0 as shown in the following figures.

Figure 2. Three relationships from the ZIP package to document parts

Figure 3. Main document part

If we expand the main document part node (document.xml), we see the relationships from the main document part to a variety of other parts:

Figure 4. Relationships from the main part to other parts

Clicking a document part updates the Properties window to show various properties of the part, including its full path and its content type. Clicking a relationship part updates the Properties window to show the properties of the relationship, including its relationship type.

The following sample contains code to find the main document part from the package, and the style part from the main document part. Comments identify the code that shows how to find the main document part and the styles part.

using System;
using System.Linq;
using System.IO;
using System.IO.Packaging;
using System.Xml;
using System.Xml.Linq;

class Program
{
    static void Main(string[] args)
    {
        const string fileName = "Test.docx";

        const string documentRelationshipType =
          "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
        const string stylesRelationshipType =
          "http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles";

        XDocument xDocMainDocument = null;
        XDocument xDocStyleDocument = null;

        using (Package wdPackage = Package.Open(fileName, FileMode.Open, FileAccess.Read))
        {
            Console.WriteLine("Finding part, documentRelationshipType:{0}",
                documentRelationshipType);
            
            // Get the relationship for the main document part.

            PackageRelationship docPackageRelationship = wdPackage
                .GetRelationshipsByType(documentRelationshipType)
                .FirstOrDefault();
            if (docPackageRelationship != null)
            {

                // Assemble the URI of the main document part.

                Uri documentUri = PackUriHelper.ResolvePartUri(
                       new Uri("/", UriKind.Relative), docPackageRelationship.TargetUri);
                Console.WriteLine("Uri of document part:{0}", documentUri.ToString());

                // Get the main document part.

                PackagePart documentPart = wdPackage.GetPart(documentUri);
                Console.WriteLine("Document part content type:{0}",
                    documentPart.ContentType);
                Console.WriteLine();

                // Load the document XML in the part into an XDocument instance.
                using (Stream documentPartStream = documentPart.GetStream())
                using (XmlReader documentPartXmlReader =
                       XmlReader.Create(documentPart.GetStream()))
                    xDocMainDocument = XDocument.Load(documentPartXmlReader);

                //  Find the styles part. There will only be one.
                Console.WriteLine("Finding styles part, stylesRelationshipType:{0}",
                    stylesRelationshipType);
            
                // Get the relationship for the style part.

                PackageRelationship styleRelation = documentPart
                    .GetRelationshipsByType(stylesRelationshipType)
                    .FirstOrDefault();
                if (styleRelation != null)
                {

                    // Assemble the URI of the style part.

                    Uri styleUri = PackUriHelper.ResolvePartUri(documentUri,
                        styleRelation.TargetUri);
                    Console.WriteLine("Uri of styles part:{0}", styleUri.ToString());

                    // Get the style part.

                    PackagePart stylePart = wdPackage.GetPart(styleUri);
                    Console.WriteLine("Style part content type:{0}",
                        stylePart.ContentType);
                    Console.WriteLine();

                    // Load the style XML in the part into an XDocument instance.
                    using (Stream stylePartStream = stylePart.GetStream())
                    using (XmlReader stylePartXmlReader =
                           XmlReader.Create(stylePartStream))
                        xDocStyleDocument = XDocument.Load(stylePartXmlReader);
                }
            }
        }
        Console.WriteLine("The main document part has {0} nodes.",
            xDocMainDocument.DescendantNodes().Count());
        Console.WriteLine("The style part has {0} nodes.",
            xDocStyleDocument.DescendantNodes().Count());
    }
}

The following sample contains code to find the main document part from the package, and the style part from the main document part. Comments identify the code that shows how to find the main document part and the styles part. As you can see, the code to use the Open XML SDK 2.0 for Microsoft Office is significantly simpler than the code to use the [System.IO.Packaging] namespace.

using System;
using System.Linq;
using DocumentFormat.OpenXml.Packaging;

class Program
{
    static void Main(string[] args)
    {
        const string filename = "Test.docx";

        using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filename, true))
        {

            // Get the main document part.

            MainDocumentPart mainPart = wordDoc.MainDocumentPart;

            // Get the style part.

            StyleDefinitionsPart styleDefinitionsPart = mainPart.StyleDefinitionsPart;

            Console.WriteLine("The main document part has {0} nodes.",
                mainPart.RootElement.Descendants().Count());
            Console.WriteLine("The style part has {0} nodes.",
                styleDefinitionsPart.RootElement.Descendants().Count());
        }
    }
}

Conclusion

There are some core concepts to understand about the Open Packaging Conventions, but after you get past the basic mechanics, what you find is that the Open Packaging Conventions is an extensible framework for assembling and organizing related content. It's a powerful framework for managing multiple content files of varying types that have strong interrelationships.

Additional Resources

Explore the following resources to build on the information presented here: