Custom Document Parsers

Last modified: November 01, 2010

Applies to: SharePoint Foundation 2010

Managing the metadata that is associated with your document is one of the most powerful advantages of storing your enterprise content in Microsoft SharePoint Foundation 2010. However, keeping the information in sync between the document library level and in the document itself is a challenge. SharePoint Foundation provides the document parser infrastructure, which enables you to create and install custom document parsers that can parse your custom file types and update a document for changes that are made at the document library level, and vice versa. Using a document parser for your custom file types helps ensure that your document metadata is always current and synchronized between the document library and the document itself.

A document parser is a custom COM object that, by implementing the ISPDocumentParser Interface, does the following when it is invoked by SharePoint Foundation:

  • Extracts document property values from a document of a certain file type, and passes those property values to SharePoint Foundation for promotion to the document library property columns.

  • Receives document properties and then demotes those property values into the document itself.

  • Additionally, updates content type information, performs link discovery and repair, and extracts a thumbnail image.

This functionality enables users to edit document properties in the document itself and have the property values on the document library automatically update to reflect their changes. Likewise, users can update property values at the document library level and have those changes written back into the document automatically.

For more information about how SharePoint Foundation invokes document parsers, and how those parsers promote and demote document metadata, see Document Property Promotion and Demotion.

For SharePoint Foundation to use a custom document parser, the document parser must meet the following conditions:

  • The document parser must be a COM component that implements the ISPDocumentParser Interface.

    For more information, see Document Parser Interface Overview.

  • You must install and register the COM component on each web server in the SharePoint Foundation installation.

  • You must register the document parser with SharePoint Foundation by adding it to the collection returned by the PluggableParsers property of the SPWebService class. (For this reason, custom parsers are sometimes called "pluggable.")

SharePoint Foundation selects the document parser to invoke based on the file type of the document to be parsed. Any given file type can be associated with only one document parser. However, the same document parser can be associated with several different file types.

You can discover which parser is associated with a particular file type by examining the collection that is returned by the PluggableParsers property of the SPWebService class. For example, the following console application enumerates the pluggable parsers that are registered with the content service on the local server and prints each file name extension and corresponding program ID.

using System;
using System.Collections.Generic;
using Microsoft.SharePoint.Administration;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            SPWebService service = SPWebService.ContentService;

            Dictionary<string, SPDocumentParser> parsers = service.PluggableParsers;
            Dictionary<string, SPDocumentParser>.KeyCollection keys = parsers.Keys;

            Console.WriteLine("Ext       ProgID");
            Console.WriteLine("---       ------");
            foreach (string key in keys)
            {
                Console.WriteLine("{0, -7}   {1}", parsers[key].FileExtension,  parsers[key].ProgId );
            }
            Console.Write("\nPress ENTER to continue...");
            Console.ReadLine();
        }
    }
}

To associate a custom document parser with a file type, add the parser to the collection that is returned by the PluggableParsers property of the SPWebService class. For example, the following console application creates a custom document parser and associates it with files that have a "txt" extension by adding the parser to the collection of pluggable parsers for the content service on the local server.

using System;
using System.Collections.Generic;
using Microsoft.SharePoint.Administration;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            SPWebService service = SPWebService.ContentService;
            Dictionary<string, SPDocumentParser> parsers = service.PluggableParsers;

            // Create a custom parser.
            string extension = "txt";
            string progID = parsers["docx"].ProgId;
            SPDocumentParser customParser = new SPDocumentParser(progID, extension);

            // Remove any existing parser for the file type.
            if (parsers.ContainsKey(extension))
            {
                parsers.Remove(extension);
                service.Update();
            }

            // Add the new parser for the file type.
            service.PluggableParsers.Add(extension, customParser);
            service.Update();
        }
    }
}

Caution noteCaution

SharePoint Foundation 2010 includes several built-in document parsers. You can replace a built-in document parser with a custom parser, but you should do so only after careful consideration, particularly if you are thinking of replacing a parser for an HTML-based file type. The built-in parsers sometimes do more than the pluggable parser interface allows.

To ensure that SharePoint Foundation can invoke a given parser whenever necessary, you must install the COM component for each parser on each front-end web server in your SharePoint Foundation installation. Because of this, you can specify only one parser for a given file type across a SharePoint Foundation installation.

The document parser infrastructure does not include the ability to package and deploy a custom document parser as part of a SharePoint Foundation Feature.

Show: