Custom XML Item Processing

Applies to: SharePoint Server 2010

In this article
XML Mapper Overview
Crawling XML Content
Format Detection and Item Parsing
Customizing xmlmapper.xml
File Format of xmlmapper.xml
XML Mapper Configuration Example

The item processing pipeline in FAST Search Server 2010 for SharePoint includes an optional XML Mapper stage to create crawled properties from specific parts of crawled XML items. For information about how to enable the XML Mapper stage in the item processing pipeline, see Configuring Optional Item Processing.

This article describes how to configure the property mapping by using the configuration file xmlmapper.xml, as follows:

XML Mapper Overview

Use XML Mapper for web and file share crawling of XML-based documents that require customized extraction and transformation. By enabling XML Mapper in the pipeline, all crawled items that are XML-based documents will be mapped. Do not use XML Mapper for crawling unknown content or multiple sources, because the configuration may unintentionally match crawled XML items of no interest.

You specify which parts of the XML content to map by using XPath 1.0 together with functionality that is provided by the configuration file itself. For example, you can handle multiple expressions, add string separators, split strings, and remove white space.

For each mapping statement, you specify which crawled properties that the extracted XML content is mapped to. For more information about crawled properties, see Plan the Index Schema (FAST Search Server 2010 for SharePoint) and Index Schema Reference.

Crawling XML Content

By enabling XML Mapper in the pipeline, all crawled items that are XML-based documents will be mapped. You should therefore ensure that you crawl only known XML content. If you crawl unknown content or multiple sources, you may unintentionally match crawled XML items that use the mapped XML elements for other purposes. This may lead to incorrect metadata in your index.

There are two main ways of crawling XML content to FAST Search Server 2010 for SharePoint.

  • Crawl XML documents by using the Content SSA (FAST Search Connector). This will enable you to retrieve XML documents from web servers, file shares, and SharePoint document libraries.

    Note

    A crawled item can include multiple parts, such as a SharePoint list item or an email message with multiple attachments. The XML Mapper will map XML content that appears only in the first (main) part of the crawled item, not in the attachments.

  • Use the FAST Search database connector to retrieve XML content from a column in a database table (for more information, see Crawling Database Content with the FAST Search Database Connector on Microsoft TechNet). In the SQL-based crawl rule, you can specify that a given column in a database table is mapped to the predefined internal property named data. This is the same internal property that contains the main content of a document when passed from an indexing connector to the item processing pipeline.

    The following example shows a simple crawl rule that retrieves an XML-based formal description of a product from a table named Product to the internal property named data.

    SELECT Product.formalDescription AS data FROM Product
    

    Important

    The data property cannot be mapped as a crawled property, but will in this case be treated as the content represents an XML-type document. The content type considerations described in Format Detection and Item Parsing will also apply to these kinds of items.

Format Detection and Item Parsing

The item processing pipeline detects the type of content by analyzing the actual data in the retrieved item. If it contains valid XML, it will be treated as XML and converted by using the XML Mapper. Some XML content may not have valid XML declarations and may contain element names that are frequently used in HTML. In such cases, the crawled XML items might be mistaken for HTML items. One solution to this problem is to bypass format detection for crawled items with ".xml" as the file name extension. You can do that by adding the following conversion rule in the configuration file user_converter_rules.xml.

<ConverterRules>
   <IFilter>
      <trust>
         <ext name=".xml" mimetype="text/xml" />
      </trust>
   </IFilter>
</ConverterRules>

This configuration ensures that files with extension ".xml" are always treated as XML content. For more information, see Configure FAST Search Server for SharePoint to use a Third-Party IFilter.

The item processing pipeline will map the main body of all items to the managed property named body, including XML items. This means that the content of all elements and attributes will be searchable in the default full-text index and in the managed property named body. This is done in addition to the optional mapping performed by using XMLMapper. If parts (or all) of the XML is metadata, you may want to control this mapping on a more granular level. You can map a more specific part of the XML (or even a dummy value) to the managed property named body. The following procedure specifies how to perform this mapping.

To map specific XML content to the body managed property

  1. Specify an XML Mapper configuration that maps specific parts of the XML to a new crawled property with a unique name.

  2. Specify a mapping of this crawled property to the managed property named body. For more information, see Manage Crawled Properties by Using Windows PowerShell (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

  3. Ensure that body has the MergeCrawledProperties flag set in the index schema. For more information, see Manage Managed Properties by Using Windows PowerShell (FAST Search Server 2010 for SharePoint) on TechNet.

Customizing xmlmapper.xml

To modify this configuration file, you must be a member of the FASTSearchAdministrators local group on the FAST Search Server 2010 for SharePoint administration node.

To modify xmlmapper.xml

  1. Ensure that the XML Mapper stage is enabled in the item processing pipeline. For more information, see Configuring Optional Item Processing.

  2. On the administration node, create the following file if it does not exist: %FASTSEARCH%\etc\config_data\DocumentProcessor\XMLMapper.xml.

  3. Edit %FASTSEARCH%\etc\config_data\DocumentProcessor\XMLMapper.xml in a text editor.

  4. Run the command psctrl reset to reset all currently running item processors in the system.

File Format of xmlmapper.xml

The following is the basic structure of the xmlmapper.xml file.

<XMLPropertiesCreator>
  <propset>propertySetValue</propset> 
  <type>variantTypeValue</type> 
  <paragraph-sep>paragraphSeparatorValue</paragraph-sep>
  <XMLMappings>
    <Namespace name='namespaceName' uri='uriName' />
    <Mapping path='XPath' attr='propName' propset='GUID' type='varType'
             sep-str='separator' post-str='postString'
             ignore-whitespace='yes|no' strip-tags='yes|no'
             shallow='yes|no' mode='append|prepend|overwrite' />

    <SubTree base-path='basePath' >
      <Mapping path='XPath' attr='propName' propset='GUID' type='varType'
               sep-str='separator' post-str='postString'
               ignore-whitespace='yes|no' strip-tags='yes|no'
               shallow='yes|no' mode='append|prepend|overwrite' />
    </SubTree> 
    <MappingGroup base-path='basePath' attr='propName' propset='GUID' type='varType' 
                                    sep-str='separator' pre-str='preString' post-str='postString' 
                                    rec-sep-str='recSeparator' rec-pre-str='recPreString' rec-post-str='recPostString' 
                                    select='merge|first|longest' mode='append|prepend|overwrite' >
      <Mapping path='XPath' sep-str='separator' post-str='postString'
               ignore-whitespace='yes|no' strip-tags='yes|no' shallow='yes|no' /> 
    </MappingGroup>
  </XMLMappings>
</XMLPropertiesCreator>

For information about the XML syntax for the individual elements, see XML Mapper Schema (FAST Search Server 2010 for SharePoint).

XML Mapper Configuration Example

For this example, consider the following XML item as the content input.

<Document>
  <Title>My title</Title>
  <Date>2010-01-06T14:25:04Z</Date>
  <Size>128</Size>
  <Tags>
    <Tag>funny</Tag>
    <Tag>hilarious</Tag>
  </Tags>
  <Tutti>Hello</Tutti>
  <Frutti>World</Frutti>
</Document>

The following example provides an XML Mapper configuration that extracts information from such an item.

<XMLPropertiesCreator>
  <propset>d6ee4933-09c4-46e3-a5e4-b3787cb4a090</propset>
  <type>31</type>
  <XMLMappings>
    <Mapping attr="mytitle" path="//Title"/>
    <Mapping attr="mysize" path="//Size" type="3"/>
    <Mapping attr="mydate" path="//Date" type="64" propset="38c35ad5-69ee-4776-886f-95961a73d52d"/>
    <Mapping attr="mytags" path="//Tag" sep-str=";"/>
    <MappingGroup attr="mymulti" base-path="/Document" select="first">
      <Mapping path="Tutti"/>
      <Mapping path="Frutti"/>
    </MappingGroup>
  </XMLMappings>
</XMLPropertiesCreator>

In this example, XML Mapper will apply the mapping shown in Table 1.

Table 1. XML Mapper mappings

Element

Maps To

//Title

mytitle in the propset "d6ee4933-09c4-46e3-a5e4-b3787cb4a090" with variant type set to "31".

//Size

mysize in the propset "d6ee4933-09c4-46e3-a5e4-b3787cb4a090" with variant type set to "3".

//Date

mydate in the propset "38c35ad5-69ee-4776-886f-95961a73d52d" with variant type set to "64".

//Tag

mytags in the propset "d6ee4933-09c4-46e3-a5e4-b3787cb4a090" with variant type set to "31". Multiple element matches will be separated by a semicolon.

/Document/Tutti

mymulti in the propset "d6ee4933-09c4-46e3-a5e4-b3787cb4a090" with variant type set to "31". select="first" implies that only the first matching element is matched. In this case this implies that the text "Hello" is extracted.

See Also

Concepts

XML Mapper Schema (FAST Search Server 2010 for SharePoint)

Configuring Optional Item Processing