Configuring Optional Item Processing

Applies to: SharePoint Server 2010

In this article
Customizing optionalprocessing.xml
File Format for optionalprocessing.xml
Property Extraction
Document Conversion
Offensive Content Filtering
Metadata Extraction

This article describes how to update the configuration file for the optional item processing stages in the pipeline, as follows:

Customizing optionalprocessing.xml

You enable or disable the optional item processing stages in the optionalprocessing.xml configuration file.

This configuration file is read every time that the item processors are reset, started, or restarted. The file must contain the name and activation status for each optional stage. By default, all optional processing stages are deactivated.

To modify this configuration file, you must be a member of the FASTSearchAdministrators local group on the FAST Search Server 2010 for SharePoint administration server.

Note

You can enable or disable optional item processing stages by using optionalprocessing.xml. However, you cannot use this file to add custom processing stages to the pipeline. For information about how to add item processing, see Integrating an External Item Processing Component.

Use a text editor or XML editor to change this file.

To change the optionalprocessing.xml file

  1. On the FAST Search Server 2010 for SharePoint administration server, edit <FASTSearchFolder>\etc\config_data\DocumentProcessor\OptionalProcessing.xml.

    Where <FASTSearchFolder> is the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch.

  2. On the FAST Search Server 2010 for SharePoint administration server, run the following command.

    <FASTSearchFolder>\bin\psctrl reset

    This will reset all currently running item processors in the system.

File Format for optionalprocessing.xml

The optionalprocessing.xml configuration file has the following syntax.

<optionalprocessing>
  <processor name="personnameextraction" active="yes|no" />
  <processor name="XMLMapper" active="yes|no" />
  <processor name="OffensiveContentFilter" active="yes|no" />
  <processor name="FFDDumper" active="yes|no" />
  <processor name="wholewordsextractor1" active="yes|no" />
  <processor name="wholewordsextractor2" active="yes|no" />
  <processor name="wholewordsextractor3" active="yes|no" />
  <processor name="wordpartextractor1" active="yes|no" />
  <processor name="wordpartextractor2" active="yes|no" />
  <processor name="MetadataExtraction" active="yes|no" />
  <processor name="SearchExportConverter" active="yes|no" />
</optionalprocessing>

Note

You must not add or remove entries in the file, except the optional processing stage MetadataExtraction. Change only the value of the active attribute for the individual processor elements.

Table 1 describes the optional item processing stages.

Table 1. Optional item processing stages

Optional Stage Name

Description

personnameextraction

Enables the built-in person name property extraction. For more information, see Property Extraction.

XMLMapper

Enables mapping of XML content by using custom mapping of XML elements to crawled properties. For more information, see Custom XML Item Processing.

OffensiveContentFilter

Enables the built-in Offensive Content Filtering. This feature removes items that contain pornographic content. For more information, see Offensive Content Filtering.

FFDDumper

Specifies the item processing pipeline debugging stage. For more information, see Debugging Custom Item Processing.

Note

You should only use this stage during testing, because it has major effect on the feed rate and may quickly fill up the local hard disk (<FASTSearchFolder>\data\ffd\).

wholewordsextractor1, wholewordsextractor2, wholewordsextractor3, wordpartextractor1, wordpartextractor2

These custom property extraction stages are included for backward compatibility.

Important

If you have installed FAST Search Server 2010 for SharePoint Service Pack 1, you should not use these stages, but instead follow the steps in Creating a Custom Property Extractor.

For information about migration of custom property extractors based on these stages, see Migrating Custom Property Extractors That Are Defined Prior to Service Pack 1.

MetadataExtraction

Enables extended metadata extraction for Microsoft Word and Microsoft PowerPoint documents. When this stage is enabled, the title and date is based on the content of the document, instead of the document metadata. For more information, see Metadata Extraction.

Important

Extended metadata extraction is introduced in FAST Search Server 2010 for SharePoint Service Pack 1, and is enabled by default when you install the service pack. You only have to include this line in the configuration file if you want to disable the feature.

SearchExportConverter

Enables conversion of additional document formats. For more information, see Document Conversion.

Note

Do not enable or disable this feature directly in the configuration file optionalprocessing.xml, but instead follow the procedure that is described in Enable Advanced Filter Pack (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

Note

If you change the item processing configuration, you must re-crawl all content that is affected by the item processing configuration change.

The following example shows how to enable the generation of a personnames crawled property that contains people names extracted from the processed content. You enable the stage by changing the value of the active attribute to yes.

<optionalprocessing>
    <processor name="personnameextraction" active="yes"/>
</optionalprocessing>

The following example shows how to enable mapping of XML content to crawled properties.

<optionalprocessing>
    <processor name="XMLMapper" active="yes"/>
</optionalprocessing>

Note

The XMLMapper processing stage requires an additional configuration file for the XML mapping. For information, see Custom XML Item Processing.

Property Extraction

Property extraction is a process that extracts information from the visible textual content of an item and stores that information as additional crawled properties for the document.

There are three built-in property extraction stages in the FAST Search Server 2010 for SharePoint item processing pipeline, which do the following:

  • The person name extractor extracts names of persons, based on a generic dictionary. By default, this stage is disabled, as FAST Search Server 2010 for SharePoint includes other features related to person name extraction (author property and people search feature). If you also want to extract names that are not specific to your company or organization, you can enable the stage in optionalprocessing.xml.

  • The location extractor extracts names of geographical locations, based on a generic dictionary. By default, this stage is enabled. If this property extraction is not relevant for your application, you do not have to map the resulting crawled property to a managed property in the index.

  • The company extractor extracts names of companies, based on a generic dictionary. By default, this stage is enabled. If this property extraction is not relevant for your application, you do not have to map the resulting crawled property to a managed property in the index.

The built-in property extraction stages support the following languages:

  • Arabic

  • Dutch

  • English

  • French

  • German

  • Italian

  • Japanese

  • Norwegian

  • Portuguese

  • Russian

  • Spanish

The default dictionaries for people, location, and company, have been created in order to achieve a reasonable coverage for public news content in the languages indicated earlier in this section.

You can modify the built-in property extractors by adding inclusion lists and exclusion lists. For information, see Manage Property Extraction (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

You can add custom property extractors to the pipeline. For information, see Creating a Custom Property Extractor.

Document Conversion

The processing stage named SearchExportConverter controls the FAST Search Server 2010 for SharePoint Advanced Filter Pack. This feature enables text and metadata extraction from several hundred file formats, complementing the document formats that are supported by the standard Filter Pack. By default, the Advanced Filter Pack is disabled.

Note

Do not enable or disable this feature directly in the configuration file optionalprocessing.xml. Instead, follow the procedure that is described in Enable Advanced Filter Pack (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

You can also deploy custom IFilter components that are developed for specific file formats. This is controlled via the user_converter_rules.xml configuration file. For information, see Configure FAST Search Server for SharePoint to use a Third-Party IFilter.

Offensive Content Filtering

The FAST Search Server 2010 for SharePoint offensive content filtering is implemented as a separate item processing stage. Item content that is run through the filter is compared to predefined terms in dictionaries. The output of the filter is an overall score that indicates the likeliness that an item is pornographic. The item's offensive score is written to the crawled property OCF::Score. Any item that exceeds the score threshold of 30 will be dropped from indexing.

The FAST Search Server 2010 for SharePoint offensive content filter uses single words and multiword expressions as the basis for the filtering.

By default, the offensive content filter is not enabled. You enable it by using the activation key OffensiveContentFilter in optionalprocessing.xml, as shown in the following example.

<optionalprocessing>
      <processor name="OffensiveContentFilter" active="yes"/>
</optionalprocessing>

Note

The offensive content filter does not use site information and does not take visual information (images) into account. The functionality is limited to pages that contain offensive text. For such pages, it provides a very high identification rate.

Offensive content filtering supports following languages:

  • Arabic

  • Chinese

  • Czech

  • English

  • Finnish

  • French

  • German

  • Hindi

  • Italian

  • Japanese

  • Korean

  • Lithuanian

  • Norwegian

  • Russian

  • Spanish

  • Swedish

  • Turkish

The offensive content filter scans the crawled properties title, body and ocfcontribution. The last property is not set by the crawlers, but can be used to scan additional content. You can, for example, map custom content to ocfcontribution by using XMLMapper.

Items that are considered pornographic are dropped during processing and appropriate feedback is supplied to the indexing connector.

Metadata Extraction

Certain crawled properties contain metadata from Microsoft Office documents. When an author creates a new document, he or she will typically use a template or another document as a starting point. In many cases the author does not update the metadata, and the metadata will be misleading. For Microsoft Word and Microsoft PowerPoint documents, it is possible to extract date and title information from the content of the document instead. In most cases this generates better metadata.

FAST Search Server 2010 for SharePoint includes an extended metadata extraction stage. When this stage is enabled, the title and date is based on the content of the document, instead of the document metadata.

By default, the extended metadata extraction is enabled. To disable extended metadata extraction, you add the key MetadataExtraction in optionalprocessing.xml, as shown in the following example.

<optionalprocessing>
      <processor name="MetadataExtraction" active="no" />
</optionalprocessing>

If you disable extended metadata extraction, then title and date will be based on the document metadata.

Important

Extended metadata extraction is introduced in FAST Search Server 2010 for SharePoint Service Pack 1, and is enabled by default after you have installed the service pack. The Service Pack upgrade does not modify optionalprocessing.xml.

If you use extended metadata extraction, then two crawled properties are created in the item processing pipeline:

  • Extracted title:

    • Crawled property name: 302

    • Property set: 012357BD-1113-171D-1F25-292BB0B0B0B0

    • Variant type: 31

    • Mapped to managed property: Title

  • Extracted date:

    • Crawled property name: 263

    • Property set: 012357BD-1113-171D-1F25-292BB0B0B0B0

    • Variant type: 64

    • Mapped to managed property: Write

See Also

Concepts

Creating a Custom Property Extractor

Custom XML Item Processing

Configure FAST Search Server for SharePoint to use a Third-Party IFilter

Integrating an External Item Processing Component