Export (0) Print
Expand All

Configuring Optional Item Processing

SharePoint 2010

Last modified: August 16, 2011

Applies to: SharePoint Server 2010

In this article
Customizing optionalprocessing.xml
File Format for optionalprocessing.xml
Property Extraction
Document Conversion
Offensive Content Filtering
Metadata Extraction

This article describes how to update the configuration file for the optional item processing stages in the pipeline, as follows:

You enable or disable the optional item processing stages in the optionalprocessing.xml configuration file.

This configuration file is read every time that the item processors are reset, started, or restarted. The file must contain the name and activation status for each optional stage. By default, all optional processing stages are deactivated.

To modify this configuration file, you must be a member of the FASTSearchAdministrators local group on the FAST Search Server 2010 for SharePoint administration server.

Note Note

You can enable or disable optional item processing stages by using optionalprocessing.xml. However, you cannot use this file to add custom processing stages to the pipeline. For information about how to add item processing, see Integrating an External Item Processing Component.

Use a text editor or XML editor to change this file.

To change the optionalprocessing.xml file

  1. On the FAST Search Server 2010 for SharePoint administration server, edit <FASTSearchFolder>\etc\config_data\DocumentProcessor\OptionalProcessing.xml.

    Where <FASTSearchFolder> is the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch.

  2. On the FAST Search Server 2010 for SharePoint administration server, run the following command.

    <FASTSearchFolder>\bin\psctrl reset

    This will reset all currently running item processors in the system.

The optionalprocessing.xml configuration file has the following syntax.

<optionalprocessing>
  <processor name="personnameextraction" active="yes|no" />
  <processor name="XMLMapper" active="yes|no" />
  <processor name="OffensiveContentFilter" active="yes|no" />
  <processor name="FFDDumper" active="yes|no" />
  <processor name="wholewordsextractor1" active="yes|no" />
  <processor name="wholewordsextractor2" active="yes|no" />
  <processor name="wholewordsextractor3" active="yes|no" />
  <processor name="wordpartextractor1" active="yes|no" />
  <processor name="wordpartextractor2" active="yes|no" />
  <processor name="MetadataExtraction" active="yes|no" />
  <processor name="SearchExportConverter" active="yes|no" />
</optionalprocessing>
NoteNote

You must not add or remove entries in the file, except the optional processing stage MetadataExtraction. Change only the value of the active attribute for the individual processor elements.

Table 1 describes the optional item processing stages.

Table 1. Optional item processing stages

Optional Stage Name

Description

personnameextraction

Enables the built-in person name property extraction. For more information, see Property Extraction.

XMLMapper

Enables mapping of XML content by using custom mapping of XML elements to crawled properties. For more information, see Custom XML Item Processing.

OffensiveContentFilter

Enables the built-in Offensive Content Filtering. This feature removes items that contain pornographic content. For more information, see Offensive Content Filtering.

FFDDumper

Specifies the item processing pipeline debugging stage. For more information, see Debugging Custom Item Processing.

NoteNote

You should only use this stage during testing, because it has major effect on the feed rate and may quickly fill up the local hard disk (<FASTSearchFolder>\data\ffd\).

wholewordsextractor1, wholewordsextractor2, wholewordsextractor3, wordpartextractor1, wordpartextractor2

These custom property extraction stages are included for backward compatibility.

Important noteImportant

If you have installed FAST Search Server 2010 for SharePoint Service Pack 1, you should not use these stages, but instead follow the steps in Creating a Custom Property Extractor.

For information about migration of custom property extractors based on these stages, see Migrating Custom Property Extractors That Are Defined Prior to Service Pack 1.

MetadataExtraction

Enables extended metadata extraction for Microsoft Word and Microsoft PowerPoint documents. When this stage is enabled, the title and date is based on the content of the document, instead of the document metadata. For more information, see Metadata Extraction.

Important noteImportant

Extended metadata extraction is introduced in FAST Search Server 2010 for SharePoint Service Pack 1, and is enabled by default when you install the service pack. You only have to include this line in the configuration file if you want to disable the feature.

SearchExportConverter

Enables conversion of additional document formats. For more information, see Document Conversion.

NoteNote

Do not enable or disable this feature directly in the configuration file optionalprocessing.xml, but instead follow the procedure that is described in Enable Advanced Filter Pack (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

NoteNote

If you change the item processing configuration, you must re-crawl all content that is affected by the item processing configuration change.

The following example shows how to enable the generation of a personnames crawled property that contains people names extracted from the processed content. You enable the stage by changing the value of the active attribute to yes.

<optionalprocessing>
    <processor name="personnameextraction" active="yes"/>
</optionalprocessing>

The following example shows how to enable mapping of XML content to crawled properties.

<optionalprocessing>
    <processor name="XMLMapper" active="yes"/>
</optionalprocessing>
NoteNote

The XMLMapper processing stage requires an additional configuration file for the XML mapping. For information, see Custom XML Item Processing.

Property extraction is a process that extracts information from the visible textual content of an item and stores that information as additional crawled properties for the document.

There are three built-in property extraction stages in the FAST Search Server 2010 for SharePoint item processing pipeline, which do the following:

  • The person name extractor extracts names of persons, based on a generic dictionary. By default, this stage is disabled, as FAST Search Server 2010 for SharePoint includes other features related to person name extraction (author property and people search feature). If you also want to extract names that are not specific to your company or organization, you can enable the stage in optionalprocessing.xml.

  • The location extractor extracts names of geographical locations, based on a generic dictionary. By default, this stage is enabled. If this property extraction is not relevant for your application, you do not have to map the resulting crawled property to a managed property in the index.

  • The company extractor extracts names of companies, based on a generic dictionary. By default, this stage is enabled. If this property extraction is not relevant for your application, you do not have to map the resulting crawled property to a managed property in the index.

The built-in property extraction stages support the following languages:

  • Arabic

  • Dutch

  • English

  • French

  • German

  • Italian

  • Japanese

  • Norwegian

  • Portuguese

  • Russian

  • Spanish

The default dictionaries for people, location, and company, have been created in order to achieve a reasonable coverage for public news content in the languages indicated earlier in this section.

You can modify the built-in property extractors by adding inclusion lists and exclusion lists. For information, see Manage Property Extraction (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

You can add custom property extractors to the pipeline. For information, see Creating a Custom Property Extractor.

The processing stage named SearchExportConverter controls the FAST Search Server 2010 for SharePoint Advanced Filter Pack. This feature enables text and metadata extraction from several hundred file formats, complementing the document formats that are supported by the standard Filter Pack. By default, the Advanced Filter Pack is disabled.

Note Note

Do not enable or disable this feature directly in the configuration file optionalprocessing.xml. Instead, follow the procedure that is described in Enable Advanced Filter Pack (FAST Search Server 2010 for SharePoint) on Microsoft TechNet.

You can also deploy custom IFilter components that are developed for specific file formats. This is controlled via the user_converter_rules.xml configuration file. For information, see Configure FAST Search Server for SharePoint to use a Third-Party IFilter.

The FAST Search Server 2010 for SharePoint offensive content filtering is implemented as a separate item processing stage. Item content that is run through the filter is compared to predefined terms in dictionaries. The output of the filter is an overall score that indicates the likeliness that an item is pornographic. The item's offensive score is written to the crawled property OCF::Score. Any item that exceeds the score threshold of 30 will be dropped from indexing.

The FAST Search Server 2010 for SharePoint offensive content filter uses single words and multiword expressions as the basis for the filtering.

By default, the offensive content filter is not enabled. You enable it by using the activation key OffensiveContentFilter in optionalprocessing.xml, as shown in the following example.

<optionalprocessing>
      <processor name="OffensiveContentFilter" active="yes"/>
</optionalprocessing>

NoteNote

The offensive content filter does not use site information and does not take visual information (images) into account. The functionality is limited to pages that contain offensive text. For such pages, it provides a very high identification rate.

Offensive content filtering supports following languages:

  • Arabic

  • Chinese

  • Czech

  • English

  • Finnish

  • French

  • German

  • Hindi

  • Italian

  • Japanese

  • Korean

  • Lithuanian

  • Norwegian

  • Russian

  • Spanish

  • Swedish

  • Turkish

The offensive content filter scans the crawled properties title, body and ocfcontribution. The last property is not set by the crawlers, but can be used to scan additional content. You can, for example, map custom content to ocfcontribution by using XMLMapper.

Items that are considered pornographic are dropped during processing and appropriate feedback is supplied to the indexing connector.

Certain crawled properties contain metadata from Microsoft Office documents. When an author creates a new document, he or she will typically use a template or another document as a starting point. In many cases the author does not update the metadata, and the metadata will be misleading. For Microsoft Word and Microsoft PowerPoint documents, it is possible to extract date and title information from the content of the document instead. In most cases this generates better metadata.

FAST Search Server 2010 for SharePoint includes an extended metadata extraction stage. When this stage is enabled, the title and date is based on the content of the document, instead of the document metadata.

By default, the extended metadata extraction is enabled. To disable extended metadata extraction, you add the key MetadataExtraction in optionalprocessing.xml, as shown in the following example.

<optionalprocessing>
      <processor name="MetadataExtraction" active="no" />
</optionalprocessing>

If you disable extended metadata extraction, then title and date will be based on the document metadata.

Important noteImportant

Extended metadata extraction is introduced in FAST Search Server 2010 for SharePoint Service Pack 1, and is enabled by default after you have installed the service pack. The Service Pack upgrade does not modify optionalprocessing.xml.

If you use extended metadata extraction, then two crawled properties are created in the item processing pipeline:

  • Extracted title:

    • Crawled property name: 302

    • Property set: 012357BD-1113-171D-1F25-292BB0B0B0B0

    • Variant type: 31

    • Mapped to managed property: Title

  • Extracted date:

    • Crawled property name: 263

    • Property set: 012357BD-1113-171D-1F25-292BB0B0B0B0

    • Variant type: 64

    • Mapped to managed property: Write

Show:
© 2014 Microsoft