Creating a Custom Property Extractor
Published: July 2010
Applies to: Microsoft FAST Search Server 2010 for SharePoint
A property extractor enables you to automatically extract entities or concepts from the visible text content of an item and map that to a managed property. In turn, you can use these properties for narrowing queries by using property filters, or as query refinement options.
This article describes how to create a custom property extractor, and includes the following tasks:
The configuration file format for custom property extractors has changed in FAST Search Server 2010 for SharePoint Service Pack 1. This article describes the new format. For information about the old configuration file format, see Migrating Custom Property Extractors That Are Defined Prior to Service Pack 1.
The custom dictionary defines which words will be searched for in the indexed items and indexed in the associated managed property. For information about the XML file syntax, see Linguistic Dictionary Schema (FAST Search Server 2010 for SharePoint). The custom dictionary must be in the same format as indicated in the following example. The custom dictionary must be saved in UTF-8 format without a byte order mark (BOM). Each entry has a key and an optional value.
The key is the string that must be present in the item. Depending on the type of property extractor, the matching of the key can be case-sensitive or case-insensitive. For more information, see Property Extractor Types.
A key should not contain any apostrophes. If it does, the term will never be matched.
You can configure the property extractor to extract the key or the value. You configure this behavior by using the yield-value attribute in the property extractor definition. It is often useful to specify a normalized representation of the word/phrase in the value attribute. If you have configured yield-value to yes, and you have an entry without a value attribute, then no entities will be extracted for this key.
Ensure there are no spaces or new lines after the closing dictionary tag, or the dictionary will generate an error.
The following example defines a property extraction dictionary that will extract terms related to wine terminology. In order to handle case variations, you use a property extractor of type Verbatim.
<dictionary> <entry key="wine" value="wine" /> <entry key="red wine" value="red wine" /> <entry key="white wine" value="white wine" /> <entry key="chardonnay" value="Chardonnay" /> <entry key="chiraz" value="Chiraz" /> </dictionary>
The associated property extractor will extract these wine-related terms to the crawled property you have associated with the property extractor. The matching will handle different casing of the terms, and normalize the casing in the resulting crawled property.
You can find templates for the property extraction dictionaries in the resource store folder on the administration server. You find the templates in the path <FASTSearchFolder>\components\resourcestore\dictionaries\matching\.
If there is an error in the format of your dictionary, you are informed of it only when feeding an item, not when uploading the dictionary to the resource store. The item processing log will contain an error indicating that the dictionary file cannot be compiled because there is an error in the automaton (compiled dictionary format).
Property Extractor Types
You can define property extractors based on whole-word matching or word-part matching.
Whole Word Matching Property Extractors
These property extractors are suited to match strings in all languages except East Asian languages (Korean, Chinese, Japanese, and Thai).
The entries in the custom dictionary can be single words or a string of words. The matching of the string is performed after a basic tokenization, which replaces separator characters (such as comma, punctuation mark, colon, and dash) in the text with spaces. The extractors must match the complete string after the basic tokenization.
Use the property extractor type Verbatim if you want case-insensitive matching. Use the property extractor type WholeWords if you want case-sensitive matching.
Word Part (Substring) Matching Property Extractors
These property extractors are suited to match strings in all documents in East Asian languages (Korean, Chinese, Japanese, or Thai), as the words in these languages are not separated by spaces.
You can also use these property extractors for specific use cases in which substring matches are needed, for example, searching for a DNA sequence within longer sequences. In this case, the custom dictionary would contain the interesting DNA sequences, for example, "AAAGTCTGAC". It would match a document that contains the sequence "ATATGAATGGAAAGTCTGACTGATATCTGG".
Use the property extractor type Substring if you want case-insensitive matching. Use the property extractor type WordParts if you want case-sensitive matching.
Items with Both East Asian and Non-East Asian Content
If the custom dictionary entry should match document strings surrounded by words in an East Asian language, then a word-part matcher should be used.
This is because foreign language words in a Chinese-language document or Japanese-language document are not always separated from the Chinese or Japanese characters by a space.
You must configure the custom property extraction stage in the XML configuration file named CustomPropertyExtractors.xml.
To configure the item processing stage
On the FAST Search Server 2010 for SharePoint administration server, edit <FASTSearchFolder>\etc\config_data\DocumentProcessor\CustomPropertyExtractors.xml.
Where <FASTSearchFolder> is the path of the folder where you have installed FAST Search Server 2010 for SharePoint, for example C:\FASTSearch.
If this is your first custom property extractor, you must create the configuration file with your property extractor configuration.
If you shall add an additional property extractor, you must add a new extractor element to the configuration file.
For more information about the configuration file format, see Custom Property Extractor Schema (FAST Search server 2010 for SharePoint).
The following configuration file example creates one property extractor for wine terms. The extractor performs case-insensitive whole word matching, and adds the extracted terms to the crawled property named mywineterms. The dictionary file used is winedictionary.xml.
If the custom property extraction dictionary is of type Verbatim or Substring, you should normalize the dictionary before uploading to the resource store. In this way you ensure that all words and phrases are in lowercase.
Use the tool <FASTSearchFolder>\bin\lowercase to normalize the dictionary.
In the following code example, the name of your edited dictionary file is c:\temp\wine_dictionary.xml, and the normalized dictionary name is c:\temp\wine_dictionary_normalized.xml.
Upload the custom property extraction dictionary to the FAST Search Server 2010 for SharePoint resource store by using the Windows PowerShell command Add-FASTSearchResource.
In the following code example, the name of your dictionary file is c:\temp\wine_dictionary_normalized.xml, and the dictionary name in the resource store is winedictionary.xml. On the administration server, in a FAST Search Server 2010 for SharePoint command shell, run the following command.
Add-FASTSearchResource -FilePath c:\temp\wine_dictionary_normalized.xml -Path dictionaries\matching\winedictionary.xml
FilePath specifies the path to your custom property extractor definition file. Path specifies the relative path of the dictionary in the resource store. The file name that is used must be the same as the dictionary name in CustomPropertyExtractors.xml.
For information about the resource store commands, see Administration cmdlets on Microsoft TechNet.
On the administration server, type the following command:
This resets all currently running item processors in the system, and activates the new item processing configuration.
To use the extracted data in queries or query refinement, you must create the crawled property and map it to a managed property within the index schema.
All extracted crawled properties must be in the crawled property category named MESG Linguistics with property set value 48385c54-cdfc-4e84-8117-c95b3cf8911c. Before you create the new crawled property, you can list all existing crawled properties in this category:
(Get-FASTSearchMetadataCategory -Name 'MESG Linguistics').GetAllCrawledProperties() | Select-Object name
The following Windows PowerShell example creates a crawled property named mywineterms, creates a managed property named wineterms, and maps the crawled property to the managed property.
$cp = New-FASTSearchMetadataCrawledProperty -Name mywineterms -Propset 48385c54-cdfc-4e84-8117-c95b3cf8911c -VariantType 31 $mp = New-FASTSearchMetadataManagedProperty -Name wineterms –type 1 $mp.StemmingEnabled=0 $mp.RefinementEnabled=1 $mp.MergeCrawledProperties=1 $mp.Update() New-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $mp -CrawledProperty $cp
In New-FASTSearchMetadataCrawledProperty, you set the variant type for the crawled property to 31 (string).
In New-FASTSearchMetadataManagedProperty, you set the type of the managed property to 1. This specifies that it will have the data type string.
You disable stemming for the managed property by using the command $mp.StemmingEnabled=0. In most cases, this is the desired behavior for an extracted property.
You enable query refinement for the managed property by using the command $mp.RefinementEnabled=1.
You enable multivalued data for the managed property by using the command $mp.MergeCrawledProperties=1.
For information about the Windows PowerShell commands, see Manage Crawled Properties by Using Windows PowerShell and Manage Managed Properties by Using Windows PowerShell on Microsoft TechNet.
You can also configure the property mapping by using the FAST Search Server 2010 for SharePoint Central Administration graphical user interface. For more information, see Property Management on Microsoft Office.com.
You must perform a full re-crawl to apply the property extraction on existing content.
Although your custom extractor is now configured, you will be unable to see a refiner for it in the search UI.
Follow the steps in Adding a Refiner to the Refinement Panel Web Part to configure the refiner in the Web Part.
When you have configured the custom item processing and the index schema, you can submit a test document to verify that the properties are extracted and are displayed in the query results.
For information about how to debug the custom item processing, see Debugging Custom Item Processing.
For information about how to test that the extracted properties are displayed in query results, see Testing Query Features.
If no properties are extracted, you can inspect the item processing log and look for the following error messages.
Possible error cause
VERBOSE systemmsg Configserver call: LoadConfigFileBase64 ('DocumentProcessor', 'CustomPropertyExtractors.xml').
WARNING systemmsg Error parsing configuration: Please check the syntax.
There is an error in the configuration file ('CustomPropertyExtractors.xml)
VERBOSE systemmsg Configserver call: LoadConfigFileBase64 ('DocumentProcessor', 'CustomPropertyExtractors.xml')
ERROR systemmsg Custom Extractor (Myextractor): Failed to instantiate the matcher from its configuration string.
ERROR systemmsg Custom Extractor (Myextractor): Reason was: RuntimeError: Matcher creation failed.
WARNING systemmsg Failed to instantiate type "verbatim": Extractor "Myextractor" deactivated.
The dictionary is not available in the resource store.
Configserver call: LoadConfigFileBase64 ('DocumentProcessor', 'CustomPropertyExtractors.xml')
WARNING systemmsg Unknown type "foo": Extractor "Myextractor" deactivated.
The specified dictionary has incorrect type.