In SQL Server, full-text queries can search for synonyms of user-specified terms through the use of a thesaurus. A SQL Server thesaurus defines a set of synonyms for a specific language. System administrators can define two forms of synonyms: expansion sets and replacement sets. By developing a thesaurus tailored to your full-text data, you can effectively broaden the scope of full-text queries on that data. Thesaurus matching occurs only for CONTAINS and CONTAINSTABLE queries that specify the FORMSOF THESAURUS clause and for FREETEXT and FREETEXTABLE queries.
Before full-text search queries on your server instance can look for synonyms in a given language, you must define thesaurus mappings (synonyms) for that language. Each thesaurus must be manually configured to define the following:
SQL Server provides a set of XML thesaurus files, one for each supported language. These files are essentially empty. They contain only the top-level XML structure that is common to all SQL Server thesauruses and a commented-out sample thesaurus.
This topic contains information to help achieve this task, as follows:
The thesaurus files that are released with SQL Server 2008 all contain the following XML code:
<XML ID="Microsoft Search Thesaurus"> <!-- Commented out <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> <expansion> <sub>run</sub> <sub>jog</sub> </expansion> </thesaurus> --> </XML>
[Top]
The default location of the thesaurus files is:
SQL_Server_install_path\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTDATA\
This default location contains the following files:
You can change the location and names of a thesaurus file by changing its registry key. For each language, the location of the thesaurus file is specified in the following value in the registry:
HKLM/SOFTWARE/Microsoft/Microsoft SQL Server/<instance name>/MSSearch/Language/<language-abbreviation>/TsaurusFile
The global thesaurus file corresponds to the Neutral language with LCID 0. This value can be changed by administrators only.
A thesaurus query uses both a language-specific thesaurus and the global thesaurus. First, the query looks up the language-specific file and loads it for processing (unless it is already loaded). The query is expanded to include the language-specific synonyms specified by the expansion set and replacement set rules in the thesaurus file. These steps are then repeated for the global thesaurus. However, if a term is already part of a match in the language specific thesaurus file, the term is ineligible for matching in the global thesaurus.
Each thesaurus file defines an XML container whose ID is Microsoft Search Thesaurus, and a comment, <!-- … -->, that contains a sample thesaurus. The thesaurus is defined in a <thesaurus> element that contains samples of the child elements that define the diacritics setting, expansion sets, and replacement sets, as follows:
Microsoft Search Thesaurus
<!--
-->
Accent insensitive
0
<diacritics_sensitive>0</diacritics_sensitive>
Accent sensitive
1
<diacritics_sensitive>1</diacritics_sensitive>
<expansion> <sub>writer</sub> <sub>author</sub> <sub>journalist</sub> </expansion>
<replacement> <pat>W2K</pat> <sub>Windows 2000</sub> <sub>XP</sub> </replacement>
<replacement> <pat>Internet</pat> <sub>intranet</sub> </replacement>
<replacement> <pat>Internet Explorer</pat> <sub>IE</sub> <sub>IE 5</sub> </replacement>
To edit a thesaurus file
To load an updated thesaurus file
To view the tokenization result of a word breaker, thesaurus, and stoplist combination