Configuring Thesaurus Files

All thesaurus files that are included with Microsoft SQL Server 2005 are formatted as follows.

<XML ID="Microsoft Search Thesaurus">

<!--  Commented out
    <thesaurus xmlns="x-schema:tsSchema.xml">
      <diacritics = false/>
        <expansion>
            <sub>Internet Explorer</sub>
            <sub>IE</sub>
            <sub>IE5</sub>
        </expansion>
        <replacement>
            <pat>NT5</pat>
            <pat>W2K</pat>
            <sub>Windows 2000</sub>
        </replacement>
        <expansion>
            <sub>run</sub>
            <sub>jog</sub>
        </expansion>
    </thesaurus>
-->
</XML>

Each thesaurus file has one or more of the following sections:

  • Expansion set
    An expansion set contains a group of synonyms. These synonyms are identified in code by "substitution" tags (<sub> and </sub>). Queries that contain matches in one substitution are expanded to include all other substitutions in the expansion set.
  • Replacement set
    A replacement set contains a text pattern to be replaced by a substitution set. For an example, see the section "Replacement Set" later in this topic.

Additionally, the thesaurus file includes a <diacritics = false/> tag. false indicates that the terms specified in the expansion and replacement sets are accent-insensitive. To make searches using the thesaurus accent-sensitive, change this tag to <diacritics = true/>. For example, suppose you specify the pattern "café" to be replaced by other patterns in a Full-Text Search query. If the thesaurus file is accent-insensitive, Full-Text Search replaces the patterns "café" and "cafe". If the thesaurus file is accent-sensitive, Full-Text Search replaces only the pattern "café". Note that this setting can only be applied one time in the file, and applies to all the search patterns in the file. This setting cannot be specified for individual patterns.

Important

When you are editing thesaurus files by using text editor tools, the files must be saved in Unicode format and Byte Order Marks must be specified.

Expansion Set

Each expansion set is enclosed within an <expansion> tag. Within the expansion tag, you specify one or more substitutions enclosed by a <sub> tag. In the expansion set, you can specify a group of substitutions that are synonyms of each other.

For example, you can edit the expansion section to treat the substitutions "writer", "author", and "journalist" as synonyms. Full-Text Search queries that contain matches in one substitution are expanded to include all other substitutions specified in the expansion set. Therefore, in the preceding example, when you issue a FORMS OF THESAURUS or a FREETEXT query for the word "author", Full-Text Search also returns search results containing the words "writer" and "journalist".

This is what the expansion set section would look like for the above example:

 <expansion>
         <sub>writer</sub>
         <sub>author</sub>
         <sub>journalist</sub>
 </expansion>

Replacement Set

Each replacement set is enclosed within a <replacement> tag. Within each replacement tag you can specify one or more patterns enclosed by a <pat> tag. You can specify one or more substitutions enclosed by <sub> tags. You can specify a pattern to be replaced by a substitution set. Patterns and substitutions can contain a word, or a sequence of words.

For example, suppose you want queries for "W2K", the pattern, to be replaced by "Windows 2000" or "XP", the substitutions. If you run a full-text query for "W2K", Full-Text Search only returns search results containing "Windows 2000" or "XP". It does not return results containing "W2K". This is because the pattern "W2K" has been "replaced" by the patterns "Windows 2000" and "XP".

This is what the replacement set section would look like for the above example:

 <replacement>
         <pat>W2K</pat>
         <sub>Windows 2000</sub>
         <sub>XP</sub>
 </replacement>

If you have two replacement sets with similar patterns being matched, the longer of the two takes precedence. For example, if you run a FORMS OF THESAURUS query for "Internet Explorer online community" and you have the following replacement sets, the "Internet Explorer" replacement set takes precedence over the "Internet" replacement set. The query will therefore be processed as "IE online community" or "IE 5 online community".

<replacement>
         <pat>Internet</pat>
         <sub>intranet</sub>
</replacement>

and

<replacement>
         <pat>Internet Explorer</pat>
         <sub>IE</sub>
         <sub>IE 5</sub>
</replacement>

See Also

Concepts

Full-Text Search Architecture
Thesaurus
Full-Text Search

Other Resources

CONTAINS (Transact-SQL)
FREETEXT (Transact-SQL)
FREETEXTTABLE (Transact-SQL)

Help and Information

Getting SQL Server 2005 Assistance

Change History

Release History

12 December 2006

Changed content:
  • Corrected the syntax of the <diacritics_sensitive> tag to <diacritics = false/> and updated the explanation of this tag.
New content:
  • Added the Important not that states thesaurus files must be saved in Unicode format and Byte Order Marks must be specified.

17 July 2006

New content:
  • Clarified the meaning of the <diacritics_sensitive> tag.