Languages Supported by Windows Search
This topic describes how Windows Search supports multiple languages.
Windows Search is language-independent, but the accuracy of search across languages may vary because of the way wordbreakers tokenize text. Wordbreakers implement various tokenization rules for languages and break text into individual tokens, or words, to be indexed or searched.
Both the language of the indexed text and the query string are broken into tokens. Because tokenization rules vary by language, there are separate wordbreakers for each language or family of languages. If there is a mismatch between the query language and the indexed language, the results can be unpredictable.
Windows Search ships with a well defined set of wordbreakers. Classic wordbreaker and stemmer components are supported in Windows Vista and later. If the language of a document cannot be determined, Windows Search attempts to detect the language to identify the most appropriate wordbreaker. Windows Search attempts to detect the language by calling the GetSystemPreferredUILanguages function to determine the first Multiple User Interface (MUI) language (which is typically the system UI language unless MUI language packs are installed). If that call succeeds, the wordbreaker for the first MUI language is used. If the call to GetSystemPreferredUILanguages fails, Windows Search retrieves the system locale by calling the GetSystemDefaultLCID function and uses the wordbreaker associated with that locale.
If no wordbreaker is installed for a language, Windows Search breaks on white space by using the Neutral wordbreaker.
You can remove a language through the registry, as illustrated in the following example.
HKEY_LOCAL_MACHINE SYSTEM CurrentControlSet Control ContentIndex Language Dutch_Dutch (Default) Locale NoiseFile StemmerClass = CLSID WBreakerClass = CLSID
When Windows Search requires a new wordbreaker, the class identifier (CLSID) is read, and the instantiated wordbreaker is cached.
You can create a custom wordbreaker for a language by implementing the IWordBreaker interface. Windows Search then calls the IWordBreaker methods when it builds content indexes and runs queries.
Locale information for indexed content is retrieved from the source of the content. If the source implementer does not know the locale of the indexed content, it should set the locale to LOCALE_NEUTRAL.
For example, if you implement a filter handler (an implementation of the IFilter interface), property handler, or protocol handler, you should set the locale for indexed content to LOCALE_NEUTRAL unless you have specific locale information and are confident of its accuracy.
Windows Search includes wordbreakers to support the following languages.
|Registry key||Language (sublanguage)||LCID|
|Chinese_HongKong||Chinese (Hong Kong SAR, PRC)||0x0C04|
|Czech_Default||Czech (Czech Republic)||0x0405|
|English_UK||English (United Kingdom)||0x0809|
|English_US||English (United States)||0x0409|
|Norwegian_Bokmal||Norwegian (Bokmål, Norway)||0x0414|
|Serbian_Cyrillic||Serbian (Serbia and Montenegro, Former, Cyrillic)||0x0C1A|
|Serbian_Latin||Serbian (Serbia and Montenegro, Former, Latin)||0x081A|
|Spanish_Modern||Spanish (Spain, Modern Sort)||0x0C0A|
For more information about languages and associated identifiers, see Language Identifier Constants and Strings.
Beginning in Windows 8.1, the preferred way to use wordbreakers is via the WinRT API WordsSegmenter class.
- For information on how to implement and use custom word breakers and stemmers for additional languages and locales, see Extending Language Resources in Windows Search.
- If you need to identify the language of a piece of text, you can use Language Auto-Detection (LAD), which is available in Windows 7 and later. For more information, see Extended Linguistic Services (ELS).
- For information on managing, querying, and extending the index, see the Windows Search Developer's Guide.
- Windows Search Overview
- Windows Search as a Development Platform
- Using Managed Code with Shell Data and Windows Search