Export (0) Print
Expand All

Languages Supported by Windows Search

This topic describes how Windows Search supports multiple languages.

Tokenization, Wordbreakers, and Language Resources

Windows Search is language-independent, but the accuracy of search across languages may vary because of the way wordbreakers tokenize text. Wordbreakers implement various tokenization rules for languages and break text into individual tokens, or words, to be indexed or searched.

Both the language of the indexed text and the query string are broken into tokens. Because tokenization rules vary by language, there are separate wordbreakers for each language or family of languages. If there is a mismatch between the query language and the indexed language, the results can be unpredictable.

Windows Search ships with a well defined set of wordbreakers. Classic wordbreaker and stemmer components are supported in Windows Vista and later. If the language of a document cannot be determined, Windows Search attempts to detect the language to identify the most appropriate wordbreaker. Windows Search attempts to detect the language by calling the GetSystemPreferredUILanguages function to determine the first Multiple User Interface (MUI) language (which is typically the system UI language unless MUI language packs are installed). If that call succeeds, the wordbreaker for the first MUI language is used. If the call to GetSystemPreferredUILanguages fails, Windows Search retrieves the system locale by calling the GetSystemDefaultLCID function and uses the wordbreaker associated with that locale.

If no wordbreaker is installed for a language, Windows Search breaks on white space by using the Neutral wordbreaker.

You can remove a language through the registry, as illustrated in the following example.

HKEY_LOCAL_MACHINE
   SYSTEM
      CurrentControlSet
         Control
            ContentIndex
               Language
                  Dutch_Dutch
                     (Default)
                     Locale
                     NoiseFile
                     StemmerClass = CLSID
                     WBreakerClass = CLSID

Tip  If you make changes to the registry, restart Windows Search.

When Windows Search requires a new wordbreaker, the class identifier (CLSID) is read, and the instantiated wordbreaker is cached.

You can create a custom wordbreaker for a language by implementing the IWordBreaker interface. Windows Search then calls the IWordBreaker methods when it builds content indexes and runs queries.

Locale information for indexed content is retrieved from the source of the content. If the source implementer does not know the locale of the indexed content, it should set the locale to LOCALE_NEUTRAL.

For example, if you implement a filter handler (an implementation of the IFilter interface), property handler, or protocol handler, you should set the locale for indexed content to LOCALE_NEUTRAL unless you have specific locale information and are confident of its accuracy.

Tip  If an index query is based on user input, the locale should match the language in which the user is typing. You can determine this locale by calling the GetKeyboardLayout function.

Languages Supported by Wordbreakers

Windows Search includes wordbreakers to support the following languages.

Registry key Language (sublanguage) LCID
Arabic_SaudiArabiaArabic (Saudi Arabia)0x0001
Bengali_DefaultBengali0x0045
Bulgarian_Default Bulgarian (Bulgaria)0x0402
Catalan_Default Catalan (Spain)0x0403
Chinese_HongKongChinese (Hong Kong SAR, PRC)0x0C04
Chinese_SimplifiedChinese (Simplified)0x0804
Chinese_TraditionalChinese (Traditional)0x0404
Croatian_DefaultCroatian (Croatia)0x041A
Dutch_DutchDutch (Netherlands)0x0413
English_UKEnglish (United Kingdom)0x0809
English_USEnglish (United States)0x0409
French_FrenchFrench (France)0x040C
German_GermanGerman (Germany)0x0407
Gujarati_DefaultGujarati (India)0x0447
Hebrew_DefaultHebrew0x000D
Hindi_DefaultHindi (India)0x0439
Icelandic_DefaultIcelandic (Iceland)0x040F
Indonesian_Default Indonesian (Indonesia)0x0421
Italian_ItalianItalian (Italy)0x0410
Japanese_DefaultJapanese (Japan)0x0411
Kannada_DefaultKannada (India)0x044B
Korean_DefaultKorean (Korea)0x0412
Latvian_DefaultLatvian (Latvia)0x0426
Lithuanian_Default Lithuanian (Lithuanian)0x0427
Malay_MalaysiaMalay (Malaysia)0x043E
Malayalam_DefaultMalayalam0x004C
Marathi_DefaultMarathi (India)0x044E
Norwegian_BokmalNorwegian (Bokmål, Norway)0x0414
Porguguese_PortugalPortuguese (Portugal)0x0816
Portuguese_BrazilPortuguese (Brazil)0x0416
Punjabi_DefaultPunjabi (India)0x0446
Romanian_DefaultRomanian (Romania)0x0418
Russian_DefaultRussian0x0019
Serbian_CyrillicSerbian (Serbia and Montenegro, Former, Cyrillic)0x0C1A
Serbian_Latin Serbian (Serbia and Montenegro, Former, Latin)0x081A
Slovak_DefaultSlovak (Slovakia)0x041B
Slovenian_DefaultSlovenian (Slovenia)0x0424
Spanish_ModernSpanish (Spain, Modern Sort)0x0C0A
Swedish_DefaultSwedish (Sweden)0x041D
Tamil_Default Tamil (India)0x0449
Telugu_DefaultTelugu (India)0x044A
Thai_DefaultThai (Thailand)0x041E
Ukrainian_DefaultUkrainian (Ukraine)0x0422
Urdu_Default Urdu (Pakistan)0x0420
Vietnamese_DefaultVietnamese (Vietnam)0x042A

 

Note  LCIDs for some languages in the table are generated using the language identifier, sublanguage identifier, and sort identifier.

For more information about languages and associated identifiers, see Language Identifier Constants and Strings.

Additional Resources

Related topics

Windows Search Overview
Windows Search as a Development Platform
Using Managed Code with Shell Data and Windows Search

 

 

Community Additions

ADD
Show:
© 2014 Microsoft