Languages Supported by Windows Search

Languages Supported by Windows Search

This topic describes how Windows Search supports multiple languages.

Tokenization, Wordbreakers, and Language Resources

Windows Search is language-independent, but the accuracy of search across languages may vary because of the way wordbreakers tokenize text. Wordbreakers implement various tokenization rules for languages and break text into individual tokens, or words, to be indexed or searched.

Both the language of the indexed text and the query string are broken into tokens. Because tokenization rules vary by language, there are separate wordbreakers for each language or family of languages. If there is a mismatch between the query language and the indexed language, the results can be unpredictable.

Windows Search ships with a well defined set of wordbreakers. Classic wordbreaker and stemmer components are supported in Windows Vista and later. If the language of a document cannot be determined, Windows Search attempts to detect the language to identify the most appropriate wordbreaker. Windows Search attempts to detect the language by calling the GetSystemPreferredUILanguages function to determine the first Multiple User Interface (MUI) language (which is typically the system UI language unless MUI language packs are installed). If that call succeeds, the wordbreaker for the first MUI language is used. If the call to GetSystemPreferredUILanguages fails, Windows Search retrieves the system locale by calling the GetSystemDefaultLCID function and uses the wordbreaker associated with that locale.

If no wordbreaker is installed for a language, Windows Search breaks on white space by using the Neutral wordbreaker.

You can remove a language through the registry, as illustrated in the following example.

                     StemmerClass = CLSID
                     WBreakerClass = CLSID
Tip  If you make changes to the registry, restart Windows Search.

When Windows Search requires a new wordbreaker, the class identifier (CLSID) is read, and the instantiated wordbreaker is cached.

You can create a custom wordbreaker for a language by implementing the IWordBreaker interface. Windows Search then calls the IWordBreaker methods when it builds content indexes and runs queries.

Locale information for indexed content is retrieved from the source of the content. If the source implementer does not know the locale of the indexed content, it should set the locale to LOCALE_NEUTRAL.

For example, if you implement a filter handler (an implementation of the IFilter interface), property handler, or protocol handler, you should set the locale for indexed content to LOCALE_NEUTRAL unless you have specific locale information and are confident of its accuracy.

Tip  If an index query is based on user input, the locale should match the language in which the user is typing. You can determine this locale by calling the GetKeyboardLayout function.

Languages Supported by Wordbreakers

Windows Search includes wordbreakers to support the following languages.

Registry key Language (sublanguage) LCID
Arabic_SaudiArabiaArabic (Neutral)0x0001
Bengali_DefaultBengali (Neutral)0x0045
Bulgarian_Default Bulgarian (Bulgaria)0x0402
Catalan_Default Catalan (Spain)0x0403
Chinese_HongKongChinese (Hong Kong SAR, PRC)0x0C04
Chinese_SimplifiedChinese (Simplified)0x0804
Chinese_TraditionalChinese (Traditional)0x0404
Croatian_DefaultCroatian (Croatia)0x041A
Czech_DefaultCzech (Czech Republic)0x0405
Danish_DefaultDanish (Denmark)0x0406
Dutch_DutchDutch (Netherlands)0x0413
English_UKEnglish (United Kingdom)0x0809
English_USEnglish (United States)0x0409
Finnish_DefaultFinnish (Finland)0x040B
French_FrenchFrench (France)0x040C
German_GermanGerman (Germany)0x0407
Greek_Default Greek (Greece)0x0408
Gujarati_DefaultGujarati (India)0x0447
Hebrew_DefaultHebrew (Neutral)0x000D
Hindi_DefaultHindi (India)0x0439
Hungarian_Default Hungarian (Hungary)0x040E
Icelandic_DefaultIcelandic (Iceland)0x040F
Indonesian_Default Indonesian (Indonesia)0x0421
Italian_ItalianItalian (Italy)0x0410
Japanese_DefaultJapanese (Japan)0x0411
Kannada_DefaultKannada (India)0x044B
Korean_DefaultKorean (Korea)0x0412
Latvian_DefaultLatvian (Latvia)0x0426
Lithuanian_Default Lithuanian (Lithuanian)0x0427
Malay_MalaysiaMalay (Malaysia)0x043E
Malayalam_DefaultMalayalam (Neutral)0x004C
Marathi_DefaultMarathi (India)0x044E
Norwegian_BokmalNorwegian (Bokmål, Norway)0x0414
Polish_Default Polish (Poland)0x0415
Portuguese_PortugalPortuguese (Portugal)0x0816
Portuguese_BrazilPortuguese (Brazil)0x0416
Punjabi_DefaultPunjabi (India)0x0446
Romanian_DefaultRomanian (Romania)0x0418
Russian_DefaultRussian (Neutral)0x0019
Serbian_CyrillicSerbian (Serbia and Montenegro, Former, Cyrillic)0x0C1A
Serbian_Latin Serbian (Serbia and Montenegro, Former, Latin)0x081A
Slovak_DefaultSlovak (Slovakia)0x041B
Slovenian_DefaultSlovenian (Slovenia)0x0424
Spanish_ModernSpanish (Spain, Modern Sort)0x0C0A
Swedish_DefaultSwedish (Sweden)0x041D
Tamil_Default Tamil (India)0x0449
Telugu_DefaultTelugu (India)0x044A
Thai_DefaultThai (Thailand)0x041E
Turkish_Default Turkish (Turkey)0x041F
Ukrainian_DefaultUkrainian (Ukraine)0x0422
Urdu_Default Urdu (Pakistan)0x0420
Vietnamese_DefaultVietnamese (Vietnam)0x042A


Note  LCIDs for some languages in the table are generated using the language identifier, sublanguage identifier, and sort identifier.

For more information about languages and associated identifiers, see Language Identifier Constants and Strings.

Note  There is no guarantee that all of these language registry keys will be present on any given machine. The wordbreaker for any given language may or may not be installed in the machine depending on user settings.

Beginning in Windows 8.1, the preferred way to use wordbreakers is via the WinRT API WordsSegmenter class.

Additional Resources

Related topics

Windows Search Overview
Windows Search as a Development Platform
Using Managed Code with Shell Data and Windows Search



Community Additions

© 2015 Microsoft