CompareString Function
CompareString Function

Compares two character strings, using the specified locale.

Note  For compatibility with Unicode, your applications are recommended to prefer CompareStringEx or the Unicode version of CompareString.

Syntax

int CompareString(      
    LCID Locale,     DWORD dwCmpFlags,     LPCTSTR lpString1,     int cchCount1,     LPCTSTR lpString2,     int cchCount2 );

Parameters

Locale
[in] Specifies the locale used for the comparison. This parameter can be one of the following predefined locale identifiers or a locale identifier created by the MAKELCID macro. For more information about locale identifiers, see Locales.
LOCALE_SYSTEM_DEFAULT
The system's default locale.
LOCALE_USER_DEFAULT
The current user's default locale.
LOCALE_NEUTRAL
Neutral local, language neutral.
LOCALE_INVARIANT
Used to perform a locale-independent comparison.
dwCmpFlags
[in] A set of flags that indicate how the function compares the two strings. By default, these flags are not set. This parameter can specify zero to get the default behavior, or it can be any combination of the following values.
LINGUISTIC_IGNORECASE
Windows Vista and later. Ignore case in the search as is linguistically appropriate. The Remarks section below explains how this differs from NORM_IGNORECASE.
LINGUISTIC_IGNOREDIACRITIC
Windows Vista and later. Ignore non-spacing characters, as is linguistically appropriate. The Remarks section below explains how this differs from NORM_IGNORENONSPACE.
NORM_IGNORECASE
Ignore case. The Remarks section below explains how this differs from LINGUISTIC_IGNORECASE.
NORM_IGNOREKANATYPE
Do not differentiate between Hiragana and Katakana characters. Corresponding Hiragana and Katakana characters compare as equal.
NORM_IGNORENONSPACE
Ignore nonspacing characters. The Remarks section below explains how this differs from LINGUISTIC_IGNOREDIACRITIC.
NORM_IGNORESYMBOLS
Ignore symbols and punctuation.
NORM_IGNOREWIDTH
Ignore the difference between half-width and full-width characters, for example, C a t == cat. The full-width form is a formatting distinction used in Chinese and Japanese scripts.
NORM_LINGUISTIC_CASING
Windows Vista and later. Uses linguistic rules for casing, rather than file system rules (the default). The Remarks section below explains this in more detail.
SORT_STRINGSORT
Treat punctuation the same as symbols.
lpString1
[in] Pointer to the first string to be compared.
cchCount1
[in] Specifies the number of TCHARs in the string pointed to by the lpString1 parameter. This refers to bytes for ANSI versions of the function or WCHARs for Unicode versions. The count does not include the null-terminator. If this parameter is any negative value, the string is assumed to be null-terminated and the length is calculated automatically.
lpString2
[in] Pointer to the second string to be compared.
cchCount2
[in] Specifies the number of TCHARs in the string pointed to by the lpString2 parameter. The count does not include the null-terminator. If this parameter is any negative value, the string is assumed to be null-terminated and the length is calculated automatically.

Return Value

Returns one of the following values.

To maintain the C run-time convention of comparing strings, the value 2 can be subtracted from a nonzero return value. Then, the meaning of < 0, ==0 and > 0 is consistent with the C run times.

CSTR_LESS_THANThe string pointed to by the lpString1 parameter is less in lexical value than the string pointed to by the lpString2 parameter.
CSTR_EQUALThe string pointed to by lpString1 is equivalent in lexical value to the string pointed to by lpString2. The two strings are equivalent for collating purposes, though not necessarily identical.
CSTR_GREATER_THANThe string pointed to by lpString1 is greater in lexical value than the string pointed to by lpString2.
0

If the function fails, the return value is zero. To get extended error information, call GetLastError. GetLastError can return one of the following error codes.

ERROR_INVALID_FLAGS

ERROR_INVALID_PARAMETER

Remarks

Equivalent Strings

Usually, CompareString, CompareStringEx, lstrcmp, and lstrcmpi evaluate strings character-by-character. However, many languages have multiple-character elements, such as the two-character pair 'CH' in Traditional Spanish. CompareString and CompareStringEx use the locale passed in Locale (or lpLocaleName) to identify multiple-character elements, while lstrcmp and lstrcmpi use the user locale.

Another example is Vietnamese, which contains many two-character elements such as the valid uppercase, title case, and lowercase forms of 'GI', which are 'GI', 'Gi', and 'gi', respectively. Any of these forms is treated as a single collation element and, if casing is ignored, compares as equal. However, because 'gI' is not valid as a single element, CompareString, CompareStringEx, lstrcmp, and lstrcmpi treat 'gI' as two separate elements.

In contrast, CompareStringOrdinal, introduced in Windows Vista, performs a strictly binary comparison: except for the option of being case-insensitive, it disregards all non-binary equivalences, and (unlike CompareString) it will test all code points for equality, including those that are not given any weight in linguistic collation schemes.

Evaluating Strings

Typically, strings are compared using what is called a "word sort" technique. In a word sort, all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently than the other nonalphanumeric symbols, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list.

If the SORT_STRINGSORT flag is specified, strings are compared using what is called a "string sort" technique. In a string sort, the hyphen and apostrophe are treated just like any other nonalphanumeric symbols. Their positions in the collating sequence are before the alphanumeric symbols.

The following table shows a list of words sorted both ways.

Word SortString SortWord SortString Sort
billetbill'st-antt-ant
billsbillettanyat-aria
bill'sbillst-ariatanya
cannotcan'tsuedsue's
cantcannotsuessued
can'tcantsue'ssues
conco-opwentwe're
coopconwerewent
co-opcoopwe'rewere

The lstrcmp and lstrcmpi functions use a word sort. The CompareString, LCMapString, and FindNLSString functions (and the corresponding locale-name-based CompareStringEx, LCMapStringEx, and FindNLSStringEx functions) default to using a word sort, but use a string sort if their caller sets the SORT_STRINGSORT flag.

When to use CompareString, lstrcmp, and lstrcmpi

When the comparison should follow the user's language preferences, for example, when sorting items for an ordered ListView control, do either of the following:

  • Call lstrcmp or lstrcmpi, which use the user locale, or
  • Call CompareString or CompareStringEx to define a locale for the comparison, to pass additional flags, to embed null characters, or to pass explicit lengths to match parts of a string.

When the results of the comparison should be consistent regardless of locale, for example, when comparing retrieved data against a predefined list or an internal value, use CompareString or CompareStringEx with the Locale parameter set to LOCALE_INVARIANT. Either of the following calls will match even if mystr is "INLAP", whereas the locale-sensitive call to lstrcmpi fails if the current locale is Vietnamese. CompareStringOrdinal can also be used for locale-insensitive comparisons.

  • On Windows XP and later:
    CompareString(LOCALE_INVARIANT, NORM_IGNORECASE, mystr, -1, _T("InLap"), -1);
  • For earlier operating systems:
    DWORD lcid = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT);
        CompareString(lcid, NORM_IGNORECASE, mystr, -1, _T("InLap"), -1);

The CompareString function can return data from custom locales. Custom locales allow administrators to change any aspects of locale formats but this will not change collation behavior.

Applications that are intended to run only on Windows Vista and later should use CompareStringEx in preference to this function. CompareStringEx provides good support for supplemental locales. However, CompareStringEx is not supported for versions of Windows prior to Windows Vista.

Performance

The CompareString function is optimized to run at the highest speed when dwCmpFlags is set to 0 or NORM_IGNORECASE, cchCount1 and cchCount2 have the value -1, and the passed-in locale does not support any linguistic compressions (such as when Traditional Spanish collation treats "ch" as a single character).

Language-specific Notes

If you are calling the ASCII version (CompareStringA), then rather than converting via the default system code page, CompareString converts parameters via the default code page of the locale you pass in. Among other things, this means that you can never use CompareStringA to handle 8-bit Unicode Transformation Format (UTF-8) text.

The NORM_IGNORENONSPACE flag only has an effect for the locales in which accented characters are sorted in a second pass from main characters. All characters in the string are first compared without regard to accents and (if the strings are equal) a second pass over the strings is performed to compare accents. In this case, this flag causes the second pass to not be performed. (In effect, NORM_IGNORENONSPACE causes diacritics to be stripped from the string before comparison. For a discussion of the consequences, see the blog entry "Why can't we strip the diacritics?" (http://blogs.msdn.com/shawnste/archive/2007/06/08/why-can-t-we-strip-the-diacritics.aspx).) For locales that sort accented characters in the first pass, this flag has no effect.

For many scripts, NORM_IGNORENONSPACE will coincide with LINGUISTIC_IGNOREDIACRITIC and NORM_IGNORECASE will coincide with LINGUISTIC_IGNORECASE. Notably, this is true for Latin scripts. However:

  • NORM_IGNORENONSPACE ignores any "secondary distinction" whether it is actually a diacritic or not. Scripts for Korean, Japanese, Chinese, and Indic languages (among others) use this distinction for purposes other than diacritics. LINGUISTIC_IGNOREDIACRITIC will ignore only actual diacritics, instead of simply ignoring the second collation weight.
  • NORM_IGNORECASE ignores any "tertiary distinction" whether it is actually linguistic case or not. For example, in Arabic and Indic scripts this distinguishes alternate forms of a character, but the differences do not correspond to linguistic case. LINGUISTIC_IGNORECASE will ignore only actual linguistic casing, instead of simply ignoring the third collation weight.

Normally, for case-insensitive comparison, this function maps the lowercase "i" to the uppercase "I", even when the Locale parameter specifies Turkish or Azeri. The NORM_LINGUISTIC_CASING flag overrides this for Turkish or Azeri. If this flag is specified, in conjunction with a value for the Locale parameter that indicated Turkish or Azeri, then LATIN SMALL LETTER DOTLESS I (U+0131) is the lowercase form of LATIN CAPITAL LETTER I (U+0049) and LATIN SMALL LETTER I (U+0069) is the lowercase form of LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130).

For double-byte character set (DBCS) locales, the flag NORM_IGNORECASE has an effect on all the wide (two-byte) characters as well as the narrow (one-byte) characters. This includes the wide Greek and Cyrillic characters.

The CompareString function ignores Arabic Kashidas during the comparison. Thus, if two strings are identical, save for the presence of Kashidas, CompareString returns a value of CSTR_EQUAL; the strings are considered "equal" in the collation sense, though they are not necessarily identical.

Custom locales

Windows Vista and later. Note that this API can return data from custom locales (see Custom Locale Information). Custom locales allow administrators to change many aspects of locale formats but this will not change collation behavior.

security note Security Alert  Using this function incorrectly can compromise the security of your application. Strings that are not compared correctly can produce invalid input. Test strings to make sure they are valid before using them and provide error handlers. For more information, see Security Considerations: International Features.
Windows 95/98/Me: CompareStringW is supported by the Microsoft Layer for Unicode (MSLU). To use this, you must add certain files to your application, as outlined in Microsoft Layer for Unicode on Windows 95/98/Me Systems.

Function Information

Minimum DLL Versionkernel32.dll
HeaderDeclared in Winnls.h, include Windows.h
Import libraryKernel32.lib
Minimum operating systems Windows 95, Windows NT 3.1
UnicodeImplemented as ANSI and Unicode versions.

See Also

Community Content

Use of CompareStringEx is preferred.
Added by:Shawn Steele [MSFT]
If you can't use the Ex version, at least use the Unicode version if possible (CompareStringW).
The NORM_IGNOREWIDTH description above is confusing, it means half and full width characters
Added by:Shawn Steele [MSFT]

NORM_IGNOREWIDTH means to ignore the difference between half and full width characters, eg Cat== cat. The full width form is a formatting distinction is used in Chinese and Japanese scripts.

See additional security considerations for collation
Added by:Shawn Steele [MSFT]

See additional security considerations for collation

See http://msdn2.microsoft.com/en-us/library/ms776406.aspx#comparison_functions for security considerations regarding collation (comparison) function choices

Inconsistent results when comparing Thai
Added by:Michael S. Kaplan

The following code demonstrates inconsistent comparison by CompareStringW when sorting Thai.

const LCID g_lcid = MAKELCID(MAKELANGID(LANG_THAI,SUBLANG_DEFAULT),SORT_DEFAULT);
void main()
{
wchar_t *s1 = L"ไม่";
wchar_t *s2 = L"ไม่กิน";
wchar_t *s3 = L"ไมค์";
 printf("%d ",CompareStringW(g_lcid, 0, s1, -1, s2, -1));
printf("%d ",CompareStringW(g_lcid, 0, s2, -1, s3, -1));
printf("%d ",CompareStringW(g_lcid, 0, s3, -1, s1, -1));
}

The output is "1 1 1," which indicates a cycle, a violation of transitive sort consistency. The results are the same if specifying SORT_STRINGSORT, and/or the LINGUISTIC_IGNOREDIACRITIC flag on Vista.

---------------------------------
The sort keys for the three strings on Win2000, XP, Server 2003, Vista, and Server 2008 (RC1) indicate that the order should be s1 < s2 < s3. It repros both when the strings are compared the other way around and when explicit lengths are passed rather than -1.

It is an edge case (most Thai strings will not have such this kind of problem), but it is a genuine, long-standing bug which should be investigated and fixed....

Michael S. Kaplan
Microsoft Corp.
http://blogs.msdn.com/michkap

----------------------------------

I agree that s1 < s2 < s3 is fine. But the sample output also indicates that s3 < s1. This makes the comparison unsuitable for qsort.

----------------------------------

Gigabytes of Thai data have been used in Windows and SQL Server since Win2000/SQL 7.0 first shipped, without this bug ever being reported.

I agree it is a bug, but the claim that comparison is unsuitable for qsort seems a bit extreme. Bugs exist, right? Products run even in spite of that....

And in any case, is the additional comments section for MSDN the place where bugs should be reported?

NORM_IGNORENONSPACE is a bit misleading
Added by:Shawn Steele [MSFT]

Technically a "non-spacing" character would be something like U+0308 ( ̈), while Ä would be a spacing character. NORM_IGNORENONSPACE tries to get rid of all diacritics, so it'll reduce Ä to A + U+308, then drop the diaresis and you end up with "A". So basically it does a normalization form D and then drops the non-spacing characters.

I can't think of any good reason to ignore diacritics in this way, except perhaps stripping diacritics for old ASCII printers. (See http://blogs.msdn.com/shawnste/archive/2007/06/08/why-can-t-we-strip-the-diacritics.aspx for why this is a bad idea).

NORM_IGNORESYMBOLS treats puncutation as symbols
Added by:Shawn Steele [MSFT]
These flags were invented prior to the Unicode labels, so they don't necessarily map in an expected way to the unicode tables. NORM_IGNORESYMBOLS also ignores punctuation as well as symbols.
LINGUISTIC_IGNOREDIACRITIC doesn't work
Added by:Shawn Steele [MSFT]

LINGUISTIC_IGNOREDIACRITIC doesn't work in some cases. Decomposed characters confuse it right now.

© 2009 Microsoft Corporation. All rights reserved.   Terms of Use | Trademarks | Privacy Statement
Page view tracker
Rate the Lightweight library
x
Lightweight builds on ScriptFree (loband) by adding features you've requested: a SearchBox and default code language selection.
Do you like the SearchBox?
Do you like the tabbed code blocks?
How useful is this topic?
Tell us more.
Thanks
x
You're helping to improve MSDN Online.
Feedback
Switch View
Classic
Lightweight Beta
ScriptFree
Switch View