4 out of 5 rated this helpful - Rate this topic

Content Moved

This content can be found here: CompareString.

 

 

Did you find this helpful?
(1500 characters remaining)
Community Content Add
Annotations FAQ
LINGUISTIC_IGNOREDIACRITIC doesn't work

LINGUISTIC_IGNOREDIACRITIC doesn't work in some cases. Decomposed characters confuse it right now.

NORM_IGNORESYMBOLS treats puncutation as symbols
These flags were invented prior to the Unicode labels, so they don't necessarily map in an expected way to the unicode tables. NORM_IGNORESYMBOLS also ignores punctuation as well as symbols.
NORM_IGNORENONSPACE is a bit misleading

Technically a "non-spacing" character would be something like U+0308 ( ̈), while Ä would be a spacing character. NORM_IGNORENONSPACE tries to get rid of all diacritics, so it'll reduce Ä to A + U+308, then drop the diaresis and you end up with "A". So basically it does a normalization form D and then drops the non-spacing characters.

I can't think of any good reason to ignore diacritics in this way, except perhaps stripping diacritics for old ASCII printers. (See http://blogs.msdn.com/shawnste/archive/2007/06/08/why-can-t-we-strip-the-diacritics.aspx for why this is a bad idea).

Inconsistent results when comparing Thai

The following code demonstrates inconsistent comparison by CompareStringW when sorting Thai.

const LCID g_lcid = MAKELCID(MAKELANGID(LANG_THAI,SUBLANG_DEFAULT),SORT_DEFAULT);
void main()
{
wchar_t *s1 = L"ไม่";
wchar_t *s2 = L"ไม่กิน";
wchar_t *s3 = L"ไมค์";
 printf("%d ",CompareStringW(g_lcid, 0, s1, -1, s2, -1));
printf("%d ",CompareStringW(g_lcid, 0, s2, -1, s3, -1));
printf("%d ",CompareStringW(g_lcid, 0, s3, -1, s1, -1));
}

The output is "1 1 1," which indicates a cycle, a violation of transitive sort consistency. The results are the same if specifying SORT_STRINGSORT, and/or the LINGUISTIC_IGNOREDIACRITIC flag on Vista.

---------------------------------
The sort keys for the three strings on Win2000, XP, Server 2003, Vista, and Server 2008 (RC1) indicate that the order should be s1 < s2 < s3. It repros both when the strings are compared the other way around and when explicit lengths are passed rather than -1.

It is an edge case (most Thai strings will not have such this kind of problem), but it is a genuine, long-standing bug which should be investigated and fixed....

Michael S. Kaplan
Microsoft Corp.
http://blogs.msdn.com/michkap

----------------------------------

I agree that s1 < s2 < s3 is fine. But the sample output also indicates that s3 < s1. This makes the comparison unsuitable for qsort.

----------------------------------

Gigabytes of Thai data have been used in Windows and SQL Server since Win2000/SQL 7.0 first shipped, without this bug ever being reported.

I agree it is a bug, but the claim that comparison is unsuitable for qsort seems a bit extreme. Bugs exist, right? Products run even in spite of that....

And in any case, is the additional comments section for MSDN the place where bugs should be reported?

The NORM_IGNOREWIDTH description above is confusing, it means half and full width characters

NORM_IGNOREWIDTH means to ignore the difference between half and full width characters, eg Cat== cat. The full width form is a formatting distinction is used in Chinese and Japanese scripts.

See additional security considerations for collation

See additional security considerations for collation

See http://msdn2.microsoft.com/en-us/library/ms776406.aspx#comparison_functions for security considerations regarding collation (comparison) function choices

Use of CompareStringEx is preferred.
If you can't use the Ex version, at least use the Unicode version if possible (CompareStringW).