LINGUISTIC_IGNOREDIACRITIC doesn't work in some cases. Decomposed characters confuse it right now.
- 6/6/2008
- Shawn Steele [MSFT]
- 3/27/2008
- Shawn Steele [MSFT]
Technically a "non-spacing" character would be something like U+0308 ( ̈), while Ä would be a spacing character. NORM_IGNORENONSPACE tries to get rid of all diacritics, so it'll reduce Ä to A + U+308, then drop the diaresis and you end up with "A". So basically it does a normalization form D and then drops the non-spacing characters.
I can't think of any good reason to ignore diacritics in this way, except perhaps stripping diacritics for old ASCII printers. (See http://blogs.msdn.com/shawnste/archive/2007/06/08/why-can-t-we-strip-the-diacritics.aspx for why this is a bad idea).
- 3/27/2008
- Shawn Steele [MSFT]
The following code demonstrates inconsistent comparison by CompareStringW when sorting Thai.
const LCID g_lcid = MAKELCID(MAKELANGID(LANG_THAI,SUBLANG_DEFAULT),SORT_DEFAULT);void main()
{
wchar_t *s1 = L"ไม่";
wchar_t *s2 = L"ไม่กิน";
wchar_t *s3 = L"ไมค์";printf("%d ",CompareStringW(g_lcid, 0, s1, -1, s2, -1));
printf("%d ",CompareStringW(g_lcid, 0, s2, -1, s3, -1));
printf("%d ",CompareStringW(g_lcid, 0, s3, -1, s1, -1));
}
The output is "1 1 1," which indicates a cycle, a violation of transitive sort consistency. The results are the same if specifying SORT_STRINGSORT, and/or the LINGUISTIC_IGNOREDIACRITIC flag on Vista.
---------------------------------
The sort keys for the three strings on Win2000, XP, Server 2003, Vista, and Server 2008 (RC1) indicate that the order should be s1 < s2 < s3. It repros both when the strings are compared the other way around and when explicit lengths are passed rather than -1.
It is an edge case (most Thai strings will not have such this kind of problem), but it is a genuine, long-standing bug which should be investigated and fixed....
Michael S. Kaplan
Microsoft Corp.
http://blogs.msdn.com/michkap
----------------------------------
I agree that s1 < s2 < s3 is fine. But the sample output also indicates that s3 < s1. This makes the comparison unsuitable for qsort.
----------------------------------
Gigabytes of Thai data have been used in Windows and SQL Server since Win2000/SQL 7.0 first shipped, without this bug ever being reported.
I agree it is a bug, but the claim that comparison is unsuitable for qsort seems a bit extreme. Bugs exist, right? Products run even in spite of that....
And in any case, is the additional comments section for MSDN the place where bugs should be reported?
- 12/11/2007
- Glenn Slayden
- 12/12/2007
- Michael S. Kaplan
NORM_IGNOREWIDTH means to ignore the difference between half and full width characters, eg Cat== cat. The full width form is a formatting distinction is used in Chinese and Japanese scripts.
- 11/16/2007
- Shawn Steele [MSFT]
- 12/11/2007
- Shawn Steele [MSFT]
See additional security considerations for collation
See http://msdn2.microsoft.com/en-us/library/ms776406.aspx#comparison_functions for security considerations regarding collation (comparison) function choices
- 11/19/2007
- Shawn Steele [MSFT]
- 11/1/2007
- Shawn Steele [MSFT]