Microsoft Jet 4.0 Sorting APIs: DBCompareStringW and DBLCMapStringW

Article
10/08/2009

Lee Woods
Microsoft Corporation

March 2007

Applies to:
   Microsoft Jet Database Engine 4.0
   SQL Server
   Microsoft Windows API

Summary: The Microsoft Jet 4.0 relational database engine utilizes character sorting tables when creating and querying field indexes made up of character data. Two Microsoft Windows APIs are generally utilized for this functionality: CompareString and LCMapString. (7 printed pages)

Introduction
DBCompareStringW Function
DBLCMapStringW Function

Introduction

The Microsoft Jet 4.0 relational database engine utilizes character sorting tables when creating and querying field indexes made up of character (string) data. Two Microsoft Windows APIs are generally utilized for this functionality: CompareString and LCMapString.

If Jet were to utilize these Windows APIs, however, a potential compatibility issue would exist for applications sharing Jet databases from different versions of Windows, or when databases are transported between computers running different versions of Windows. The compatibility issue that may exist because CompareString and LCMapString can yield different results across versions of Microsoft Windows, potentially creating index lookup failures when indexes are based on character (string) data.

To ensure 100 percent character indexing compatibility across all previous, current, and future versions of Microsoft Windows, Jet utilizes its own static versions of these APIs; DBCompareStringW and DBLCMapStringW. The Microsoft Jet Database Engine component MSWSTR10 exposes these APIs, which were created to stabilize the operation of the Windows sorting functions utilized by Jet, by freezing a single version of the Windows sorting tables (the sorts of Windows 2000). This effectively isolates it from any potential changes in sort tables by ensuring consistent and identical sorting results, regardless of the Windows version applications (and thus, Jet) is running on.

Table 1. Windows and Jet equivalent APIs

Windows	Jet (MSWSTR10)
CompareString	DBCompareStringW
LCMapString	DBLCMapStringW

The input parameters and return values of each function are identical to their Windows equivalent functions, with the following exceptions: the DBLCMapStringW function only supports a sub-set of the mapping operations that the Windows function has, and the DBCompareStringW function supports an additional flag for doing prefix comparisons.

DBCompareStringW Function

The DBCompareStringW function compares two character strings, using the specified locale.

Syntax

int DBCompareStringW(      

    LCID Locale,
    DWORD dwCmpFlags,
    UNALIGNED WCHAR * lpString1,
    int cchCount1,
    UNALIGNED WCHAR * lpString2,
    int cchCount2);

Parameters

Locale

[in] Specifies the locale used for the comparison. This parameter can be one of the following predefined locale identifiers. This parameter can also be a locale identifier created by the MAKELCID macro.

dwCmpFlags

[in] Indicates what type of transformation is to occur during mapping. Several flags can be combined on a single transformation (though some combinations are illegal). Mapping options include:

Table 2. Mapping options for the DBCompareStringW function

Flag	Meaning
NORM_IGNORECASE	If set, this flag will cause the function to ignore any differences in upper/lower case.
NORM_IGNORENONSPACE	Nonspace marks are diacritical accents that are added to certain characters in many languages, either as an integral part of a new glyph or as a separate character that should be combined in some way with a base character. This flag, if set, will cause all nonspace marks to be masked off and ignored, whether they are separate characters or not.
NORM_IGNORESYMBOLS	If set, this flag will cause the function to ignore any punctuation and/or other symbol characters that appear in the string.
NORM_IGNOREKANATYPE	If set, this flag will cause the function to ignore any differences of Hiragana and Katakana form.
NORM_IGNOREWIDTH	If set, this flag will cause the function to ignore any differences of character byte size in the ANSII equivalent of the Unicode character.
SORT_STRINGSORT	This flag can be set to force the apostrophe and the hyphen to be treated as if they are regular symbol characters; see last comment in this section
NORM_PREFIX	If set, this flag requests that the two strings be considered equal if no collation differences are found by the time that lpString1 is used up; that is, lpString1 can be a prefix of lpString2.
NORM_PADSPACE	If set, this flag requests that, when the end of the shorter string is found, the code should continue matching against the longer string by inserting as many space characters as required to compare up to the end of the longer string. This flag will affect the results of comparing two strings that only differ because of control codes on the end of the longer string.

lpString1

[in] Pointer to the first Unicode string to be compared.

cchCount1

[in] Specifies the number of WCHARs in the string pointed to by the lpString1 parameter. The count does not include the null-terminator. If this parameter is any negative value, the string is assumed to be null terminated and the length is calculated automatically. If cchCount1 and cchCount2 are not –1, then the comparison continues for the number of characters specified. It will not terminate if a null-terminator is found within the string when cchCount1 and cchCount2 are not set to –1.

lpString2

[in] Pointer to the second Unicode string to be compared.

cchCount2

[in] Specifies the number of WCHARs in the string pointed to by the lpString2 parameter. The count does not include the null-terminator. If this parameter is any negative value, the string is assumed to be null terminated and the length is calculated automatically.

Return Value

Success: 1 (CSTR_LESS_THAN)—if lpString1 is less than lpString2.

2 (CSTR_EQUAL)—if lpString1 is equal to lpString2.

3 (CSTR_GREATER_THAN)—if lpString1 is greater than lpString2.

Failure: 0

Remarks

Note that if the return value is 2, the two strings are "equal" in the collation sense, though not necessarily identical (that is, the case might be ignored).
If the two strings are of different lengths, they are compared up to the length of the shortest one. If they are equal to that point, then the return value will indicate that the longer string is greater, unless the NORM_PREFIX flag is set and the first string is shorter; in this case, the two strings are considered to be equal.
All symbols will sort before any other alphanumeric. The hyphen/minus and apostrophe characters are treated as being symbol character if the SORT_STRINGSORT flag is specified; otherwise, they are ignored as symbols unless no other difference are found in the strings. This is to ensure that words like coop and co-op stay together within a list.
For Japanese Kana, using the NORM_IGNORENONSPACE will drop the DW field that contain the Daku-on and Handaku-on markers, and also will ignore the 4th and 5th weight fields in the XW field that contain the Repeat, Cho-on, and Small Kana character markers.
If the strings supplied contain Arabic Kashidas, the Kashidas will be ignored during the comparison. Therefore, if the two strings are identical save for Kashidas within the strings, DBCompareStringW will return a value of 2; that is, they are "equal" in the collation sense, though not necessarily identical.

DBLCMapStringW Function

The DBLCMapStringStringW function compares two character strings, using the specified locale. DBLCMapStringW provides sort key generation for database indexing. The sort key strings can be compared against each other by doing a byte by byte compare. In addition, this API provides mappings from one character string to another, performing the specified locale-dependent uppercase or lowercase translation. All other string mappings are not supported by the MSWSTR10.DLL.

Syntax

int DBLCMapStringW(      

    LCID Locale,
    DWORD dwMapFlags,
    UNALIGNED WCHAR * lpSrcStr,
    int cchSrc,
    UNALIGNED WCHAR * lpDestStr,
    int cchDest
);

Parameters

Locale

dwCmpFlags

[in] Indicates what type of transformation is to occur during mapping. Several flags can be combined on a single transformation (though some combinations are illegal). Mapping options include:

Table 3. Mapping options for the DBLCMapStringStringW function

Flag	Meaning
LCMAP_LOWERCASE	If set, this flag will convert all characters in the string to their lowercase equivalent characters.
LCMAP_UPPERCASE	If set, this flag will convert all characters in the string to their uppercase equivalent characters.
LCMAP_LINGUISTIC_CASING	Valid in combination with LCMAP_LOWERCASE or LCMAP_UPPERCASE only. If set, this flag will cause the casing to be done based on linguistic rules rather than file system rules (default behavior).
LCMAP_SORTKEY	If set, this flag will convert the string into sort key (normalized form).
NORM_IGNORECASE	Valid in combination with LCMAP_SORTKEY. If set, this flag will cause the function to ignore any differences in upper/lower case.
NORM_IGNORENONSPACE	Valid in combination with LCMAP_SORTKEY. Nonspace marks are diacritical accents that are added to certain characters in many languages, either as an integral part of a new glyph or as a separate character that should be combined in some way with a base character. This flag, if set, will cause all nonspace marks to be masked off and ignored, whether they are separate characters or not.
NORM_IGNORESYMBOLS	Valid in combination with LCMAP_SORTKEY. If set, this flag will cause the function to ignore any punctuation and/or other symbol characters that appear in the string.
NORM_IGNOREKANATYPE	Valid in combination with LCMAP_SORTKEY. If set, this flag will cause the function to ignore any differences of Hiragana and Katakana form.
NORM_IGNOREWIDTH	Valid in combination with LCMAP_SORTKEY. If set, this flag will cause the function to ignore any differences of character byte size in the ANSII equivalent of the Unicode character.
SORT_STRINGSORT	Valid in combination with LCMAP_SORTKEY. This flag can be set to force the apostrophe and the hyphen to be treated as if they are regular symbol characters; see last comment in this section

lpSrcStr

[in] Pointer to the supplied string to be mapped.

cchSrc

[in] Character count of the input string buffer. This count can include the NULL terminator or not. If the NULL terminator is included in this count, it will not show up in the sort key value anyway, as a NULL is considered "unsortable" and a NULL is always mapped to a NULL, so it will not greatly affect the mapping behavior. Note that this is a WORD count for wide character strings and a BYTE count for multi-byte strings. If –1, lpSrcStr is assumed to be null-terminated; the length will be calculated automatically, and the return string will also be NULL terminated.

lpDestStr

[in] Pointer to the memory buffer to store the resulting mapped string. If LCMAP_SORTKEY is specified, the resulting string will always be terminated by a NULL byte, regardless of what value is in cchSrc, and the resulting buffer is an LPBYTE with the following format (see sorting section for details):

[all Unicode weights]0x01 [all Diacritic weights]0x01 [all Case weights]0x01 [all Special weights]0x00
If some of these weights are absent from the sort key string, due to ignore flags, the 0x01 separators and the NULL terminator are still present.

cchDest

[in] The character count of the memory buffer pointed to by lpDestStr, or the count of bytes if LCMAP_SORTKEY is specified. If the NULL terminator is included in cchSrc, then cchDest must also include the NULL terminator in the character count. If cchDest is 0, then the return value of this function is the number of characters required to hold the mapped string. The lpDestStr pointer is not referenced in this case. This is a WORD count for wide character strings and a BYTE count for multi-byte strings.

Return Value

Success: number of characters written to lpDestStr including the terminator.
(OR the number of bytes if LCMAP_SORTKEY was specified.)

Failure: 0

Remarks

The mapped string will be null-terminated if the source string is null-terminated.
The lpSrcStr and lpDestStr pointers may not be the same.
The output string is not strictly in [WCHAR|CHAR] format, since some of the mapping flavors return byte values (sort key) rather than characters.
The Arabic Kashida is ignored. Thus, if you create a sort key for a string containing an Arabic Kashida, there will not be a sort key value for the Kashida.
All symbols will sort before any other alphanumeric. The hyphen/minus and apostrophe characters are treated as being symbol characters if the SORT_STRINGSORT flag is specified; otherwise they are ignored as symbols unless no other difference are found in the strings. This is to ensure that words like coop and co-op stay together within a list.
For Japanese Kana, using the NORM_IGNORENONSPACE will cause the DW field that contain the Daku-on and Handaku-on markers to be omitted from the output, as will the 4th and 5th weight fields in the XW field that contain the Repeat, Cho-on, and Small Kana character markers.
When the LCMAP_SORTKEY flag is used with any Chinese locale, it is preferable to use the NORM_IGNORENONSPACE flag, as well. This will result in a shorter sortkey; however, the end result of the sort with or without the NORM_IGNORENONSPACE flag will be identical. These weights are unnecessary in Chinese, but are necessary in other ideographic languages.

Security Alert Using these functions incorrectly can compromise the security of your application. Strings that are not compared correctly can produce invalid input. Test strings to make sure they are valid before using them and provide error handlers. For more information, see Security Considerations: International Features.

Microsoft Jet 4.0 Sorting APIs: DBCompareStringW and DBLCMapStringW

Contents

Introduction

DBCompareStringW Function

DBLCMapStringW Function

Additional resources