Translating Strings Between Unicode and ANSI

You use the Windows function MultiByteToWideChar to convert multibyte-character strings to wide-character strings. MultiByteToWideChar is shown here:

int MultiByteToWideChar( 
UINT uCodePage, 
DWORD dwFlags, 
PCSTR pMultiByteStr, 
int cbMultiByte, 
PWSTR pWideCharStr, 
int cchWideChar);

The uCodePage parameter identifies a code page number that is associated with the multibyte string. The dwFlags parameter allows you to specify an additional control that affects characters with diacritical marks such as accents. Usually the flags aren’t used, and 0 is passed in the dwFlags parameter (For more details about the possible values for this flag, read the MSDN online help at .) The pMultiByteStr parameter specifies the string to be converted, and the cbMultiByte parameter indicates the length (in bytes) of the string. The function automatically determines the length of the source string if you pass –1 for the cbMultiByte parameter.

The Unicode version of the string resulting from the conversion is written to the buffer located in memory at the address specified by the pWideCharStr parameter. You must specify the maximum size of this buffer (in characters) in the cchWideChar parameter. If you call MultiByteToWide-Char, passing 0 for the cchWideChar parameter, the function doesn’t perform the conversion and instead returns the number of wide characters (including the terminating '\0' character) that the buffer must provide for the conversion to succeed. Typically, you convert a multibyte-character string to its Unicode equivalent by performing the following steps:

  1. Call MultiByteToWideChar, passing NULL for the pWideCharStr parameter and 0 for the cchWideChar parameter and –1 for the cbMultiByte parameter.
  2. Allocate a block of memory large enough to hold the converted Unicode string. This size is computed based on the value returned by the previous call to MultiByteToWideChar multiplied by sizeof(wchar_t).
  3. Call MultiByteToWideChar again, this time passing the address of the buffer as the pWideCharStr parameter and passing the size computed based on the value returned by the first call to MultiByteToWideChar multiplied by sizeof(wchar_t) as the cchWideChar parameter.
  4. Use the converted string.
  5. Free the memory block occupying the Unicode string.

The function WideCharToMultiByte converts a wide-character string to its multibyte-string equivalent, as shown here:

int WideCharToMultiByte( 
UINT uCodePage, 
DWORD dwFlags, 
PCWSTR pWideCharStr, 
int cchWideChar, 
PSTR pMultiByteStr, 
int cbMultiByte, 
PCSTR pDefaultChar, 
PBOOL pfUsedDefaultChar);

This function is similar to the MultiByteToWideChar function. Again, the uCodePage parameter identifies the code page to be associated with the newly converted string. The dwFlags parameter allows you to specify additional control over the conversion. The flags affect characters with diacritical marks and characters that the system is unable to convert. Usually, you won’t need this degree of control over the conversion, and you’ll pass 0 for the dwFlags parameter.

The pWideCharStr parameter specifies the address in memory of the string to be converted, and the cchWideChar parameter indicates the length (in characters) of this string. The function determines the length of the source string if you pass –1 for the cchWideChar parameter.

The multibyte version of the string resulting from the conversion is written to the buffer indicated by the pMultiByteStr parameter. You must specify the maximum size of this buffer (in bytes) in the cbMultiByte parameter. Passing 0 as the cbMultiByte parameter of the WideCharToMulti-Byte function causes the function to return the size required by the destination buffer. You’ll typically convert a wide-character string to a multibyte-character string using a sequence of events similar to those discussed when converting a multibyte string to a wide-character string, except that the return value is directly the number of bytes required for the conversion to succeed.

You’ll notice that the WideCharToMultiByte function accepts two parameters more than the MultiByteToWideChar function: pDefaultChar and pfUsedDefaultChar. These parameters are used by the WideCharToMultiByte function only if it comes across a wide character that doesn’t have a representation in the code page identified by the uCodePage parameter. If the wide character cannot be converted, the function uses the character pointed to by the pDefaultChar parameter. If this parameter is NULL, which is most common, the function uses a system default character. This default character is usually a question mark. This is dangerous for filenames because the question mark is a wildcard character.

The pfUsedDefaultChar parameter points to a Boolean variable that the function sets to TRUE if at least one character in the wide-character string could not be converted to its multibyte equivalent. The function sets the variable to FALSE if all the characters convert successfully. You can test this variable after the function returns to check whether the wide-character string was converted successfully. Again, you usually pass NULL for this parameter.

For a more complete description of how to use these functions, please refer to the Platform SDK documentation.

Exporting ANSI and Unicode DLL Functions

You could use these two functions to easily create both Unicode and ANSI versions of functions. For example, you might have a dynamic-link library containing a function that reverses all the characters in a string. You could write the Unicode version of the function as shown here:

BOOL StringReverseW(PWSTR pWideCharStr, DWORD cchLength) { 
//	Get a pointer to the last character in the string. 
PWSTR pEndOfStr = pWideCharStr + wcsnlen_s(pWideCharStr , cchLength) - 1; 
wchar_t cCharT; 
// Repeat until we reach the center character in the string. 
while (pWideCharStr < pEndOfStr) { 
// Save a character in a temporary variable. 
cCharT = *pWideCharStr; 
// Put the last character in the first character. 
*pWideCharStr = *pEndOfStr; 
// Put the temporary character in the last character. 
*pEndOfStr = cCharT; 
// Move in one character from the left. 
// Move in one character from the right. 
// The string is reversed; return success. 

And you could write the ANSI version of the function so that it doesn’t perform the actual work of reversing the string at all. Instead, you could write the ANSI version so that it converts the ANSI string to Unicode, passes the Unicode string to the StringReverseW function, and then converts the reversed string back to ANSI. The function would look like this:

BOOL StringReverseA(PSTR pMultiByteStr, DWORD cchLength) { 
PWSTR pWideCharStr; 
int nLenOfWideCharStr; 
// Calculate the number of characters needed to hold 
// the wide-character version of the string. 
nLenOfWideCharStr = MultiByteToWideChar(CP_ACP, 0, 
   pMultiByteStr, cchLength, NULL, 0); 
// Allocate memory from the process' default heap to 
// accommodate the size of the wide-character string. 
// Don't forget that MultiByteToWideChar returns the 
// number of characters, not the number of bytes, so 
// you must multiply by the size of a wide character. 
pWideCharStr = (PWSTR)HeapAlloc(GetProcessHeap(), 0, 
   nLenOfWideCharStr * sizeof(wchar_t)); 
if (pWideCharStr == NULL)         

// Convert the multibyte string to a wide-character string.
MultiByteToWideChar(CP_ACP, 0, pMultiByteStr, cchLength, 
   pWideCharStr, nLenOfWideCharStr); 

// Call the wide-character version of this 
// function to do the actual work. 
fOk = StringReverseW(pWideCharStr, cchLength); 
if (fOk) { 
// Convert the wide-character string back 
// to a multibyte string. 
WideCharToMultiByte(CP_ACP, 0, pWideCharStr, cchLength, 
   pMultiByteStr, (int)strlen(pMultiByteStr), NULL, NULL); 
// Free the memory containing the wide-character string. 
HeapFree(GetProcessHeap(), 0, pWideCharStr); 

Finally, in the header file that you distribute with the dynamic-link library, you prototype the two functions as follows:

BOOL StringReverseW(PWSTR pWideCharStr, DWORD cchLength); 
BOOL StringReverseA(PSTR pMultiByteStr, DWORD cchLength); 
#ifdef UNICODE 
#define StringReverse StringReverseW 
#define StringReverse StringReverseA 
#endif // !UNICODE

Determining If Text Is ANSI or Unicode

The Windows Notepad application allows you to open both Unicode and ANSI files as well as create them. In fact, Figure 2-5 shows Notepad’s File Save As dialog box. Notice the different ways that you can save a text file.


Figure 2-5 The Windows Vista Notepad File Save As dialog box

For many applications that open text files and process them, such as compilers, it would be convenient if, after opening a file, the application could determine whether the text file contained ANSI characters or Unicode characters. The IsTextUnicode function exported by AdvApi32.dll and declared in WinBase.h can help make this distinction:

BOOL IsTextUnicode(CONST PVOID pvBuffer, int cb, PINT pResult);

The problem with text files is that there are no hard and fast rules as to their content. This makes it extremely difficult to determine whether the file contains ANSI or Unicode characters. <msl_b>IsText-Unicode</msl_b> uses a series of statistical and deterministic methods to guess at the content of the buffer. Because this is not an exact science, it is possible that <msl_b>IsTextUnicode</msl_b> will return an incorrect result.

The first parameter, <msl_b>pvBuffer</msl_b>, identifies the address of a buffer that you want to test. The data is a void pointer because you don’t know whether you have an array of ANSI characters or an array of Unicode characters.

The second parameter, <msl_b>cb</msl_b>, specifies the number of bytes that <msl_b>pvBuffer</msl_b> points to. Again, because you don’t know what’s in the buffer, <msl_b>cb</msl_b> is a count of bytes rather than a count of characters. Note that you do not have to specify the entire length of the buffer. Of course, the more bytes <msl_b>IsText-Unicode</msl_b> can test, the more accurate a response you’re likely to get.

The third parameter, <msl_b>pResult</msl_b>, is the address of an integer that you must initialize before calling <msl_b>IsTextUnicode</msl_b>. You initialize this integer to indicate which tests you want <msl_b>IsTextUnicode</msl_b> to perform. You can also pass <msl_b>NULL</msl_b> for this parameter, in which case <msl_b>IsTextUnicode</msl_b> will perform every test it can. (See the Platform SDK documentation for more details.)

If <msl_b>IsTextUnicode</msl_b> thinks that the buffer contains Unicode text, <msl_b>TRUE</msl_b> is returned; otherwise, <msl_b>FALSE</msl_b> is returned. If specific tests were requested in the integer pointed to by the <msl_b>pResult</msl_b> parameter, the function sets the bits in the integer before returning to reflect the results of each test.

The FileRev sample application presented in Chapter 17, “Memory-Mapped Files,” demonstrates the use of the <msl_b>IsTextUnicode</msl_b> function.

< Back