September 2016

Volume 31 Number 9

[C++]

Unicode Encoding Conversions with STL Strings and Win32 APIs

By Giovanni Dicanio

Unicode is the de facto standard for representing international text in modern software. According to the official Unicode consortium’s Web site (bit.ly/1Rtdulx), “Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.” Each of these unique numbers is called a code point, and is typically represented using the “U+” prefix, followed by the unique number written in hexadecimal form. For example, the code point associated to the character “C” is U+0043. Note that Unicode is an industry standard that covers most of the world’s writing systems, including ideographs. So, for example, the Japanese kanji ideograph 学, which has “learning” and “knowledge” among its meanings, is associated to the code point U+5B66. Currently, the Unicode standard defines more than 1,114,000 code points.

From Abstract Code Points to Actual Bits: UTF-8 and UTF-16 Encodings

A code point is an abstract concept, though. For a programmer, the question is: How are these Unicode code points represented concretely using computer bits? The answer to this question leads directly to the concept of Unicode encoding. Basically, a Unicode encoding is a particular, well-defined way of representing Unicode code point values in bits. The Unicode standard defines several encodings, but the most important ones are UTF-8 and UTF-16, both of which are variable-length encodings capable of encoding all possible Unicode “characters” or, better, code points. Therefore, conversions between these two encodings are lossless: No Unicode character will be lost during the process.

UTF-8, as its name suggests, uses 8-bit code units. It was designed with two important characteristics in mind. First, it’s backward-­compatible with ASCII; this means that each valid ASCII character code has the same byte value when encoded using UTF-8. In other words, valid ASCII text is automatically valid UTF-8-encoded text.

Second, because Unicode text encoded in UTF-8 is just a sequence of 8-bit byte units, there’s no endianness complication. The UTF-8 encoding (unlike UTF-16) is endian-neutral by design. This is an important feature when exchanging text across different computing systems that can have different hardware architectures with different endianness.

Considering the two Unicode characters I mentioned before, the capital letter C (code point U+0043) is encoded in UTF-8 using the single byte 0x43 (43 hexadecimal), which is exactly the ASCII code associated with the character C (as per the UTF-8 backward compatibility with ASCII). In contrast, the Japanese ideograph 学 (code point U+5B66) is encoded in UTF-8 as the three-byte sequence 0xE5 0xAD 0xA6.

UTF-8 is the most-used Unicode encoding on the Internet. According to recent W3Techs statistics available at bit.ly/1UT5EBC, UTF-8 is used by 87 percent of all the Web sites it analyzed.

UTF-16 is basically the de facto standard encoding used by Windows Unicode-enabled APIs. UTF-16 is the “native” Unicode encoding in many other software systems, as well. For example, Qt, Java and the International Components for Unicode (ICU) library, just to name a few, use UTF-16 encoding to store Unicode strings.

UTF-16 uses 16-bit code units. Just like UTF-8, UTF-16 can encode all possible Unicode code points. However, while UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units, UTF-16 is, in a way, simpler. In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units. However, having code units larger than a single byte implies endianness complications: in fact, there are both big-endian UTF-16 and little-endian UTF-16 (while there’s just one endian-neutral UTF-8 encoding).

Unicode defines a concept of plane as a continuous group of 65,536 (216) code points. The first plane is identified as plane 0 or Basic Multilingual Plane (BMP). Characters for almost all modern languages and many symbols are located in the BMP, and all these BMP characters are represented in UTF-16 using a single 16-bit code unit.

Supplementary characters are located in planes other than the BMP; they include pictographic symbols like Emoji and historic scripts like Egyptian hieroglyphs. These supplementary characters outside the BMP are encoded in UTF-16 using two 16-bit code units, also known as surrogate pairs.

The capital letter C (U+0043) is encoded in UTF-16 as a single 16-bit code unit 0x0043. The ideograph 学 (U+5B66) is encoded in UTF-16 as the single 16-bit code unit 0x5B66. For many Unicode characters there’s an immediate, direct correspondence between their code point “abstract” representation (such as U+5B66) and their associated UTF-16 encoding in hex (for example, the 0x5B66 16-bit word).

To have a little fun, let’s take a look at some pictographic symbols. The Unicode character “snowman” (U+2603, U+2603) is encoded in UTF-8 as the three-byte sequence: 0xE2 0x98 0x83; however, its UTF-16 encoding is the single 16-bit unit 0x2603. The Unicode character “beer mug” (U+1F37A, U+1F37A), which is located outside the BMP, is encoded in UTF-8 by the four-byte sequence 0xF0 0x9F 0x8D 0xBA. Its UTF-16 encoding instead uses two 16-bit code units, 0xD83C 0xDF7A, an example of a UTF-16 surrogate pair.

Converting Between UTF-8 and UTF-16 Using Win32 APIs

As discussed in the previous paragraphs, Unicode text is represented inside a computer’s memory using different bits, based on the particular Unicode encoding. Which encoding should you use? There isn’t a single answer to this question.

Recently, conventional wisdom seems to suggest that a good approach consists of storing Unicode text encoded in UTF-8 in instances of the std::string class in cross-platform C++ code. Moreover, there’s general agreement that UTF-8 is the encoding of choice for exchanging text across application boundaries and across different machines. The fact that UTF-8 is an endian-neutral format plays an important role in this. In any case, conversions between UTF-8 and UTF-16 are required at least at the Win32 API boundary, because Windows Unicode-enabled APIs use UTF-16 as their native encoding.

Now let’s dive into some C++ code to implement these Unicode UTF-8/UTF-16 encoding conversions. There are two key Win32 APIs that can be used for this purpose: MultiByteToWideChar and its symmetric WideCharToMultiByte. The former can be invoked to convert from UTF-8 (“multi-byte” string in the specific API terminology) to UTF-16 (“wide char” string); the latter can be used for the opposite. Because these Win32 functions have similar interfaces and usage patterns, I’ll focus only on MultiByteToWideChar in this article, but I have included C++-compilable code that uses the other API as part of this article’s download.

Using Standard STL String Classes to Store Unicode Text Because this is a C++ article, there’s a valid expectation of storing Unicode text in some sort of string class. So, the question now becomes: What kind of C++ string classes can be used to store Unicode text? The answer is based on the particular encoding used for the Unicode text. If UTF-8 encoding is used, because it’s based on 8-bit code units, a simple char can be used to represent each of these code units in C++. In this case the STL std::string class, which is char-based, is a good option to store UTF-8-encoded Unicode text.

On the other hand, if the Unicode text is encoded in UTF-16, each code unit is represented by 16-bit words. In Visual C++, the wchar_t type is exactly 16 bits in size; consequently, the STL std::wstring class, which is wchar_t-based, works fine to store UTF-16 Unicode text.

It’s worth noting that the C++ standard doesn’t specify the size of the wchar_t type, so while it amounts to 16 bits with the Visual C++ compiler, other C++ compilers are free to use different sizes. And, in fact, the size of wchar_t defined by the GNU GCC C++ compiler on Linux is 32 bits. Because the wchar_t type has different sizes on different compilers and platforms, the std::wstring class, which is based on that type, is non-portable. In other words, wstring can be used to store Unicode text encoded in UTF-16 on Windows with the Visual C++ compiler (where the size of wchar_t is 16 bits), but not on Linux with the GCC C++ compiler, which defines a different-sized 32-bit wchar_t type.

There is actually another Unicode encoding, which is less well-known and less used in practice than its siblings: UTF-32. As its name clearly suggests, it’s based on 32-bit code units. So a GCC/Linux 32-bit wchar_t is a good candidate for the UTF-32 encoding on the Linux platform.

This ambiguity on the size of wchar_t determines a consequent lack of portability of C++ code based on it (including the std::wstring class itself). On the other hand, std::string, which is char-based, is portable. However, from a practical perspective, it’s worth noting that the use of wstring to store UTF-16 encoded text is just fine in Windows-specific C++ code. In fact, these portions of code already interact with Win32 APIs, which are, of course, platform-specific by definition. So adding wstring to the mix doesn’t change the situation.

Finally, it’s important to note that, because both UTF-8 and UTF-16 are variable-length encodings, the return values of the string::length and wstring::length methods in general don’t correspond to the number of Unicode characters (or code points) stored in the strings.

The Conversion Function Interface Let’s develop a function to convert Unicode text encoded in UTF-8 to the equivalent text encoded using UTF-16. This can come in handy, for example, when you have some cross-platform C++ code that stores UTF-8-encoded Unicode strings using the STL std::string class, and you want to pass that text to Unicode-enabled Win32 APIs, which typically use the UTF-16 encoding. Because this code is talking to Win32 APIs, it’s already non-portable, so std::wstring is well-suited to store UTF-16 text here. A possible function prototype is:

std::wstring Utf8ToUtf16(const std::string& utf8);

This conversion function takes as input a Unicode UTF-8-­encoded string, which is stored in the standard STL std::string class. Because this is an input parameter, it’s passed by const reference (const &) to the function. As the result of the conversion, a string encoded in UTF-16 is returned, stored in a std::wstring instance. However, during Unicode encoding conversions, things can go wrong. For example, the input UTF-8 string might contain an invalid UTF-8 sequence (which can be the result of a bug in other parts of the code, or it could end up there as a result of some malicious activity). In such cases, the best thing from a security perspective is to fail the conversion, instead of consuming potentially danger­ous byte sequences. The conversion function can handle cases of invalid UTF-8 input sequences by throwing a C++ exception.

Defining an Exception Class for Conversion Errors What kind of C++ class can be used to throw an exception in case a Unicode encoding conversion fails? An option might be using a class already defined in the standard library, for example: std::runtime_error. However, I prefer defining a new custom C++ exception class for that purpose, deriving it from std::runtime_error. When Win32 APIs like MultiByteToWideChar fail, it’s possible to call GetLastError to get further information about the cause of the failure. For example, in case of invalid UTF-8 sequences in the input string, a typical error code returned by GetLastErorr is ERROR_NO_UNICODE_­TRANSLATION. It makes sense to add this piece of information to the custom C++ exception class; it can come in handy later for debugging purposes. The definition of this exception class can start like this:

// utf8except.h
#pragma once
#include <stdint.h>   // for uint32_t
#include <stdexcept>  // for std::runtime_error
// Represents an error during UTF-8 encoding conversions
class Utf8ConversionException
  : public std::runtime_error
{
  // Error code from GetLastError()
  uint32_t _errorCode;

Note that the value returned by GetLastError is of type DWORD, which represents a 32-bit unsigned integer. However, DWORD is a Win32-specific non-portable typedef. Even if this C++ exception class is thrown from Win32-specific portions of C++ code, it can be caught by cross-platform C++ code! So, it does make sense to use portable typedefs instead of Win32-specific ones; uint32_t is an example of such types.

Next, a constructor can be defined, to initialize instances of this custom exception class with an error message and error code:

public:
  Utf8ConversionException(
    const char* message,
    uint32_t errorCode
  )
    : std::runtime_error(message)
    , _errorCode(errorCode)
  { }

Finally, a public getter can be defined to provide read-only access to the error code:

uint32_t ErrorCode() const
  {
    return _errorCode;
  }
}; // Exception class

Because this class is derived from std::runtime_error, it’s possible to call the what method to get the error message passed in the constructor. Note that in the definition of this class, only portable standard elements have been used, so this class is perfectly consumable in cross-platform portions of C++ code, even those located far away from the Windows-specific throwing point.

Converting from UTF-8 to UTF-16: MultiByteToWideChar in Action

Now that the prototype of the conversion function has been defined and a custom C++ exception class implemented to properly represent UTF-8 conversion failures, it’s time to develop the body of the conversion function. As already anticipated, the conversion work from UTF-8 to UTF-16 can be done using the MultiByte­ToWideChar Win32 API. The terms “multi-byte” and “wide-char” have roots in historical reasons. Basically, this API and its symmetric sibling WideCharToMultiByte were initially meant to convert between text stored in specific code pages and Unicode text, which uses the UTF-16 encoding in Win32 Unicode-enabled APIs. Wide char refers to wchar_t, so it’s associated to a wchar_t-based string, which is a UTF-16-encoded string. In contrast, a multi-byte string is a sequence of bytes expressed in a code page. The legacy concept of code page was then extended to include the UTF-8 encoding.

A typical usage pattern of this API consists of first calling Multi­ByteToWideChar to get the size of the result string. Then, some string buffer is allocated according to that size value. This is typically done using the std::wstring::resize method in case the destination is a UTF-16 string. (For further details, you may want to read my July 2015 article, “Using STL Strings at Win32 API Boundaries” at msdn.com/magazine/mt238407.) Finally, the MultiByteToWideChar function is invoked a second time to do the actual encoding conversion, using the destination string buffer previously allocated. Note that the same usage pattern applies to the symmetric WideCharToMultiByte API.

Let’s implement this pattern in C++ code, inside the body of the custom Utf8ToUtf16 conversion function. Start handling the special case of an empty input string, where just an empty output wstring is returned:

#include <Windows.h> // For Win32 APIs
#include <string>    // For std::string and std::wstring
std::wstring Utf8ToUtf16(const std::string& utf8)
{
  std::wstring utf16; // Result
  if (utf8.empty())
  {
    return utf16;
  }

Conversion Flags MultiByteToWideChar can be called for the first time to get the size of the destination UTF-16 string. This Win32 function has a relatively complex interface, and its behavior is defined according to some flags. Because this API will be called twice in the Utf8ToUtf16 conversion function’s body, it’s good practice for code readability and maintainability to define a named constant that can be used in both calls:

// Safely fails if an invalid UTF-8 character
// is encountered in the input string
constexpr DWORD kFlags = MB_ERR_INVALID_CHARS;

It’s also good practice from a security perspective to make the conversion process fail if an invalid UTF-8 sequence is found in the input string. The use of the MB_ERR_INVALID_CHARS flag is also encouraged in Michael Howard and David LeBlanc’s book, “Writing Secure Code, Second Edition” (Microsoft Press, 2003).

If your project uses an older Visual C++ compiler version that doesn’t support the constexpr keyword, you can substitute static const in that context.

String Lengths and Safe Conversions from size_t to int MultiByteToWideChar expects the input string length parameter expressed using the int type, while the length method of the STL string classes returns a value of type equivalent to size_t. In 64-bit builds, the Visual C++ compiler emits a warning signaling a poten­tial loss of data for the conversion from size_t (whose size is 8 bytes) to int (which is 4 bytes in size). But even in 32-bit builds, where both size_t and int are defined as 32-bit integers by the Visual C++ compiler, there’s an unsigned/signed mismatch: size_t is unsigned, while int is signed. This isn’t a problem for strings of reasonable length, but for gigantic strings of length greater than (231-1)—that is, more than 2 billion bytes in size—the conversion from an unsigned integer (size_t) to a signed integer (int) can generate a negative number, and negative lengths don’t make sense.

So, instead of just invoking utf8.length to get the size of the source UTF-8 input string, and passing that to the MultiByteTo­WideChar API, it’s better to check the actual size_t-value of the length, making sure it can be safely and meaningfully converted to an int, and only then pass it to the MultiByteToWideChar API.

The following code can be used to make sure that size_t-length doesn’t exceed the maximum value for a variable of type int, throwing an exception if it does:

if (utf8.length() > static_cast<size_t>(std::numeric_limits<int>::max()))
{
  throw std::overflow_error(
    "Input string too long: size_t-length doesn't fit into int.");
}

Note the use of the std::numeric_limits class template (from the <limits> C++ standard header) to query the largest possible value for the type int. However, this code might actually not compile. How’s that? The problem is in the definition of the min and max macros in the Windows Platform SDK headers. In particular, the Windows-specific definition of the max preprocessor macro conflicts with the std::numeric_limits<int>::max member function call. There are some ways to prevent that.

A possible solution is to #define NOMINMAX before including <Windows.h>. This will prevent the definition of the min and max Windows-specific preprocessor macros. However, preventing the definition of these macros may actually cause problems with other Windows headers, such as <gdiplus.h>, which do require the definitions of these Windows-specific macros.

So, another option is to use an extra pair of parentheses around the std::numeric_limits::max member function call, to prevent against the aforementioned macro expansion:

if (utf8.length() > static_cast<size_t>((std::numeric_limits<int>::max)()))
{
  throw std::overflow_error(
    "Input string too long: size_t-length doesn't fit into int.");
}

Moreover, as an alternative, the INT_MAX constant could be used instead of the C++ std::numeric_limits class template.

Whatever approach is used, once the size check is done and the length value is found to be suitable for a variable of type int, the cast from size_t to int can be safely performed using static_cast:

// Safely convert from size_t (STL string's length)
// to int (for Win32 APIs)
const int utf8Length = static_cast<int>(utf8.length());

Note that the UTF-8 string’s length is measured in 8-bit char units; that is, in bytes.

First API Call: Getting the Destination String’s Length Now MultiByteToWideChar can be called for the first time, to get the destination UTF-16 string’s length:

const int utf16Length = ::MultiByteToWideChar(
  CP_UTF8,       // Source string is in UTF-8
  kFlags,        // Conversion flags
  utf8.data(),   // Source UTF-8 string pointer
  utf8Length,    // Length of the source UTF-8 string, in chars
  nullptr,       // Unused - no conversion done in this step
  0              // Request size of destination buffer, in wchar_ts
);

Note how the function is invoked passing zero as the last argument. This instructs the MultiByteToWideChar API to just return the required size for the destination string; no conversion is done in this step. Note also that the size of the destination string is expressed in wchar_ts (not in 8-bit chars), which makes sense, because the destination string is a UTF-16-encoded Unicode string, made by sequences of 16-bit wchar_ts.

To get read-only access to the input UTF-8 std::string’s content, the std::string::data method is called. Because the UTF-8 string’s length is explicitly passed as an input parameter, this code will work also for instances of std::string having embedded NULs inside.

Note also the use of the CP_UTF8 constant to specify that the input string is encoded in UTF-8.

Handling the Error Case If the preceding function call fails, for example, in presence of invalid UTF-8 sequences in the input string, the MultiByteToWideChar API returns zero. In this case, the GetLast­Error Win32 function can be invoked to get further details about the cause of the failure. A typical error code returned in case of invalid UTF-8 characters is ERROR_NO_UNICODE_TRANSLATION.

In case of failure, it’s time to throw an exception. This can be an instance of the previously custom-designed Utf8Conversion­Exception class:

if (utf16Length == 0)
{
  // Conversion error: capture error code and throw
  const DWORD error = ::GetLastError();
  throw Utf8ConversionException(
    "Cannot get result string length when converting " \
    "from UTF-8 to UTF-16 (MultiByteToWideChar failed).",
    error);
}

Allocating Memory for the Destination String If the Win32 function call succeeds, the required destination string length is stored in the utf16Length local variable, so the destination memory for the output UTF-16 string can be allocated. For UTF-16 strings stored in instances of the std::wstring class, a simple call to the resize method would be just fine:

utf16.resize(utf16Length);

Note that because the length of the input UTF-8 string was explicitly passed to MultiByteToWideChar (instead of just passing -1 and asking the API to scan the whole input string until a NUL-terminator is found), the Win32 API won’t add an additional NUL-terminator to the resulting string: The API will just process the exact number of chars in the input string specified by the explicitly passed length value. Therefore, there’s no need to invoke std::wstring::resize with a “utf16Length + 1” value: Because no additional NUL-terminator will be scribbled in by the Win32 API, you don’t have to make room for it inside the destination std::wstring (more details on that can be found in my July 2015 article).

Second API Call: Doing the Actual Conversion Now that the UTF-16 wstring instance has enough space to host the resulting UTF-16 encoded text, it’s finally time to call MultiByteToWideChar for the second time, to get the actual converted bits in the destination string:

// Convert from UTF-8 to UTF-16
int result = ::MultiByteToWideChar(
  CP_UTF8,       // Source string is in UTF-8
  kFlags,        // Conversion flags
  utf8.data(),   // Source UTF-8 string pointer
  utf8Length,    // Length of source UTF-8 string, in chars
  &utf16[0],     // Pointer to destination buffer
  utf16Length    // Size of destination buffer, in wchar_ts          
);

Note the use of the “&utf16[0]” syntax to gain write access to the std::wstring’s internal memory buffer (this, too, was already discussed in my July 2015 article).

If the first call to MultiByteToWideChar succeeds, it’s unlikely this second call will fail. Still, checking the API return value is certainly a good, safe coding practice:

if (result == 0)
{
  // Conversion error: capture error code and throw
  const DWORD error = ::GetLastError();
  throw Utf8ConversionException(
    "Cannot convert from UTF-8 to UTF-16 "\
    "(MultiByteToWideChar failed).",
    error);
}

Else, in case of success, the resulting UTF-16 string can finally be returned to the caller:

 

return utf16;
} // End of Utf8ToUtf16

Usage Sample So, if you have a UTF-8-encoded Unicode string (for example, coming from some C++ cross-platform code) and you want to pass it to a Win32 Unicode-enabled API, this custom conversion function can be invoked like so:

std::string utf8Text = /* ...some UTF-8 Unicode text ... */;
// Convert from UTF-8 to UTF-16 at the Win32 API boundary
::SetWindowText(myWindow, Utf8ToUtf16(utf8Text).c_str());
// Note: In Unicode builds (Visual Studio default) SetWindowText
// is expanded to SetWindowTextW

The Utf8ToUtf16 function returns a wstring instance containing the UTF-16-encoded string, and the c_str method is invoked on this instance to get a C-style raw pointer to a NUL-terminated string, to be passed to Win32 Unicode-enabled APIs.

Very similar code can be written for the reverse conversion from UTF-16 to UTF-8, this time calling the WideCharToMultiByte API. As I noted earlier, Unicode conversions between UTF-8 and UTF-16 are lossless—no characters will be lost during the conversion process.

Unicode Encoding Conversion Library

Sample compilable C++ code is included in the downloadable archive associated with this article. This is reusable code, compiling cleanly at the Visual C++ warning level 4 (/W4) in both 32-bit and 64-bit builds. It’s implemented as a header-only C++ library. Basically, this Unicode encoding conversion module consists of two header files: utf8except.h and utf8conv.h. The former contains the definition of a C++ exception class used to signal error conditions during the Unicode encoding conversions. The latter implements the actual Unicode encoding conversion functions.

Note that utf8except.h contains only cross-platform C++ code, making it possible to catch the UTF-8 encoding conversion exception anywhere in your C++ projects, including portions of code that aren’t Windows-specific, and instead use cross-platform C++ by design. In contrast, utf8conv.h contains C++ code that’s Windows-­specific, because it directly interacts with the Win32 API boundary.

To reuse this code in your projects, just #include the aforementioned header files. The downloadable archive contains an additional source file implementing some test cases, as well.

Wrapping Up

Unicode is the de facto standard for representing international text in modern software. Unicode text can be encoded in various formats: The two most important ones are UTF-8 and UTF-16. In C++ Windows code there’s often a need to convert between UTF-8 and UTF-16, because Unicode-enabled Win32 APIs use UTF-16 as their native Unicode encoding. UTF-8 text can be conveniently stored in instances of the STL std::string class, while std::wstring is well-suited to store UTF-16-encoded text in Windows C++ code targeting the Visual C++ compiler.

The Win32 APIs MultiByteToWideChar and WideCharTo­MultiByte can be used to perform conversions between Unicode text represented using the UTF-8 and UTF-16 encodings. I showed a detailed description of the usage pattern of the MultiByteTo­WideChar API, wrapping it in a reusable modern C++ helper function to perform conversions from UTF-8 to UTF-16. The reverse conversion follows a very similar pattern, and reusable C++ code implementing it is available in this article’s download.


Giovanni Dicanio is a computer programmer specializing in C++ and Windows, a Pluralsight author and a Visual C++ MVP. Besides programming and course authoring, he enjoys helping others on forums and communities devoted to C++, and can be contacted at giovanni.dicanio@gmail.com. He also writes a blog at blogs.msmvps.com/gdicanio.

Thanks to the following technical experts for reviewing this article: David Cravey and Marc Gregoire
David Cravey is an Enterprise Architect at GlobalSCAPE, leads several C++ user groups, and was a four-time Visual C++ MVP.

Marc Gregoire is a senior software engineer from Belgium, the founder of the Belgian C++ Users Group, author of “Professional C++” (Wiley), co-author of “C++ Standard Library Quick Reference” (Apress), technical editor on numerous books, and since 2007, has received the yearly MVP award for his VC++ expertise. Marc can be contacted at marc.gregoire@nuonsoft.com.


Discuss this article in the MSDN Magazine forum