This content has moved to another location. See WideCharToMultiByte for the latest version.
correction: under "Windows Vista and later:" it says: "code that uses this function on valid UTF-8 operating systems will behave the same way on Windows Vista and later as on earlier Windows operating systems."
It probably meant to say valid UTF-8 strings, not valid UTF-8 operating systems.
As it turns out, WideCharToMultiByte() even fills the buffer lpMultiByteStr when it is too small to hold the converted wide string - however it does not zero-terminate it. I'm not sure why the function would fill the resulting string if it knows that it's too small and then not zero-terminate it. I can only assume it's because it converts byte after bye.
Consider this sample code:
wchar_t wstrTest[] = L"12345678";
char strTest[4] = { '\0' };
WideCharToMultiByte(CP_UTF8, 0, (LPCWSTR) wstrTest, -1, strTest, sizeof(strTest), NULL, NULL);
printf("Converted: %s\n", strTest);
If you don't check the return value, then any subsequent string operation will go past the size of strTest.
Converted: 1234╠╠╠╠╠╠╠╠1
WideCharToMultiByte() returns 0 in this case, but still fills strTest.
In my opinion, WideCharToMultiByte() should either NOT fill the destination variable if it's too small, or terminate it with a 0 character.
If it is "ok" that the resulting multibyte string is smaller and that data might be lost (which can be the case when you work with a fixed output buffer), then it's better to call the function like this:
WideCharToMultiByte(CP_UTF8, 0, (LPCWSTR) wstrTest, -1, strTest, (sizeof(strTest) - sizeof(char)), NULL, NULL);
... assuming that strTest has been zeroed-out first.
lpDefaultChar, lpUsedDefaultChar
are set to non-zero, the function fails. However, it works with the same arguments if the last two are set to NULL. This was observed on Windows XP SP2.
----------
Shawn Steele
The replacement character isn't interesting for Unicode since any valid string should be convertable to UTF-8. (UTF-7 isn't terribly secure and not recommended) The only exception to that is an invalid surrogate pair (high surrogate with no following low surrogate, etc.), in which case it'll be replaced with U+FFFD
- lpUsedDefaultChar
- ...
- For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to a null pointer. Otherwise, the function fails with ERROR_INVALID_PARAMETER.
P.S. Probably, this was added after Skyfaller's comment? :)
(Shawn Steele - No, it was there before :))
See http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx and http://blogs.msdn.com/shawnste/archive/2007/06/08/why-can-t-we-strip-the-diacritics.aspx for some of my reasoning.
Best fit behavior causes some characters to behave identically to others. This can change the linguistic meaning in a very unfortunate way sometimes, and it can cause security problems.
If you do use best fit, realize that the mappings are somewhat random, you might mangle customer's names, etc., and do any security verification after the conversion (in case it causes new security mappings for your app).