CLR Inside Out
Windows Vista Globalization Features
Shawn Steele

Contents
Windows XP and the Microsoft .NET Framework both have APIs that support globalization. Windows Vista™ will further extend globalization support by introducing several new features.
When you begin to think about Windows Vista, you'll probably want to know how cultures in the .NET Framework compare to locales in Windows Vista, what the difference is between a managed encoding and a Windows®-based code page, and why Windows Vista has Internationalized Domain Name (IDN) risk mitigation functions that aren't available in managed code. This month's column will answer those and other questions about globalization features in the .NET Framework 2.0 and Windows Vista.
Custom Locales and Custom Cultures
The .NET Framework uses the term "culture" (or "region") to refer to what Windows refers to as a "locale." In the October 2005 issue of
MSDN®Magazine, Michael Kaplan and Cathy Wissink discussed the need for user-defined culture data in the .NET Framework (see
msdn.microsoft.com/msdnmag/issues/05/10/Globalization). Custom cultures provide such a mechanism for defining data for the CultureInfo and RegionInfo objects used across the machine. Using the new .NET Framework 2.0 CultureAndRegionInfoBuilder class, you can create new cultures or modify existing ones and register them for use on the machine.
All of the scenarios that require .NET Framework custom cultures can also apply to Windows Vista. For example, a company in the United States may prefer a particular short date format such as yyyy/mm/dd instead of the U.S. default (mm/dd/yyyy). In such a case, you can create a custom culture that replaces the existing en-U.S. culture data with this specific short date format. That customized culture format can then be installed on all machines in the company to support this corporate preference.
Another scenario is to derive a custom culture from another culture. For example, you could use an English language culture as a template to create a new English language culture for India. Many fields, such as month and day names, would remain the same, but by changing the currency symbol, region names, and other information you could create a culture appropriate for English in India.
For completely new cultures you could create a new custom locale entirely from scratch, specifying all of the region and culture data appropriate to that locale. When registered on the machine, all applications would then have access to this completely new locale by using the normal CultureInfo constructor.
.NET-based cultures and regions are comparable to locales in Windows. There is also a need for custom cultures for Windows locales. The .NET Framework 2.0 CultureAndRegionInfoBuilder fulfills these needs for Windows Vista, as well. Any custom cultures registered by the .NET Framework are automatically shared as custom locales in Windows Vista. This ability to use .NET-based custom cultures as Windows Vista custom locales gives you more flexibility.
In Windows Vista a custom locale created or modified by the .NET Framework 2.0 CultureAndRegionInfoBuilder can be selected by a user as his default locale. This gives legacy unmanaged applications access to custom locale data when it is the user's default (see Figure 1). The locale in Figure 2 requires a custom font and keyboard, but shows that even complex custom locale solutions can now be implemented.
Figure 1 Custom Locale in Windows Vista
Replacement .NET Framework cultures now become replacement Windows Vista locales. This allows users to modify the short time format or other values for a particular locale on the entire system. Those culture/locale changes will be respected by both managed and unmanaged applications. Modifying the data of existing locales can be fairly straightforward, as you can see in Figure 3.

Figure 3 Modifying Existing Locales
Code
|
Imports System
Imports System.Globalization
Public Class Example
Public Shared Sub Main()
‘ Show a date format before our changes
Console.WriteLine("Before Modification:")
Console.WriteLine(DateTime.Now.ToString("d", _
New CultureInfo("en-US", False)))
‘ Make a replacement for en-US
Dim carib As New CultureAndRegionInfoBuilder("en-US", _
CultureAndRegionModifiers.Replacement)
‘ Modify the short date format
Dim dtfi As DateTimeFormatInfo = carib.GregorianDateTimeFormat
dtfi.ShortDatePattern = "yyyy/MM/dd"
carib.GregorianDateTimeFormat = dtfi
‘ Make it available on this machine
‘ note that this impacts all applications
carib.Register()
‘ Show a date format after our changes, specify no user
‘ override to demostrate that this comes from the custom
‘ culture, not a user override
Console.WriteLine("After Modification:")
Console.WriteLine(DateTime.Now.ToString("d", _
New CultureInfo("en-US", False)))
‘ Remove the replacement
CultureAndRegionInfoBuilder.Unregister("en-US")
End Sub
End Class
|
Output
|
> ubc /r:sysglobl.dll example.ub
> example.exe
Before Modification:
2/13/2006
After Modification:
2006/02/13
|
Figure 1 Complex Custom Locale Solution
Some locales may have multiple acceptable spellings or other user preferences dictated by regional dialects or other cultural differences. Occasionally the generic data provided by Windows Vista must choose one form which, although technically correct, may not be appropriate for all users in that locale. Users with expectations that differ from where the data was collected can now use the CultureAndRegionInfoBuilder to change both the .NET Framework and Windows Vista culture data to fit their preferences, even if it's a minority preference.
Microsoft® Locale Builder for Windows Vista will help users build these custom locales by providing a user-friendly tool that will allow even novices to customize their locale data. Even though it modifies Windows Vista locale data, the tool itself is written in C# and relies on the .NET Framework 2.0 CultureAndRegionInfoBuilder class to create these custom locales. See the sidebar "Some Custom Culture Gotchas" for more information.
Historically, Microsoft has shipped the most recent culture or locale data available, whether in the .NET Framework or Windows. Unfortunately, the data doesn't always stay the same. Cultures continue to be added, cultural changes can occur (such as the adoption of the Euro), governments can change standards, and cultural variations can come into favor. Since .NET and Windows have been shipped at different times, they are usually out of sync.
By using the CultureAndRegionInfoBuilder class to build custom cultures/locales, data can be shared between the .NET Framework 2.0 and Windows Vista, ensuring the consistency of culture data between managed and unmanaged applications.
Windows-Only Cultures and Regions
Microsoft continues to expand the list of locales supported by Windows and the .NET Framework. The .NET Framework 2.0 shipped with the locales that were available in Windows XP SP1. Since that time, Microsoft has continued to add locales to Windows. With Windows Vista, even more locales will be added. The .NET Framework 2.0 can access all locales that it finds in Windows by synthesizing a "Windows-only" culture and region from the data returned by Windows APIs. The .NET Framework 2.0 provides all of the data for this Windows-only culture by querying Windows for all values, including data formats, strings, sort behavior, and every other value related to this Windows-only culture or region.
One interesting difference between the Windows-only cultures and the built-in framework cultures is that Windows does not support region-neutral or language-neutral regions. Region-neutral cultures are cultures without a specific region. In the Framework, each specific culture, such as en-US, has a neutral parent culture such as en, which then has the Invariant locale as its parent. Language-neutral regions such as US do not exist in Windows.
In Windows, locales with only a language name do not exist. When the Framework cannot find a region-neutral parent for a specific Windows-only culture, the Invariant culture is used instead for that culture's parent. Managed applications need to use a specific full culture name to take advantage of these synthetic Windows-only cultures.
Region-only information is similarly unavailable to Windows and Windows-only regions. Microsoft recommends that users construct RegionInfo objects from a complete culture name (like en-US) rather than just a region name (like US). Using a full region name is also necessary to get a culturally appropriate language for the region name, such as United States for en-US and Estados Unidos for es-US. In some cases this difference in region names is a geopolitical issue, so a specific culture should always be used when possible.
The Windows-only cultures provide a way for the .NET Framework to support the same set of locales that the operating system supports. One caveat is that .NET installations on one machine may not have the same set of Windows-only cultures as another machine. For example, one machine could have a different version of Windows or service pack than another computer. The CultureInfo.GetCultures method can filter on the type of culture desired, and the CultureInfo.CultureTypes property will include the WindowsOnlyCultures value for Windows-only cultures.
No More LCIDs?
It's probably a little early to say that locale identifiers (LCIDs) are no longer needed, but I can imagine such a future. Windows has historically relied on the LCID to identify locales. From its inception, the .NET Framework has been more flexible, allowing the use of names to identify cultures. .NET still allows CultureInfo construction by LCID, but names are preferred, and with custom locales names are required.
With the advent of custom cultures, names have become even more important since LCIDs can only be assigned by Microsoft. This is as true for Windows Vista locales as for the .NET Framework. Since C code doesn't allow for overloading, Ex versions have been created to accept locale names for all of the National Language Support (NLS) APIs in Windows Vista. So GetLocaleInfoEx, EnumSystemLocalesEx, and so on have been created to allow querying locale data by name identifiers. These functions will allow native applications to access data in new .NET Framework custom cultures.
It is expected that custom cultures/locales will become very important to Windows and the .NET Framework.
Ethnologue lists 6,912 languages in the world, with 162 spoken in the U.S. alone. Microsoft is adding over 60 new locales to Windows Vista on top of the 137 available for Windows XP. This includes new support for emerging markets in Africa, South America, and Asia, increased support for regional languages in established markets in Europe, and support for Spanish (U.S.). However, sourcing and maintaining data for all the language and country combinations in the world would be very difficult, particularly in emerging markets that are newer to technology. Even identifying a required locale set is challenging since everyone's locale is important to them and it is difficult to get a clear perspective on the needs of so many different groups.
For interoperability with legacy applications, data, and standards, Microsoft has added LCIDToLocaleName and LocaleNameToLCID functions to Windows Vista. Combined with the CultureInfo.LCID property, applications should have the tools necessary to migrate to locale names even in mixed managed and unmanaged code environments.
While the LCID will need to remain for legacy applications and other cases in the near future, culture names (and locale names) are now predominant. The .NET Framework 2.0 has enabled a plethora of custom cultures/locales, and the Microsoft Locale Builder for Windows Vista (shown in Figure 4) will make it even easier to create and distribute custom locale solutions. Users can select their own name identifiers without having to rely on Microsoft to assign an LCID, and they can use the new name-based APIs to query their data. The process started by .NET is actually hastening the demise of the LCID.
Figure 4 Microsoft Locale Builder for Windows Vista
Remember, though, that without LCIDs you need to use culture and locale names. Microsoft recommends getting culture and region data by full culture name, like en-US. Also, keep in mind that Microsoft .NET culture names respect the syntax of the Internet Engineering Task Force (IETF) standard names. When names include language and region, they are of the form language-region (for example, en-US, fr-FR). Some locales also include the script used in the locale. For example, there are two locales for Uzbek-Uzbekistan: uz-Cyrl-UZ for the Cyrillic script locale and uz-Latn-UZ for the Latin script locale. Additionally, some IETF names have unusual codes. For example, en-029 is the IETF code for the Windows English (Caribbean) locale. Unmanaged applications should be using GetLocaleInfoEx and related functions that accept names.
.NET Encodings and Windows Code Pages
Code pages, encodings, and character sets map character data from one form, such as Unicode, to another, such as ASCII. Most code pages support only a subset of the characters supported by the .NET Framework or Windows. Unfortunately, there are variations in how those subsets are implemented. The easiest way to avoid unexpected conversion behavior is to use standard Unicode encodings such as UTF-8 or UTF-16.
The .NET Framework uses encoding and the Encoding class to refer to the mapping of the internal UTF-16 Unicode representation to a different byte representation. Windows uses the term code page to refer to the mapping tables used by its APIs to do the same thing.
Most encodings only handle a subset of Unicode. ASCII, for example, can only encode 128, or 0.1 percent, of the more than 95,000 characters in Unicode. Depending on the mechanism used, the remaining 99.9 percent of Unicode would be ignored, converted to another character like ?, or result in errors. Other single-byte code pages, like Windows-1252, only raise that figure to about .25 percent. Some common double-byte code pages in Windows still only accurately convert about 10 percent of the characters in Unicode.
Another issue with encodings is that new standards have been created and existing code pages have been extended, corrected, or otherwise changed. Some encodings have unassigned code points that can be used by applications for their own behavior. As the character set of those encodings increase, sometimes those code points are reused for new characters. Sometimes a private use region becomes well-known and accepted as part of the standard. Those changes can break applications or data that depend on the old behavior.
Historically, different operating systems, vendors, and applications have had unique algorithms for mapping characters between various encodings, often because different versions of encoding standards were adopted by different vendors. In other situations vendors may have had needs that differed slightly from existing standards. Sometimes there are even unintentional bugs in encoding behavior.
These differences between encoding implementations can cause data to be corrupted. Fortunately, those corruptions are usually in edge cases or different revisions of standards; however, that does not help people using data that needs to use a corrupted character. Round-tripping from Unicode to an encoding and back to Unicode is usually less lossy if the same API set is used for both encoding and decoding the data.
All of these complexities mean that even though a character may encode and decode correctly using one set of APIs, it may not correctly round-trip when it shifts between platforms. Everyone has probably seen square boxes instead of rich quotes on well-known Web sites or in popular publications. This is often because the computer, application, or browser client hasn't interpreted the character data in the intended way, even if it thought it was using the correct encoding. Plus, in some languages it is common for certain characters to become corrupted on Web pages because of inconsistent interpretations between vendors, changes to a code page standard, or mislabeled character set tags.
At Microsoft, the philosophy has been that if a user saves a document using a set of APIs, he should be able to get the same data back with the same set of APIs. Historically, some code pages have encoded characters not yet assigned by a standard. Other standards have evolved to include new characters, and Microsoft has provided varying levels of support for other standards. Newer standards may be more correct than those provided by Microsoft, but the extension of a standard must be balanced with the cost of causing customers to lose data they saved several years ago.
.NET Encoding Classes and Windows Code Page Functions
In Windows there are two primary ways to convert character data: MultiByteToWideChar/WideCharToMultiByte and MLang. In the .NET Framework there are the Encoding classes. All three have their own unique quirks, which means that none of them can read the data written by the others with 100 percent accuracy.
MLang has several features designed to work with international code pages. In general, MultiByteToWideChar is preferred to calling the MLang conversion functions. MLang is used by Microsoft Internet Explorer
® for its code page detection and conversion. The .NET Framework Encoding class provides similar conversion functionality, although some code points may convert differently in some code pages. For example, in ASCII, MLang maps characters with the eighth bit set by just dropping the eighth bit, which could allow spoofing. Internally, MLang converts from Unicode to a specific code page and then tries to adapt that result to the target code page. Its algorithms sometimes cause random mappings, particularly of undefined sequences in a code page. The .NET Framework has cleaner implementations without the random mappings. The .NET Encoding class is also more consistent about restoring the state to a baseline when writing multiple sequences. MLang could cause inadvertent mode changes, such as reading single byte data as double byte data. For more information on MLang, see
msdn.microsoft.com/workshop/misc/mlang/mlang.asp.
MultiByteToWideChar and WideCharToMultiByte are the workhorses of the Windows APIs, and are used internally by most of Windows. They are also called by MLang. Some code pages differ from the .NET Encoding class versions, particularly the iso-2022-xx code pages. MultiByteToWideChar and WideCharToMultiByte also expect data to be converted in its entirety, whereas the .NET Framework provides the Encoder and Decoder classes. Calling MultiByteToWideChar from unmanaged C code is fairly straightforward, as you see here:
|
int length;
WCHAR wcTemp[BUFFER_SIZE];
BYTE* bytes = "Input byte string";
Length = MultiByteToWideChar(CP_UTF8, 0, bytes, -1,
wcTemp, BUFFER_SIZE);
|
The Encoding classes are used by managed code and include features not present in the Windows APIs. The Encoder and Decoder classes can be used to provide encoding or decoding of large buffers in multiple steps. Additionally the .NET Framework 2.0 has support for fallbacks, which provide user-specified behavior for unknown or corrupted characters.
Here's a simple UTF-8 example using the default behavior:
|
Dim bytes() As Byte = Encoding.UTF8.GetBytes( _
"Convert this string to UTF8")
|
Here's one using the constructor that selects an exception fallback to throw on illegal input data:
|
Dim bytes() As Byte = New UTF8Encoding(true, true).GetBytes("Convert " & _
"this string and throw on illegal characters like " & ChrW(&hD800))
|
And here's an example using custom fallback to replace unknown characters with {unknown}:
|
Dim utf8 As Encoding = CType(Encoding.UTF8.Clone(), Encoding)
utf8.EncoderFallback = New EncoderReplacementFallback("{unknown}")
Dim bytes() As Byte = utf8.GetBytes( _
"Convert and replace illegal " & ChrW(&hD800) & " with {unknown}")
|
.NET-based applications that communicate with native Windows-based applications, or with applications on other platforms for that matter, need to be aware that encoding differences can occur. The preferable way to exchange data is to use Unicode, since that is the exact set of the characters supported by Windows or .NET. Unfortunately, some protocols rely on other standards and a conversion is necessary. As more and more applications move to Unicode, these cases should be limited. Fortunately, many newer or updated standards have some provision for Unicode.
In recognition of the importance of Unicode in both Windows and the .NET Framework, the .NET Framework 2.0 team put serious effort into optimizing the UTF-8 and UTF-16 Encodings. In nearly all cases UTF-8 is now significantly faster than most of the non-Unicode encodings. Similar effort was spent on Windows Vista to improve the speed of UTF-8 used by the MultiByteToWideChar functions. Incompatibilities between unmanaged and managed character conversion algorithms can be avoided by using these Unicode encodings.
Windows Vista and Managed IDN Mapping APIs
Windows Vista has also added unmanaged versions of the Unicode Normalization and IDN mapping APIs. As mentioned briefly in the aforementioned article by Michael Kaplan and Cathy Wissink, the IDN standard exposes a potential for spoofing attacks by way of the confusion some characters cause. Windows Vista has added APIs that can help mitigate some of those concerns. These APIs are not available to managed code, so developers of such code should be aware of potential security issues.
The primary concern with the IDN standard is that the character repertoire of a host name is increased greatly. That increases the possibilities of domain name spoofing through homographs. A homograph is a word that looks like another word but isn't. A common example is the creation of a paypal.com URL that is spelled pаypal.com instead of paypal.com. Obviously, it is very difficult for a human to see that the first "a" here in pаypal.com has been changed to a Cyrillic "а", code point U+0430, from a Latin "a", code point U+0061.
or mine.com/paypal.com/safe/logon can also trap some users.
One approach to mitigate homographs is to detect the permissible scripts in a domain name. The idea is that domain names are probably restricted to a single script such as Latin or Cyrillic, but not both. So if a domain name is encountered that uses more scripts than expected, it is considered suspect. In some locales it is common to see Latin or other scripts mixed with the native script, such as in the name of a local division of a foreign company. In other cases, a single script can have many characters that appear similar, particularly if rendered in small fonts in a browser address bar. So this mitigation is not perfect, but can be used as part of a broader strategy.
The script detection approach is the primary idea in the Windows Vista mitigation APIs. These APIs allow for the detection and comparison of the scripts present in an IDN. The new unmanaged Windows Vista GetStringScripts function will return a list of scripts present in a string. GetLocaleInfoEx can now query for LOCALE_SSCRIPTS to discover the scripts expected for a particular locale, and VerifyScripts compares the two previous results to determine if the expected scripts are within the set expected by the locale (see Figure 5).
Since the homograph concerns with IDN were raised after the .NET Framework 2.0 was released, there are not any managed equivalents to these Windows Vista APIs. As mentioned earlier, remember that these APIs cannot solve all of the security concerns regarding IDN names by themselves. Trusted managed applications could use P/Invoke to call these native Windows Vista APIs.
Windows and the .NET Framework support globalization needs in complementary ways. Each has its own strengths, but together they can provide excellent customer solutions—for example, the shared custom culture and locale behavior in Windows Vista. Each also approaches the tasks in different ways, as seen with the encoding and code page behavior. Therefore, a certain amount of care needs to be taken when creating solutions that utilize both managed and unmanaged functionality.