Bugslayer

Basics of .NET Internationalization

John Robbins

Code download available at:Bugslayer0403.exe(120 KB)

Contents

Why Worry About Internationalization?
Definitions and Rules
The CultureInfo Class
Wrap-up
The Tips

It seems that most software developers in the U.S. don't think about making their software friendly to the rest of the world. If you're a long-time reader of the Bugslayer column, you'll remember that back in the Win32® days I always supported internationalizing my code. Internationalization has always been a passion of mine. Recently, I was involved in a Windows® Forms project, which had a firm requirement to be fully internationalized.

As I started researching internationalization in the Microsoft® .NET Framework, I was amazed to find excellent support for multiple languages baked right in. The only problem was that information on internationalization was scattered throughout the documentation and various books, which made getting started a lot harder than it should have been. In this, the first of two Bugslayer columns dedicated to .NET internationalization, I want to bring together the high points of this topic. In my next column, I'll discuss some utilities I whipped up for working with ResX files—the .NET equivalent of a Win32 resource file—to make your internationalization life much easier.

Everything I'll talk about for internationalization in these columns applies to both Windows Forms and ASP.NET applications. As ResX files are somewhat more relevant to Windows Forms applications, I'll get more specific with Windows Forms. Before I jump into the details, I'll discuss why internationalization is important so you can make a strong argument for incorporating it in your projects.

Why Worry About Internationalization?

From a business standpoint, a single statistic should convince your employer to deliver internationalized software: Microsoft makes over 60 percent of its revenue from outside the U.S. The more Microsoft supports localized language, number formats, date formats, and so on, the more they increase sales. For example, if you're working in .NET, you don't have to write your own Input Method Editor (IME), which is what Eastern languages such as Japanese and Chinese use to input their character-based languages. Instead, you simply tweak the setup and you're ready.

Even if you don't plan to ship a fully internationalized app, writing internationalized code means writing better code. The first major benefit is that the code is easier to maintain because you don't have any hardcoded strings, so changing the wording on a UI display means you can change just the resource, not force a recompile of the whole application. Second, you get much better UI modularity—the thing most people think of when they think about internationalization. By keeping the UI and the code as separate as possible, you make it easier to experiment with user interface design. For these reasons, you should always work internationalization into all your applications.

Definitions and Rules

If you've never looked at internationalization before, I want to define some key concepts so you can understand the rest of this column. The most important internationalization concept is the locale, which describes the country, region, and cultural conventions pertaining to the user. The locale tells you the user's language, how numbers, dates, and times should be formatted, how currency must be displayed, and what measurement system is used.

You can set your own locale at any time in Windows by opening the Regional and Language Options icon in the Control Panel (Regional Options in Windows 2000), and in the Regional Options tab (General tab in Windows 2000), select a locale from the first dropdown at the top of the dialog. On Windows XP and later, you'll see samples of what the formatted values will look like. There are no public API functions for changing the locale programmatically; it's something only the user can do.

To determine the locale, .NET follows the conventions specified by RFC- 1766 from the Internet Engineering Task Force (IETF), which specifies a string with the following format: language code-country/region. Except for a few special cases, both portions of the format are two character codes. For example, Spanish spoken in Mexico is "es-MX" and German spoken in Liechtenstein is "de-LI". As you can see, the language appears before the hyphen, and the locale follows the hyphen. All languages and regions supported by .NET are listed in the documentation for the CultureInfo class.

In some examples later in this column, you'll see a locale string comprised of just the first language portion, such as French, "fr". By specifying just the language, you are saying that the language is set, but there is no culture applied. This is very important when reusing common translations between multiple cultures. I will also discuss the invariant culture, represented by the empty string (""), which means that no language or culture is applied. The invariant culture can be used for storing data in a neutral culture format as well as providing the base culture for an assembly. One final comment on the locale string is that languages such as Azeri that have both Latin and Cyrillic variants will be expressed by appending "-Latn" or "-Cyrl," respectively, onto the locale string. Thus, to set the culture to Cyrillic Azeri as spoken in Azerbaijan, the locale string is "az-AZ-Cyrl".

To store the actual characters, Windows uses Unicode. In Unicode, all the characters of the world's alphabets can be represented. If you've ever experienced the pain of dealing with ASCII (8-bit) characters, where some languages such as Japanese had one, two, or, three bytes per character, you'll love Unicode. Unicode means each character is generally stored as a 16-bit value (UTF-16), though there are 8-bit (UTF-8) and 32-bit (UTF-32) encodings as well. The good news for everyone using the .NET Framework is that UTF-16 Unicode is the native type for characters and strings. You simply need to ensure that all your data is stored as the default so you'll get the Unicode for free.

If you are sharing your string data with applications that do not support 16-bit Unicode, the various encoding classes in the System.Text namespace come into play. For example, most data sent across the wire uses UTF-8 (8-bit Unicode) as the encoding. In order to convert a standard .NET string into UTF-8, you'd use the UTF8Encoding class.

As there's an ASCIIEncoding class, you probably think that simply using it will automatically convert any Unicode string into its exact ASCII equivalents. Unfortunately, that's not the case because the ASCIIEncoding class uses the Latin code pages; so if you are trying to convert to Greek, the conversion won't be correct. A better way is to get the bytes of the Unicode string and the encoding based on the code page for the ASCII conversion, then pass that encoding to the Encoding.Convert method. Figure 1 shows the proper way to convert from Greek ASCII to Unicode and back.

Figure 1 Converting Between ASCII and Unicode

using System ; using System.Text ; using System.Windows.Forms ; namespace ASCII_To_Unicode { class StartUp { static void Main ( ) { // All the Greek uppercase characters. Byte[] GreekBytes = { 0xB8 , // Epsilon with Tonos 0xB9 , // Eta with Tonos 0xBA , // Iota wiht Tonos 0xBC , // Omicron with Tonos 0xBE , // Upsiloin with Tonos 0xBF , // Omega with Tonos 0x20 , // space 0xC1 , // Alpha 0xC2 , // Beta 0xC3 , // Gamma 0xC4 , // Delta 0xC5 , // Epsilon 0xC6 , // Zeta 0xC7 , // Eta 0xC8 , // Theta 0xC9 , // Iota 0xCA , // Kappa 0xCB , // Lamda 0xCC , // Mu 0xCD , // Nu 0xCE , // Xi 0xCF , // Omicron 0xD0 , // Pi 0xD1 , // Rho 0xD3 , // Sigma 0xD4 , // Tau 0xD5 , // Upsilon 0xD6 , // Phi 0xD7 , // Chi 0xD8 , // Psi 0xD9 , // Omega 0x20 , // space 0xDA , // Iota with Dialytika 0xDB } ; // Upsilon with Dialytika // Get the code page for Greek. Encoding EncodingGreek = Encoding.GetEncoding ( 1253 ) ; // Convert the Greek bytes characters to Unicode. // 1st parameter - Current encoding of the bytes. // 2nd parameter - The conversion to encoding. // 3rd parameter - The bytes to convert. Byte[] Cvted = Encoding.Convert ( EncodingGreek , Encoding.Unicode , GreekBytes ) ; // Convert from the raw bytes to Unicode characters. String DoneString = Encoding.Unicode.GetString ( Cvted ) ; // As it's Unicode, Windows will display it correctly. MessageBox.Show ( DoneString , "Greek String" ) ; // Convert the Unicode string to Greek ASCII bytes. Byte[] AsciiBytes = Encoding.Convert ( Encoding.Unicode, EncodingGreek, Encoding.Unicode.GetBytes(DoneString)) ; // Ensure the converted string is equal. if ( GreekBytes.Length == AsciiBytes.Length ) { for ( int i = 0 ; i < GreekBytes.Length ; i++ ) { if ( GreekBytes[i] != AsciiBytes[i] ) { MessageBox.Show("Unicode to ASCII conversion failed"); } } } else { MessageBox.Show ( "Failed to convert Unicode back to ASCII" ) ; } } } }

Now that you're well on your way to using internationalization, you can also take a look at "Best Practices for Developing World-Ready Applications" from MSDN®. Now I'll turn to the magical CultureInfo class.

The CultureInfo Class

For .NET internationalization, the CultureInfo class from the System.Globalization namespace is your one-stop shop for all your cultural needs. It contains everything from number formatting to calendaring and sorting. Each thread has a default CultureInfo instance where you can get the proper locale formatting based on the user's locale. Interestingly, in .NET there's no way you can set a global culture for an entire process that would override the individual thread cultures. Having worked on numerous applications where one of the key requirements was that the application use a different culture and language from the one in use by the operating system, I know many of you will struggle with this issue. The InternationalizationDemo program that's part of this month's source code download shows one way of tackling this problem (see the link at the top of this article).

While you might think that there's only one CultureInfo value per thread, there are actually two, and you'll see in a moment how this makes sense. You get the first CultureInfo instance through the static properties—CultureInfo.CurrentCulture and Thread.CurrentThread.CurrentCulture. The CultureInfo instance holds the locale information for performing all number, time, date, and calendar formatting. The other CultureInfo instance is the UI display culture used by the ResourceManager class for resource loading. You can get the user interface display culture by reading the static properties, CultureInfo.CurrentUICulture and Thread.CurrentThread.CurrentUICulture.

In order for you to change the locale for a thread, you should set the Thread.CurrentThread.CurrentCulture and Thread.CurrentThread.CurrentUICulture properties. Since the Thread.CurrentThread.CurrentCulture is the CultureInfo instance used to display numbers, dates, and such, when you construct the new CultureInfo class, you need to specify both the language and country/region value in RFC-1766 format. If you try to set Thread.CurrentThread.CurrentCulture to a CultureInfo instance that only has the language portion set, you'll encounter an exception. As I hinted at earlier, you can set the Thread.CurrentThread.CurrentUICulture to a CultureInfo class that only specifies the language portion in its constructor. That way you can share UI resources, such as translated Windows Forms, across cultures that speak the same language, but don't share other cultural formatting.

Figure 2 shows the code for a simple console application that outputs the current date and time using three different CultureInfo values. The first value uses the system default. In my case, that's English as spoken in the U.S. After setting the current thread's CultureInfo, the second value is the date, displayed in German as spoken in Germany. The final display is the result of passing a CultureInfo class as the second parameter to the DateTime.ToString method. Since the CultureInfo class is derived from IFormatProvider, there's nothing else you need to do. The final output of the program shows the properly formatted output for French as spoken in Switzerland.

Figure 2 A Quick Internationalization Demo

en-US = Monday, September 22, 2003 3:49:48 PM de-DE = Montag, 22. September 2003 15:49:48 fr-CH = lundi, 22. septembre 2003 15:49:48 using System ; using System.Threading ; using System.Globalization ; namespace QuickIntlDemo { class StartUp { static void Main(string[] args) { DateTime currTime = DateTime.Now ; // Show the current date based on the user's locale. Console.WriteLine ( "{0} = {1}" , CultureInfo.CurrentCulture.Name , currTime.ToString ( "F" ) ) ; // Change the culture to German spoken in Germany. CultureInfo deDE = new CultureInfo ( "de-DE" ) ; Thread.CurrentThread.CurrentCulture = deDE ; Thread.CurrentThread.CurrentUICulture = deDE ; // Show the current date (which will now be in German.) Console.WriteLine ( "{0} = {1}" , CultureInfo.CurrentCulture.Name , currTime.ToString ( "F" ) ) ; // Create a culture for French spoken in Switzerland. CultureInfo frCH = new CultureInfo ( "fr-CH" ) ; // Pass the culture to ToString to show that it can be done. Console.WriteLine ( "{0} = {1}" , frCH.Name , currTime.ToString ( "F" , frCH ) ) ; } } }

Earlier, I mentioned the invariant culture. With this option, no cultural information is applied, thus it is culture insensitive. This is important in two kinds of situations. The first is when you need to make a security decision that depends on the return value of string comparisons or case change. String.Compare uses the CultureInfo.CurrentCulture culture to perform the comparison so security vulnerabilities may creep into the process if you do the compare with active locales that were never tested by the developer. In dealing with security and strings, always pass the invariant culture to any comparison or casing methods to ensure that there aren't any surprises. Getting the invariant culture is easy, and the following two C# lines are identical:

CultureInfo cultInvariant = new CultureInfo ( "" ) ; CultureInfo cultInvariant = CultureInfo.InvariantCulture ;

The other valuable use for the invariant culture is to store data in nonserialized situations. Instead of storing the data in a format that could change or require reparsing, you store it in a known format that has no culture associated with it. There's a fine demo of using the invariant culture in "Using the InvariantCulture Property".

Finally, I want to touch on string sorting. When you create a CultureInfo class or access the user's default, the sort order for strings is set to the default sorting for that language. However, some languages, such as Chinese as spoken in China (zh-CN), have two sort orders. For zh-CN, the default sort order is by pronunciation and the alternate is by stroke count. In order to use the alternate sort order, create the CultureInfo class using the integer locale ID (LCID) for the alternate sort order. You can also create a CompareInfo class using the alternate LCID if that's all you actually need. For the list of languages with alternate sort orders and LCIDs, see Comparing and Sorting Data for a Specific Culture.

String sorting with Array.Sort uses the CultureInfo.CurrentCulture implicitly behind the scenes, as you would expect. Unfortunately, if you want to control the sorting for an array based on culture, none of the overloads of the Sort method take a CultureInfo or CompareInfo class directly. You have to provide a class that implements the IComparer interface. Fortunately, it's simple enough to whip one up to do the necessary work. Figure 3 shows the CulturalStringComparer class I use to ensure that my string arrays remain properly sorted. When constructing the CulturalStringComparer, you can get the CompareInfo class for the parameter from the CompareInfo property of a CultureInfo class.

Figure 3 CulturalStringComparer for Array.Sort

internal class CulturalStringComparer : IComparer { private CompareInfo m_compareInfo = null ; public CulturalStringComparer ( CompareInfo cultureCompareInfo ) { if ( null == cultureCompareInfo ) { m_compareInfo = CultureInfo.CurrentCulture.CompareInfo ; } else { m_compareInfo = cultureCompareInfo ; } } public int Compare ( Object a , Object b ) { String sa = a as String ; String sb = b as String ; return ( m_compareInfo.Compare ( sa , sb ) ) ; } }

Wrap-up

Now that I've covered the basics of .NET internationalization, you should be able to write more flexible code. If you want to learn more, I encourage you to spend some time looking at the CultureInfo class to see how to manipulate and format the various items in a culturally sensitive manner. You should consider writing a program that takes a country/region pairing on the command line and shows the current date and time in various formats. In the next Bugslayer column, I'll discuss .NET resource handling and provide utilities to help make your internationalization and internationalization testing much easier.

All developers should have the book Developing International Software by Dr. International (Microsoft Press®, 2003). This excellent resource includes many advanced details of internationalization that you won't find anywhere else. And if you ever see Dr. International on the street, tell him I said "Hi."

The Tips

Tip 59 In this column, I've covered handling internationalization from the .NET programming perspective. If you're using ASP.NET, you probably have many HTML pages with static text that will need to be internationalized. One great tool you'll want to check out is the Enterprise Localization Toolkit.

Tip 60 If you need to format a selected section of code in a jiffy with Visual Studio® .NET, press Ctrl+K, Ctrl+F on the default keyboard, and the Edit.FormatSelection command will do the work.

Send your questions and comments for John to  slayer@microsoft.com.

John Robbins is a cofounder of Wintellect, a software consulting, education, and development firm that specializes in programming for the .NET Framework and Windows. His latest book is Debugging Applications for Microsoft .NET and Microsoft Windows (Microsoft Press, 2003). You can contact John at https://www.wintellect.com.