| atabase-driven international Web sites are becoming increasingly important now that the Internet reaches even the most obscure places in the world. The success of an international site can depend on the many relationships between your code and the software on your client and server. The HTML and ASP code, Microsoft® SQL Server™ data, and the Microsoft Data Access Components (MDAC) must work together in spite of the constraints defined by the character sets you use and languages you support. The purpose of this article is to help you develop a seamless site for an international audience and avoid some of the common problems that can occur in the process.|
After briefly discussing how to choose character sets and display localized content, we will explain how to correctly specify the character set on the client and server, let users choose their language for browsing, and use Unicode and ANSI for static and dynamic content on Microsoft Internet Information Server (IIS) 4.0 and IIS 5.0. Then we'll cover some additional tasks such as installing code pages, enabling server state, and using the best techniques for validating your localized data.
Choosing the Right Character Sets A character set is a mapping of characters to their identifying code values; it dictates how your application will handle data. The standard character sets used in Web applications are Unicode and ANSI. Microsoft products use code pages to provide a mapping to both the Unicode and ANSI character sets. A code page is an internal table that the operating system uses to map symbols (letters, numerals, and punctuation characters) to a character number. Different code pages provide support for the character sets used in different countries. Code pages are referred to by number; for example, code page 932 represents the Japanese character set, and code page 950 represents the Traditional Chinese character set (see Figure 1). Using code pages ensures that the character set being used is handled and interpreted correctly so that no data is corrupted.
Figure 1 Chinese Character Set on microsoft.com/china
Unicode code pages allow you to use one specific code page to represent all of the languages that you want to support. (To help you understand references to Unicode encodings later in this article, see the sidebar "Unicode Encodings.") The ANSI character set, on the other hand, contains many code pages that represent languages found throughout the world. You will often hear the term single-byte character set (SBCS) or double-byte character set (DBCS) associated with ANSI code pages. Most languages are SBCS, but Asian languages such as Japanese, Chinese, and Korean use DBCS characters.
Having to use a specific code page for every language that you use is the biggest drawback to ANSI code pages. This means that Unicode is easier to implement than ANSI since you only need to specify one code page in Unicode, and you don't have to dynamically change the code page based on the language being supported. In many cases, however, ANSI may be useful for legacy products or data. As you read this article, you will also find that there are some specific issues that may require you to use ANSI instead of Unicode for international Web sites.
Specifying Character Sets on the Client and Server Once you have the presentation area covered, you need to understand how to correctly submit and receive data between the Web server and the Web browser. To handle the data correctly, you should programmatically address this in the client-side HTML code by adding a meta tag that specifies the charset property to use in your Web pages. The charset property tells the Web browser which character set you are using so it knows how to post data to the Web server and also knows the form of the data that it is receiving from the server. Each available charset value has a corresponding character set. A list of available charsets can be found on the MSDN® Online Web site at http://msdn.microsoft.com/workshop/Author/dhtml/reference/charsets/charsets.asp.
The following code shows the use of the charset property on a Japanese page:
The charset attribute must be set on the client. However as an alternative, you can also set this value in the HTTP header of your page if you are using ASP code. You can use either of these code samples to set the charset to Japanese from server-side ASP code:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
<% Response.AddHeader "Content-Type","text/html;
<% Response.CharSet = "shift_jis"%>
Allowing Users to Select their Browser Language For international sites, you need to allow users to select the language that they will use to browse the site. In the best-case scenario you should allow the user to select the language and then store that information either in a cookie or in a database on the Web server. This allows the user to browse your site from any machine and still use the language of their choice. Customers who travel frequently and need to access information from anywhere in the world will appreciate this feature.
Alternatively, the browser language can be selected by using the information in the HTTP_ACCEPT_LANGUAGE variable from the HTTP header. You can use the ASP Request.ServerVariables collection to retrieve the value. HTTP_ACCEPT_LANGUAGE contains a list of name/value pairs that can be parsed to find out not only what languages the browser will support, but also the order of importance that the user has defined (see the sidebar "Language Preferences Set by Users" for more information). Then, based on the value, you can redirect the user to the appropriate content. This is an example of using the ServerVariables collection to retrieve the HTTP_ACCEPT_LANGUAGE variable:
The advantage of using HTTP_ACCEPT_LANGUAGE is that you can determine what languages have been installed in the Web browser and provide the user with the most compatible site. The downside is that the user may be traveling, and they will need to understand how to change settings in Microsoft Internet Explorer to get the language they would like to use.
txtLanguage = Request.ServerVariables("HTTP_ACCEPT_LANGUAGE")
Using Unicode for Static Content If the content on your site is static and you are supporting multiple languages from one server, you will want to save your files as Unicode. By doing so, the code will not be translated to the default language of the machine where you are editing your Web page. (On a English version of Windows®, this is typically the Windows-1252 ANSI character set.) On both IIS 4.0 and IIS 5.0, the file can be saved in a Unicode format and with an .htm extension and, as long as the browser supports Unicode, it should work just fine. Visual InterDev® and Notepad are two editing tools that will allow you to save files in a Unicode format.
On IIS 4.0, if the file is saved in Unicode format with an .asp extension, the ASP code will not execute; it will be sent to the browser instead. Not only does this not produce the desired result, but it's also a security issue since your code will be exposed. If you need to save your file in a Unicode format on IIS 4.0, you can create a separate ASP page and then use the File System Object to read the file that was saved as Unicode.
If you are going to create dynamic pages with ASP using Unicode then you will need to host your site on IIS 5.0. On IIS 5.0, the file must be saved with a UTF-8 Unicode encoding format for .asp pages to work. For more information on saving ASP files in Unicode, see the Microsoft Knowledge Base article Q245000.
Lack of Unicode Support for ASP on IIS 4.0 If you are currently using Unicode on IIS 4.0, you many be wondering why I said that it is not supported. If your Web server is configured correctly, you should get an Invalid Code Page error as described in Knowledge Base article Q254313. However, you may be using Unicode on IIS 4.0 and not getting the error. There are two reasons for this. First, Unicode was turned off in Windows NT® Service Pack 4 because ASP was originally written to handle ANSI data and Unicode characters larger then two bytes in size were truncated. Second, if you have Service Pack 4 or greater but still use Unicode, it is because the Windows NT Service Pack was not reapplied after the Windows NT Option Pack was installed. We have heard that some customers didn't install the Windows NT Service Pack if they wanted to guarantee that the characters in their applications were less than two bytes in size. (This isn't a good idea in general, because you lose the benefits of the service pack updates, and you can get mixed results due to mismatched components in the operating system.)
Dynamic Content on IIS 4.0 Since Unicode is not supported on IIS 4.0 for ASP code, you will have to use an ANSI character set. This means that the character set you specify in your client-side code must match the character set (called the code page) that you set on the server. You can set the code page in ASP server-side code using the @CODEPAGE directive and the Session.CodePage property.
The @CODEPAGE directive affects all the internal handling of characters, not just string conversions. Active Server Pages assumes that strings passed between the Web server and the browser are in the same code page you have set for your script. The Session.CodePage property controls the necessary conversions for both the input and the output of data when you use the ASP Request and Response objects. On IIS 4.0, both @CODEPAGE and Session.CodePage must be set using a numeric code page value that corresponds with the client-side charset value. The code page values are available in the same document as the charset values.
Not setting the ASP code page properties can cause problems for international sites. ASP will use the default system code page if the properties are not explicitly set. On an English-language Windows-based server this will be the Windows-1252 charset. This corrupts any characters that cannot be represented by that charset.
In addition, setting the Session.CodePage will not dynamically set the @CODEPAGE if it is omitted, nor will the @CODEPAGE set the Session.CodePage if it is omitted. While the Session.CodePage can be dynamically set, the @CODEPAGE is hardcoded into each page and cannot be changed dynamically. As a result, you must create a Web application for each language that you would like to support. In each of the applications you must hardcode the @CODEPAGE value to a specific ANSI code page. Alternatively, you could have a separate Windows NT-based server running in the language that you would like to support and omit the ASP code page settings so the default code page of the server is used instead.
Not only do you need to make sure that your ASP has the proper code page settings, but you also need to make sure that the Web server has the code page used by ASP installed. If the code page is not installed, you will get the Invalid Code Page error described earlier. The code page comes in the form of a National Language Support (.NLS) file and can be installed according to the instructions found in Knowledge Base article Q164948.
The information we've presented so far relates to handling string data, but what if you need to handle Date, Time, and Currency formats? The settings of the @LCID directive and Session.LCID property affect those data formats. The @LCID directive sets the locale identifier (LCID) for a script. The LCID is a DWORD containing the language identifier in the lower word and a reserved value in the upper word. The identifier supplied in an LCID is a standard international numeric abbreviation. The Session.LCID property determines the location identifier that will be used to display dynamic content. The @LCID directive is just like the @CODEPAGE directive in that you need to use both the @LCID directive and the Session.LCID to obtain the correct results on IIS 4.0. You also need to be aware that the LCID formatting only works with VBScript format functions like the FormatDateTime function. You can find more information about using the @LCID directive in your ASP pages from Knowledge Base article Q229690.
Dynamic Content on IIS 5.0 IIS 5.0 now supports Unicode code pages in the UTF-8 and UTF-7 encoding formats. However, there are some limitations. The Response.Write method was designed to work with Unicode, but there are other methods that still only offer ANSI support. This does not mean that you should not use Unicode; on the contrary, we encourage it because it allows you to use one common ASP code page and client-side charset for all languages. You just need to test your applications as you would in any development scenario. Figure 2 is an example of using UTF-8 with a common page that posts data to the Web server.
Improved code page support has also been added for ANSI on IIS 5.0. Omitting the @CODEPAGE directive and Session.CodePage setting can now be done dynamically. Omitting the Session.CodePage and setting the @CODEPAGE dictates the way ASP handles data since the Session.CodePage defaults to the value of the @CODEPAGE directive.
Figure 3 Regional Options
Improved regional settings allow you to easily change the default system language on the server. This makes it possible to test a Web application on Windows 2000 that you may be deploying to a foreign-language operating system. You can change the regional settings from the Regional Options applet in Control Panel. In the Region Options applet, click the Set default button and choose the language that you would like to test (see Figure 3).
Avoiding Known Issues The Server.HTMLEncode method in ASP will corrupt Unicode and DBCS character sets on IIS 4.0 and Unicode character sets on IIS 5.0. More information regarding the Server.HTMLEncode method can be found in Knowledge Base article Q259352. In addition, many of the design-time controls in Visual InterDev 6.0 use Server.HTMLEncode. You can find more information on this topic in the Knowledge Base article Q261154.
In addition to the previous warnings about IIS 4.0 and IIS 5.0, another common mistake is using the default collections for the ASP Request object. These collections are created when you omit the Form or QueryString method on the request object when retrieving posted values from the client. When you use Server.HTMLEncode in the server side-script of an ASP page that contains Unicode or DBCS data, you may find that the data gets corrupted. The ASP Server.HTMLEncode method does not recognize Unicode and DBCS data formats and converts the data as if it were in SBCS format. The Server.HTMLEncode method on Windows 2000 platforms does recognize DBCS data formats, so for Windows 2000 only Unicode data is converted as if it's in SBCS format. This issue is also outlined in Knowledge Base article Q259352 (mentioned earlier).
Setting the Default Language for Web Applications IIS 5.0 also has an added feature in the metabase that allows you to set the default language for each specific Web application. The metabase is the repository for most IIS 5.0 configuration information. You can use the MetaEdit 2.1 utility to modify IIS metabase values. Using MetaEdit, you can create a new DWORD key called ASPCodepage that will override any ASP code page settings. You can download and install MetaEdit using the instructions in Knowledge Base article Q232068.
Figure 4 Edit Metabase Data
Once the tool has been installed, navigate down the tree-like structure that represents your Web server to find your site. Right-click the site in the left pane, then choose New and DWORD from the context menu. Figure 4 shows an example of the settings for a Web site that is using a shift_jis character set (for a Japanese page) in which the corresponding Windows-932 code page value is used.
IIS 5.0 Code Page Installation In IIS 5.0, if you are using an ANSI character set you will need to make sure that it has been installed. However, installing a code page is easy since you can change language settings for a system from the Regional Options applet in the Control Panel. In Regional Options, you should place a checkmark next to the languages you want to support in the Language settings for the system section, as shown in Figure 3.
Like you do in IIS 4.0, if you need to provide formatting for dates, time, or currency, you can use the @LCID directive and Session.LCID property. The important point to remember is that the LCID behaves just like the @CODEPAGE and Session.CodePage on IIS 5.0. Omitting either one will dynamically set the other. However, just as in IIS 4.0, the LCID formatting only works with the VBScript format functions such as the FormatDateTime method.
Enabling Session State Session state must be enabled in order for any ASP localization to work. Without session state, only the @CODEPAGE directive will work. On IIS 4.0 this can be crippling, since you need to set the Session.CodePage to handle input and output. On IIS 5.0, setting the @CODEPAGE directive will dynamically handle the input and output. This is a pretty good workaround if you planned to use Unicode. However if you are using ANSI, you can't dynamically set the @CODEPAGE directive.
Using Localized SQL Server 7.0 Data The MDAC includes ADO, OLE DB, and ODBC. Data-driven client/server applications deployed over the Web or a LAN can use these components to easily integrate information from a variety of sources, including the SQL Server data used by your ASP code. If you are using Unicode, then you will need to have MDAC 2.1 or greater installed on the database client (IIS) since it provides strong support for Unicode. The MDAC components are downloadable from the Microsoft Universal Data Access site at http://www.microsoft.com/data. Once the proper data components are installed on the server, you need to ensure that SQL Server 7.0 will handle the data correctly when it is passed from MDAC.
You can manage localized data in SQL Server in several ways. The recommended method is to store the data as Unicode, since it only requires that you have one box running SQL Server. A second and less attractive method is to store the data as ANSI. When storing data this way, you need to have multiple machines running SQL Server since you can only assign one code page for each box. As a side note, SQL Server 2000 now supports the use of multiple code pages on one machine, so it makes using ANSI character sets much more convenient. The nice part of this story is that it does not matter if ASP handles your data as ANSI or Unicode. You will find that SQL Server handles both character sets the same and does any necessary Unicode conversions automatically.
Internally, SQL Server stores the data in a Unicode encoding of UCS-2. When storing Unicode data in SQL Server, the data type that you choose must be ntext, nvarchar, or nchar. Once you have defined the correct data types, you need to modify your SQL statements to include the N prefix. In a SQL statement the N prefix will prevent SQL Server from converting the data to the installed SQL Server code page. Here is an example of a SQL statement that uses the N prefix:
More information regarding the N prefix can be found in Knowledge Base article Q239530.
SELECT * FROM International WHERE Int_ID = N'value'
Validating Localized SQL Data Perhaps one of the most important tasks when developing a good international site is ensuring that the data is being stored and handled correctly. When you try to view the data in your tables, the SQL Enterprise Manager will show your data as a question mark. The SQL Query Analyzer provides a convenient way to validate your data by casting your data to binary. This code shows a sample SQL statement to retrieve the binary values of the cu_fname field:
Once you have the binary representation of the data, you can use the Character Map system tool (included with Windows 2000) to compare the values with the corresponding character (see Figure 5). Remember that the data is stored in a little-endian UCS-2 encoding format, so you need to take the 2-byte representation and flip the byte order so you get the raw Unicode values. With the Character Map program, you select a font that is compatible with your language, then set the Character set option to Unicode. Once the font and character set are set correctly, you can type in the flipped 2-byte representation of your character in the Go to Unicode textbox. In addition, you could also go to http://www.unicode.org and download or browse the charts that represent the characters.
SELECT Cast(cu_fname as varbinary) FROM customers
Figure 5 Character Map Tool
In addition to using SQL Query Analyzer to validate your data, you can use the Microsoft Network Monitor to view data as it is sent from the browser to the Web server, and also from the Web server back to the browser. By examining the HTTP packets, you can look at the bytes and use the Character Map program to validate the data. If the data is in Unicode and you do not have access to the Windows 2000 Character Map program, use the charts at http://www.unicode.org for Unicode data. For ANSI data you can use the charts from Nadine Kano's book, Developing International Software For Windows and Windows NT (Microsoft Press, 1995).
Conclusion Getting the right character sets and code pages set for your database-driven international Web site can go a long way toward trouble-free operations. Understanding how various versions of software work together with your application code can also help you develop or maintain your site. The resources we've mentioned provide a lot of additional help. But as always, the best way to ensure that these combinations work for you is to test your application before going public to ensure that the data stays intact through the entire process.
Remember, one key to developing a good presentation is to have people on your team who are familiar with the cultures you are developing for. While this article focused on handling data for international Web sites, you will find helpful information about displaying localized content in the October 1999 issue of Microsoft Internet Developer, where Nick Dallett addresses many presentation issues in his article, "DHTML Localization on the Windows Update Site". You will also find Michael Kaplan's article, "Designing Your ASP-based Web Site to Support Globalization" in the July 2000 issue of MSDN Magazine very helpful in discussing not only ASP localization but also good presentation guidelines.