Skip to main content

Globalization Step-by-Step


Encodings and Code Pages

Unicode Enabled Use Locale Model

Overview and Description

A codepage is a list of selected character codes characters represented as code points) in a certain order. Codepages are usually defined to support specific languages or groups of languages that share common writing systems. All Window codepages can only contain 256 code points. Most of the first 127 code pointsrepresent the same characters. This is to allow for continuity and legacy code. It is the upper 128 code points 128-255 (0-based) where codepages differ considerably.

For example, codepage 1253 provides character codes required in the Greek writing system and codepage 1250 provides the characters for Latin writing systems including English, German and French. It is the upper 128 code points that contain either the accent characters or the Greek characters. Thus you cannot store Greek and German in the same code stream unless you put some type of identifier to indicate what codepage you are referencing.

This becomes even more complex when dealing with Asian characters sets. Because Chinese, Japanese and Korean contain more than 256 characters, a different scheme needed to be developed but it had to be based on the concept of 256 character codepages. Thus DBCS (Double Byte Character Sets) were born.

Each Asian character is represented by a pair of code points (thus double-byte). For programming awareness, a set of points are set aside to represent the first byte of the set and are not valued unless they are immediately followed by a defined second byte. DBCS meant that you had to write code that would treat these pair of code points as one,and this still disallowed the combining of say Japanese and Chinese in the same data stream, because depending on the codepage the same double-byte code points represent different characters for the different languages.

In order to allow for the storage of different languages in the same data stream, Unicode was created. This one "codepage" can represent 64000+ characters and now with the introduction of surrogates it can represent 1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for easier creation of World-Ready code, because you no longer have to worry about which codepage you are addressing, nor whether you had to group character points to represent one character.

Please note that when writing Unicode applications for Win95/98/ME you still need to convert the Unicode code points back to Window codepages. This is because Win95/98/ME GDI is still ANSI based. But this is made easy with the functions WideChartoMultiByte and MultiByteToWideChar. See " Unicode and Character Sets" on MSDN.

For information about encodings in web pages, please see MLang on MSDN.

Top of pageTop of page

Encodings in Win32

See Unicode and Character Sets on MSDN.

Encodings in the .NET Framework

The .NET Framework is a platform for building, deploying, and running Web services and applications that provides a highly productive, standards-based, multilanguage environment for integrating existing or legacy investments with next-generation applications and services. The .NET Framework uses Unicode UTF-16 to represent characters, although in some cases it uses UTF-8 internally. The System.Text namespace provides classes that allow you to encode and decode characters, with support that includes the following encodings:

  • Unicode UTF-16 encoding. Use the UnicodeEncoding class to convert characters to and from UTF-16 encoding.
  • Unicode UTF-8 encoding. Use the UTF8Encoding class to convert characters to and from UTF-8 encoding.
  • ASCII encoding. ASCII encodes the Latin alphabet as single 7-bit characters. Because this encoding only supports character values from U+0000 through U+007F, in most cases it is inadequate for internationalized applications. You can use the ASCIIEncoding class to convert characters to and from ASCII encoding whenever you need to interoperate with legacy encodings and systems.
  • Windows/ISO Encodings. The System.Text.Encoding class provides support for a wide range of Windows/ISO encodings.

The .NET Framework provides support for data encoded using code pages. You can use the Encoding.GetEncoding method (Int32) to create a target encoding object for a specified code page. Specify a code page number as the Int32 parameter. The following code example creates an encoding object-enc -for the code page 1252.

[Visual Basic]
Encoding enc = Encoding.GetEncoding(1252)

[C#]
Encoding enc = Encoding.GetEncoding(1252);

After you create an encoding object that corresponds to a specified code page, you can use the object to perform other operations supported by the System.Text.Encoding class.

The one additional type of support introduced to ASP.NET is the ability to clearly distinguish between file, request, and response encodings. To set the encoding in ASP.NET for code, page directives, and configuration files, you'll need to do the following.

In code:

Response.ContentEncoding=<value>

Request.ContentEncoding=<value>

File.ContentEncoding=<value>

In page directives:

<%@Page ResponseEncoding=<value>%>
<%@Page RequestEncoding=<value>%>
<%@Page FileEncoding=<value>%>

In a configuration file:

<configuration>
<globalization
fileEncoding=<value>
requestEncoding=<value>
responseEncoding=<value>
/>
</configuration>

The following code example in C# uses the Encoding.GetEncoding method to create a target encoding object for a specified code page. The Encoding.-GetBytes method is called on the target encoding object to convert a Unicode string to its byte representation in the target encoding. The byte representations of the strings in the specified code pages are displayed.

Top of pageTop of page

using System;
using System.IO;
using System.Globalization;
using System.Text;

public class Encoding_UnicodeToCP
{
public static void Main()
{
// Convert ASCII characters to bytes.
// Display the string's byte representation in the
// specified code page.
// Code page 1252 represents Latin characters.
PrintCPBytes("Hello, World!",1252);
// Code page 932 represents Japanese characters.
PrintCPBytes("Hello, World!",932);

// Convert Japanese characters to bytes.
PrintCPBytes("\u307b,\u308b,\u305a,\u3042,\u306d",1252);
PrintCPBytes("\u307b,\u308b,\u305a,\u3042,\u306d",932);
}

public static void PrintCPBytes(string str, int codePage)
{
Encoding targetEncoding;
byte[] encodedChars;
// Get the encoding for the specified code page.
targetEncoding = Encoding.GetEncoding(codePage);

// Get the byte representation of the specified string.
encodedChars = targetEncoding.GetBytes(str);

// Print the bytes.
Console.WriteLine
("Byte representation of '{0}' in Code Page '{1}':", str, codePage);
for (int i = 0; i < encodedChars.Length; i++)
Console.WriteLine("Byte {0}: {1}", i, encodedChars[i]);
}
}

To determine the encoding to use for response characters in an Active Server Pages for the .NET Framework (ASP.NET) application, set the value of the HttpResponse.ContentEncoding property to the value returned by the appropriate method. The following code example illustrates how to set HttpResponse.ContentEncoding.

// Explicitly set the encoding to UTF-8.
Response.ContentEncoding = Encoding.UTF8;

// Set ContentEncoding using the name of an encoding.
Response.ContentEncoding = Encoding.GetEncoding(name);

// Set ContentEncoding using a code page number.
Response.ContentEncoding = Encoding.GetEncoding(codepageNumber);

The last major area of discussion involves encodings in console or text-mode programming. In the section that follows, you'll find information on using the Win32 API and C run-time (CRT) library functions, CRT console input/output (I/O), and Win32 text-mode I/O, should you need to deal with this sort of application.

Top of pageTop of page

Encodings in Web Pages

Generally speaking, there are four different ways of setting the character set or the encoding of a Web page.

  • With this approach, you can select from the list of supported code pages to create your Web content. The downside of this approach is that you are limited to languages that are included in the selected character set, making true multilingual Web content impossible. This limits you to a single-script Web page.
  • Number entities can be used to represent a few symbols out of the currently selected code page or encoding. Let's say, for example, you have decided to create a Web page using the previous approach with the Latin ISO charset 8859-1. Now you also want to display some Greek characters in a mathematical equation; Greek characters, however, are not part of the Latin code page. Take, for instance, the Greek character Φ, which has the Unicode code-point U+03A6. By using the decimal number entity of this code point preceded by &# the character's output will be as follows: This is my text with a Greek Phi: Φ. and the output would be:This is my text with a Greek Phi: Φ. Unfortunately, this approach makes it impossible to compose large amounts of text and makes editing your Web content very hard.
  • Unlike Win32 applications where UTF-16 is by far the best approach, for Web content UTF-16 can be used safely only on Windows NT networks that have full Unicode support. Therefore, this is not a suggested encoding for Internet sites where the capabilities of the client Web browser as well the network Unicode support are not known.
  • This Unicode encoding is the best and safest approach for multilingual Web pages. It allows you to encode the whole repertoire of Unicode characters. Also, all versions of Internet Explorer 4 and later as well as Netscape 4 and later support this encoding, which is not restricted to network or wire capabilities. The UTF-8 encoding allows you to create multilingual Web content without having to change the encoding based on the target language.

Figure 1: Example of a multilingual Web page encoded in UTF-8.

Figure 1: Example of a multilingual Web page encoded in UTF-8.


Setting and Manipulating Encodings

Since Web content is currently based on Windows or other encoding schemes, you'll need to know how to set and manipulate encodings. The following describes how to do this for HTML pages, Active Server Pages (ASP), and XML pages.

HTML pages

Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document, taking into account the user's preferences if no meta element is specified. To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after the head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames. Here is how it works:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=<value>">

You substitute with any supported character-set-friendly name (for example, UTF-8) or any code-page name (for example, windows 1251). (For more information, see Character Set Recognition on MSDN).

ASP pages

Internally, ASP and the language engines it calls-such as Microsoft Visual Basic Scripting Edition (VBScript), JScript, and so forth-all communicate in Unicode strings. However, Web pages currently consist of content that can be in Windows or other character-encoding schemes besides Unicode. Therefore, when form or query-string values come in from the browser in an HTTP request, they must be converted from the character set used by the browser into Unicode for processing by the ASP script. Similarly, when output is sent back to the browser, any strings returned by scripts must be converted from Unicode back to the code page used by the client. In ASP these internal conversions are done using the default code page of the Web server. This works great if the users and the server are all using the same language or script (more precisely, if they use the same code page). However, if you have a Japanese client connecting to an English server, the code page translations just mentioned won't work because ASP will try to treat Japanese characters as English ones.

The solution is to set the code page that ASP uses to perform these inbound and outbound string translations. Two mechanisms exist to set the code page:

  • Per page, at design time. For example: <% @ LANGUAGE=VBScript CODEPAGE=1252 %>

  • In script code, at run time: The Session.CodePage property sets the code page to use for the current sessions's string translations. In Microsoft Internet Information Services (IIS) 5.1 and later, the Response.Code-Page property defines the code page of response set to the client. Once explicitly set, the Re sponse code page overrides the Session code page, which in turn overrides the @CODEPAGE setting. For example: <% @ LANGUAGE=VBScript CODEPAGE=65001 %> <% Response.Write (Session.CodePage) Session.CodePage = 1252 %>

How are these code-page settings applied; First, any static content (HTML) in the .asp file is not affected at all; it is returned exactly as written. Any static strings in the script code (and in fact the script code itself) will be converted based on the CODEPAGE setting in the .asp file. Think of CODEPAGE as the way an author (or better yet, the authoring tool, which should be able to place this in the .asp file automatically) tells ASP the code page in which the .asp file was written.

Any dynamic content-such as Response.Write(x) calls, where the x is a variable-is converted using the value of Response.CodePage, which defaults to the CODEPAGE setting but can be overridden. You'll need this override, since the code page used to write the script might differ from the code page you use to send output to a particular client. For example, the author may have written the ASP page in a tool that generates text encoded in JIS, but the end user's browser might use UTF-8. With this code-page control feature, ASP now enables correct handling of code-page conversion.

The behavior of the browser set by the meta tags (described earlier) in the server-side script can be achieved by setting the Response.Charset property. Setting this property would instruct the browser how to interpret the encoding of the incoming stream. Generally this value should always match the value of the session's code page.

For example, for an ASP page that did not include the Response.Charset property, the content-type header would be:

content-type:text/html

If the same .asp file included:

<% Response.Charset= "ISO-LATIN-7" %>

the content-type header would be:

content-type:text/html; charset=ISO-LATIN-7

XML pages

All XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 (the default encoding) and UTF-16. The Microsoft XML Parser (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UTF-16 character encoding.

The encoding declaration identifies which encoding is used to represent the characters in the document. Although XML parsers can determine automatically if a document uses the UTF-8 or UTF-16 Unicode encoding, this declaration should be used in documents that support other encodings.

For example, the following is the encoding declaration for a document that uses the ISO 8859-1 encoding (Latin 1):

< xml version="1.0" encoding="ISO-8859-1" >

Top of pageTop of page

Encodings in Console

Programmers can use both Unicode and SBCS or DBCS encodings when programming console, or "text-mode," applications. For legacy reasons, non-Unicode console I/O functions use the console code page, which is an OEM code page by default. All other non-Unicode functions in Windows use the Windows code page. This means that strings returned by the console functions might not be processed correctly by the other functions and vice versa. For example, if FindFirstFileA returns a string that contains certain non-ASCII characters, WriteConsoleA will not display the string properly.

Always keeping track of which encoding is required by which function-and appropriately converting encodings of textual parameters-can be hard. This task was simplified with the introduction of the functions SetFileApisToOEM, SetFileApisToANSI, and a helper function AreFileApisANSI. The first two affect non-Unicode functions exported by KERNEL32.dll that accept or return a file name. As the names suggest, the SetFileApisToOEM sets those functions to accept or return file names in the OEM character set corresponding to the current system locale, and SetFileApisToANSI restores the default, Windows ANSI encoding for those names. Currently selected encoding can be queried with AreFileApisANSI.

With SetFileApisToOEM at hand, the problem with the results of WindFirstFileA (or GetCurrentDirectoryA, or any of the file-handling functions of Win32 API) that cannot be passed directly to WriteConsoleA is easily solved: after SetFileApisToOEM is called, WindFirstFileA returns text encoded in OEM, not in the Windows ANSI character set. This solution is not a universal remedy against all Windows ANSI versus OEM incompatibilities, however. Imagine you need to get text from a file-handling function, output it to console, and then process it by another function, which is not affected by SetFileApisToOEM. This absolutely realistic scenario will require the encoding to be changed. Otherwise, you will need to call SetFileApisToOEM to get data for console output, then SetFileApisToANSI and get the same text, just in another encoding, for internal processing. Another case when SetFileApisToOEM does not help is handling of the command-line parameters: when the entry point of your application is main (and not wmain), the arguments are always passed as an array of Windows ANSI strings. All this clearly complicates the life of a programmer who writes non-Unicode console applications.

To make things more complex, 8-bit code written for console has to deal with two different types of locales. To write your code, you can use either Win32 API or C run-time library functions. ANSI functions of Win32 API assume the text is encoded for the current console code page, which the system locale defines by default. The SetConsoleCP and SetConsoleOutputCP functions change the code page used in these operations. A user can call chcp or mode con cp select= commands in the command prompt; this will change the code page for the current console. Another way to set a fixed console code page is to create a console shortcut with a default code-page set (only available on East Asian localized versions of the operating system). Applications should be able to respond to a user's actions.

Locale-sensitive functions of C run-time library (CRT functions) handle text according to the settings defined by a (_w)setlocale call. If (_w)setlocale is not called in the code, CRT functions use the ANSI "C" language invariant locale for those operations, losing language-specific functionality.

The declaration of the function is:

setlocale( int category, const char *locale)

or

_wsetlocale( int category, const wchar_t *locale)

The "category" defines the locale-specific settings affected (or all of them, if LC_ALL is specified). The variable-locale -is either the explicit locale name or one of the following:

  • ".OCP"-refers to the OEM code page corresponding to the current user locale
  • ".ACP" or ""-refers to the Windows code page corresponding to the current user locale

".OCP" and ".ACP" parameters always refer to the settings of the user locale, not the system locale. Hence they should not be used to set LC_CTYPE. This category defines the rules for Unicode to 8-bit conversion and other text-handling procedures, and must follow the settings of the console, accessible with GetConsoleCP and GetConsoleOutputCP.

The best long-term solution for a console application is to use Unicode, since Unicode interfaces are defined for both the Win32 API and C run-time library. The latter programming model still requires you to set the locale explicitly, but at least you can be sure the text seen by Win32 and CRT does not require transcoding.

Top of pageTop of page

Unicode Enabled  Use Locale Model

 

Top of pageTop of page

Next 4 of 5 Next