Globalization Step-by-StepOn This PageOverview and DescriptionA codepage is a list of selected character codes characters represented as code points) in a certain order. Codepages are usually defined to support specific languages or groups of languages that share common writing systems. All Window codepages can only contain 256 code points. Most of the first 127 code pointsrepresent the same characters. This is to allow for continuity and legacy code. It is the upper 128 code points 128-255 (0-based) where codepages differ considerably. For example, codepage 1253 provides character codes required in the Greek writing system and codepage 1250 provides the characters for Latin writing systems including English, German and French. It is the upper 128 code points that contain either the accent characters or the Greek characters. Thus you cannot store Greek and German in the same code stream unless you put some type of identifier to indicate what codepage you are referencing. This becomes even more complex when dealing with Asian characters sets. Because Chinese, Japanese and Korean contain more than 256 characters, a different scheme needed to be developed but it had to be based on the concept of 256 character codepages. Thus DBCS (Double Byte Character Sets) were born. Each Asian character is represented by a pair of code points (thus double-byte). For programming awareness, a set of points are set aside to represent the first byte of the set and are not valued unless they are immediately followed by a defined second byte. DBCS meant that you had to write code that would treat these pair of code points as one,and this still disallowed the combining of say Japanese and Chinese in the same data stream, because depending on the codepage the same double-byte code points represent different characters for the different languages. In order to allow for the storage of different languages in the same data stream, Unicode was created. This one "codepage" can represent 64000+ characters and now with the introduction of surrogates it can represent 1,000,000,000+ characters. The use of Unicode in Windows 2000 allows for easier creation of World-Ready code, because you no longer have to worry about which codepage you are addressing, nor whether you had to group character points to represent one character. Please note that when writing Unicode applications for Win95/98/ME you still need to convert the Unicode code points back to Window codepages. This is because Win95/98/ME GDI is still ANSI based. But this is made easy with the functions WideChartoMultiByte and MultiByteToWideChar. See "Unicode and Character Sets" on MSDN. For information about encodings in web pages, please see MLang on MSDN. Encodings in Win32See Unicode and Character Sets on MSDN. Encodings in the .NET FrameworkThe .NET Framework is a platform for building, deploying, and running Web services and applications that provides a highly productive, standards-based, multilanguage environment for integrating existing or legacy investments with next-generation applications and services. The .NET Framework uses Unicode UTF-16 to represent characters, although in some cases it uses UTF-8 internally. The System.Text namespace provides classes that allow you to encode and decode characters, with support that includes the following encodings:
The .NET Framework provides support for data encoded using code pages. You can use the Encoding.GetEncoding method (Int32) to create a target encoding object for a specified code page. Specify a code page number as the Int32 parameter. The following code example creates an encoding object-enc -for the code page 1252.
After you create an encoding object that corresponds to a specified code page, you can use the object to perform other operations supported by the System.Text.Encoding class. The one additional type of support introduced to ASP.NET is the ability to clearly distinguish between file, request, and response encodings. To set the encoding in ASP.NET for code, page directives, and configuration files, you'll need to do the following. In code:
In page directives:
In a configuration file:
The following code example in C# uses the Encoding.GetEncoding method to create a target encoding object for a specified code page. The Encoding.-GetBytes method is called on the target encoding object to convert a Unicode string to its byte representation in the target encoding. The byte representations of the strings in the specified code pages are displayed.
To determine the encoding to use for response characters in an Active Server Pages for the .NET Framework (ASP.NET) application, set the value of the HttpResponse.ContentEncoding property to the value returned by the appropriate method. The following code example illustrates how to set HttpResponse.ContentEncoding.
The last major area of discussion involves encodings in console or text-mode programming. In the section that follows, you'll find information on using the Win32 API and C run-time (CRT) library functions, CRT console input/output (I/O), and Win32 text-mode I/O, should you need to deal with this sort of application. Encodings in Web PagesGenerally speaking, there are four different ways of setting the character set or the encoding of a Web page.
Setting and Manipulating EncodingsSince Web content is currently based on Windows or other encoding schemes, you'll need to know how to set and manipulate encodings. The following describes how to do this for HTML pages, Active Server Pages (ASP), and XML pages. HTML pagesInternet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document, taking into account the user's preferences if no meta element is specified. To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after the head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames. Here is how it works:
You substitute with any supported character-set-friendly name (for example, UTF-8) or any code-page name (for example, windows 1251). (For more information, see Character Set Recognition on MSDN). ASP pagesInternally, ASP and the language engines it calls-such as Microsoft Visual Basic Scripting Edition (VBScript), JScript, and so forth-all communicate in Unicode strings. However, Web pages currently consist of content that can be in Windows or other character-encoding schemes besides Unicode. Therefore, when form or query-string values come in from the browser in an HTTP request, they must be converted from the character set used by the browser into Unicode for processing by the ASP script. Similarly, when output is sent back to the browser, any strings returned by scripts must be converted from Unicode back to the code page used by the client. In ASP these internal conversions are done using the default code page of the Web server. This works great if the users and the server are all using the same language or script (more precisely, if they use the same code page). However, if you have a Japanese client connecting to an English server, the code page translations just mentioned won't work because ASP will try to treat Japanese characters as English ones. The solution is to set the code page that ASP uses to perform these inbound and outbound string translations. Two mechanisms exist to set the code page:
How are these code-page settings applied; First, any static content (HTML) in the .asp file is not affected at all; it is returned exactly as written. Any static strings in the script code (and in fact the script code itself) will be converted based on the CODEPAGE setting in the .asp file. Think of CODEPAGE as the way an author (or better yet, the authoring tool, which should be able to place this in the .asp file automatically) tells ASP the code page in which the .asp file was written. Any dynamic content-such as Response.Write(x) calls, where the x is a variable-is converted using the value of Response.CodePage, which defaults to the CODEPAGE setting but can be overridden. You'll need this override, since the code page used to write the script might differ from the code page you use to send output to a particular client. For example, the author may have written the ASP page in a tool that generates text encoded in JIS, but the end user's browser might use UTF-8. With this code-page control feature, ASP now enables correct handling of code-page conversion. The behavior of the browser set by the meta tags (described earlier) in the server-side script can be achieved by setting the Response.Charset property. Setting this property would instruct the browser how to interpret the encoding of the incoming stream. Generally this value should always match the value of the session's code page. For example, for an ASP page that did not include the Response.Charset property, the content-type header would be:
If the same .asp file included:
the content-type header would be:
XML pagesAll XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 (the default encoding) and UTF-16. The Microsoft XML Parser (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UTF-16 character encoding. The encoding declaration identifies which encoding is used to represent the characters in the document. Although XML parsers can determine automatically if a document uses the UTF-8 or UTF-16 Unicode encoding, this declaration should be used in documents that support other encodings. For example, the following is the encoding declaration for a document that uses the ISO 8859-1 encoding (Latin 1):
Encodings in ConsoleProgrammers can use both Unicode and SBCS or DBCS encodings when programming console, or "text-mode," applications. For legacy reasons, non-Unicode console I/O functions use the console code page, which is an OEM code page by default. All other non-Unicode functions in Windows use the Windows code page. This means that strings returned by the console functions might not be processed correctly by the other functions and vice versa. For example, if FindFirstFileA returns a string that contains certain non-ASCII characters, WriteConsoleA will not display the string properly. Always keeping track of which encoding is required by which function-and appropriately converting encodings of textual parameters-can be hard. This task was simplified with the introduction of the functions SetFileApisToOEM, SetFileApisToANSI, and a helper function AreFileApisANSI. The first two affect non-Unicode functions exported by KERNEL32.dll that accept or return a file name. As the names suggest, the SetFileApisToOEM sets those functions to accept or return file names in the OEM character set corresponding to the current system locale, and SetFileApisToANSI restores the default, Windows ANSI encoding for those names. Currently selected encoding can be queried with AreFileApisANSI. With SetFileApisToOEM at hand, the problem with the results of WindFirstFileA (or GetCurrentDirectoryA, or any of the file-handling functions of Win32 API) that cannot be passed directly to WriteConsoleA is easily solved: after SetFileApisToOEM is called, WindFirstFileA returns text encoded in OEM, not in the Windows ANSI character set. This solution is not a universal remedy against all Windows ANSI versus OEM incompatibilities, however. Imagine you need to get text from a file-handling function, output it to console, and then process it by another function, which is not affected by SetFileApisToOEM. This absolutely realistic scenario will require the encoding to be changed. Otherwise, you will need to call SetFileApisToOEM to get data for console output, then SetFileApisToANSI and get the same text, just in another encoding, for internal processing. Another case when SetFileApisToOEM does not help is handling of the command-line parameters: when the entry point of your application is main (and not wmain), the arguments are always passed as an array of Windows ANSI strings. All this clearly complicates the life of a programmer who writes non-Unicode console applications. To make things more complex, 8-bit code written for console has to deal with two different types of locales. To write your code, you can use either Win32 API or C run-time library functions. ANSI functions of Win32 API assume the text is encoded for the current console code page, which the system locale defines by default. The SetConsoleCP and SetConsoleOutputCP functions change the code page used in these operations. A user can call chcp or mode con cp select= commands in the command prompt; this will change the code page for the current console. Another way to set a fixed console code page is to create a console shortcut with a default code-page set (only available on East Asian localized versions of the operating system). Applications should be able to respond to a user's actions. Locale-sensitive functions of C run-time library (CRT functions) handle text according to the settings defined by a (_w)setlocale call. If (_w)setlocale is not called in the code, CRT functions use the ANSI "C" language invariant locale for those operations, losing language-specific functionality. The declaration of the function is:
or
The "category" defines the locale-specific settings affected (or all of them, if LC_ALL is specified). The variable-locale -is either the explicit locale name or one of the following:
".OCP" and ".ACP" parameters always refer to the settings of the user locale, not the system locale. Hence they should not be used to set LC_CTYPE. This category defines the rules for Unicode to 8-bit conversion and other text-handling procedures, and must follow the settings of the console, accessible with GetConsoleCP and GetConsoleOutputCP. The best long-term solution for a console application is to use Unicode, since Unicode interfaces are defined for both the Win32 API and C run-time library. The latter programming model still requires you to set the locale explicitly, but at least you can be sure the text seen by Win32 and CRT does not require transcoding.
| In This Article
|