Language issues are the result of differences in how languages around the world differ in display, alphabets, grammar, and syntactical rules.
Bidirectional (Bidi) is the term used to describe text that has scripts that flow both left-to-right (LTR) and right-to-left (RTL). Text that consists of a mixture of English and Arabic is a good example.
There are several issues you must keep in mind when making sure your application is Bidi-aware.
- Internal Data Storage — As mentioned above, Bidi text has LTR and RTL flowing scripts. Although both scripts flow differently, both are stored in the same order from first character to the last character. The best way to envision this is to think of the data stored from the top of a buffer to the bottom.
- Display Stream — Most Latin-based languages are displayed one character at a time. Bidi-text's different properties of character position, which prescribe script flow and how Arabic ligatures change their shape depending on the preceding and following character, have changed this display formula. Now, it is best to save the currently displayed line in a buffer and then output the whole buffer every time you modify or add a character in the line.
- Line Length — Because of the ligature changes mentioned in the bullet above, it is not a good practice to sum cached character lengths to calculate the length of a line.
A built-in feature of ASCII is that you can create the lowercase and uppercase character of each letter in the English alphabet by adding or subtracting 0x0020 to its corresponding code point:
A[0x0041] + 0x0020 = a[0x0061]
Therefore, converting to either of the cases was a simple addition or subtraction algorithm:
if ((c >= 'a') && (c <= 'z')) upper = c - 0x0020;
This is not the case for accented Latin characters (A[U+0102], a[U+0103]). You cannot just add or subtract the same value to or from all characters to get their corresponding upper- and lowercase representation.
There are several other reasons why algorithmic solutions for case handling do not cover all occurrences.
- Some languages do not have a one-to-one mapping between upper- and lowercase characters. For example:
- European French accented characters lose their accents in uppercase (é becomes E). However, French-Canadian accented characters keep their accents (é becomes É).
- The uppercase equivalent of the German ß is SS.
- Most non-Latin scripts do not even use the concept of lower- and uppercase. For example:
A code page is a list of selected character codes (characters represented as code points) in a certain order. Code pages are usually defined to support specific languages or groups of languages that share common writing systems. All Windows code pages contain 256 code points. Most of the first 127 code points represent the same characters. This makes it possible for continuity and legacy code. It is the upper 128 code points 128-255 (0-based) where code pages differ considerably.
For example, code page 1253 provides character codes that are required in the Greek writing system. Code page 1250 provides the characters for Latin writing systems including English, German, and French. It is the upper 128 code points that contain either the accent characters or the Greek characters. Consequently, you cannot store Greek and German in the same code stream unless you include some type of identifier that indicates the referenced code page.
Because Chinese, Japanese, and Korean contain more than 256 characters, a different scheme, based on the concept of code pages that contain 256 code points, needed to be developed. The result was Double-Byte Character Sets (DBCS).
In DBCS, a pair of code points (a double-byte) represents each character. For programming awareness, a set of points, which are set aside to represent the first byte of the set, are not valued unless they are immediately followed by a defined second byte. DBCS required code that would treat these pairs of code points as one character. This still disallowed the combination of two languages, for example, Japanese and Chinese, in the same data stream because the same double-byte code points represent different characters depending on the code page.
The special processing required by a complex script can involve one or more of the following characteristics: character reordering; contextual shaping; display of combining characters and diacritics; specialized word break and justification rules; cursor positioning; filtering out illegal character combinations. Scripts considered complex are: Arabic, Hebrew, Thai, Vietnamese, and Indic family.
It is important to respect these following points:
- When displaying typed text, do not output characters one at a time.
- To allocate character/glyph buffers, do not assume one character equals one glyph.
- To measure line lengths, do not sum cached character widths.
Windows has the ability to select an appropriate font to display a particular script. Windows accomplishes this by using a new face name called MS Shell Dlg. MS Shell Dlg is a mapping mechanism that makes it possible for Windows to support cultures/locales that have characters that are not contained in code page 1252. It is not a font, but is instead a face name for a nonexistent font. The MS Shell Dlg face name maps to the default shell font associated with the current culture/locale. For example, in U.S. English Windows 98 this maps to MS Sans Serif. However, in Greek Windows 98, this maps to MS Sans Serif Greek. In U.S. English Windows 2000, it maps to Tahoma. However, MS Shell Dlg does not work on East Asian versions of Windows 9x. For more information, see Localization and the Shell Font.
However, application developers often overlook fonts when creating world-ready applications. Here are two issues that you must watch when dealing with fonts:
- Hard-Coded Font Names — With the use of Unicode, we now deal with thousands of different characters instead of hundreds. Most fonts do not cover all of the Unicode character set. Thus if you hard code a font name that displays English characters and not Japanese, all of your localized Japanese text will display incorrectly. Another reason not to hardcode font names is that the font you want may not be on the system that is displaying your text.
- Hard-Coded Font Sizes — Some scripts are more complex than others. They need more pixels to be displayed properly. For example, most English characters can be displayed on a 5x7 grid, but Japanese characters need at least a 16x16 grid to be clearly seen. Whereas Chinese needs a 24x24 grid, Thai only needs 8 pixels for width but at least 22 pixels for height. Thus, it is easy to understand that some characters may not be legible at a small font size.
The best way to treat font names and sizes is to consider them as another localizable resource. Using MS Shell Dlg solves the problem of running your (any language) application on (any language) Windows NT/Windows 2000. Setting your font as a localizable resource solves the problem of making it possible for your localizer to change the font for the localized UI.
Input Method Editors (IMEs), also called front-end processors, are applets that make it possible for the user to enter the thousands of different characters used in East Asian written languages using a standard 101-key keyboard.
The user composes each character in one of several ways: by radical, by phonetic representation, or by typing in the character's numeric code page index. IMEs are widely available; Windows ships with IMEs based on the most popular input methods used in each target area.
An IME consists of an engine that converts keystrokes into phonetic and ideographic characters plus a dictionary of commonly used ideographic words. As the user enters keystrokes, the IME engine attempts to convert the keystrokes into an ideographic character or characters.
Because many ideographs have identical pronunciation, the IME engine's first guess is not always correct. When the suggestion is incorrect, the user can choose from a list of homophones; the homophone that the user selects then becomes the IME engine's first guess the next time around.
You do not need to use a localized keyboard to enter ideographic characters. While localized keyboards can generate phonetic syllables (such as kana or hangul) directly, the user can represent phonetic syllables using Latin characters.
In Japanese, romaji are Latin characters representing kana. Japanese keyboards contain extra keys that make it possible for the user to toggle between entering romaji and entering kana. If you are using a non-Japanese keyboard, you need to type in romaji to generate kana.
There are three discrete levels of IME support for applications running on Windows: no support, partial support, and fully customized support. Applications can customize IME support in small ways — by repositioning windows, for example — or they can completely change the look of the IME user interface.
- No Support — IME-unaware applications ignore all IME-specific Windows messages. Most applications that target single-byte languages are IME-unaware. Applications that are IME-unaware inherit the default user interface of the active IME through a predefined global class, appropriately called IME. For each thread, Windows automatically creates a window based on the IME global class; all IME-unaware windows of the thread share this default IME window.
- Partial Support — IME-aware applications can create their own IME windows instead of relying on the system default. Applications that contain partial support for IMEs can use these functions to set the style and the position of the IME user interface windows, but the IME DLL is still responsible for drawing them — the general appearance of the IME's user interface remains unchanged.
- Full Support — In contrast, fully IME-aware applications take over responsibility for painting the IME windows (the status, composition, and candidate windows) from the IME DLL. Such applications can fully customize the appearance of these windows, including determining their screen position and selecting which fonts and font styles are used to display characters in them. This is especially convenient and effective for word processing and similar programs whose primary function is text manipulation and which therefore benefit from smooth interaction with IMEs, creating a "natural" interface with the user.
For more information, see Input Method Editor.
Line-breaking and word-wrapping algorithms are important to text parsing as well as to text display. Western languages typically follow patterns that break lines on hyphenation rules or word boundaries and that break words based on white space (spaces, tabs, end-of-line, punctuation, and so on.).
However, the rules for Asian DBCS languages are quite different from the rules for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily distinguish one word from the next word by using a space. The Thai language does not even use punctuation.
For these languages, world-ready software applications cannot conveniently base line breaks and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.
For example, the kinsoku rule determines Japanese line breaking — you can break lines between any two characters with several exceptions:
- A line of text cannot end with any leading characters — such as opening quotation marks, opening parentheses, and currency signs — that should not be separated from succeeding characters.
- A line of text cannot begin with any following characters — such as closing quotation marks, closing parentheses, and punctuation marks — that you should not separate from preceding characters.
- Certain overflow characters (punctuation characters) can extend beyond the right margin for horizontal text or below the bottom margin for vertical text.
Keyboard layouts change according to culture/locale. Some characters do not exist in all keyboard layouts. When assigning shortcut-key combinations, make sure that you can reproduce them using international keyboards, especially if you plan to use the shortcut-key combinations with the Windows 2000 MUI (Multilanguage User Interface).
Because each culture/locale may use a different keyboard, consider using numbers and function keys (F4, F5, and so on) instead of letters in shortcut-key combinations.
Although you do not need to localize number and function-key combinations, they are not as intuitive for the user as letter combinations. Some shortcut keys may not work for each keyboard layout in a particular culture/locale. For example, some cultures/locales use more than one keyboard, such as Eastern Europe and most Arabic-speaking countries/regions.
For Right-To-Left (RTL) languages, not only does the text alignment and text reading order go from right to left, but also the UI layout should follow this natural direction. Of course, this layout change would only apply to localized RTL languages.
Note The .NET Framework does not support mirroring.
Arabic and Hebrew Windows 98 introduced the mirroring technology to resolve the issues with flipping. Windows 2000 uses this same technology. It gives a perfect RTL look and feel to the UI. For Windows 98, this technology is only available on localized Arabic and Hebrew operating systems. However, on Windows 2000 and later, all versions of the operating system are mirroring aware making it possible for you to easily create a mirrored application.
To avoid confusion around coordinates, try to replace the concept of left/right, with the concept of near/far. Mirroring is in fact nothing else than a coordinate transformation:
- Origin (0,0) is in the upper RIGHT corner of a window
- X scale factor = -1 (i.e., values of X increase from right to left)
The following figure illustrates the coordinate transformation from LTR to RTL:
To minimize the amount of re-write needed for applications to support mirroring, system components, such as "GDI" and "User," have been modified to turn mirroring on and off with almost no additional code changes except for a few considerations regarding owner-drawn controls and bitmaps.
For more information, see Window Layout and Mirroring in Window Features.
All applications at some time process data, whether text or numerical. In the past, different culture/locale language requirements meant that applications used diverse encodings to represent this data internally. These encodings have caused fragmented code bases for operating systems and applications (single-byte editions for European languages, double-byte editions for East Asia languages, and bi-directional editions for Middle East Languages). This fragmentation has made it hard to share data and even harder to support a multilingual UI.
Since a goal of globalization is writing code that functions equally well in any of the supported cultures/locales, a data encoding schema that makes it possible for the unique representation of each character in all the required cultures/locales for our products is essential. Unicode meets this requirement.
Unicode makes it possible for the storage of different languages in the same data stream. This one encoding can represent 64,000+ characters. With the introduction of surrogates, it can represent 1,000,000,000+ characters. The use of Unicode in Windows makes it possible for easier creation of world-ready code because you no longer need to reference a code page or group character points to represent one character.
Unicode is a 16-bit international character encoding that covers values for over 45,000 characters (with room for over a million more). Unicode text is usually easier to process than text in other encodings. It also eliminates the need to keep track of which characters are encoded and the need to keep track of the encoding schema that produced the characters.
Note A Unicode-enabled product is still not fully world-ready. In fact, enabling your code to use Unicode is probably only 10 percent of the work.
Using Unicode encoding to represent all international characters enables Windows 2000 to support over 64 scripts and hundreds of languages.