Skip to main content

Globalization Step-by-Step

Complex Scripts Awareness

Line & Word Breaking3squares Console Globalization

Overview and Description

All language versions of Windows 2000 and Windows XP are enabled for all supported languages, thereby empowering applications that use Unicode as their encoding model to handle mixed text from any of the supported scripts. For example, in Notepad you can display text containing English, Farsi, Greek, Hindi, Korean, and Thai text all at once. Among these scripts there are several that require special processing to display and edit because the characters are not laid out in a simple linear progression from left to right, as most European characters are. These writing systems are referred to as "complex scripts."

When writing applications, you should follow the guideline below to make your application work properly for complex scripts:

  • When displaying typed text, do not output characters one at a time! Save the text in a buffer and display the whole string with Uniscribe or a Win32 API.
  • To allocate character/glyph buffers, do not assume one character = one glyph. You can use Uniscribe to allocate glyph buffers.

  • To measure line lengths, do not sum cached character widths but rather use the GetTextExtent function or Uniscribe.

Top of pageTop of page

Characteristics of Complex Scripts

Bidirectionality and Character Reordering

Arabic scripts (used for languages such as Arabic, Farsi, Pashtu, and Urdu) and Hebrew scripts (used for Hebrew and Yiddish) are not only right-justified, but also are read from right to left (RTL)-in other words, these two scripts have an RTL reading order. For these particular scripts, the logical order (the order in which the user enters text with a sequence of virtual-key inputs) and the visual order (the order in which characters are represented to the user) are different in most cases. (See Figure 1.) Character positioning and caret movement in bidirectional context, in which RTL characters and left to right (LTR) characters coexist, are the biggest hurdles to overcome when dealing with RTL scripts. A bidirectional context can be a mixture of Latin and Arabic or Hebrew text; or it can involve Arabic and Hebrew characters with numerals that have an LTR attribute in Arabic and Hebrew. Figure 2 shows a challenge that occurs when the trailing edge of one character in bidirectional text is not necessarily adjacent to the leading edge of the next character.

 

Figure 1: Bidirectional text (Arabic) where the logical order (first row) and the visual order (second row) are not of the same sequence of characters.

Figure 1: Bidirectional text (Arabic) where the logical order (first row) and the visual order (second row) are not of the same sequence of characters.


 

Figure 2: In bidirectional text the trailing edge of one character is not necessarily adjacent to the leading edge of the next character. In this example, the text selection follows the logical order.

Figure 2: In bidirectional text the trailing edge of one character is not necessarily adjacent to the leading edge of the next character. In this example, the text selection follows the logical order.


The Unicode bidirectional algorithm resolves the layout of mixed-direction text in the absence of higher-level protocols. The following are some of the general assumptions this algorithm makes:

  • Adjacent runs of words of opposite language direction are laid out according to the base level-left to right for an English paragraph, right to left for an Arabic paragraph (Please reference Uniscribe Glossary for the definition of run).

  • Numbers following LTR words should be displayed to the right of the words.

  • Numbers following RTL words should be displayed to the left of the words.

  • Punctuation between words of the same language direction should be displayed between those words.

  • Punctuation between runs of words of opposite language direction appears between those runs.

  • Punctuation at the beginning or end of a paragraph is laid out according to the paragraph direction and is not affected by the direction of adjacent text.

  • The digits of numbers are laid out left to right in the number.

  • Commas and periods are considered part of a number when immediately surrounded by digits. Other characters, such as currency signs, are considered part of a number when immediately adjacent to a digit.

The algorithm makes a valiant and surprisingly successful stab at resolving what can be very ambiguous text. In applications such as databases and forms, this algorithm is often sufficient. In applications such as word processors, it is usually considered necessary to give the user more direct control over bidirectional-text layout. (To learn more about bidirectionality as well as Unicode algorithms and implementation, go to the Unicode site.)

Contextual Shaping

For the Arabic and Indic family of languages, a character's glyph (that is, all the character's different possible representations) can change greatly depending on the glyph's position within a word and the characters that precede or follow the glyph. In Arabic, the same character can have several different shapes depending on the context (see table below).

Arabic Letter "ein"Forms
Isolated formIsolated form
Initial form (beginning of a word)Initial form (beginning of a word)
Middle formMiddle form
Final form (end of a word)Final form (end of a word)

The difficult part of contextual shaping is that, for all the various glyphs, there is only one defined code point in different encoding models. Layout and displaying mechanisms should define (at run time) the appropriate glyph to be used from the font tables, depending on the context.

Combining Characters

For Latin script, there is often a direct one-to-one mapping between a character and its glyph. (For instance, the character "h" is always represented by the same glyph "h.") For complex scripts, several characters can combine together to create a whole new glyph independent of the original characters. There are also cases where the number of resulting glyphs can be bigger than the original number of characters used to generate those glyphs. Characters are often stacked or combined to create a cluster (Please reference Uniscribe Glossary for the definition of cluster), which is indivisible for most of the complex scripts. In Arabic, on the other hand, a cluster can be divided by breaking a glyph into its composing characters and diacritics. (See Figure 3.)

 

Figure 3: Cluster formatting

Figure 3: Cluster formatting for (from left to right) Hindi, where four individual characters get resolved to one indivisible cluster of one glyph; Tamil, where two individual characters get resolved to one indivisible cluster of three glyphs; Arabic, where two individual characters get resolved to one divisible cluster of one glyph.


An indivisible cluster is treated as a single entity in UI-bound text handling. When selected or deleted, an indivisible cluster is selected or deleted as one symbol; when a caret is moved over such a cluster, it skips over the cluster in one cursor move. Divisible clusters, on the other hand, allow the user to position a caret within a cluster and to delete characters that were combined in the cluster.

Word Breaking and Line Breaking

Word breaking and line breaking for Latin script follow some straightforward rules, such as breaking a line at a space, tab, or hyphen. For languages like Thai and Khmer, words run together (with no space between characters that end a word and those that begin another word, as with Latin script). This makes word breaking in such languages a more complex process, since syntax rules require line breaking on word boundaries. Thus for languages like Thai and Khmer, word breaking is based on grammatical analysis and on word matching in dictionaries during text processing at run time.

Text Justification

To justify Latin text, spaces are added between words and characters. This approach cannot be used to justify Arabic text or the contextual shaping will break. Instead, continuous lines (or kashidas) are inserted between adjoining characters to make each word look longer. Figure 4 shows an example of Arabic text with kashidas inserted for justification purposes.

Figure 4: Arabic text with kashidas (in gray) inserted for justification purposes.


Windows Support for Complex Scripts

Supported complex scripts in Windows XP are Arabic, Divehi, Hebrew, Syriac, Thai, Vietnamese, and the Indic family of scripts including Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Oriya, Tamil, and Telugu. The Unicode Script Processor (USP10.dll), also known as Uniscribe, is the system engine used to shape and lay out complex scripts. It is shipped with Windows 2000 and Windows XP, Microsoft Internet Explorer 4 and later, and with Microsoft Office 2000 and Microsoft Office XP.

Although applications can directly interface Uniscribe to render complex scripts, the easiest and most efficient way of supporting complex scripts is through the support that Windows 2000 and Windows XP provide automatically, as discussed in the following section. Each time the system's User or GDI components are called to perform a text output, if support for complex scripts is installed, the text is passed to a language module called "LPK.dll" (a name derived from [L]anguage [P]ac[K]). This module analyzes the text and looks for complex-script portions or runs. If the text does not contain any such runs, it is immediately returned back to the User components or GDI components for processing. If the text does contain complex-script runs, its processing is given to Uniscribe, which sends the final processed text to the User components or GDI components in order to be displayed. (See Figure 5)

Figure 5: Flowchart of complex-script processing.

Figure 5: Flowchart of complex-script processing.


Top of pageTop of page

Options for Displaying Text

Text Layout in Win32

In order to display text in a multilingual context, which also entails the output of complex scripts, there are four possible options:

  • Calling Win32 text APIs

  • Instantiating Win32 standard edit controls
  • Instantiating rich edit controls

  • Calling Uniscribe

The following sections briefly explain the advantages of each of these possibilities. It's then up to you to decide which is the best option for your application based on the application's complexity and its design features.

Win32 Text APIs

Many applications deal mostly with plaintext-text that is all in the same typeface, weight, color, and so on. Such applications have traditionally displayed text using standard Win32 display entry points (TextOut, ExtTextOut, TabbedTextOut,and DrawText) to write text to a window, and the GetTextExtent family of functions to measure line lengths. Starting from Windows 2000, the standard entry points have been extended to support display of multilingual Unicode text and complex scripts, to display vertical text, and to handle special rules regarding line breaking and word breaking. In general, this support is transparent to the application itself, so properly-designed applications may require no changes to support complex scripts through these interfaces. Certain design guidelines must be followed, however. Requirements for the support of complex scripts, including bidirectional text, are described below.

Figure 6 shows how ExtTextOut can be used to lay out multilingual Unicode text including complex scripts. There is no need for you to do anything other than call ExtTextOut; it handles everything for you.

Figure 6: Multilingual text output using the ExtTextOut API

Figure 6: Multilingual text output using the ExtTextOut API.


Here is sample code illustrating the use of ExtTextOut to draw text:

HDC    hDC;
HFONT  hFont;
// Creating a font object to display text using Microsoft Sans Serif.
hFont = CreateFont(14, 0, 0, 0,FW_NORMAL, FALSE, FALSE, FALSE, DEFAULT_CHARSET,
OUT_CHARACTER_PRECIS, CLIP_DEFAULT_PRECIS, PROOF_QUALITY,
VARIABLE_PITCH | FF_SWISS, TEXT("Microsoft Sans Serif"));
hDC = GetDC(hDlg);
SelectObject(hDC, hFont);

// Outputting buffer lpszText into the selected device context.
ExtTextOut(hDC, 10, 10, ETO_CLIPPED, NULL, lpszText, _tcslen(lpszText), NULL);
ReleaseDC(hDlg, hDC);
DeleteObject(hFont);

There are three requirements for displaying complex scripts correctly using the standard Win32-based applications:

  • First, applications should save characters in a buffer and display the whole line of text at once rather than, for example, calling ExtTextOut on each character as it is typed in by the user. When characters are written out one by one, the complex-script shaping modules cannot determine the context for correct reordering and glyph shaping.

    Figure 7 demonstrates the results of calling ExtTextOut on each character versus on the whole string.

    Figure 7

    Figure 7: In the first row the word is being written one character at a time, which does not provide enough information to Uniscribe to do the layout and shaping. In the second row, the string is passed as a whole to the Win32 API, and the final result is properly laid out.


  • Second, applications should use one of the GetTextExtentXXX functions to determine line length rather than computing line lengths from cached character widths. This is because the width of a glyph used to display a character can vary by context. (See Figure 8.)

    Figure 8

    Figure 8: The width of the shaped string might be shorter or longer than the sum of individual character widths. In the first row, the two Arabic letters "Beh" and "Alef" are longer than the width of the two combined characters.


  • Third, bidirectional-aware applications should make sure that Arabic and Hebrew scripts are rendered using RTL alignment and reading order. You can use the GetTextAlign and SetTextAlign APIs to retrieve and set, respectively, the alignment of the text in a given device context. As for the reading order, calls to ExtTextOut and DrawText should specify the appropriate RTL reading-order flags, which are, respectively, ETO_RTLREADING and DT_RTLREADING. (See Figure 9.)

    Figure 9

    Figure 9: Bidirectional text output. On the left side, the reading order for the sentence "123-52 equals 71." is broken. On the right side, the same sentence is being drawn using ETO_RTLREADING to allow the proper reading order


Depending on the purpose of your application, you might also find it necessary to support vertical text. Although text that is read horizontally from left to right is becoming more common in East Asian countries-text in Japanese technical and business journals, for example, is often printed horizontally-many books, magazines, and newspapers still print text vertically.

As Figure 10 shows, displaying text vertically does not mean that you simply rotate an entire line of text by 90 degrees. Most characters remain upright, but others, such as those identified by arrows, change orientation.

Fortunately, with Win32 you do not need to write code to rotate characters. To display text vertically on Windows 2000 and Windows XP, enumerate the available fonts as usual, and select a font whose font face name begins with the at sign (@). Then create a LOGFONT structure, setting both the escapement and the orientation to 270 degrees. Calls to TextOut are the same as for horizontal text.

Figure 10: Text displayed vertically

Figure 10: Text displayed vertically.


The Windows Platform SDK contains a sample application called "TATE" (short for "tategaki," meaning "vertical writing"), which demonstrates how to create fonts and display vertical text. (For more information on vertical writing, see the Windows Platform SDK documentation, available at http://msdn2.microsoft.com. )

Again, the line-breaking and word-breaking rules for Asian languages, however, are quite different from those for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily indicate the distinction between words by using spaces. Although the Thai language does not use spacing between words, it still requires lines to be broken on word boundaries.

For these languages, world-ready software applications cannot conveniently base line-breaking and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.

Take Japanese, for example. Japanese line breaking is based on the kinsoku rules-you can break lines between any two characters, with several exceptions. The first exception is that a line of text cannot end with any leading characters-such as opening quotation marks, opening parentheses, and currency signs-that shouldn't be separated from succeeding characters. The second exception is that a line of text cannot begin with any following characters-such as closing quotation marks, closing parentheses, and punctuation marks-that shouldn't be separated from preceding characters. The third exception is that certain overflow characters (such as punctuation characters) are allowed to extend beyond the right margin for horizontal text or below the bottom margin for vertical text.

As you can see, these rules and exceptions can become somewhat complicated. However, by using the appropriate APIs (DrawTextEx, ExTextOut, TextOut,? and so on), you don't need to worry about which rule to use. Win32 functions take care of line breaking and word breaking for you. (For more information on East Asian line and word breaking, go to Rules for Breaking Lines in Asian Languages.)

Standard Edit Control

The second option to display text in a multilingual context is to instantiate the standard edit control. This control has been extended in Windows 2000 and Windows XP to support data containing multilingual text and complex scripts, and includes not only input and display, but also correct cursor movement over character clusters (in Thai and Devanagari script, for example). As with the standard Win32 API functions, a well-written application will receive this support automatically, without modification. Again, you should consider adding support for right-to-left reading order and right alignment. In this case, toggle the extended style flags of the edit control window to manage these attributes, as shown in the following code:

// ID_EDITCONTROL is the control ID in the resource file.
HANDLE hWndEdit = GetDlgItem(hDlg, ID_EDITCONTROL);
LONG lAlign = GetWindowLong(hWndEdit, GWL_EXSTYLE);

// To toggle alignment
lAlign ^= WS_EX_RIGHT;

// To toggle reading order
lAlign ^= WS_EX_RTLREADING;

After setting the lAlign value, enable the new display by setting the extended style of the edit control window as follows:

// This assumes your edit control is in a dialog box. If not,
// you can? get the edit control handle from another source.
SetWindowLong(hWndEdit, GWL_EXSTYLE, lAlign);
InvalidateRect(hWndEdit, NULL, FALSE);

One new feature of the standard edit control is a context menu (activated by pressing the right mouse button while the cursor is in the field) that allows the user to toggle the reading order and to insert or display Unicode bidirectional control characters. (See Figure 5-11.)

Figure 11: Edit controls context menu allows the user to insert Unicode control characters and to toggle text reading order.


Rich Edit Control

A third option for multilingual text display is Rich Edit. Rich Edit 3 is a higher-level collection of interfaces that takes advantage of Uniscribe to further insulate text-layout clients from the complexities of certain scripts. Rich Edit provides fast, versatile editing of rich Unicode multilingual text and simple plaintext. It includes extensive message and Component Object Model (COM) interfaces, text editing, formatting, line breaking, simple table layout, vertical text layout, bidirectional-text layout, Indic and Thai support, a Word-like edit UI, and Text Object Model (TOM) interfaces. Rich Edit is the simplest way for a client to support features of complex scripts. Clients use its TextOut function to automatically parse, shape, position, and break lines.

Uniscribe

The last of the four options, Uniscribe supports the complex rules found in scripts such as Arabic, Thai, and scripts used for Indic languages. Uniscribe also handles scripts written from right to left, such as Arabic or Hebrew, and supports the mixing of scripts. For plaintext clients, Uniscribe provides a range of ScriptString functions that are similar to TextOut, with additional support for caret placement. The remainder of the Uniscribe interfaces provide finer control to clients.

Uniscribe uses multiple shaping engines that contain the layout knowledge for particular scripts. It also takes advantage of the OpenType layout shaping engine for handling font-specific script features such as glyph generation, extent measurement, and word-breaking support.

Uniscribe subdivides strings of characters into items, runs, and clusters. The client builds runs based on its own stored formatting attributes and on the item boundaries obtained by calling the Uniscribe ScriptItemize API. The Uniscribe ScriptShape API breaks a run into clusters according to script rules and then generates glyphs. The ScriptPlace API generates x and y positions for the characters. The ScriptTextOut API then displays the glyphs using the x and y positions.

Uniscribe supports line breaking at word boundaries through ScriptBreak. Hit testing and cursor positioning are supported by ScriptCPtoX and ScriptXtoCP. Character-to-glyph mapping is provided by ScriptGetCMap. Uniscribe manages bidirectional character reordering using the Unicode bidirectional algorithm, and also understands non-OpenType layout font formats for Arabic, Hebrew, and Thai shaping and positioning.

Using Uniscribe, the text-layout client only needs to manage a backing store of Unicode character codes. The client does not need to maintain any other buffer or mapping table to track character order, but rather only needs to store and manage the order in which the characters were entered by the user. This is the same logical order as defined by Unicode. The client's backing store never changes as a result of layout operations. Uniscribe maintains an index from the reordered clusters to the original character boundaries passed by the client.

Top of pageTop of page

Text I/O in Web Pages and .NET Framework

Text input, output, and display in Web content has been made a lot easier because HTML rendering in Internet Explorer is handled by the Trident module (Mshtml.dll), which is one of the Uniscribe clients. All support for different input languages and complex scripts is provided to Web-based pages automatically and transparently, as long as Unicode encoding (either UTF-8 or UTF-16) is used. For Web content within the .NET Framework, system support hides all implementation details for Microsoft Windows Forms and for other .NET applications.

Top of pageTop of page

References

Line & Word Breaking Console Globalization

Top of pageTop of page

Previous6 of 7 Next