Complex Scripts Awareness

All language versions of Windows since Windows 2000 are enabled for all supported languages, thereby empowering applications that use Unicode as their encoding model to handle mixed text from any of the supported scripts. For example, in Notepad you can display text containing English, Persian, Greek, Hindi, Korean, and Thai text all at once. Among these scripts there are several that require special processing to display and edit because the characters are not laid out in a simple linear progression from left to right, as most European characters are. These writing systems are referred to as “complex scripts.”

When writing applications, you should follow the guideline below to make your application work properly for complex scripts:

  • When displaying typed text, do not output characters one at a time! Save the text in a buffer and display the whole string with DirectWrite.
  • To allocate character/glyph buffers, do not assume one character = one glyph. You can use DirectWrite to allocate glyph buffers.
  • To measure line lengths, do not sum cached character widths but rather use the GetTextExtent function or DirectWrite.

Bidirectionality and Character Reordering

Arabic scripts (used for languages such as Arabic, Persian, Pashto, Urdu, and other) and Hebrew scripts (used for Hebrew, Yiddish, and others) are read from right-to-left (RTL)—in other words, these two scripts have an RTL reading order. (Additionally, it is commonly expected by readers of these languages for the text to be right-aligned.) For these particular scripts, the logical order (the order in which the user enters text with a sequence of virtual-key inputs) and the visual order (the order in which characters are represented to the user) are different in most cases. (See Figure 1.) Character positioning and caret movement in bidirectional context, in which RTL characters and left-to-right (LTR) characters coexist, are the biggest hurdles to overcome when dealing with RTL scripts. A bidirectional context can be a mixture of Latin and Arabic or Hebrew text; or it can involve Arabic and Hebrew characters with numerals that have an LTR attribute in Arabic and Hebrew. Figure 2 shows a challenge that occurs when the trailing edge of one character in bidirectional text is not necessarily adjacent to the leading edge of the next character.

Arabic character order 

Figure 1: Bidirectional text (Arabic) where the logical order (first row) and the visual order (second row) are not of the same sequence of characters.

Arabic character edge 

Figure 2: In bidirectional text the trailing edge of one character is not necessarily adjacent to the leading edge of the next character. In this example, the text selection follows the logical order.

The Unicode bidirectional algorithm resolves the layout of mixed-direction text in the absence of higher-level protocols. The following are some of the general assumptions this algorithm makes:

  • Adjacent runs of words of opposite language direction are laid out according to the base level-left to right for an English paragraph, right to left for an Arabic paragraph (Please reference Uniscribe Glossary for the definition of run).
  • Numbers following LTR words should be displayed to the right of the words.
  • Numbers following RTL words should be displayed to the left of the words.
  • Punctuation between words of the same language direction should be displayed between those words.
  • Punctuation between runs of words of opposite language direction appears between those runs.
  • Punctuation at the beginning or end of a paragraph is laid out according to the paragraph direction and is not affected by the direction of adjacent text.
  • The digits of numbers are laid out left to right in the number.
  • Commas and periods are considered part of a number when immediately surrounded by digits. Other characters, such as currency signs, are considered part of a number when immediately adjacent to a digit.

The algorithm makes a valiant and surprisingly successful stab at resolving what can be very ambiguous text. In applications such as databases and forms, this algorithm is often sufficient. In applications such as word processors, it is usually considered necessary to give the user more direct control over bidirectional-text layout. (To learn more about bidirectionality as well as Unicode algorithms and implementation, go to the Unicode site.)

Contextual Shaping

For the Arabic and Indic families of languages, a character’s glyph (that is, all the character’s different possible representations) can change greatly depending on the glyph’s position within a word and the glyphs that precede or follow the glyph. In Arabic, the same character can have several different shapes depending on the context (see table below).

Arabic Letter "ein" Forms
Isolated form Isolated form
initial form Initial form (beginning of a word)
Middle form  Middle form
Final form  Final form (end of a word)

The difficult part of contextual shaping is that, for all the various glyphs, there is only one defined code point in different encoding models. Layout and displaying mechanisms should define (at run time) the appropriate glyph to be used from the font tables, depending on the context.

Combining Characters

For Latin script, there is often a direct one-to-one mapping between a character and its glyph. (For instance, the character "h" is always represented by the same glyph "h.") For complex scripts, several characters may combine to create a whole new glyph independent of the original characters. There are also cases where the number of resulting glyphs can be bigger than the original number of characters used to generate those glyphs. Glyphs are often stacked or combined to create a cluster (Please reference Uniscribe Glossary for the definition of cluster), which is indivisible for most of the complex scripts. In Arabic, on the other hand, a cluster can be divided by breaking a glyph into its composing characters and diacritics. (See Figure 3.)

Cluster formatting 

Figure 3: Cluster formatting for (from left to right) Hindi, where four individual characters get resolved to one indivisible cluster of one glyph; Tamil, where two individual characters get are resolved to one indivisible cluster of three glyphs; Arabic, where two individual characters get are resolved to one divisible cluster of one glyph.

An indivisible cluster is treated as a single entity in UI-bound text handling. When selected or deleted, an indivisible cluster is selected or deleted as one symbol; when a caret is moved over such a cluster, it skips over the cluster in one cursor move. Divisible clusters, on the other hand, allow the user to position a caret within a cluster and to delete characters that were combined in the cluster.

Word Breaking and Line Breaking

Word breaking and line breaking for Latin script follow some straightforward rules, such as breaking a line at a space, tab, or hyphen. For languages such as Thai,  Khmer, Tibetan and others, words run together (with no space between characters that end a word and those that begin another word, as with Latin script). This makes word breaking in such languages a more complex process, since syntax rules require line breaking on word boundaries. Thus for languages such as Thai and others, word breaking is based on grammatical analysis and on word matching in dictionaries during text processing at run time.

Text Justification

To apply full justification to Latin text, spaces are added between words and characters. This approach cannot be used to justify Arabic text or the contextual shaping will break. Instead, continuous lines (or kashidas) are inserted between adjoining characters to make each word look longer. Figure 4 shows an example of Arabic text with kashidas inserted for justification purposes.

Arabic text with kashidas 

Figure 4: Arabic text with kashidas (in gray) inserted for justification purposes.

Windows Script Support

Supported scripts in Windows are Arabic, Armenian, Avestan, Bamum, Batak, Bengali, Bopomofo, Brahmi, Braille, Canadian aboriginal syllabics, Cherokee, Chinese (Simplified & Traditional), Coptic, Cyrillic, Devanagari, Egyptian hieroglyphics, Ethiopic, Georgian, Glagolitic, Greek, Gujarati, Gurmukhi, Hebrew, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Japanese, Javanese, Kaithi, Kannada, Khmer, Korean, Lao, Latin, Lisu (Fraser), Malayalam, Mandaic, Meetei Mayek, Mongolian, Myanmar, New Tai Lue, Ogham, Old South Arabian, Old Turkish (Orkhon), Oriya, 'Phags-pa, Runic, Samaritan, Sinhala, Syriac, Tai Le, Tai Tham (Lanna), Tai Viet, Tamil, Telugu, Thaana, Thai, Tibetan, and Yi. DirectWrite is the system engine used to shape and lay out scripts. It is shipped with Windows 7 and above and Microsoft Internet Explorer 9 and higher.

Although applications can directly interface with DirectWrite to render scripts, the easiest and most efficient way of supporting scripts is through the support that Windows provides automatically. Each time the system’s components are called to perform a text output, the text is passed to a language module. This module analyzes the text and looks for complex-script portions or runs and sends the final processed text to the calling component in order to be displayed.

DirectWrite Architecture  

Figure 5: Architecture of DirectWrite.