Skip to main content

Rules for Breaking Lines in Asian Languages

Line-breaking and word-wrapping algorithms are important to text parsing as well as to text display. The rules for Asian languages, however, are quite different from the rules for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily indicate the distinction between words by using spaces. The Thai language doesn't even use punctuation. For these languages, software applications cannot conveniently base line breaks and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.

Because the Win32 full-text search engine for Microsoft WinHelp recognizes that word wrapping is more complex for some languages than for others, it supports the IWordBreak OLE interface. That way, if a third-party developer creates a superior word-wrapping algorithm for any language, the WinHelp engine can take advantage of it through OLE.

3squares

Dividing Lines of Text in Japanese

Glossary

  • Leading characters: Characters—such as opening quotation marks, opening parentheses, and currency signs—that shouldn't be separated from succeeding characters.
  • Following characters: Characters—such as closing quotation marks, closing parentheses, and punctuation marks—that shouldn't be separated from succeeding characters.
  • Overflow characters: Punctuation characters that are allowed to extend beyond the right margin for horizontal text or below the bottom margin for vertical text.

Japanese line breaking is based on the kinsoku rule—you can break lines between any two characters, with several exceptions. The first exception is that a line of text cannot end with any leading characters, which are listed below. (Characters are shown with their hexadecimal code points for Shift-JIS.)

24A2816D8177
        
288165816F8179
        
5B81678171818F
        
5C816981738190
        
7B816B81758192

Top of pageTop of page

 

The second exception is that a line of text cannot begin with any following characters. Following characters are listed below.

21B0816882A7
        
25DE816A82C1
        
29DF816C82E1
        
2C8141816E82E3
        
2E8142817082E5
        
3F8143817282EC
        
5D814481748340
        
7D814581768342
        
A1814681788344
        
A38147817A8346
        
A48148818B8348
        
A58149818C8362
        
A7814A818D8383
        
A8814B818E8385
        
A9815281918387
        
AA81538193838E
        
AB815481F18395
        
AC8155829F8396
        
AD815882A1  
        
AE815B82A3  
AF816682A5  

Top of pageTop of page

 

The third exception is that certain overflow characters are allowed to extend past the right or bottom margin. Those characters that can are listed below.

    
2C8141
    
2E8142
    
A18143
    
A48144


You must decide how to implement special cases. For example, you might choose to break lines only between complete words instead of between two individual characters. Also, you might want to follow standard English or European line-breaking rules for any text that the user enters using Latin characters.

 

Top of pageTop of page

 

Dividing Lines of Text in Chinese

The Chinese language uses the same line-breaking rule as the Japanese language does. A line cannot end with any leading characters, which are listed below.

Traditional Chinese (characters are shown with their hexadecimal code point for the Chinese-BIG5 CharSet)

0028A165A173A1A3
        
005BA167A175A1A5
        
007BA169A177A1A7
        
A15DA16BA179A1A9
        
A15FA16DA17BA1AB
        
A161A16FA17D  
        
A163A171A1A1  

Top of pageTop of page

 


Simplified Chinese(characters are shown with their hexadecimal code point for the GB-2312 CharSet)

0028A1B6A3B0A3B7
        
005BA1B8A3B1A3B8
        
007BA1BAA3B2A3B9
        
A1AEA1BCA3B3A3DB
        
A1B0A1BEA3B4A3FB
        
A1B2A3A8A3B5  
        
A1B4A3AEA3B6  

Top of pageTop of page

 

A line cannot begin with any following characters, which are listed below.

Traditional Chinese(characters are shown with their hexadecimal code point for the Chinese-BIG5 CharSet)

0021A147A156A16E
        
0029A148A157A170
        
002CA149A158A172
        
002EA14AA159A174
        
003AA14BA15AA176
        
003BA14CA15BA178
        
003FA14DA15CA17A
        
005DA14EA15EA17C
        
007DA14FA160A17E
        
A141A150A162A1A2
        
A142A151A164A1A4
        
A143A152A166A1A6
        
A144A153A168A1A8
        
A145A154A16AA1AA
        
A146A155A16CA1AC

Top of pageTop of page

 

Simplified Chinese

(characters are shown with their hexadecimal code point for the GB-2312 CharSet)

0021A1A4A1B1A3A7
        
0029A1A5A1B3A3A9
        
002CA1A6A1B5A3AC
        
002EA1A7A1B7A3AE
        
003AA1A8A1B9A3BA
        
003BA1A9A1BBA3BB
        
003FA1AAA1BDA3BF
        
005DA1ABA1BFA3DD
        
007DA1ACA1C3A3E0
        
A1A2A1ADA3A1A3FC
        
A1A3A1AFA3A2A3FD

Top of pageTop of page

 

The following overflow characters are allowed to extend past the right margin:

Traditional Chinese and Simplified Chinese(characters are shown with their hexadecimal code points, which are the same for the Chinese-BIG5 CharSet and the GB-2312 CharSet)

0021003B
    
0029003F
    
002C005D
    
002E007D
    
003A  


Top of pageTop of page

 

Dividing Lines of Text in Korean

Korean words expressed in hangul are separated by spaces, as they are in Western languages. Some Korean-language applications allow the user to choose whether or not to break lines between hangul characters.

This example breaks lines only between words.

The example below breaks lines between individual hangul characters.

The standard rule for breaking lines between hangul characters, called geumchik, is very similar to the Japanese kinsoku rule—you can break lines between any two characters, with the following exceptions. A line of text cannot end with any leading characters. (Characters are shown with their hexadecimal code point for Korean standard code, KSC 5601.)

28(A1B0A1BAA3DC
        
5B[A1B2A1BCA3FB
        
5C\A1B4A3A4  
        
7B{A1B6A3A8  
        
A1AEA1B8A3DB  

Top of pageTop of page

 

A line of text cannot begin with any following characters, listed below:

21!7D}A1BDA3AC
        
25%A1A2A1C6A3AE
        
29)A1AFA1C7A3BA
        
2C,A1B1A1C8A3BB
        
2E.A1B3A1C9A3BF
        
3A:A1B5A1CBA3DC
        
3B;A1B7A3A1A3DD
        
3F?A1B9A3A5A3FD
        
5D]A1BBA3A9  

Top of pageTop of page

 

The geumchik rule defines three methods for dealing with following characters. The first method, the JalLaNaeGi method, breaks the line before the first character to the left of the following character, as shown below:

The MilEoNuGi method breaks the line after the following character and compresses the text that falls before it, as shown below:

The GeuNyangDuGi method extends the right margin slightly to accommodate the following character, as shown below:

This method can also extend the bottom margin.

There is no special category for overflow characters in Korean.

Top of pageTop of page

 

Dividing Lines of Text in Thai

Thai editions of Windows come with a fairly sophisticated line-breaking algorithm. If you are writing a Thai-language application, take advantage of what the system provides rather than trying to come up with your own line-breaking code. To give you an idea of what would be involved, try to decipher the following line:

Imaginethatthisisastringtobewordwrappedtheonlywaytodosoinenglishwouldbetoidentifytheindividualwordsandthendeterminethebestplacetobreaktheline

Translation: Imagine that this is a string to be word-wrapped. The only way to do so in English would be to identify the individual words and then determine the best place to break the line.

The line-breaking algorithm provided by the system solves these problems for you.

Top of pageTop of page

 

Summary

  • To create Chinese-language, Japanese-language, or Korean-language applications for Windows NT or Windows 95, you need the appropriate Far East Win32 Software Development Kit (SDK) and a compiler that understands Unicode or double-byte character sets.
  • As a first step toward creating a Far East–edition code base, enable your code to handle double-byte character sets or Unicode, following the guide-lines presented in Chapter 3.
  • The Chinese, Japanese, and Korean writing systems contain thousands of ideographic characters. Therefore, entering characters efficiently on computers requires Input Method Editors (IMEs), which are software modules that map multiple keystrokes into single ideographs. Different text input methods are popular for each language.
  • To support IMEs on Windows NT 3.5, your application needs to parse the WM_IME_REPORT message and its various wParam values. IME support on Windows NT 3.5 differs slightly from one language to another.
  • The IME model for Windows has been revised for Windows 95 and Windows NT 3.51. It includes a single IME API for all Far East editions of the operating systems. Applications following this model can be IME-unaware, partially IME-aware, or fully IME-aware.
  • Applications can customize IME support on Windows 95 by controlling the appearance of the IME windows.
  • Win32-based applications can display text vertically using fonts whose typeface names begin with the at (@) character.
  • As long as your application relies on the Win32 API, you do not have to write special code to handle hardware differences found in the Japanese PC market.
  • Windows 95 supports Windows Intelligent Font Emulator (WIFE) fonts for compatibility reasons, but new applications should use TrueType fonts instead. With TrueType fonts, the user can define characters not supported by the system's character encoding or standard fonts.
  • Far East editions of Windows support additional functions that are related to IMEs which other editions of Windows do not. It is possible, however, to display Far Eastern characters on any edition of Windows NT and to create a single binary that will run on Far East editions and other editions of Windows.
  • Far East editions of Windows support different methods for sorting ideographic characters. Characters can be sorted in stroke order, phonetically, or by code-point value, depending on the locale.
  • Chinese, Japanese, Korean, and Thai written text follow special rules for breaking lines.

Top of pageTop of page