Unicode Support for Surrogate Pairs and Combining Character Sequences

Article
11/16/2012

The Unicode Standard defines a surrogate pair as a coded character representation for a single abstract character that consists of a sequence of two code units. The first value of the surrogate pair is the high surrogate, a 16-bit code value in the range of U+D800 through U+DBFF. The second value of the pair is the low surrogate, in the range of U+DC00 through U+DFFF.

The Unicode Standard defines a combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character. For more information on surrogate pairs and combining character sequences, see The Unicode Standard at the Unicode home page.

The key point to remember is that surrogate pairs represent 32-bit single characters. You cannot assume that one 16-bit Unicode encoding value maps to exactly one character. By using surrogate pairs, a 16-bit Unicode encoded system can address an additional one million code points to which characters will be assigned by the Unicode Standard.

The .NET Framework supports text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The StringInfo class provides methods that allow your application to split a string into its text elements and iterate through the text elements. For an example of using the StringInfo class, see String Indexing.

Unicode Support for Surrogate Pairs and Combining Character Sequences

See Also

Concepts

Additional resources