Unicode Support for Surrogate Pairs and Combining Character Sequences

The Unicode Standard defines a surrogate pair as a coded character representation for a single abstract character that consists of a sequence of two code units. The first value of the surrogate pair is the high surrogate, and contains a 16-bit code value in the range of U+D800 through U+DBFF. The second value of the pair is the low surrogate, and contains values in the range of U+DC00 through U+DFFF.

The Unicode Standard defines a combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character. For more information on surrogate pairs and combining character sequences, see The Unicode Standard at www.unicode.org.

The key point to remember is that surrogate pairs represent 32-bit single characters, and you cannot assume that one 16-bit Unicode encoding value maps to exactly one character. By using surrogate pairs, a 16-bit Unicode encoded system can address an additional one million code points to which characters will be assigned by the Unicode standard.

The .NET Framework supports text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The StringInfo class provides methods that allow you to split a string into its text elements and iterate through the text elements. For example, the StringInfo.GetNextTextElement method allows you to retrieve a surrogate pair as one text element. For an example of using the StringInfo class, see String Indexing.

See Also

Developing World-Ready Applications | Unicode in the .NET Framework | System.Text Namespace | String Indexing