This documentation is archived and is not being maintained.

String Indexing

The System.Globalization.StringInfo class provides methods that allow you to split a string into text elements and iterate through those text elements. A text element is a unit of text that is displayed as a single character, called a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. For more information on surrogate pairs and combining character sequences, see Unicode Support for Surrogate Pairs and Combining Character Sequences.

Use the StringInfo.GetTextElementEnumerator method to create an enumerator that can iterate through the elements of a string. Use the StringInfo.ParseCombiningCharacters method to return the indexes of each base character, high surrogate, or control character in a specified string.

In the following code example, a string of Arabic characters containing combining character sequences is created. In strCombining, for example, the Unicode code U+0625 represents an Arabic base character (Arabic letter Alef with Hamza below), and the Unicode code U+0650 represents an Arabic combining character (Arabic Kasra). Together, these codes represent a combining character sequence and therefore must be parsed as a single text element. Next, a string containing surrogate pairs is created. In strSurrogates, for example, the Unicode code U+DACE represents a high surrogate and the Unicode code U+DEFF represents a low surrogate. Together, these codes represent a surrogate pair and must be parsed as a single text element. Each string is parsed once using the ParseCombiningCharacters method and again using the GetTextElementEnumerator method. Both methods correctly parse the text elements in strCombining at the indexes 0, 2, 3, 5, and 6. Both methods correctly parse the text elements in strSurrogates at the indexes 0, 2, 4, 5, and 6. The results of the parsing operations are displayed.

using System;
using System.IO;
using System.Globalization;
using System.Text;

public class StringInfoSample
   public static void Main()
      // Creates a string with text elements at <0;2;3;5;6>.
      // The Unicode code points specify Arabic 
      // combining character sequences.
      string strCombining =   
      // Creates a string with text elements at <0;2;4;5;6>.
      // The Unicode code points specify private surrogate pairs.
      string strSurrogates = "\uDACE\uDEFF\uDAAF\uDEFCa\uD8BF\uDD99"; 


   public static void EnumerateTextElements(string str)
      // Creates a TextElementEnumerator.
      int[] TEIndices = null;
      TextElementEnumerator TEEnum = null;

      // Parses the string using the ParseCombiningCharacters() method.
         ("\r\nParsing '{0}' Using ParseCombiningCharacters()...",str);
      int i;
      TEIndices = StringInfo.ParseCombiningCharacters(str);
      for (i = 0; i < (TEIndices.Length - 1); i++)
            ("Text Element {0} ({1}..{2})= 
            {3}",i,TEIndices[i],TEIndices[i+1] - 1,
            str.Substring(TEIndices[i],TEIndices[i+1] - TEIndices[i]));
         ("Text Element {0} ({1}..{2})= {3}",i,TEIndices[i],str.Length - 
         1, str.Substring(TEIndices[i]));

      // Parses the string using the GetTextElementEnumerator method.
         ("\r\nParsing '{0}' Using TextElementEnumerator...",str);
      TEEnum = StringInfo.GetTextElementEnumerator(str);

      bool Continue = false;
      int TECount = -1;

      // Note: Begins at element -1 (none).
      Continue = TEEnum.MoveNext();
      while (Continue)
         // Prints the current element.
         // Both GetTextElement() and Current retrieve the current
         // text element. The latter returns it as an Object.
         Console.WriteLine("Text Element {0} ({1}..{2})=  
               TEEnum.ElementIndex + TEEnum.GetTextElement().Length - 1, 

         // Moves to the next element.
         Continue = TEEnum.MoveNext();

If you execute this code in a console application, the specified Unicode text elements will not be displayed correctly because the console environment does not support all the Unicode characters.