Normalization and Sorting

Some Unicode characters have multiple equivalent binary representations consisting of sets of combining and/or composite Unicode characters. Consequently, two strings can look identical but actually consist of different characters. The existence of multiple representations for a single character complicates sorting operations. The solution to this problem is to normalize each string, then use an ordinal comparison to sort the strings.

The Unicode Standard defines a process called normalization that returns one binary representation of a character when given any of the equivalent binary representations of that character. The Unicode Standard defines four different algorithms, called normalization forms, that you can use to normalize a string. Each normalization form obeys different rules, and therefore yields a different binary representation of an input string. However, if two strings are normalized to the same normalization form, the strings can subsequently be compared using an ordinal (case-insensitive) comparison.

An ordinal comparison of two strings is a binary comparison of the numeric value, or code point, of each corresponding pair of Char structures in the two String objects that represent the normalized strings. The .NET Framework provides several methods that can perform an ordinal comparison.

Your application can use the following process to normalize and sort strings:

  1. Obtain two strings to be sorted from an input source such as a file or user input.

  2. Use the String.Normalize() method to normalize both strings to normalization form C, or the String.Normalize(NormalizationForm) method to normalize both strings to a normalization form of your choosing.

  3. Use an ordinal string comparison, such as the Compare(String, Int32, String, Int32, Int32, StringComparison) method with an Ordinal or OrdinalIgnoreCase value, to compare the two strings. The comparison operation determines whether the first string lexically precedes the second or the two strings are lexically equal.

  4. Emit the strings in the sorted output based on the results of step 3. If the strings are not equal, emit the strings in the order that achieves ascending or descending order.

    If the strings are equal, either string can be emitted first unless it is appropriate to arrange the strings based on some characteristic other than lexical order. For example, if the application is sorting file names but also writing the properties of each file to the output, it might write equal file names in the order of their file creation dates.

  5. Repeat this process until all the input is sorted.

For more information about normalization form support in the .NET Framework, see the description of the NormalizationForm enumeration. For more information about normalizing a string, see the Normalize method.

For more information about normalization, character decompositions, and equivalence, see Unicode Standard Annex #15, "Unicode Normalization Forms," at the Unicode home page.

See Also

Concepts

Sorting with Cultures

Comparing and Sorting Data for a Specific Culture

Other Resources

Encoding and Localization