Normalization and SortingĀ 

Some Unicode characters have multiple equivalent binary representations consisting of sets of combining and/or composite Unicode characters. Consequently, two strings can look identical but actually consist of different characters. The existence of multiple representations for a single character complicates sorting operations. The solution to this problem is to normalize each string, then use an ordinal comparison to sort the strings.

The Unicode standard defines a process called normalization that returns one binary representation of a character when given any of the equivalent binary representations of that character. The Unicode standard defines four different algorithms, called normalization forms, that you can use to normalize a string. Each normalization form obeys different rules, and therefore yields a different binary representation of an input string. However, if two strings are normalized to the same normalization form, the strings can subsequently be compared using an ordinal comparison.

An ordinal comparison of two strings is a binary comparison of the numeric value, or code point, of each corresponding pair of Char structures in the two String objects that represent the normalized strings. The .NET Framework provides several methods that can perform an ordinal comparison.

Use the following process to normalize and sort strings:

  1. Obtain two strings to be sorted from an input source such as a file or user input.

  2. Use the System.String.Normalize method to normalize both strings to normalization form C, or the System.String.Normalize(System.Text.NormalizationForm) method to normalize both strings to a normalization form of your choosing.

  3. Use an ordinal string comparison, such as the Compare method with an Ordinal or OrdinalIgnoreCase argument, to compare the two strings. The comparison operation determines whether the first string lexically precedes the second or if the two strings are lexically equal.

  4. Emit the strings in the sorted output of your application based on the results of step 3. If the strings are not equal, emit the strings in the order that achieves ascending or descending order.

    If the strings are equal, either string can be emitted first unless it is appropriate in your application to arrange the strings based on some other characteristic than lexical order. For example, if you are sorting file names but you are also writing the properties of each file to the output, you might write equal file names in the order of their file creation dates.

  5. Repeat this process until all the input is sorted.

For more information about normalization form support in the .NET Framework, see the NormalizationForm enumeration. For more information about normalizing a string, see the Normalize method.

For more information about normalization, character decompositions, and equivalence, see Unicode Standard Annex #15, "Unicode Normalization Forms," at the Unicode home site.

See Also

Reference

IsNormalized

Concepts

Normalization
Sorting with Cultures

Other Resources

Encoding and Localization