Supplementary Characters

The data types nchar and nvarchar store each character as a 16-bit value in an encoding called UCS-2. This encoding, defined by versions of Unicode prior to 1996, supports characters in the range U+0000 to U+FFFF. Newer versions of Unicode have defined additional characters in the range U+10000 to U+10FFFF called supplementary characters. These characters are stored as pairs of 16-bit values, called surrogate pairs, in an encoding called UTF-16. All new _100 level collations support linguistic sorting with supplementary characters. 

If you use supplementary characters, consider the following limitations:

  • Supplementary characters can only be used in ordering and comparison operations in collation versions 90 or greater.

  • Because supplementary characters are stored as two 16-bit values, the LEN() function returns the value 2 for each supplementary character that is contained in the argument string. Similarly, the functions CHARINDEX and PATINDEX misrepresent the occurrence of supplementary characters inside character strings.

  • The LEFT, RIGHT, SUBSTRING, STUFF, and REVERSE functions may split any surrogate pairs and lead to unexpected results.

  • Supplementary characters are not supported for use with the underscore (_), percent (%), and caret (^) wildcard characters.

  • Supplementary characters are not supported for use in metadata, such as in names of database objects.

For a Transact-SQL script related to this scenario, see the Supplementary-Aware String Manipulation sample. For information about samples, see Considerations for Installing SQL Server Samples and Sample Databases.