Printer Friendly Version      Send     
Click to Rate and Give Feedback
MSDN
MSDN Library
SQL Server
SQL Server 2005
Technical Articles
SQL Server 2005
 International Features in Microsoft...
Microsoft SQL Server 9.0 Technical Articles
International Features in Microsoft SQL Server 2005
 

SQL Server User Education
Microsoft Corporation

February 2007
Revised April 2007

Applies to:
   Microsoft SQL Server 2005
   Unicode

Summary: This white paper introduces Microsoft SQL Server developers to the international features of Microsoft SQL Server 2005. Topics covered include an explanation of Unicode, added support for supplementary characters in SQL Server 2005, the changes in collation in different versions of SQL Server, changes in data types, performance, updates on data providers, and new international support features in SQL Server 2005 Analysis Services and Integration Services. (59 printed pages)

Click for the Word version of this document.

Contents

Introduction
Unicode Support in SQL Server 2005
Data Types in SQL Server 2005
Performance and Storage Space
Migration of Metadata Information in System Tables
Collations
Server-Client Communication (Data-Access Technologies)
Multilingual Data in the User Interface
Multilingual Transact-SQL
Locale Support in SQL Server 2005
SQL Server 2005 Integration Services
Using Command-Line Utilities with Multilingual Data
International Features in SQL Server Analysis Services
XML Support in SQL Server 2005
Enhancements to Full-Text Search
Interacting with Other Database Products
Conclusion
Acknowledgments
Appendix A: Collation Suffixes
Appendix B: Locales Supported in Windows
Appendix C: SQL-Specific Sort Orders
Appendix D: Languages Supported in SQL Server 2005
Appendix E: Full-Text Languages Added in SQL Server 2005
Appendix F: Collations Updated in SQL Server 2005

Introduction

Microsoft SQL Server 2005 builds on the Unicode and XML support introduced in SQL Server 2000, and adds a powerful new set of development and query tools with SQL Server Management Studio and Business Intelligence Development Studio. Robust multilingual features make SQL Server 2005 a compelling database product and applications platform for the support of international operations and environments.

This white paper provides an overview of these features in a global context. It lists features related to international and multilingual requirements, and explains how design decisions can affect many aspects of a project.

Note   This paper uses the following international fonts to provide examples of some international characters: Arial Unicode MS, Latha, Mangal, PmingLiu, Gulim, SimSun, and MS-Mincho. Not having these fonts installed will not seriously affect the usability of this paper.

Unicode Support in SQL Server 2005

Unicode support is the foundation for multilingual support in SQL Server 2005.

Unicode is a standard that was created by the Unicode Consortium, an organization that promotes a single character set for all languages. SQL Server 2005 supports the Unicode Standard, Version 3.2. Version 3.01 of the Unicode Standard is identical to ISO-10646, an international standard that matches all of the code points in Unicode.

Unicode works by supplying a unique code point for every character, regardless of platform, program, or language. A program that supports Unicode can handle data in any language. Because it is designed to cover all the characters of all the languages of the world, there is no need for different code pages to handle different sets of characters.

Because all Unicode systems consistently use the same bit patterns to represent all characters, there is no problem with characters being converted incorrectly when moving from one system to another.

The easiest way to manage character data in international databases is always to use the Unicode nchar, nvarchar, and nvarchar(max) data types, instead of their non-Unicode equivalents: char, varchar, and text. That way, clients will see the same characters in the data as all other clients. If all the applications that work with international databases also use Unicode variables instead of non-Unicode variables, character translations do not have to be performed anywhere in the system.

Note   The ntext data type will be removed in a future version of Microsoft SQL Server.

Unicode code points and the characters they represent are separate from the glyph that is used for visual rendering. A glyph is defined by the ISO standard (ISO/IEC 9541-1) as "a recognizable abstract graphic symbol which is independent of a specific design." Therefore, a character is not necessarily always represented by the same glyph, or even a unique glyph. The typeface that you choose determines what glyph will be used to represent a particular code point or series of code points.

For more information, see the Unicode Consortium Web site.

Encodings

Unicode maps code points to characters, but does not actually specify how the data will be represented in memory, in a database, or on a Web page. This is where the encoding of Unicode data comes into play. There are many different encodings for Unicode. Most of the time you can simply choose a Unicode data type and not worry about these details; however, it is important to understand the encoding when you are in the following situations:

  • When dealing with an application that may encode Unicode differently
  • When sending data to other platforms (non-Microsoft Windows) or Web servers
  • When importing data from or exporting data to other encodings

The Unicode standard defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. This section describes these common encodings:

  • UCS-2
  • UTF-16
  • UTF-8

Generally, SQL Server stores Unicode in the UCS-2 encoding scheme. However, many clients process Unicode in another encoding scheme, such as UTF-8. This scenario frequently occurs in Web-based applications. In Microsoft Visual Basic applications, character strings are processed in the UCS-2 encoding scheme. Therefore, you do not have to explicitly specify encoding scheme conversions between Visual Basic applications and an instance of SQL Server.

SQL Server 2005 encodes XML data using Unicode (UTF-16). Data in a column of type xml is stored in an internal format as large binary objects (BLOBs) in order to support XML model characteristics, such as document order and recursive structures. Therefore, XML data retrieved from the server comes out in UTF-16; if you want a different encoding for data retrieval, your application must perform the necessary conversion on the retrieved UTF-16 data. XML Best Practices in SQL Server 2005 Books Online provides an example of how to explicitly declare an encoding for XML data retrieved from a varchar(max) column.

The UTF-16 encoding is used because it can handle 2-byte or 4-byte characters, and is processed according to a byte-oriented protocol. These qualities make UTF-16 well-suited for traversing different computers that use different encodings and byte-ordering systems. Because XML data is typically shared widely across networks, it makes sense to maintain the default UTF-16 storage of XML data in your database, and when you export XML data to clients.

UCS-2

UCS-2 is a predecessor of UTF-16. UCS-2 differs from UTF-16 in that UCS-2 is a fixed-length encoding that represents all characters as a 16-bit value (2 bytes), and therefore does not support supplementary characters. UCS-2 is frequently confused with UTF-16, which is used to internally represent text in the Microsoft Windows operating systems (Windows NT, Windows 2000, Windows XP, and Windows CE), but UCS-2 is more limited.

Note   For up-to-date information about the use of Unicode in the Windows operating system, see Unicode in the Microsoft Developer Network (MSDN) Library. It is recommended that a Windows application use UTF-16 internally and convert as part of a "thin layer" over the interface only if another format must be used.

Information that is stored in Unicode in Microsoft SQL Server 2000 and Microsoft SQL Server 2005 uses the UCS-2 encoding, which stores every character as two bytes, regardless of which character is used. Therefore, the Latin letter "A" is treated the same way as the Cyrillic letter Sha (), the Hebrew letter Lamed (), the Tamil letter Rra (), or the Japanese Hiragana letter E (). Each one has a unique code point (for these letters, the code points are U+0041, U+0248, U+05DC, U+0BB1, and U+3048, respectively, where each four-digit hexadecimal number represents the two bytes that UCS-2 uses).

Because UCS-2 allows for encoding of only 65,536 different code points, it does not natively handle supplementary characters, and instead treats supplementary characters as a pair of undefined Unicode surrogate characters that, when paired together, define a supplementary character. However, SQL Server can store supplementary characters without risk of loss or corruption. You can extend the capabilities of SQL Server to work with surrogate pairs by creating custom CLR functions. For more information about working with surrogate pairs and supplementary characters, see the "Supplementary Characters and Surrogate Pairs" section, later in this paper.

Note   Supplementary characters are defined as A Unicode-encoded character having a supplementary code point. Supplementary code points are in the range between U+10000 and U+10FFFF.

UTF-8

UTF-8 is an encoding scheme that is designed to treat Unicode data in a way that is independent of the byte ordering on the computer. UTF-8 is useful for working with ASCII and other byte-oriented systems that require 8-bit encodings, such as mail servers that must span a vast array of computers that use different encodings, different byte orders, and different languages. Although SQL Server 2005 does not store data in UTF-8 format, it supports UTF-8 for handling Extensible Markup Language (XML) data. For more information, see the XML Support in SQL Server 2005 section of this paper.

Other database systems, such as Oracle and Sybase SQL Server, support Unicode by using UTF-8 storage. Depending on a server's implementation, this can be technically easier for a database engine to implement, because existing text management code on the server does not require major changes to deal with data one byte at a time. However, in the Windows environment, UTF-8 storage has several disadvantages:

  • The Component Object Model (COM) supports only UTF-16/UCS-2 in its APIs and interfaces. Therefore, if data is stored in UTF-8 format, constant conversion is required. This issue applies only when COM is used; the SQL Server database engine does not typically call COM interfaces.
  • The Windows XP and Windows Server 2003 kernels are both Unicode. UTF-16 is the standard encoding for Windows 2000, Windows XP, and Windows Server 2003. However, Windows 2000, Windows XP, and Windows Server 2003 are UTF-8 aware. Therefore, use of a UTF-8 storage format in the database requires many extra conversions. Typically, the extra resources needed for conversion do not affect the SQL Server database engine, but could potentially affect many client-side operations.
  • UTF-8 can be slower for many string operations. Sorting, comparing, and virtually any string operation can be slowed because characters do not have a fixed width.
  • UTF-8 often needs more than 2 bytes, and the increased size can make for a larger footprint on disk and in memory.

Despite these drawbacks, given the fact that XML has become an important standard for communication over the Internet, you may want to consider defaulting to UTF-8.

Note   Older versions of Oracle and Java also use UCS-2, and cannot recognize surrogates. Oracle Corporation started supporting Unicode as a database character set in Oracle 7. Oracle currently supports two methods for Unicode data storage: (1) UTF-8 as the encoding for CHAR and VARCHAR2 character data types, and for all SQL names and literals; (2) UTF-16 for storage of the NCHAR, NVARCHAR, and NCLOB Unicode data types. Oracle allows you to use both methods simultaneously.

UTF-16

UTF-16 is the encoding standard at Microsoft and in the Windows operating system UTF-16 is an extension that was devised to handle an additional 1,048,576 characters. The need for a surrogate range was recognized even before Unicode 2.0 was released, as it became clear that the Unicode goal of having a single code point for every character in every language could not be achieved by using only 65,536 characters. For example, some languages, such as Chinese, require at least that many characters to encode characters such as historical and literary ideographs, which, although rarely used, are nonetheless important for publishing and scholarship. The next section provides more information about the surrogate range.

Like UCS-2, UTF-16 uses a little endian byte order, as does everything on Windows. Little endian, as opposed to big endian, means that the low-order byte is stored at the lowest address in memory. The ordering of bytes is important at the operating-system level. SQL Server, together with other applications that run on the Windows platform, uses the little endian byte order. Therefore, a hexadecimal word such as 0x1234 is stored in memory as 0x34 0x12. In certain cases you may need to explicitly reverse the byte order to correctly read the encoding for a character. SQL Server Integration Services provides functions for converting the byte order of Unicode text. For more information, see the SQL Server 2005 Integration Services section of this paper.

Supplementary Characters and Surrogate Pairs

Microsoft Windows normally uses UTF-16 to represent character data. The use of 16 bits allows representation of 65,536 unique characters. However, even this is not enough to cover all of the symbols used in human languages. In UTF-16, certain code points use another code point right after the first two bytes to define the character as part of the surrogate range.

In the Unicode standard, there are 16 planes of characters, with the potential to define as many as 1,114,112 characters. Plane 0, or the Basic Multilingual Plane (BMP), can represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, all level-100 Zapf Dingbats, and punctuation marks.

Outside of the BMP, the characters in most planes are still undefined, but can be used to represent supplementary characters. Supplementary characters are used primarily for historical and classical literary documents to help with the encoding of the rich literary heritage of the Chinese, Korean, and Japanese languages. Supplementary characters also include runes and other historic scripts, musical symbols, and so forth.

In UTF-16, a pair of code points, called a surrogate pair, is used to represent characters outside the main character set (the BMP). The surrogate area is a range in Unicode from U+D800 to U+DFFF that contains 1,024 low surrogate values and 1,024 high surrogate values. A high surrogate (the range U+D800 to U+DBFF) and a low surrogate (the range U+DC00 to U+DFFF) are combined to give access to over a million possible characters.

It is not considered valid to have only one-half of a surrogate pair; to be valid, there must always be a high surrogate followed by a low surrogate. This makes checking for a surrogate a matter of range checking, which is easy compared to the rather complex rules that are required to detect DBCS (double-byte character system) characters.

Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs, even though UCS-2 is not aware of surrogates. SQL Server treats the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate-neutral or surrogate-safe, meaning that there is no intrinsic ability to interact with the data, but at least the data can be stored without loss.

In contrast, "surrogate-aware" applications can not only take surrogate pairs into consideration, but can also process combining characters and other characters that require special handling. A well-written application can detect separated surrogates, and recombine them, with just a few subroutines. Surrogate-aware applications include Microsoft Word and Internet Explorer 5 and later.

When working with supplementary characters in SQL Server, remember the following points:

  • Because surrogate pairs are considered to be two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (in other words, space for a surrogate pair).
  • Supplementary characters are not supported for use in metadata, such as in names of database objects. In general, text used in metadata must meet the rules for identifiers. For more information, see Identifiers in SQL Server 2005 Books Online.
  • Standard string operations are not aware of supplementary characters. Operations such as SUBSTRING(nvarchar(2),1,1) return only the high surrogate of the supplementary character's surrogate pair. The LEN function returns the count of two characters for every supplementary character encountered: one for the high surrogate and one for the low surrogate. However, you can create custom functions that are aware of supplementary characters. The StringManipulate sample in Supplementary-Aware String Manipulation, in SQL Server 2005 Books Online, demonstrates how to create such functions.
  • Sorting and searching behavior for supplementary characters may change depending on the collation. In the new 90_and BIN2 collations, supplementary characters are correctly compared, whereas, in older collations and standard Windows collations, all supplementary characters compare equal to all other supplementary characters. For example, the default Japanese and Korean collations do not handle supplementary characters, whereas Japanese_90 and Korean_90 do.

For more information about Unicode code points, best practices for designing surrogate-aware applications, and working with Unicode data, see Globalization Step-by-Step: Unicode Enabled. For information about character ranges supported in the Unicode standard, see the section on Unicode Regular Expressions in Unicode Technical Standard #18.

Combining Characters

Combining characters are characters that are used together with other characters to modify their appearance or meaning. The combined characters form a single glyph. For example, diacritics used in European languages are combining characters that can either appear as a character plus diacritic, or as a precomposed character.

In the .NET Framework, the sequence of combining characters is treated as a text element—that is, a unit of text that is displayed as a single character. A text element is different from a sort element. For example, in some collations, the letters "CH" are not combining characters; they are two separate text elements, but can be treated as one sort element.

Note   SQL functions, on the other hand, generally treat combining characters the same as supplementary characters: They process them as two separate Unicode code points. For an example of how to create a custom function that more accurately counts and compares supplementary characters, see the StringManipulate sample.

The way that combining characters map to sort elements depends on both the Unicode standard and the collation. Some combined characters are always regarded as equivalent to their variant forms, no matter how many different code points they include (for example, Latin letter a plus a diacritic is treated as equivalent to the precomposed letter including diacritic), whereas in certain collations it is possible to sort or compare strings differently depending on the presence of the diacritic.

Combining characters were originally defined in Unicode 2.0. For more information, see the section of the Unicode 4.0.1 specification that deals with Special Areas and Format Characters. The Unicode Consortium also publishes a FAQ specifically related to combining characters and their processing. For more information about methods for processing combining characters in the .NET Framework, see Normalization in the .NET Framework Developer's Guide.

Support for GB18030

GB18030 (GB18030-2000) is a separate standard mandated by the People's Republic of China (PRC) for encoding Chinese characters. It specifies both an extended code page and a mapping table to Unicode. As of August 1, 2006, support for this character set is officially required for all software products sold in the PRC. GB18030 conformance includes requirements to support some previously unsupported languages—for example, Tibetan, Mongolian, Yi, and Uyghur.

In GB18030, characters can be 1, 2, or 4 bytes. Surrogate pairs are used to enable mapping of the GB18030 4-byte sequences to Unicode.

SQL Server 2005 provides support of GB18030-encoded characters by recognizing them when they enter the server from a client-side application. SQL Server 2005 converts and stores these characters natively as Unicode. After they are stored in the server, they are treated as Unicode characters in any subsequent operations performed on them. GB18030 does not have a system locale; it has only a code page identifier, to allow for conversions to and from Unicode. The Microsoft code page identifier for GB18030-2000 is 54936.

When you use GB18030 characters, remember that these characters can be used in ordering and comparison operations, but in collations older than SQL Server 90, comparisons are only based on their code points and not on other linguistically meaningful ways. Therefore, be careful when you use GB18030 characters in operations such as ORDER BY, GROUP BY, and DISTINCT, especially when GB18030 and non-GB18030 characters are included in the same operation. To enable meaningful string comparisons that use GB18030 characters, use the new SQL Server 90 collation version, signified by the 90 suffix added to its name. For example, instead of the Chinese_PRC collation, use Chinese_PRC_90.

To enable support of the GB18030 standard, you can install a support package, available from the Microsoft Product Help and Support portal, that includes a font file and libraries to support conversion between GB18030 and Unicode. The support package is a single, world–wide binary that works on Windows XP or Windows 2000. However, the system must have the optional East Asian language support installed. In Windows Vista, support for the GB18030 standard is included, including fonts and keyboard layouts for Chinese minority languages such as Tibetan, Mongolian, Yi, and Uighur. These languages use the Chinese (PRC) locale.

Note   Some functions for converting GB18030 bytes to Unicode characters, such as BytesToUnicode, are not supported in Vista. When converting GB18030 bytes to Unicode characters in Vista, use the MultiByteToWideChar function.

For more information about support for this standard in Microsoft products, see the Microsoft Global Development and Computing portal.

Data Types in SQL Server 2005

This section describes the new data types in SQL Server 2005, and explains issues related to the use of SQL Server 2005 data types for storing international data.

Non-Unicode Text Types: char, varchar, text, varchar(max)

When you deal with text data that is stored in the char, varchar, varchar(max), or text data type, the most important limitation to consider is that only information from a single code page can be validated by the system. (You can store data from multiple code pages, but this is not recommended.) The exact code page used to validate and store the data depends on the collation of the column. If a column-level collation has not been defined, the collation of the database is used. To determine the code page that is used for a given column, you can use the COLLATIONPROPERTY function, as shown in the following code examples:

SELECT COLLATIONPROPERTY('Chinese_PRC_Stroke_CI_AI_KS_WS', 'CodePage')
936

SELECT COLLATIONPROPERTY('Latin1_General_CI_AI', 'CodePage')
1252

SELECT COLLATIONPROPERTY('Hindi_CI_AI_WS', 'CodePage')
0

The last example returns 0 (Unicode) as the code page for Hindi. This example illustrates the fact that many locales, such as Georgian and Hindi, do not have code pages, as they are Unicode-only collations. Those collations are not appropriate for columns that use the char, varchar, or text data type, and some collations have been deprecated. For a list of available collations and which collations are Unicode-only, see Collation Settings in Setup in SQL Server 2005 Books Online.

Important   In SQL Server 2005, use the varchar(max) data type instead of the text data type. The text data type will be removed in a future version of Microsoft SQL Server. For more information, see Using Large-Value Data Types in SQL Server 2005 Books Online.

When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?). Therefore, the appearance of random question marks within your data is a good indication that your data has been corrupted due to unspecified conversion. It also is a good indication that your application could benefit from conversion to a Unicode data type.

If you use a string literal of a non-Unicode type that is not supported by the collation, the string is converted first using the database's default code page, which is derived from the default collation of the database.

Another problem you might encounter is the inability to store data when not all of the characters you wish to support are contained in the code page. In many cases, Windows considers a particular code page to be a "best fit" code page, which means there is no guarantee that you can rely on the code page to handle all text; it is merely the best one available. An example of this is the Arabic script: it supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language as defined in Windows code page 1256. If you attempt to store these extra characters in a non-Unicode column that has the Arabic collation, the characters are converted into question marks.

Unicode Text Types: nchar, nvarchar, nvarchar(max), ntext

The SQL-92 specification defines the data types prefaced with "N" (meaning "national" data types), but does not specifically require that the data types be used for Unicode; the actual definition of these data types is left to the database platform or developer. In SQL Server 2000 and SQL Server 2005, these data types are defined as being equivalent to UCS-2, which is a Unicode encoding. However, when you work with other database servers, it is important to know that the "N" data types do not specifically mean Unicode. The decision to define the "N" data types as Unicode is specific to Microsoft SQL Server.

The nvarchar(max) data type, which is new in SQL Server 2005, holds up to 2 gigabytes (GB) of data and is the preferred alternative to the ntext data type.

Important   In SQL Server 2005, use the nvarchar(max) data type instead of the ntext data type. The ntext data type will be removed in a future version of Microsoft SQL Server. For more information, see Using Large-Value Data Types in SQL Server 2005 Books Online.

For the storage of complex scripts, such as Hindi and Tamil, make sure that the data is in the proper ordering. Many languages, such as Tamil, specify that certain letters be reordered when the text is rendered; therefore, the logical order of text as it is stored in memory can be different from the visual order that will be seen in a user interface. Data should always be stored in the proper logical order for any complex script language, which includes all of the Indic languages, Arabic, Farsi, Hebrew, and many others. The actual rendering of such data is a separate issue (see the Multilingual Data in the User Interface section of this paper).

Although the "N" columns can support data of any language or combination of languages, when you sort the data, you can only choose a single collation. To learn more about how to choose a collation, see the Collations section of this paper. None of the code page limitations that were mentioned previously in this paper apply to Unicode columns.

In SQL Server 2005, you can create additional functions to improve string manipulation and collation behavior with supplementary characters. For example, the StringManipulate sample for Microsoft SQL Server 2005 demonstrates supplementary character-aware string processing. This sample shows how to implement five Transact-SQL string functions that provide the same string manipulation functions as built-in string functions, but with additional supplementary character-aware capability to handle both Unicode and supplementary character strings.

CLR Data Types

Microsoft SQL Server gives you the ability to extend the SQL type system by defining a custom data type for use in SQL Server programming. These user-defined types are suited to the creation of custom date, time, currency, and extended numeric types, or for encoded or encrypted data.

A user-defined type (UDT) can be used to define the type of a column in a table, or a variable or routine parameter in the Transact-SQL language. An instance of a user-defined type can be a column in a table, a variable in a batch, a function or stored procedure, or an argument of a function or stored procedure. A user-defined type is implemented as a managed class in any one of the CLR languages, and then registered with SQL Server. For information about how to implement a user-defined type by using Visual Basic or Microsoft Visual C#, see Coding User-Defined Types in SQL Server 2005 Books Online.

You can use user-defined types to extend the scalar type system of the server, enabling storage of CLR objects in a SQL Server database. UDTs can contain multiple elements and can have special behaviors that you define. This differentiates them from the traditional alias data type, which consists of a single SQL Server system data type. For example, the Currency UDT sample that is provided in SQL Server 2005 Books Online supports handing amounts of money in the monetary system of a particular culture. You must define two fields: a string value for CultureInfo, which specifies the source of the currency (for example, en-us), and a decimal value for CurrencyValue, to represent the amount of money.

Because UDTs are accessed by the system as a whole, their use for complex data types may negatively affect performance. Complex data is generally best modeled using traditional rows and tables. SQL Server 2005 Books Online includes several samples that demonstrate how to create and work with custom user-defined types. The UTF8String sample for SQL Server 2005 demonstrates how to implement a user-defined data type that extends the type system of the database to provide storage for UTF8-encoded values. The new type also implements code to convert Unicode strings to and from UTF8. For details, see UTF8 String User-Defined Data Type (UDT) in SQL Server 2005 Books Online.

For additional examples, see CLR Programmability Samples in SQL Server 2005 Books Online.

xml Data Type

The xml data type allows you to store an XML fragment or document in SQL Server databases. Instances of the xml data type can be columns in a table, functions, or stored procedure arguments, or variables in a function or stored procedure. In addition, the xml data type can be specialized by indicating an associated XML schema that provides both validation constraints and type information for the data for the XML instance.

You perform operations on an instance of an xml data type by using built-in XML query methods. These methods accept queries and data-manipulation statements that are appropriate for XML data. You can then specify queries (XQuery) against the XML that is stored in the xml data-type variable or column, and apply updates (using insert, update, or delete) to the XML instance. You can also use an XSD to create an index for the XML column, which will improve query performance.

For more information about the xml data type, and the features in SQL Server 2005 to support handling of XML data, see the XML Support in SQL Server 2005 section of this paper.

Date/Time Types: datetime, smalldatetime

The date and time data types used in SQL Server 2000 and SQL Server 2005 have the following definitions:

datetime

Date and time data in the Gregorian calendar from January 1, 1753, through December 31, 9999, to an accuracy of one three-hundredth of a second (equivalent to 3.33 milliseconds or 0.00333 seconds).

smalldatetime

Date and time data in the Gregorian calendar from January 1, 1900, through June 6, 2079, with accuracy to the minute. smalldatetime values with 29.998 seconds or lower are rounded down to the nearest minute; values with 29.999 seconds or higher are rounded up to the nearest minute.

Microsoft SQL Server rejects data that falls outside of these ranges. The actual data is stored internally as two integers that represent the date and time in question (4-byte integers for datetime, and 2-byte integers for smalldatetime).

The data stored does not represent either a local time or a universal time, and contains no time zone information. If you need to convert dates to universal time, you can use one of the UTC date functions. The current UTC time is derived from the current local time and the time zone setting in the operating system of the computer on which the instance of Microsoft SQL Server is running. Because the value has no intrinsic locale-specific formatting, it is up to the developer to define conversions as needed. SQL Server supports many different locale-specific conversions that can be performed at the server instead of relying on custom solutions from developers. These date styles can be accessed through the CONVERT function, which takes a data type, an expression, and an optional style, as shown in the following table.

W/century W/o century Standard Input (converting to datetime)
Output (converting to text)
0 or 100 - Default mon dd yyyy hh:miAM (or PM)
101 1 U.S. mm/dd/yy
102 2 ANSI yy.mm.dd
103 3 British/French dd/mm/yy
104 4 German dd.mm.yy
105 5 Italian dd-mm-yy
106 6 - dd mon yy
107 7 - Mon dd, yy
108 8 - hh:mm:ss
9 or 109 - Default + milliseconds mon dd yyyy hh:mi:ss:mmmAM (or PM)
110 10 U.S. mm-dd-yy
111 11 JAPAN yy/mm/dd
112 12 ISO yymmdd
13 or 113 - Europe default + milliseconds dd mon yyyy hh:mm:ss:mmm(24h)
114 14 - hh:mi:ss:mmm(24h)
20 or 120 - ODBC canonical yyyy-mm-dd hh:mi:ss(24h)
21 or 121 - ODBC canonical + milliseconds yyyy-mm-dd hh:mi:ss.mmm(24h)
126 - ISO8601 (no spaces) yyyy-mm-dd Thh:mm:ss:mmm
130 - Kuwaiti (Hijri) dd mon yyyy hh:mi:ss:mmmAM
131 - Kuwaiti (Hijri) dd/mm/yy hh:mi:ss:mmmAM

The following example shows how the CONVERT function is used to represent the current date in a specified style:

SELECT CONVERT(char, GETDATE(), 100) AS [100]
RETURNS:
Aug 16 2000 11:50AM

You can then convert the data from a string to a date value in much the same manner:

SELECT CONVERT(datetime, 'Aug 16 2000 11:50AM', 100) AS [100]
RETURNS:
100
2000-08-16 11:50:00.000

If you convert dates in Style 130 (Kuwaiti or Hijri) to a char data type, the data may be corrupted if the collation is not one of the Arabic collations that use code page 1256 for Unicode conversions. For example, Figure 1 shows a column that has been converted to char and a second column that has been converted to nchar. In this example, the client computer uses the EN-US locale. Therefore, when you attempt to use the char data type, the Arabic characters are converted into question marks, whereas if you use the nchar data type, Arabic characters are displayed.

Click here for larger image

Figure 1. Using CONVERT with date/time data (Click on the picture for a larger image)

However, even the string represented using nchar is still not correctly formatted, as it would be on an Arabic client computer, because of limitations in the Query Editor. Figure 2 shows how the actual Hijri date string should appear.

Figure 2. Hijri date string

The reason that Arabic cannot be rendered correctly is that complex scripts, such as Arabic, have shaping rules that control how the data is rendered. In the case of bidirectional (BIDI) languages such as Hebrew, shaping causes all data to be reversed. In the case of Arabic, shaping has a more marked effect, because the actual shapes of letters can change depending on the surrounding letters. This problem does not happen in versions of Windows after Windows 2000, or in any earlier, Arabic-enabled versions of Windows.

Additionally, the date string that is returned can cause problems in the bidirectional cases where it is needed, because the rules for the layout of bidirectional text used by an application, such as Internet Explorer, causes the date to appear as shown in Figure 3.

Figure 3. Bidirectional date string example

This visual order (dd hh:mi:ss yyyy mon :) is obviously not the order that would be expected; the problem can be considered to be a general limitation of the 130 style in the CONVERT function. You can work around this problem by adding the proper Unicode control character in front of the string, as in the following query:

SELECT NCHAR(8207) + CONVERT(nchar, GETDATE(), 130)

The NCHAR function returns a character based upon the passed-in Unicode code point; 8207 or hexadecimal 0x200F is the Right-to-Left Marker (RLM), and it causes the string to be displayed correctly.

Performance and Storage Space

If you define a column using one of the Unicode data types, you may experience the following issues relating to storage space and speed.

Increased Storage Space

Unicode data types use two or more bytes per character, whereas non-Unicode data types use one byte for all non-DBCS text and two bytes for Asian languages that use DBCS. Therefore, unless your data uses one of the Asian code pages, when you convert to Unicode you will use twice as much space to store the data. The increased storage space requirements must be considered when you upgrade existing databases, or when you are deciding on the proper data types for a new project. If you are storing data only in a column that is on a single (non-Asian) code page, you might prefer to not use Unicode so you can save space on disk and in memory. However, the advantages of Unicode conversion generally outweigh the storage impacts.

Note   When storing Asian DBCS data, the UCS-2 encoding method used by SQL Server 2005 tends to be more efficient than the UTF-8 method used by many other database programs. This is because UTF-8 uses three bytes to store most Asian language characters, while UCS-2 uses just two (with the exception of supplementary characters and combining characters). On the other hand, for non-DBCS languages, such as ASCII-based characters, UTF-8 usually uses only one byte per character, while UCS-2 uses two.

Speed Issues

The effect of using Unicode data types on performance is complex. You must consider the following issues:

  • If you are running on Windows XP, Windows Server 2003, or Windows Vista, the operating system expects Unicode data; therefore, in many cases non-Unicode columns have to be converted when you display data or use the operating system services.
  • When you are dealing with a native DBCS data format, additional time may be needed to load the larger amount of data.
  • If you are working between instances, between database server products, or exchanging data with other applications, the number of conversions can affect performance.
  • If you are dealing with Asian languages, Unicode is faster than the language-specific DBCS code page. This is because DBCS data does not have fixed width; it is a mixture of double-byte and single-byte characters.
  • If you are dealing with non-Asian languages, sorting Unicode data can be slower than sorting non-Unicode data.
  • Binary is the fastest sorting order, and it is case sensitive, but it can yield unexpected sort orders. If a binary sort order is selected, the case-sensitive, accent-sensitive, kana-sensitive, and width-sensitive options are not available.
Important   To evaluate performance realistically, you must test in both the Unicode and non-Unicode scenarios.

Migration of Metadata Information in System Tables

The system tables in SQL Server 2005 store all the data in the system tables, including identifiers for objects, as Unicode. This minimizes the problems that can occur with different collations between databases and columns.

If you are migrating from SQL Server 2000 to SQL Server 2005, the only significant change is that SQL Server 2000 uses the Unicode 2.0 standard for the list of valid characters in identifiers, whereas SQL Server 2005 supports the Unicode Standard, Version 3.2.

You can directly upgrade instances of SQL Server 2000 Service Pack 3 (SP3) or later, and instances of SQL Server 7.0 SP4 or later, to SQL Server 2005. You can perform most upgrade operations through Setup; however, some components support or require migration of applications or solutions after running Setup.

Collations

Sort order is a critical but often overlooked part of database definition. Users tend to take sorting in their own alphabet for granted. However, some languages, such as Greek, Russian, and Thai, use different alphabets. Some languages, such as Japanese, use multiple alphabets, with complex rules for ordering. Even in the European languages, there is wide difference in the way that individual characters are handled. For example, Spanish users expect the letter combination "ch" to sort as a single character after the letter "h."

Basic sorting works through collations, which control sort orders, among other behaviors. Sorting can be optimized through the creation of indexes.

Sorting works together with a technique known as string normalization. This kind of normalization is different than the sense of the word that database developers are used to. String normalization refers to a method of comparing two strings so that they can be sorted: for example, you can specify whether different kana types should be treated as equivalent, or whether accents should be ignored.

For non-Unicode columns, collation has a second meaning that is very important: the collation specifies the code page for the data and therefore determines which characters can be represented. Data can be moved between Unicode columns without requiring special conversion or character mappings; data moved between non-Unicode columns is frequently garbled or corrupted, or at the very least cannot be displayed.

Collations in SQL Server 6.5 and Earlier

In SQL Server versions 6.5 and earlier, collations were used to specify the code page to use for language in general. There were numerous limitations related to different sort orders. For example, you could only support Western European languages if you used Latin-1. The collations also limited the number of different locales that you could represent in a single instance of SQL Server. In other words, you could only store or display the language used in a specific region. To use a different language you needed to set up a separate database or even a separate server.

These issues apply also to the collation of non-Unicode fields in later versions of SQL Server.

Collations in SQL Server 7.0

SQL Server 7.0 provided for one Unicode and one non-Unicode collation per server. The non-Unicode collations were defined as a combination of a code page and a sort order. Often, each code page can support more than one sort order; for example, Latin languages usually allow both case-sensitive and case-insensitive sorting. Simplified Chinese allows sorting both by stroke count and phonetic sorts.

In a Unicode collation, any character of any language can be included in the column, and the various collations that are provided were designed to make sure that any collation-specific differences are properly handled. In other words, you choose the "best fit" to give users the data they expect. For example, the general Unicode collation lets you sort Farsi data.

A Unicode collation consists of a locale and several comparison styles. Locales are usually named after countries or cultural areas. They sort characters according to the standard in that area. The Unicode collation still provides a sort order for all characters in the Unicode standard, but precedence is given to the locale specified. Any locale that does not have a supported, unique Unicode collation should use the General Unicode Collation.

The general Unicode sort order correctly handles not only the data but also the sorting of Afrikaans, Albanian, Arabic, Basque, Belarusian, Bulgarian, English, Faeroese, Farsi, Georgian (Traditional), Greek, Hebrew, Hindi, Indonesian, Malay, Russian, Serbian, Swahili, and Urdu.

One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

Collations in SQL Server 2000

In SQL Server 2000, the collation model was changed to eliminate inconsistencies between the two different systems of collations: Windows and SQL Server collations. A more flexible model was needed to handle all of the new requirements for collations. New collations were also required in SQL Server 2000 to handle the code page of non-Unicode columns.

To address these needs, a single, consistent model was designed to handle both Unicode and non-Unicode sorts. Each of the collations was combined with suffixes that help define whether the collation is sensitive to case, accent, width, or kana type. Appendix A contains a table that lists the suffixes. These 17 suffixes, combined as appropriate with the 40 languages supported in SQL Server 2000, created a total of 680 Windows collations.

The language names used for the collations were arbitrary and chosen to represent each unique supported code page for non-Unicode data and a sort order for all data. In many cases, multiple languages could be completely represented on a single code page, or processed by the same sort order used by another language, and in these cases the additional languages were removed from the list.

Collations in SQL Server 2005

SQL Server 2005 supports all languages supported by the Microsoft Windows operating systems. This means that SQL Server 2005 added support for new and updated collations in Windows Server 2003 and Windows XP. (These collations are installed as part of SQL Server 2005 setup. You control the choice of collation for the server and database during setup. Updates to the operating system do not affect collations used in SQL Server.)

An important part of the new collations were East Asian collations that support supplementary characters. Support was also added for string comparison of supplementary characters, based on code points, and a binary flag (BIN2) was introduced to enable true code point comparisons.

Binary collations sort and compare data in SQL Server based on the bit pattern for each character. Each binary collation in SQL Server maps to a specific language locale and ANSI code page, and each performs case-sensitive and accent-sensitive data sorts. Binary collations provide the fastest data sorts.

The Binary option (_BIN) sorts and compares data in SQL Server tables based on the bit patterns defined for each character. Binary sort order is case sensitive and accent sensitive. Binary is also the fastest sorting order. If this option is not selected, SQL Server follows sorting and comparison rules as defined in dictionaries for the associated language or alphabet.

The Binary-code point (_BIN2) option sorts and compares data in SQL Server tables based on Unicode code points for Unicode data. For non-Unicode data, Binary-code point will use comparisons identical to binary sorts. The advantage of using a Binary-code point sort order is that no data resorting is required in applications that compare sorted SQL Server data. As a result, a Binary-code point sort order provides simpler application development and possible performance increases.

Appendix E contains a list of the collations that were updated in SQL Server 2005. Unless you need backward compatibility with SQL Server 2000 or earlier, it is best to use the updated collations.

Collations in SQL Server 2005 include the following collation types: Windows collations, and SQL collations.

Windows Collations

Windows collations define rules for storing character data based on an associated Windows locale. (There are more Windows locales than there are SQL Server Windows collations.) The base Windows collation rules specify which alphabet or language is used when dictionary sorting is applied, as well as the code page used to store non-Unicode character data. In SQL Server, Windows collations are combined with a series of suffixes to additionally define sorting and comparison rules based on case, accent, kana, and width sensitivity. The complete Windows collation name is composed of the collation designator and the comparison styles.

In a Windows collation, comparison and sorting of non-Unicode data is implemented using the same algorithm as Unicode data. This provides consistency across data types within SQL Server, and it also provides developers with the ability to sort strings in their applications using the same rules that are used by SQL Server—that is, by calling the CompareStringW function of the Microsoft Win32 API.

Binary Collations

Binary collations sort data based on the sequence of coded values defined by the locale and data type. A binary collation in SQL Server defines the language locale and the ANSI code page to be used, enforcing a binary sort order. Binary collations are useful in achieving improved application performance due to their relative simplicity. For non-Unicode data types, data comparisons are based on the code points defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN2 yield identical sorting results when used on Unicode data. That is because the collation compares Unicode as Unicode, and compares non-Unicode data as binary.

In SQL Server 2000, binary collations performed an incomplete code-point–to–code-point comparison for Unicode data. The first character was compared as WCHAR, followed by a byte-by-byte comparison. For backward compatibility, existing binary collation semantics will not be changed.

Binary collations in SQL Server 2005 include both the previous BIN collation and a new set of pure code-point comparison collations. Customers can choose to migrate to the new binary collations to take advantage of true code-point comparisons, and they should utilize the new binary collations when developing new applications. The new BIN2 suffix identifies collation names that implement the new code-point collation semantics. In addition, a new comparison flag is added corresponding to BIN2 for the new binary sort.

SQL Server automatically recommends a default collation based on the system locale. You should change the settings for a Windows default collation only if your installation of SQL Server must match the collation settings used by another instance of SQL Server, or if the collations setting must match the Windows system locale of another computer.

If you need to handle supplementary characters, change the default collation to one of the newer collations that support ordering and comparison operations on supplementary characters. These comparisons are based on code points only, and not on other linguistically meaningful ways. Only the 90 collation versions, which are signified by the 90 suffix added to their names, support these operations. For example, instead of the Japanese collation, use Japanese_90. If you do not use a supplementary-aware collation, be careful when you use supplementary characters in operations such as ORDER BY, GROUP BY, and DISTINCT, and especially when supplementary and non-supplementary characters are included in the same operation.

SQL Server Collations

SQL Server collations provide sort-order compatibility with earlier versions of SQL Server. (For a full list of SQL Server collations, see SQL Collation Name in SQL Server 2005 Books Online.) SQL Server collations are based on legacy SQL Server sort orders for non-Unicode data—for example, char and varchar data types—defined by SQL Server. The dictionary sorting rules for non-Unicode data are not compatible with any sorting routine provided by Windows operating systems, but the sorting of Unicode data is compatible with a particular version of Windows sorting rules. Because SQL Server collations use different comparison rules for non-Unicode data and for Unicode data, you might see different results for comparisons of the same data, depending on the underlying data type.

When you upgrade an instance of SQL Server, SQL Server collations can be specified for compatibility with existing instances of SQL Server. Because the default collation for an instance of SQL Server is defined during setup, it is important to specify collation settings carefully when:

  • Your application code depends in some way on the behavior of previous SQL Server collations.
  • You are going to use SQL Server 2005 replication with existing installations of SQL Server 6.5 or SQL Server 7.0.
  • You must store character data that reflects multiple languages.

If you have a mix of Unicode and non-Unicode columns in your database, you should primarily use Windows collations. Windows collations apply Unicode-based sorting rules to both Unicode and non-Unicode data. This means that SQL Server internally converts non-Unicode data to Unicode to perform comparison operations. This provides consistency across data types in SQL Server, and also provides developers with the ability to sort strings in their applications that use the same rules that SQL Server uses.

SQL collations, on the other hand, apply non-Unicode sorting rules to non-Unicode data, and Unicode sorting rules to Unicode data, by using a corresponding Windows collation for the Unicode data. This difference can cause inconsistent results for comparisons of the same characters. Therefore, if you have a mix of Unicode and non-Unicode columns in your database, they should all be defined by using Windows collations so that the same sorting rules are used across Unicode and non-Unicode data.

Backward Compatibility in Collations

The BIN binary collation provided in SQL Server 2000 performed an incomplete code-point to code-point comparison for Unicode data. These binary collations compared the first character as WCHAR, followed by a byte-by-byte comparison. This could cause Unicode character data to be sorted in an unexpected manner.

To ensure backward compatibility, existing binary collation semantics will not be changed. However, the functionality may have been superseded by newer collations. The Appendix lists collations that have been preserved for backward compatibility with SQL Server 2000 or SQL Server 7.0.

To obtain information about collations in a database, you can use the following system views:

Catalog view Description
sys.databases Returns information about the collation of a database.
sys.columns Returns information about the collation of a column of a table or view.
COLLATIONPROPERTY Returns information about collations in SQL Server 2005.

You can pass the CodePage value, or LCID, which returns the Windows Locale ID, or Null for SQL collations.

You can also specify Windows ComparisonStyle (returns Null for both Binary and SQL collations). ComparisonStyle can be used to verify that there is an equivalence between string normalization in Windows and in SQL Server for the Windows collation.

fn_helpcollations Returns a list of available collations in SQL Server 2005.
SELECT * FROM ::fn_helpcollations()

In SQL Server 2005, this query returns 1,011 collations. (In SQL Server 2000, 753 collations.)

SERVERPROPERTY Returns the collation associated with the server.
SELECT CONVERT(char, SERVERPROPERTY('collation'))
DATABASEPROPERTYEX Determines the collation of a database, for example:
SELECT CONVERT(char, 
DATABASEPROPERTYEX('pubs', 
'collation'))

Additional collations cannot be added, unless they are added in service packs or future versions.

Collations and Sorting of Data

As a general rule, every collation defined in SQL Server on a Unicode column will sort all defined Unicode characters. However, there are many different collations because there are many differences in how the data can be sorted. A good example of this is the Georgian modern sort. Although the traditional sorting of Georgian text places all of the characters in a specific order, it is common for modern usage to place certain rarely used characters at the end. These characters are HE (), HEI (), WE (), and HAR (). Thus, there are two ways to sort the Georgian alphabet: traditional, and modern.

Figure 4. Traditional Georgian sort order

Figure 5. Modern Georgian sort order

The existence of this collation does not prevent other Unicode data from being sorted according to the sorting order provided in the Latin1_General collation. All collations, in fact, sort Georgian in the traditional form, with the sole exception of the Georgian_Modern_Sort collations. In other words, the same general rules apply to all collations; only the exceptions change between collations.

Collation Levels

SQL Server 2005 supports setting collations at the following levels of a SQL Server 2005 instance:

  • Server level
  • Database level
  • Column level
  • Expression level

This section explains these collation levels and how they interact when multiple collations are used.

Collations Specified at the Server Level

The default collation of a SQL Server instance is set during setup. The default collation of the instance also becomes the default collation of the system databases. After a collation has been assigned to any object other than a column or a database, you cannot change the collation except by dropping and re-creating the object. Instead of changing the default collation of a SQL Server instance, you can specify the collation when you create a new database or database column.

In earlier versions, the collation was always set at the server level, and in many cases it was the only collation you ever needed to set. The server collation acts as the default whenever a new database is created on the server, unless you explicitly set the collation at the database level. Because each database has its own collation, the server-level collation is referenced only when the database is first created.

In SQL Server 2000, you could change the default collation for the server without rerunning Setup by using the Rebuild Master utility (RebuildM.exe), which is located in the Program Files\Microsoft SQL Server\80\Tools\BINN directory. For more information, see How to Rebuild the Master Database (Rebuild Master Utility) in SQL Server 2005 Books Online for SQL Server 2000. In SQL Server 2005, this utility is not supported; instead, you use the REBUILDDATABASE option in Setup.exe. However, because the server collation is not often used, instead of changing the default collation of an instance of SQL Server 2005, you can specify a default collation for each new database you create.

If you must change the default collation for an instance of SQL Server 2005, you must first script or back up your database, drop all user databases, and then rebuild the master database—specifying the new collation in the SQLCOLLATION property of the setup command, as follows:

start /wait setup.exe /qb INSTANCENAME=MSSQLSERVER REINSTALL=SQL_Engine
REBUILDDATABASE=1 SAPWD=test SQLCOLLATION=SQL_Latin1_General_CP1_CI_AI

Collations at the Database Level

When a database is created, the COLLATE clause of the CREATE DATABASE statement can be used to specify the default collation of the database. If no collation is specified during database creation, the database is assigned the default collation of the model database. The default collation for the model database is the same as the default collation of the SQL Server instance.

Every database can have a unique collation, with the sort order being set at the database level. Figure 6 shows how the collation is set using SQL Server Management Studio.

Click here for larger image

Figure 6. Setting the database collation using the Options tab of the Database Properties window (Click on the picture for a larger image)

You can also set the collation order by using Transact-SQL. For example, to create a new database that uses the Czech sort order and is case- and accent-sensitive, use a statement like the following:

USE master
GO
CREATE DATABASE Products
ON 
( NAME = products_dat,
   FILENAME = 'c:\program files\microsoft sql 
server\mssql\data\products.mdf' )
COLLATE Czech_CS_AS
GO

In SQL Server 2005, you can also change the collation of an existing database by using SQL Server Management Studio. In SQL Server 2000, this functionality was not available in SQL Server Enterprise Manager. Instead, you had to use the ALTER DATABASE statement. For example, the following statement changes the collation of the Products database from Czech_CS_AS to Czech_CI_AI (case- and accent-sensitive to case- and accent-insensitive):

ALTER DATABASE Products
COLLATE Czech_CI_AI

Considerations Before Changing the Collation of a Database

If you have data in a text, varchar, or char field and there is no explicit collation on the column, changing the collation of the database alters the way that the data's encoding is interpreted, resulting in a form of corruption of any characters beyond the ASCII range.

Therefore, avoid changing the collation of any database that contains non-Unicode text data, unless the data is stored in columns that have their own explicit collations set.

To change the collation of a database, all of the following must be true:

  • No one else can be using the database.
  • No schema-bound object can be dependent on the database collation. Schema-bound objects include any of the following:
    • User-defined functions and views created with SCHEMABINDING
    • Computed columns
    • CHECK constraints
    • Table-valued functions that return tables with character columns with collations inherited from the default database collation
  • Changing the collation of the database does not create duplicates among any of the system names. If duplicates are found, an error is raised and the action to change the collation fails.

    Objects that can potentially cause such duplications include:

    • Object names (such as procedure, table, trigger, or view).
    • Schema names (such as group, role, or user).
    • Scalar-type names (such as system and user-defined types).
    • Full-text catalog names.
    • Column or parameter names within an object.
    • Index names within a table.

For example, suppose your database contains two tables named Table1 and TABLE1, and you try to change the collation from French_CS_AS (case-sensitive, accent-sensitive) to French_CI_AS (case-insensitive, accent-sensitive). With the first collation, it is possible to have two tables. However, changing to the second collation causes duplicates.

Collations Specified at the Column Level

In SQL Server 2005, you can specify the collation of text in a particular column. This can be very useful, for example, if you need to force case sensitivity for a password column. Other scenarios where column-level collations might be useful involve multiple languages within the same table. For example, the customer name column might need to use Unicode with the Latin1_General collation to enable sorting that is appropriate for a variety of names, whereas the product line column might always contain Greek, and for this column a Greek collation might make sense. Figure 7 shows how you can choose a collation and set the sort-order options during table design.

Figure 7. Specifying a collation

If you have not previously set a collation on the column, when you click the column, the dialog box lists <database default> for the Collation property of the column. To change the collation, click the ellipsis (...) button. This opens the Collation dialog box, where you select either a Windows collation or a SQL Server collation, and set sorting options.

When you select a Windows collation, you can specify whether the collation is sensitive to case, accent, kana type, and width.

You can also set column-level collations using Transact-SQL by adding a COLLATE clause to the column definition in the CREATE TABLE statement.

The following example shows how to use Transact-SQL to specify a collation for the job-description column that is set to Arabic, is case- and accent-insensitive, and is kana-type insensitive.

CREATE TABLE jobs
(
   job_id  smallint
      IDENTITY(1,1)
      PRIMARY KEY CLUSTERED,
   job_desc varchar(50)
      COLLATE Arabic_CI_AI_KS
      NOT NULL
      DEFAULT 'New Position - title not formalized yet',
)

You can also use the ALTER TABLE statement to change the collation at the column level (except for an ntext or text column). If the COLLATE clause is not specified, changing a column's data type causes the collation of the column to change to the default collation of the database.

In SQL Server 2005, you can also change the collation programmatically by using the column.collation property in SQL Management Objects (SMO).

The COLLATE clause can be used to change the collations of columns of only the char, varchar, nchar, and nvarchar data types. To change the collation of a user-defined alias data-type column, you must execute separate ALTER TABLE statements to change the column to a SQL Server system data type and change its collation, and then change the column back to an alias data type.

ALTER COLUMN cannot have a collation change if one or more of the following conditions exist:

  • If a CHECK constraint, FOREIGN KEY constraint, or computed column references the column that is being changed.
  • If any indexes, statistics, or full-text indexes are created on the column. Statistics created automatically on the column changed are dropped if the column collation is changed.
  • If a schema-bound view or function references the column.

You can insert or update values in a text column whose collation is different from the code page of the default collation of the database. SQL Server implicitly converts the values to the collation of the column.

Collations Specified in Expressions

Expression-level collations are set at the time a statement is run and they affect the way a result set is returned. This allows sorting results so that the ORDER BY clause can be language-specific.

During SQL Server setup, you are prompted to select either Windows collations or binary collations. Your choice of collations affects the data comparison and sort-order behaviors of your instance of Microsoft SQL Server.

There may be many instances when you need to display data to people in different countries and want locale-appropriate sorting. In both SQL Server 2000 and 2005, you can specify collations in expressions. This powerful feature allows you to sort in a particular manner so that the ORDER BY clause can be language-specific.

For example, the following query on the AdventureWorks database sorts the Person.Address table by the city in their address. The Lithuanian sort order provides a good example of collation differences because the rules about how the letter Y sorts are so striking.

SELECT *
FROM Person.Address
ORDER BY City COLLATE Lithuanian_CI_AI

The following example assumes that Table1 has no explicit column-level collations. In this case, both columns are compared by using the Turkish sort order.

SELECT   *
FROM  Table1
WHERE Field1 = Field2 COLLATE Turkish_ci_ai

For an explanation of how collations are used in comparisons, see the "Rules of Precedence for Collations" section, later in this paper.

COLLATE Keyword

A collation can be defined at the database level, the column level, or in an expression. To specify the collation for a database, column, or character expression, use the following syntax:

COLLATE [<Windows_Collation_name>|<SQL_Collation_Name]

If the column is not defined using a Unicode data type (ntext, nvarchar, nvarchar(max), nchar), the collation is converted into a code page.

The COLLATE keyword provides the option to use the following two types of collations:

Windows collations

These are defined by Windows. You can change options to specify case, accent, kana, and width sensitivity, and you can choose a binary sort order.

SQL collations

These collations are provided for backward compatibility. You cannot configure the sort order.

In general, try to use the Windows collations whenever you can. The following example presents a list of country codes and names, showing how sorting behavior can change depending on the collation. The upper part of the Query Window shows the list sorted by the default collations; the lower half of the Query Window shows the same data sorted by the Lithuanian collation.

In the default collation, shown first, Y comes between X and Z. In the Lithuanian collation, shown second, Y comes between I and J.

Click here for larger image

Figure 8. Effect of Lithuanian collation on sorting (Click on the picture for a larger image)

Rules of Precedence for Collations

Because you can specify collations at the level of the server, the database, columns, and in expressions, it is important to understand how the collations interact. Collation precedence determines how expressions that evaluate to a character string are collated, and determine the collation that is used by operators that use character string inputs but do not return a character string, such as LIKE and IN.

The collation precedence rules in SQL Server 2005 apply only to the character string data types: char, varchar, text, nchar, nvarchar, and ntext. Objects that have other data types do not participate in collation evaluations.

The comparison operators and the MAX, MIN, BETWEEN, LIKE, and IN operators, are collation-sensitive. The string used by the operators is assigned the collation label of the operand that has the higher precedence. The UNION operator is also collation-sensitive, and all string operands and the final result is assigned the collation of the operand with the highest precedence. The collation precedence of the UNION operands and result are evaluated column by column.

The assignment operator is collation-insensitive and the right expression is cast to the left collation.

The string concatenation operator is collation-insensitive. The two string operands and the result are assigned the collation label of the operand that has the highest collation precedence. The UNION ALL and CASE operators are collation-insensitive, and all string operands and the final results are assigned the collation label of the operand with the highest precedence. The collation precedence of the UNION ALL operands and result are evaluated column by column.

Because objects can have different collations at various levels, SQL Server 2005 introduces collation labels to help you manage the complex interaction of collations. A collation label names a category of objects that can take a collation. The collation rules are described in terms of interactions between collation labels.

If an expression references only one character string object, the collation label of the referenced object is used. If the expression references two operand expressions that have the same collation label, that becomes the collation label of the operand expression.

If a complex expression references two operand expressions with different collations, the collation label of the final result uses a series of rules to determine the precedence of the collations. For more information, see Collation Precedence (Transact-SQL) in SQL Server 2005 Books Online.

The following list describes the different types of collation labels. The list is followed by a chart that summarizes possible interactions of the collation labels.

Explicit

A collation is explicitly defined for a given expression, or explicitly cast to a specific collation (X) by using a COLLATE clause in the expression.

Implicit

A column is referenced. Even if the column were explicitly assigned a collation by using a COLLATE clause in the CREATE TABLE or CREATE VIEW statement, a column reference is classified as implicit.

Coercible-Default

The database-level collation is used for any Transact-SQL character string variable, parameter, literal, the output of a catalog built-in function, or the output of a built-in function that does not take string inputs but produces a string output.

If the object is declared in a user-defined function, stored procedure, or trigger, the object is assigned the default collation of the database in which the function, stored procedure, or trigger is created. If the object is declared in a batch, the object is assigned the default collation of the current database for the connection.

No Collation

When the value of an expression is the result of an operation between two strings that have conflicting collations of the implicit collation label, the expression result is defined as not having a collation.

  Explicit C1 Implicit C1 Default No Collation
Explicit C2 Run-time Error Explicit C1 Explicit C1 Explicit C1
Implicit C1 Explicit C2 No Collation Implicit C1 No Collation
Default Explicit C2 Implicit C2 Default No Collation
No Collation Explicit C2 No Collation No Collation No Collation

This table shows that the only time SQL Server cann