Export (0) Print
Expand All
Creating Audit Tables, Invoking COM Objects, and More
Exception-handling Techniques
Exploring SQL Server Triggers: Part 2
Deliver User-Friendly Reports from Your Application with SQL Server Reporting Services
SQL Server: Display Your Data Your Way with Custom Renderers for Reporting Services
Updating Data in Linked Servers, Information Schema Views, and More
XML Features in SQL Server 2000
Expand Minimize

International Features in Microsoft SQL Server 2000

SQL Server 2000
 

Michael Kaplan
Trigeminal Software, Inc.

April 2001

Summary: This article introduces Microsoft SQL Server developers to the international features of SQL Server 2000. Topics covered include an explanation of Unicode, SQL Server international data types, and key issues regarding implementation. (57 printed pages)

Contents

Introduction
What Is Unicode and How Can It Be Used?
Data Types in SQL Server 2000
Performance and Storage Space
Metadata Information In System Tables
Collation
Communication Between Server and Client (Code Page and Collation Issues)
Multilingual Data In the User Interface
Getting to SQL Server Data (Data Access Methods)
Multilingual Transact-SQL
Locale Support In SQL Server 2000
Data Transformation Services
Using the bcp Utility with Multilingual Data
The Microsoft Search Service and FTS
Dealing with OLAP/Hierarchical Data
Using XML Support in SQL Server 2000 with Multilingual Data
Interacting with Other Database Products
Conclusion
Acknowledgments
About the Author

Introduction

Microsoft® SQL Server™ 2000 includes powerful features to support international operations and environments. Extensive multilingual features make SQL Server 2000 a compelling database product and applications platform. This article provides a complete overview of how to use these features in a global context. This article is not limited to a list of features but also will explain how international/multilingual requirements can affect many aspects of a project.

What Is Unicode and How Can It Be Used?

Unicode support is the foundation of the multilingual support of SQL Server 2000. Unicode is a standard that is designed to support all the world's scripts. Unicode supplies a unique code point for every character, regardless of platform, program, or language. A program that supports Unicode can handle data in any language. Unicode 3.0 can handle up to 1,114,112 characters.

Unicode is an industry standard managed by the Unicode Consortium, an organization that recognizes the importance of having a single character set for all languages. Microsoft is a member of the Unicode Consortium. Most companies joined for many of the same reasons that Microsoft did: in creating global software solutions, the importance of being able to represent multilingual data is obvious. Many other companies and individuals join to understand the issues and techniques for handling multilingual data.

The Unicode Standard, now at version 3.01, is identical to ISO-10646, an international standard that matches all of the code points in Unicode since just after Unicode 1.1 was released. The effective combination of both an industry and international standard helps to keep any individual interest from subverting the goal of both standards: one character set for everyone!

For more information, go to the Unicode Consortium Web site.

Encodings

Unicode maps code points to characters, but do not actually specify how the data will be represented in memory, in a database, or on a Web page. This is where the actual encoding of Unicode data comes into play. There are many different encodings for Unicode. This section describes these common encodings:

  • UCS-2
  • UTF-16
  • UTF-8

The information about encoding is presented here to help you better understand Unicode and some of the many ways in which it can be stored. Most of the time you can simply choose a Unicode data type and not worry about these details; however, it is important to understand the encoding any time you:

  • Are dealing with an application that may encode Unicode differently
  • Must send data to other platforms (non-Microsoft Windows®) or Web servers
  • Must manage the importation of data from or exportation of data to other encodings

UCS-2

UCS-2 is the main Unicode encoding used by Microsoft Windows NT® 4.0, Microsoft® SQL Server™ version 7.0, and Microsoft SQL Server 2000. UCS-2 allows for encoding of 65,536 different code points. All information that is stored in Unicode in SQL Server 2000 is stored in this encoding, which uses two bytes for every character, regardless of the character being used. Therefore, the Latin letter "A" is treated the same way as the:

  • Cyrillic letter Sha Aa902644.intlfeainsqlsvr2000.rtf25(en-us,SQL.80).gif
  • Hebrew letter Lamed Aa902644.intlfeainsqlsvr2000.rtf26(en-us,SQL.80).gif
  • Tamil letter Rra Aa902644.intlfeainsqlsvr2000.rtf27(en-us,SQL.80).gif
  • Japanese Hiragana letter E Aa902644.intlfeainsqlsvr2000.rtf28(en-us,SQL.80).gif

Each one has a unique code point (for these letters, the code points are U+0041, U+0248, U+05DC, U+0BB1, and U+3048, respectively, where each four-digit hexadecimal number represents the 2 bytes that UCS-2 uses).

The ordering of bytes can be crucially important at the operating-system level. Because SQL Server runs on the Windows platform, it uses a Little Endian encoding system (meaning "little end in"). Therefore, a hexadecimal word such as 0x1234 is stored in memory as 0x34 0x12.

UTF-16

UTF-16 is the primary Unicode encoding used by Microsoft Windows 2000. Even before Unicode 2.0 was released, it became clear that the goal of Unicode (to support a single code point for every character in every language) could not be achieved using only 65,536 characters. Some languages, such as Chinese, require that many characters to encode just the rarely used characters. Thus, support was added for a surrogate range to handle an additional 1,048,576 characters. UTF-16 is the encoding that fully supports this extension to the original standard. For information about the surrogate range, see the topic What are Surrogates?

In UTF-16, the same standard of 2 bytes per code point is followed; however, with UTF-16 certain code points use another code point right after them to define the character.

Like UCS-2, UTF-16 is stored in a Little Endian manner, as is everything on Windows, by default.

Important   Although UCS-2 is not aware of surrogates, it will not corrupt the actual data in a database that contains them; it will treat them as two separate (undefined) characters.

Although SQL Server 7.0 and SQL Server 2000 can store surrogate pairs losslessly, they will treat the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate "neutral" or surrogate "safe" (where safe refers to the ability to store the data even if there is no intrinsic ability to interact with it). Surrogate "aware" applications are rare at this time, because currently there are no surrogate characters officially defined. Microsoft Word 2000, Microsoft Windows 2000, and Microsoft Internet Explorer version 5.0 and later are a few surrogate-aware applications.

UTF-8

Many ASCII and other byte-oriented systems that require 8-bit encodings (such as mail servers) must span a vast array of computers that use different encodings, different byte orders, and different languages. UTF-8 is an encoding scheme that is designed to treat Unicode data in a way that is independent of the byte ordering on the computer. Although SQL Server 2000 does not store data in UTF-8 format, it supports UTF-8 in at least one crucial scenario: its support of the Extensible Markup Language (XML). For more information, see Using XML Support in SQL Server 2000 with Multilingual Data later in this article.

Many other database systems (such as Oracle and Sybase SQL Server) support Unicode using UTF-8 storage. Depending on a server's implementation, this can be technically easier for a database engine to implement (all of the existing text management code on the server that is designed to deal with data 1 byte at a time does not require major changes). In the Windows environment, UTF-8 storage has these disadvantages:

  • The Component Object Model (COM) supports only UTF-16/UCS-2 in its APIs and interfaces, which would require constant conversion if data were stored in UTF-8 format (this issue only applies when COM is used; SQL Server database engine does not typically call COM interfaces).
  • The Windows NT and Windows 2000 kernels are both Unicode and use UCS-2 and UTF-16, respectively. Once again, a UTF-8 storage format would require many extra conversions (as with the previous note on COM, this would not result in a conversion hit in the SQL Server database engine, but would potentially affect many client-side operations).
  • UTF-8 can be slower for many string operations. Sorting, comparing, and virtually any string operation can be slowed because characters do not have a fixed width.
  • UTF-8 will often need more than 2 bytes, and the increased size can make for a larger footprint on disk and in memory.

Because XML is, among other things, a very important standard for communication over the Internet (which has a strong byte-oriented bias), its default to UTF-8 is something that can make a lot of sense.

What are Surrogates?

The surrogates' area is a range in Unicode from U+D800 to U+DFFF that contains 1024 low surrogate values and 1024 high surrogate values. A high surrogate and a low surrogate can be combined to give access to over a million possible characters. It is not considered valid to have only one-half of a surrogate pair; to be valid, there must always be a high surrogate followed by a low surrogate. This makes checking for a surrogate an easy matter of range checking compared to the rather complex rules that are required to detect DBCS (double-byte character system) characters.

There were no surrogates when SQL Server 7.0 was released. When SQL Server 2000 was released, the only surrogate characters were those relating to language tags in plain text.

Important   As mentioned previously, surrogates can be stored without any danger of data loss, but you should be very careful in trying to use the SQL Server string manipulation functions with such data. In addition, Windows 2000, at this time, supports only code-point sorting for surrogate characters.

After SQL Server 2000 released, through the efforts of the ISO and Unicode standard group, more characters were added to the surrogate range, including about 40,000 CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs. These characters are used primarily for historical and classical literary documents to help with the encoding of the rich CJKV literary heritage.

Data Types In SQL Server 2000

The central task of a database is obviously the storage of data. This section covers some issued involved in using SQL Server 2000 data types for storing international data.

Non-Unicode Text Types: char, varchar, text

When you deal with text data that is stored in the char, varchar, or text data types, the most important limitation to consider is that only information from a single code page can be stored. The exact code page depends on the collation of the column (if there is no column-level collation then the collation of the database is used). To determine the code page that is used for a given column, you can use the COLLATIONPROPERTY function, as shown in the following examples:

SELECT COLLATIONPROPERTY('Chinese_PRC_Stroke_CI_AI_KS_WS', 'CodePage')
936

SELECT COLLATIONPROPERTY('Latin1_General_CI_AI', 'CodePage')
1252

SELECT COLLATIONPROPERTY('Hindi_CI_AI_WS', 'CodePage')
0

In the last example, Hindi was added to point out that many locales (such as Georgian and Hindi) do not have code pages, as they are "Unicode only" collations. Those collations are not appropriate for these data types.

Any time Unicode data must be inserted into these columns, the columns will be internally converted from Unicode using the WideCharToMultiByte API and the code page associated with the collation. Any time a character cannot be represented on the given code page, it will be replaced by a question mark (?); this makes random question marks a good indication of data that has been corrupted due to this conversion. It also is a good indication that you really needed a Unicode data type. If you use a string literal of a non-Unicode type, it will be converted first using the database's default code page (derived from its collation).

You might encounter another problem if you try to store data when not all of the characters you wish to support are contained in the code page. One of the best examples of this is the Arabic script: it supports a wide array of languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Pashto, Sindhi, Uighur, Urdu, and more. All of these languages have additional characters beyond those in the Arabic language, which is the basis for Windows code page 1256. These extra characters are thus converted into question marks if they are stored in a non-Unicode column with the Arabic collation. This problem occurs because, in many cases, Windows will consider a particular code page to be a "best fit" code page. That means there is no guarantee you will be able to rely on the code page to handle all text, but it is the best one available.

Unicode Text Types: nchar, nvarchar, ntext

The SQL-92 specification defines these "N" (stands for national) data types but does not specifically require them to be used for Unicode; the actual definition of these data types is left to the database platform or developer. In SQL Server 7.0 and SQL Server 2000, these data types are defined as being equivalent to UCS-2/UTF-16 Unicode. It is important to keep in mind that this is specific to Microsoft SQL Server. When you work with other database servers (such as Sybase SQL Server), its important to know that the "N" data types do not specifically mean Unicode.

For the storage of complex scripts, such as Hindi and Tamil, it is important to note that the data is expected to be in the proper order. Many languages such as Tamil will actually specify that certain letters must be reordered when the text is rendered, thus making the logical order of text as it is stored in memory different from the visual order that will be seen in a user interface. Data should always be stored in the proper logical order for any complex script language, which includes all of the Indic languages, Arabic, Farsi, Hebrew, and many others. The actual rendering of such data is a separate issue (see Multilingual Data In the User Interface later in this article).

Although the "N" columns can indeed support data of any language or combination of languages, the actual sorting of this data can only be in a single collation (the meaning and consequences of this issue are discussed further in the section on collation). None of the code page limitations that were mentioned previously in this article apply to Unicode columns.

Date/Time Types: datetime, smalldatetime

The actual data types have no actual international meaning; they represent a date/time value with the following definitions:

datetime
Date and time data in the Gregorian calendar from January 1, 1753 through December 31, 9999, to an accuracy of one three-hundredth of a second (equivalent to 3.33 milliseconds or 0.00333 seconds).
smalldatetime
Date and time data in the Gregorian calendar from January 1, 1900, through June 6, 2079, with accuracy to the minute. smalldatetime values with 29.998 seconds or lower are rounded down to the nearest minute; values with 29.999 seconds or higher are rounded up to the nearest minute.

Microsoft SQL Server rejects data that falls outside of these ranges. The actual data is stored internally as two integers (4-byte integers for datetime, and 2-byte integers for smalldatetime) that represent the date and time in question. Because the actual value has no actual connotations in regard to locale-specific formatting, it is up to the developer to define such conversions as needed.

SQL Server 2000 supports many different locale-specific conversions that can be performed at the server instead of relying on custom solutions from developers. These date styles can be accessed through the CONVERT function, which takes a data type, an expression, and an optional style, as shown in the following table.

W/centuryW/o centuryStandardInput (converting to datetime)
Output (converting to text)
0 or 100-Defaultmon dd yyyy hh:miAM (or PM)
1011USAmm/dd/yy
1022ANSIyy.mm.dd
1033British/Frenchdd/mm/yy
1044Germandd.mm.yy
1055Italiandd-mm-yy
1066-dd mon yy
1077-Mon dd, yy
1088-hh:mm:ss
9 or 109 -Default + millisecondsmon dd yyyy hh:mi:ss:mmmAM (or PM)
11010USAmm-dd-yy
11111JAPANyy/mm/dd
11212ISOyymmdd
13 or 113 -Europe default + millisecondsdd mon yyyy hh:mm:ss:mmm(24h)
11414-hh:mi:ss:mmm(24h)
20 or 120 -ODBC canonicalyyyy-mm-dd hh:mi:ss(24h)
21 or 121-ODBC canonical + millisecondsyyyy-mm-dd hh:mi:ss.mmm(24h)
126-ISO8601 (no spaces)yyyy-mm-dd Thh:mm:ss:mmm
130-Kuwaiti (Hijri)dd mon yyyy hh:mi:ss:mmmAM
131-Kuwaiti (Hijri)dd/mm/yy hh:mi:ss:mmmAM

The following example shows how the CONVERT function is used:

SELECT CONVERT(char, GETDATE(), 100) AS [100]
Aug 16 2000 11:50AM 

You can then convert the data from a string to a date value in much the same manner:

SELECT CONVERT(datetime, 'Aug 16 2000 11:50AM', 100) AS [100]

It is worth noting that when you convert such dates with Style 130 (Kuwaiti or Hijri), converting to the char data type may result in the data corruption if the collation is not one of the Arabic collations that use code page 1256 for Unicode conversions. The following illustration (Figure 1) shows this issue.

Aa902644.intlfeainsqlsvr2000.rtf01(en-us,SQL.80).gif

Figure 1. Convert date/time Transact-SQL

Note that on a U.S. client computer, attempting to use the char data type results in the Arabic characters being converted into question marks, and the nchar data type rendering Arabic characters. This particular string is one that is still not properly formatted (as it would be on an Arabic) client computer due to limitations in the SQL grid in SQL Query Analyzer. The following illustration (Figure 2) shows how the actual Hijri date string should appear.

Aa902644.intlfeainsqlsvr2000.rtf02(en-us,SQL.80).gif

Figure 2. Hijri date string

This is because complex scripts, such as Arabic, have shaping rules that must be applied so that the data can be rendered properly. In the case of bidirectional (BIDI) languages such as Hebrew, it will cause all data to be reversed; the effect is more marked with Arabic. That is because the actual shapes of letters can change, depending on the surrounding letters. This problem does not happen in Windows 2000 or in any earlier, Arabic-enabled versions of Windows.

Additionally, the date string that is returned can itself cause problems in the bidirectional cases where it is needed, because the rules for the layout of bidirectional text used by an application, such as Internet Explorer or Windows 2000, causes the date to appear as shown in the following illustration (Figure 3).

Aa902644.intlfeainsqlsvr2000.rtf03(en-us,SQL.80).gif

Figure 3. Bi-directional date string example

This visual order (dd hh:mi:ss yyyy mon :) is obviously not the order that would be expected; the problem can be considered a general limitation of the 130 styles in the CONVERT function, although you can work around it easily enough by adding the proper Unicode control character in front of the string, as shown in the following query:

SELECT NCHAR(8207) + CONVERT(nchar, GETDATE(), 130)

The NCHAR function returns a character based upon the passed-in Unicode code point; 8207 or hexadecimal 0x200F is the Right-to-Left Marker (RLM), and causes the string to be displayed properly.

Performance and Storage Space

Ideally, every column is defined with one of the Unicode data types; however, doing so when you do not need to support multilingual data can create issues relating to storage space and speed.

Storage Space Issues

The actual amount of space required for the Unicode data types is 2 bytes per character, and the amount of space for the non-Unicode data types is 1 byte for all non-DBCS text and 2 bytes for Asian languages that use DBCS. Therefore, unless your data is on one of the Asian code pages, you will be using twice as much space to store the data. This must be considered when you upgrade existing databases or when you are deciding on the proper data types of new projects. If you are storing data only in a column that is on a single (non-Asian) code page, you might prefer to not use Unicode so you can save the space on disk and in memory.

Speed Issues

The speed issue is a complicated one. Here are some of the issues:

  • If you are running on Windows NT or Windows 2000, the kernel is expecting Unicode data, and thus non-Unicode columns will have to be converted in many cases, such as when you display data or use the operating system services.
  • There is the additional time needed to load the larger amount of data that must also be considered when you are dealing with DBCS data.
  • If you are dealing with a Windows 95 or Windows 98 client or server, much of the information can also face conversions from Unicode when operating-system services, such as data display, are needed.
  • If you are working between servers (see Communication Between Server and Client later in this article), database server products, or other products, the number of conversions can also play a great role in the performance battle.
  • If you are dealing with Asian languages, Unicode will actually be faster than using the language-specific DBCS code page. This is because DBCS data does not have fixed width; it is a mixture of double-byte and single-byte characters.
  • If you are dealing with non-Asian languages, sorting Unicode data can be up to 30-percent slower than non-Unicode data. This can be considered one cost of being able to represent global data.
Important   To realistically evaluate a performance issue, you must test to get conclusive data about the situation.

Metadata Information In System Tables

The system tables in SQL Server 2000 store all the data they contain as Unicode. This minimizes the problems that can occur with different collations between databases and columns. There is no other way to deal with the fact that different databases on the same server can have any mix of Unicode and non-Unicode column names. Even if you are supporting only a single language now, SQL Server must be ready to support any language you might choose to support in the future.

When you are converting databases and servers from SQL Server 6.5 or earlier, it is easy to become concerned about the metadata that is being converted; however, there is no need to worry. The conversion to Unicode is straightforward because these earlier versions of SQL Server are only in one code page/collation, defined at the server level.

One important issue involves using identifiers for objects in the system tables. SQL Server 2000 uses the Unicode 2.0 character property definition to create the list of valid characters in identifiers (Unicode 3.0 was not released when SQL Server 2000 development was completed). To avoid running into issues with international characters that are not defined in the Unicode 2.0 character property definition, you should limit your identifiers with brackets ([]) or double quotation marks ("). This prevents the server from checking for valid characters.

Collation

One of the things that everyone takes for granted is sorted data; after all, what could be more basic than the alphabet? Some of us may acknowledge languages that use different sets of letters such as Greek, Russian, Thai, or Japanese. But at least in the United States everyone seems to think that if they can count on anything, it's the alphabet.

The problem is that they are wrong! Whether or not you choose to understand why users who know Spanish might expect the letter combination "ch" to sort as a single character after the letter "h", you must understand that non-English languages have different sorting expectations. In general, one of the most effective ways to alienate end users of an application is to get a task such as basic sorting wrong.

The way this works is through collations, or sort orders, and a technique known as string normalization. This is a different meaning of "normalization" than database developers are used to, as it is a not a design issue; when you refer to string normalization, you are considering how to compare two strings so that they can be sorted. This can be optimized through the creation of indexes.

For non-Unicode columns, collation has a second meaning that is very important: Collation specifies the code page for the data and, therefore, what characters can be represented. Data can be moved between Unicode columns seamlessly, while data moved between non-Unicode columns cannot.

Collation in SQL Server 6.5 and Earlier

In SQL Server version 6.5 and earlier, collation was also relied on to specify the code page to use for language in general. There are some limitations related to different sort orders—in various Latin languages, for example. Also, you could only support Western European languages if you used Latin-1. Because of this, the number of different locales&151;that is, the number of different languages used in a specific region—that could be represented in information on a single instance of SQL Server was limited. The basic issues also apply to the collation of non-Unicode fields in later versions of SQL Server. In addition, the issue about languages that have a "best fit" code page, such as Farsi, (mentioned previously in Non-Unicode Text Types: char, varchar, text) can apply here.

Collation in SQL Server 7.0

SQL Server 7.0 has one Unicode and one non-Unicode collation per server. The non-Unicode collations are made up of the decision for code page and sort order ID, because each code page can support more than one sort. For example, Latin languages will usually allow both case sensitive and case insensitive sorts, and Simplified Chinese will allow sorts both by stroke count and phonetic sorts.

In the Unicode collation, any character of any language can be included in the column, so the individual collations that are available are there to make sure any collation-specific differences are properly handled. This is the proper solution to the "best fit" problem, because, for example, sorting Farsi data with the general Unicode collation gives users the data they expect. A Unicode collation consists of a locale and several comparison styles. Locales are usually named after countries or cultural regions. They sort characters according to the standard in that area. The Unicode collation still provides a sort order for all characters in the Unicode standard, but precedence is given to the locale specified.

The following table lists the supported, unique Unicode collations in SQL Server 7.0. Any locale not listed should use the General Unicode Collation.

Locale ID (LCID)Description
1033General Unicode
33280Binary Order
1027Catalan
197636Chinese Bopomofo (Taiwan)
2052Chinese Punctuation
133124Chinese Stroke Count
1028Chinese Stroke Count (Taiwan)
1050Croatian
1029Czech
1043Dutch
1061 Estonian
1036French
66615Georgian Modern
1031German
66567German Phone Book
1038Hungarian
66574Hungarian Technical
1039Icelandic
1040Italian
1041Japanese
66577Japanese Unicode
1042Korean
66578Korean Unicode
1062Latvian
1063Lithuanian
1071 Macedonian (FYROM)
1044Norwegian/Danish
1045Polish
1046Portuguese
1048Romanian
1051Slovak
1060Slovenian
1034Spanish (Traditional)
3082Spanish (Spain)
1053Swedish/Finnish
1054Thai
2057UK English
1058 Ukrainian
1066Vietnamese

As you can see by this list, not all languages are included; this is okay, however, as not all of them have to be. For example, the general Unicode sort order properly handles not only the data, but also the sorting of Afrikaans, Albanian, Arabic, Basque, Belarusian, Bulgarian, English, Faeroese, Farsi, Georgian (Traditional), Greek, Hebrew, Hindi, Indonesian, Malay, Russian, Serbian, Swahili, and Urdu. The other languages listed in the table, however, have one or more differences from the general Unicode collation.

It should be emphasized that the developers of SQL Server are not "political" people and there really is no desire to offend any one country/region by asking them to "use another country/region's sort order." In working with customers in other countries/regions, just use the numbers, because the names are really arbitrary descriptions. What is most important is that you can choose a collation that will allow your data to be handled appropriately.

One very important change in SQL Server 7.0 is the provision of an operating system-independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 can be consistent. This code, based on the same code that Windows 2000 uses for its own string normalization, is encapsulated to be the same on all computers. With this change, SQL Server no longer relies on the operating system for its international functionality—from the smallest MSDE installation to the largest SQL Server Enterprise Edition.

Collation in SQL Server 2000

In SQL Server 2000, the collation model has changed because:

  • The requirement for two different collations was confusing.
  • A more flexible model was needed to handle all of the new places for which a collation could be specified.
  • In SQL Server 2000, collations are also used to handle the code page of non-Unicode columns; therefore, more collations were required.

A single, consistent model was designed to handle both Unicode and non-Unicode sorts. This model supports the languages displayed in the following list.

Collation in SQL Server 2000  
AlbanianArabicChinese_PRC
Chinese_PRC_StrokeChinese_Taiwan_BopomofoChinese_Taiwan_Stroke
Cyrillic_GeneralCroatianCzech
Danish_NorwegianEstonianFinnish_Swedish
FrenchGeorgian_Modern_sortGerman_PhoneBook
GreekHebrewHindi
HungarianHungarian_TechnicalIcelandic
JapaneseJapanese_UnicodeKorean_Wansung
Korean_Wansung_UnicodeLatin1_GeneralLatvian
LithuanianLithuanian_ClassicMacedonian (FYROM)
Spanish (Spain)PolishRomanian
SlovakSlovenianThai
Traditional_SpanishTurkishUkrainian
Vietnamese  

Each of these collations is combined with a series of suffixes that help define whether there is case, accent, width, or kana sensitivity. The exact suffixes that are possible are shown in the following table. Each of the 40 languages in the previous list supports the 17 suffixes in the following table, for a total of 680 Windows collations.

Suffix for collationMeaning
_BINBinary sort
_CI_AICase-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
_CI_AI_WSCase-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
_CI_AI_KSCase-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
_CI_AI_KS_WSCase-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
_CI_ASCase-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
_CI_AS_WSCase-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
_CI_AS_KSCase-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
_CI_AS_KS_WSCase-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
_CS_AICase-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
_CS_AI_WSCase-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
_CS_AI_KSCase-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
_CS_AI_KS_WSCase-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
_CS_ASCase-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
_CS_AS_WSCase-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
_CS_AS_KSCase-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
_CS_AS_KS_WSCase-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive

These language names are arbitrary and were chosen to properly represent each unique supported code page for non-Unicode data and sort order for all data. In the many cases where a language can be completely represented on another code page, or when a sort order that one language needs is covered by another, the language was "removed" from the list because it is adequately covered. Note that the default setting of kana sensitivity and width sensitivity is set to insensitive.

To ensure that code pages from earlier versions of SQL Server are properly supported, many backward-compatible, SQL-specific sort orders are also included in SQL Server 2000. These SQL-specific sort orders are listed below. Many of these support some of the various parts of the suffixes described previously, but not all of the suffixes are supported.

SQL-specific sort orders  
SQL_1xCompat_CP850SQL_Estonian_CP1257SQL_Latin1_General_Pref_CP437
SQL_AltDiction_CP1253SQL_Hungarian_CP1250SQL_Latin1_General_Pref_CP850
SQL_AltDiction_CP850SQL_Icelandic_Pref_CP1SQL_Latvian_CP1257
SQL_AltDiction_Pref_CP850SQL_Latin1_General_CP1SQL_Lithuanian_CP1257
SQL_Croatian_CP1250SQL_Latin1_General_CP1250SQL_MixDiction_CP1253
SQL_Czech_CP1250SQL_Latin1_General_CP1251SQL_Polish_CP1250
SQL_Danish_Pref_CP1SQL_Latin1_General_CP1253SQL_Romanian_CP1250
SQL_EBCDIC037_CP1SQL_Latin1_General_CP1254SQL_Scandinavian_CP850
SQL_EBCDIC273_CP1SQL_Latin1_General_CP1255SQL_Scandinavian_Pref_CP850
SQL_EBCDIC277_CP1SQL_Latin1_General_CP1256SQL_Slovak_CP1250
SQL_EBCDIC278_CP1SQL_Latin1_General_CP1257SQL_Slovenian_CP1250
SQL_EBCDIC280_CP1SQL_Latin1_General_CP437SQL_SwedishPhone_Pref_CP1
SQL_EBCDIC284_CP1SQL_Latin1_General_CP850SQL_SwedishStd_Pref_CP1
SQL_EBCDIC285_CP1SQL_Latin1_General_Pref_CP1SQL_Ukrainian_CP1251
SQL_AltDiction_CP1253SQL_Hungarian_CP1250 
SQL_Latin1_General_Pref_CP850  

You can retrieve the actual information about collations by using the COLLATIONPROPERTY function. In addition to the CodePage value used earlier, you can also pass other information types, such as LCID, which returns the Windows Locale ID (returns Null for SQL collations). You can also specify Windows ComparisonStyle (returns Null for both Binary and SQL collations). This information can be used to verify that there is indeed an equivalence between string normalization in Windows 2000 and SQL Server 2000 for all of the Windows collations.

All of the available collations can be returned by using the fn_helpcollations() function, for example:

SELECT * FROM ::fn_helpcollations()

This query returns 753 rows in SQL Server 2000. Additional collations cannot be added, unless they are added in service packs or future versions.

How Collations Specify the Sorting of Data

It is important to briefly explain how collations actually operate on Unicode data. As a general rule that has no exception, every single defined collation in SQL Server on a Unicode column will sort all defined Unicode characters. There are many different collations because there are many differences in how the data can be sorted. A good example of this is the Georgian modern sort. Although the traditional sorting of Georgian text places all of the characters in a specific order, it is common for modern usage to place certain rarely used characters at the end. These characters are:

  • HE, which displays as: Aa902644.intlfeainsqlsvr2000.rtf21(en-us,SQL.80).gif
  • HEI, which displays as: Aa902644.intlfeainsqlsvr2000.rtf22(en-us,SQL.80).gif
  • WE, which displays as: Aa902644.intlfeainsqlsvr2000.rtf23(en-us,SQL.80).gif
  • HAR, which displays as: Aa902644.intlfeainsqlsvr2000.rtf24(en-us,SQL.80).gif

Thus, there are two ways to sort the Georgian alphabet, as displayed in Figures 4 and 5.

Aa902644.intlfeainsqlsvr2000.rtf04(en-us,SQL.80).gif
Figure 4. Traditional way to sort the Georgian alphabet

Aa902644.intlfeainsqlsvr2000.rtf05(en-us,SQL.80).gif
Figure 5. Modern way to sort the Georgian alphabet

This does not stop any other Unicode data from being sorted according to the same sort provided in the Latin1_General collation. In fact, all collations sort Georgian in the traditional form, with the sole exception of the Georgian_Modern_Sort collations. The same rule applies to all other collations; only the exceptions change between collations.

Collations Specified at Multiple Levels

In SQL Server 2000, collations can be specified:

  • At the server level
  • At the database level
  • At the column level
  • In an expression

Collation specified at the server level

The server level is where the collation always used to be, and in many cases is the only collation you ever need to set. This collation acts as the default for all collations in all databases on the server when they are created, if you do not explicitly set the collation at the database level. Because a database is always given a collation, the server-level collation is never really consulted except when the database is created.

You can change this collation without rerunning Setup by using the Rebuild Master utility (RebuildM.exe), which is located in Program Files\Microsoft SQL Server\80\Tools\BINN directory. For more information, see the topic "Rebuild master Utility," in SQL Server Books Online for SQL Server 2000.

You can also query the server for the collation with the Transact-SQL SERVERPROPERTY function, for example:

SELECT CONVERT(char, SERVERPROPERTY('collation'))

Collations at the database level

Every database can have a unique collation, with the sort order being set at the database level. The following illustration (Figure 6) shows how the collation is set using SQL Server Enterprise Manager.

Aa902644.intlfeainsqlsvr2000.rtf06(en-us,SQL.80).gif

Figure 6. Setting Collation using the Enterprise Manager

You can set the collation order with Transact-SQL, as well. For example, to create a new database in Czech Republic sort order, case and accent sensitive, use a statement such as:

USE master
GO
CREATE DATABASE Products
ON 
( NAME = products_dat,
   FILENAME = 'c:\program files\microsoft sql server\mssql\data\products.mdf' )
COLLATE Czech_CS_AS
GO

Interestingly, you can even change the collation of an existing database using the ALTER DATABASE statement (this is not available using SQL Server Enterprise Manager). For example, the following statement changes the collation of the Products database from Czech_CS_AS to Czech_CI_AI (case and accent sensitive to case and accent insensitive):

ALTER DATABASE Products
COLLATE Czech_CI_AI

Considerations before changing the collation of a database

To change the collation of a database, all of the following must be true:

  • No one else can be using the database.
  • No schema-bound object can be dependent on the database collation. Schema bound objects that qualify include any of the following:
    • User-defined functions and views created with SCHEMABINDING
    • Computed columns
    • CHECK constraints
    • Table-valued functions that return tables with character columns with collations inherited from the default database collation.
  • The act of trying to change the collation of the database does not result in duplicates among any of the system names. This is very easy to imagine, for example, in the case of trying to change the collation from French_CI_AS to French_CS_AS (case insensitive to case sensitive). With the collation in earlier versions of SQL Server, it is possible to have two tables named Table1 and TABLE1; whereas, in the collation in SQL Server 2000, this causes duplicates. Objects that can potentially cause such duplications include:
    • Object names (such as procedure, table, trigger, or view).
    • Schema names (such as group, role, or user).
    • Scalar-type names (such as system and user-defined types).
    • Full-text catalog names.
    • Column or parameter names within an object.
    • Index names within a table.

None of these restrictions is unrealistic, and they keep you from doing things that can corrupt your data or your database. In fact, some might think the rules are not strict enough! If you have data in a text, varchar, or char field and there is no explicit collation on the column, changing the collation of the database alters the way that the data's encoding is interpreted, resulting in a form of corruption of any characters beyond the ASCII range (which is contained by all code pages). Rather than tempting fate in this way, you should really avoid changing the collation of any database that contains text data columns that are not one of the Unicode types, unless those columns have their own explicit collations set (see Collations specified at the column level below).

You can also use the Transact-SQL DATABASEPROPERTYEX function to find out the collation of a database, for example:

SELECT CONVERT(char, DATABASEPROPERTYEX('pubs', 'collation'))

Collations specified at the column level

In SQL Server 2000, you can change the collation of text in a particular column. This can be very useful, for example, in a circumstance when you might need to force case sensitivity for a password column. Different language columns would be useful in other scenarios. For example, a customer name might need to be in Unicode using Latin1_General for the broadest appropriate sorting, and a product line might always be in Greek, in which case a Greek collation might make sense. The following illustration (Figure 7) displays a collation specification during table design, using SQL Server Enterprise Manager.

Aa902644.intlfeainsqlsvr2000.rtf07(en-us,SQL.80).gif

Figure 7. Specifying Collation during table design using the Enterprise Manager

When you click on the "..." button, the dialog box shown in the following illustration appears. In this dialog box (Figure 8), you can choose a collation.

Aa902644.intlfeainsqlsvr2000.rtf08(en-us,SQL.80).gif

Figure 8. Collation dialog box

You can also set column-level collations using Transact-SQL. In the CREATE TABLE statement, simply add a COLLATE clause to the column definition. In the following example, the job description has a collation set for Arabic (case and accent insensitive, kanatype insensitive).

CREATE TABLE jobs
(
   job_id  smallint
      IDENTITY(1,1)
      PRIMARY KEY CLUSTERED,
   job_desc varchar(50)
      COLLATE Arabic_CI_AI_KS
      NOT NULL
      DEFAULT 'New Position - title not formalized yet',
)

You can use the ALTER TABLE statement to change the collation at the column level (except for an ntext or text column) with a new data type. However, you can change the collation on an ntext column using SQL Server Enterprise Manager. This is because SQL Server Enterprise Manager creates a temp table, moves data to the new temp table, drops the old table, creates a new table with the new collation, and then copies the data back to the new table.

Collations specified in expressions

There may be many instances when you need to display data to people in different countries and want locale-appropriate sorting. With SQL Server 2000, you can specify collations in expressions. This powerful new feature allows you to sort in a particular manner so that the ORDER BY clause can be language-specific.

For example, the following query sorts the Customers table by last name and first name, (using the Lithuanian sort order provides a good example of collation differences because the rules about how the letter Y sorts after the letter are so striking and noticeable).

SELECT 
  * 
FROM 
  tblCustomers 
ORDER BY 
  LastName COLLATE Lithuanian_AI_CI,
  FirstName COLLATE Lithuanian_AI_CI

You can also refer to the examples in SQL Server Books Online for SQL Server 2000 that relate to regular comparisons, such as:

SELECT 
  * 
FROM 
  Table1 
WHERE 
  Field1 = Field2 COLLATE Turkish_ci_ai

Assuming that Table1 has no explicit column-level collations, both columns are compared with the Turkish sort order. For a more complete explanation on why this is the case, see The rules of precedence for collations later in this article.

The COLLATE keyword

The syntax for using the COLLATE keyword is:

COLLATE [<Windows_Collation_name>|<SQL_Collation_Name]

The choice of collation name can be a difficult one to make. The keyword can be specified at the database level, the column level, or in expressions. As a rule, any time the field is not a Unicode one (ntext, nvarchar, or nchar) the earlier process in which the collation is converted into a code page is used.

There are two types of collations:

Windows collations
These are defined by Windows. You have full options to specify case, accent, kana, and width sensitivity, as well as being able to define binary sorts.
SQL collations
These collations are defined by SQL Server for legacy reasons. You do not have full options to configure these sorts.

Generally, you should try to use the Windows collations whenever you can. The following illustration (Figure 9) presents a simple example of how sorting rules can change. In this example the pubs database is used. Whether or not Y comes between I and J, or between X and Z, depends on whether or not a Lithuanian collation is used, which clearly affects the way items in the query are ordered.

Aa902644.intlfeainsqlsvr2000.rtf09(en-us,SQL.80).gif

Figure 9. Example of collations' effect on sorting

The rules of precedence for collations

In SQL Server 2000, you can specify a collation at the server, in the database, in the column, and in an expression. How do these interact?

The interaction works with some straightforward rules that may seem confusing at first, but are needed. The following table shows an interaction in which A and B act as the two different parts of the comparison.

 Explicit BImplicit BDefaultNo Collation
Explicit ARun-time ErrorExplicit AExplicit AExplicit A
Implicit AExplicit BNo CollationImplicit ANo Collation
DefaultExplicit BImplicit BDefaultNo Collation
No CollationExplicit BNo CollationNo CollationNo Collation

The terms in this table are defined below.

Explicit A/Explicit B
A collation is explicitly defined for a given expression.
Implicit A/Implicit B
A collation has been defined at the column level.
Default
The database-level collation is being used.
No Collation
There is a conflict between the two operators; the expression will be handled with no collation.

As you can see, the only instances that SQL Server cannot handle an expression is when you explicitly define two different, conflicting collations, or when you try to compare two items and no common ground for the comparison can be found. These are not really blocking limitations; they are understandable rules. SQL Server simply needs you to provide some basis for comparison!

For example, consider the following Transact-SQL statement for creating a table:

CREATE TABLE TestTab (
   id int, 
   GreekCol nvarchar(10) COLLATE greek_ci_as, 
   LatinCol nvarchar(10) COLLATE latin1_general_cs_as
   )
INSERT TestTab VALUES (1, N'A', N'a')
GO

This statement creates a table with one column using a case-insensitive, accent sensitive Greek collation and the other column using a case-sensitive, accent sensitive General Latin1 collation.

You could attempt to use a query to explicitly compare the two:

SELECT * 
FROM TestTab 
WHERE GreekCol = LatinCol

However, this returns an error:

Msg 446, Level 16, State 9, Server V-MICHKA3, Line 1
Cannot resolve collation conflict for equal to operation.

This occurs because the server cannot compare the two segments of text with different collations. If, however, you use the COLLATE keyword to explicitly create an expression that allows them to be compatible, the query will work like this:

SELECT * 
FROM TestTab 
WHERE GreekCol = LatinCol COLLATE greek_ci_as

You will also want to note that although LatinCol usually has a case-sensitive collation, the case-insensitive collation of the expression overrides this and allows the uppercase and lowercase 'A' to be treated as equal.

Limitations of the COLLATE Keyword

The COLLATE keyword and all of its related features are pretty incredible and this author believes they are without compare among contemporary enterprise database products. There are some limitations, however, which I've listed below. Note these limitations all have workarounds. The limitations are described here to help you understand what you can do directly and what requires a little extra work.

Returning less than a full list of collations

The fn_helpcollations function (see the illustration in Collations specified at the column level earlier in this article) returns a full list of collations. However, as shown by the dialog box presented in Collations at the database level earlier in this article, SQL Server clearly can list one locale (such as Albanian) and provide the rest of the flags as options, returning the full string in the end. If you want to provide a user interface for this functionality, you must do a little extra work yourself.

Issues with defining collation at the column level

How often would you have a database that needs one sort order (for example, Latin1_General) and a column that needs a different one (for example Greek)? Sometimes, this may be crucial, but in other cases if the data in your database does not use a single collation, then it is probably multilingual data that may need to be sorted according to more than one collation. Being able to define multiple collations, each of which can be indexed, allows you to access the Greek data by specifying the Greek collation, and to have this query be an indexed search.

That last clause "and to have this query be an indexed search" is the crux of the matter. In the example provided earlier, using a COLLATE expression in the ORDER BY clause of a query gives you the functionality; however, this will not be an indexed ordering, so it will be slower for large datasets. As it stands, column-level collation makes sense only if you do not have monolingual data in a column, or if you denormalize your database to store different languages in different columns.

LCIDs and collations

Windows uses the locale ID (LCID) to define sorts. If you are doing work to format your results, you will probably already have the LCID on hand (or you can use the default LCID by specifying 1024 or using Microsoft Visual Basic® formatting functions to do your work). In fact, if you are doing this work in a Web-based ASP application, you can use the SetLocale function in Microsoft Visual Basic Scripting Edition (VBScript) to change the formatting to use the date/time, number, and currency formatting preferences of any locale. Unfortunately, there is no way to map the two: You can get an LCID from a collation, but you cannot get a collation from an LCID due to many to 1 mapping from LCID to collation.

Why is this inconvenient? Well, imagine a case where you have a multilingual Web site, with people visiting from different countries and examining product information. You may already be mapping their browsers' HTTP_ACCEPT_LANGUAGE variables to LCIDs for formatting date and currency values using the Session.LCID property, and you decide that sorting using their locale is a sensible option for the sake of usability.

To help build your own mapping function to work around this issue, see the conversion table in the topic "Windows Collation Designators," in SQL Server Books Online for SQL Server 2000.

ISO strings and collations

You can obtain the HTTP_ACCEPT_LANGUAGE variable in VBScript with a script such as:

Dim stLang

stLang = Request.ServerVariables("HTTP_ACCEPT_LANGUAGE")

Recognizing that this value is the only value many Web developers will have when it comes to locale information, the VBScript SetLocale function was designed to not only take LCID values but to accept this value directly. This means you do not have to go through the intermediate step of mapping the value to an LCID. Because SQL Server2000 does not accept a string such as "en-us" (English-United States) and properly map that to a Latin1_General collation, or "vi" (Vietnamese) and map that to a Vietnamese collation, you have to map all of them yourself.

How are custom collations defined?

One commonly asked question that many developers have after seeing the myriad of collation options is how can they define their own. The answer is they cannot. Unless a collation is added to Windows 2000, it cannot be added to SQL Server 2000. That's because a collation is literally designed to define the method for sorting every defined character in the Unicode standard, and there is no user interface designed to allow for creating such a thing.

Note   All the new SQL Server collations are derived from information in Windows, which is why they are referred to as Windows Collations.

Communication Between Server and Client (Code Page and Collation Issues)

In a very small number of cases, everything you do with a SQL Server will happen on the same computer on which the server resides and only SQL Server tools such as SQL Query Analyzer or SQL Server Enterprise Manager will be used. In most cases, however, the server will be interacting with other servers or clients, and might be using one or more data access standards. You will want to be aware of how such issues are handled by SQL Server 2000. In this context, anyone who talks to SQL Server is a client, and there are basically two types of clients:

  • Unicode clients: OLE DB, and ODBC versions 3.7 and later
  • Non-Unicode clients: ODBC version 3.6 and earlier, and DB-Library

One important issue that comes into play with non-Unicode data is the way that the data is translated between code pages or to and from Unicode when ODBC is used. There are two possible settings of the SQL_COPT_SS_TRANSLATE attribute when sent to SQLSetConnectAttr:

  • SQL_XL_OFF

    The driver does not translate characters from one code page to another in character data exchanged between the client and the server.

  • SQL_XL_ON

    The driver translates characters from one code page to another in character data exchanged between the client and the server. The driver automatically configures the character translation, determining the code page installed on the server and that in use by the client.

By default, SQL_XL_ON is the attribute that is used. You can also set it using the SQL-DMO TranslateChar method off the SQLServer object. Usually, this default provides the desired behavior (which is to turn auto_translate on) any time you are dealing with non-Unicode data.

The possible scenarios of client and server connections, with some of their issues, are presented in the topics that follow.

Unicode Server and Client

This is the ideal type of configuration. By keeping data in Unicode throughout the process, you can guarantee the best performance and protection from corruption of retrieved data. This is the case with ADO and OLE DB.

Unicode Server and One or More Non-Unicode Clients

In this type of configuration, you may not have any problems storing data, but there is obviously a serious limitation when it comes to bringing the data to the client and using it. The client code page must be used to convert the Unicode data at some point.

An example of this at the data layer is when you are connecting to a SQL Server 2000 database from a computer that is using DB-Library. DB-Library is a Call Level Interface that allows C applications to access SQL Server. DB-Library has not been upgraded significantly since SQL Server 6.5, and for our purposes this explicitly spells out the limitations that any client using DB-Library will face. Data can be based on only one code page, the default OEM code page for the system. You can also choose whether locale information will be based on the locale settings of the client system or not. As shown in the following illustration (Figure 10), in the DB-Library Options tab of the SQL Server Client Network Utility, you can choose between two options for how DB-Library converts information. Both of these options are selected by default.

Aa902644.intlfeainsqlsvr2000.rtf10(en-us,SQL.80).gif

Figure 10. Default DB-Library Options

Because you cannot handle data on other code pages, the only time DB-Library really makes sense as a data layer is in a legacy system that only needs to deal with a subset of data from SQL Server. Although it is clearly technology that is only there so developers already using it are not forced to rewrite their applications, it may be worth considering such a rewrite if you need to support multilingual data.

Another case of a non-Unicode client is programs that are not Unicode enabled, such as Microsoft Access 97. While an Access database can be linked to a SQL Server 2000 database, there are some limitations you should be aware of. If you connect to a database that, for example, has Japanese table names from a US English computer, you may see a dialog box in which the table names have been converted to question marks. The following illustration (Figure 11) displays an example of such a dialog box.

Aa902644.intlfeainsqlsvr2000.rtf11(en-us,SQL.80).gif

Figure 11. Illustration of table names converted to question marks in Access 97

The cause of this problem is easy to understand; Access 97 is using a version of ODBC that is earlier than 3.7, and, thus, the data is being converted from Unicode to ANSI using the default system code page. Even if you install a later version of ODBC, Jet 3.5 in Access 97 will do the same conversion. Because Japanese characters are not on code page 1252 of the US English computer, they are replaced by question marks.

It is not possible to connect to these tables; connection attempts will result in the error message displayed in Figure 12.

Aa902644.intlfeainsqlsvr2000.rtf12(en-us,SQL.80).gif

Figure 12. Microsoft Access error message

This too, is easy to understand. Once data has been converted to the wrong code page and replaced by question marks, there is no way for it to ever be converted back. This causes Jet and ODBC to literally try to connect to a table named dbo.????, which will obviously fail because it does not exist. This will occur with any data not in that code page.

A similar problem will occur in data in the tables themselves. For example, in a table that contains Korean data, you will see the data presented in a non-Unicode client (such as Access) display as question marks. This is illustrated below in Figure 13.

Figure 13. Example of questions marks in non-Unicode client database (click to enlarge)

Figure 13. Example of questions marks in non-Unicode client database (click to enlarge)

In all three of these clients (DB-Library, ODBC, and Jet 3.5), a component that simply does not understand Unicode, except for how to convert it to the default system code page, will not be able to handle this type of multilingual data. Such a client will be limited to using data that is containable on its default system code page.

Non-Unicode Server and Unicode Client

This is not an ideal configuration for multilingual data, because you will not be able to keep such data on the server. At the very least, however, you can be sure the data will show up properly. This configuration has all the limitations of the previous case, without the risk of corrupting received data due to invalid conversions. An example is when a SQL Server 2000 database defines a linked server to a SQL Server 6.5 database. All information that is received from the server running SQL Server 6.5 will be valid, but do not try to insert any off code page data!

Non-Unicode Server and Client

This is the most limiting configuration, because you are basically restricted to a single code page at all times.

Conversion of Multilingual Data from Earlier Versions of SQL Server

Not all users could wait for the Unicode features included in SQL Server 7.0 and SQL Server 2000 to handle their multilingual data. As a result, some users have created custom encoding schemes to store such data. If you were thinking ahead in such a way, then you will need to use the bulk copy utility (bcp) to save the data as binary (which means no conversion), and then bulk copy back in using the appropriate code page with the -C command line parameter. For more information about the bcp utility, see Using the bcp Utility with Multilingual Data later in this article.

Using the Access 2000 New ADP Format

Microsoft Access 2000 adds a new file format option that is not a Jet database, but rather an Access Data Project (ADP). These files can act directly as a front end to SQL Server.

The capabilities of ADPs are beyond the scope of this article, but two important issues to point out are:

  • Access 2000 does not support SQL Server 2000 unless the SQL Server 2000 client tools are installed.
  • In all of the entry points to SQL Server data within Access (forms, data access pages, table datasheet view, ADO), there is one layer that sits between Access and SQL Server: COM. This can affect the input of date/time values, numbers, and currency values, because the regional settings of the client (in this case, the computer on which the ADP is sitting) are used to interpret the meaning of the data. This is important to keep in mind when you are using Access, because SQL Server's rules are less restrictive in some cases. For more information, see Dealing with COM's Locale Interference later in this article.

Multilingual Data In the User Interface

Although SQL Server is first and foremost a server, it has many administration and management tools. These tools have been updated in SQL Server 2000 to support multilingual data as needed.

General UI Changes (Unicode Support)

SQL Server Enterprise Manager does a fairly good job of handling table names that are on the current default code page or server code page. It can also take advantage of some of the font linking technologies in Windows to "borrow" characters from other fonts when needed, as shown in the following illustration (Figure 14).

Aa902644.intlfeainsqlsvr2000.rtf14(en-us,SQL.80).gif

Figure 14. Font linking technology example

As shown, font linking cannot do 100 percent of the job. There are many languages (such as Armenian, Sylfaen, Georgian, and Hindi) about which Windows has no advanced font linking information. Also, when a language uses characters that are not as commonly used within a script (such as Azeri – Cyrillic), most of the string will be displayed, but a few characters might not be.

There are three interesting things to note:

  • Any time the name is not supported within the font and font linking cannot be done, you will not see characters corrupted; instead, you will see the boxes shown above, which indicates there is a character that could not be displayed.
  • Because the SQL Server 2000 client tools do not use the Uniscribe technology for rendering complex scripts, unless you are using Windows 2000, the bidirectional languages (such as Hebrew, Yiddish, Arabic, and Farsi shown in the previous illustration) will have their characters displayed backwards. On Windows 2000 and BIDI-enabled platforms, however, these characters will display properly. Other complex script rendering issues such as Thai word breaking can see the same limitations (although not on Windows 2000).
  • The "base" font used by SQL Server Enterprise Manager is the one defined in the desktop display settings of the computer and cannot be overridden.

Multilingual Information in the Grid and SQL Panes of SQL Query Analyzer

In SQL Server Enterprise Manager, you cannot make font changes; however, in SQL Query Analyzer, in the Fonts tab in the Options dialog box (Figure 15), you can explicitly change the font for many parts of the user interface.

Figure 15. SQL Server 2000 Query Analyzer font dialog box (click to enlarge)

Figure 15. SQL Server 2000 Query Analyzer font dialog box (click to enlarge)

The reason for this may not be apparent at first, because font linking seems to allow most strings to be displayed. However, there is more to the proper display of a string than simply finding a font that can represent the characters. Often, the font choice can make a difference—therefore, this functionality can be important for properly representing multilingual data. In cases in which font linking does not seem to work, this feature will enable you to see the string instead of boxes.

Format Issues in the Query Designer

In the Query Designer, for the most part, you can enter information in the grid pane that matches the default regional settings of the computer, or you can explicitly use the CONVERT function to cause a string in an arbitrary format to be handled.

There are a few design limitations with regard to this method of using the regional settings that you should be aware of:

  • Long data formats are not supported.
  • Currency symbols should not be entered in the grid pane, although the US dollar sign ($) can optionally be used. Either way, the currency symbol retrieved from the regional settings will be used in the Results pane.
  • Unary minus always appears on the left side without parentheses, regardless of the regional settings. Thus, -1 should be represented as -1 instead of 1- or (1) or any other valid variation that may be specified in the Regional Options dialog box.

These limitations are necessary to allow a certain amount of worldwide support in the Query Designer, and are not actually something that will block most efforts to use locale-specific data.

Note that any information entered in the Grid pane will be translated to a locale-independent format in the SQL pane, thus "03.09.65" on a Standard German computer will be translated to { ts ' 1965-09-03 00:00:00 }. All data entered directly into the SQL pane should be in this format or, otherwise, include an explicit CONVERT call.

Sort Order

The sorting of data displayed in the Results pane is not influenced by the Regional Settings; instead, the collation rules (presented in Collation in SQL Server 2000 earlier in this article) control how an ORDER BY clause is interpreted.

Double-Byte (DBCS) Characters

You can enter DBCS characters for literals and database object names, aliases, parameter names, and parameter marker characters. However, you cannot use DBCS characters for SQL language elements such as function names or SQL keywords. Thus, you should use the keyword SELECT instead of the Japanese Fullwidth Aa902644.intlfeainsqlsvr2000.rtf29(en-us,SQL.80).gif

Getting to SQL Server Data (Data Access Methods)

How you access data in SQL Server can be crucial. There are many different data access methods, and the rules for how each handles multilingual text can be very important. The sections that follow describe several of these data access methods.

OLE DB

OLE DB is the central component of the Microsoft Data Access Components (MDAC), which is MDAC version 2.1 in SQL Server 7.0 and MDAC version 2.6 in SQL Server 2000. OLE DB is based on COM, and thus all the strings are Unicode BSTRs (UTF-16 on Windows 2000, and UCS-2 on all other operating systems). For SQL Server, the provider is the Microsoft OLE DB Provider for SQL Server (SQLOLEDB). Data is converted to Unicode as needed, using the collation of the actual data. The most optimized case would require you to keep the data as Unicode throughout the process.

ADO

Microsoft ActiveX® Data Objects is a Visual Basic and scripting friendly interface that acts as a wrapper around OLE DB. It is also a COM component and thus has the same support for Unicode. There is no way to decouple ADO and OLE DB in a way that would ever allow conversions to happen between the two, so when problems do exist, they will always be at the OLE DB layer.

ODBC

Whether ODBC is a Unicode layer depends on the version of ODBC being used. For the rules that apply to the use of ODBC, see Communication Between Server and Client earlier in this article.

DB-Library

There is no later Unicode version of DB-Library. For more information, see Communication Between Server and Client earlier in this article.

SQL-DMO

SQL Distributed Management Objects (SQL-DMO) is a COM layer that encapsulates SQL Server 2000 database and replication management. Because it is COM, the same rules that applied to ADO and OLE DB apply to SQL-DMO. SQL-DMO also has properties that can be used for features mentioned earlier, such as the Collation property on the SQLServer2, Database2, Column2, SystemDateType2, and UserDefinedDataType2 objects.

Multilingual Transact-SQL

When you send an SQL statement to the server that contains multilingual data, the two primary issues that will affect whether the data makes it to the server properly are the:

  • Encoding of the SQL statement itself
  • Encoding of string literals inside the statement

Encoding of String Literals in an SQL Statement

After the SQL string itself is encoded, a technique for handling string literals must also be employed. Essentially, the only choices here are a string in the computer's default code page or a Unicode string: the latter is designated by placing an "n" (for National) prefix in front of the string, as in:

Aa902644.intlfeainsqlsvr2000.rtf30(en-us,SQL.80).gif



This string (the Hindi word for the Hindi language) will be converted to "??????" if the "n" prefix is not placed there. This will also happen with data that does have a code page, but does not match the system defaults.

Warning: Remember that the use of the "n" prefix to represent Unicode data in both string literals and data types (nchar, nvarchar, and ntext) is specific to SQL Server. The ANSI-92 SQL specification does define the National character data types but does not specify them as having to be Unicode. The ANSI-99 SQL specification (which was not complete when SQL Server 2000 was released and will likely be discussed and modified from its present status in regards to Unicode support) does discuss using a set of Unicode types with a "u" prefix (for example, utext, uchar, and uvarchar). These data types are not available in SQL Server 2000.This is not true for some other server database products. For more information, see Interacting with Other Database Products later in this article.

SQL String Encoding

If the SQL string uses Unicode (as would any SQL string using ADO, for example), you can encode any type of character. If the string does not use Unicode (such as a string in a non-Unicode batch file or .SQL file), the conversion will have to be made at some point, and usually this will be done with the default system code page of the computer on which the conversion is done. This can be very problematic, if not planned properly, in a multilingual application.

If a string literal is not in Unicode (marked with the n prefix), it will be converted to Unicode using the database's default code page. With multilingual data, it is best to use a Unicode data type and Unicode string literals.

String-Handling Functions

Transact-SQL has built-in string-handling functions that have important multilingual considerations:

ASCII
Returns the code point of the first character in a string using the current default system code page. If the character is not on that code page, a 63 is returned (the code point for a question mark). This is similar to the Asc() function in Visual Basic and VBScript.
CHAR
Returns a character given the ANSI code point; essentially the inverse operation of the ASCII function, it is similar to the Chr() function in Visual Basic and VBScript. If the code point is not in the 0-255 range, it returns Null.
NCHAR
The Unicode equivalent of the CHAR function. It returns a character given its Unicode code point. It is similar to the ChrW() function in Visual Basic and VBScript.
UNICODE
The Unicode equivalent to the ASCII function, and returns the Unicode code point of the first character in a string. It is similar to the AscW() function in Visual Basic and VBScript.

Note that the NCHAR function was used in an example earlier in this article (see Date/Time Types: datetime, smalldatetime) to add the RLM (right to left mark) in front of a Hijri date to allow it to be formatted in the expected manner.

Locale Support In SQL Server 2000

SQL Server 2000 includes some specific locale support for 33 different languages. This is not the full locale support that is available in the NLS database in Windows, but it does provide many basic functions. The following list displays the supported languages (in both English and the native language).

List of languages supported in SQL Server 2000 and Windows 2000

English TranslationNative Language
ArabicArabic
British English British
BrazilianPortuguês - Brasil
BulgarianAa902644.intlfeainsqlsvr2000.rtf32(en-us,SQL.80).gif
Simplified Chinese Aa902644.intlfeainsqlsvr2000.rtf33(en-us,SQL.80).gif
Traditional Chinese Aa902644.intlfeainsqlsvr2000.rtf34(en-us,SQL.80).gif
Croatian hrvatski
Czech Aa902644.intlfeainsqlsvr2000.rtf35(en-us,SQL.80).gif
DanishDansk
Dutch Nederlands
Englishus_english
Estonianeesti
FinnishSuomi
FrenchFrançais
GermanDeutsch
Greek Ελλζν͖à
Hungarian magyar
ItalianItaliano
JapaneseAa902644.intlfeainsqlsvr2000.rtf38(en-us,SQL.80).gif
KoreanAa902644.intlfeainsqlsvr2000.rtf39(en-us,SQL.80).gif
Latvian Latviešu
Lithuanianlietuvių
Norwegian Norsk
Polishpolski
PortuguesePortuguês
RomanianRomână
Slovakslovenĉina
Slovenianslovenski
Spanish Español
SwedishSvenska
Thai Aa902644.intlfeainsqlsvr2000.rtf40(en-us,SQL.80).gif
TurkishTüurkçe
RussianAa902644.intlfeainsqlsvr2000.rtf42(en-us,SQL.80).gif

You can enumerate these languages and information about them using the sp_helplanguage stored procedure. Note that although every version of SQL Server will store full information about many of the items listed below, you will not get translated system messages for all locales unless you have installed a localized version of the product.

The information on languages is stored in the syslanguages table, except for messages, which are stored in sysmessages.

Language Settings

Every SQL Server must have a default language that it uses to handle items such as date formats and messages. This information is stored for each login to the server that is created. Although this is initially set during setup, it can be overridden at the server level in the Server Settings tab in the SQL Server Properties dialog box, as shown in Figure 16.

Figure 16. SQL Server language settings dialog box (click to enlarge)

Figure 16. SQL Server language settings dialog box (click to enlarge)

You can also use the sp_configure stored procedure, for example, in a call to change the default language to Italian:

sp_configure "language", 6

The following table shows the language IDs (langids) that are available (and can be queried using a query from the syslanguages table).

langidLanguage name
0English
1German
2French
3Japanese
4Danish
5Spanish
6Italian
7Dutch
8Norwegian
9Portuguese
10Finnish
11Swedish
12Czech
13Hungarian
14Polish
15Romanian
16Croatian
17Slovak
18Slovenian
19Greek
20Bulgarian
21Russian
22Turkish
23British English
24Estonian
25Latvian
26Lithuanian
27Brazilian
28Traditional Chinese
29Korean
30Simplified Chinese
31Arabic
32Thai

The default language setting can be overridden on a per-login basis using the sp_addlogin stored procedure or the Login Properties dialog box shown in Figure 17.

Figure 17. Login Properties dialog box (click to enlarge)

Figure 17. Login Properties dialog box

Finally, the language can be overridden at the session level, using the SET LANGUAGE statement, as shown in the following examples:

Aa902644.intlfeainsqlsvr2000.rtf43(en-us,SQL.80).gif

You should use the n prefix in front of strings to make sure they are passed as Unicode. This can help avoid the unintended problems with conversions using the default system code page of the server.

Individual data access methods provide their own methods of specifying the language setting outside of a SET LANGUAGE call:

  • ADO supports a provider-specific language keyword in the ConnectionString.
  • OLE DB can set the provider-specific SSPROP_INIT_CURRENTLANGUAGE property.
  • ODBC can specify a language in the data source definition or in a LANGUAGE keyword in the connect string.
  • DB-Library can use dblogin to allocate a LOGINREC, and then DBSETNATLANG to specify a language setting.

The language settings (whether specified at the server, login, or session level) affect these items:

  • Messages
  • Date/Time
  • First Day of Week
  • Currency and currency symbols
  • Month/day names and abbreviated month names

Messages

SQL Server 2000 supports having multiple, language-specific copies of system error strings and messages. These messages are stored in the sysmessages table of the master database. When you install a localized version of SQL Server 2000 these system messages are translated for that language version you are installing. You also get by default the US English set of these messages. You can use Set Language to specify the language of the server session. By default, it is the language of the installed version. When SQL Server sends a message to a connection, it uses the localized message if the language ID Set matches one of the language IDs found in the msglangid column of the sysmessages table. These IDs are in decimal format and represent the locale ID (LCID) of the message. If there is no message in the sysmessages table with the same LCID, the US English messages are sent.

You can add a language-specific user-defined message to the sysmessages table by using @lang parameter of the sp_addmessage system stored procedure. (Error numbers should be greater then 50,000.) The language is @lang—the Named Alias that maps to the LCID —for the message.

Because multiple language-specific, system-error strings and messages can be installed on the server, the value of language specifies the language in which the messages should be written to the sysmessages table. When language is omitted, the language is the default language of the server session. The supported language definitions are stored in the master.dbo.syslanguages. If you must install multiple language versions of sysmessages, contact your Product Support Services (PSS) representative.

Date/time

The basic change that can be made is whether the short date format is mdy, dmy, or ymd. You can override this at the connection level by using the SET DATEFORMAT setting, but each language has an appropriate default. The default can be retrieved by using the sp_helplanguage stored procedure. The value is stored in the dateformat column.

First day of week

The first day of the week varies among different locales; among the 33 languages in the syslanguages table, it varies between 1 (Monday) and 7 (Sunday). This information can be retrieved by using the sp_helplanguage stored procedure. The value is stored in the datefirst column.

Currency and currency symbols

Any column of money or smallmoney type can include a currency symbol. The symbol does not have to be the one specified in the Regional Options dialog box, and can be any of the characters shown in the following table.

Currency symbolCurrency nameUnicode (hexadecimal) value
$Dollar sign (USA)0024
Aa902644.intlfeainsqlsvr2000.rtf44(en-us,SQL.80).gifPound sign (UK)00A3
Aa902644.intlfeainsqlsvr2000.rtf45(en-us,SQL.80).gif(Universal) Currency sign00A4
Aa902644.intlfeainsqlsvr2000.rtf46(en-us,SQL.80).gifYen sign00A5
Aa902644.intlfeainsqlsvr2000.rtf59(en-us,SQL.80).gifBengali Rupee mark09F2
Aa902644.intlfeainsqlsvr2000.rtf60(en-us,SQL.80).gifBengali Rupee sign09F3
Aa902644.intlfeainsqlsvr2000.rtf47(en-us,SQL.80).gifThai Baht symbol03EF
Aa902644.intlfeainsqlsvr2000.rtf48(en-us,SQL.80).gifColon sign20A1
Aa902644.intlfeainsqlsvr2000.rtf49(en-us,SQL.80).gifCruzeiro sign20A2
Aa902644.intlfeainsqlsvr2000.rtf50(en-us,SQL.80).gifFrench Franc sign20A3
Aa902644.intlfeainsqlsvr2000.rtf51(en-us,SQL.80).gifLira sign20A4
Aa902644.intlfeainsqlsvr2000.rtf52(en-us,SQL.80).gifNaira sign20A6
Aa902644.intlfeainsqlsvr2000.rtf53(en-us,SQL.80).gifPeseta sign20A7
Aa902644.intlfeainsqlsvr2000.rtf54(en-us,SQL.80).gifRupee sign20A8
Aa902644.intlfeainsqlsvr2000.rtf55(en-us,SQL.80).gifWon sign20A9
Aa902644.intlfeainsqlsvr2000.rtf56(en-us,SQL.80).gifNew Sheqel sign20AA
Aa902644.intlfeainsqlsvr2000.rtf57(en-us,SQL.80).gifDong sign20AB
Aa902644.intlfeainsqlsvr2000.rtf58(en-us,SQL.80).gifEuro sign20AC

You will probably want to note that the SQL Server 2000 Books Online incorrectly lists the Euro sign as having a hexadecimal value of 20A0; the real value is 20AC. The character represented by 20A0 is Aa902644.intlfeainsqlsvr2000.rtf61(en-us,SQL.80).gif, the Euro-Currency (ECU) sign. This is not the Euro and should not be used as such; attempting to use Aa902644.intlfeainsqlsvr2000.rtf61(en-us,SQL.80).gif in a money value will result in an error.

Month/day names and abbreviated month names

The names of the months and days are included in the syslanguages table. They can be retrieved using the sp_helplanguage stored procedure, using the following columns:

months
A comma-delimited list of month names, January through December.
shortmonths
A comma-delimited list of abbreviated month names, January through December.
days
A comma-delimited list of the days of the week, Monday through Sunday.

Dealing with COM's Locale Interference

Although SQL Server has some very powerful features when it comes to handling date/time and currency values, if you are using any COM service such as ADO to access the server, you must for its intervention. For example, you can have problems getting Visual Basic to recognize that a number value prefaced by any of the currency symbols (shown in the preceding table) is a currency value. You can also have serious problems getting COM to properly use date/time values stored in strings.

To properly deal with this situation, you must understand when your application is converting a string to a date/time or currency value. Once you know whether it is happening on the client or the server, you can decide which rules apply.

The Access 2000 ADP is an example of such a client (see Using the Access 2000 New ADP Format earlier in this article). Because Access is working through OLE DB, all operations from Access 2000 will be governed by the COM rules that use the client computer's regional settings.

Note   OLE DB providers such as the Microsoft OLE DB Provider for SQL Server will properly convert a valid COM date to and from a SQL Server date. It is best to not rely on date formats in strings when you can avoid it, because this is the type of functionality that can break as you move the client side between different locales.

Data Transformation Services

When dealing with multilingual data, there are some issues in the user interface of the DTS Import/Export Wizard, shown in Figure 18.

Figure 18. DTS Wizard dialog box (click to enlarge)

Figure 18. DTS Wizard dialog box (click to enlarge)

The wizard provides an interface for working with heterogeneous data. There are a few dialog boxes for showing the data that are not as fully Unicode enabled as SQL Server Enterprise Manager and SQL Query Analyzer (see Multilingual Data In the User Interface earlier in this article). Because of this, you may encounter situations in which the data may not look right in the wizard, but will transfer properly. The reliability and stability of DTS operations are not affected by these UI limitations! In most cases, you simply will not be able to see the data because it will be replaced by boxes, as shown in Figure 19.

Figure 19. DTS Wizard illustration (click to enlarge)

Figure 19. DTS Wizard illustration (click to enlarge)

This is a Unicode text file with Simplified Chinese data in it. Once imported, the data will be properly handled, whether it is imported into a Unicode text column or a Simplified Chinese text column.

However, if you have a text file that uses a particular code page that does not match the system default, the wizard does not have the proper context to read it, as shown in Figure 20. Note that this is not an uncommon limitation. Many programs, including Notepad and WordPad, do not have the capability to understand files that are not in the expected code page.

Figure 20. Illustration of a file that will not import properly using DTS Wizard (click to enlarge)

Figure 20. Illustration of a file that will not import properly using DTS Wizard (click to enlarge)

This Korean file will not import properly on a U.S. computer, and the wizard does not allow means to specify a code page for the import process. (That is why the data will neither display nor import properly). As in many other cases, the rules are very clear: If you want to use multilingual data, you should use Unicode.

There are three rules you can use for encoding information during DTS transformations: you can choose to "use src," "use dest," or "use collation."

The "use collation" option is most likely to give you the results you want. It is actually the default in both SQL-DMO and in the DTS Designer. (It is not the default in the DTS wizard, but it is easy enough to set). The "use collation" option only works when both source and destination are SQL Server 2000 databases, but it will use the collation information of the respective columns to best determine how to do the transfer.

The other two transformation options are not quite so flexible and require you to use either the source or destination code pages of the server to determine how the transform should be done. Unfortunately, these settings are quite literal in their adherence to the code page settings of the server. Although they will respect the collation of a particular multiple-instance server, they will not support a database collation that is different from the server.

The other feature that exists in DTS Transformation operations is a very powerful ability to convert strings and other non-datetime data, using the NLS information of the instance of SQL Server. For formats that have date/time data stored as strings, this crucial functionality can allow the import to proceed as the user would expect it to. This can be very important because not all formats have the large number of data formats SQL Server 2000 does, and DTS is the important bridge between these limited formats and the server's capabilities.

Other problems with DTS can occur when moving data between non-Unicode columns with a SQL Server 7.0 source and a SQL Server 2000 destination, when the collation of the SQL Server 2000 column is not one that is easily represented in SQL Server 7.0. (Conversely, a SQL Server 2000 source and SQL Server 7.0 destination can also cause the same problem). The default of using the server code page on the SQL Server 7.0 end is what will be used in these cases. Once again, using Unicode data types prevents such problems from occurring.

When you perform a copy operation with DTS rather than a transform, there are three possibilities:

  • OLE DB provider column

    This choice just copies the raw data, which obviously will have problems if the two columns do not contain data on the same code pages. It is a very fast option when the source and destination use the same code page, but the copy will not be effective for copying between columns with data that uses two different code pages.

  • NCHAR to CHAR

    DTS will automatically use the code page of the server to do conversions from Unicode.

  • CHAR to NCHAR

    DTS will automatically use the code page of the source server to do conversions to Unicode.

Summary of DTS Conversion Issues

To start, Unicode source and target will always work. This fact cannot be stressed strongly enough. The issues occur only with non-Unicode data in specific circumstances, which I will summarize here.

When you are creating a destination table:

  • Translation fails when the source is SQL Server 7.0 and the SQL Server 2000 destination server code page does not match the destination database code page.
  • Translation fails when the source is SQL Server 2000, the destination is SQL Server 7.0, and the source column collation does not match the source server code page.
  • When the source and destination is SQL Server 2000, translation fails if Use Collation is not selected; otherwise, no data is lost.

When the destination table already exists:

  • Translation fails when the source is SQL Server 7.0 and the SQL Server 2000 destination server code page does not match the destination column code page.
  • Translation fails when the source is SQL Server 2000, the destination is SQL Server 7.0, and the source column collation does not match the source server code page.
  • When the source and destination are SQL Server 2000, translation succeeds if either of these exist:
    • The source and destination column code pages match and Use Collation is selected.
    • The destination column code page matches destination server code page and Use Collation is not selected.

      If Use Collation is true, data transfer is raw. If false, data is cast into the source server code page and then translated into the destination column code page.

For the copy column transform:

  • char to char is a raw transform. Translation fails if the source and destination column code pages do not match.
  • char to Unicode casts source data into the computer code page and translates to Unicode, but fails if the source data does not appear in that code page.
  • Unicode to char casts translates data into the computer code page, regardless of destination code page. As such it will fail if the destination column code page does not match the computer's code page.

As in the other cases, Unicode to Unicode will work properly with no conversions that could fail or corrupt data, and is preferred for use with multilingual data.

Using the bcp Utility with Multilingual Data

When you want to import data from or export data to a particular code page, you can still use the bcp utility, which supports the flags listed in the table below for lossless conversion of data.

FlagMeaningExplanation and notes
-C xxxCode page specifierxxx can specify a code page, ANSI, OEM, or RAW (for direct copy with no conversion—the fastest option).
-NUse Unicode native formatUses native (database) data types for all noncharacter data, and Unicode character format for all character data.
-wUse Unicode character formatUses the Unicode character data format for all columns.

You can also use format files and specify collations at the column level. (If you do not specify -C, -N, or -w, bcp will actually query for each column, the collation, and code page information prior to performing the import/export). You will then be prompted to save a format file, as shown in the following example (refers to the authors table in the pubs database):

8.0
9
1   SQLCHAR   0   11   ""1   au_id   Latin1_General_CI_AI
2   SQLCHAR   0   40   ""   2   au_lname   Latin1_General_CI_AI
3   SQLCHAR   0   20   ""   3   au_fname   Latin1_General_CI_AI
4   SQLCHAR   0   12   ""   4   phone   Latin1_General_CI_AI
5   SQLCHAR   0   40   ""   5   address   Latin1_General_CI_AI
6   SQLCHAR   0   20   ""   6   city   Latin1_General_CI_AI
7   SQLCHAR   0   2   ""   7   state   Latin1_General_CI_AI
8   SQLCHAR   0   5   ""   8   zip   Latin1_General_CI_AI
9   SQLBIT   0   1   ""   9   contract   ""

The allowable data types are displayed in the following table. The collation is the default specified for the column, which may have been inherited from the database or the server.

TypeFull name
cChar
TText
iInt
sSmallint
tTinyint
fFloat
mMoney
bBit
dDatetime
xBinary
IImage
DSmalldatetime
rReal
MSmallmoney
nNumeric
eDecimal
wNchar
WNtext
uUniqueidentifier

Note that the varchar and nvarchar data types are not listed in the table. For the bcp utility, the char and nchar types should be used in their place, respectively.

Finally, bcp supports the -R flag for "Regional Enable." This flag has the same effect as the ODBC "use regional settings" option (see Communication Between Server and Client earlier in this article) and can relate to the way the date/time, number, and currency data being stored in nontext fields is interpreted.

The Microsoft Search Service and FTS

Many Microsoft products use the Microsoft Search service; the service itself is used in both SQL Server 2000 and Exchange 2000. The engine is the same one used in Index Server, and many other products will likely include its capabilities in the future. Its most important capability is its ability to provide stemming (verb conjugation) and word breaking functionality that are language-specific; for example, allowing you to properly index words such as l'unique in French. The languages supported by Microsoft Search service are:

  • Dutch
  • English (UK)
  • English (US)
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Simplified Chinese
  • Spanish (Spain)
  • Swedish
  • Traditional Chinese
  • Thai

In addition to these language-specific stemmers/word breakers, a neutral one is also provided for use in other languages. There are specific cases where you may be more pleased with the results of using a specific language rather than neutral (for example you may wish to use the Spanish language for Catalan, or Dutch for Afrikaans), but the official recommendation for such cases is to use the neutral choice. Therefore, testing it out on your data rather than just assuming that "similar" languages will provide good results is recommended. The language support is very sophisticated, and this can potentially be a bad thing if you try to provide data that is not specifically in that language.

The actual implementation details of the various providers that use the Microsoft Search service and related components are, for the most part, based on your need: If you are indexing data in an Exchange store, the Exchange implementation is preferred. But if you are using the file system, Index Server is preferred. Obviously, if you are using SQL Server 2000, then Full-Text Search in SQL Server 2000 is the best choice.

Microsoft Search service can allow clients to "tag" areas of text with a particular language to allow for multilingual indexing on a data stored in an IMAGE column.  You can also specify a language to use on queries with the @language parameter of the sp_fulltext_column stored procedure.

You can also specify a default language to use for Full-Text Search by using the default full-text language option to the sp_configure stored procedure, which you can call as follows:

sp_configure 'show advanced options', 1
GO
RECONFIGURE
GO
sp_configure 'default full-text language', 1041
GO
RECONFIGURE
GO

If you are using the default language, you do not need to specify a language in the call to sp_fulltext_column.

Users commonly ask how they can add support for their own languages in Full-Text Search. Unfortunately, there is currently no way to do this.

Surrogate pairs (see What are Surrogates? earlier in this article) are not supported in FTS or Microsoft Search service. In addition, you cannot reliably perform searches based on these values, or, for that matter, on any characters that SQL Server 2000 considers to be undefined.

Dealing with OLAP/Hierarchical Data

In general, all the rules about multilingual text in Unicode fields that apply to relational use of SQL Server 2000 apply equally to the Online Analytical Processing Capabilities (OLAP) included in the Analysis Services component of SQL Server 2000. Analysis Services can handle any character that SQL Server 2000 itself can support. There are a few cases in the user interface for OLAP and its wizards where such data will be displayed as question marks if it is not on the client system's default code page. However, this does not affect the actual data and is purely a display issue in the UI.

One feature that is not supported in SQL Server 2000 for OLAP is the ability to have different collations in different parts of hierarchical data structures that make up a Data Warehouse. Although not generally useful in the analysis of data, there are cases such as partitioning of data alphabetically (for analysis) that can be impacted. In such cases, you need to determine another way to partition the data that defines the ordering and partitions explicitly, and does not assume letter order. This would also be very useful in the display of hierarchical data. If you need such functionality, you will have to do sorting on the subset of the data returned from the OLAP source.

Using XML Support In SQL Server 2000 with Multilingual Data

The rich XML support in SQL Server 2000 provides the ability to support multilingual data, because XML itself has a default encoding of Unicode using UTF-8, and in many cases SQL Server will use UCS-2 encoding in the XML it creates. Here are several ways you can specify an encoding in XML:

  • If you are formatting data as XML in an ADO Stream object and then persisting the stream, you can specify an output encoding and the proper encoding will be marked in the XML-formatted data.
  • You can specify an output encoding in a URL.
  • XML templates can specify an encoding.

Even if you do not use any of these methods, Unicode is supported by default and will work properly.

One important issue to keep in mind is that the characters allowed in names in XML are a lot more restricted than those allowed as identifiers in SQL Server. To fully support SQL Server identifiers, the identifier characters not supported in XML—they are replaced by a special form, _x0000_, where 0000 is replaced by the Unicode code point number. SQL Server will properly recognize these characters and return them correctly. Note that Updategrams interpret the encoding either way. Using them is only necessary, however, if you do not have a mapping schema.

For example, the following Transact-SQL statement can be used to create an XML file in UCS-2 encoding:

USE Northwind
GO

SELECT TOP 1 * FROM "Order Details" FOR XML AUTO

DECLARE @h int

EXEC sp_xml_preparedocument @h output, 
 N'<?xml version="1.0" encoding="ucs-2"?>

<root test_x0020_2="foo">Aa902644.intlfeainsqlsvr2000.rtf62(en-us,SQL.80).gif</root>'

SELECT * FROM OPENXML(@h,'/root') WITH("test 2" varchar(200) )

EXEC sp_xml_removedocument @h

This bit of script creates an XML document that encodes the Cyrillic string Aa902644.intlfeainsqlsvr2000.rtf62(en-us,SQL.80).gif, which happens to be a word for the language name Ukranian.

Annotated schemas shipped with XPath support in SQL Server 2000. Updategrams are currently in beta release on the Web and planned for final release to the Web in early 2001. Both features are designed to work well with multilingual text and support methods for specifying encoding; they also support the syntax mentioned earlier for the use of identifiers.

Interacting with Other Database Products

When working with other database systems, your most important task is simply a matter of determining the code page and similar rules of that system. The majority of the recommend data access methods on the SQL Server side involve COM, and thus use Unicode data. Therefore, the main piece of information you need to determine how well an international application will run between them is how well the other database products support Unicode.

For example, other database products such as Oracle and Sybase SQL Server support Unicode using the UTF-8 encoding. Usually this will not affect you because the data must be converted to UTF-16 using ADO/OLE DB before you ever see the information. But you should be aware of the difference if you try to interact with data in such products directly.

Another major issue you must allow for is that products such as Sybase SQL Server do support the National character data types but do not treat them as Unicode data types. For them, nchar and nvarchar are fields that you could use, for example, to store Japanese data in an otherwise US English database. When you run queries against another product, it is very important that you know how information is being handled in the other database so that using commands such as OPENROWSET will properly handle international text. The use of the National character data types for specifying Unicode text is Microsoft SQL Server specific in order to properly handle encoding that supports all languages.

Conclusion

Microsoft SQL Server 2000 includes a variety of very powerful international features. By building on SQL Server 7.0, the first truly multilingual-capable version of SQL Server, SQL Server 2000 has added a compelling set of features that allows the creation of truly global applications. With the importance of the Internet and the World Wide Web, it is crucial that applications and databases are able to meet this need—and the increasing needs of e-commerce and global communication require a database product that can support them. SQL Server 2000 is the database of choice for global organizations.

Acknowledgments

This article would have been impossible without the hard work of many people, and I would be remiss if I did not mention them.

Michael Kung, in his role as a Program Manager for SQL Server with a focus on globalization, provided not only invaluable review material for many different parts of this article, but also helped point me to the right people for all the questions that came up in product areas with which I was less familiar. His broad knowledge of many different areas has always been a great resource, even beyond this article, and I was very happy for his help in this project.

Peter Carlin, a Development Manager whose responsibilities include the SQL Server relational engine, was not only able to find the time to provide more feedback than any other person I contacted, but he also crafted an e-mail that contained a lot of the important information regarding Unicode support in SQL Server. This e-mail was actually the direct inspiration for this piece that exploded into an article almost 25 times the size!

Fernando Caro, a lead international program manager, was not only able to help take me through the OLAP features of SQL Server, but also helped change the discussion on localized system messages from a curt "sorry, not supported" to the helpful pointer to the appropriate resources. I am grateful that this article gave me the opportunity to point out the problem, but even more grateful that Fernando and Peter decided it was important to solve it. It is widely due to people like them who always want to help users in any (reasonable) way possible that SQL Server is such a great product.

I also would like to thank several program mangers and testers for their assistance. Without their help, many of the features that are scattered throughout the full SQL Server product would not have had their international features and issues discussed here. These people include: Michael Rys, Euan Garden, Fadi Fakhouri, and James Howey. I would also like to thank Margaret Li for the insight she gave me into how the Microsoft Search service that sits underneath SQL Server's Full-Text Search does its work.

Many of the hard working people on the Windows 2000 team were also very helpful, both for providing information on how basic issues such as collation and locale support are supposed to work and of course for providing the original data on which SQL Server 7.0 and 2000's collation support is based. I would especially like to thank Julie Bennett, Cathy Wissink, and John McConnell for being around to answer questions and provide encouragement for getting the word out!

Of course, the full list of people who were involved in the planning, developing, and testing of SQL Server's international and multilingual features would probably be almost as long as the list of people who worked on the product, because these features are clearly not an "add-on" but are a core part of SQL Server 2000. Therefore, I would like to thank everyone involved for producing a great product that is so globally useful!

About the Author

Michael Kaplan is the president and lead developer of Trigeminal Software, Inc., a software development and consulting company that specializes in the internationalization and localizability of Microsoft Visual Basic, Microsoft SQL Server, Microsoft Access, and ASP applications. He is the author of the book Internationalization with Visual Basic from Sams Publishing, and his next book, Internationalization with SQL Server, is due out in the middle of 2001. He has also written many articles and spoken on international development. His e-mail address is michka@trigeminal.com, and the Web site for Trigeminal Software, Inc. is http://www.trigeminal.com/.

The information contained in this article represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, this article should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This article is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS ARTICLE..

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this article may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this article. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this article does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 2001 Microsoft Corporation. All rights reserved.

Microsoft, ActiveX, Visual Basic, Windows, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Show:
© 2014 Microsoft