Capitalization, Uppercasing, and Lowercasing

When creating a locale–aware application, you'll need to consider handling of linguistic nuances. These nuances might seem trivial, but could have a large impact on application design and functionality. For example, Windows allows you to convert characters into either uppercase or lowercase equivalents. Some applications use this feature to automatically convert the first letter of every sentence into uppercase or to assume that certain types of words should always be capitalized. In Russian, however, names of the days of the week are never capitalized–capitalizing the word for "Wednesday" changes the meaning to "environment," and capitalizing the word for "Sunday" changes the meaning to "resurrection".

In the past as localized products were developed, language–sensitive issues–such as casing–were sometimes handled with what were thought of as well–designed, intelligent algorithms. For example, an uppercasing macro that relies on the code–point numbers of ASCII characters and the linear relationship between uppercase characters (A = 41) and lowercase characters (a = 61) can be written as:

#define ToUpper(ch) ((ch)<='Z' ? (ch) : (ch)+'A' - 'a')

You can see the problems this English–centric approach presented when representing uppercasing on non–Latin scripts or languages with accented characters where, for example, character mapping doesn't follow the assumed relationship between lowercase and uppercase characters? There are several other reasons why algorithmic solutions for case–folding do not cover all occurrences.

First, some languages do not have a one–to–one mapping between their uppercase and lowercase characters. For instance, the uppercase equivalent of the German ß is "SS." Second, some characters have different mappings depending upon the language in which they are used. For example, the lowercase "i" in English maps to a dotless uppercase letter: "I." However, in Turkish the lowercase "i" maps to a dotted uppercase letter: "İ." Finally, most non–Latin scripts do not even use the concept of lowercase and uppercase, as in the case of Chinese, Japanese, and Korean; Arabic, Farsi, and Hebrew; as well as Thai. For example, since Farsi has no notion of uppercasing, string output is composed of random and unsupported glyphs.

Capitalization in English 

The English-centric uppercasing macro used on an English string and on a Farsi string, where the notion of casing does not exist.