.NET Framework Class Library
Char..::.GetUnicodeCategory Method

Categorizes a Unicode character into a group identified by one of the UnicodeCategory values.

Overload List

  NameDescription
Public methodStatic memberSupported by the .NET Compact FrameworkSupported by the XNA FrameworkGetUnicodeCategory(Char)Categorizes a specified Unicode character into a group identified by one of the UnicodeCategory values.
Public methodStatic memberSupported by the .NET Compact FrameworkSupported by the XNA FrameworkGetUnicodeCategory(String, Int32)Categorizes the character at the specified position in a specified string into a group identified by one of the UnicodeCategory values.
Top
See Also

Reference

Tags :


Community Content

Shawn Steele [MSFT]
WARNING: Chars don't make sense in many languages

It is worth mentioning that the "char" type represents a single 16 bit value. In Unicode some characters consist of 2 UTF-16 code points, so in that case a "char" cannot represent a complete "character". This doesn't happen to English, but many Chinese and other characters exist outside of the BMP (ie: require 2 chars to represent the Unicode code point).

Also note that the notion of a "character" is also flexible. Many people think of them as "glyphs", but many "glyphs" require multiple code points. For example ä can be "a" + U+0308 (combining diaresis) or "ä" (U+00A4). In some languages all "letters/characters/glyphs" cannot be represented correctly by a single Unicode code point and instead require multiple code points.

Additionally some concepts get confused by this behavior. For example, There is a ΰ (U+03B0 greek small letter Upsilon with Dialytika and Tonos), however there's no equivilent capital letter. Trying to do ToUpper() ends up returning the same value, although you could perhaps argue for Ϋ́ (U+03AB + U+0301, greeke capital letter upsilon with dialytika, and then a combining tonos) Some other operating systems/environments choose that as the ToUpper() value for U+03B0, so then a single "char" ends up with a 2 "char" upper case form.

Another example is when combinations of characters cause their form to change. This isn't common in the "latin" characters, but its kind of like æ (U+00E6) looking like a and e crammed together, or, in German ß being the equivilent of ss. In some scripts the form changes a lot depending on the subsequent letters. An oversimplification would be to describe it as kind of like a hyperactive cursive where the letters connect in different ways depending on the following letters.

There are many other examples of cases when the "character" concept breaks down, so use caution. Strings are generally preferrable to better represent linguistic content.


Tags : unicode

Shawn Steele [MSFT]
CharUnicodeInfo is preferred
For some backwards compatibility reasons CharUnicodeInfo and Char have slightly different behavior for GetUnicodeCategory. CharUnicodeInfo has more "correct" behavior.
Tags : unicode

Page view tracker