Understanding Encodings

Article
11/03/2006

Internally, the .NET Framework stores text as Unicode UTF-16. An encoder transforms this text data to a sequence of bytes. A decoder transforms a sequence of bytes into this internal format. An encoding describes the rules by which an encoder or decoder operates. For example, the UTF8Encoding class describes the rules for encoding to and decoding from a sequence of bytes representing text as Unicode UTF-8. Encoding and decoding can also include certain validation steps. For example, the UnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. Both of these classes inherit from the Encoding class.

Choosing an Encoding

The Encoding class is very general. Supported classes inheriting from Encoding allow .NET applications to work with the common encodings they are likely to encounter in legacy applications, and .NET developers can implement additional encodings. However, when you have the opportunity to choose an encoding, it is strongly recommended that you use a Unicode encoding, typically either UTF8Encoding or UnicodeEncoding (UTF32Encoding is also supported). In particular, UTF8Encoding is generally preferred over ASCIIEncoding. If the content is ASCII, the two encodings will be identical, but UTF8Encoding can also represent every Unicode character, while ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F. Because ASCIIEncoding does not provide error detection, UTF8Encoding is also better for security.

UTF8Encoding has been tuned to be as fast as possible and should be faster than any other encoding. Even for content that is entirely ASCII, operations performed with UTF8Encoding will be faster than operations performed with ASCIIEncoding. Developers should consider using ASCIIEncoding only for certain legacy applications. However, even in this case, UTF8Encoding may still be a better choice. Assuming default settings, the following scenarios can occur:

If you have internal content that is not strictly ASCII and you encode it with ASCIIEncoding, each non-ASCII character will encode as a question mark ("?"). If you then decode this data, the information will be lost.
If you have internal content that is not strictly ASCII, and you encode it with UTF8Encoding, the result will seem to be unintelligible if interpreted as ASCII. However, if you then decode this data, it will round-trip successfully.

Choosing a Fallback Strategy

When an application tries to encode or decode a character, but no mapping exists, you must implement a fallback strategy. A fallback strategy is a failure-handling mechanism. There are two types of fallback strategies:

Best fit fallback

When characters do not have an exact match in the target encoding/decoding, the application can decide whether to try to map them to a similar character.
Replacement string fallback

If there is no appropriate similar character, the application can specify what to insert in the string to mark the omission.

For example, an application can call GetEncoding(1252, 0, 0) (see GetEncoding); this specifies Code Page 1252 (the Windows Code Page for Western European languages) with encoderFallback and decoderFallback specified as zero. The default behavior will be a best fit mapping for certain Unicode characters. For example, CIRCLED LATIN CAPITAL LETTER S (U+24C8) will be changed to LATIN CAPITAL LETTER S (U+0053) before it is encoded; SUPERSCRIPT FIVE (U+2075) will be changed to DIGIT FIVE (U+0035). If you then decode from Code Page 1252 back to Unicode, the circle around the letter will be lost and 2⁵ will become 25. Other conversions may be even more drastic: the Unicode INFINITY symbol (U+221E) may be mapped to DIGIT EIGHT (U+0038).

Best fit strategies vary for different code pages and they are not documented in detail. For example, for some code pages, full-width Latin characters will map to the more common half-width Latin characters; for others, they will not.

Even under an aggressive best fit strategy, there is no imaginable fit for some characters in some encodings. For example, a Chinese ideograph has no reasonable mapping to Code Page 1252. In that case, a replacement string is used. By default, this string is just a single QUESTION MARK (U+003F).

Best fit mapping is the default behavior for Encoding, which encodes Unicode data into code page data, and there are legacy applications that rely on this behavior. However, most new applications should avoid best fit behavior for security reasons. (For example, applications should not put a domain name through a best fit encoding.) Use the following alternatives to best fit mapping:

Use only Unicode encodings (UTF8Encoding, UnicodeEncoding, and UTF32Encoding) to avoid fallback issues.

Caution
While UTF7Encoding is, technically, a Unicode encoding, it is less robust and secure than the other encodings. In some situations, changing one bit can radically alter the interpretation of an entire UTF-7 string. In other situations, substantially different UTF-7 strings can encode the same text. Consequently, UTF-7 should not be used when you have a choice. UTF-8 is preferred over UTF-7

Use EncoderExceptionFallback and DecoderExceptionFallback, which throw an exception (EncoderFallbackException and DecoderFallbackException, respectively) if a character does not map exactly.
Use EncoderReplacementFallback and DecoderReplacementFallback to always substitute a replacement string if a character does not map exactly. (This is the default behavior for ASCIIEncoding). By default, this string will be just a question mark, but methods are provided that allow an application to choose a different string. Although this is typically a single character, that is not a requirement. For DecoderReplacementFallback, which is used when transforming text into Unicode, one character commonly used is REPLACEMENT CHARACTER (U+FFFD).
Write your own EncoderFallback and/or DecoderFallback, to implement the strategy you prefer. See the Fallback Encoding Application Sample.

Two further notes about best fit encoding (or decoding) fallback strategies:

Best fit is mostly an encoding issue, not a decoding issue. There are very few code pages that contain characters that cannot be mapped successfully to Unicode. Those characters are not commonly used, which is why they were omitted from Unicode.
There are no supported named objects corresponding to the best fit fallbacks and the best fit fallback for each code page is distinct. If you want to be able to switch back and forth between the best fit and some other fallback for a single Encoding object, make a copy the original best fit object to a variable before you assign any other fallback object. You can then recover the best fit fallback by assigning that value back to System.Text.Encoding.EncoderFallback (or System.Text.Encoding.DecoderFallback).

Understanding Encodings

Choosing an Encoding

Choosing a Fallback Strategy

See Also

Tasks

Reference

Other Resources

Additional resources