Character Encoding and Decoding in XML

Character encoding provides a map between a series of numbers and the characters people expect to see when they enter text into a computer. The uppercase letter "A", for example, is represented by the decimal number 65 (41 in hexadecimal) in a variety of character encodings, including the ASCII text familiar to many Western programmers and Windows Code Page-1252, the default encoding used by most Microsoft Windows Western systems.

Character encodings are not fonts that provide graphic representations or glyphs that map to a particular character encoding. Microsoft Word, for example, includes a version of Arial (Arial Unicode MS) with tens of thousands of characters.

All XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 and UTF-16.

The following table shows different platforms representing the same set of Western characters and how they can use different bytes to represent the same character.

Byte

Windows (CP1252)

Macintosh (MacRoman)

140

Œ

å

229

å

Â

231

ç

Á

232

è

Ë

233

é

È

Parsers can read in documents written in ISO-8859-1, Big-5, or Shift-JIS, but the processing rules treat everything as Unicode. While parsers can read documents, there are some limitations to automatically detecting character encodings. For example, 8-bit ASCII text is acceptable UTF-8, but UTF-8 is more than 8-bit ASCII text. For reliable processing, XML documents that use character encodings other than UTF-8 or UTF-16 must include an encoding declaration in the XML declaration. This makes it possible for a parser to read the characters correctly or report errors when it cannot process an encoding.

XML declaration is written in basic ASCII text; therefore parsers can read the contents even if the document is in a very different encoding. The encoding declaration significantly increases the likelihood that documents with encodings other than UTF-8 and UTF-16 are interpreted correctly.

The following is an example of a simple XML declaration:

<?xml version="1.0"?>

This XML declaration example shows the use of optional attributes, encoding and standalone:

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

The encoding attribute must be a legal character encoding scheme. The standalone attribute is a string whose values can be yes or no. For more information on the standalone attribute and the meaning of its values, see XmlDeclaration.Standalone Property.

Some transactions, for example, those carried over HTTP and e-mail protocols, also provide information about character encodings. Microsoft Internet Explorer uses that information in document processing; however, it is not available, for example, if you load an XML document from a local hard drive or a file server.

See Also

Concepts

Character Encoding of XML Names and Conversion of XML Data Types

Encode and Decode XML Element and Attribute Names

Conversion of XML Data Types

Reference

XmlConvert