How to Encode XML Data

Article
08/30/2006

By Chris Lovett
Microsoft Corporation
March 2000

Summary: This article explains how character encoding works and specifically how it works in XML and the MSXML DOM.

Contents

Cross-Platform Data Formats XML and Character Encoding Character Sets and the MSXML DOM Creating New XML Documents with MSXML Conclusion For More Information

A lot of people have been asking me questions lately about how to make their XML files transfer data properly between different platforms. They create an XML document, type in data, stick a few tags around it, make the tags well-formed, and even put the <?xml version="1.0"?> declaration in for good measure. Then they try and load it up but get an unexpected error message from the Microsoft® XML Parser (MSXML) saying that there's something wrong with their data. This can be frustrating to the new XML author. Shouldn't it just work?

Well, not quite. It's likely that when you receive the unexpected error message from MSXML, the platform that is receiving your data stores it differently than the platform from which you sent it, resulting in character encoding problems.

Cross-Platform Data Formats

Creating cross-platform technologies and allowing different platforms to share data is an area that computer software and hardware industries have struggled with ever since they managed to connect two computers together. Since the early days, things have only become more and more complex with the explosion in the number of different types of computers, different ways of connecting them, and different kinds of data that you might want to share between them.

After decades of research into cross-platform programming technologies, the only real cross-platform solution today (and probably for a long time to come) is that which is achieved by simple standard data formats. The success of the Web was built on exactly these formats. The main thing that passes between Web servers and Web browsers today is HTTP headers and HTML pages, both of which are standard text formats.

In the next few sections, I'll discuss character encoding and standard character sets, Unicode, the HTML Content-Type header, the HTML Content-Type metatags, and character entities. If you are familiar with these concepts, you can skip ahead to the tips and tricks of encoding XML data for the XML Document Object Model (DOM) programmer. For details, see XML and Character Encoding.

A Lesson in Character Encoding

Standard text formats are built on standard character sets. Remember that all computers store text as numbers. However, different systems can also store the same text using different numbers. The following table shows how a range of bytes is stored, first on a typical computer running Microsoft Windows® using the default code page 1252, and second on a typical Apple® Macintosh® computer using the Macintosh Roman code page.

Byte	Windows	Macintosh
140	Œ	å
229	å	Â
231	ç	Á
232	è	Ë
233	é	È

For example, when your grandma places an order for a new book from https://www.barnesandnoble.com/, she is unaware that her Macintosh computer stores its characters differently than the new Windows 2000 Web server running www.barnesandnoble.com. As she enters your Swedish home address in the ship-to field of the Internet order form, she believes the Internet will correctly deliver the character å (byte value 140 on her Macintosh), unaware that her message will be received and processed by computers that translate byte value 140 as the letter Œ.

Unicode

The Unicode Consortium decided it would be a good idea to define one universal code page (using 2 bytes instead of one per character) that covers all the languages of the world so that this mapping problem between different code pages would be gone forever.

So if Unicode solves cross-platform character encoding issues, why hasn't it become the only standard? The first problem is that switching to Unicode sometimes means doubling the size of all your files-which in a network-bound world is not ideal. Some people therefore still prefer to use the older, single-byte character sets such as ISO-8859-1 to ISO-8859-15, Shift-JIS, EUC-KR, and so forth.

The second problem is that there are still many systems out there that are not Unicode-based at all, which means that on a network, some of the byte values that make up the Unicode characters can cause major problems for those older systems. So Unicode Transformation Formats (UTF) have been defined; they use bit-shifting techniques to encode Unicode characters as byte values that will be "transparent" (or flow through safely) on those older systems.

The most popular of these character encodings is UTF-8. UTF-8 takes the first 127 characters of the Unicode standard (which happen to be the basic Latin characters, A-Z, a-z, and 0-9, and a few punctuation characters) and maps those directly to single byte values. It then applies a bit-shifting technique using the high bit of the bytes to encode the rest of the Unicode characters. The result of all this is that the little Swedish character å (0xE5) becomes the following 2-byte gibberish Ã¥ (0xC3 0xA5). So unless you can do bit shifting in your head, data encoded in UTF-8 is not human readable.

Content-Type Header

Because the older single-byte character sets are still in use, the problem of transferring data is not solved until we also specify what actual character set the data is in. Recognizing this, the Internet e-mail and HTTP protocols groups defined a standard way to specify the character set in the message header Content-Type property. The property specifies a character set from the list of the registered character set names defined by the Internet Assigned Numbers Authority (IANA). A typical HTTP header might contain the following text:

HTTP/1.1 200 OK
Content-Length: 15327
Content-Type: text/html; charset:ISO-8859-1;
Server: Microsoft-IIS/5.0
Content-Location: https://www.microsoft.com/Default.htm
Date: Wed, 08 Dec 1999 00:55:26 GMT
Last-Modified: Mon, 06 Dec 1999 22:56:30 GMT

This header indicates to the application that what follows after the header is in the ISO-8859-1 character set.

Content-Type Metatags

The Content-Type property is optional, and in some applications, the HTTP header information is stripped off and just the HTML itself is passed along. To remedy this, the HTML standards group defined an optional metatag as a way to specify the character set in the HTML document itself, making the HTML document character set self-describing.

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

In this case, the character set ISO-8859-1 declares that in this particular HTML page, the byte value of 229 means å. This page is now completely unambiguous on any system, and data is not misinterpreted. Unfortunately, because this metatag is optional, it leaves room for error.

Character Entities

Not every system supports every registered character set. For example, I don't think many platforms actually support the IBM mainframe character set called EBCDIC. Windows NT does, but probably not many others-which is most likely why the https://www.ibm.com home page is generating ASCII.

As a backup plan, HTML allows the encoding of individual characters in the page by specifying their exact Unicode character value. These character entities are then parsed independently of the character set, and their Unicode values can be determined unambiguously. The syntax for this is "å" or "å".

XML and Character Encoding

XML borrowed these ideas from HTML and took them even further, defining a completely unambiguous algorithm to determine the character set encoding used. In XML, an optional encoding attribute on the XML declaration defines the character encoding. The following algorithm determines the default encodings:

If the file starts with a Unicode byte-order mark [0xFF 0xFE] or [0xFE 0xFF], the document is considered to be in UTF-16 encoding. Otherwise, it is in UTF-8.

The following are all correct and equivalent XML documents:

Character set or encoding	HTTP header	XML document
ISO-8859-1	Content-Type: text/xml; charset:ISO-8859-1;	<test>å</test>
UTF-8	Content-Type: text/xml;	<test>Ã¥</test>
ISO-8859-1	Content-Type: text/xml;	<?xml version="1.0" encoding="ISO-8859-1"?> <test>å</test>
UTF-8 (using character entities)	Content-Type: text/xml;	<test>å</test>
UTF-16 (Unicode with byte-order mark)	Content-Type: text/xml;	ff fe 3c 00 74 00 65 00 73 00 74 00 3e 00 e5 00 ..<.t.e.s.t.>... 3c 00 2f 00 74 00 65 00 73 00 74 00 3e 00 0d 00 <./.t.e.s.t.>... 0a 00

Character Sets and the MSXML DOM

Now that we've discussed various ways to encode characters, let's look at how to load XML documents in the MSXML DOM and the types of error messages you might get when encountering ambiguously-encoded characters. The two main methods of loading XML DOM documents are the LoadXML method and the Load method.

The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or UTF-16 only. If you pass in anything other than a valid Unicode BSTR to LoadXML, it will fail to load.

The Load method can take the following as a VARIANT:

Value	Description
URL	If the VARIANT is a BSTR, it is interpreted as a URL.
VT_ARRAY \| VT_UI1	The VARIANT can also be a SAFEARRAY containing the raw encoded bytes.
IUnknown	If the VARIANT is an IUnknown interface, the DOM document calls QueryInterface for IStream, IPersistStream, and IPersistStreamInit.

The Load method implements the following algorithm for determining the character encoding or character set of the XML document:

If the Content-Type HTTP header defines a character set, this character set overrides anything in the XML document itself. This obviously doesn't apply to SAFEARRAY and IStream mechanisms because there is no HTTP header.
If there is a 2-byte Unicode byte-order mark, it assumes the encoding is UTF-16. It can handle both big endian and little endian.
If there is a 4-byte Unicode byte order mark (0xFF 0xFE 0xFF 0xFE), it assumes the encoding is UTF-32. It can handle both big endian and lttle endian.
Otherwise, it assumes the encoding is UTF-8 unless it finds an XML declaration with an encoding attribute that specifies some other character set (such as ISO-8859-1, Windows-1252, Shift-JIS, and so on).

There are two errors you will see returned from the XML DOM that indicate encoding problems. The first usually indicates that a character in the document does not match the encoding of the XML document:

An invalid character was found in text content.

The ParseError object will tell you exactly where on a particular line this rogue character occurs so that you can fix the problem.

The second error indicates that you started off with a Unicode byte-order mark (or you called the LoadXML method), and then an encoding attribute specified something other than a 2-byte encoding (such as UTF-8 or Windows-1250):

Switch from current encoding to specified encoding not supported.

Alternatively, you could have called the Load method and started off with a single-byte encoding (no byte-order mark), but then it found an encoding attribute that specified a 2- or 4-byte encoding (such as UTF-16 or UCS-4).

The bottom line is that you cannot switch between a multibyte character set like UTF-8, Shift-JIS, or Windows-1250 and Unicode character encodings such as UTF-16, UCS-2, or UCS-4 using the encoding attribute on an XML declaration, because the declaration itself has to use the same number of bytes per character as the rest of the document.

Lastly, the IXMLHttpRequest interface provides the following ways of accessing downloaded data:

Methods	Description
ResponseXML	Represents the response entity body as parsed by the MSXML DOM parser, using the same rules as the Load method.
ResponseText	Represents the response entity body as a string. This method blindly decodes the received message body from UTF-8. This is a known problem that should be fixed in the upcoming MSXML Web Release.
ResponseBody	Represents the response entity body as an array of unsigned bytes.
ResponseStream	Represents the response entity body as an IStream interface.

Creating New XML Documents with MSXML

Once the XML document is loaded, you can manipulate that XML document using the DOM without concern for any encoding issues because the document is stored in memory as Unicode. All the XML DOM interfaces are based on COM BSTRs, which are 2-byte Unicode strings. This means you can build an MSXML DOM document from scratch in memory that contains all sorts of Unicode characters and all components will be able to share this DOM in memory without any confusion over the meaning of the Unicode character values. When you save this, however, MSXML will encode all data in UTF-8 by default. For example, suppose you do the following:

var xmldoc = new ActiveXObject("Microsoft.XMLDOM")
var e = xmldoc.createElement("test");
e.text = "å";
xmldoc.appendChild(e);
xmldoc.save("foo.xml");

The following UTF-8 encoded file will result:

<test>Ã¥</test>

Note The preceding example will only work if you run the code outside the browser environment. Calling the Save method while inside the browser will not produce the same results because of security restrictions.

Even though this looks weird, it is correct. The following test loads up the UTF-8 encoded file and tests whether the UTF-8 is decoded back to the Unicode character value 229. It is:

var xmldoc = new ActiveXObject("Microsoft.XMLDOM")
xmldoc.load("foo.xml");
if (xmldoc.documentElement.text.charCodeAt(0) == 229)
{
    WScript.echo("Yippee - it worked !!");
}

To change the encoding that the XML DOM Save method uses, you need to create an XML declaration with an encoding attribute at the top of your document as follows:

var pi = xmldoc.createProcessingInstruction("xml",
                        " version='1.0' encoding='ISO-8859-1'");
xmldoc.appendChild(pi);

When you call the save method, you will then get an ISO-8859-1 encoded file as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<test>å</test>

Now, be careful you don't let the XML property confuse you. The XML property returns a Unicode string. If you call the XML property on the DOMDocument object after creating the ISO-8859-1 encoding declaration, you will get the following Unicode string back:

<?xml version="1.0"?>
<test>å</test>

Notice that the ISO-8859-1 encoding declaration is gone. This is normal. The reason it did this is so that you can turn around and call LoadXML with this string and it will work. If it does not do this, LoadXML will fail with the error message: "Switch from current encoding to specified encoding not supported."

Conclusion

Hopefully this article has helped explain how character encoding works and specifically how it works in XML and the MSXML DOM. Character set encoding is pretty simple once you understand it, and XML is great because it leaves no room for ambiguity in this respect. The MSXML DOM has a few quirks to watch out for, but it remains a powerful tool that allows you to both read and write any XML encoding.

For More Information

Microsoft MSDN Online Library: XML DOM Reference
Character Encoding Model by Ken Whistler and Mark Davis
IANA Character Sets
Internet Engineering Task Force (IETF) at https://www.ietf.org for a list of RFCs
Microsoft Global Software Development: Compatibility Issues with Mixed Environments