Unless your files or customers are monolingual (i.e., one language only), UTF-8 is the recommended encoding for:
- Web pages and sites containing non-European or multi-lingual text
- Embedded HTML files
- HTML-based help files
- Server-based services
Japanese page not encoded in UTF-8
UTF-8 is only a different transform or representation of Unicode characters (encoded as UTF-16 little endian in Windows environment) and allows the mapping of the same repertoire of characters as Unicode.
Encoding in UTF-8 prevents wrong code page conversion once these files are localized, or when the system code page is not matching the language of the HTML. In fact, UTF-8 has become the default encoding scheme for XML, Microsoft .NET frameworks, and other emerging web technologies.
Advantages of UTF-8:
- Supported by Internet Explorer 4.0 and later, as well as Netscape Navigator 4.0 and later.
- Character values from 0000 to 007F (US-ASCII repertoire) correspond to octets 00 to 7F (7 bit US-ASCII values). As a result, US-ASCII values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software that parse based on US-ASCII values but are transparent to other values.
- Round-trip conversion is easy between UTF-8 and UTF-16. To translate to UTF-8 and vice versa in Win32 programming, set the
WideCharToMultiByte codepage parameter to CP_UTF8.
- The first octet of a multi-octet sequence indicates the number of octets in the sequence:
|Unicode Range||UTF-8 Encoded Bytes|
|0x00000000 - 0x0000007F||0xxxxxxx|
|0x00000080 - 0x000007FF||110xxxxx 10xxxxxx|
|0x00000800 - 0x0000FFFF||1110xxxx 10xxxxxx 10xxxxxx|
|0x00010000 - 0x0010FFFF||11110xxx 10xxxxxx 10xxxxxx 10xxxxxx|
Visit the Unicode Consortium Web site to learn more about
Unicode and UTF-8.