Skip to main content

You Are Not World-Ready If...

Your Web-based file encoding is other than UTF-8

Unless your files or customers are monolingual (i.e., one language only), UTF-8 is the recommended encoding for:

  • Web pages and sites containing non-European or multi-lingual text
  • Embedded HTML files
  • HTML-based help files
  • Server-based services

broken web

Japanese page not encoded in UTF-8

UTF-8 is only a different transform or representation of Unicode characters (encoded as UTF-16 little endian in Windows environment) and allows the mapping of the same repertoire of characters as Unicode.

Encoding in UTF-8 prevents wrong code page conversion once these files are localized, or when the system code page is not matching the language of the HTML. In fact, UTF-8 has become the default encoding scheme for XML, Microsoft .NET frameworks, and other emerging web technologies.

Advantages of UTF-8:

  • Supported by Internet Explorer 4.0 and later, as well as Netscape Navigator 4.0 and later.
  • Character values from 0000 to 007F (US-ASCII repertoire) correspond to octets 00 to 7F (7 bit US-ASCII values). As a result, US-ASCII values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software that parse based on US-ASCII values but are transparent to other values.
  • Round-trip conversion is easy between UTF-8 and UTF-16. To translate to UTF-8 and vice versa in Win32 programming, set the MultiByteToWideChar and WideCharToMultiByte codepage parameter to CP_UTF8.
  • The first octet of a multi-octet sequence indicates the number of octets in the sequence:
  • Unicode RangeUTF-8 Encoded Bytes
    0x00000000 - 0x0000007F0xxxxxxx
    0x00000080 - 0x000007FF110xxxxx 10xxxxxx
    0x00000800 - 0x0000FFFF1110xxxx 10xxxxxx 10xxxxxx
    0x00010000 - 0x0010FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Visit the Unicode Consortium Web site to learn more about Unicode and UTF-8.

Top of pageTop of page