| n today's marketplace, where any company can have a Web site that can be seen in virtually any country in the world, it is still not easy to create a worldwide presence. Users prefer a site that has been localized into their own language and will often choose to purchase products from a site in their language rather than from one that isn't. This preference is actually stronger with less technical users, and if your Web site doesn't appeal to many of those customers, you might be losing business you didn't even know existed.|
This article discusses the Trigeminal Software Web site at http://www.trigeminal.com and the issues faced in developing and maintaining this site. Trigeminal.com is localized into between four and 48 languages, depending on which page you are viewing. It uses a completely extensible framework that allows any number of languages to be added, defaults to the user's preferred language (yet overrides it at any time to view whatever languages are available), and makes use of both static and dynamic content.
I'll explain the overall framework for the site and how the site manages language-specific content. Then I'll discuss implementing and maintaining an international site, whether the site should use frames, which character set for multibyte languages is best, whether the site should use static or dynamic pages, what database should be used for storing dynamic content, and issues to be resolved if your platform isn't MicrosoftÂ® WindowsÂ® 2000.
The Site Framework Parts of the trigeminal.com site uses frames. (Later I'll discuss the issues around implementing a site with frames so you can decide whether this is the right choice for your site.) Some pages use a main frame with at least one page in that main frame. A smaller frame at the bottom of the screen contains the list of languages supported by the given page. This smaller page, lang.asp, is the real workhorse of the site. It performs all of the following:
The language to use is determined by parsing the HTTP_ACCEPT_LANGUAGE variable. This is handled in the UiLcid function (see Figure 1). UiLcid handles the fact that multiple languages can be selected by picking the first supported language that is found in the HTTP_ACCEPT_LANGUAGE variable. This function combines several languages together (such as all the versions of Arabic and Spanish) on the assumption that you will only have one of them. If you need to support all the different versions of the Arabic or Spanish languages, for example, then you can break them out into individual lines of the select case.
- Determines the language to use for displaying the list of language names in the floating frame at the bottom of the screen.
- Determines what language versions are available for a given page.
- Creates hyperlinks for each page that exists.
- Chooses the page to display if a specific language is requested and a version of the page is available for that language. If the requested language does not have a corresponding page, it defaults to English.
- If the user does not request a specific language, lang.asp uses the language determined via parsing the HTTP_ACCEPT_ LANGUAGE variable (described later) to choose the page to display if a version for that page exists. If the detected language does not have a corresponding page, the page displayed defaults to English.
- Sets the server's codepage and the client's character set so that data will be displayed properly.
- Puts the client-side JScriptÂ® code in place so that when the client processes this page, the correct page will be selected.
The code in lang.asp also determines what language versions are available for a given page since the site may not have all pages localized for all languages. This is implemented with the FileSystemObject and works because lang.asp assumes that the Web site will be structured in a particular way. The layout for the site (in the case where only two languages are supported) is shown in Figure 2. Any time there is a localized version of a particular page, it will be in the language-specific directory.
|Figure 2 Localized Site Architecture |
Note that in Figure 2, the pages are all mirrored in the language-specific directories. Although all of the pages show up for both languages, you could easily be missing pages in a specific language if those pages had not been localized. The code in lang.asp checks to see which pages are actually present.
In the case of frameless pages, the same architecture is used, but a redirect moves to the localized content. (On Windows 2000 I use Server.Execute, which allows me to completely hide the site's architecture from people viewing its content. Unfortunately, Server.Execute is call specific to version 5.0 of Microsoft Internet Information Services, so you can't do this with WindowsÂ® NT 4.0.)
For the frame pages, server-side scripting takes care of building the list and setting up links, while the client side does the actual navigation work. In the frameless pages, everything is done on the server.
The use of locale IDs (LCID) also makes it easy to provide localized content, for instance specific date, time, currency, or number formats for a given locale. You can read more about this in the Microsoft Knowledge Base article Q229690; I use a slightly more complex scheme than the one that's described in the Knowledge Base article.
Frames versus Server-side Includes Choosing whether to use frames is an important decision for international sites. Many people dislike using frames for several reasons. First, some browsers, including a few on handheld devices, do not support frames. Second, frames require at least three calls to the server: one for the frameset and one for each of the two frames. Third, client-side scripting is required both on the frameset page itself (which sends the information to lang.asp to determine the language and page to process), and in lang.asp (to make the actual page selection). If the user has scripting disabled, then the site will not work properly.
For these reasons, using server-side includes is often a better solution. If you implement the site with server-side includes the structure of the site can be the same, but all script will be server-side ASP instead. The .inc files containing language-specific information will be chosen dynamically based on script that runs on the server.
Using includes means more work to make sure the language list remains at the bottom of the visible browser window instead of at the bottom of the page. You will lose the benefits of not supporting client-side script if you decide to support a language list in the form of a moveable <div>. On the other hand, you'll avoid the extra round-trips to the server with server-side includes. The worst thing that will happen if scripts are turned off in the browser is that users will have to scroll to the bottom of the page to see the language list; the functionality stays otherwise intact.
Choosing Character Sets The choice of character set (or language grouping) to use for specific pages can be a difficult one. The Microsoft definition of Unicode is actually UCS-2, which defines two bytes per character. This standard applies to the Unicode that Microsoft uses in the kernel for Windows NT 4.0 and Windows 2000, COM, and everywhere wide functions are supplied. The UTF-7 and UTF-8 encodings both support Unicode text as well, but Windows considers them multibyte encodings. I point this out so that people who are used to thinking of Unicode as UTF-8 do not think I am being platform-provincial.
Since I was using a Windows NT 4.0-based server, Unicode seemed like the intuitive choice. The problem, however, is that FrontPageÂ® 2000 will not let you even look at Unicode pages, and many browsers and client operating systems will not support them either. The fact that Microsoft Internet Information Services (IIS) 4.0 and 5.0 will not support Unicode pages was the last straw. I had to find another option.
My next thought was to use UTF-8 since a much broader range of browsers support this character encoding, and FrontPage 2000 handles it well. (I did not really consider UTF-7 since it can usually require up to five bytes per character, while UTF-8 usually requires at most three bytes per character.) If you are not just working on Microsoft platforms, you will probably find that most people consider UTF-8 to be the best standard to use for Unicode data.
The only real problem I have with using UTF-8 is that the Microsoft Internet Explorer 5.0 "Auto-download of uninstalled languages" feature only works when the browser knows the language to install based on the page's character set, and UTF-8 is not designed for a specific language. You can work around this by specifically adding calls to the addComponentRequest and doComponentRequest methods. For example, you would use the script shown in Figure 3 to install Hebrew language support, although this can be problematic if you are trying to avoid client-side code.
The full list of component IDs for language support are shown in Figure 4.
Perhaps future versions of Internet Explorer will support an HTML tag such as <content-language> that you can use to make auto-download work properly with UTF-8 pages. (Hopefully, it will be a method that is part of the HTML standard and other browsers will support it as well.) In the meantime, I limited my UTF-8 usage to the Unicode-only languages such as Armenian, Georgian, Hindi, and Tamil in the interest of simplicity.
For the other languages, this left me with ANSI. FrontPage 2000 support for ANSI is excellent, and as long as you set the character set for the page properly, it will look right in all views. Users of my Trigeminal Software site frequently send me e-mail saying how impressed they are at this excellent feature of my site because of the Internet Explorer language auto-download functionality. I have to tell them that I have nothing to do with the feature; it's their browser doing all the work.
Most localizers who translate your pages will probably be working in a specific code page, so you will be using the same charset settings that they are. The only real downside to this approach occurs because of a problem with FrontPage 2000. If there are errors in HTML tags, FrontPage 2000 will occasionally ignore the language setting. If this happens and you make any changes to the page (including fixing the broken tag), all of the localized characters will be replaced by the Unicode code points in the &#xxx; format, which will bloat the file tremendously. The only current workaround to this problem is to cut the text to the clipboard, save and close the page, then reopen and paste the text back in (fixing the broken tag before you save again). A fix for this problem is currently planned for the next version of FrontPage.
Static versus Dynamic Content If you have pages on an international site that never (or at least seldom) change, it makes sense to keep them as static HTML. This also applies to static text that appears in ASP pages. Since your localizers will most likely be able to handle static HTML, using the exact file they hand off to you is beneficial. If changes must be made, they can take the static HTML page, modify it, and send it back to you.
If the content changes often, however, it is worth considering the slight overhead of storing most of the content in a database and reading it out at runtime via VBScript. I do this for a downloadable wizard that can be seen at http://www.trigeminal.com/frmrpt2dap.html. Most of the text is completely static, but the version of the wizard will change every time there is a new release, and the wizard's localized list of supported languages will have new entries each time as well. Both the language list and the version number are stored in a database and I read them out at runtime. As I'm writing this article the wizard supports 64 languages, so any technique that keeps me from updating 64 Web pages has got to be a good thing.
The next question is which database to use. My site is designed to use either SQL Serverâ¢ 7.0 or Jet, in both cases using the most recent OLE DB providers. I usually run the site off of the Jet database as a proof of concept to show how performance is not hampered by the thread-safe Jet OLE DB provider when you are doing simple querying rather than complex inserts, updates, or deletes.
If you have control over the server where your international site will be hosted, then your choice is easy: use IIS 5.0 and Windows 2000 Server, which includes Microsoft Data Access Components (MDAC) 2.5. Everything is just so much easier when you do this. In many cases, however, you may not get to choose what the server will run. My own ISP, for example, is running a Windows NT 4.0-based box with Service Pack 4 (and it was a one-year battle to get them to upgrade to SP4). If you aren't running on Windows 2000, there are two issues that will affect your site: the version of MDAC running on the server and its implications on character set choices.
If you do not explicitly install MDAC 2.1 or 2.5 on the server, then Jet 4.0 is simply not an option for your database needsâ"unless you are using SQL Server 7.0, in which case you will have at least MDAC 2.1 automatically. If you do not have at least MDAC 2.1, then any database operations you perform in ASP code will probably be against SQL Server 6.5 or a Jet 3.5x datastore. In this case, you are limited to a single ANSI codepage choice for languages.
The only real workaround to this limitation is to store the Unicode code points in your database (in a comma-delimited list), which you can parse out at runtime with the Split and ChrW functions in your VBScript code. This is obviously not an ideal solution, as it will slow down your pages a bit. If you must use this technique, be sure to close the database connection before you start parsing to minimize the amount of time you are using up database resources. You should also take special care to streamline the ASP code as much as possible. If your site generates any real traffic, then an ISP's refusal to upgrade to at least MDAC 2.1 might be a good reason to look for another ISP.
UTF-7 and UTF-8 Support in ASP The UTF-8 problem I touched on earlier is one of the most compelling reasons to run Windows 2000 Server. Windows NT 4.0 did not support UTF-8 until SP4, but SP4 and later service packs disabled UTF-8 support in ASP. How frustrating! The explanation I was given for this decision is that it has always been assumed that when IIS makes a WideCharToMultiByte call to convert from Unicode to ANSI, the buffer it creates only needs to be one or two bytes per Unicode character.
In all non-Asian languages, one character is enough; in the Asian languages, DBCS might require two characters. This assumption fails in the case of UTF-8, where it can take up to three bytes for a given Unicode character. (UTF-7 is even worse, since up to five bytes per character might be needed.) Since the buffers are not guaranteed to be large enough for the data, it is possible you will overrun them and cause page faults on the server. As a result, you can never reliably set Session.CodePage to 65000 (UTF-7) or 65001 (UTF-8). However, there are two workarounds to this problem, one supported and one unsupported.
The supported workaround is to never actually set the Session.CodePage to let IIS do conversions for you. Instead, you need to use Response.BinaryWrite instead of Response.Write to directly write out the UTF-7 and UTF-8 data. Assuming you set the page's Charset properly, the text will be displayed as you would like.
But how do you get the UTF-7 and UTF-8 data in the first place? There are two options. You can create your own COM component that does the actual conversion or use the sample one included with this article. To use this utility, put TsiAtoW.dll on your server, set a reference to it in your Visual BasicÂ® or Visual Basic for Applications (VBA) project, and then just run the following:
Alternatively, you can store the raw data as UTF-8 in the database. Although storing the raw data in the database works, it is not easily maintainable. Therefore, I usually prefer to set Session.CodePage to 65000/65001 and trap for the runtime error. If the error happens, I either default to some other language such as English or use the component solution.
Set cnv = Server.CreateObject("TsiAToW.Convert")
' Convert the string in stUnicode to UTF-8 and write
' it out with BinaryWrite so that the IIS 4.0 ASP can
' handle the job
Response.BinaryWrite cnv.WToA(stUnicode, 65001)
The unsupported workaround is one I found quite by accident, and it is one that my ISP actually uses. I had been using the previous method of conditionally supporting UTF-8 and I expected the attempt to fail on their installation of Windows NT 4.0 and succeed on my Windows 2000-based test server. However, the UTF-8 data worked! My ISP had installed SP4 before installing the Windows NT Option Pack that provides IIS. When you do this, whatever magical method Microsoft used to turn off UTF-7 and UTF-8 support is not present, so your server will work as if it were running Windows 2000. I recommend extreme caution if you are using such a solutionâ"test your site thoroughly! Your mileage may vary, so make sure your site works correctly before you go down this route.
Where Do You Go from Here? Although this article describes a working site, there is obviously much more to the story of implementing and maintaining an international site. Hopefully this first look will get you thinking about the best way to bring your site to the world.
Your site may need to take commercial transactions or otherwise accept information from users that involves inserting or editing data in their local languages. If so, it is extremely important that your e-commerce provider understands the issues of character set support described in this article. The resources mentioned as background information for this article are a place to start.
If you want your site to support search pages, a good choice is using Microsoft Index Server on your server. FrontPage knows how to make use of Index Server if it is on the server. You cannot use the built-in FrontPage search pages because they will only work properly for the default system code page, which makes them useless for international text.
Finding good localizers to do the real translation work for your site is always a challenge, but one place to start is http://www.aquarius.net. Another option is machine-translated text. The small utility on my site that translates a phrase is available from http://translator.go.com. You can also check out one available on http://world.altavista.com.
Deciding on what languages are important to your business is often key to the success of your business. You'll want to investigate which targets are worthwhile for providing localized information.