Summary: The key to Web-application internationalization is to know both your application and the future audience. This article aims to answer many of the critical questions that are required before an application is internationalized. (9 printed pages)
When well-known, multinational companies want to sell in China, they make sure that their users understand their content. At the time of this writing, one such company of which I know has 26 localized versions of its online-shopping portal. They and other global corporations have realized that, although you can buy in any language, you must use the customer's language to sell [Morgan et al., 2001]. Challenges in transforming a Web site or a product, so that it can cater to a global audience, can be immense.
This article attempts to highlight the main issues of which a practicing architect would have to be aware upon attempting internationalization of a Web application with both static and dynamic content. Every platform on which an application is built provides a detailed, platform-specific internationalization guide, which should be carefully read and understood.
The article does not attempt to provide solutions for the various issues that it highlights in every layer of the application. However, it does recommend an approach through the choices that are made in our fictitious scenario.
Your company, A. Datum Corporation, is celebrating its fourth consecutive year as the leading Web-based mall-management ASP in the U.S. Your CEO, Scott Bishop, is addressing shareholders with current-year numbers, when someone asks a question that he had anticipated: "With the U.S. market captured and saturated, how do we plan to maintain explosive growth in revenue and profits?"
Scott explains, "By selling to the Chinese, Indians, and Russians. Seventy percent of the shopping malls to be built in the next 10 years will be in these countries. We want to be there when they build them." This is when it all starts. This is when A. Datum embarks on its journey to become a global corporation.
You are the architect of the special task force that is created by Scott and is responsible for achieving this important milestone of releasing the product in China in six months. You know that the effort will affect every part of the existing application and will not be limited only to the presentation layer, as many in the company have been suggesting. From experience, you also know that globalizing an application has two parts, both of which are orthogonal to the application layers (client, presentation, business, and integration): internationalization (i18n) and localization (L10n).
Figure 1. Internationalization and localization are orthogonal to the application layers.
What Is the Current State of the Application's Internationalization?
Embarking on a project without complete information is a recipe for failure. Hence, in the initial kickoff meeting, you help the team to understand the scope and impact of the effort. Areas of the application are divided up between teams, which analyze the application to identify the elements in the application that are affected by locale.
Do You Go "Big Bang" or in a Phased Manner?
You also realize that it is better to approach the internationalization effort in a phased manner, instead of by executing the "Big Bang" approach. You know this, because going in a phased manner will:
· Reduce risk, as well as make it easier to isolate and resolve issues quickly and effectively.
· Make more sense, because development of the application is happening in parallel.
· Enable regression-testing to happen easily, because development of the application is happening in parallel.
Part of the team will work on automating certain parts of the effort by using commercial off-the-shelf (COTS) or homegrown tools. The following are general areas in which you think that this might help:
· Extracting embedded content out to resource bundles
· Scanning code, and identifying use of locale-specific functions, routines, and methods
· Extracting image references
Following your analysis, you define a phased process that is punctuated by regression-testing efforts (see the section titled "Process").
You realize that, although the immediate goal is to release in China, that market is not going to be the only one in which you will eventually deploy.
Internationalization of the application should consider the following character-set features:
· The database must support the character set of the data that is coming in. The choice of character set must consider future language requirements. The ISO 8859 character set supports English and most of the western European languages, but it does not support Chinese. Consequently, you choose Unicode, which provides unrestricted multilingual support.
· The application should be refactored appropriately, so that localization in various languages does not result in code change and recompilation. After this has been done, the time-to-market in new geographies will be reduced tremendously.
· The external interfaces of the application should be able to handle data in every character set.
You determine that, because existing data is encoded in ASCII, its migration into a database with Unicode support is not an issue. However, there are tools that are available from database vendors for exporting and importing data to switch its character set.
Every character in English is encoded within a single byte. Hence, a database field column of width CHAR (10) implies 10 characters in English or in any language that is encoded in ISO 8859. A character in Russian or Chinese, however, might span three bytes. Consequently, you give the go-ahead to increase the size of all character fields in the database to at least three times their current size, to accommodate Asian and other multibyte languages.
The database contains business logic inside of procedures and functions. Modifications to column sizes can result in this code being refactored. However, specific database vendors might provide certain features that can limit the amount of modifications that are needed to accommodate these changes (for example NLS_LENGTH_SEMANTICS in Oracle 9i [Oracle, 2005]).
This is the layer in which most of your application code lies. It broadly encompasses the application server/middleware/MOM/ESB/processing engines.
Because your application is J2EE-based, it is inherently Unicode-compliant. Most of the application development platforms that are available today support Unicode.
The application currently assumes an en_US locale (U.S. English, as indicated in Java properties); every user in the application defaults to this. In a global scenario, locale negotiation would determine the user's locale. There are various ways in which this can be done for a Web application:
· Deduce the locale from the Accept-Language HTTP header.
· Provide separate application entry points for different locales.
· Store the locale as a user preference, and service all requests to the user based on that preference.
Multilingual data has created new requirements/validations in your code. These requirements relate to locale-specific validations, currency conversions, and the exporting and importing of data. For instance, a CSV file import/export will have problems for a French locale, due to the decimal separator being a period and not a comma (that is, 4.5 becomes 4, 5). You realize that fixes to these problems would have to be determined on a case-by-case basis, and that these are more of design issues.
The most important aspect of any Web application is its presentation; this is what the users interact with, and it has the maximum impact on their perception of the application. Your application presents two types of content to the user.
This includes help files, pages about terms and conditions, images, and so on. The best approach for these is to maintain separate copies of them per language, and have the application pick the appropriate version depending upon the user's locale. This content best resides inside of a content-management system (CMS), and is best served through HTTP servers like Apache or Microsoft Internet Information Services (IIS). The main task here is to decide on a directory structure that keeps static content for each locale separate and easily maintainable.
Supporting multiple locales for dynamic content means that the internationalization architecture must:
· Treat text as a resource. This resource must be accessible to a translator easily.
· Automatically render entities, such as numeric and monetary values, according to locale.
· Allow groups of templates to be treated as a unit, to support different page designs for different locales.
You list the following areas as those in which most of the work will lie, in the presentation layer.
The internationalized application must treat text and images (images with text) as dynamically generated data. Existing textual content has to be extracted into resource bundles. Your team has developed certain in-house tools that can scan the existing code base and perform the extraction. You suggest enhancing the tool to replace the extracted occurrence with the result of a call to the resource bundle by using a generated key. These resource bundles will be the targets of localization efforts in the various user languages.
Each written language has different characters, and they take up a different amount of real estate on the screen. Hence, it is possible, after translating to Russian, that "Purchase Order Number" will not fit in the current 100-pixel width that is defined for its label. You need a way, then, to externalize (or parameterize) the screen layout per locale. You make an informed decision to use HTML DIV-based layouts in the Web pages. This allows you to control the layout completely by using CSS. The idea is to have a separate style sheet for every language that is supported. This has enabled UI designers to work on screen layouts by using only style sheets and not complex dynamic JSP pages.
Localization of Data Values
Your team has identified various data elements that must be rendered differently, according to the user's locale. Generally, the platform provides application programming interfaces (APIs) for dealing with these issues. In your case, J2SE provides extensive localization support through its NumberFormat and DateFormat classes.
· Currency formats—For an amount of 10000.00, display "$10,000.00" for the en_US locale and "€10 000,00" for the fr_FR locale.
· Date formats—Locales vary in date and calendar-format displays. "DD/MM/YYYY," "MM/DD/YYYY," and "MMM DD, YYYY" are some of the common formats that are used. The names of the days and months also need localization. Figure 2 shows the current date lookup in your application—internationalized, and then localized in Russian:
Figure 2. Date lookup, both internationalized and localized
· Address/Phone number—Addresses also vary from country to country. There are differences in the list of states and ZIP code formats (for example, "XXXXX-XXXX" for a ZIP code in the United States, or "XXXXXXX" in India).
· Validation—Differing formats for numbers, dates, addresses, ZIP codes, and phone numbers lead to the related problem of locale-specific validation. You come up with two ways to solve this problem:
· Certain validations might require some server-side support. The J2SE platform provides extensive locale support for currency, date, and numbers. Your company has contracted with a third-party Web service for validation of international addresses. You decide to implement such complex validations by using AJAX.
· Text truncation—The length of a phrase with the same meaning might vary in different languages. Because a Web page has finite space, you decide on implementing a truncation scheme, in which noncritical data is truncated. The user has an option of drilling down to see the full content.
· HTTP encoding—The server can set a CHARSET parameter in the HTTP header to specify the character encoding of the response. Because the application is going to support multiple languages, your recommendation is to use Unicode. Hence, you instruct the team to put the following line in every JSP in the application:
<code> content="text/html;charset=UTF-8" />
The reasons that you choose Unicode are simple:
· Unicode encoding allows you to incorporate multiple languages in a single page.
· It eliminates server-side logic to set the appropriate CHARSET in the page.
The client for the application is a browser. Your application supports Microsoft Internet Explorer version 5.5 and later. The following issues are those with which your team has to deal, in this layer:
For a browser to show data in a particular character set, it needs a font, which will map the code points of that character set to appropriate visual representations. Certain fonts are part of the basic installation of the operating system. For instance, a Japanese version of Windows will have fonts installed for the Japanese language. However, if the user wants to view Japanese data on a machine that has the Windows-1252 code page installed, a compatible font is required. There are general-purpose fonts available, too, which support all of the languages. As an example, Arial Unicode MS is one such font and is part of the Microsoft Office distribution. You application will specify the appropriate font as part of style sheets, and the expectation would be that the appropriate font is installed on the user's machine.
Every third-party interface to the application has to be looked upon in the light of data in multiple languages passing through it.
· XML interaction—Interactions with external Web services using XML encoded in UTF-8 is capable of handing multilingual content.
· Flat file—The application allows certain flat-file downloads and uploads. Your team ensures correct encoding of the files that are downloaded and uploaded, so that no information is lost. Every export/import that involves CSV files must be looked at, as to whether such a method is still viable. Some situations might warrant a switch to a different format, such as XML.
· Third-party tools—Each third-party tool/API that is used in the application must support multibyte character sets and Unicode compliance; for example, a PDF driver is used to generate reports as a PDF document.
The process diagram that is shown in Figure 3 depicts the phased process for this effort:
Figure 3. A process diagram for internationalization (Click on the picture for a larger image)
In this article, we have answered many of the critical questions that are required before an application is internationalized. The answers to these questions are critical for the effort to both succeed and realize the expected return on investment (ROI).
· How many languages an application can possibly support in the future will decide on the character set that is to be used. It is recommended to use Unicode, when in doubt.
· It is also critical to carefully list the elements of the application that will be affected by multilingual data. This exercise is important to scope the effort.
· Internationalization will affect every part of the application, including third-party interfaces. This also implies a significant testing effort. It always helps if the application has an existing automated test suite, which can be used for regression-testing.
· Internationalization can lead to a lot of work in many source files. An example is extracting static text out of JSP pages and into resource bundles. It is always better to look for tools (COTS or homegrown) that can ease some of this pain.
· What is the current state of my application, from an i18N perspective?
· Do I go "Big Bang" or in a phased manner?
· Can my persistence handle data in multiple character sets? What should be the character set of the database?
· What areas of the presentation layer are affected by the user's locale?
· [Morgan et al., 2001] Morgan, Terri, Carol Luttrell, and Yuzeng Liu. "Designing Multilingual Web Sites: Applied Authoring Techniques." 2001. ACM Digital Library.
· [Oracle, 2005] Hardman, Ron. "Globalization: Going Global." Oracle Magazine, 2005.
· Murray, Greg. "J2EE Internationalization and Localization." 2002. Sun Microsystems.
· Various. "W3C Internationalization (I18n) Activity." 2007. W3C.
ASCII (American Standard Code for Information Interchange)—The most common character set that is used to represent American English. Code points in 7-bit ASCII (called US-ASCII) range from 0 to 127. ASCII contains uppercase and lowercase Roman alphabets, European numerals, punctuation, a set of control codes (nongraphical code points from decimal 0 to 31), and a few miscellaneous symbols. Many early Internet protocols were based on 7-bit ASCII, which greatly complicated Web-application support of languages other than American English.
Character set—A set of graphical, textual symbols, each of which is mapped to an integer (for example, ASCII and ISO 8859).
Collation—The process of ordering text by using language or specific rules, instead of by using binary comparison.
Encoding—A way of mapping the code points of a character set to units of specific width, and defining byte serialization and ordering rules. Unicode has UTF-8 and UTF-16 encodings.
Internationalization (i 18n)—The process of designing an application to make it adaptable to different languages and regions, without requiring engineering changes.
ISO 8859—A character-set series that was created to overcome some of the limitations of ASCII. Each ISO 8859 character set may have up to 256 characters. ISO 8859–1 ("Latin–1") comprises the ASCII character set, plus characters with diacritics (accents, diereses, cedillas, circumflexes, and the like), and additional symbols. The ISO 8859 series defines 13 character sets (ISO 8859–1 through ISO 8859–10, and ISO 8859–13 through ISO 8859–15) that can represent text in dozens of languages.
Locale—A set of political, cultural, and region-specific elements that are represented in an application. As per ISO standards, locale is a combination of a language + country + variant (for example, en_US and en_GB).
Localization (L10n)—The process of adapting software for a specific region or language, by adding locale-specific components and translating text.
Unicode—Also known as ISO 10646, defines a character set with 21-bit code points. Unicode can represent all of the character sets in the world. The Java Programming language internally represents all character and string objects in 16-bit Unicode. Hence, programs that are written in Java can process data in multiple languages.
Windows-1252—Character encoding of the Latin alphabet in Microsoft Windows. Windows-1252 is a superset of ISO 8859.
About the author
Puneet Sachdev is a practicing architect at NIIT Technologies, Inc. He specializes in designing high-throughput and scalable Web sites, developing multilingual J2EE Web applications and event-driven architecture (EDA), and developing Web 2.0/RIA using AJAX/AFLAX. Within NIIT, Puneet heads the Open Source Center of Competence, in which his charter is to suggest to NIIT's customers innovative solutions that use open-source software. Puneet is a member of both WWISA and ACM. He can be contacted at either firstname.lastname@example.org or email@example.com.
This article was published in Skyscrapr, an online resource provided by Microsoft. To learn more about architecture and the architectural perspective, please visit skyscrapr.net.