.gif)
Puneet Sachdev
December 2007
Summary: The key to Web-application internationalization
is to know both your application and the future audience. This article aims to
answer many of the critical questions that are required before an application
is internationalized. (9 printed pages)
Contents
Introduction
Scenario: A. Datum Corporation
Database-Layer Issues
Business-Layer Issues
Presentation-Layer Issues
Client-Layer Issues
Process
Conclusion
Critical-Thinking Questions
Sources
Glossary
Introduction
When well-known, multinational companies want to sell in China,
they make sure that their users understand their content. At the time of this
writing, one such company of which I know has 26 localized versions of its
online-shopping portal. They and other global corporations have realized that,
although you can buy in any language, you must use the customer's language to
sell [Morgan et al., 2001]. Challenges in transforming
a Web site or a product, so that it can cater to a global audience, can be
immense.
This article attempts to highlight the main issues of which a
practicing architect would have to be aware upon attempting
internationalization of a Web application with both static and dynamic content.
Every platform on which an application is built provides a detailed,
platform-specific internationalization guide, which should be carefully read
and understood.
The article does not attempt to provide solutions for the various
issues that it highlights in every layer of the application. However, it does
recommend an approach through the choices that are made in our fictitious
scenario.
Your company, A. Datum Corporation, is celebrating its fourth
consecutive year as the leading Web-based mall-management ASP in the U.S. Your
CEO, Scott Bishop, is addressing shareholders with current-year numbers, when
someone asks a question that he had anticipated: "With the U.S. market
captured and saturated, how do we plan to maintain explosive growth in revenue
and profits?"
Scott explains, "By selling to the Chinese, Indians, and
Russians. Seventy percent of the shopping malls to be built in the next 10
years will be in these countries. We want to be there when they build
them." This is when it all starts. This is when A. Datum embarks on its journey
to become a global corporation.
You are the architect of the special task force that is created
by Scott and is responsible for achieving this important milestone of releasing
the product in China in six months. You know that the effort will affect every part
of the existing application and will not be limited only to the presentation
layer, as many in the company have been suggesting. From experience, you also
know that globalizing an application has two parts, both of which are
orthogonal to the application layers (client, presentation, business, and
integration): internationalization (i18n) and localization (L10n).
.gif)
Figure 1. Internationalization and localization are orthogonal
to the application layers.
What Is the Current State of the Application's Internationalization?
Embarking on a project without complete information is a recipe
for failure. Hence, in the initial kickoff meeting, you help the team to
understand the scope and impact of the effort. Areas of the application are
divided up between teams, which analyze the application to identify the
elements in the application that are affected by locale.
Do You Go "Big Bang" or in a Phased Manner?
You also realize that it is better to approach the internationalization
effort in a phased manner, instead of by executing the "Big Bang"
approach. You know this, because going in a phased manner will:
· Reduce risk,
as well as make it easier to isolate and resolve issues quickly and
effectively.
· Make more sense,
because development of the application is happening in parallel.
· Enable
regression-testing to happen easily, because development of the application is
happening in parallel.
Part of the team will work on automating certain parts of the
effort by using commercial off-the-shelf (COTS) or homegrown tools. The
following are general areas in which you think that this might help:
· Extracting
embedded content out to resource bundles
· Scanning code,
and identifying use of locale-specific functions, routines, and methods
· Extracting
image references
Following your analysis, you define a phased process that is
punctuated by regression-testing efforts (see the section titled "Process").
Database-Layer Issues
You realize that, although the immediate goal is to release in
China, that market is not going to be the only one in which you will eventually
deploy.
Character Set
Internationalization of the application should consider the
following character-set features:
· The database must
support the character set of the data that is coming in. The choice of
character set must consider future language requirements. The ISO 8859
character set supports English and most of the western European languages, but
it does not support Chinese. Consequently, you choose Unicode, which provides
unrestricted multilingual support.
· The
application should be refactored appropriately, so that localization in various
languages does not result in code change and recompilation. After this has been
done, the time-to-market in new geographies will be reduced tremendously.
· The external
interfaces of the application should be able to handle data in every character
set.
Data Migration
You determine that, because existing data is encoded in ASCII,
its migration into a database with Unicode support is not an issue. However,
there are tools that are available from database vendors for exporting and
importing data to switch its character set.
Character Widths
Every character in English is encoded within a single byte.
Hence, a database field column of width CHAR (10) implies 10 characters in
English or in any language that is encoded in ISO 8859. A character in Russian
or Chinese, however, might span three bytes. Consequently, you give the
go-ahead to increase the size of all character fields in the database to at
least three times their current size, to accommodate Asian and other multibyte
languages.
Business Logic
The database contains business logic inside of procedures and
functions. Modifications to column sizes can result in this code being
refactored. However, specific database vendors might provide certain features
that can limit the amount of modifications that are needed to accommodate these
changes (for example NLS_LENGTH_SEMANTICS in Oracle 9i [Oracle,
2005]).
Business-Layer Issues
This is the layer in which most of your application code lies. It
broadly encompasses the application server/middleware/MOM/ESB/processing
engines.
Character Set
Because your application is J2EE-based, it is inherently
Unicode-compliant. Most of the application development platforms that are
available today support Unicode.
Locale Negotiation
The application currently assumes an en_US locale (U.S. English,
as indicated in Java properties); every user in the application defaults to
this. In a global scenario, locale negotiation would determine the user's
locale. There are various ways in which this can be done for a Web application:
· Deduce the
locale from the Accept-Language HTTP header.
· Provide
separate application entry points for different locales.
· Store the
locale as a user preference, and service all requests to the user based on that
preference.
Business Logic
Multilingual data has created new requirements/validations in
your code. These requirements relate to locale-specific validations, currency
conversions, and the exporting and importing of data. For instance, a CSV file
import/export will have problems for a French locale, due to the decimal
separator being a period and not a comma (that is, 4.5 becomes 4, 5). You
realize that fixes to these problems would have to be determined on a
case-by-case basis, and that these are more of design issues.
Presentation-Layer Issues
The most important aspect of any Web application is its
presentation; this is what the users interact with, and it has the maximum
impact on their perception of the application. Your application presents two
types of content to the user.
Static Content
This includes help files, pages about terms and conditions,
images, and so on. The best approach for these is to maintain separate copies
of them per language, and have the application pick the appropriate version
depending upon the user's locale. This content best resides inside of a content-management
system (CMS), and is best served through HTTP servers like Apache or Microsoft
Internet Information Services (IIS). The main task here is to decide on a
directory structure that keeps static content for each locale separate and
easily maintainable.
Dynamic Content
Supporting multiple locales for dynamic content means that the
internationalization architecture must:
· Treat text as
a resource. This resource must be accessible to a translator easily.
· Automatically
render entities, such as numeric and monetary values, according to locale.
· Allow groups
of templates to be treated as a unit, to support different page designs for
different locales.
You list the following areas as those in which most of the work
will lie, in the presentation layer.
Textual Content
The internationalized application must treat text and images
(images with text) as dynamically generated data. Existing textual content has
to be extracted into resource bundles. Your team has developed certain in-house
tools that can scan the existing code base and perform the extraction. You
suggest enhancing the tool to replace the extracted occurrence with the result
of a call to the resource bundle by using a generated key. These resource
bundles will be the targets of localization efforts in the various user
languages.
Screen Layout
Each written language has different characters, and they take up
a different amount of real estate on the screen. Hence, it is possible, after
translating to Russian, that "Purchase Order Number" will not fit in
the current 100-pixel width that is defined for its label. You need a way,
then, to externalize (or parameterize) the screen layout per locale. You make
an informed decision to use HTML DIV-based layouts in the Web pages. This
allows you to control the layout completely by using CSS. The idea is to have a
separate style sheet for every language that is supported. This has enabled UI
designers to work on screen layouts by using only style sheets and not complex
dynamic JSP pages.
Localization of Data Values
Your team has identified various data elements that must be
rendered differently, according to the user's locale. Generally, the platform
provides application programming interfaces (APIs) for dealing with these
issues. In your case, J2SE provides extensive localization support through its NumberFormat
and DateFormat classes.
· Currency
formats—For an amount of 10000.00, display "$10,000.00" for the
en_US locale and "€10 000,00" for the fr_FR locale.
· Date
formats—Locales vary in date and calendar-format displays.
"DD/MM/YYYY," "MM/DD/YYYY," and "MMM DD, YYYY"
are some of the common formats that are used. The names of the days and months
also need localization. Figure 2 shows the current date lookup in your
application—internationalized, and then localized in Russian:
.gif)
Figure 2. Date lookup, both internationalized and localized
·
Address/Phone number—Addresses also vary from country to country. There are
differences in the list of states and ZIP code formats (for example,
"XXXXX-XXXX" for a ZIP code in the United States, or
"XXXXXXX" in India).
· Validation—Differing
formats for numbers, dates, addresses, ZIP codes, and phone numbers lead to the
related problem of locale-specific validation. You come up with two ways to
solve this problem:
·
Locale-sensitive validations that can be accomplished completely in JavaScript
can be implemented on a per-locale basis, in separate JavaScript files, which
will be dynamically included in the pages depending on locale.
· Certain
validations might require some server-side support. The J2SE platform provides
extensive locale support for currency, date, and numbers. Your company has
contracted with a third-party Web service for validation of international
addresses. You decide to implement such complex validations by using AJAX.
· Text
truncation—The length of a phrase with the same meaning might vary in
different languages. Because a Web page has finite space, you decide on
implementing a truncation scheme, in which noncritical data is truncated. The
user has an option of drilling down to see the full content.
· HTTP
encoding—The server can set a CHARSET parameter in the HTTP header
to specify the character encoding of the response.
Because the application is going to support multiple languages, your
recommendation is to use Unicode. Hence, you instruct the team to put the
following line in every JSP in the application:
<code>
content="text/html;charset=UTF-8" />
<code>
The reasons that you choose Unicode are simple:
· Unicode
encoding allows you to incorporate multiple languages in a single page.
· It eliminates
server-side logic to set the appropriate CHARSET in the page.
Client-Layer Issues
The client for the application is a browser. Your application
supports Microsoft Internet Explorer version 5.5 and later. The following
issues are those with which your team has to deal, in this layer:
Fonts
For a browser to show data in a particular character set, it
needs a font, which will map the code points of that character set to
appropriate visual representations. Certain fonts are part of the basic
installation of the operating system. For instance, a Japanese version of
Windows will have fonts installed for the Japanese language. However, if the
user wants to view Japanese data on a machine that has the Windows-1252
code page installed, a compatible font is required. There are general-purpose
fonts available, too, which support all of the languages. As an example, Arial
Unicode MS is one such font and is part of the Microsoft Office distribution.
You application will specify the appropriate font as part of style sheets, and
the expectation would be that the appropriate font is installed on the user's
machine.
JavaScript
An important aspect to consider on the client side is JavaScript.
Your application uses a lot of JavaScript for performing validation and showing
alert messages to the user. These messages will be pre-read from the resource
bundles in the browser and shown to the user by using JavaScript alerts;
therefore, they must be localized.
Third-Party Interfaces
Every third-party interface to the application has to be looked
upon in the light of data in multiple languages passing through it.
· XML
interaction—Interactions with external Web services using XML encoded in
UTF-8 is capable of handing multilingual content.
· Flat file—The
application allows certain flat-file downloads and uploads. Your team ensures
correct encoding of the files that are downloaded and uploaded, so that no
information is lost. Every export/import that involves CSV files must be looked
at, as to whether such a method is still viable. Some situations might warrant
a switch to a different format, such as XML.
· Third-party
tools—Each third-party tool/API that is used in the application must support
multibyte character sets and Unicode compliance; for example, a PDF driver is
used to generate reports as a PDF document.
Process
The process diagram that is shown in Figure 3 depicts the phased
process for this effort:
.gif)
Figure 3. A process diagram for internationalization (Click on
the picture for a larger image)
Conclusion
In this article, we have answered many of the critical questions
that are required before an application is internationalized. The answers to these
questions are critical for the effort to both succeed and realize the expected
return on investment (ROI).
· How many
languages an application can possibly support in the future will decide on the
character set that is to be used. It is recommended to use Unicode, when in
doubt.
· It is also
critical to carefully list the elements of the application that will be
affected by multilingual data. This exercise is important to scope the effort.
·
Internationalization will affect every part of the application, including
third-party interfaces. This also implies a significant testing effort. It
always helps if the application has an existing automated test suite, which can
be used for regression-testing.
·
Internationalization can lead to a lot of work in many source files. An example
is extracting static text out of JSP pages and into resource bundles. It is
always better to look for tools (COTS or homegrown) that can ease some of this
pain.
Critical-Thinking Questions
· What is the
current state of my application, from an i18N perspective?
· Do I go
"Big Bang" or in a phased manner?
· Can my
persistence handle data in multiple character sets? What should be the
character set of the database?
· What areas of
the presentation layer are affected by the user's locale?
Sources
· [Morgan et
al., 2001] Morgan, Terri, Carol Luttrell, and Yuzeng Liu. "Designing
Multilingual Web Sites: Applied Authoring Techniques." 2001. ACM Digital
Library.
· [Oracle, 2005]
Hardman, Ron. "Globalization:
Going Global." Oracle Magazine, 2005.
· Murray, Greg.
"J2EE
Internationalization and Localization." 2002. Sun Microsystems.
· Various.
"W3C Internationalization
(I18n) Activity." 2007. W3C.
Glossary
ASCII (American Standard Code for Information Interchange)—The
most common character set that is used to represent American English. Code
points in 7-bit ASCII (called US-ASCII) range from 0 to 127. ASCII contains
uppercase and lowercase Roman alphabets, European numerals, punctuation, a set
of control codes (nongraphical code points from decimal 0 to 31), and a few
miscellaneous symbols. Many early Internet protocols were based on 7-bit ASCII,
which greatly complicated Web-application support of languages other than
American English.
Character set—A set of graphical, textual symbols, each of
which is mapped to an integer (for example, ASCII and ISO 8859).
Collation—The process of ordering text by using language
or specific rules, instead of by using binary comparison.
Encoding—A way of mapping the code points of a character
set to units of specific width, and defining byte serialization and ordering
rules. Unicode has UTF-8 and UTF-16 encodings.
Internationalization (i 18n)—The process of designing an
application to make it adaptable to different languages and regions, without
requiring engineering changes.
ISO 8859—A character-set series that was created to
overcome some of the limitations of ASCII. Each ISO 8859 character set may have
up to 256 characters. ISO 8859–1 ("Latin–1") comprises the ASCII
character set, plus characters with diacritics (accents, diereses, cedillas,
circumflexes, and the like), and additional symbols. The ISO 8859 series
defines 13 character sets (ISO 8859–1 through ISO 8859–10, and ISO 8859–13
through ISO 8859–15) that can represent text in dozens of languages.
Locale—A set of political, cultural, and region-specific
elements that are represented in an application. As per ISO standards, locale
is a combination of a language + country + variant (for example, en_US and
en_GB).
Localization (L10n)—The process of adapting software for a
specific region or language, by adding locale-specific components and
translating text.
Unicode—Also known as ISO 10646, defines a character set
with 21-bit code points. Unicode can represent all of the character sets in the
world. The Java Programming language internally represents all character and
string objects in 16-bit Unicode. Hence, programs that are written in Java can
process data in multiple languages.
Windows-1252—Character encoding of the Latin alphabet in
Microsoft Windows. Windows-1252 is a superset of ISO 8859.
About the author
Puneet Sachdev is a practicing architect at NIIT Technologies,
Inc. He specializes in designing high-throughput and scalable Web sites,
developing multilingual J2EE Web applications and event-driven architecture
(EDA), and developing Web 2.0/RIA using AJAX/AFLAX. Within NIIT, Puneet heads
the Open Source Center of Competence, in which his charter is to suggest to
NIIT's customers innovative solutions that use open-source software. Puneet is
a member of both WWISA and ACM. He can be contacted at either puneet.sachdev@niit-tech.com or puneet.sachdev@gmail.com.
This article was published in Skyscrapr, an online resource provided
by Microsoft. To learn more about architecture and the architectural
perspective, please visit skyscrapr.net.