In
the following article, we will examine the major
internationalized components of a web application,
and extend some principles to all software internationalization
efforts. This application was built in Java with
an Oracle back end, primarily by Zia Consulting
(http://www.ziaconsulting.com), and provides quotes
for equipment to people in 27 countries. The
software libraries referenced in this article will
be Java standard libraries; other languages may
or may not have the same level of support.
The
three main phases involved in internationalizing
an application are:
- finding the user’s preferred locale within the set of supported
locales
- displaying information appropriate to the chosen locale
- operational concerns
Please
note that internationalization is only half of the
process—when the application can handle different
locales, locale-specific data and information must
be provided. That process is called localization,
and this article will not cover the topic.
Finding the Appropriate Locale
First, a user must be associated with a locale.
This is a configuration setting like any other.
Allowing the user to explicitly choose the appropriate
locale is crucial. There are some hints that the
user’s computer can provide: in Microsoft Windows,
for instance, there is a “Country” key in the registry¹,
and web browsers can provide preferences for a given
locale and communicate this information to the server
in a process called language negotiation (http://www.w3.org/).
However,
while these can serve as starting points for a piece
of software, they certainly do not represent the
entire answer. The operating system may be configured
incorrectly, or a user might be at an Internet cafe
in a foreign country with settings adapted for that
cafe’s typical user, but not for him or her.
When
the locale of an application is not set correctly,
a typical user blames the application, not the operating
system or browser. And unlike other configuration
options, an incorrect choice of locale often renders
an application useless. There also may be business
requirements that require the user to choose their
locale regardless of the machine setting.
For
example, in the quoting system to which this article
refers, users must be able to receive a quote in
US dollars. This often happens when multinational
corporations have a procurement entity that deals
with US dollar quotes only.

GFTPKlient Locale
Install Dialog
In
short, the best method for finding the proper locale
for a given user is to ask. For desktop applications,
this usually occurs at installation time, as shown
in the illustration above. For web applications,
it is a bit more complex, but generally the application
should not ask for the preferred locale whenever
it can recognize that a request has come from a
user for whom the preferred locale is already known.
By asking the user explicitly, the application also
lets users know that it is prepared to handle a
different locale, and allows them to choose whatever
locale is appropriate, regardless of the existing
settings.
Of
course, there is a bootstrapping problem here: In
what language should the initial locale query be
posed? There are two choices: use a “lowest common
denominator” language that is widely known by the
target audience, or use a graphic that has simple
instructions in several languages. The reason to
use an image rather than text is that users may
not have fonts for all supported languages installed.
Besides, an image looks better than a series of
question marks, which is how some operating systems
render characters they do not recognize.
However,
images cannot be used free of charge. It is less
expensive to have someone modify a single piece
of text when you need to update and redeploy that
message. Working with images is more expensive and
complicated; someone needs to provide the new text
and then the existing image must be modified, usually
by a graphics specialist. If the image is changed
in size or shape, the QA (quality assurance) department
must make sure that none of the non-modified characters
has changed.
In
addition to cost considerations, the “locale bootstrap”
option chosen for an application depends largely
on the target audience. It may be safe for a scientific
web application to use English for the startup instructions,
while a consumer application may need to have simple
startup text translated in several languages. The
quoting application, as shown below, chose to do
both—have unknown users default to the American
English locale (as evidenced by the text “If your
country...”), and at the same time send all such
users to a page with messages in a number of languages
stating, “Please select your country.” (The reason
to use “country,” rather than “locale,” in this
message is that typical users have no idea what
the latter means.)

The quoting application
locale chooser
Whatever
the user’s choice, the locale should be stored using
the ISO (International Organization for Standardization)
codes for country and language to ensure maximum
compatibility. Java uses the two-letter, lowercase
language code, then an underscore, then the two-letter,
uppercase country code²;. For example, en_US represents American English, while fr_CA stands for Canadian French. The web-based quoting
application used different locale codes, primarily
for legacy compatibility. This caused some issues
when new, internationalized software was integrated
into the website—a translation layer had to be written.
If you have a choice, choose to store locale information
in the standard format.
Once
the user has chosen a locale, the application should
never ask again. For desktop software, that is not
an issue since the developer should store the chosen
value in the registry or in a configuration file.
However, for web software, re-querying is inescapable.
If the user deletes his or her cookies or views
the application from a different computer, the server
cannot identify the user. This means that locale
choice should be as easy as possible in a web application,
because it is likely to happen more often than in
a desktop application.
However,
the user should be able to reconfigure the application
and choose a new locale. In practice, typical desktop
applications often do not provide this choice (other
than via
re-installation), although Windows does allow users
to install a new locale without requiring them to
re-install. On the other hand, and because they
are built to deal with transient users, web applications
must allow for easy changes of the preferred locale.
In the quoting application, for example, every page
has a drop down box where the user can select a
different locale.
Rendering Locale-Specific Information
Once the user has chosen a locale, the application
must respond by showing information relevant to
that locale. The most important type of information
is the language of the displayed text. Other locale-specific
information includes business rules, such as whether
or not to show a product that may be illegal or
unavailable in a given country, date formatting,
number formatting and sorting. We will examine each
of these options in the context of the quoting application.
The
general way to place the appropriate text for a
given locale in the application’s user interface
is to make sure all user interface text is replaced
with tokens. Each token is a key string used when
the developer is building the user interface. The
quoting application we have been referring as an
example used JSPs (Java Server Pages) and tag libraries,
but these concepts apply to almost any display technology.
In separate files, at least one for each locale,
a key receives a localized value.
The
value should be stored using a UTF (Unicode Transformation
Format) representation, which can use multiple bytes
to represent a single character. Such a representation
is very specific to a particular programming language.
In the case of Java, the UTF characters are typically
escaped and stored in the ISO-8859-1 character set
(for more information on this topic, please visit
http://java.sun.com/).
An
escaped UTF sequence looks like this: \u65e5.
When the user has identified a preferred locale,
the application, represented by the framework in
the illustration below, generates the correct user
interface by replacing all the keys with the corresponding
values, which are drawn from a file or a database
(the datastore below). Note that this assumes the
user’s system is set up correctly, with fonts, etc.
for the chosen locale.

Many
modern programming languages have library support
for this key-value separation, including Java and
XUL (XML User Interface Language, part of the Mozilla
application framework). If you are building an internationalized
application and are able to choose a development
language, definitely take a look at the internationalization
features of each considered language, because some
programming languages have better support than others.
An
example of a key-value pair for the American English
locale is:
HELLO_KEY=Hello there!
Extracting
all the text in the user interface is tedious work,
although some automation is possible. This is one
major reason to internationalize the application
from the beginning, because guaranteeing that all
text has been extracted is easier if one starts
from the beginning. The application has fewer states,
and each additional component or feature built can
be internationalized in turn.
At
this point, the developer should begin to think
about the localization process as well: how will
all the text be translated, tested and deployed
in an efficient manner? For text stored in files,
the quoting application used a combination of Excel
spreadsheets for data entry, an Access database
to store all localized strings and various scripts
to extract and generate files for use by the web
application.
In
addition to simple key value replacement, it may
be desirable to put dynamic content in a string
to be localized. This is typically done with another
type of token to represent the dynamic content.
The following is a Java example:
HELLO_KEY=Hello {0}!
Not
only will “HELLO_KEY” in the application user interface
be replaced with “Hello {0}!,” but the token “{0}”
can be dynamically replaced with any value the application
supplies. This token replacement is extremely useful
when dealing with languages that have different
subject-verb-object orders. There is library support
in Java for this functionality. You can find more
information at http://java.sun.com/.../MessageFormat.html.
Locale-specific
files work fine for text that rarely changes. However,
for the quoting application, there was a significant
amount of dynamic text—primarily product data. While
most modern databases support the storage of UTF
data in the database, developers need to make sure
the database is configured correctly, and that any
other tools used to manipulate that data are equipped
to do so. The quoting application was built on Oracle,
which supports the UTF character set—developers
just needed to be sure that the NLS_LANG environment
variable was set to “american_america.AL32UTF8.”
Other applications like SQL*Loader also needed to
be configured correctly to handle multiple byte
characters.
Be
careful using UTF strings as keys into hash tables
or with third party libraries. What looks the same
on the screen may not be the same string of UTF
characters. For example, combining accent characters
may or may not be used to represent accents. (For
more on this topic, see the Unicode FAQ at http://unicode.org/faq/char_combmark.html#2).
Number
and date formatting are typically performed by language
libraries (http://java.sun.com/.../DateFormat.html). In the
quoting application, the locale-specific files were
leveraged rather than using the locale-specific
date and number formatters. Two additional keys
were added, one for number formatting and another
for date formatting. Each would be pulled out and
passed on to the appropriate formatting class whenever
a date or number was rendered to the user. This
meant that all locale-specific information was stored
in one file. In addition, the Java libraries could
not process the application’s non-standard locale
codes, which as mentioned above, were required for
compatibility reasons.
Business
rules are intensely application-specific and, after
displaying a user interface in a user’s chosen language,
are the second most important part of internationalizing
an application. Developers should be aware of locale-related
business rules and build support for them early
on. For example, in the quoting application, people
in different countries had different sets of available
products. Few languages or frameworks are going
to provide any support, as these rules are very
application-dependent, so developers should plan
to build in customized rule sets based on the locale
choice.
Note
that this allows for some types of security attacks—if
French users are prohibited from buying certain
items that Japanese users can, a Japanese-speaking
French user can choose the Japanese locale and view
the prohibited items. Since the user is typically
asked to choose a locale, they can circumvent some
of the business rules based on that choice. Sometimes
it is then necessary to implement business processes
to handle locale-specific issues. In the case mentioned
above, the call centers that handle the improper
quote can choose to ignore it or to respond directly
to the user that the locale is not supported.
Sorting
text correctly for a given locale is a complicated
issue. Again, there appears to be some library support
(http://java.sun.com/.../Collator.html), although
initially the quoting application did not support
locale-specific sorting. However, later releases
required the quoting application to sort localized
text. Rather than using library support, a custom
interface was built to allow business users to optionally
define a sort order for any particular column in
a table. This allowed administrators to apply a
default sort to the column alphanumerically and
then let users change it.
When
all the user interface text has been replaced with
tokens, including the date and number formatting
as may be required, one can run the application
and test the user interface. The actual generation
of the application user interface is executed when
the application knows the user’s locale, so that
the appropriate language, number format and other
locale-specific features can be displayed correctly.
In desktop applications, this could be performed
at installation; in web applications, the substitution
is typically done at runtime.
Operational Concerns
Internationalization is more than just pulling strings
out of files and finding a user’s preferred locale.
For web applications, availability can be a challenge.
Deploying new versions of the quoting application
was an issue due to the time zones covered by the
supported countries. Deploying the application severely
affected its availability for a short period of
time, and it was almost always business hours in
the afternoon somewhere. While the developers were
able to view usage logs and find a weekday time
when the fewest number of people were using the
application, this period was only useful for quick
deployments or configuration changes. For any situation
that required more downtime than a few minutes,
the solution was to deploy on Friday evening or
Saturday, which is weekend for all time zones. This
was not very popular among the developers supporting
the application.
Furthermore,
from a data modeling point of view, internationalized
applications can be a bit of a headache. In every
table that contains text that will be displayed
to a user, there will be a key to the locale table.
The quoting application ended up with thirty or
so tables with foreign keys to the locale table.
One alternative is to de-normalize the locale information
and place the actual code, rather than a foreign
key, in every table that contains user displayable
text. If this alternative is chosen, triggers can
be used for validation and to enforce locale correctness.
Both these solutions are perfectly reasonable in
terms of fulfilling business requirements and application
support, but tough to manage and, less importantly,
a bit distasteful from a data standpoint.
Lessons Learned
The quoting application did not support all 27 countries
out of the box—locales were added incrementally.
This gradual approach allowed for the resolution
of process-specific issues, especially the localization
process. It also allowed for the maintenance and
administration of the site to mature.
Using
standard locale codes whenever possible will maximize
compatibility and can save future headaches. However,
there may be compelling business reasons prohibiting
this, and the incompatibility can be worked around.
Internationalization
should be considered from the very beginning. Not
only in terms of the tediousness of extracting all
user interface strings, but also the complexity
of business rule support arguments for building
some support into all applications from the beginning.
In short, internationalizing software, whether desktop software or
a web application, is not extremely difficult. There
is a set number of issues to be dealt with, the
longest and most tedious being the extraction of
all displayed strings to separate files. Almost
every internationalization task is easier if dealt
with at the beginning of a project rather than bolted
on at the end—and going from one to two supported
locales is typically more difficult from an internationalization
viewpoint than going from two to N supported locales.
Most modern programming languages have extensive
library support for internationalization, which
should be leveraged whenever possible.
¹ For
more information on this topic, please check the
“How to read, add or modify Windows registry entries
with REGEDIT” topic of Rob van der Woude’s website
(http://www.robvanderwoude.com/index.html). It can be found by
clicking the “Batch files” link under the “Scripting”
section on the home page.
²
Language codes can be found in http://www.loc.gov/.../English_list.php, and country codes in
http://www.iso.org/.../list-en1.html. Java libraries use the
ISO codes in the locale classes: http://java.sun.com/.../Locale.html.
Dan
Moore is an independent consultant who
has been working with web technologies since 1997.
He helped Zia Consulting extend the quoting web
application outlined above, and became familiar
with some of the “gotchas” of software internationalization
and localization. Moore has written articles and
given presentations to local technical groups on
topics ranging from internationalization to Java
on the cell phone and Java authentication technology.
He has a degree in Physics from the Whitman College
and maintains a weblog that covers a variety of
technical topics (and an occasional rant) at http://www.mooreds.com/weblog.