The Basics of Software Internationalization

Software internationalization builds support for multiple locales in an application, where a locale is “[a] subset of a user’s environment that defines conventions for a specified culture,” (Cf. http://publib16.boulder.ibm.com/) typically including language. Supporting multiple locales lets the user choose the most appropriate one, allowing for easier use of the given application. It is best to complete the internationalization process as the application is being built, since adding in such support after the fact can be expensive and complicated.

In the following article, we will examine the major internationalized components of a web application, and extend some principles to all software internationalization efforts. This application was built in Java with an Oracle back end, primarily by Zia Consulting (http://www.ziaconsulting.com), and provides quotes for equipment to people in 27 countries. The software libraries referenced in this article will be Java standard libraries; other languages may or may not have the same level of support.

The three main phases involved in internationalizing an application are:

finding the user’s preferred locale within the set of supported locales
displaying information appropriate to the chosen locale
operational concerns

Please note that internationalization is only half of the process—when the application can handle different locales, locale-specific data and information must be provided. That process is called localization, and this article will not cover the topic.

Finding the Appropriate Locale
First, a user must be associated with a locale. This is a configuration setting like any other. Allowing the user to explicitly choose the appropriate locale is crucial. There are some hints that the user’s computer can provide: in Microsoft Windows, for instance, there is a “Country” key in the registry¹, and web browsers can provide preferences for a given locale and communicate this information to the server in a process called language negotiation (http://www.w3.org/).

However, while these can serve as starting points for a piece of software, they certainly do not represent the entire answer. The operating system may be configured incorrectly, or a user might be at an Internet cafe in a foreign country with settings adapted for that cafe’s typical user, but not for him or her.

When the locale of an application is not set correctly, a typical user blames the application, not the operating system or browser. And unlike other configuration options, an incorrect choice of locale often renders an application useless. There also may be business requirements that require the user to choose their locale regardless of the machine setting.

For example, in the quoting system to which this article refers, users must be able to receive a quote in US dollars. This often happens when multinational corporations have a procurement entity that deals with US dollar quotes only.

GFTPKlient Locale Install Dialog picture
GFTPKlient Locale Install Dialog

In short, the best method for finding the proper locale for a given user is to ask. For desktop applications, this usually occurs at installation time, as shown in the illustration above. For web applications, it is a bit more complex, but generally the application should not ask for the preferred locale whenever it can recognize that a request has come from a user for whom the preferred locale is already known. By asking the user explicitly, the application also lets users know that it is prepared to handle a different locale, and allows them to choose whatever locale is appropriate, regardless of the existing settings.

Of course, there is a bootstrapping problem here: In what language should the initial locale query be posed? There are two choices: use a “lowest common denominator” language that is widely known by the target audience, or use a graphic that has simple instructions in several languages. The reason to use an image rather than text is that users may not have fonts for all supported languages installed. Besides, an image looks better than a series of question marks, which is how some operating systems render characters they do not recognize.

However, images cannot be used free of charge. It is less expensive to have someone modify a single piece of text when you need to update and redeploy that message. Working with images is more expensive and complicated; someone needs to provide the new text and then the existing image must be modified, usually by a graphics specialist. If the image is changed in size or shape, the QA (quality assurance) department must make sure that none of the non-modified characters has changed.

In addition to cost considerations, the “locale bootstrap” option chosen for an application depends largely on the target audience. It may be safe for a scientific web application to use English for the startup instructions, while a consumer application may need to have simple startup text translated in several languages. The quoting application, as shown below, chose to do both—have unknown users default to the American English locale (as evidenced by the text “If your country...”), and at the same time send all such users to a page with messages in a number of languages stating, “Please select your country.” (The reason to use “country,” rather than “locale,” in this message is that typical users have no idea what the latter means.)

The quoting application locale chooser picture
The quoting application locale chooser

Whatever the user’s choice, the locale should be stored using the ISO (International Organization for Standardization) codes for country and language to ensure maximum compatibility. Java uses the two-letter, lowercase language code, then an underscore, then the two-letter, uppercase country code²;. For example, en_US represents American English, while fr_CA stands for Canadian French. The web-based quoting application used different locale codes, primarily for legacy compatibility. This caused some issues when new, internationalized software was integrated into the website—a translation layer had to be written. If you have a choice, choose to store locale information in the standard format.

Once the user has chosen a locale, the application should never ask again. For desktop software, that is not an issue since the developer should store the chosen value in the registry or in a configuration file. However, for web software, re-querying is inescapable. If the user deletes his or her cookies or views the application from a different computer, the server cannot identify the user. This means that locale choice should be as easy as possible in a web application, because it is likely to happen more often than in a desktop application.

However, the user should be able to reconfigure the application and choose a new locale. In practice, typical desktop applications often do not provide this choice (other than via
re-installation), although Windows does allow users to install a new locale without requiring them to re-install. On the other hand, and because they are built to deal with transient users, web applications must allow for easy changes of the preferred locale. In the quoting application, for example, every page has a drop down box where the user can select a different locale.

Rendering Locale-Specific Information
Once the user has chosen a locale, the application must respond by showing information relevant to that locale. The most important type of information is the language of the displayed text. Other locale-specific information includes business rules, such as whether or not to show a product that may be illegal or unavailable in a given country, date formatting, number formatting and sorting. We will examine each of these options in the context of the quoting application.

The general way to place the appropriate text for a given locale in the application’s user interface is to make sure all user interface text is replaced with tokens. Each token is a key string used when the developer is building the user interface. The quoting application we have been referring as an example used JSPs (Java Server Pages) and tag libraries, but these concepts apply to almost any display technology. In separate files, at least one for each locale, a key receives a localized value.

The value should be stored using a UTF (Unicode Transformation Format) representation, which can use multiple bytes to represent a single character. Such a representation is very specific to a particular programming language. In the case of Java, the UTF characters are typically escaped and stored in the ISO-8859-1 character set (for more information on this topic, please visit http://java.sun.com/).

An escaped UTF sequence looks like this: \u65e5. When the user has identified a preferred locale, the application, represented by the framework in the illustration below, generates the correct user interface by replacing all the keys with the corresponding values, which are drawn from a file or a database (the datastore below). Note that this assumes the user’s system is set up correctly, with fonts, etc. for the chosen locale.

The Basics of Software Internationalization picture

Many modern programming languages have library support for this key-value separation, including Java and XUL (XML User Interface Language, part of the Mozilla application framework). If you are building an internationalized application and are able to choose a development language, definitely take a look at the internationalization features of each considered language, because some programming languages have better support than others.

An example of a key-value pair for the American English locale is:

HELLO_KEY=Hello there!

Extracting all the text in the user interface is tedious work, although some automation is possible. This is one major reason to internationalize the application from the beginning, because guaranteeing that all text has been extracted is easier if one starts from the beginning. The application has fewer states, and each additional component or feature built can be internationalized in turn.

At this point, the developer should begin to think about the localization process as well: how will all the text be translated, tested and deployed in an efficient manner? For text stored in files, the quoting application used a combination of Excel spreadsheets for data entry, an Access database to store all localized strings and various scripts to extract and generate files for use by the web application.

In addition to simple key value replacement, it may be desirable to put dynamic content in a string to be localized. This is typically done with another type of token to represent the dynamic content. The following is a Java example:

HELLO_KEY=Hello {0}!

Not only will “HELLO_KEY” in the application user interface be replaced with “Hello {0}!,” but the token “{0}” can be dynamically replaced with any value the application supplies. This token replacement is extremely useful when dealing with languages that have different subject-verb-object orders. There is library support in Java for this functionality. You can find more information at http://java.sun.com/.../MessageFormat.html.

Locale-specific files work fine for text that rarely changes. However, for the quoting application, there was a significant amount of dynamic text—primarily product data. While most modern databases support the storage of UTF data in the database, developers need to make sure the database is configured correctly, and that any other tools used to manipulate that data are equipped to do so. The quoting application was built on Oracle, which supports the UTF character set—developers just needed to be sure that the NLS_LANG environment variable was set to “american_america.AL32UTF8.” Other applications like SQL*Loader also needed to be configured correctly to handle multiple byte characters.

Be careful using UTF strings as keys into hash tables or with third party libraries. What looks the same on the screen may not be the same string of UTF characters. For example, combining accent characters may or may not be used to represent accents. (For more on this topic, see the Unicode FAQ at http://unicode.org/faq/char_combmark.html#2).

Number and date formatting are typically performed by language libraries (http://java.sun.com/.../DateFormat.html). In the quoting application, the locale-specific files were leveraged rather than using the locale-specific date and number formatters. Two additional keys were added, one for number formatting and another for date formatting. Each would be pulled out and passed on to the appropriate formatting class whenever a date or number was rendered to the user. This meant that all locale-specific information was stored in one file. In addition, the Java libraries could not process the application’s non-standard locale codes, which as mentioned above, were required for compatibility reasons.

Business rules are intensely application-specific and, after displaying a user interface in a user’s chosen language, are the second most important part of internationalizing an application. Developers should be aware of locale-related business rules and build support for them early on. For example, in the quoting application, people in different countries had different sets of available products. Few languages or frameworks are going to provide any support, as these rules are very application-dependent, so developers should plan to build in customized rule sets based on the locale choice.

Note that this allows for some types of security attacks—if French users are prohibited from buying certain items that Japanese users can, a Japanese-speaking French user can choose the Japanese locale and view the prohibited items. Since the user is typically asked to choose a locale, they can circumvent some of the business rules based on that choice. Sometimes it is then necessary to implement business processes to handle locale-specific issues. In the case mentioned above, the call centers that handle the improper quote can choose to ignore it or to respond directly to the user that the locale is not supported.

Sorting text correctly for a given locale is a complicated issue. Again, there appears to be some library support (http://java.sun.com/.../Collator.html), although initially the quoting application did not support locale-specific sorting. However, later releases required the quoting application to sort localized text. Rather than using library support, a custom interface was built to allow business users to optionally define a sort order for any particular column in a table. This allowed administrators to apply a default sort to the column alphanumerically and then let users change it.

When all the user interface text has been replaced with tokens, including the date and number formatting as may be required, one can run the application and test the user interface. The actual generation of the application user interface is executed when the application knows the user’s locale, so that the appropriate language, number format and other locale-specific features can be displayed correctly. In desktop applications, this could be performed at installation; in web applications, the substitution is typically done at runtime.

Operational Concerns
Internationalization is more than just pulling strings out of files and finding a user’s preferred locale. For web applications, availability can be a challenge. Deploying new versions of the quoting application was an issue due to the time zones covered by the supported countries. Deploying the application severely affected its availability for a short period of time, and it was almost always business hours in the afternoon somewhere. While the developers were able to view usage logs and find a weekday time when the fewest number of people were using the application, this period was only useful for quick deployments or configuration changes. For any situation that required more downtime than a few minutes, the solution was to deploy on Friday evening or Saturday, which is weekend for all time zones. This was not very popular among the developers supporting the application.

Furthermore, from a data modeling point of view, internationalized applications can be a bit of a headache. In every table that contains text that will be displayed to a user, there will be a key to the locale table. The quoting application ended up with thirty or so tables with foreign keys to the locale table. One alternative is to de-normalize the locale information and place the actual code, rather than a foreign key, in every table that contains user displayable text. If this alternative is chosen, triggers can be used for validation and to enforce locale correctness. Both these solutions are perfectly reasonable in terms of fulfilling business requirements and application support, but tough to manage and, less importantly, a bit distasteful from a data standpoint.

Lessons Learned
The quoting application did not support all 27 countries out of the box—locales were added incrementally. This gradual approach allowed for the resolution of process-specific issues, especially the localization process. It also allowed for the maintenance and administration of the site to mature.

Using standard locale codes whenever possible will maximize compatibility and can save future headaches. However, there may be compelling business reasons prohibiting this, and the incompatibility can be worked around.

Internationalization should be considered from the very beginning. Not only in terms of the tediousness of extracting all user interface strings, but also the complexity of business rule support arguments for building some support into all applications from the beginning.

In short, internationalizing software, whether desktop software or a web application, is not extremely difficult. There is a set number of issues to be dealt with, the longest and most tedious being the extraction of all displayed strings to separate files. Almost every internationalization task is easier if dealt with at the beginning of a project rather than bolted on at the end—and going from one to two supported locales is typically more difficult from an internationalization viewpoint than going from two to N supported locales. Most modern programming languages have extensive library support for internationalization, which should be leveraged whenever possible.

¹ For more information on this topic, please check the “How to read, add or modify Windows registry entries with REGEDIT” topic of Rob van der Woude’s website (http://www.robvanderwoude.com/index.html). It can be found by clicking the “Batch files” link under the “Scripting” section on the home page.

² Language codes can be found in http://www.loc.gov/.../English_list.php, and country codes in http://www.iso.org/.../list-en1.html. Java libraries use the ISO codes in the locale classes: http://java.sun.com/.../Locale.html.

Dan Moore is an independent consultant who has been working with web technologies since 1997. He helped Zia Consulting extend the quoting web application outlined above, and became familiar with some of the “gotchas” of software internationalization and localization. Moore has written articles and given presentations to local technical groups on topics ranging from internationalization to Java on the cell phone and Java authentication technology. He has a degree in Physics from the Whitman College and maintains a weblog that covers a variety of technical topics (and an occasional rant) at http://www.mooreds.com/weblog.