Article for translators: "Unicode

Unicode Has Come a Long Way in Recent Years, But There Is Still a Long Way to Go

Unicode has held out the promise of simplified multilingual workflows, improved publishing support for the world’s languages, and elimination of many hassles that now plague work in the GILT community. In this article Arle Lommel looks at changes in the past two years regarding Unicode support in applications localizers typically deal with.

In 2001 I reported on Unicode implementation and OpenType in the LISA Newsletter (available for LISA members). I reported then, in essence, that Unicode had made little or no impact on how most localizers worked at the time, but that some changes were on the horizon.

The problem with Unicode so far has been that most of the work on Unicode has gone into the back end of systems. While this makes perfect sense from an implementations standpoint (the front end can’t support Unicode if the back end doesn’t), it does mean that Unicode has actually impacted most computer users very little. Unicode support in a database or an operating system doesn’t really make much difference if the applications people are using either ignore Unicode or, even worse, corrupt Unicode data passed to them.

Unfortunately the vast majority of applications on the market still assume a monolingual world where Unicode is unimportant. Until the key software most users deal with can support Unicode, most users will not be able to take advantage of Unicode. As but one example, Quark XPress is a staple of almost every publisher in the world, yet it does not support Unicode in any meaningful way, so no level of operating system or font support for Unicode will make one whit of difference to a publisher doing everything in Quark XPress.

Most multilingual computer users are still using the same fundamental technologies for international text that they were using over a decade ago, and the time- and labor-saving potentials of Unicode were, until recently, by and large still empty promises. That said, recent developments in end-user Unicode support (as opposed to systems Unicode support) indicate that Unicode support is truly entering the mainstream. Increasing numbers of consumer applications support Unicode (and OpenType, which is likely to be the default Unicode implementation for most users), and the number of applications supporting Unicode (and the quality of their implementations) has risen dramatically in the past few years.

Fonts

In 2001, when I last wrote on this subject, Adobe had 21 fonts available in OpenType format (out of hundreds in the company’s font library), but Adobe was actively porting their fonts from PostScript Type 1 to OpenType. This process is now complete, and Adobe no longer sells PostScript Type 1 fonts. Most of these conversions are to “Standard” fonts, i.e., the fonts are identical in their glyph complements to the older versions; but a significant number of the fonts have been converted to “Pro” versions that contain additional characters, including, in some cases, full support for all European languages that use Roman script, as well as Cyrillic and Greek scripts, plus historical character variants and dingbat (decorative) character forms.

Other major type foundries (an outdated term if ever there was one!) have made similar conversions, so now there are literally thousands of fonts in OpenType format to choose from, rather than the slim selection of a few years ago. In addition, those who wish to build their own OpenType fonts now have a solid font-editing choice in FontLab 4.5 (see the review in this issue), so legacy custom fonts can be converted to the new format and take advantage of the Unicode under-pinnings of OpenType.

Both Microsoft and Apple have also bundled Unicode-rich fonts with their operating systems and make liberal use of these fonts, enabling many of the basic applications made by both companies to support a truly amazing array of languages (see notes on each operating system below). The bundling of Unicode fonts with operating systems (and the implementation of system calls that recognize these fonts) is a major milestone in the progress of Unicode, for these fonts provide developers with access to resources that would be prohibitively expensive otherwise, and allow various applications to “talk” to each other. This level of system support is what is needed to open the gates for more and more Unicode-enabled products to enter the market.

Operating Systems

A few years ago OpenType and Unicode support in major operating systems seemed half-baked at best. Despite claims that OSes were Unicode “under the hood,” this support did not translate into usable support in most instances. Input and display support were spotty at best. Even if the “innards” of the OS were Unicode, it did not mean much to users if they could not see or access this support. Fortunately OS support for Unicode has improved dramatically since 2001. Because localizers overwhelmingly deal with just two platforms, Microsoft Windows and Apple’s Mac OS, I will focus on these two OSes. Unix Unicode support varies by Unix “flavor” and installation - Unix installations tend to be much less uniform than Windows or Mac OS installations - and is beyond the scope of this article.

Windows

Windows has supported OpenType and Unicode since at least Windows 98, but Windows XP has taken this support to a new level. Within the desktop environment scripts can be mixed and matched with little difficulty, and input method support is excellent. Applications such as WordPad that rely on system text-handling calls inherit Unicode support from Windows, and so are now inherently multilingual. Unfortunately OS support does not automatically make most legacy applications Unicode-capable, and many major DTP and drawing applications do not support anything but the traditional single-byte font range at this point. These applications will need to be re-engineered to take advantage of OS Unicode support.

Macintosh

Apple’s Mac OS X’s Unicode support is very similar to Windows XP’s. Within the desktop environment scripts can be freely mixed within file names and even bi-directional text is handled properly. In the example below a text file has been given a name consisting of Japanese, Greek, Hebrew, Devanagari, Hangul (Korean) and Arabic characters. While most of the name is random characters, it can be seen that they coexist quite nicely.

Unfortunately, as with Windows, most legacy applications are unable to take advantage of the OS’s Unicode support. Applications written in the fully native OS X “Cocoa” environment, including most of Apple’s bundled applications, seem to deal with Unicode data very well, while those written in the “Carbon” environment used to port OS 9 applications to OS X do not seem to make use of the OS support. Most major applications are Carbon ports from OS 9 at this point, so real Unicode support is spotty. This means, unfortunately, that the large amount of DTP work done on Macintoshes is still using the same text and font technology it was a decade ago…

Applications

This section will focus on two classes of applications: web browsers and word processing/DTP applications. I have chosen to focus on these applications, rather than other sorts of applications, for the simple reason that browsers and DTP applications represent the destination for a large percentage of the work GILT companies do, thus real Unicode support in these areas would have a disproportionately large impact on GILT companies.

Web Browsers

Web browser support is a perpetual problem for localizers given the wide variety of browsers and the number of users of antiquated browsers (such as Netscape 4.7). The good news is that the latest browsers from Microsoft and Netscape support Unicode very well, and the two account for the majority of potential users of web content. How well these browsers work, however, depends not only on the browsers’ own capability, but also on the resources (such as fonts) of the operating system under which they run. Unicode support will get even better as the browser developers implement recommendations of the W3C’s Internationalization Working Group (see Richard Ishida’s presentation on W3C internationalization activity for more information on the recommendations).

Even Unicode-capable browsers differ in their capabilities and may or may not be able to display certain scripts. In the following example, two different Unicode-capable browsers (both under Mac OS X) show striking differences in their display capabilities. Microsoft Internet Explorer 5.2 fails to render Arabic, Hebrew, or Devanagari in an intelligible manner, while it replaces a number of Greek characters with incorrect Roman glyphs, and the whole Greek line’s appearance is odd, to say the least. Both browsers fail to display certain characters in the line in Kazakh because the fonts available to them do not include these characters. Note as well, the way in which characters the browser cannot display are handled in Kazakh - Internet Explorer renders them as ?, while Safari renders them with a glyph that identifies them as unavailable Cyrillic characters.

Figure 2. Different Unicode-capable browsers differ in their ability to actually render Unicode text. (The example screenshots are of Richard Ishida’s presentation on W3C internationalization activity.)

At present the overwhelming majority of web browsers in use do not support Internationalized Resource Indicators (IRIs), the internationalized replacement for URI web addresses, so developers cannot count on being able to use IRIs for some time, and are still limited to US-ASCII URIs for web addresses.

In short, Unicode browsers are here, but they are not yet perfect, and Unicode text cannot be relied upon in all circumstances on the web. This is an area of rapid change, however, and just a few years ago browser support was much less reliable than it is now. My prediction is that within two years, most users’ needs for Unicode capability in browsers will be met, and reliance on legacy code pages for web content will be increasingly needless.

Word Processing/DTP Applications

Microsoft Office

The latest versions of Microsoft Office for Windows support Unicode TrueType fonts quite well, but do not support any OpenType advanced features beyond those required for specific scripts; Roman-script Office documents under Windows cannot automatically substitute glyphs or take advantage of any of the advanced typographic features of OpenType. Most business users of Office are unlikely to need anything beyond basic language support for various scripts however, so Office’s support for OpenType is adequate for its intended audience.

Office X for Macintosh does not seem to support Unicode font display or input, and instead relies on old code pages for its international support. This means that Office documents created on Windows may not display properly on the Macintosh (although Office does pass Unicode data through unharmed and does not crash or corrupt the data, so a file opened under Mac Office X can be returned to Windows without damage). Office X will recognize Roman OpenType fonts, but utilizes only the 232 characters available in a traditional single-byte Roman font. (It will correctly interpret and use CJK OpenType fonts however, so Asian-language users of Office are in luck.)

Quark XPress

Quark XPress is the undisputed king of page layout programs (despite Adobe InDesign’s strong showing in this area), and real Unicode support in Quark would mark a major turning point in Unicode acceptance. Unfortunately it seems that the recent release of Quark XPress 6.0 has not introduced Unicode or OpenType support to Quark XPress. This is disappointing, but also consistent with Quark’s requirement that users purchase separate language versions of the software to handle various scripts. Aside from the fact that failure to really support Unicode or OpenType hampers multilingual work in Quark, it also means that Quark cannot take advantage of the advanced typographical capabilities afforded by OpenType.

Adobe InDesign

Although Adobe’s “Quark killer” has not even come close to breaking Quark’s dominance in the DTP market, it does have vastly superior Unicode and multilingual support when compared to Quark XPress. The support is a bit quirky, but no other mainstream DTP application on the market comes even close to InDesign’s Unicode support. InDesign was built around OpenType and Unicode, and it is only with OpenType fonts that InDesign’s capabilities reach their maximum potential. It supports a large number of advanced OpenType features and allows free mixing of scripts (with the notable exception of bi-directional scripts like Arabic and Hebrew), and can readily import and display Microsoft Office files that include Unicode text (something, as noted above, that Office for Mac cannot do).

Strangely, at least under Mac OS X, Unicode input methods for text are not supported, while older code page based input methods work fine (presumably this is because InDesign on Mac is a Carbon application that cannot fully make use of OS X’s native resources). Use of an “Insert Character” palette does allow characters to be selected and input, but this is really not an option for serious text entry. However, if text can be entered into Microsoft Office, InDesign can import that, so there are routes to get Unicode text into InDesign.

InDesign’s Unicode support should help make InDesign an attractive platform for localization. It would be especially attractive for projects that require multiple languages to be supported within a single document. It is not hard to conceive of projects that would require four or more separate copies of Quark XPress’s various localized versions that could, in principle, be handled with a single standard installation of InDesign (e.g., a project in English with small amounts of text in Simplified Chinese, Japanese and Korean would need to be opened in four separate copies of Quark - one to work on each language - while it could be done entirely within the English version of InDesign with no special installation or plug-ins).

Adobe PhotoShop

PhotoShop’s OpenType support is similar to that of InDesign. It supports some of OpenType’s advanced layout features (such as ligatures and old style figures), but not as many as InDesign supports, and it suffers from the same input method restrictions as InDesign (at least under Mac OS X), while lacking an Insert Character palette. This means that PhotoShop cannot access the full character complement of many OpenType fonts (it cannot access the Greek characters in Adobe’s flagship MinionPro, for example). For languages with appropriate input methods, however, the OpenType support is solid and adequate.

Figure 3. Example of multilingual OpenType text in Adobe Photoshop. A single OpenType font (Adobe MinionPro) is used to display text in English, Russian, and Hungarian. Unicode characters not available to non-Unicode applications are shown in red.

Adobe Illustrator

Adobe Illustrator has limited multiple script support for TrueType-flavored OpenType fonts at this time, while PostScript-flavored OpenType fonts are treated essentially as old-fashioned single-byte fonts. Illustrator does, however, work very well with CJK OpenType fonts and has additional language-specific support for Japanese in the English version.

Adobe Acrobat

For obvious reasons Adobe Acrobat depends on other applications for OpenType support, but output from OpenType-aware applications to PDF via Acrobat Distiller is generally flawless and accurate. Getting text back out of Acrobat, however, is harder, and most characters not found in old-fashioned single-byte fonts are lost when text is exported from Acrobat. Most language professionals know that PDFs are not always a final destination format however, so this is one area where Unicode/OpenType support could stand to improve.

Other applications

Most applications not mentioned above will support at least the basic Roman range of OpenType fonts, and some may have support for other code ranges, but support tends to be quite basic. This will likely change in the next few years as more and more applications begin to take advantage of the OS-level support made available in recent operating systems.

Summary

Although meaningful Unicode support is far from universal, real end-user support for Unicode has become a reality, and is improving. There are real options for those wanting to use Unicode today, and these options are getting better all the time. Unicode is finally beginning to live up to its promise. While there is a long way to go before Unicode is pervasive in everything we do, we are no longer waiting for Unicode to have an impact on the bread and butter of the GILT industry.

Reprinted by permission from the Globalization Insider,
15 July 2003, Volume XII, Issue 3.2.
Copyright the Localization Industry Standards Association
(Globalization Insider: www.localization.org, LISA: www.lisa.org)
and S.M.P. Marketing Sarl (SMP) 2004

"Unicode - Where Are We?"