Using XML For Localization

Home

Join as a Member!

Post Your Job - Free!

All Translation Agencies

Advertisements

Using XML For Localization

By Yves Savourel,
Localization Solutions Architect

ENLASO Corporation

Become a member of TranslationDirectory.com at just $12 per month (paid per year)

Yves Savourel XML is one of the safest, most powerful and flexible ways to store, manipulate, localize and present data in different languages. With the vast array of internationalization features and companion technologies, XML provides many advantages in translation and localization projects.

Unicode - While the encoding of a saved document can be anything, the character set of a parsed XML document is always ISO 10646/Unicode and therefore can include characters in most of the languages used today. Not only can the content of the XML files consist of almost any Unicode characters, but the names of elements and attributes can also be made of non-ASCII characters. For instance, the document shown below has Japanese and Russian content, and the element and attribute names are in Russian.

Listing 1 - Non-ASCII XML document:

<?xml version="1.0"?>
<Собрание версия="2.0">
 <Объект id="12">
  <НомерОбъекта>12-3456-0050</НомерОбъекта>
  <ВНаличии>123</ВНаличии>
  <Описание xml:lang="ja">第二発電機</Описание>
  <Описание xml:lang="ru">Запасной генератор</Описание>
 </Объект>
</Собрание>

Encoding Identification - One of the main sources of problem during the process of localizing almost any type of data is related to opening the files using an incorrect encoding. XML solves this issue by providing a clear mechanism that always ensures (if applied properly) the identification of the encoding used to save the document.

Escape Mechanism - XML also offers a safe way to escape extended characters that are not supported by a given encoding. This allows the document to contain any characters, even when they are not supported in the encoding the document is using. The hexadecimal form (&#xHHH;) and the decimal form (&#DDD;) use both the Unicode code-point value of the characters. For example, the Greek capital letter Omega (Ω) can be escaped as Ω or Ω if a given document uses an encoding where the Greek alphabet is not supported.

Language Identification - Since an XML document can store data in different languages, it is important to be able to identify the language of the content. The XML namespace offers an attribute for this: xml:lang. The values of xml:lang are the same as the ones for the LANG attributes in HTML. They are defined by the RFC 3066 (an update to the RFC 1766), allowing 2 and 3-letter language codes, with an optional 2-letter country code. For instance: xml:lang="ar" indicates Arabic in general, while xml:lang="ar-IQ" identifies Arabic for Iraq. Note that, unlike other attribute values, the values of xml:lang are not case sensitive. Note also that the current values offered by RFC 3066 do not cover all needs of localization. For instance, there is currently no code for Latin-American Spanish; and there is no way to make a distinction between different scripts such as Azeri in Cyrillic script and Azeri in Latin script.

Transformation Mechanism - In addition to the features of XML, some XML-related technologies provide efficient capabilities to facilitate the implementation of XML multilingual solutions. For example, XSLT is a powerful declarative language to manipulate XML data. It uses XPath, the standard mechanism to point to any node in an XML document. Both XSLT and XPath offer internationalization features such as language identification, sorting, numbers formatting, itemization using various digit sets, and so forth.

Rendering - Lastly, two other technologies related to XML, XSL (also called XSL-FO for XSL Formatting Object) and CSS (Cascading Style Sheets), confer to XML a powerful and well internationalized framework to render content. Mixed with the possibilities offered by XSLT, you can present any given XML document in a wide variety of outputs.

Migrating to XML
Obviously, not everyone has data stored in an XML format today. But there are ways to use XML even in these circumstances, and to pave the way for a slow and often painless migration to XML, while still harvesting the benefits of its use early on.

There are three ways to integrate XML into your process:

Scenario 1: As a temporary source

The localizer uses a filter to convert or extract/merge the original format to XML, back and forth. In this first scenario, nothing changes for the developer or the author of the documentation. The localization provider can take advantage of XML to fit the proprietary format into an existing localization process (including for example, TM leveraging, terminology management, etc.).

Scenario 2: As an alternate source

The developer/author works in XML and utilizes a one-way converter to create the file normally used by the application. This is an attractive solution if you need to keep most of your process untouched. By developing one conversion tool to go from your new XML format to your old proprietary format, you keep the benefits of an existing process, and the advantages of providing XML data for the localizer. Converters from XML to another format are usually not costly to develop, in many cases a simple XSLT template will do.

Scenario 3: As the original source

The original format is in XML. This is obviously the best way to integrate XML within your architecture. One type of documents that can be easily migrated to this is HTML. The official hypertext file format recommended by the W3C is now XHTML, not HTML. The conversion from HTML to XHTML can be easily done, even automatically in many cases.

The most important aspect is to move the XML format into the developer/author domain. By doing this, you allow the developers and the authors to use metadata directly within the original source file, ultimately giving them much more control on what happens to the data during the localization.

Taking Advantage of XML
Once you have the source data in an XML format, you can start taking advantage of it in many different ways:

Insertion of metadata related to localization. For example: comments for the translators, indication of some length limitation for content, etc.
Addition of unique identifiers for given sections of text, so you can make re-use of existing translation based on ID matching rather than text matching. The traditional translation memory mechanism can be used later on for what is left to translate.
Using technologies such as XML Linking, you can setup automatic referencing. For instance: pulling error messages from a resource file into a help document. This ensures a single translation and an always-perfect correspondence between the message in the UI and the one in the documentation.

In some cases, you can go one step further and use existing XML vocabularies instead of inventing your own. This will increase interoperability for your data. In the passed years, several specialized XML applications have been developed for particular purposes:

OLIF (Open Lexicon Interchange Format) -- for exchanging machine translation-oriented terminology between tools.
TBX (TermBase eXchange) -- for exchanging glossaries between terminology tools.
TMX (Translation Memory eXchange) -- for exchanging translation memories between translation tools.
XLIFF (XML Localization Interchange File Format) -- for transporting extracted localizable data from one step to the other in the localization process.

Using these formats when appropriate can bring many additional benefits.

Translating XML Documents
While the support among translation tools for XML is improving, it is still not perfect and you may have to work around many problems if the XML schemas you are using are somehow complex. Here are some potential issues:

Very few tools can deal with multilingual files. The xml:lang attribute is not recognized as a special attribute and not taken in account or updated.
Do not expect tools to give you back translated files with CDATA sections as they were in the original files. Some products don't even handle CDATA correctly.
Most tools, with the exception of very few like GlobalSight System 4, work with an element/attribute paradigm rather than a node paradigm. In other words, they can't deal with element content that should be translated or not depending on conditions such as a specific value for a given attribute.

One way to work around this last problem is to modify slightly the original file to have a better input for the tools partially XML-impaired. For example, the XML document below has only one element content to be translated: "Cancel". However, very few tools will be able to set the necessary conditions for that because the case is too complex for them: translate only the content of any <data> element if it has an attribute type set to "text", and if it is inside a <component> element that has an attribute type set to "caption".

Listing 2 - Example of UI description in XML:

<?xml version="1.0"?>
<dialogue xml:lang="en-gb">
 <rsrc id="123">
  <component id="456" type="image">
   <!-- Do not translate -->
   <data type="text">images/cancel.gif</data>
   <data type="coordinates">12,20,50,14</data>
  </component>
  <component id="789" type="caption">
   <!-- Translate -->
   <data type="text">Cancel</data>
   <data type="coordinates">12,34,50,14</data>
  </component>
 </rsrc>
</dialogue>

One method to address this type of issue that will work with any tool is to reduce the conditions into a new temporary element. For instance, add an element <tbt> (to be translated) where the content needs to be localized. The tools can then use a simple condition: translate the content of any <tbt> element. If the document has to be validated during the localization process, you may have to declare it as part of a different namespace, otherwise you can use it directly and make sure to remove it after localization.

Listing 3 - Example of UI description in XML with extra element for translation:

<?xml version="1.0"?>
<dialogue xml:lang="en-gb">
 <rsrc id="123">
  <component id="456" type="image">
   <!-- Do not translate -->
   <data type="text">images/cancel.gif</data>
   <data type="coordinates">12,20,50,14</data>
  </component>
  <component id="789" type="caption">
   <!-- Translate -->
   <data type="text"><tbt>Cancel</tbt></data>
   <data type="coordinates">12,34,50,14</data>
  </component>
 </rsrc>
</dialogue>

A simple XSL transformation template can be used to automate the changes as shown below.

Listing 4 - XSL Transformation for adding special translation elements:

<?xml version="1.0" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">
 <xsl:output encoding="utf-8" />
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
 <xsl:template
  match="//component[@type='caption']/
           data[@type='text']">
  <xsl:copy>
   <xsl:apply-templates select="@*"/>
   <tbt><xsl:apply-templates/></tbt>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

In addition to indicating what element content and/or attribute values need translation, you also need to provide some other information to the localizer:

Are there any elements that have pre-formatted content? (like <pre> in XHTML)
Are there any elements that have content other than normal text? (like <script> in XHTML)
Which elements should be treated as inline codes? (like <b> in XHTML)

Just like for other file formats, there are various strategies for translating XML documents.

The first option is to use an XML-specialized translation tool that will allow you to edit the text directly into the original format, as shown below with Trados TagEditor:

Trados TagEditor

As many XML documents are simple repositories of data that are used in a context different from the way they are stored, seeing the original file may not always be a big advantage.

The second option is to use tools that extract the translatable text into a database and present it to the localizer in a tabular form. For instance, Catalyst, from Alchemy Software, uses this mechanism as shown below:

extract the translatable text into a database

Such applications usually offer a preview method that reconstructs the original format and displays it, so the translator can see the data in context. Obviously the same limitations as in the first option apply here as well: sometimes XML files are simply not the best place to see the text in the context it will be used.

Finally, the last option is to prepare the XML document with a utility that adds a color-coded RTF layer on top of the content. This way the file can be opened and translated in a classic word-processor like Word. This is illustrated below with Wordfast used as the translation tool:

color-coded RTF layer on top of the content

This method has the advantage of being more flexible regarding the choice of translation tool you want to use (or if you do not want to use any translation tool). It may also be more accommodating if the original format is not very well supported by translation tools: you can always add manual formatting to delimit the translatable parts from the sections to leave untouched.

Regardless of the method utilized, at some point the localizer has to create some sort of definition file where localization-related information for the format to translate are stored: what parts of the content needs translation and what parts need to be left untouched, the name of the translatable attributes, and so forth.

For example, the following screen shows how this is done with SDLX from SDL International:

SDLX from SDL International

Each tool has its own way of storing this type of information, but overall, they all use more or less the same information. At some point, the standardization of both the information needed and a common format to specify it would be helpful to everyone.

One of the attractive aspects of XML is that by applying different style-sheets and/or transformation templates to the document you can provide different views of the data. For example: one showing the data as it will be formatted in the final output, one displaying only items that have changed, and another showing source and target side by side for easier comparison, and so on.

Using XSLT allows you to go even further. You can create templates that validate the document and display the items with errors. If the XML format you are using is one of the standard formats such as XLIFF, you can develop a library of different templates offering a wide range of functionalities that you can re-use across different original formats. In the example below, a simple template provides a comparison between the source and the target text, flagging any target entries where an ending exclamation point is missing. You can apply similar methods to verify (according each language) other punctuation requirements, preservation of leading or trailing spaces, missing or extra variables, etc. This specific example uses a very useful extension in the Microsoft XSL engine: the ability to call scripts in the template.

Listing 5 - XSLT template for verification:

<?xml version="1.0" encoding="iso-8859-1" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0"
 xmlns:msxsl="urn:schemas-microsoft-com:xslt"
 xmlns:rws="http://www.translate.com/xml-verification">
 <msxsl:script language="JScript" 
        implements-prefix="rws">
<![CDATA[
var g_nCount = 0;
function ErrorCount () {
 return(g_nCount);
}
function CheckLastChar(p_Src, p_Trg, p_Char) {
 var sSrc = new String(p_Src.nextNode().text);
 if ( sSrc.length<1 ) return("");
 var sTrg = new String(p_Trg.nextNode().text);
 if ( sTrg.length<1 ) return("Empty Target");
 var cTmp = sSrc.charAt(sSrc.length-1);
 if ( cTmp == p_Char ) {
  if ( cTmp != sTrg.charAt(sTrg.length-1) ) {
   g_nCount++;
   return("Missing character '" + p_Char +
    "' at the end of the target text.");
  }
 }
 return("");
}
]]>
 </msxsl:script>
 <xsl:template match="text()"/>
 <xsl:template match="comment()"/>
 <xsl:template match="//alt-trans"/>
 <xsl:template match="/xliff">
  <html>
   <head>
    <title>Verification</title>
   </head>
   <body>
    <h1>Verification</h1>
    <table border="1" cellspacing="0" cellpadding="3">
     <xsl:apply-templates/>
    </table>
    <p>Number of errors =
     <b><xsl:value-of select="rws:ErrorCount()"/></b></p>
   </body>
  </html>
 </xsl:template>
 <xsl:template match="//source">
 <xsl:variable name="R1"
  select="rws:CheckLastChar(.,../target,'!')"/>
  <xsl:if test="$R1!=''">
   <tr>
    <xsl:attribute name="style">
         background:silver</xsl:attribute>
    <td>Error <xsl:value-of 
        select="rws:ErrorCount()"/></td>
    <td><xsl:value-of select="$R1"/></td>
   </tr>
   <tr>
    <td>Source:</td>
    <td>[<xsl:value-of select="."/>]</td>
   </tr>
   <tr>
    <td>Target:</td>
    <td>[<xsl:value-of select="../target"/>]</td>
   </tr>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

This type of mechanism allows you to extend and easily customize the capabilities of any translation tool. Most of them offer a function to see a preview of the translation, the user can observe directly the results of the validation in his/her working environment, as show below for TagEditor:

XSLT template for verification

The Missing Parts

There are still a few pieces that need to be added to the overall XML framework.

A. Localization Properties
The localization properties are the information translation tools need in order to correctly process an XML document. This information needs to be defined at the document type level. For example: one set of localization properties for XHTML, another for SVG, another for ebXML, and so forth. It would be even better if these properties could be integrated into the XML Schema of each of the document types. An illustration of such possible definition file is shown below.

Listing 6 - Localization properties for the document in Listing 2 on page 7:

<?xml version="1.0" ?>
<locprop version="0.1">
 <rules name="Example1" root="dialogue">
  <element-defaults localize="no"/>
  <attribute-defaults localize="no"/>
  <rule item="//component[@type='caption']/ 
       data[@type='text']"
   localize="yes"/>
 </rules>
</locprop>

B. Localization Directives
Localization directives are information specified at the document instance level to indicate specific behavior or information for a given part of the content. For instance: a span of text not to translate in a paragraph otherwise translatable, a specification of a maximum number of characters to use for a given field, etc.

Localization directives are to be used as metadata, using the standard XML namespace mechanism. This allows you to embed, within the document of your own document type, common directives that can be understood by any authoring and translation application. You can imagine that at some point an author could simply highlight a section of the text and click a button to make it non-translatable, or to identify it as a glossary term, etc.

For example, the XHTML document on the next page includes a few of those possible types of directives.

Listing 7 - Localization directives in an XHTML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
 xml:lang="en" lang="en"
 xmlns:loc="urn:the-localization-directives-standard">
 <head><title loc:id="100">Title</title></head>
 <body>
  <h1 id="101">Introduction to <loc:span term="yes">
   Document Management</loc:span></h1>
  <p id="102">Our company, <loc:span localize="no">
Infinite Wisdom Inc.</loc:span>, provides quality  
courses on how to
manage your documentation.</p>
 </body>
</html>

Such vocabulary could also be re-used by schema developers to include localization-specific attributes in their own vocabularies, the same way xml:lang is used today.

Submit your article!