Using XML For Localization
By Yves Savourel,
Localization Solutions Architect
ENLASO Corporation
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
XML
is one of the safest, most powerful and flexible ways to
store, manipulate, localize and present data in different
languages. With the vast array of internationalization features
and companion technologies, XML provides many advantages
in translation and localization projects.
Unicode - While the encoding
of a saved document can be anything, the character set of
a parsed XML document is always ISO
10646/Unicode and therefore can include characters
in most of the languages used today. Not only can the content
of the XML files consist of almost any Unicode characters,
but the names of elements and attributes can also be made
of non-ASCII characters. For instance, the document shown
below has Japanese and Russian content, and the element
and attribute names are in Russian.
Listing 1 - Non-ASCII
XML document:
<?xml version="1.0"?>
<Собрание версия="2.0">
<Объект id="12">
<НомерОбъекта>12-3456-0050</НомерОбъекта> <ВНаличии>123</ВНаличии> <Описание xml:lang="ja">第二発電機</Описание> <Описание xml:lang="ru">Запасной генератор</Описание> </Объект> </Собрание>
Encoding
Identification -
One of the main sources of problem during the process of
localizing almost any type of data is related to opening
the files using an incorrect encoding. XML solves this issue
by providing a clear mechanism that always ensures (if applied
properly) the identification of the encoding used to save
the document.
Escape
Mechanism - XML
also offers a safe way to escape extended characters that
are not supported by a given encoding. This allows the document
to contain any characters, even when they are not supported
in the encoding the document is using. The hexadecimal form
(&#xHHH;) and the decimal form (&#DDD;) use both
the Unicode code-point value of the characters. For example,
the Greek capital letter Omega (Ω)
can be escaped as Ω or Ω if a given document
uses an encoding where the Greek alphabet is not supported.
Language
Identification -
Since an XML document can store data in different languages,
it is important to be able to identify the language of the
content. The XML namespace offers an attribute for this:
xml:lang. The values of xml:lang are the same as the ones
for the LANG attributes in HTML. They are defined by the
RFC
3066 (an update to the RFC 1766), allowing 2
and 3-letter language codes, with an optional 2-letter country
code. For instance: xml:lang="ar" indicates Arabic
in general, while xml:lang="ar-IQ" identifies
Arabic for Iraq. Note that, unlike other attribute values,
the values of xml:lang are not case sensitive. Note also
that the current values offered by RFC 3066 do not cover
all needs of localization. For instance, there is currently
no code for Latin-American Spanish; and there is no way
to make a distinction between different scripts such as
Azeri in Cyrillic script and Azeri in Latin script.
Transformation
Mechanism - In addition
to the features of XML, some XML-related technologies provide
efficient capabilities to facilitate the implementation
of XML multilingual solutions. For example, XSLT
is a powerful declarative language to manipulate XML data.
It uses XPath,
the standard mechanism to point to any node in an XML document.
Both XSLT and XPath offer internationalization features
such as language identification, sorting, numbers formatting,
itemization using various digit sets, and so forth.
Rendering
- Lastly, two other technologies
related to XML, XSL
(also called XSL-FO for XSL Formatting Object) and CSS
(Cascading Style Sheets), confer to XML a powerful and well
internationalized framework to render content. Mixed with
the possibilities offered by XSLT, you can present any given
XML document in a wide variety of outputs.
Migrating
to XML
Obviously, not everyone has data
stored in an XML format today. But there are ways to use
XML even in these circumstances, and to pave the way for
a slow and often painless migration to XML, while still
harvesting the benefits of its use early on.
There are three
ways to integrate XML into your process:
Scenario
1: As a temporary source
The localizer uses a filter
to convert or extract/merge the original format to XML,
back and forth. In this first scenario, nothing changes
for the developer or the author of the documentation. The
localization provider can take advantage of XML to fit the
proprietary format into an existing localization process
(including for example, TM leveraging, terminology management,
etc.).
Scenario
2: As an alternate source
The developer/author works
in XML and utilizes a one-way converter to create the file
normally used by the application. This is an attractive
solution if you need to keep most of your process untouched.
By developing one conversion tool to go from your new XML
format to your old proprietary format, you keep the benefits
of an existing process, and the advantages of providing
XML data for the localizer. Converters from XML to another
format are usually not costly to develop, in many cases
a simple XSLT template will do.
Scenario
3: As the original source
The original format
is in XML. This is obviously the best way to integrate XML
within your architecture. One type of documents that can
be easily migrated to this is HTML. The official hypertext
file format recommended by the W3C is now XHTML, not HTML.
The conversion from HTML to XHTML can be easily done, even
automatically in many cases.
The most important
aspect is to move the XML format into the developer/author
domain. By doing this, you allow the developers and the
authors to use metadata directly within the original source
file, ultimately giving them much more control on what happens
to the data during the localization.
Taking
Advantage of XML
Once you have the source data in
an XML format, you can start taking advantage of it in many
different ways:
- Insertion of
metadata related to localization. For example: comments
for the translators, indication of some length limitation
for content, etc.
- Addition of unique
identifiers for given sections of text, so you can make
re-use of existing translation based on ID matching rather
than text matching. The traditional translation memory
mechanism can be used later on for what is left to translate.
- Using technologies
such as XML
Linking, you can setup automatic referencing.
For instance: pulling error messages from a resource file
into a help document. This ensures a single translation
and an always-perfect correspondence between the message
in the UI and the one in the documentation.
In some cases, you can
go one step further and use existing XML vocabularies instead
of inventing your own. This will increase interoperability
for your data. In the passed years, several specialized
XML applications have been developed for particular purposes:
Using these formats
when appropriate can bring many additional benefits.
Translating
XML Documents
While the support among translation
tools for XML is improving, it is still not perfect and
you may have to work around many problems if the XML schemas
you are using are somehow complex. Here are some potential
issues:
- Very few tools
can deal with multilingual files. The xml:lang attribute
is not recognized as a special attribute and not taken
in account or updated.
- Do not expect
tools to give you back translated files with CDATA sections
as they were in the original files. Some products don't
even handle CDATA correctly.
- Most tools,
with the exception of very few like GlobalSight System
4, work with an element/attribute paradigm rather
than a node paradigm. In other words, they can't
deal with element content that should be translated or
not depending on conditions such as a specific value for
a given attribute.
One way to work around
this last problem is to modify slightly the original file
to have a better input for the tools partially XML-impaired.
For example, the XML document below has only one element
content to be translated: "Cancel". However, very
few tools will be able to set the necessary conditions for
that because the case is too complex for them: translate
only the content of any <data> element if it has an
attribute type set to "text", and if it is inside
a <component> element that has an attribute type set
to "caption".
Listing 2
- Example of UI description in XML:
<?xml version="1.0"?>
<dialogue xml:lang="en-gb">
<rsrc id="123">
<component id="456" type="image">
<!-- Do not translate -->
<data type="text">images/cancel.gif</data>
<data type="coordinates">12,20,50,14</data>
</component>
<component id="789" type="caption">
<!-- Translate -->
<data type="text">Cancel</data>
<data type="coordinates">12,34,50,14</data>
</component>
</rsrc>
</dialogue>
One method to address this type of issue
that will work with any tool is to reduce the conditions
into a new temporary element. For instance, add an element
<tbt> (to be translated) where the content needs to
be localized. The tools can then use a simple condition:
translate the content of any <tbt> element. If
the document has to be validated during the localization
process, you may have to declare it as part of a different
namespace, otherwise you can use it directly and make sure
to remove it after localization.
Listing 3
- Example of UI description in XML with extra element for
translation:
<?xml version="1.0"?>
<dialogue xml:lang="en-gb">
<rsrc id="123">
<component id="456" type="image">
<!-- Do not translate -->
<data type="text">images/cancel.gif</data>
<data type="coordinates">12,20,50,14</data> </component> <component id="789" type="caption"> <!-- Translate --> <data type="text"><tbt>Cancel</tbt></data> <data type="coordinates">12,34,50,14</data> </component> </rsrc>
</dialogue>
A simple XSL transformation
template can be used to automate the changes as shown below.
Listing 4
- XSL Transformation for adding special translation elements:
<?xml version="1.0" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output encoding="utf-8" />
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template
match="//component[@type='caption']/
data[@type='text']">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<tbt><xsl:apply-templates/></tbt>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
In addition to indicating
what element content and/or attribute values need translation,
you also need to provide some other information to the localizer:
- Are there any
elements that have pre-formatted content? (like <pre>
in XHTML)
- Are there any
elements that have content other than normal text? (like
<script> in XHTML)
- Which elements
should be treated as inline codes? (like <b> in
XHTML)
Just like for other
file formats, there are various strategies for translating
XML documents.
The first option
is to use an XML-specialized translation tool that will
allow you to edit the text directly into the original format,
as shown below with Trados TagEditor:

As many XML documents
are simple repositories of data that are used in a context
different from the way they are stored, seeing the original
file may not always be a big advantage.
The second option
is to use tools that extract the translatable text into
a database and present it to the localizer in a tabular
form. For instance, Catalyst, from Alchemy Software, uses
this mechanism as shown below:

Such applications
usually offer a preview method that reconstructs the original
format and displays it, so the translator can see the data
in context. Obviously the same limitations as in the first
option apply here as well: sometimes XML files are simply
not the best place to see the text in the context it will
be used.
Finally, the last option
is to prepare the XML document with a utility that adds
a color-coded RTF layer on top of the content. This way
the file can be opened and translated in a classic word-processor
like Word. This is illustrated below with Wordfast used
as the translation tool:

This
method has the advantage of being more flexible regarding
the choice of translation tool you want to use (or if you
do not want to use any translation tool). It may also be
more accommodating if the original format is not very well
supported by translation tools: you can always add manual
formatting to delimit the translatable parts from the sections
to leave untouched.
Regardless of the
method utilized, at some point the localizer has to create
some sort of definition file where localization-related
information for the format to translate are stored: what
parts of the content needs translation and what parts need
to be left untouched, the name of the translatable attributes,
and so forth.
For example, the
following screen shows how this is done with SDLX from SDL
International:

Each tool has its
own way of storing this type of information, but overall,
they all use more or less the same information. At some
point, the standardization of both the information needed
and a common format to specify it would be helpful to everyone.
One of the attractive
aspects of XML is that by applying different style-sheets
and/or transformation templates to the document you can
provide different views of the data. For example: one showing
the data as it will be formatted in the final output, one
displaying only items that have changed, and another showing
source and target side by side for easier comparison, and
so on.
Using XSLT allows
you to go even further. You can create templates that validate
the document and display the items with errors. If the XML
format you are using is one of the standard formats such
as XLIFF, you can develop a library of different templates
offering a wide range of functionalities that you can re-use
across different original formats. In the example below,
a simple template provides a comparison between the source
and the target text, flagging any target entries where an
ending exclamation point is missing. You can apply similar
methods to verify (according each language) other punctuation
requirements, preservation of leading or trailing spaces,
missing or extra variables, etc. This specific example uses
a very useful extension in the Microsoft XSL engine: the
ability to call scripts in the template.
Listing
5 - XSLT template for verification:
<?xml version="1.0" encoding="iso-8859-1" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:rws="http://www.translate.com/xml-verification">
<msxsl:script language="JScript"
implements-prefix="rws">
<![CDATA[
var g_nCount = 0;
function ErrorCount () {
return(g_nCount);
}
function CheckLastChar(p_Src, p_Trg, p_Char) {
var sSrc = new String(p_Src.nextNode().text);
if ( sSrc.length<1 ) return("");
var sTrg = new String(p_Trg.nextNode().text);
if ( sTrg.length<1 ) return("Empty Target");
var cTmp = sSrc.charAt(sSrc.length-1);
if ( cTmp == p_Char ) {
if ( cTmp != sTrg.charAt(sTrg.length-1) ) {
g_nCount++;
return("Missing character '" + p_Char +
"' at the end of the target text.");
}
}
return("");
}
]]>
</msxsl:script>
<xsl:template match="text()"/>
<xsl:template match="comment()"/>
<xsl:template match="//alt-trans"/>
<xsl:template match="/xliff">
<html>
<head>
<title>Verification</title>
</head>
<body>
<h1>Verification</h1>
<table border="1" cellspacing="0" cellpadding="3">
<xsl:apply-templates/>
</table>
<p>Number of errors =
<b><xsl:value-of select="rws:ErrorCount()"/></b></p>
</body>
</html>
</xsl:template>
<xsl:template match="//source">
<xsl:variable name="R1"
select="rws:CheckLastChar(.,../target,'!')"/>
<xsl:if test="$R1!=''">
<tr> <xsl:attribute name="style">
background:silver</xsl:attribute> <td>Error <xsl:value-of
select="rws:ErrorCount()"/></td> <td><xsl:value-of select="$R1"/></td> </tr> <tr> <td>Source:</td> <td>[<xsl:value-of select="."/>]</td> </tr> <tr> <td>Target:</td> <td>[<xsl:value-of select="../target"/>]</td> </tr> </xsl:if> </xsl:template>
</xsl:stylesheet>
This type of mechanism
allows you to extend and easily customize the capabilities
of any translation tool. Most of them offer a function to
see a preview of the translation, the user can observe directly
the results of the validation in his/her working environment,
as show below for TagEditor:

The
Missing Parts
There are still a few
pieces that need to be added to the overall XML framework.
A. Localization
Properties
The localization properties are
the information translation tools need in order to correctly
process an XML document. This information needs to be defined
at the document type level. For example: one set of localization
properties for XHTML, another for SVG, another for ebXML,
and so forth. It would be even better if these properties
could be integrated into the XML Schema of each of the document
types. An illustration of such possible definition file
is shown below.
Listing 6
- Localization properties for the document in Listing 2
on page 7:
<?xml version="1.0" ?>
<locprop version="0.1">
<rules name="Example1" root="dialogue">
<element-defaults localize="no"/>
<attribute-defaults localize="no"/>
<rule item="//component[@type='caption']/
data[@type='text']"
localize="yes"/>
</rules>
</locprop>
B. Localization
Directives
Localization directives are information
specified at the document instance level to indicate specific
behavior or information for a given part of the content.
For instance: a span of text not to translate in a paragraph
otherwise translatable, a specification of a maximum number
of characters to use for a given field, etc.
Localization directives
are to be used as metadata, using the standard XML namespace
mechanism. This allows you to embed, within the document
of your own document type, common directives that can be
understood by any authoring and translation application.
You can imagine that at some point an author could simply
highlight a section of the text and click a button to make
it non-translatable, or to identify it as a glossary term,
etc.
For example, the
XHTML document on the next page includes a few of those
possible types of directives.
Listing
7 - Localization directives in an XHTML document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en" lang="en"
xmlns:loc="urn:the-localization-directives-standard">
<head><title loc:id="100">Title</title></head>
<body>
<h1 id="101">Introduction to <loc:span term="yes">
Document Management</loc:span></h1>
<p id="102">Our company, <loc:span localize="no">
Infinite Wisdom Inc.</loc:span>, provides quality
courses on how to
manage your documentation.</p>
</body>
</html>
Such vocabulary
could also be re-used by schema developers to include localization-specific
attributes in their own vocabularies, the same way xml:lang
is used today.
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!
|