Article for translators: Coping with Babel: How to Localize XML

In this article Andrzej Zydron outlines pitfalls that are often encountered by authors, programmers and localizers when first using XML, as well as ways to avoid these problems. Following Zydron’s advice can save developers time, money and headaches, and can help them reach out effectively to the world.

Introduction

Andrzej Zydron, CTO, xml-Intl Ltd.
OSCAR Steering Committee The adoption of XML as a standard for the storage, retrieval and delivery of information has meant that many enterprises have large corpora in this format. Very often, information components in these corpora require translation. Normally, such enterprises have enjoyed all of the benefits of XML on the information creation side, but very often have failed to maximize all the benefits that XML-based translation can provide.

The separation of form and content, which is inherent within the concept of XML, makes XML documents easier to localize than those created with traditional proprietary text processing or composition systems. Nevertheless, decisions made during the creation of the XML structure and during the authoring of documents can have a significant effect on the ease with which the source language text can be localized into other languages. For example, the difficulties introduced into XML documents through inappropriate use of syntactical tools can have a profound effect on translatability and cost. It may even require complete re-authoring of documents in order to make them translatable. This is worth noting, as a very high proportion of XML documents are candidates for translation into other languages.

Designing XML Documents for Translation

It is very important to consider the implications for localization when designing an XML document. Wrong decisions can cause considerable problems for the translation process and thus increase costs. All of the following examples assume that the text to be translated is to be extracted into an intermediate form such as XLIFF (XML Localization Interchange File Format). Anyone planning to deliver an XML document directly to translators will soon be disabused of this idea after the first attempt. The intermediate format protects the original file format and guarantees that you get back a target language document equivalent to that of the original source.

An additional concept that is important regarding the localization of XML documents is the “inline” element. Inline elements are those that can exist within normal text (PCDATA - Parsable Character DATA). They do not cause a linguistic or structural break in the text being extracted, but are part of the PCDATA content.

The following is a list of guidelines based on (often bitter) experience. Most of the problems are caused by not following the fundamental principles of XML and XML best practice. It is, nevertheless, surprising how often you will come across instances of the following types of problems. Please note that this is not a proscriptive list, i.e., there may be special circumstances where the proposed rules may have to be broken:

Avoid the Use of Specially Defined Entity References

Although entity references can look like a 'slick' technique for substituting variable text such as a model name or feature in a publication, they can cause more problems than they resolve. The following example shows how XML designers might be tempted to use entities:

<para>Use a &tool; to release the catch.</para>

Example 1: Incorrect Use of Entity References

Entities can cause the following problems:

Grammatical difficulties. If the entity represents a noun or noun phrase, this can potentially cause serious problems for languages in which nouns are strongly inflected, such as in many Slavic and Germanic languages. What appears to be correct as an entity substitution in English can cause insurmountable problems in inflected languages. The solution is to resolve all entities in the serialized version of the XML document prior to translation.
Parsing difficulties. During the translation process, the text will typically be transformed into different XML-based translation formats, such as XLIFF, where the entity will cause a parsing error.
Problems with leveraged translation memories. The use of specially defined entity references can also cause problems with leveraged memories. The leveraged memory may contain entities not declared in the current document.
It is generally better to use alternative techniques rather than entity references, e.g.,

<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>

Example 2: Proposed Solution

One area where entities CAN be used to great effect is that of boilerplate text. The technique here is to use parameter entities to store the text. The text must always be linguistically complete in that it cannot rely on positional dependencies with regard to other entities, etc. Boilerplate text is used solely within a DTD (Document Type Definition). There need to be parallel target language versions of the DTD for this technique to be used. This can add to the maintenance cost, although judicious use of INCLUDE directives and DTD design can mitigate this.

Avoid Translatable Attributes

Translatable attributes can also look like a smart way of embedding variable information in an element.

<para>
Use a <tool id="a1098" name="claw hammer">
to release the CPU retention catch.
</para>

Example 3: Incorrect Use of Translatable Attributes

Unfortunately, they present the translation process with the following difficulties:

Grammatical difficulties. The same problems can arise as with entity references. If you want to use the text for indexing, etc., then you cannot rely on the contents of translatable attributes to be consistent for inflected languages.
Flow of text difficulties. With translatable attributes, there are two possibilities regarding the flow of text:
- The text is part of the logical text flow.
- The text should be treated outside of the text flow.

If the text is to be part of the text flow, then the translatable attribute causes the insertion of extra inline elements in translatable format (typically XLIFF format) of the file. If it is to be translated separately, then the translatable attribute forms a new text unit. The translator then needs to know if it is to be translated within the context of the original text unit or in isolation.

With extra inline elements, the burden is on the translator to preserve the encapsulating encoding, bearing in mind that there may be significant changes in the sequence of such attribute text in the target language. Translation may often require that the position of the various components of a text unit be significantly rearranged.

<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>

Example 4: Proposed Solution

The following guideline usually applies in this case: if text has more than one word, then it should not be used in attributes. As a syntactical instrument, attributes are much more limited than elements, e.g., you can only have one attribute of a given name. The use of attributes should be reserved for single "word" values that qualify, in a meaningful way, an aspect of their element.

Avoid the Use of CDATA Sections That May Contain Translatable Text

CDATA sections are typically used as a means of escaping multiple '<' and '&' characters. Unfortunately, they pose particular problems for tools that are extracting such text. The problem is not one of the escaped characters, but how to treat the CDATA text.

<TEMPLATE><![CDATA[<p>Please refer to the
<em>index page
</em> page for further information</p>
</TEMPLATE>

Example 5: CDATA Section Problems

The problem is similar to that posed by translatable attributes. Is the text to be treated as 'inline' to the surrounding text? Should escape sequence characters be replaced during translation with the appropriate characters that were originally escaped, or are they to be left in their escaped form? How is the software to know?

I have come across entire XML documents being embedded as CDATA within an encompassing XML document. This poses significant problems regarding the treatment of the CDATA text. It must first be extracted and then re-parsed before it can be extracted for translation.

Unless the text within CDATA sections is never to be translated, use the standard built-in character references to escape the text. Avoid using CDATA sections.

<TEMPLATE>
<p>Please refer to the <em>index page
</em> page for further information</p>
</TEMPLATE>

Example 6: Proposed Solution

As an alternative, link to an external resource rather than embedding XML as CDATA:

Example 7: Link to an External Resource

Avoid the Use of Infinite Naming Schemes

Do not use the following type of element elm001, elm002, elm003 in well-formed documents.

<?xml version="1.0" ?>
<resources xml:lang="en">
<err001>Cannot open file $1.</err001>
<hint001>Hint: does file $1 exist.</hint001>
<err002>Incorrect value.</err002>
<hint002>Hint: value must be between $1 and $2.</hint002>
<err003>Connection timeout.</err999>
.
.
</resources>

Example 8: Example of Infinite Naming Scheme Usage

This presents problems for extraction programs and is not regarded as good XML practice. A much better way of doing this is to use the ID and IDREF attribute mechanisms to link elements together.

<?xml version="1.0" ?>
<resources xml:lang="en">
<error id="001">
<caption>Cannot open file $1.</caption>
<hint>Does file $1 exist.</hint>
</error>
<error id="002">
<caption>Incorrect value.</caption>
<hint>Value must be between $1 and $2.</hint>
</error>
.
.
</resources>

Example 9: Proposed Solution

Avoid Processing Instructions (PIs) in Translatable Text

Processing Instructions are a very 'weak' syntactical instrument in XML. There is no built-in mechanism in XML to assist syntactically in the preservation of Processing Instructions. Above all, avoid translatable text in PIs.

<para>
Use a <?tool name="claw hammer"?> to release
the CPU retention catch.
</para>

Example 10: Incorrect Use of Translatable Text in PIs.

<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>

Example 11: Proposed Solution

It is generally not a good idea to have any PIs present within translatable text. There is no guarantee that they will survive the translation process, unless special processing is carried out to preserve them. The problem is deciding if the PIs are significant or not. This can cause problems with translation memory (TM) systems. Due to their syntactical weakness, it is not easy for off-the-shelf extraction software to parameterize their handling. The insertion of a PI can cause otherwise linguistically identical text to fail TM matching. As a syntactically weak element, PIs do not have the handling capabilities of elements. It is better to strip out all PIs prior to translation.

Avoid the Use of Text in Bitmap Graphics

With the existence of the SVG (Scalable Vector Graphics) format, there should be no excuse to use bitmapped graphics. They pose particular problems in that the original bitmap will need to recreated for the target language with the translated text. This is usually a very costly and error-prone process and requires appropriate target language knowledge by the person who edits the graphics.

Never Make Any Assumptions About Text Length Sizes in Your Design

Always allow for the fact that the target language text may be significantly longer than the source. For example, "Welcome" becomes "шчыра запрашаем" in Belarusian and "maligayang pugdatíng" in Tagalog. Design your output with flexibility in mind.

Always Use UTF-8 (Or Alternatively UTF-16) Encoding Throughout Your Process

With English source, we are often tempted to use 7-bit ASCII or ISO 8859/1 encoding. As soon as you find that you are required to translate into a language that is not covered by ISO 8859/1, you will discover that trying to maintain documents in different encoding schemes to be a real problem.

Always use UTF-8 from the start. It gives you immediate access to commonly used punctuation characters such as 'm-dash' and 'n-dash,' etc. It also significantly simplifies your document processing.

All XML parsing tools are required to handle both UTF-8 and UTF-16. UTF-8 is more economical in terms of space usage for most European languages whose scripts are based on the Latin alphabet.

Never Break a Linguistically Complete Text Unit Over More Than One Non-inline Element

Never start a sentence in one non-inline element and continue it in another. You cannot rely on the translated text being in the same word sequence in the target language. It also makes the job of translation much more difficult as the translator cannot see the whole sentence.

<para>
  <line>This text should not be</line>
  <line>broken this way – the translated
  text may well be in a different order.</line>
</para>

Example 12: Example of a Sentence Broken Over More Than One Element.

Avoid the Use of Typographical Elements

Use logical elements that encompass the text, instead of typographical elements.

<para><b>Do not use</b>
'<br/>' type elements.
</para>

Example 13: Example of Typographical Element Usage.

Use "emph" instead of "bold." Encompass any text that must be included on the same line with line elements.

<para>
<emph>Do not use</emph> 'br' type elements.
</para>

Example 14: Suggested Correct Usage.

Avoid at all costs introducing any line breaks into the text stream. If you do so, it is unconditionally guaranteed that this will cause problems in some, if not all, of the target languages.
Do Not Mix Translatable and Non-translatable Text in the Same Elements

Keep non-translatable PCDATA in different elements than translatable PCDATA.

<data-items>
  <data id="class">
  com.xmlintl.data.dataDefDefinition
  </data>
  <data id="text">
Replace generic data
definitions with specific instances.
  </data>
</data-items>

Example 15: Example of Mixed PCDATA.

Most XML translation tools will have problems with this type of construct. It is only when inspecting the 'id' attribute that a decision can be made as to whether the PCDATA should be extracted or not.

<data-items>
<class id="com.xmlintl.data.dataDefinition">
<text>
Replace generic data
definitions with specific instances.
</text>
</class>
</data-items>

Example 16: Suggested Solution.

Avoid Holding Source and Target PCDATA in the Same Document

This can cause all manner of problems for processing and extraction tools.

<para>
<text xml:lang="en">
My hovercraft is full of eels.
</text>
<text xml:lang="fr">
Mon aéroglisseur est plein d'anguilles.
</text>
<text xml:lang="hu">
Légpárnás hajóm tele van angolnákkal.
</text>
<text xml:lang="ja">
私のホバークラフトは鰻で一杯です。
</text>
<text xml:lang="pl">
Mój poduszkowiec jest pełen węgorzy.
</text>
<text xml:lang="es">
Mi aerodeslizador está lleno de anguilas.
</text>
<text xml:lang="zh-CH">
我隻氣墊船裝滿晒鱔．
</text>
<text xml:lang="zh-TW">
我的氣墊船充滿了鱔魚 [我的气垫船充满了鳝鱼]
</text>
</para>

Example 17: Example of Mixed Source and Target PCDATA

Unless your document requires mixed language content, use a separate document instance to store each target language version. If you store both source and target data in the same document, it will become unwieldy, overly large and cumbersome to process.

Clearly Define Text That Requires Translation

Keep any PCDATA that requires translation in different elements from PCDATA that does not require translation. Use special elements for text within PCDATA that is specifically not to be translated.

<para>
  The following part of this sentence should
  <notrans>not be translated</notrans>
  at all.
</para>

Example 18: Suggested Solution.

Suggested Further Reading

Yves Savourel of ENLASO Corporation, who has done so much good work in the field of localizing XML, has an excellent web page dedicated to the subject of XML Internationalization and Localization FAQ. Another very good reference work is the paper by Richard Ishida of W3C, Localisation Considerations in DTD Design.

Finally – Please Invest Time and Effort in the Quality of the Source Text

If the source text is properly written in a clear and understandable manner, then it will be easy to read and easier to localize. It is worth investing in tools that will check the grammar and terminology in your source text. Without tools, your authors do not have a benchmark against which to test themselves, and it is thus all to easy for poorly written text to make its way into your documents.

Andrzej Zydron is a member of the LISA OSCAR Steering Committee. He is the technical architect and editor of the GILT Metrics proposed specification suite, as well as editor of the proposed TBX Link specification. Zydron also sits on the OASIS technical committees for Translation Web Services, XLIFF and XLIFF segmentation. As CTO for xml-Intl Ltd., he is currently developing the next generation of XML-based text memory systems to reduce authoring and translation costs for documentation. Zydron is fluent in English, Polish and French.

Reprinted by permission from the Globalization Insider,
10 December 2004, Volume XIII, Issue 4.2.
Copyright the Localization Industry Standards Association
(Globalization Insider: www.localization.org, LISA: www.lisa.org)
and S.M.P. Marketing Sarl (SMP) 2004

Submit your article!