In this article Andrzej Zydron
outlines pitfalls that are often encountered by
authors, programmers and localizers when first using
XML, as well as ways to avoid these problems. Following
Zydron’s advice can save developers time, money
and headaches, and can help them reach out effectively
to the world.
Introduction
The adoption of XML as a standard for the storage, retrieval
and delivery of information has meant that many
enterprises have large corpora in this format. Very
often, information components in these corpora require
translation. Normally, such enterprises have enjoyed
all of the benefits of XML on the information creation
side, but very often have failed to maximize all
the benefits that XML-based translation can provide.
The separation of form and content,
which is inherent within the concept of XML, makes
XML documents easier to localize than those created
with traditional proprietary text processing or
composition systems. Nevertheless, decisions made
during the creation of the XML structure and during
the authoring of documents can have a significant
effect on the ease with which the source language
text can be localized into other languages. For
example, the difficulties introduced into XML documents
through inappropriate use of syntactical tools can
have a profound effect on translatability and cost.
It may even require complete re-authoring of documents
in order to make them translatable. This is worth
noting, as a very high proportion of XML documents
are candidates for translation into other languages.
Designing XML Documents for Translation
It is very important to consider
the implications for localization when designing
an XML document. Wrong decisions can cause considerable
problems for the translation process and thus increase
costs. All of the following examples assume that
the text to be translated is to be extracted into
an intermediate form such as XLIFF
(XML Localization Interchange File Format). Anyone
planning to deliver an XML document directly to
translators will soon be disabused of this idea
after the first attempt. The intermediate format
protects the original file format and guarantees
that you get back a target language document equivalent
to that of the original source.
An additional concept that is important
regarding the localization of XML documents is the
“inline” element. Inline elements are
those that can exist within normal text (PCDATA
- Parsable Character DATA). They do not cause a
linguistic or structural break in the text being
extracted, but are part of the PCDATA content.
The following is a list of guidelines
based on (often bitter) experience. Most of the
problems are caused by not following the fundamental
principles of XML and XML best practice. It is,
nevertheless, surprising how often you will come
across instances of the following types of problems.
Please note that this is not a proscriptive list,
i.e., there may be special circumstances where the
proposed rules may have to be broken:
Avoid the Use of Specially Defined
Entity References
Although entity references can look like a 'slick'
technique for substituting variable text such as
a model name or feature in a publication, they can
cause more problems than they resolve. The following
example shows how XML designers might be tempted
to use entities:
<para>Use a &tool; to release the catch.</para>
Example 1: Incorrect
Use of Entity References
Entities can cause the following
problems:
- Grammatical difficulties.
If the entity represents a noun or noun phrase,
this can potentially cause serious problems for
languages in which nouns are strongly inflected,
such as in many Slavic and Germanic languages.
What appears to be correct as an entity substitution
in English can cause insurmountable problems in
inflected languages. The solution is to resolve
all entities in the serialized version of the
XML document prior to translation.
- Parsing difficulties.
During the translation process, the text will
typically be transformed into different XML-based
translation formats, such as XLIFF, where the
entity will cause a parsing error.
- Problems with leveraged translation
memories. The use of specially defined entity
references can also cause problems with leveraged
memories. The leveraged memory may contain entities
not declared in the current document.
It is generally better to use alternative techniques
rather than entity references, e.g.,
<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>
Example 2: Proposed
Solution
One area where entities CAN be used
to great effect is that of boilerplate text. The
technique here is to use parameter entities to store
the text. The text must always be linguistically
complete in that it cannot rely on positional dependencies
with regard to other entities, etc. Boilerplate
text is used solely within a DTD (Document Type
Definition). There need to be parallel target language
versions of the DTD for this technique to be used.
This can add to the maintenance cost, although judicious
use of INCLUDE directives and DTD design can mitigate
this.
Avoid Translatable Attributes
Translatable attributes can also look like a smart
way of embedding variable information in an element.
<para>
Use a <tool id="a1098" name="claw hammer">
to release the CPU retention catch.
</para>
Example 3: Incorrect
Use of Translatable Attributes
Unfortunately, they present the
translation process with the following difficulties:
- Grammatical difficulties.
The same problems can arise as with entity references.
If you want to use the text for indexing, etc.,
then you cannot rely on the contents of translatable
attributes to be consistent for inflected languages.
- Flow of text difficulties.
With translatable attributes, there are two possibilities
regarding the flow of text:
- The text is part of the logical
text flow.
- The text should be treated
outside of the text flow.
If the text is to be part of the
text flow, then the translatable attribute causes
the insertion of extra inline elements in translatable
format (typically XLIFF format) of the file. If
it is to be translated separately, then the translatable
attribute forms a new text unit. The translator
then needs to know if it is to be translated within
the context of the original text unit or in isolation.
With extra inline elements, the
burden is on the translator to preserve the encapsulating
encoding, bearing in mind that there may be significant
changes in the sequence of such attribute text in
the target language. Translation may often require
that the position of the various components of a
text unit be significantly rearranged.
<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>
Example 4: Proposed
Solution
The following guideline usually
applies in this case: if text has more than one
word, then it should not be used in attributes.
As a syntactical instrument, attributes are much
more limited than elements, e.g., you can only have
one attribute of a given name. The use of attributes
should be reserved for single "word" values that
qualify, in a meaningful way, an aspect of their
element.
Avoid the Use of CDATA Sections
That May Contain Translatable Text
CDATA sections are typically used as a means of
escaping multiple '<' and '&' characters.
Unfortunately, they pose particular problems for
tools that are extracting such text. The problem
is not one of the escaped characters, but how to
treat the CDATA text.
<TEMPLATE><![CDATA[<p>Please refer to the
<em>index page
</em> page for further information</p>
</TEMPLATE>
Example 5: CDATA Section
Problems
The problem is similar to that posed
by translatable attributes. Is the text to be treated
as 'inline' to the surrounding text? Should escape
sequence characters be replaced during translation
with the appropriate characters that were originally
escaped, or are they to be left in their escaped
form? How is the software to know?
I have come across entire XML documents
being embedded as CDATA within an encompassing XML
document. This poses significant problems regarding
the treatment of the CDATA text. It must first be
extracted and then re-parsed before it can be extracted
for translation.
Unless the text within CDATA sections
is never to be translated, use the standard built-in
character references to escape the text. Avoid using
CDATA sections.
<TEMPLATE>
<p>Please refer to the <em>index page
</em> page for further information</p>
</TEMPLATE>
Example 6: Proposed
Solution
As an alternative, link to an external
resource rather than embedding XML as CDATA:
<TEMPLATE xlink="ftp://ftp.xml-intl.com/res/ex1.xml"/>
Example 7: Link to an
External Resource
Avoid the Use of Infinite Naming
Schemes
Do not use the following type of element elm001,
elm002, elm003 in well-formed documents.
<?xml version="1.0" ?>
<resources xml:lang="en">
<err001>Cannot open file $1.</err001>
<hint001>Hint: does file $1 exist.</hint001>
<err002>Incorrect value.</err002>
<hint002>Hint: value must be between $1 and $2.</hint002>
<err003>Connection timeout.</err999>
.
.
</resources>
Example 8: Example of
Infinite Naming Scheme Usage
This presents problems for extraction
programs and is not regarded as good XML practice.
A much better way of doing this is to use the ID
and IDREF attribute mechanisms to link elements
together.
<?xml version="1.0" ?>
<resources xml:lang="en">
<error id="001">
<caption>Cannot open file $1.</caption>
<hint>Does file $1 exist.</hint>
</error>
<error id="002">
<caption>Incorrect value.</caption>
<hint>Value must be between $1 and $2.</hint>
</error>
.
.
</resources>
Example 9: Proposed
Solution
Avoid Processing Instructions
(PIs) in Translatable Text
Processing Instructions are a very
'weak' syntactical instrument in XML. There is no
built-in mechanism in XML to assist syntactically
in the preservation of Processing Instructions.
Above all, avoid translatable text in PIs.
<para>
Use a <?tool name="claw hammer"?>
to release
the CPU retention catch.
</para>
Example
10: Incorrect Use of Translatable Text in PIs.
<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>
Example 11: Proposed
Solution
It is generally not a good idea to have any PIs
present within translatable text. There is no guarantee
that they will survive the translation process,
unless special processing is carried out to preserve
them. The problem is deciding if the PIs are significant
or not. This can cause problems with translation
memory (TM) systems. Due to their syntactical weakness,
it is not easy for off-the-shelf extraction software
to parameterize their handling. The insertion of
a PI can cause otherwise linguistically identical
text to fail TM matching. As a syntactically weak
element, PIs do not have the handling capabilities
of elements. It is better to strip out all PIs prior
to translation.
Avoid the Use of Text in Bitmap Graphics
With the existence of the SVG (Scalable Vector
Graphics) format, there should be no excuse to use
bitmapped graphics. They pose particular problems
in that the original bitmap will need to recreated
for the target language with the translated text.
This is usually a very costly and error-prone process
and requires appropriate target language knowledge
by the person who edits the graphics.
Never Make Any Assumptions About Text Length Sizes
in Your Design
Always allow for the fact that the target language
text may be significantly longer than the source.
For example, "Welcome" becomes "шчыра запрашаем"
in Belarusian and "maligayang pugdatíng" in Tagalog.
Design your output with flexibility in mind.
Always Use UTF-8 (Or Alternatively UTF-16) Encoding
Throughout Your Process
With English source, we are often tempted to use
7-bit ASCII or ISO 8859/1 encoding. As soon as you
find that you are required to translate into a language
that is not covered by ISO 8859/1, you will discover
that trying to maintain documents in different encoding
schemes to be a real problem.
Always use UTF-8 from the start. It gives you immediate
access to commonly used punctuation characters such
as 'm-dash' and 'n-dash,' etc. It also significantly
simplifies your document processing.
All XML parsing tools are required to handle both
UTF-8 and UTF-16. UTF-8 is more economical in terms
of space usage for most European languages whose
scripts are based on the Latin alphabet.
Never Break a Linguistically Complete Text Unit
Over More Than One Non-inline Element
Never start a sentence in one non-inline element
and continue it in another. You cannot rely on the
translated text being in the same word sequence
in the target language. It also makes the job of
translation much more difficult as the translator
cannot see the whole sentence.
<para>
<line>This text should not be</line>
<line>broken this way – the translated
text may well be in a different order.</line>
</para>
Example 12: Example
of a Sentence Broken Over More Than One Element.
Avoid the Use of Typographical Elements
Use logical elements that encompass the text, instead
of typographical elements.
<para><b>Do
not use</b>
'<br/>' type elements.
</para>
Example 13: Example
of Typographical Element Usage.
Use "emph" instead of "bold." Encompass any text
that must be included on the same line with line
elements.
<para>
<emph>Do not use</emph> 'br' type
elements.
</para>
Example 14: Suggested
Correct Usage.
Avoid at all costs introducing any line breaks into
the text stream. If you do so, it is unconditionally
guaranteed that this will cause problems in some,
if not all, of the target languages.
Do Not Mix Translatable and Non-translatable Text
in the Same Elements
Keep non-translatable PCDATA in different elements
than translatable PCDATA.
<data-items>
<data id="class">
com.xmlintl.data.dataDefDefinition
</data>
<data id="text">
Replace generic data
definitions with specific instances.
</data>
</data-items>
Example 15: Example
of Mixed PCDATA.
Most XML translation tools will have problems with
this type of construct. It is only when inspecting
the 'id' attribute that a decision can be made as
to whether the PCDATA should be extracted or not.
<data-items>
<class id="com.xmlintl.data.dataDefinition">
<text>
Replace generic data
definitions with specific instances.
</text>
</class>
</data-items>
Example
16: Suggested Solution.
Avoid Holding Source and Target PCDATA in the Same
Document
This can cause all manner of problems for processing
and extraction tools.
<para>
<text xml:lang="en">
My hovercraft is full of eels.
</text>
<text xml:lang="fr">
Mon aéroglisseur est plein d'anguilles.
</text>
<text xml:lang="hu">
Légpárnás hajóm tele van angolnákkal.
</text>
<text xml:lang="ja">
私のホバークラフトは鰻で一杯です。
</text>
<text xml:lang="pl">
Mój poduszkowiec jest pełen węgorzy.
</text>
<text xml:lang="es">
Mi aerodeslizador está lleno de anguilas.
</text>
<text xml:lang="zh-CH">
我隻氣墊船裝滿晒鱔.
</text>
<text xml:lang="zh-TW">
我的氣墊船充滿了鱔魚 [我的气垫船充满了鳝鱼]
</text>
</para>
Example 17: Example
of Mixed Source and Target PCDATA
Unless your document requires mixed language content,
use a separate document instance to store each target
language version. If you store both source and target
data in the same document, it will become unwieldy,
overly large and cumbersome to process.
Clearly Define Text That Requires
Translation
Keep any PCDATA that requires translation in different
elements from PCDATA that does not require translation.
Use special elements for text within PCDATA that
is specifically not to be translated.
<para>
The following part of this sentence
should
<notrans>not be translated</notrans>
at all.
</para>
Example 18: Suggested
Solution.
Suggested Further Reading
Yves Savourel of ENLASO Corporation, who has done
so much good work in the field of localizing XML,
has an excellent web page dedicated to the subject
of XML
Internationalization and Localization FAQ.
Another very good reference work is the paper by
Richard Ishida of W3C, Localisation
Considerations in DTD Design.
Finally – Please Invest Time and Effort in the
Quality of the Source Text
If the source text is properly written in a clear
and understandable manner, then it will be easy
to read and easier to localize. It is worth investing
in tools that will check the grammar and terminology
in your source text. Without tools, your authors
do not have a benchmark against which to test themselves,
and it is thus all to easy for poorly written text
to make its way into your documents.
Andrzej Zydron
is a member of the LISA
OSCAR Steering Committee. He is the technical
architect and editor of the GILT Metrics proposed
specification suite, as well as editor of the proposed
TBX
Link specification. Zydron also sits
on the OASIS technical committees for Translation
Web Services, XLIFF and XLIFF segmentation. As CTO
for xml-Intl
Ltd., he is currently developing the
next generation of XML-based text memory systems
to reduce authoring and translation costs for documentation.
Zydron is fluent in English, Polish and French.
Reprinted
by permission from the Globalization Insider,
10 December 2004, Volume XIII, Issue 4.2.
Copyright
the Localization Industry Standards Association
(Globalization Insider: www.localization.org,
LISA: www.lisa.org)
and S.M.P. Marketing Sarl (SMP) 2004
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!