|
|
Coping with Babel: How to Localize XML
In this article Andrzej Zydron outlines pitfalls that are often encountered by authors, programmers and localizers when first using XML, as well as ways to avoid these problems. Following Zydron’s advice can save developers time, money and headaches, and can help them reach out effectively to the world. Introduction The adoption of XML as a standard for the storage, retrieval and delivery of information has meant that many enterprises have large corpora in this format. Very often, information components in these corpora require translation. Normally, such enterprises have enjoyed all of the benefits of XML on the information creation side, but very often have failed to maximize all the benefits that XML-based translation can provide. The separation of form and content, which is inherent within the concept of XML, makes XML documents easier to localize than those created with traditional proprietary text processing or composition systems. Nevertheless, decisions made during the creation of the XML structure and during the authoring of documents can have a significant effect on the ease with which the source language text can be localized into other languages. For example, the difficulties introduced into XML documents through inappropriate use of syntactical tools can have a profound effect on translatability and cost. It may even require complete re-authoring of documents in order to make them translatable. This is worth noting, as a very high proportion of XML documents are candidates for translation into other languages. Designing XML Documents for TranslationIt is very important to consider the implications for localization when designing an XML document. Wrong decisions can cause considerable problems for the translation process and thus increase costs. All of the following examples assume that the text to be translated is to be extracted into an intermediate form such as XLIFF (XML Localization Interchange File Format). Anyone planning to deliver an XML document directly to translators will soon be disabused of this idea after the first attempt. The intermediate format protects the original file format and guarantees that you get back a target language document equivalent to that of the original source. An additional concept that is important regarding the localization of XML documents is the “inline” element. Inline elements are those that can exist within normal text (PCDATA - Parsable Character DATA). They do not cause a linguistic or structural break in the text being extracted, but are part of the PCDATA content. The following is a list of guidelines based on (often bitter) experience. Most of the problems are caused by not following the fundamental principles of XML and XML best practice. It is, nevertheless, surprising how often you will come across instances of the following types of problems. Please note that this is not a proscriptive list, i.e., there may be special circumstances where the proposed rules may have to be broken: Avoid the Use of Specially Defined Entity ReferencesAlthough entity references can look like a 'slick' technique for substituting variable text such as a model name or feature in a publication, they can cause more problems than they resolve. The following example shows how XML designers might be tempted to use entities: <para>Use a &tool; to release the catch.</para>Example 1: Incorrect Use of Entity References Entities can cause the following problems:
Use a <tool id="a1098">claw hammer</tool> to release the CPU retention catch. </para> Example 2: Proposed Solution One area where entities CAN be used to great effect is that of boilerplate text. The technique here is to use parameter entities to store the text. The text must always be linguistically complete in that it cannot rely on positional dependencies with regard to other entities, etc. Boilerplate text is used solely within a DTD (Document Type Definition). There need to be parallel target language versions of the DTD for this technique to be used. This can add to the maintenance cost, although judicious use of INCLUDE directives and DTD design can mitigate this. Avoid Translatable AttributesTranslatable attributes can also look like a smart way of embedding variable information in an element. <para> Use a <tool id="a1098" name="claw hammer"> to release the CPU retention catch. </para> Example 3: Incorrect Use of Translatable Attributes Unfortunately, they present the translation process with the following difficulties:
If the text is to be part of the text flow, then the translatable attribute causes the insertion of extra inline elements in translatable format (typically XLIFF format) of the file. If it is to be translated separately, then the translatable attribute forms a new text unit. The translator then needs to know if it is to be translated within the context of the original text unit or in isolation. With extra inline elements, the burden is on the translator to preserve the encapsulating encoding, bearing in mind that there may be significant changes in the sequence of such attribute text in the target language. Translation may often require that the position of the various components of a text unit be significantly rearranged. <para> Use a <tool id="a1098">claw hammer</tool> to release the CPU retention catch. </para> Example 4: Proposed Solution The following guideline usually applies in this case: if text has more than one word, then it should not be used in attributes. As a syntactical instrument, attributes are much more limited than elements, e.g., you can only have one attribute of a given name. The use of attributes should be reserved for single "word" values that qualify, in a meaningful way, an aspect of their element. Avoid the Use of CDATA Sections That May Contain Translatable TextCDATA sections are typically used as a means of escaping multiple '<' and '&' characters. Unfortunately, they pose particular problems for tools that are extracting such text. The problem is not one of the escaped characters, but how to treat the CDATA text. <TEMPLATE><![CDATA[<p>Please refer to the <em>index page </em> page for further information</p> </TEMPLATE> Example 5: CDATA Section Problems The problem is similar to that posed by translatable attributes. Is the text to be treated as 'inline' to the surrounding text? Should escape sequence characters be replaced during translation with the appropriate characters that were originally escaped, or are they to be left in their escaped form? How is the software to know? I have come across entire XML documents being embedded as CDATA within an encompassing XML document. This poses significant problems regarding the treatment of the CDATA text. It must first be extracted and then re-parsed before it can be extracted for translation. Unless the text within CDATA sections is never to be translated, use the standard built-in character references to escape the text. Avoid using CDATA sections. <TEMPLATE> <p>Please refer to the <em>index page </em> page for further information</p> </TEMPLATE> Example 6: Proposed Solution As an alternative, link to an external resource rather than embedding XML as CDATA: <TEMPLATE xlink="ftp://ftp.xml-intl.com/res/ex1.xml"/>Example 7: Link to an External Resource Avoid the Use of Infinite Naming SchemesDo not use the following type of element elm001, elm002, elm003 in well-formed documents. <?xml version="1.0" ?> <resources xml:lang="en"> <err001>Cannot open file $1.</err001> <hint001>Hint: does file $1 exist.</hint001> <err002>Incorrect value.</err002> <hint002>Hint: value must be between $1 and $2.</hint002> <err003>Connection timeout.</err999> . . </resources> Example 8: Example of Infinite Naming Scheme Usage This presents problems for extraction programs and is not regarded as good XML practice. A much better way of doing this is to use the ID and IDREF attribute mechanisms to link elements together. <?xml version="1.0" ?> <resources xml:lang="en"> <error id="001"> <caption>Cannot open file $1.</caption> <hint>Does file $1 exist.</hint> </error> <error id="002"> <caption>Incorrect value.</caption> <hint>Value must be between $1 and $2.</hint> </error> . . </resources> Example 9: Proposed Solution Avoid Processing Instructions (PIs) in Translatable Text Processing Instructions are a very 'weak' syntactical instrument in XML. There is no built-in mechanism in XML to assist syntactically in the preservation of Processing Instructions. Above all, avoid translatable text in PIs. <para> Example 10: Incorrect Use of Translatable Text in PIs. <para> Example 11: Proposed Solution It is generally not a good idea to have any PIs present within translatable text. There is no guarantee that they will survive the translation process, unless special processing is carried out to preserve them. The problem is deciding if the PIs are significant or not. This can cause problems with translation memory (TM) systems. Due to their syntactical weakness, it is not easy for off-the-shelf extraction software to parameterize their handling. The insertion of a PI can cause otherwise linguistically identical text to fail TM matching. As a syntactically weak element, PIs do not have the handling capabilities of elements. It is better to strip out all PIs prior to translation. Avoid the Use of Text in Bitmap GraphicsWith the existence of the SVG (Scalable Vector Graphics) format, there should be no excuse to use bitmapped graphics. They pose particular problems in that the original bitmap will need to recreated for the target language with the translated text. This is usually a very costly and error-prone process and requires appropriate target language knowledge by the person who edits the graphics. Never Make Any Assumptions About Text Length Sizes in Your DesignAlways allow for the fact that the target language text may be significantly longer than the source. For example, "Welcome" becomes "шчыра запрашаем" in Belarusian and "maligayang pugdatíng" in Tagalog. Design your output with flexibility in mind. Always Use UTF-8 (Or Alternatively UTF-16) Encoding Throughout Your ProcessWith English source, we are often tempted to use 7-bit ASCII or ISO 8859/1 encoding. As soon as you find that you are required to translate into a language that is not covered by ISO 8859/1, you will discover that trying to maintain documents in different encoding schemes to be a real problem. Always use UTF-8 from the start. It gives you immediate access to commonly used punctuation characters such as 'm-dash' and 'n-dash,' etc. It also significantly simplifies your document processing. All XML parsing tools are required to handle both UTF-8 and UTF-16. UTF-8 is more economical in terms of space usage for most European languages whose scripts are based on the Latin alphabet. Never Break a Linguistically Complete Text Unit Over More Than One Non-inline ElementNever start a sentence in one non-inline element and continue it in another. You cannot rely on the translated text being in the same word sequence in the target language. It also makes the job of translation much more difficult as the translator cannot see the whole sentence. <para> Example 12: Example of a Sentence Broken Over More Than One Element. Avoid the Use of Typographical ElementsUse logical elements that encompass the text, instead of typographical elements. <para><b>Do not use</b>
Example 13: Example of Typographical Element Usage. Use "emph" instead of "bold." Encompass any text that must be included on the same line with line elements. <para> Example 14: Suggested Correct Usage. Avoid at all costs introducing any line breaks into the text stream. If you do so, it is unconditionally guaranteed that this will cause problems in some, if not all, of the target languages. Do Not Mix Translatable and Non-translatable Text in the Same Elements Keep non-translatable PCDATA in different elements than translatable PCDATA. <data-items> Example 15: Example of Mixed PCDATA. Most XML translation tools will have problems with this type of construct. It is only when inspecting the 'id' attribute that a decision can be made as to whether the PCDATA should be extracted or not. <data-items> Example 16: Suggested Solution. Avoid Holding Source and Target PCDATA in the Same Document This can cause all manner of problems for processing and extraction tools. <para> Example 17: Example of Mixed Source and Target PCDATA
Clearly Define Text That Requires Translation Keep any PCDATA that requires translation in different elements from PCDATA that does not require translation. Use special elements for text within PCDATA that is specifically not to be translated. <para> Example 18: Suggested Solution. Suggested Further Reading Yves Savourel of ENLASO Corporation, who has done so much good work in the field of localizing XML, has an excellent web page dedicated to the subject of XML Internationalization and Localization FAQ. Another very good reference work is the paper by Richard Ishida of W3C, Localisation Considerations in DTD Design. Finally – Please Invest Time and Effort in the Quality of the Source TextIf the source text is properly written in a clear and understandable manner, then it will be easy to read and easier to localize. It is worth investing in tools that will check the grammar and terminology in your source text. Without tools, your authors do not have a benchmark against which to test themselves, and it is thus all to easy for poorly written text to make its way into your documents.
Andrzej Zydron is a member of the LISA OSCAR Steering Committee. He is the technical architect and editor of the GILT Metrics proposed specification suite, as well as editor of the proposed TBX Link specification. Zydron also sits on the OASIS technical committees for Translation Web Services, XLIFF and XLIFF segmentation. As CTO for xml-Intl Ltd., he is currently developing the next generation of XML-based text memory systems to reduce authoring and translation costs for documentation. Zydron is fluent in English, Polish and French.
Reprinted
by permission from the Globalization Insider,
10 December 2004, Volume XIII, Issue 4.2. Copyright the Localization Industry Standards Association (Globalization Insider: www.localization.org, LISA: www.lisa.org) and S.M.P. Marketing Sarl (SMP) 2004
E-mail this article to your colleague! Need more translation jobs? Click here! Translation agencies are welcome to register here - Free! Freelance translators are welcome to register here - Free! |
|
|
Legal Disclaimer Site Map |