Using XML technology to reduce the cost of authoring and translation
By Andrzej Zydron
xml-Intl Ltd.
Get the List of 4,400+ Translation Agencies Now! No Recurring Membership Fees!
At the recent LISA Forum Europe in London, Andrzej Zydron reported on the history of publishing formats and the development of a new XML-based format for translation and authoring memory called xml:tm. xml:tm leverages the architecture of XML to allow XML-based documents to contain multiple language versions, translation information, revision history and other important information that enables end-users to much more easily handle many aspects of a document’s creation and subsequent use. In this article Zydron describes xml:tm and its benefits for users of XML-based publishing formats.
A Little History…
The advent of text
in electronic format posed a number of problems for
translators. These problems were:
- How to mange the differing encoding
standards and their corresponding font support and
availability.
- How to present the text to translators
without having to purchase additional copies of
the original creation program.
- How to translate the text while
preserving the formatting.
- How to build translation memories
for these documents to reduce the cost of translation
and improve consistency.
The problem was exacerbated
by the veritable “Tower of Babel” of differing
authoring and composition environments from Interleaf
through to PageMaker. The typical approach was to
write filters that would “lift” the text
to be translated from its proprietary embedded environment
and to present it to translators in a uniform but
equally proprietary translation environment. After
translation the text would then be merged with the
original document, replacing the source language text.
ISO 8879:1986 SGML
A serious attempt to
tackle the plethora of competing formats and their
embedded nature was made in 1986 with the advent of
ISO 8879 Standard Generalized Markup Language (SGML).
This attempted to separate the content of documents
from their form. SGML arose at a time of great and
rapid change in the IT industry. The architects attempted
to make the standard as flexible and open to change
as possible. This laudable aim unfortunately produced
something that was very difficult and expensive to
implement. In addition SGML only tackled the aspect
of content. Form was tackled by ISO/IEC 10179:1996
Document Style Semantics and Specification Language
(DSSSL), but this proved equally difficult to implement.
HTML
The efforts of the
ISO 8879 committee were not in vain. SGML allowed
for the creation of HTML which allowed the early World
Wide Web to catapult the Internet from a vehicle used
by academics and computer scientists to what we know
today. HTML was initially based on strict adherence
to the SGML standard, but soon diverged as the limitations
of ISO 8879 became apparent.
XML
By 1996 the World Wide
Web Consortium (W3C) began to look for a solution
that was better than HTML. What was required was something
that would allow for the semantic exchange of information.
It needed to be able to propel the Internet from displaying
only static pages to a core semantic web, allowing
for the exchange of data. The efforts of W3C resulted
in XML 1.0. This addressed many of the architectural
limitations of SGML, allowing for easier manipulation
and parsing of the semantics. In addition to many
very good features, the architects of XML introduced
a powerful new feature called “namespace”.
XML Namespace allows for the mapping of more than
one representation of meaning onto a given document.
This feature is now used extensively in supporting
standards such as XSL, XSLT, XML Schema and FOP.
The Future
The success of XML
has been phenomenal, although much of its success
has yet to become visible to end users. It is now
driving the future of the World Wide Web through providing
the foundation for important web standards such as
XML Web Services, electronic data exchange, etc. It
has spawned much feverish activity in the developer
community and has created some excellent Open Source
tools and libraries such as those provided by the
Apache foundation (xml.apache.org)
and Source Forge (www.sourceforge.net). Even strongly
proprietary companies have had to accept the importance
of XML. Much excellent work is also being conducted
by standards organizations such as OASIS (www.oasis-open.org)
and W3C (www.w3c.org)
XML based standards, including XLIFF for the translation
of documents.
Our premise is that the
case for XML is so compelling that all leading vendors
of word processing and composition systems will have
to support it in the near future. In terms of translation
the arguments are even more convincing. It can be
up to five times more expensive to translate and correct
the layout of documents written in proprietary systems
than in XML. Sun Microsystems (www.sun.com)
along with the OpenOffice organization (www.openoffice.org),
already supplies an excellent XML based alternative
to proprietary systems, which can also read proprietary
systems such as Word and convert them to XML. Microsoft
has also announced support for XML in the next version
of Office.
With this view of the
future in mind, we have concentrated our efforts on
how best to exploit the very rich syntax and capabilities
of XML.
xml:tm
xml:tm radically changes
the approach to the translation of XML-based documents.
It is an Open Standard created and maintained by xml-Intl,
for the benefit of those involved in the translation
of XML documents.
At the core of xml:tm
is the concept of “text memory,” made
up of two components:
- Author Memory
- Translation Memory (TM)
Author Memory
XML namespace is used
to map a text memory view onto a document. This process
is called segmentation. The text memory view works
at the sentence level of granularity – the text
unit. Each individual xml:tm text unit is allocated
a unique identifier. This unique identifier is immutable
for the life of the document. As a document goes through
its life cycle, the unique identifiers are maintained,
while new ones are allocated as required. This aspect
of text memory is called “author memory.”
It can be used to build author memory systems to simplify
and improve authoring consistency.
The following diagram
shows how the tm namespace maps onto an existing xml
document:
In the above diagram, "te" stands for "text element"
(an XML element that contains text) and "tu" stands
for "text unit" (a single sentence or stand alone
piece of text).
The following is an example
of part of an xml:tm document. The xml:tm elements
are highlighted in red to show how xml:tm maps onto
an existing XML document.:
<?xml version="1.0" encoding="UTF-8"
?>
<office:document-content
xmlns:text="http://openoffice.org/2000/text"
xmlns:tm="urn:xmlintl-tm-tags" xmlns:xlink="http://www.w3.org/1999/xlink">
<tm:tm>
..........
<text:p text:style-name="Text body">
<tm:te id="e1"
tuval="2">
<tm:tu id="u1.1">Xml:tm
is a revolutionary technology for dealing with the
problems of translation memory for XML documents
by using XML techniques to embed memory
directly into the XML documents themselves.</tm:tu>
<tm:tu id="u1.2">It
makes extensive use of XML namespace.</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Text body">
<tm:te id="e2">
<tm:tu id="u2.1">The
“tm” stands for “text memory”.</tm:tu>
<tm:tu id="u2.2">There
are two aspects to text memory:</tm:tu>
</tm:te>
</text:p>
......................
And the composed document:
Translation Memory
When an xml:tm namespace
document is ready for translation, the namespace itself
specifies the text that is to be translated. The tm
namespace can be used to create an XLIFF document
for translation.
(XML Localization Interchange
File Format [XLIFF] is an OASIS
standard that is optimized for translation. Please
refer to the section below entitled XLIFF and Online
Translation for more information.)
When the translation
of XLIFF files has been completed, the target language
text can be merged with the original document to create
a new target language version of that document. The
net result is a perfectly aligned source and target
language document.
The following is an example
of a translated xml:tm document:
<?xml version="1.0" encoding="UTF-8"
?>
<office:document-content
xmlns:text="http://openoffice.org/2000/text"
xmlns:tm="urn:xmlintl-tm-tags" xmlns:xlink="http://www.w3.org/1999/xlink">
<tm:tm>
..........
<text:p text:style-name="Text body">
<tm:te id="e1" tuval="2">
<tm:tu id="u1.1">Xml:tm
es un técnica revolucionaria que trata los problemas
de memoria de traducción en documentos
XML usando técnicas XML e incluyendo la memoria en el documento mismo.</tm:tu>
<tm:tu id="u1.2">Esta
técnica hace extensor uso de XML namespace.</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Text body">
<tm:te id="e2">
<tm:tu id="u2.1">“tm”
significa “memoria de texto”.</tm:tu>
<tm:tu id="u2.2">Hay
dos aspectos de memoria de texto:</tm:tu>
</tm:te>
..........
This is an example of the composed translated text:
The source and target text is linked at the sentence
level by the unique xml:tm identifiers. When the document
is revised, new identifiers are allocated to modified
or new text units. When extracting text for translation
of the updated source document the text units that
have not changed can be automatically replaced with
the target language text.
Different Types of
Matching
The matching described
in the previous section is called “perfect”
matching. xml:tm offers unique TM matching possibilities
to reduce the quantity of text for translation and
to provide the human translator with alternative translations.
The following diagram
shows how Perfect Matching is achieved:
The following types of matching are available:
- Perfect matching
Author memory provides exact details of any changes
to a document. Where text units have not been changed
from a previously translated document, there is
a “perfect match”. The concept of perfect
matching is an important one. With traditional TM
systems a translator still has to proof each match,
as there is no way to ascertain the appropriateness
of the match. Proofing has to be paid for –
typically at 60% of the standard translation cost.
With perfect matching, there is no need to proof
read, thereby saving on the cost of translation.
- Leveraged matching
When an xml:tm document is translated the translation
process provides perfectly aligned source and target
language text units. These can be used to create
traditional translation memories, but in a consistent
and automatic fashion.
- In-document leveraged matching
xml:tm can be used to find in-document leveraged
matches. These are more appropriate to a given document
than matches leveraged through TM.
- In document fuzzy matching
During the maintenance of author memorytext units
that have only changed slightly can be tracked.
If a corresponding translation exists for the previous
version of the source text unit, then the previous
source and target versions can be offered to the
translator as a type of close fuzzy match.
- Non-translatable text
In technical documents, there is often a large number
of text units that are made up solely of numeric,
alphanumeric, punctuation or measurement items.
With xml:tm, these can be identified during authoring
and flagged as non-translatable, thus reducing the
word count. For numeric and measurement text units
it is also possible to automatically convert the
decimal and thousands designators as required by
the target language.
The following is an example
of non-translatable text in xml:tm:
......................
<text:list-header>
<text:p text:style-name="P9">
<tm:te id="e41">
<tm:tu id="u41.1">Some
new text with examples of text that does not require
translation:</tm:tu>
</tm:te>
</text:p>
</text:list-header>
</text:ordered-list>
<text:p text:style-name="Hanging indent">
<tm:te id="e42">
<tm:tu id="u42.1" type="measure">10
mm</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Hanging indent">
<tm:te id="e43">
<tm:tu id="u43.1"
type="measure">10.50
m</tm:tu>
</tm:te>
</text:p>
.....................
And an example of the composed text:
Word Counts
The output from the
text extraction process can be used to generate automatic
word and match counts by the customer. This puts the
customer in control of the word counts.
XLIFF OnLine translation.
XLIFF is an OASIS standard
for the interchange of translatable text in XML format.
xml:tm translatable files can be created in XLIFF
format. The XLIFF format can then be used to create
dynamic web pages for translation. A translator can
access these pages via a browser and omplete the entire
translation process over the Internet.
This has many potential
benefits. The challenges posed by (1) filters, (2)
delays inherent in sending data out for translation,
(3) inadvertent corruption of character encoding or
document syntax, and (4) simple human work-flow problems
can be totally avoided. By applying XML technology,
it is now possible to both decrease and control the
cost of translation, as well as reduce the time it
takes while improving reliability.
An example of a web-based
translator environment can be seen at the following
web address:
http://www.xml-intl.com/demo/trans.html
Benefits of Using
xml:tm
The following is a
list of the main benefits of using the xml:tm approach
to authoring and translation:
- The ability to build consistent
authoring systems.
- Automatic production of authoring
statistics.
- Automatic alignment of source and
target text.
- Aligned texts can be used to populate
leveraged matching tm database tables.
- Perfect translation matching for
unchanged text units.
- In-document leveraged and modified
text unit matching.
- Automatic production of word count
statistics.
- Automatic generation of perfect,
leveraged, previously modified or fuzzy matching.
- Automatic generation of XLIFF files.
- Protection of the original document
structure.
- The ability to provide on line
access for translators.
- Can be used transparently for relay
translation.
Summary
xml:tm is an Open Standard
created and maintained by xml-Intl based on XML and
XLIFF. Full details of the xml:tm definitions (XML
Data Type Definition and XML Schema) are available
from the xml-Intl web site (http://www.xml-intl.com).
Xml-Intl also supplies an implementation of xml:tm
using Java and Oracle, which includes linguistically
aware database leveraged and fuzzy matching.
xml:tm is best suited
to enterprise-level implementation for corporations
with large annual translation requirements and content
management systems. During the implementation process,
xml:tm is integrated with the customer’s content
management system.
xml:tm reduces translation
costs in the following ways:
- TM is held by the customer within
the documents.
- Perfect matching reduces translation
costs by eliminating the need for translators to
proof these matches.
- TM matching is much more focused
than is the case with traditional TM systems providing
better results.
- It allows for relay TM processing
via an intermediate language.
- All TM, extractions and merge processing
are automatic with no manual intervention required.
- Translation can take place directly
via the customers’ web site.
- All word counts are controlled
by the customer.
- The original XML documents are
protected from accidental damage.
- The system is totally integrated
into the XML framework, making maximum use of the
capabilities of XML to address authoring and translation.
Andrzej Zydron
was born in England. Educated in France he started
working in IT in 1976. His experience has covered
all aspects of computing, with in depth knowledge
of Software Engineering, SGML, XML, encoding methodologies
and translation memory. Highlights of his career include:
- The design and architecture of
the European Patent Office patent data capture system
for Xerox Business Services.
- The design and architecture of
the Xerox Language Services XTM translation memory
system in 1996.
- Author of XML and SGML filters
for SDL International's SDLX Translation Suite.
- Assisting the Oxford University
Press, the British Council and Oxford University
in work on the New Dictionary of the National Biography.
Currently he is developing
the next generation of XML based “text memory”
systems, which offer substantial benefits by reducing
the costs for authoring and translation of XML based
documents.
Andrzej Zydron is a member
of the British Computer Society. He is also technical
and research director of xml-Intl Ltd. and sits on
the OASIS technical committee for Translation Web
Services.
Reprinted
by permission from the Globalization Insider,
26 August 2003, Volume XII, Issue 3.5.
Copyright
the Localization Industry Standards Association
(Globalization Insider: www.localization.org,
LISA: www.lisa.org)
and S.M.P. Marketing Sarl (SMP) 2004
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice
counts!
|