Making Reuse Intelligent: Improving Enterprise Information Quality Management
By Andrew Bredenkamp,
CEO acrolinx GmbH,
Berlin, Germany
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
Reuse has become a buzzword in
technical communication and localization. For one thing,
businesses want to be sure that they write the same information
only once. They also want to avoid translating information
repeatedly, because it is expressed using different words
or a different word order. But in a distributed writing
environment where disparate groups are contributing to huge
content repositories, how can you make sure that content
is created only once? This article looks at the role technology
can play in promoting content reuse. In particular, innovations
in linguistic technology are making it possible for companies
to take a systematic approach to this challenge.
Topic-based reuse
We can think about reuse under two broad
headings, topic-based reuse and linguistic reuse. Companies
already recognize the tremendous benefits of DITA, the Darwin
Information Typing Architecture. It is an emerging trend
in XML and the leading technology infrastructure for topic-based
reuse.
Through a process often called “chunking,”
DITA helps create recyclable, transferable units from extensive
documents by breaking them into smaller topics. It provides
a structure that eliminates the need for user-defined DTDs,
while letting users create customized topic extensions for
their own needs. In essence, DITA provides a framework for
breaking often enormous documents into manageable packages.
But the key advantage of DITA involves reuse.
Imagine five different products with a power supply that
has to be connected in a standard way. DITA helps create
a single topic to describe the setup process instead of
five. It thus eliminates 80% of the content that companies
previously had to manage - edit, maintain, and translate.
The current state of linguistic
reuse
Although more and more organizations recognize
that topic-based reuse is good for them, sentence-level
(or sentence-fragment level) reuse remains a relatively
unexplored territory. Yet reusing linguistic segments ensures
consistency across documents and makes localization more
cost-effective by eliminating the need for retranslation.
Remember - translation memory systems do not work at the
topic level, but at the sentence or segment level. Working
on this level is therefore the key to controlling translation
costs.
Most of the current solutions to this challenge
rely on “fuzzy matching” algorithms. These algorithms measure
the similarity between two character strings (sentences
or sentence fragments). On a superficial level, fuzzy matching
seems like a useful solution to the problem. But the reality
is quite different. Fuzzy matching works in translating,
but is far less suited to writing environments.
Consider this example:
WARNING: Switch power off only
when the fan has stopped.
Fuzzy matching offers the following potential
suggestions having different or even opposite, meanings:
WARNING: Switch power on only when
the fan has stopped.
-and-
Switch power off before the fan has
stopped.
There are similar problems for sentences
with variables.
For the example:
Operating temperature must not
exceed 45 degrees Celsius.
fuzzy matching offers:
Operating temperature should not
exceed 50 degrees Celsius.
-and-
Operating temperature must not
exceed 65 degrees Celsius.
and so on.
In terms of usability, authors might have
to wade through tens of suggestions for a single input,
or thousands for a single document, which not only discourages
writers from using the tool, but also increases the risk
that they will introduce inaccurate information.
Matching Meaning
Up until now, technology has not met the
challenges of applying reuse at the sentence level. The
technologies that are available have delivered few tangible
results for authoring and editing. The tools currently available
do not address the single most important aspect of linguistic
reuse: matching sentences or sentence fragments in terms
of meaning. At the same time, the tools have often proven
unwieldy or unusable in practice.
Acrolinx has recently introduced a new Intelligent
Reuse component for its Information Quality software that
meets these two challenges, combining meaning-based reuse
with usability.
Consider the following examples:
- Follow this link to find out more
- To find out more, follow this link
- Click here to find out more
- For more information, please go here
These segments are simply different ways
of saying the same thing, but translating them individually
increases costs. Tools based on fuzzy matching are not useful
here because the words and word sequences are too different.
However, Intelligent Reuse identifies the similarity in
meaning so that authors do not have to write different sentences
to express the same thing.
Behind the scenes, a technology based on
Artificial Intelligence extracts sentences from a translation
memory or content management system. It groups sentences
with similar meanings into so-called “micro-clusters.” The
previous example is one such cluster. The following sentences
are drawn from a cluster of approximately 25 sentences:
End Date must be greater than or equal to
Start Date.
End date must be equal to or later than
the start date..
End date should be greater than start date.
The start date cannot be later than the
end date.
Start date must be before end date!
The start date must be on or before the
end date.
Your end date must be after your start
date.
Your start date must be before your end
date.
The end date must be later than or the
same as the start date.
The actual end date must be on or after
the actual start date.
You cannot enter an “End Date” that is
before your “Start Date.”
Please enter an end date that is later
than the start date.
Please enter an End Date that is later
than or the same as the Start
End Time must be later than the Start Time.
Please enter a start date that is before
the end date.
Typically, content repositories or translation
memories contain many segments that are redundant or of
questionable quality. Based on initial experiences with
the tool in business settings, Intelligent Reuse reduces
redundancy in content by 15-35%. It also filters the micro-clusters
for quality, checking for spelling and grammar, corporate
style, and terminology. Compare the second and first sentences
in the micro-cluster. “Start date” is capitalized in the
first, but not in the second; the second contains a double
period at the end. Juxtaposing the sentences in this way
enables users to detect issues that typical spellcheckers
might not catch. Intelligent Reuse provides spellchecking
and quality assurance on a sentence level, rather than a
word level, with an overhead similar to regular spellchecking.
After checking for quality, Intelligent
Reuse chooses a representative sentence, a “winner” in terms
of representativeness and quality. For this cluster, Intelligent
Reuse chooses the following sentence, which is highlighted
in a web-based interface:
Please enter an end date that
is later than the start date.
At this point, linguistic administrators
can accept the suggested representative sentence, choose
another one, or even move sentences from one cluster to
another using the interface. This validation process is
a key aspect of quality assurance because it helps administrators
choose only correct sentences. Once a representative sentence
has been chosen, the administrator activates its cluster
for document checking.
Putt ing Intelligent Reuse into
practice
From the perspective of writers, the tool
now functions exactly like a spellchecker. For any sentence
that approximates a representative sentence in meaning,
writers receive a single standard sentence as a suggestion.
Intelligent Reuse provides suggestions for sentences already
stored in a content repository. But what is truly new about
the tool is that it makes suggestions for newly authored
sentences that match a representative sentence in meaning.
For the preceding example, a writer comes up with the sentence:
The start date must precede
the end date.
Intelligent Reuse would suggest the validated
representative sentence:
Please enter an end date that
is later than the start date.
Even though the new sentence is not part
of the original micro-cluster. In addition, the tool does
not detract from productivity because it makes only one,
high-quality suggestion for any input.
This reuse is always intelligent because
the suggestions match in meaning, not proximity of letters
or words. The system can understand numbers and units and
other complex entities. If we turn back to our previous
example for fuzzy matching:
Operating temperature must not exceed 45
degrees Celsius.
Operating temperature should not exceed 50
degrees Celsius.
Operating temperature must not exceed 65
degrees Celsius.
Intelligent Reuse recognizes that the temperature
variables in the sentences make a difference. It would place
these sentences in the same micro-cluster, but recognize
the temperature values as variables. Let us say that the
linguistic administrator validates the first sentence of
the cluster. If an author writes:
The operating temperature should not
exceed 80 degrees Celsius.
Intelligent Reuse suggests:
Operating temperature must not exceed 80
degrees Celsius.
In other words, it offers the validated
representative sentence, but preserves the value (80 degrees)
that the author typed. More than translation cost or usability
is at issue in this case, since the difference in operating
temperature affects product safety.
While this new technology has the potential
to cut costs significantly in the translation and localization
cycles, one of its most promising fields of application
concerns text authored by non-native speakers. Intelligent
Reuse helps non-native speakers meet the challenge of formulating
text in a foreign language by offering them a representative
sentence that has already been checked for quality and validated.
As more and more companies employ nonnative speakers to
author their technical documentation, Reuse could offer
enormous benefits.
Finally, Intelligent Reuse extends beyond
technical documentation to software strings, where developers
confront significant issues in deciding whether a message
is available. Here, Intelligent Reuse represents a novel
approach to a problem where there are currently few solutions.
For acrolinx, Intelligent Reuse comprises part of a holistic
view of enterprise information quality management.
Its initial results have been immensely
promising, in terms of decreasing redundancy, improving
quality, increasing productivity, and cutting costs. For
the first time, information developers can implement a linguistic-based
reuse strategy that makes sense.
About acrolinx
acrolinx is market leader in quality assurance
tools for professional information developers. These tools
help companies worldwide to maintain their corporate image,
address compliance issues, improve quality, and control
document production and localization costs. Its flagship
product, acrocheck™, is used internationally by thousands
of customers in a variety of industries, including software,
automotive, life sciences, and aerospace. acrocheck has
been deployed at global enterprises like SAP, Symantec,
SAS, Philips, Siemens, Motorola, and Bosch.
acrolinx maintains its headquarters in Berlin,
Germany with a sales and support subsidiary in North America.
ClientSide
News Magazine - www.clientsidenews.com
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!
|