High quality machine translation (MT) of human languages
has been a quest for more than five decades. Almost
as soon as computers were invented, developers and business
people could imagine the solutions automated translation
would provide in supporting international business,
aiding communications, and furthering collaboration
in the medical and research communities. Scientists
and linguists started to develop programs that would
correlate the words, grammar rules, and syntax of one
language to another. But after initial successes with
basic word translation, progress was slow; it seemed
that few significant MT accuracy increases occurred.
In fact, human translators delighted in finding the
worst examples they could uncover and MT became the
brunt of many jokes!
Basic word translation, however, did spawn large
volumes of previously translated material that could
be kept in translation memory (TM) phrase databases:
i.e., French phrase = English phrase. These databases
track and allow the reuse of repetitive information,
both within and across a series of documents. This
computer-aided translation aids productivity for human
translators, as long as the TM sees a known phrase.
However, the words must appear in the exact same order
as what the TM has in its database in order to make
the translation match.
About five years ago, Language Weaver (LW) came along
with a technology to change the MT paradigm. As computers
grew in speed and capacity and more translated documents
became available, statistical methods became feasible
for automated translation. Rather than try and correlate
the grammar of one language to another, as rule-based
MT does, Language Weaver’s statistical machine
translation (SMT) uses proprietary statistical models
to find the most likely translation for any given
input, using pattern recognition and statistical probabilities.
The quality of Language Weaver’s SMT improves
with larger volumes of bilingual translated text for
training documents. Unlike TM, SMT works well with
documents of previously unseen text; unlike rule-based
methods, statistical translation models allow the
system to generate many possible translations for
each sentence and then choose the best option, depending
on context from among the possibilities, producing
fluent, more natural sounding translations. The statistical
and pattern recognition process also allows for more
rapid, automated customization and adaptation to customer-specific
domains, style, and vocabulary.
This applies even to non-Roman alphabet-based languages,
like Chinese, Japanese, or Arabic, and results in
a lower development cost and shorter time to deployment.
In addition, SMT is more easily integrated with other
statistical method-based technologies, like OCR and
data mining. When the software is trained on parallel
corpus (already translated text) it retains the word
and phrase relationships in a database as probability
tables.
When new documents to be translated are presented
to the SMT engine, phrases and words can exist in
new, previously unseen materials with different words
surrounding them and the software will make educated
guesses of what they mean, assigning a probability
level of confidence to perhaps three or four possible
options. If it has never seen the exact phrase before,
but it has seen the individual words, it will go through
its database of probabilities and come up with the
ones that best match the pattern. So if a company
has a large database of TM phrases, SMT can use the
TM as input data to increase its probability confidence.
Example-based and statistical methodologies have
created renewed excitement in the translation and
localization communities. For proof, note that Language
Weaver is no longer the only company in the market
using SMT; Google recently went live with consumer-oriented
SMT engines for Chinese, Arabic and other languages.
At Language Weaver, we have seen a significant movement
in the last six or eight months with many commercial
language service providers, localization companies,
and global management systems companies starting to
integrate Language Weaver’s SMT software into
their workflow processes – and typically this is the
first time they have integrated statistical machine
translation (and in some cases, ANY machine translation.)
These companies have included ITP (of Belgium and
Japan), Interligare (Spain), across Systems (Germany),
Idiom Technologies (Boston), and Janya (Buffalo, NY)
in addition to Cross Language (Belgium), Zylab (Vienna,
VA) and Clay Tablet (Canada.)
The business drivers for this integration are fairly
obvious – with the rise of the Internet and the globalization
of business, there is a greater volume of material
to be translated than ever before and it is unreasonable
to assume that human translators can keep up with
it unless they have better productivity tools. In
addition, IT departments have realized that simple
document translation is not enough; integration of
translation with other applications adds more value.
Multinational companies in custom industry domains,
in particular, have massive databases of digital parallel
content that can be leveraged to increase productivity
and decrease time to market of translated materials;
when automated translation and human post-editing
are combined, quality is not compromised. In addition,
when translation workflow is integrated with other
business processes, it can facilitate data mining
for business intelligence such as: patent research
or marketing knowledge, enhance communications, and
improve customer satisfaction levels. Integration
will take translation where it has never gone before,
and will grow the market for translation services
more than ever.
Let’s look at a couple of examples. Idiom has
recently integrated Language Weaver software into
its workflow. The solution is not designed to replace
human translators, but rather, to give them tools
that will help them do their jobs more effectively.
The promise is to help localization managers better
manage translation projects.
Both Idiom and Language Weaver use the TMs. The workflow
software Idiom commercializes enables users to match
sentences in new documents to be translated against
existing TMs. If 70 percent of a software manual has
not changed from one version to another, the Idiom
software is capable of finding the 70 percent of the
sentences that have not changed and suggest translations
for them by simple lookup in the database.
Language Weaver’s software learns from the
TMs how to produce translations for new, unseen sentences
and how to translate words and phrases and assemble
them into grammatical outputs. With the LW software,
users can obtain translations of high quality for
the 30 percent of the sentences that do not match
a previous translation.
"The key difference between TM and SMT," Daniel Marcu,
interim CEO of Language Weaver, points out, "is that
a TM system never tries to propose translations it
hasn’t previously seen, whereas Language Weaver’s
SMT system will."
The localization manager can estimate the scope of
work and cost based on the number of words in each
match range before the job is sent to a translator.
When it is sent to the translator, the source text
segments are opened in the workbench for editing.
The translator can view the possible translations
generated by the TM and by Language Weaver, along
with the percentage match confidence, as well as the
source text. The translator populates the target language
column with the best possible translation, using his/her
own best judgment. Productivity for the translator
is expected to increase by 100-200 percent, with higher
productivity increases anticipated as the SMT software
accumulates more data.
Eric Richard, VP of engineering at Idiom, says, "By
combining Idiom translation memories with Language
Weaver’s SMT, the SMT system can ‘learn’
from the corpus of translation memory that continues
to build up over time in WorldServer. As a result,
Idiom and Language Weaver can significantly accelerate
translation and localization activities for customers,
enabling them to translate far greater volumes of
content than ever before."
The promise of time and money savings are a good
reason to work with SMT. But, as we all know, accuracy
drives adoption. ITP is responsible for translating
high volumes of automotive content. It is not unusual
for them to have to output 3,000 page manuals within
two weeks turnaround. ITP deployed Language Weaver
SMT software a few months ago and now has data to
assess it.
According to Gert van Assche, director of technology
of ITP-Europe, in Belgium, "The translators simply
could not believe that Language Weaver was capable
of producing this kind of high quality translation.
In general, the TM matches lower than 90 percent confidence
ratings are more reliable from Language Weaver than
from our TM. In one instance, only 5 out of 1000 segments
were unusable. And most surprisingly, long sentences
are just as good as short ones. The results achieved
are beyond the high expectations we had after the
first evaluation."
This kind of success doesn’t necessarily appear
right out of the box. Language Weaver pioneered the
approach to statistically-based automated translation
and has developed customization methods that lead
to high quality translations. The good news is that
it doesn’t take a team of linguists months (or
years) to fine-tune the work; Language Weaver can
do it pretty quickly.
There are several factors that contribute to high
quality translations that can be more easily post-edited: