Statistical Machine Translation and Translation Memory: an Integration Made in Heaven
High quality machine translation (MT) of human languages has been a quest for more than five decades. Almost as soon as computers were invented, developers and business people could imagine the solutions automated translation would provide in supporting international business, aiding communications, and furthering collaboration in the medical and research communities. Scientists and linguists started to develop programs that would correlate the words, grammar rules, and syntax of one language to another. But after initial successes with basic word translation, progress was slow; it seemed that few significant MT accuracy increases occurred. In fact, human translators delighted in finding the worst examples they could uncover and MT became the brunt of many jokes!
Basic word translation, however, did spawn large volumes of previously translated material that could be kept in translation memory (TM) phrase databases: i.e., French phrase = English phrase. These databases track and allow the reuse of repetitive information, both within and across a series of documents. This computer-aided translation aids productivity for human translators, as long as the TM sees a known phrase. However, the words must appear in the exact same order as what the TM has in its database in order to make the translation match.
About five years ago, Language Weaver (LW) came along with a technology to change the MT paradigm. As computers grew in speed and capacity and more translated documents became available, statistical methods became feasible for automated translation. Rather than try and correlate the grammar of one language to another, as rule-based MT does, Language Weaver’s statistical machine translation (SMT) uses proprietary statistical models to find the most likely translation for any given input, using pattern recognition and statistical probabilities. The quality of Language Weaver’s SMT improves with larger volumes of bilingual translated text for training documents. Unlike TM, SMT works well with documents of previously unseen text; unlike rule-based methods, statistical translation models allow the system to generate many possible translations for each sentence and then choose the best option, depending on context from among the possibilities, producing fluent, more natural sounding translations. The statistical and pattern recognition process also allows for more rapid, automated customization and adaptation to customer-specific domains, style, and vocabulary.
This applies even to non-Roman alphabet-based languages, like Chinese, Japanese, or Arabic, and results in a lower development cost and shorter time to deployment. In addition, SMT is more easily integrated with other statistical method-based technologies, like OCR and data mining. When the software is trained on parallel corpus (already translated text) it retains the word and phrase relationships in a database as probability tables.
When new documents to be translated are presented to the SMT engine, phrases and words can exist in new, previously unseen materials with different words surrounding them and the software will make educated guesses of what they mean, assigning a probability level of confidence to perhaps three or four possible options. If it has never seen the exact phrase before, but it has seen the individual words, it will go through its database of probabilities and come up with the ones that best match the pattern. So if a company has a large database of TM phrases, SMT can use the TM as input data to increase its probability confidence.
Example-based and statistical methodologies have created renewed excitement in the translation and localization communities. For proof, note that Language Weaver is no longer the only company in the market using SMT; Google recently went live with consumer-oriented SMT engines for Chinese, Arabic and other languages.
At Language Weaver, we have seen a significant movement in the last six or eight months with many commercial language service providers, localization companies, and global management systems companies starting to integrate Language Weaver’s SMT software into their workflow processes – and typically this is the first time they have integrated statistical machine translation (and in some cases, ANY machine translation.) These companies have included ITP (of Belgium and Japan), Interligare (Spain), across Systems (Germany), Idiom Technologies (Boston), and Janya (Buffalo, NY) in addition to Cross Language (Belgium), Zylab (Vienna, VA) and Clay Tablet (Canada.)
The business drivers for this integration are fairly obvious – with the rise of the Internet and the globalization of business, there is a greater volume of material to be translated than ever before and it is unreasonable to assume that human translators can keep up with it unless they have better productivity tools. In addition, IT departments have realized that simple document translation is not enough; integration of translation with other applications adds more value.
Multinational companies in custom industry domains, in particular, have massive databases of digital parallel content that can be leveraged to increase productivity and decrease time to market of translated materials; when automated translation and human post-editing are combined, quality is not compromised. In addition, when translation workflow is integrated with other business processes, it can facilitate data mining for business intelligence such as: patent research or marketing knowledge, enhance communications, and improve customer satisfaction levels. Integration will take translation where it has never gone before, and will grow the market for translation services more than ever.
Let’s look at a couple of examples. Idiom has recently integrated Language Weaver software into its workflow. The solution is not designed to replace human translators, but rather, to give them tools that will help them do their jobs more effectively. The promise is to help localization managers better manage translation projects.
Both Idiom and Language Weaver use the TMs. The workflow software Idiom commercializes enables users to match sentences in new documents to be translated against existing TMs. If 70 percent of a software manual has not changed from one version to another, the Idiom software is capable of finding the 70 percent of the sentences that have not changed and suggest translations for them by simple lookup in the database.
Language Weaver’s software learns from the TMs how to produce translations for new, unseen sentences and how to translate words and phrases and assemble them into grammatical outputs. With the LW software, users can obtain translations of high quality for the 30 percent of the sentences that do not match a previous translation.
"The key difference between TM and SMT," Daniel Marcu, interim CEO of Language Weaver, points out, "is that a TM system never tries to propose translations it hasn’t previously seen, whereas Language Weaver’s SMT system will."
The localization manager can estimate the scope of work and cost based on the number of words in each match range before the job is sent to a translator. When it is sent to the translator, the source text segments are opened in the workbench for editing. The translator can view the possible translations generated by the TM and by Language Weaver, along with the percentage match confidence, as well as the source text. The translator populates the target language column with the best possible translation, using his/her own best judgment. Productivity for the translator is expected to increase by 100-200 percent, with higher productivity increases anticipated as the SMT software accumulates more data.
Eric Richard, VP of engineering at Idiom, says, "By combining Idiom translation memories with Language Weaver’s SMT, the SMT system can ‘learn’ from the corpus of translation memory that continues to build up over time in WorldServer. As a result, Idiom and Language Weaver can significantly accelerate translation and localization activities for customers, enabling them to translate far greater volumes of content than ever before."
The promise of time and money savings are a good reason to work with SMT. But, as we all know, accuracy drives adoption. ITP is responsible for translating high volumes of automotive content. It is not unusual for them to have to output 3,000 page manuals within two weeks turnaround. ITP deployed Language Weaver SMT software a few months ago and now has data to assess it.
According to Gert van Assche, director of technology of ITP-Europe, in Belgium, "The translators simply could not believe that Language Weaver was capable of producing this kind of high quality translation. In general, the TM matches lower than 90 percent confidence ratings are more reliable from Language Weaver than from our TM. In one instance, only 5 out of 1000 segments were unusable. And most surprisingly, long sentences are just as good as short ones. The results achieved are beyond the high expectations we had after the first evaluation."
This kind of success doesn’t necessarily appear right out of the box. Language Weaver pioneered the approach to statistically-based automated translation and has developed customization methods that lead to high quality translations. The good news is that it doesn’t take a team of linguists months (or years) to fine-tune the work; Language Weaver can do it pretty quickly.
There are several factors that contribute to high quality translations that can be more easily post-edited:
The best is yet to come. We are confident that with the integration of translation memory databases and SMT, especially as it continues to evolve, the quest of those early computer programmers will be realized.
ClientSide News Magazine - www.clientsidenews.com
Please see some ads as well as other content from TranslationDirectory.com: