Article for translators: Systran MT/TM integration

Home

Join as a Member!

Post Your Job - Free!

All Translation Agencies

Advertisements

Systran MT/TM integration

By Jean Senellart
Director Research and Development,
SYSTRAN

Become a member of TranslationDirectory.com at just $12 per month (paid per year)

With the longest history of any MT developer in the world, SYSTRAN, founded in 1968, has an R&D investment measured in thousands of person years. From the beginning, SYSTRAN’s development has been deeply rooted in linguistics and the system still remains in constant evolution. The translation engines, linguistic resources and user-interactive customization tools are in step with the latest computing standards and new development techniques, as the company is always working on introducing innovative linguistic advancements.

SYSTRAN’s goal is to better describe languages, the gateway to continue improving the quality of automatic translation.

Today SYSTRAN is the market leading provider of language translation software products and solutions for the desktop, enterprise, and Internet that facilitate communication in 52 commercially available language pairs and in 20 vertical domains. Almost 20 additional language pairs have been developed for specific projects or customers.

SYSTRAN is the choice of leading global corporations, portals that include AltaVista™, Apple, Google™, Yahoo!®, and governmental institutions throughout the world like the US Intelligence Community. Use of SYSTRAN products and solutions enhances multilingual communication and increases user productivity and time-savings for B2E, B2B, and B2C market segments as they deliver real-time language solutions.

Although there is a wide spectrum of SYSTRAN products and solutions to choose from, each allows users to instantly translate any written text for gisting (understanding the general idea of what is written, such as quick understanding of foreign language Web content) and for publishing (near-perfect translations that also require post-editing, like user guides, technical support content, and other common localization projects).

When used for publishing purposes, professional users often combine MT and TM’s (translation memories). The simpler integration applies TM’s first, and MT on “no match” segments as a default translation. A richer approach uses TM’s and, more generally user feedback, to supply MT resources.

It is in the context of the second approach that we provide an overview of how SYSTRAN’s translation engines work, highlighting the importance of the linguistic descriptions, how existing TM’s can be reused to customize the translation engines, and available tools for managing Translation Memory and User Terminology within a translation workflow based on MT.

LINGUISTICS AND THE IMPORTANCE OF DESCRIBING LANGUAGES

Two key elements distinguishing SYSTRAN’s MT system from the others are that it’s an incremental system and is deterministic. The system is designed to produce incremental translation quality results between versions which can be easily validated by users. Additionally, the system produces deterministic output, meaning the results are consistent and based on available resources. As a result, users are able to interact with the system to modify results by customizing linguistic resources.

It all starts with the three types of linguistic descriptions provided for each language pair (source language to target language) implanted in the system: Analysis, Transfer, and Generation.

The following diagram illustrates the process for translating a source language sentence into a target language sentence. The deeper the source language analysis, the smaller the transfer will be (and the smaller the effort to build new language pairs).

Systran MT/TM integration 1. The description of the source language, also referred to as the analysis is composed of the following analyses:

• Global Document Analysis considers the input text as one unit and performs several rounds of analysis that identify important elements that help describe the source language. These include language identification at the paragraph level, the named entities (such as dates and proper nouns) that define the local terminology, and subject detection of the document which enables the system to automatically select preferred meanings by domain.

• Grammatical Analysis provides the system with the data required to create the internal representation of each by displaying the complete linguistic structure of each sentence. Included are part of speech for each word, the syntagm to which they belong, the relationship between different entities, and the function of main elements (verb, object, subject, and complement). This deep description builds a hierarchical representation of the dependencies between the different elements.

The system’s analysis involves several rounds represented by a sequence, each of which has several dependencies to other components, including:

• Sentence segmentation – identification of sentences in the text
• Normalization of the languages
• Morphological analysis
• Grammatical Disambiguation
• Clause identification
• Basic local relationships
• Enumeration analysis
• Predicate/Subject analysis
• Preposition Rattachement
• Semantic Analysis of the different words used in context allows the system to automatically tag words with associated semantic features that are used in a later part of the process.

2. The description of the transfer from the source language to target language focuses on the transfer of structures and the transfer of the lexicon. It is the only description dependent on both the source and target languages. For instance, the internal analysis of the previous sentence is transferred (from English to German) into the structure represented in following figure.

3. The description of the target language also referred to as the generation.

These three linguistic descriptions are based on linguistic resources; monolingual for the analysis and generation or bilingual for the transfer. There are two primary types of resources: rules and dictionaries. Dictionaries describe individual terminological units. Rules generalize these descriptions while providing high-level descriptions of linguistic phenomena.

Typically linguistic resources are very large, and for some language pairs can reach up one million entries. Most of these resources can be learned (with supervised semi-automatic extraction tools) from a bilingual corpus – such as existing translation memories - and easily therefore adapted to specific domains, which means the translation is customized.

In comparison, the rules count is smaller but rules rank much higher in terms of complexity and require linguistic expertise for creation and maintenance.

ARCHITECTURE

SYSTRAN systems are highly modular and are based on an XML workflow, a mechanism enabling communication between users and the different modules, and between the different modules themselves. This mechanism provides interaction between users and the system’s internal rules.

The interaction between users and internal rules is enabled by rich interactive tools embedding MT. For high productivity, professional users must be able to understand the translation process, as well as interact with the rules and resources in order to fine-tune their translations. An example of a rich interactive tool is the SYSTRAN Translation Project Manager, a translation workbench available in select SYSTRAN products and solutions. All of the features mentioned below are included in this tool.

• The translation engine produces rich markup and allows users to view the impact of their resources, alternative meanings, and indicators on sentence complexity.
• Build and apply TM’s.
• A step that goes beyond TM’s is the ability to store and reuse any user-choice resource for other translations using SYSTRAN’s translation choice feature.

SYSTRAN PRODUCTS

SYSTRAN’s array of products and solutions for the desktop, enterprise, and Internet help enterprise and home users understand foreign language content in real-time and create multilingual documents. Released in February of this year, SYSTRAN 6 brings 12 new language pairs, more than one million new terms created from aligned data, a dictionary lookup, a comprehensive environment for post-editing and QA that includes terminology extraction, flexible TM’s, collaborative dictionary management, built-in comparators, and other tools to efficiently optimize translations in a cost-effective manner.

Noteworthy rich interactive tools and technology associated to the translation engines follow:

• SYSTRAN Translation Project Manager (STPM) is a “translation workbench” used to create, manage, and refine localization projects consisting of hundreds of files. Using STPM users perform side-by-side comparisons between original and translated documents and affect changes to both, as well as add terms to User Dictionaries and process dictionary updates. In addition, STPM offers a selection of powerful built-in review tools, including terminology review, analysis of the original document, full sentence review, use of alternative meanings, and others for applying, reviewing, and building TM’s and other advanced features.

• SYSTRAN Dictionary Manager (SDM) allows users to create and manage three levels of linguisti c data types to improve translation quality. The data types are:

o User Dictionaries (UDs) – user created bilingual or multilingual glossaries that are used alongside SYSTRAN’s built-in domain dictionaries.

o Normalization Dictionaries (NDs) - monolingual resources that can be used to standardize or correct source text prior to translation, or to correct target text after translation.

o TM’s

• SDM is based on IntuitiveCoding, a proprietary SYSTRAN technology that allows users to massively import and manage entries in User Dictionaries.

• Dictionary Lookup provides additional contextual information for alternative meanings of selected source language terms. Users can select a term at any time across multiple dictionaries covering the SYSTRAN Main Dictionary, 20 SYSTRAN Domain Dictionaries, User Dictionaries, and other integrated dictionaries.

Set translation choices and customize terminology with contextual review tools in SYSTRAN Translation Project Manager.

CONCLUSION

It is essential to customize the translation engines for use within a localization workflow. This customization process is based on a variety of rich user-interactive tools that leverage existing TM’s and exploit the post-editing effort to create complementary linguistic resources. Based on linguistic tools, this customization process advances the quality of the linguistic descriptions involved in the translation process and progressively increases overall translation quality.

Source sentence
Target sentence
Post-analysis representation
Pre-generation representation
Language independent representation

Surface level

Analysis
Transfer
Generation

ClientSide News Magazine - www.clientsidenews.com

Submit your article!