Uncontrolled Terminology and MT: The Importance of Making Good Comparisons

Home

Join as a Member!

Post Your Job - Free!

All Translation Agencies

Advertisements

Uncontrolled Terminology and MT: The Importance of Making Good Comparisons

By Rafael Guzmán,
Symantec Corporation, Ireland

http://www.invenis.net/rguzman.php

Source: http://accurapid.com/Journal/48mt.htm

Become a member of TranslationDirectory.com - click here!

Key words: machine translation, uncontrolled terminology, duplicates, comparison criteria

Introduction

Feeding Machine Translation (MT) dictionaries with uncontrolled terminology is never a good thing. When this is done, MT engines consistently produce translations containing the same unwanted or wrong terminology over and over again. Post-editing these terminology errors tends to be time-consuming and expensive, which defeats the purpose of using MT.

Feeding Machine Translation (MT) dictionaries with uncontrolled terminology is never a good thing.

One of the most effective ways of controlling MT terminology is to automatically check for the existence of terms within an MT dictionary and across multiple terminology repositories, and determine if they are duplicated and how they are translated. This type of control can involve extensive terminology comparisons. Unfortunately, there is still a broad unawareness of why and how appropriate comparison criteria need to be used for each specific situation. This is confirmed by the lack of sufficient comparison functionality and flexibility available in current terminology tools.

This article briefly explains why terminology for MT dictionaries needs to be controlled, and suggests necessary comparison criteria as well as why Excel spreadsheets can help in filling in the gaps in current terminology tools to automate and customize terminology comparisons. Although the emphasis of this article is on terminology tasks affecting rule-based machine translation, they can be applicable to terminology management in general.

Good-quality terminology: nice to have or critical?

A while ago, Client Side News (2006) warned that "businesses often fail to see terminology management as a way to cut costs," and Warburton (2005) pointed out that "terminology is the biggest factor in poor translation quality." Also, it has been discussed the "unexpected return on investment (ROI)" generated by good-quality terminology (Wittner, 2007).

These statements seem to be confirmed by Lionbridge (2009), which reports that "it is estimated that 15 percent of all globalization project costs arise from rework and that the primary cause of rework is inconsistent terminology."

This poses the question of whether the success of rule-based MT is exempt from the dangers of uncontrolled terminology.

Unfortunately, it is not. Because rule-based MT relies on dictionaries, apart from grammar rules, the danger is twofold:

Necessary terminology is missing
Encoded (appropriate or inappropriate) terminology is incorrectly or inconsistently translated

The first problem is the result of poor terminology harvesting and validation, while the second problem is caused by the lack of appropriate quality controls. This article will focus only on the latter.

Once incorrect translated terminology makes its way into an MT dictionary, the same wrong translations will automatically be repeated over and over again in every new MT output.

This is caused by the fact that MT dictionaries are usually expected to contain just one single instance of each term and one unique translation and part of speech (POS)—maybe some contextual information as well—in each semantic domain. If there are terms that are duplicated and translated inconsistently, the MT engine will not know which translation to use. Consequently, it is possible that the wrong translation may always end up being used in every MT output. But even if there are no duplicates, or if they are translated consistently, the MT engine will use the wrong translation if terms contain an unwanted or wrong translation.

The severity of this problem will normally depend on the number of terms affected and their frequency in the source text. This suggests how critical it can be that necessary terminology be captured and translated correctly in the MT dictionary.

To sum up, post-editing incorrectly translated terms in the MT output is a reactive approach that causes unnecessary extra costs. What is needed is a proactive terminology control strategy that will prevent MT from causing the exact same problems over and over again.

MT-related terminology tasks

A very effective strategy to control MT terminology is to automatically analyze MT dictionaries and check for:

Duplicated source terms. If found, are they translated consistently?
Key source terms occurring as head of compound nouns. If found, are they translated consistently?
Deprecated (unwanted) or wrong translations

There are numerous MT-related terminology tasks in which these controls are required. These controls always involve terminology comparisons within an MT dictionary and across multiple terminology repositories (Table 1).

Table 1: terminology tasks involving comparisons

MT-related terminology task	Reason why a comparison task is required
Validation of harvested terminology for an MT dictionary	Once harvested terms have been manually "cleaned," they need to be checked for duplicates.
Creation of a global list of key terms from different project glossaries to feed an MT dictionary	Once all the project glossaries are merged into a single glossary, terms need to be checked for unwanted terms, duplicates, and translation inconsistencies.
Populating a list of source terms by leveraging translations and other data from legacy glossaries to create an MT dictionary	Before automatically allocating translations to each term, their part of speech values (POS) values need to be compared to avoid false positives (e.g. "open" can be a verb or an adjective; "scan" could be a verb or a noun) .
Customization of a project glossary into an MT dictionary from scratch	Some rule-based MT dictionaries often require that terms are surrounded by linguistic clues based on their POS values (e.g. open > to open). This requires comparing POS values as a pre-condition to allocate the appropriate clues.
Feeding an MT dictionary with terminology from a project glossary	Some of the terms in the glossary may already exist in the MT dictionary with the same or a different translation.
Diagnosing (auditing) an MT dictionary	Terms may contain unwanted translations. There can also be duplicates translated inconsistently.
MT dictionary metrics	Based on any of the tasks above, metrics need to capture the number of duplicates, deprecated translations and inconsistencies

Finally, when dealing with MT dictionaries, "it is not wise to hope that all source texts have been written well, and have followed any type of standardization guidelines with respect to the use of terminology grammatical rules, or writing style" (Allen, 2006). Sometimes it may be necessary to encode incorrect source terms (e.g. deprecated or misspelled terms) that tend to recur frequently, otherwise, they will not be recognized by the MT engine. However, a better solution would be to use a normalization dictionary. What really matters is to ensure that their translation in the dictionary is correct and consistent.

Terminology comparison criteria

In the previous sections, the importance of checking for duplicates as well as for inconsistent and deprecated translations in MT dictionaries has been discussed. But is this enough? In other words, is there anything that can invalidate an automatic terminology comparison or make it inaccurate causing the issues to remain hidden?

The tables below show some basic self-explanatory examples of likely results after comparing terms automatically using specific criteria. Each table represents a different comparison scenario. The equal and unequal signs indicate the result of the comparison. Of course, many of these scenarios often appear combined in practice.

Table 2: Spelling

Scenario	Example
Same term, same spelling	administrator = administrator
Same term, but different spelling	administrator ≠ admin

Tables 3 and 4 represent two of the most typical comparison scenarios. Suppose that a rule-based MT dictionary (e.g. Systran) requires that generic terms are entered in lower case in order to enable the engine to recognize them in both lower and upper case while translating the source text. If a term is entered in upper case only, the MT engine will only be able to recognize the exact occurrence of that term in upper case.

Based on this, if the term "Virtual Memory" (upper case) already exists in the MT dictionary and the term "virtual memory" (lower case) is compared against it using a case-insensitive comparison, it will be "wrongly" concluded that "virtual memory" already exists in the dictionary and that there is no need to add it again. On the other hand, if a case-sensitive comparison is run, "virtual memory" will be flagged as an non-existent term in the dictionary. But if this term gets added to the dictionary in this case, will this action not create a duplicate?

Table 3: case sensitive versus case insensitive

Scenario	Example
Same term, same case	virtual memory = virtual memory
Same term, but different case	Virtual Memory ≠ virtual memory

Table 4: Part of speech (POS) values

Scenario	Example
Same term, same POS	scan (noun) = scan (noun)
Same term, different POS	scan (noun) ≠ scan (verb)
Same term, but POS missing in one of them	scan (noun) ≠ scan
Same term, same POS spelled differently	fast (adj.) ≠ fast (adjective)

Using POS values can help to fine-tune terminology comparisons. For instance, proper nouns are normally (although not necessarily) expected to be encoded in upper case, while nouns, verbs and adjectives are expected to be encoded in lower case. Occurrences of the same term with different part of speech values are not considered duplicates. For instance: "Virtual Memory" (proper noun) and "virtual memory" (noun) could be safely added to the MT dictionary.

When POS values are not used as part of the comparison criteria, misleading results are likely to occur.

Obviously, this comparison criterion assumes that all the candidate terms for addition to the MT dictionary, as well as the existing terms in the dictionary, have the correct spelling and correct standard POS values (e.g. TBX or OLIF). This might require a little bit of manual "cleaning" before hand.

Table 5 shows some scenarios of possible false positives if a "match whole word" option is not enabled prior to the comparison.

Table 5: Whole word matching

Scenario	Example
Term fully matches a term	backup = backup
Term partially matches the same term	backup = backups
Term partially matches a different term	backup = BackupExec

Some terms surrounded by "noise" (e.g. linguistic clues, trade mark sign) may need to be added to the MT dictionary even if the same term already exists without noise. In many cases, they do not need to be treated as duplicates necessarily; otherwise they will not be recognized by the MT engine (e.g. Linux, Linux(tm)). In these cases, what really matters is that the translation remains consistent and correct. To check this, any existing "noise" may need to be temporarily removed before running the comparison analysis.

Table 8: "Noise" (linguistic clues, punctuation, ampersand, tags, TM, variables)

Scenario	Example
Same verb, preceded inconsistently by "to"	open ≠ to open
Same term, but linguistic clues follow one of them	offline ≠ offline (flexible hyphen) off-host ≠ "off-host" (a)
Same term, but asterisk indicating the existence of a comment follows one of them	backup ≠ backup (*)
Same term, but colon sign follows one of them	Space available ≠ Space available:
Same term, but trade mark (TM) sign or similar appended to one of them	Linux ≠ Linux(tm)
Same term, but hotkey in the middle of one of them	File ≠ F&ile
Term surrounded by tags	Activate ≠ <tag>Activate</tag>

Table 6: semantic domain values

Scenario	Example
Same term, same meaning in different O/S	file (in Windows) = file (in Macintosh)
Same term, different meaning in different O/S	Dock (in Windows) ≠ Dock (in Macintosh)

Table 7: Leading and trailing blank spaces

Scenario	Example
Same term, no leading or trailing spaces	CPU Usage = CPU Usage
Same term, but leading space	CPU Usage ≠ CPU Usage

Table 9: translations

Scenario	Example
Same term, same translation	scan (analisis) = scan (analisis)
Same term, different translation	scan (analisis) ≠ scan (escaneo)

Excel spreadsheets as terminology tools companion

So far, this article has discussed why rule-based MT needs terminology controls focused on terminology comparisons using appropriate criteria based on the relevant MT terminology task. However, carrying out these tasks manually is tedious, time-consuming and prone to errors.

This raises one more question: what is the best technology environment to run good terminology comparisons?

While it is true that a number of well-known terminology tools have improved significantly providing online storage and management capabilities, the comparison functionality described in this article is still very limited in many tools. In addition, there is also the lack of flexibility to customize and automate the terminology tasks required by each situation. Also, practice shows that terminology stored in powerful terminology repositories often ends up being exported to text files that are subsequently opened in spreadsheets for evaluation or even usability purposes.

This is the context in which Microsoft Excel spreadsheets can be of great assistance to terminology tools, not necessarily as storage repositories but as terminology workbenches for analysis and customization purposes.

In Excel, users can easily record macros and develop them further in the Excel Visual Basic (VB) editor according to their own needs (e.g. T-Manager Terminology Tool).

For instance, the VB editor allows users to instruct a macro to compare all the cells in a dictionary or glossary in a spreadsheet and flag each duplicated term using the cell next to each duplicate. Then the user can start adding more macros and features such as a drop-down list in one of the cells in the spreadsheet to instruct the macro to use the required comparison criterion (e.g. case-sensitive or case-insensitive).

Obviously, apart from a bit of creativity, some previous knowledge of VB will be also needed, but there are plenty of online tutorials and this can be easily learned as the user goes along.

Conclusion

Good-quality and controlled terminology is critical for the success of rule-based Machine Translation. Key to this is to automatically identify duplicates and translation inconsistencies, as well as deprecated translations within an MT dictionary and across multiple terminology tools. This requires making sure that the necessary terminology comparisons are made using appropriate comparison criteria. As a result, MT outputs will require less post-editing, and MT translation costs will be reduced. Finally, while many current terminology tools provide little or no comparison functionality and flexibility, Excel macros and spreadsheets can help to provide these.

Acknowledgments

The author thanks Dr Johann Roturier and Fabio Fantino for discussion and useful comments.

References

Allen, J., 2006. Improved translation quality with Machine Translation Dictionary Building. [Online]. Available at: http://www.translatorscafe.com/cafe/article59.htm [accessed 20 November 07].

ClientSide News. 2006, A New Mid-level Solution for Terminology Management. [Online]. Available at: http://www.lexicool.com/lingo4-terminology-management-article.asp?IL=1 [accessed 20 January 2008].

Guzman, R., 2007. T-Manager Terminology Tool. [Online]. Available at http://www.invenis.net/resources/tw/index.php (Accessed 18 February 2009].

Guzman, R. 2008. Advanced automatic MT post-editing. Multilingual, 9 (3), p.52-57.

Itagaki, M., Aikawa , T, and He, X., 2007. Automatic Validation of Terminology Translation Consistency with Statistical Method.[Online]. Available at: http://research.microsoft.com/en-us/um/people/xiaohe/publication/mtsummit-2007.pdf [accessed 2 February 2009].

Lionbridge, Standardizing Business Terminology through Localization Management. [Online]. Available at http://www.lionbridge.com/lionbridge/en-US/services/localization-translation/terminology-management.htm [Accessed 3 February 2009).

Warburton, K., 2005, Terminology: Getting Down to Business, The Globalization Insider [online]. Available at http://www.lisa.org/globalizationinsider/2005/07/terminology_get.html [accessed 26 January 2006].

Wittner, J., 2007. Unexpected ROI from terminology. MultiLingual, (3), p.51-54.

Published - June 2009

This article was originally published at Translation Journal (http://accurapid.com/journal).

Submit your article!