Uncontrolled Terminology and MT: The Importance of Making Good Comparisons
Key words: machine translation, uncontrolled terminology, duplicates, comparison criteria
Feeding Machine Translation (MT) dictionaries with uncontrolled terminology is never a good thing. When this is done, MT engines consistently produce translations containing the same unwanted or wrong terminology over and over again. Post-editing these terminology errors tends to be time-consuming and expensive, which defeats the purpose of using MT.
This article briefly explains why terminology for MT dictionaries needs to be controlled, and suggests necessary comparison criteria as well as why Excel spreadsheets can help in filling in the gaps in current terminology tools to automate and customize terminology comparisons. Although the emphasis of this article is on terminology tasks affecting rule-based machine translation, they can be applicable to terminology management in general.
Good-quality terminology: nice to have or critical?
A while ago, Client Side News (2006) warned that "businesses often fail to see terminology management as a way to cut costs," and Warburton (2005) pointed out that "terminology is the biggest factor in poor translation quality." Also, it has been discussed the "unexpected return on investment (ROI)" generated by good-quality terminology (Wittner, 2007).
These statements seem to be confirmed by Lionbridge (2009), which reports that "it is estimated that 15 percent of all globalization project costs arise from rework and that the primary cause of rework is inconsistent terminology."
This poses the question of whether the success of rule-based MT is exempt from the dangers of uncontrolled terminology.
Unfortunately, it is not. Because rule-based MT relies on dictionaries, apart from grammar rules, the danger is twofold:
The first problem is the result of poor terminology harvesting and validation, while the second problem is caused by the lack of appropriate quality controls. This article will focus only on the latter.
Once incorrect translated terminology makes its way into an MT dictionary, the same wrong translations will automatically be repeated over and over again in every new MT output.
This is caused by the fact that MT dictionaries are usually expected to contain just one single instance of each term and one unique translation and part of speech (POS)maybe some contextual information as wellin each semantic domain. If there are terms that are duplicated and translated inconsistently, the MT engine will not know which translation to use. Consequently, it is possible that the wrong translation may always end up being used in every MT output. But even if there are no duplicates, or if they are translated consistently, the MT engine will use the wrong translation if terms contain an unwanted or wrong translation.
The severity of this problem will normally depend on the number of terms affected and their frequency in the source text. This suggests how critical it can be that necessary terminology be captured and translated correctly in the MT dictionary.
To sum up, post-editing incorrectly translated terms in the MT output is a reactive approach that causes unnecessary extra costs. What is needed is a proactive terminology control strategy that will prevent MT from causing the exact same problems over and over again.
MT-related terminology tasks
A very effective strategy to control MT terminology is to automatically analyze MT dictionaries and check for:
There are numerous MT-related terminology tasks in which these controls are required. These controls always involve terminology comparisons within an MT dictionary and across multiple terminology repositories (Table 1).
Table 1: terminology tasks involving comparisons
Finally, when dealing with MT dictionaries, "it is not wise to hope that all source texts have been written well, and have followed any type of standardization guidelines with respect to the use of terminology grammatical rules, or writing style" (Allen, 2006). Sometimes it may be necessary to encode incorrect source terms (e.g. deprecated or misspelled terms) that tend to recur frequently, otherwise, they will not be recognized by the MT engine. However, a better solution would be to use a normalization dictionary. What really matters is to ensure that their translation in the dictionary is correct and consistent.
Terminology comparison criteria
In the previous sections, the importance of checking for duplicates as well as for inconsistent and deprecated translations in MT dictionaries has been discussed. But is this enough? In other words, is there anything that can invalidate an automatic terminology comparison or make it inaccurate causing the issues to remain hidden?
The tables below show some basic self-explanatory examples of likely results after comparing terms automatically using specific criteria. Each table represents a different comparison scenario. The equal and unequal signs indicate the result of the comparison. Of course, many of these scenarios often appear combined in practice.
Tables 3 and 4 represent two of the most typical comparison scenarios. Suppose that a rule-based MT dictionary (e.g. Systran) requires that generic terms are entered in lower case in order to enable the engine to recognize them in both lower and upper case while translating the source text. If a term is entered in upper case only, the MT engine will only be able to recognize the exact occurrence of that term in upper case.
Based on this, if the term "Virtual Memory" (upper case) already exists in the MT dictionary and the term "virtual memory" (lower case) is compared against it using a case-insensitive comparison, it will be "wrongly" concluded that "virtual memory" already exists in the dictionary and that there is no need to add it again. On the other hand, if a case-sensitive comparison is run, "virtual memory" will be flagged as an non-existent term in the dictionary. But if this term gets added to the dictionary in this case, will this action not create a duplicate?
Table 4: Part of speech (POS) values
Using POS values can help to fine-tune terminology comparisons. For instance, proper nouns are normally (although not necessarily) expected to be encoded in upper case, while nouns, verbs and adjectives are expected to be encoded in lower case. Occurrences of the same term with different part of speech values are not considered duplicates. For instance: "Virtual Memory" (proper noun) and "virtual memory" (noun) could be safely added to the MT dictionary.
When POS values are not used as part of the comparison criteria, misleading results are likely to occur.
Obviously, this comparison criterion assumes that all the candidate terms for addition to the MT dictionary, as well as the existing terms in the dictionary, have the correct spelling and correct standard POS values (e.g. TBX or OLIF). This might require a little bit of manual "cleaning" before hand.
Table 5 shows some scenarios of possible false positives if a "match whole word" option is not enabled prior to the comparison.
Table 5: Whole word matching
Some terms surrounded by "noise" (e.g. linguistic clues, trade mark sign) may need to be added to the MT dictionary even if the same term already exists without noise. In many cases, they do not need to be treated as duplicates necessarily; otherwise they will not be recognized by the MT engine (e.g. Linux, Linux(tm)). In these cases, what really matters is that the translation remains consistent and correct. To check this, any existing "noise" may need to be temporarily removed before running the comparison analysis.
Table 8: "Noise" (linguistic clues, punctuation, ampersand, tags, TM, variables)
Table 6: semantic domain values
Table 7: Leading and trailing blank spaces
Table 9: translations
Excel spreadsheets as terminology tools companion
So far, this article has discussed why rule-based MT needs terminology controls focused on terminology comparisons using appropriate criteria based on the relevant MT terminology task. However, carrying out these tasks manually is tedious, time-consuming and prone to errors.
This raises one more question: what is the best technology environment to run good terminology comparisons?
While it is true that a number of well-known terminology tools have improved significantly providing online storage and management capabilities, the comparison functionality described in this article is still very limited in many tools. In addition, there is also the lack of flexibility to customize and automate the terminology tasks required by each situation. Also, practice shows that terminology stored in powerful terminology repositories often ends up being exported to text files that are subsequently opened in spreadsheets for evaluation or even usability purposes.
This is the context in which Microsoft Excel spreadsheets can be of great assistance to terminology tools, not necessarily as storage repositories but as terminology workbenches for analysis and customization purposes.
In Excel, users can easily record macros and develop them further in the Excel Visual Basic (VB) editor according to their own needs (e.g. T-Manager Terminology Tool).
For instance, the VB editor allows users to instruct a macro to compare all the cells in a dictionary or glossary in a spreadsheet and flag each duplicated term using the cell next to each duplicate. Then the user can start adding more macros and features such as a drop-down list in one of the cells in the spreadsheet to instruct the macro to use the required comparison criterion (e.g. case-sensitive or case-insensitive).
Obviously, apart from a bit of creativity, some previous knowledge of VB will be also needed, but there are plenty of online tutorials and this can be easily learned as the user goes along.
Good-quality and controlled terminology is critical for the success of rule-based Machine Translation. Key to this is to automatically identify duplicates and translation inconsistencies, as well as deprecated translations within an MT dictionary and across multiple terminology tools. This requires making sure that the necessary terminology comparisons are made using appropriate comparison criteria. As a result, MT outputs will require less post-editing, and MT translation costs will be reduced. Finally, while many current terminology tools provide little or no comparison functionality and flexibility, Excel macros and spreadsheets can help to provide these.
The author thanks Dr Johann Roturier and Fabio Fantino for discussion and useful comments.
ClientSide News. 2006, A New Mid-level Solution for Terminology Management. [Online]. Available at: http://www.lexicool.com/lingo4-terminology-management-article.asp?IL=1 [accessed 20 January 2008].
Guzman, R., 2007. T-Manager Terminology Tool. [Online]. Available at http://www.invenis.net/resources/tw/index.php (Accessed 18 February 2009].
Guzman, R. 2008. Advanced automatic MT post-editing. Multilingual, 9 (3), p.52-57.
Itagaki, M., Aikawa , T, and He, X., 2007. Automatic Validation of Terminology Translation Consistency with Statistical Method.[Online]. Available at: http://research.microsoft.com/en-us/um/people/xiaohe/publication/mtsummit-2007.pdf [accessed 2 February 2009].
Lionbridge, Standardizing Business Terminology through Localization Management. [Online]. Available at http://www.lionbridge.com/lionbridge/en-US/services/localization-translation/terminology-management.htm [Accessed 3 February 2009).
Warburton, K., 2005, Terminology: Getting Down to Business, The Globalization Insider [online]. Available at http://www.lisa.org/globalizationinsider/2005/07/terminology_get.html [accessed 26 January 2006].
Wittner, J., 2007. Unexpected ROI from terminology. MultiLingual, (3), p.51-54.
Published - June 2009
Please see some ads as well as other content from TranslationDirectory.com: