Specialized Monolingual Corpora in Translation

By Maryam Mohammadi Dehcheshmeh,
English Language Department, Faculty of Humanities,
Shahrekord University, P. O. Box 115,
Shahrekord, Iran

In the new world of technology, the translation profession, like other disciplines, cannot be deprived of modern tools such as electronic corpora. Recently, large monolingual, comparable and parallel corpora have played a crucial role in solving various problems of linguistics, including translation. In this study we shall attempt to show the effectiveness of a specialized monolingual corpus in translating various collocations usually found in political texts from English into Persian. This experiment compares the accuracy in translating collocations using a specialized monolingual corpus to the conventional resources (e.g. monolingual as well as bilingual dictionaries). The results show how the quality of translation can be improved using corpus-based translation tools.

In recent years computers have increasingly found their way into different branches of sciences, including humanities. Language studies are no exceptions in this respect. In this new world of technology, the translation profession, like other disciplines, canot be deprived of modern tools such as electronic corpora. Constructing as well as exploiting different types of corpora are among the computer applications available to researchers in various language fields. Recently, large monolingual, comparable and parallel corpora have played a crucial role in solving various problems of linguistics such as language learning and teaching (Aston, 2000; Leech, 1997; Nesselhauf, 2004), translation studies (Mosavi Miangah, 2006), information retrieval (Braschler, & Schauble, 2000), statistical machine translation (Brown et al., 1990) and the like. In this study, we shall attempt to show the effectiveness of a specialized monolingual corpus in translating various collocations usually found in political texts from English into Persian. This experiment compares the accuracies of collocation translation using a specialized monolingual corpus to the conventional resources (e.g. monolingual as well as bilingual dictionaries). The results show how the quality of translation can be improved using corpus-based translation tools.

Generally, a corpus can be defined as a collection of naturally occurring examples of language. A corpus includes no new information about language, but it gives new perspectives to linguistic researches and helps in the development of different processes such as language learning and teaching and translation.

Depending on the purpose and the form, different types of corpora may be distinguished.

Specialized corpus is a corpus which includes a particular type of texts. This specialization has no definite boundaries, but some criteria that specify the type of the text in question should be considered. Such corpora may contain either some texts specialized in terms of a particular timeframe (texts from 1822 to 1876) or a particular subject (art, politics, medicine) or some other factors. Some famous LSP (Language for Special Purposes) corpora are the 5-million word Cambridge and Nottingham Corpus of Discourse in English (CANCODE) and the Michigan Corpus of Academic Spoken English (MICASE).

This is a type of corpus which includes various types of texts, either written or spoken, on a variety of subjects. Sometimes it is called "reference corpus" concerning its function as a reference material for language learning, translation, etc. Some of the best-known general corpora are the 100-million words British National Corpus (BNC) and the 400- million Words Bank of English.

A corpus consisting of texts of the same type and content in different languages (e.g. legal contracts in English and French), or articles about linguistics from English and Persian journals. The ICE corpus (International Corpus of English) is a one-million word comparable corpus of different varieties of English.

Parallel corpora are those consisting of texts with their translations into two or more languages, eg. a medical article translated into Spanish, Finnish, and French. They can be of great help in searching equivalent expressions in each language and investigating the differences between languages by translators and learners.

A collection of texts—essays, for example—produced by learners of a language (Hunston, S. 2006). This corpus is prepared to help to find the differences between texts produced by the learners and text produced by native speakers. the International Corpus of Learner English (ICLE) with 20,000 words and Louvain Corpus of Native English Essays (LOCNESS) are the examples of numerous well-known learner corpora.

Pedagogic corpus is a corpus consisting of all texts to which a learner has been exposed (Hunston, S. 2006). A pedagogic corpus collected by a teacher or researcher may consist of all course books, readers, etc. used by a learner and the tapes they have listened to. This includes all instances of a word or phrase that learners encounter in different contexts, to improve their knowledge of language.

This is a corpus which includes texts belonging to various periods of time, to show the development of language over a specified timeframe. The most famous English historical corpus is the Helsinki Corpus with 1.5-million words .

This is a corpus which consists of texts of the same type to trace the changes in the language by adding to it annually, monthly, even daily. So the texts of one year (month or day) can be compared to those of another, similar, period.

Different types of corpora may be annotated differently in accordance with the needs of the researchers. Some types of information, which are encoded in a corpus and are effective in translation tasks are parts of speech (POS), syntactic structure, parsing, word senses, and anaphoric relation

In recent years, the importance of corpora in the field of translation has become noticeable to trainers and researchers. Therefore, some researchers believe that the analysis of corpora should be integrated into translator education. There have been a number of studies on monolingual corpora (general and specialized) and various kinds of exploitation of such corpora like extraction of collocations.

The website "Gateway to corpus linguistics on the Internet" at http://www.corpus-linguistics.de/ is a proper reference for obtaining information about many of best-known corpora and their features such as their size, content, and accessibility as well as when and by whom they were compiled.

Most of the latest research in translation knowledge acquisition is based on parallel corpora (Brown et al.1993). However, since large aligned bilingual corpora are hard to obtain, some researches have tried to exploit translation knowledge from non-parallel corpora such as comparable corpora or monolingual corpora. One of the best known large-scale monolingual corpora is the British National Corpus (BNC), a 100 million-word collection of samples of written and spoken language from wide range of sources. However, the BNC has, despite its large size, serious limitations as a translation aid if you are translating contemporary specialized text (Wilkinson, M. 2006).

In a pilot experiment, Bowker (1998) found that learners using a specialized corpus of texts in the target language (their L1) showed greater correct term choice and idiomaticity than a matched group using bilingual dictionaries alone. In his study, Bowker determined that a specialized monolingual native-language corpus assists translators to improve two of the most important criteria to produce high quality translation: subject-field understanding and specialized native-language competence (Bowker, L. 1998).

Bowker & Pearson (2002) provide a good experiment on exploiting such monolingual corpora in translating texts on mechanical engineering. They attempt to investigate the term "nut" and its various collocations in the 100-million-word BNC corpus. They found 670 occurrences of this term. However they found most of the concordance lines not helpful, since most of contexts show examples of "nut" being used in other meanings, such as food or eccentric person. Although some of the occurrences describe the type of nuts used in engineering, it takes time to identify them; there is excessive "noise" due to the fact that "nut" is a homonym—it has various meanings—and so separating the wheat from the chaff is a time-consuming process.

Bowker & Pearson go on to report that a search for the term "nut" in a 10,000-word corpus containing catalogues, product descriptions and assembly instructions from companies in the manufacturing industry generated 49 occurrences. Although this was far fewer than the BNC search, the findings were far more relevant, since the noise was considerably reduced, and it was easy to spot the many different types of "nut" used in manufacturing (e.g. collar nut, compression nut, flare nut, knurled nut, winged nut), as well as the verbs that collocate with nut (e.g. thread, screw, tighten, loosen)

Thus, the role of specialized corpora in translating different types of texts becomes more prominent. Such specialized corpora which are restricted to the language of a particular specialized field and focus on Language for Special Purposes are sometimes referred to as LSP corpus (Wilkinson, M. 2006).

Nowadays, specialized corpora play a crucial role in translation. However, due to the unavailability of ready-made LSP corpora, translators can construct their own specialized corpora. In this respect, we tried to compile a specialized monolingual corpus of Persian texts in the field of politics consisting of over 5 million words or 150 MB. These texts are mainly extracted from political articles, journals, interviews, etc. found on the Internet and preprocessed before being entered in the corpus. That is, all tables, pictures, figures or diagrams are to be deleted from the texts to be ready for the corpus. Moreover, the texts should be converted to an XML format to be suitable for use on Internet sites. At this stage the texts can be entered into the corpus to be used by translators trying to translate political texts from English into Persian. At present, the Persian monolingual corpus is freely available from the following URL:

Considering the fact that concepts and terms within a particular field are evolving constantly, we need our corpus to be open in order to add or remove some texts when required. As it is mentioned in the definition of corpus, corpora by themselves are nothing more than collection of examples of language. But beside other tools they become invaluable and find their position in translation task.

Here, two applications of specialized corpora are introduced to describe their role in producing a high-quality translation.

Referring to a monolingual corpus in the field of politics (containing about 5-million words), we search for different collocations which are frequently encountered by translators. We also use a bilingual dictionary (Aryanpur, English to Persian) to compare the use of a bilingual English-Persian dictionary to a monolingual Persian corpus. Consider the noun phrase "pre-emptive war." At first, we refer to a conventional resource such as a bilingual dictionary in which we naturally cannot find such collocation as an individual entry. However, some suggested equivalents for two components of the collocation are found. For the word "pre-emptive" we found three suggested equivalents as پيش دستانه، پيش گيرانه and بازدارنده. And for the word "war" only جنگ has been suggested. Then we turned to our corpus and found 0 occurrence of جنگ پيش گيرانه, 0 occurrence of جنگ پيش دستانه, and 14 occurrences of جنگ بازدارنده. So, we selected the third equivalent of this collocation as the most probable translation due to its higher frequency in the corpus. By this way, the corpus can help us obtain the most probable translation of the other collocates, too.

In a parallel movement, we considered the more common collocation "increasing relations," and found "توسعه," "گسترش" suggested by the dictionary as two equivalents for "increasing." It may make no difference for a translator to use "توسعه" instead of "گسترش" or vice versa. But it is wondeful to find 199 occurences of "گسترش روابط" and 79 of "توسعه روابط." As you see, when we think that our dicision is right, the corpus changes the situation and reveals the truth. In the following table we have mentioned some other examples:

Collocation	Dictionary Suggestions	Occurrences in Corpus	Corpus decision
Military Confrontation	1. برخورد نظامي 2. درگيري نظامي 3. مقابله نظامي 4. رويارويي نظام 5. مواجهه نظامي	12 0 1 2 0	برخورد نظامي
Nuclear Talks	1.مذاكرات اتمي 2.مذاكرات هسته اي	3 16	مذاكرات هسته اي
Slow Pace of Negotiations	1. كندي روند مذاكرات 2. كندي پيشرفت مذاكرات	11 2	كندي روند مذاكرات
Suspension of Uranium Enrichment	1.توقف غني سازي اورانيوم 2.تعليق غني سازي اورانيوم	5 16	تعليق غني سازي اورانيوم

Table 1. Corpus decision on certain collocations' equivalents suggested by dictionary

While traditional translation tools (such as dictionaries) suggest more than one equivalents and sometimes improper ones, corpora become an effective solution to these problems. When you are in doubt about which one to choose among the equivalents suggested by dictionary, corpora are great tools for verifying or rejecting the suggested translation(s). A number of equivalents for "trade-off" suggested by the dictionary are as follows: "مبادله," "تهاتر," "پاياپاي كاري," "بده بستاني." The occurrences are illustrated in the following table.

Collocation

Dictionary Suggestions

Occurrences in Corpus

Corpus decision

Trade-off

1.پاياپاي كاري

2.بده بستاني

3.مبادله

4.تهاتر

مبادله

We can use this strategy in translation criticism in evaluating the naturalness of translation. For the word "confidence" in the phrase "confidence-building" there are 2 equivalents suggested by Aryanpur dictionary, "اعتماد," "اطمينان." Due to the great similarity between these two words and their high frequency in Persian language, it is hard even for a native speaker to select between these two translations: "اعتماد سازي" and "اطمينان سازي." But when they occur in a political texts and therefore are searched in our corpus, it is surprising to find no occurrence of "اطمينان سازي" and 18 occurrences of "اعتماد سازي."

According to Larson, to do effective translation one must discover the meaning of the source language and use receptor language forms which express this meaning in a natural way (Larson, M. 1984). So, in addition to other conventional translation tools a translator should use corpora to become more certain that his/her choice is a proper and natural one. According to above explanations, corpora can be of great help in finding suitable collocates and verifying or rejecting the suggested translations by dictionaries. As Varantola states, the general comment made by her students about the corpus evidence: "This evidence helps translators to be less bound to the source material and feel much more confident when deviating from the way things are expressed in the source material if they feel that the changes are justified." (Varantola, 2003, p. 67).

Large monolingual as well as bilingual electronic corpora are just recently becoming available to translators, and this is a good opportunity for them to be provided with more precise, natural, and up-to-date information about words and collocations' senses than before. Open parallel corpora can play their greatest role in resolving different translation problems. Unfortunately, this invaluable tool has not been widely used by translators in Iran. This may be due to the fact that they have not been exposed to the potentials of corpus analysis tools during their college education. Unavailability of ready-made special field corpora may be another reason in this respect. So, we decided to describe the effective applications of a specialized monolingual corpus of Persian in the sensitive task of translating political texts.

We hope to expand this study to cover experiments dealing with other subject fields such as medicine, sports, business, religion, literature, and the like. It is suggested that such experiments be also performed with other language pairs to see if more definitive conclusions in terms of the effect of monolingual corpora on the translator's work can be drawn.

Aryanpur, A. and Aryanpur, M. (1991). English-Persian Collegiate Dictionary. (Ninth Edition) Amir-Kabir Publication Organization, Tehran, Iran.

Aston, G. (2000). I corpora come risorse per la traduzione e l'apprendimento. In Silvia Bernardini and Federico Zanettin (eds.) I corpora nella didattica della traduzione. Bologna: CLUEB, 21-29.

Bowker, L., 1998, Using specialized monolingual native-language corpora as a translation resource: a pilot study, Meta, 43/4, pp. 631-651.

Bowker, L. and Pearson, J. (2002). Working with Specialized Language—A practical guide to using corpora. London: Routledge, Pp. xiv + 242

Brown P.F., Pietra, S.A.D., Pietra, V. J. D., and Mercer R. L. 1993. The mathematics of machine translation: parameter estimation. Computational Linguistics, 19(2): 263-313.

Braschler, M. and Schauble, P. 2000. Using corpus-based approaches in a system for multilingual information retrieval. Information Retrieval, 3, PP. 273-284.

Brown, P., Cocke, S., Della Pietra, V., Della Pietra, S., Jelinek, F., Lafferty, J., Mercer, R. & Roosin, P. 1990. A Statistical Approach to Machine Translation. Computational Linguistics 16:2, 79-85.

Larson, Mildred L. (1998). Meaning-based translation: A guide to cross- language equivalence. Lanham, MD: University Press of America and Summer Institute of Linguistics.

Leech, G. (1997). Teaching and language corpora: A convergence. In: A. Wichmann, S. Fligelstone, T. McEnery & G. Knowles (Eds.), Teaching and language corpora (1-23). New York: Addison Wesley Longman

Mosavi Miangah, T. (2006). Applications of corpora in translation. Translation Studies, 12, pp: 43-56.

Nesselhauf, N. (2004). Learner corpora and their potential for language teaching. In: J. McH. Sinclair (Ed.), How to use corpora in language teaching (125-152). Amsterdam: Benjamins.

Varantola, K. 2003. Translators and Disposable Corpora. In Federico Zanettin, Silvia Bernardini and Dominic Stewart (eds.) Corpora in Translator Education Manchester: St Jerome, pp 55-70.