Specialized Monolingual Corpora in Translation
By
Maryam Mohammadi Dehcheshmeh,
English Language Department, Faculty of Humanities,
Shahrekord University, P. O. Box 115,
Shahrekord, Iran
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
Abstract
In
the new world of technology, the translation profession,
like other disciplines, cannot be deprived of modern
tools such as electronic corpora. Recently, large
monolingual, comparable and parallel corpora have
played a crucial role in solving various problems
of linguistics, including translation. In this study
we shall attempt to show the effectiveness of a specialized
monolingual corpus in translating various collocations
usually found in political texts from English into
Persian. This experiment compares the accuracy in
translating collocations using a specialized monolingual
corpus to the conventional resources (e.g. monolingual
as well as bilingual dictionaries). The results show
how the quality of translation can be improved using
corpus-based translation tools.
1. Introduction
In recent years computers have increasingly
found their way into different branches of sciences,
including humanities. Language studies are no exceptions
in this respect. In this new world of technology,
the translation profession, like other disciplines,
canot be deprived of modern tools such as electronic
corpora. Constructing as well as exploiting different
types of corpora are among the computer applications
available to researchers in various language fields.
Recently, large monolingual, comparable and parallel
corpora have played a crucial role in solving various
problems of linguistics such as language learning
and teaching (Aston, 2000; Leech, 1997; Nesselhauf,
2004), translation studies (Mosavi Miangah, 2006),
information retrieval (Braschler, & Schauble,
2000), statistical machine translation (Brown et al.,
1990) and the like. In this study, we shall attempt
to show the effectiveness of a specialized monolingual
corpus in translating various collocations usually
found in political texts from English into Persian.
This experiment compares the accuracies of collocation
translation using a specialized monolingual corpus
to the conventional resources (e.g. monolingual as
well as bilingual dictionaries). The results show
how the quality of translation can be improved using
corpus-based translation tools.
2. What Is a Corpus?
Generally, a corpus can be defined
as a collection of naturally occurring examples of
language. A corpus includes no new information about
language, but it gives new perspectives to linguistic
researches and helps in the development of different
processes such as language learning and teaching and
translation.
Depending on the purpose and the form,
different types of corpora may be distinguished.
2.1. Specialized corpus
Specialized corpus is a corpus which
includes a particular type of texts. This specialization
has no definite boundaries, but some criteria that
specify the type of the text in question should be
considered. Such corpora may contain either some texts
specialized in terms of a particular timeframe (texts
from 1822 to 1876) or a particular subject (art, politics,
medicine) or some other factors. Some famous LSP (Language
for Special Purposes) corpora are the 5-million word
Cambridge and Nottingham Corpus of Discourse in English
(CANCODE) and the Michigan Corpus of Academic Spoken
English (MICASE).
2.2. General corpus
This is a type of corpus which includes
various types of texts, either written or spoken,
on a variety of subjects. Sometimes it is called "reference
corpus" concerning its function as a reference material
for language learning, translation, etc. Some of the
best-known general corpora are the 100-million words
British National Corpus (BNC) and the 400- million
Words Bank of English.
2.3. Comparable corpus
A corpus consisting of texts of the
same type and content in different languages (e.g.
legal contracts in English and French), or articles
about linguistics from English and Persian journals.
The ICE corpus (International Corpus of English) is
a one-million word comparable corpus of different
varieties of English.
2.4. Parallel corpus
Parallel corpora are those consisting
of texts with their translations into two or more
languages, eg. a medical article translated into Spanish,
Finnish, and French. They can be of great help in
searching equivalent expressions in each language
and investigating the differences between languages
by translators and learners.
2.5. Learner corpus
A collection of textsessays,
for exampleproduced by learners of a language
(Hunston, S. 2006). This corpus is prepared to help
to find the differences between texts produced by
the learners and text produced by native speakers.
the International Corpus of Learner English (ICLE)
with 20,000 words and Louvain Corpus of Native English
Essays (LOCNESS) are the examples of numerous well-known
learner corpora.
2.6. Pedagogic corpus
Pedagogic corpus is a corpus consisting
of all texts to which a learner has been exposed (Hunston,
S. 2006). A pedagogic corpus collected by a teacher
or researcher may consist of all course books, readers,
etc. used by a learner and the tapes they have listened
to. This includes all instances of a word or phrase
that learners encounter in different contexts, to
improve their knowledge of language.
2.7. Historical and diachronic
corpus
This is a corpus which includes texts
belonging to various periods of time, to show the
development of language over a specified timeframe.
The most famous English historical corpus is the Helsinki
Corpus with 1.5-million words .
2.8. Monitor corpus
This is a corpus which consists of
texts of the same type to trace the changes in the
language by adding to it annually, monthly, even daily.
So the texts of one year (month or day) can be compared
to those of another, similar, period.
Different types of corpora may be
annotated differently in accordance with the needs
of the researchers. Some types of information, which
are encoded in a corpus and are effective in translation
tasks are parts of speech (POS), syntactic structure,
parsing, word senses, and anaphoric relation
3. Related Work
In recent years, the importance of
corpora in the field of translation has become noticeable
to trainers and researchers. Therefore, some researchers
believe that the analysis of corpora should be integrated
into translator education. There have been a number
of studies on monolingual corpora (general and specialized)
and various kinds of exploitation of such corpora
like extraction of collocations.
The website "Gateway to corpus linguistics
on the Internet" at http://www.corpus-linguistics.de/
is a proper reference for obtaining information about
many of best-known corpora and their features such
as their size, content, and accessibility as well
as when and by whom they were compiled.
Most of the latest research in translation
knowledge acquisition is based on parallel corpora
(Brown et al.1993). However, since large aligned bilingual
corpora are hard to obtain, some researches have tried
to exploit translation knowledge from non-parallel
corpora such as comparable corpora or monolingual
corpora. One of the best known large-scale monolingual
corpora is the British National Corpus (BNC), a 100
million-word collection of samples of written and
spoken language from wide range of sources. However,
the BNC has, despite its large size, serious limitations
as a translation aid if you are translating contemporary
specialized text (Wilkinson, M. 2006).
In a pilot experiment, Bowker (1998)
found that learners using a specialized corpus of
texts in the target language (their L1) showed greater
correct term choice and idiomaticity than a matched
group using bilingual dictionaries alone. In his study,
Bowker determined that a specialized monolingual native-language
corpus assists translators to improve two of the most
important criteria to produce high quality translation:
subject-field understanding and specialized native-language
competence (Bowker, L. 1998).
Bowker & Pearson (2002) provide
a good experiment on exploiting such monolingual corpora
in translating texts on mechanical engineering. They
attempt to investigate the term "nut" and its various
collocations in the 100-million-word BNC corpus. They
found 670 occurrences of this term. However they found
most of the concordance lines not helpful, since most
of contexts show examples of "nut" being used in other
meanings, such as food or eccentric person. Although
some of the occurrences describe the type of nuts
used in engineering, it takes time to identify them;
there is excessive "noise" due to the fact that "nut"
is a homonymit has various meaningsand
so separating the wheat from the chaff is a time-consuming
process.
Bowker & Pearson go on to report
that a search for the term "nut" in a 10,000-word
corpus containing catalogues, product descriptions
and assembly instructions from companies in the manufacturing
industry generated 49 occurrences. Although this was
far fewer than the BNC search, the findings were far
more relevant, since the noise was considerably reduced,
and it was easy to spot the many different types of
"nut" used in manufacturing (e.g. collar nut, compression
nut, flare nut, knurled nut, winged nut), as well
as the verbs that collocate with nut (e.g. thread,
screw, tighten, loosen)
Thus, the role of specialized corpora
in translating different types of texts becomes more
prominent. Such specialized corpora which are restricted
to the language of a particular specialized field
and focus on Language for Special Purposes are sometimes
referred to as LSP corpus (Wilkinson, M. 2006).
4. Compiling and Exploiting Specialized
Monolingual Corpora
Nowadays, specialized corpora play
a crucial role in translation. However, due to the
unavailability of ready-made LSP corpora, translators
can construct their own specialized corpora. In this
respect, we tried to compile a specialized monolingual
corpus of Persian texts in the field of politics consisting
of over 5 million words or 150 MB. These texts are
mainly extracted from political articles, journals,
interviews, etc. found on the Internet and preprocessed
before being entered in the corpus. That is, all tables,
pictures, figures or diagrams are to be deleted from
the texts to be ready for the corpus. Moreover, the
texts should be converted to an XML format to be suitable
for use on Internet sites. At this stage the texts
can be entered into the corpus to be used by translators
trying to translate political texts from English into
Persian. At present, the Persian monolingual corpus
is freely available from the following URL:
www.persiancorpus.com
Considering the fact that
concepts and terms within a particular field are evolving
constantly, we need our corpus to be open in order
to add or remove some texts when required. As it is
mentioned in the definition of corpus, corpora by
themselves are nothing more than collection of examples
of language. But beside other tools they become invaluable
and find their position in translation task.
4.1. Two main applications
of the corpora in translation
Here, two
applications of specialized corpora are introduced
to describe their role in producing a high-quality
translation.
4.
1. 1. Translating Collocations
Referring to a monolingual corpus in the field
of politics (containing about 5-million words), we
search for different collocations which are frequently
encountered by translators. We also use a bilingual
dictionary (Aryanpur, English to Persian) to compare
the use of a bilingual English-Persian dictionary
to a monolingual Persian corpus. Consider the noun
phrase "pre-emptive war." At first, we refer to a
conventional resource such as a bilingual dictionary
in which we naturally cannot find such collocation
as an individual entry. However, some suggested equivalents
for two components of the collocation are found. For
the word "pre-emptive" we found three suggested equivalents
as پيش دستانه،
پيش گيرانه and بازدارنده. And for the word "war" only جنگ has been suggested. Then we turned to our corpus
and found 0 occurrence of جنگ پيش
گيرانه, 0 occurrence of جنگ پيش دستانه, and 14 occurrences of جنگ بازدارنده. So, we selected the third equivalent of this collocation as the most
probable translation due to its higher frequency in
the corpus. By this way, the corpus can help us obtain
the most probable translation of the other collocates,
too.
In a parallel movement, we considered the more common collocation "increasing
relations," and found "توسعه," "گسترش" suggested by the dictionary as two equivalents for "increasing." It
may make no difference for a translator to use "توسعه" instead of "گسترش" or vice versa. But it is wondeful to find 199 occurences of
"گسترش روابط" and 79 of "توسعه روابط." As you see, when we think that our dicision
is right, the corpus changes the situation and reveals
the truth. In the following table we have mentioned
some other examples:
| Collocation |
Dictionary Suggestions
|
Occurrences
in Corpus
|
Corpus
decision
|
|
Military Confrontation
|
1. برخورد
نظامي
2. درگيري
نظامي
3. مقابله
نظامي
4. رويارويي نظام
5. مواجهه
نظامي
|
12
0
1
2
0
|
برخورد
نظامي
|
|
Nuclear Talks
|
1.مذاكرات
اتمي
2.مذاكرات
هسته اي
|
3
16
|
مذاكرات
هسته اي
|
|
Slow Pace of Negotiations
|
1. كندي
روند
مذاكرات
2. كندي
پيشرفت
مذاكرات
|
11
2
|
كندي
روند
مذاكرات
|
|
Suspension of Uranium
Enrichment
|
1.توقف
غني سازي
اورانيوم
2.تعليق
غني سازي
اورانيوم
|
5
16
|
تعليق غني
سازي
اورانيوم
|
Table 1. Corpus decision on certain
collocations' equivalents suggested by dictionary
4.1.2. Verifying or rejecting decision taken
based on other tools
While traditional translation tools (such as dictionaries)
suggest more than one equivalents and sometimes improper
ones, corpora become an effective solution to these
problems. When you are in doubt about which one to
choose among the equivalents suggested by dictionary,
corpora are great tools for verifying or rejecting
the suggested translation(s). A number of equivalents
for "trade-off" suggested by the dictionary are as
follows: "مبادله," "تهاتر," "پاياپاي
كاري," "بده بستاني." The occurrences are illustrated in the following
table.
| Collocation |
Dictionary Suggestions |
Occurrences in Corpus |
Corpus decision
|
| Trade-off
|
1.پاياپاي
كاري
2.بده بستاني
3.مبادله
4.تهاتر |
0
18
52
0
|
مبادله
|
Table 2. Corpus decision
on "trade-off" equivalents
We can use this strategy in translation criticism
in evaluating the naturalness of translation. For
the word "confidence" in the phrase "confidence-building"
there are 2 equivalents suggested by Aryanpur dictionary,
"اعتماد," "اطمينان." Due to the great similarity between these two words and their high
frequency in Persian language, it is hard even for
a native speaker to select between these two translations:
"اعتماد سازي" and "اطمينان
سازي." But when
they occur in a political texts and therefore are
searched in our corpus, it is surprising to find no
occurrence of "اطمينان
سازي" and 18 occurrences of "اعتماد سازي."
According to Larson, to do effective translation one must discover the
meaning of the source language and use receptor language
forms which express this meaning in a natural way
(Larson, M. 1984). So, in addition to other conventional
translation tools a translator should use corpora
to become more certain that his/her choice is a proper
and natural one. According to above explanations, corpora can be of great
help in finding suitable collocates and verifying
or rejecting the suggested translations by dictionaries.
As Varantola states, the general comment made by her
students about the corpus evidence: "This evidence
helps translators to be less bound to the source material
and feel much more confident when deviating from the
way things are expressed in the source material if
they feel that the changes are justified." (Varantola,
2003, p. 67).
5.
Conclusion
Large monolingual as well as bilingual electronic
corpora are just recently becoming available to translators,
and this is a good opportunity for them to be provided
with more precise, natural, and up-to-date information
about words and collocations' senses than before.
Open parallel corpora can play their greatest role
in resolving different translation problems. Unfortunately,
this invaluable tool has not been widely used by translators
in Iran. This may be due to the fact
that they have not been exposed to the potentials
of corpus analysis tools during their college education.
Unavailability of ready-made special field corpora
may be another reason in this respect. So, we decided
to describe the effective applications of a specialized
monolingual corpus of Persian in the sensitive task
of translating political texts.
We hope to expand this study to cover experiments dealing with other
subject fields such as medicine, sports, business,
religion, literature, and the like. It is suggested
that such experiments be also performed with other
language pairs to see if more definitive conclusions
in terms of the effect of monolingual corpora on the
translator's work can be drawn.
References
Aryanpur, A. and Aryanpur, M.
(1991). English-Persian Collegiate Dictionary.
(Ninth Edition) Amir-Kabir Publication Organization,
Tehran,
Iran.
Aston, G. (2000). I corpora
come risorse per la traduzione e l'apprendimento. In
Silvia Bernardini and Federico Zanettin (eds.)
I corpora nella didattica della traduzione.
Bologna: CLUEB, 21-29.
Bowker, L., 1998, Using specialized monolingual
native-language corpora as a translation resource:
a pilot study, Meta,
43/4, pp. 631-651.
Bowker, L. and Pearson, J. (2002). Working with Specialized
LanguageA practical guide to using corpora.
London: Routledge, Pp. xiv + 242
Brown P.F., Pietra, S.A.D., Pietra, V. J. D., and
Mercer R. L. 1993. The mathematics of machine translation:
parameter estimation. Computational Linguistics,
19(2): 263-313.
Braschler, M. and Schauble, P. 2000. Using corpus-based
approaches in a system for multilingual information
retrieval. Information Retrieval, 3, PP. 273-284.
Brown, P., Cocke, S., Della Pietra, V., Della Pietra,
S., Jelinek, F., Lafferty, J., Mercer, R. & Roosin,
P. 1990. A Statistical Approach to Machine Translation.
Computational Linguistics 16:2, 79-85.
Larson,
Mildred L. (1998). Meaning-based translation: A
guide to cross- language equivalence. Lanham,
MD: University Press
of America
and Summer Institute of Linguistics.
Leech, G. (1997). Teaching and language corpora:
A convergence. In: A. Wichmann, S.
Fligelstone, T. McEnery & G. Knowles
(Eds.), Teaching and language corpora (1-23).
New York:
Addison Wesley Longman
Mosavi Miangah, T. (2006). Applications of corpora in translation.
Translation Studies,
12, pp: 43-56.
Nesselhauf, N. (2004). Learner corpora and
their potential for language teaching. In: J. McH.
Sinclair (Ed.), How to use corpora in language
teaching (125-152). Amsterdam: Benjamins.
Varantola, K. 2003. Translators and Disposable Corpora. In Federico
Zanettin, Silvia Bernardini and Dominic Stewart (eds.)
Corpora in Translator Education Manchester:
St Jerome, pp 55-70.
Wilkinson, M, (2006). Compiling
Corpora for Use as Translation Resources,
Translation Journal, Vol. 10, No. 1.
Read
more articles - Free!
E-mail
this article to your colleague!
Need
more translation jobs? Click here!
Translation
agencies are welcome to register here - Free!
Freelance
translators are welcome to register here - Free!
Subscribe
to TranslationDirectory.com newsletter - Free!
Take
part in TranslationDirectory.com poll - your voice counts!
|