The Comparable Corpus-Based Chinese-English Translation

Home

Join as a Member!

Post Your Job - Free!

All Translation Agencies

Advertisements

The Comparable Corpus-Based Chinese-English Translation - A Case Study of City Introduction

By Guangsa Jin,
Peking University, China

Become a member of TranslationDirectory.com at just $12 per month (paid per year)

Abstract

Guangsa Jin photo Since only a limited pool of qualified native English-speaking translators can do Chinese-English translation, it is inevitable for native Chinese-speaking translators to translate out of their native language. Influenced by their mother tongue, Chinese translators often use some awkward expressions, which do not exist in English, in the translated texts. This paper aims to explore how a comparable corpus can be applied in Chinese-English translation to assist native Chinese-speaking translators to make their translated texts sound natural to native English speakers. To illustrate the point, a comparable corpus on the subject of city introduction is constructed. With the help of comparable corpus analysis tools, sentence length, lexical density, and other statistics which can reflect the stylistic features of the translated texts are derived. It is argued that a comparable corpus which can provide examples of natural expressions in the target language plays an irreplaceable role in terminology extraction, awkward collocation spotting and it is also can pick up some small errors which are often neglected by non-native English-speaking translators such as the usage of articles.

Introduction

In terms of the prerequisites of translators, the ideal candidates would be the native speakers of the target language. This guideline is followed by many translation agencies for International institutes. It is also clearly stated in the Occupational Outlook Handbook of the U.S. Department of Labor that the nature of translation is for the translators to put their secondary, or passive language into their native, or active language. However, this is not the case in Chinese-English translation. According to Xu Meijiang (2004), a senior translator in China's Central Translation Bureau, though some qualified native English-speaking translators are involved in C-E translating, editing or proofreading; large volumes of C-E translations are done by native Chinese-speaking translators alone. The present situation will not be changed in the near future for two reasons: first, only a limited pool of qualified native English-speaking translators are available; second, the fee charged by native English-speaking translators is much higher than those of their Chinese counterparts. The statistics from Beijing Evening News (2007) state that 60% of the translation market demand cannot be met, and China is in desperate need for qualified C-E translators. The present problems would be how to improve the quality of the C-E translation done by native Chinese-speaking translators. Corpora would be a helpful tool to arm them.

The development of computer technology and Internet make the comparable corpus-based approach accessible. In the corpus-based approach, two subcorpora are to be constructed:

Subcorpus A—C-E translation done by Chinese translators;

Subcorpus B—English texts on the same subject written by native English speakers.

Since equivalents can be easily extracted from aligned parallel corpora, they are extensively used in translation practice.

First, since computers are widely used in translators' everyday work, electronic translation texts are available, which enables the construction of Subcorpus A. Second, the Internet provides a huge archive of texts written by native English speakers, storing the most recently updated language and information on various subjects and making the construction of Subcorpus B easier than ever before. Third, the advancement of software engineering offers tools to process the corpus. Customizable corpus analysis software is produced to meet different research and study needs. Wordsmith (Scott 1996), MonoConc (Barlow 1999) and AntConc (Laurence Anthony 2007) are the most common corpus analysis software packages and are widely used in the fields of literature, pedagogy, linguistics and translation studies. Machine-readable texts and computer programs make quantitative language study possible, offering new approaches to improve the quality of translation.

This paper aims to examine how comparable corpora can be used to enhance the quality of the Chinese-English translations done by non-native English-speaking translators. To illustrate the point, comparable corpora comprising original English texts and translated texts into English on the subject of City Introduction are constructed and the question of how they can help translators who are translating out of their native language to use idiomatic expressions is examined.

1 Corpus-based Translation Study: A Review

The 20th century saw a dramatic change in translation studies—a transformation from traditional prescriptive study into descriptive study, which directly promotes the development of corpus-based translation studies. Scholars and translation education professionals, who used to conduct translation studies or provide translation trainings on an intuitive basis, started to do empirical research, relying on both original and translated texts. Therefore, various kinds of translational corpora are constructed to meet different needs in descriptive and practical translation studies.

1.1 Current Research

1.1.1 Corpus-based Descriptive Translation Study

It is generally acknowledged that Mona Baker is a pioneer in corpus-based translation studies, since she was the first person to conceive the idea of translational corpus construction and actually set up one—the Translational English Corpus (TEC). TEC, a project funded by the British Academy, was started in 1996 and opened to the public on line in 1999. Translated from European languages such as French, German and Spanish and non-European languages such as Chinese and Thai, the texts in the corpus are taken directly from publications. Mona Baker and other faculty members in the University of Manchester Institute of Science & Technology (UMIST) have done translation studies on the basis of TEC. Basically, TEC-based translation studies fall into three categories: features of translationese; studies on translator's style; social and cultural influence on translation.

Compared with original texts, translationese, the language of translated texts has its own special features. Thus comparable studies have been done to reveal the differences. Baker (1996) observed that the translated version usually had the features of explication, simplification, normalization and leveling out. By making a comparison between TEC and BNC on the usage of "that" which precedes an objective clause, Olohan and Baker (2000) found that the ratio of "that" was much higher than that in BNC, which further demonstrated the feature of explication. Besides simplification, explication and normalization, Sara Laviosa (1998) added three more features—avoidance of repetitions present in the source text, discourse transfer and law of interference, and distinctive distribution of target-language items.

TEC was used to study the different styles of translators. By making a comparison between the type-token ratio, sentence length and narrative structure of the translation of Peter Bush and Peter Clark, two British translators, Baker (2000) came to the conclusion that Clark had a more direct style than Bush.

Cultural differences between nations are revealed through comparisons between TEC and texts originally written in English. For example, Laviosa (2002) showed the differences between cultural messages by making a comparison between the news subcorpora of the English Comparable Corpus (ECC), a corpus constructed by herself, which included 396 articles from the Guardian and Europe Journal and the news subcorpora in TEC which included news translated from German, Slavic, Italian, etc.

Descriptive translation study lays the foundation for practical translation studies. The universal of translation revealed in corpus-based descriptive translation studies suggests ways translators can make their translation sound more natural to the target language readers. Besides, the methodologies used in descriptive translation study are very inspiring to those involved in translation practice and in other practical translation studies.

1.1.2 Corpus-based Practical Translation Study

Whereas a wide array of different kinds of corpora has been applied in descriptive translation studies, exploration has been made to adapt corpora to practical translation studies. Federico Zanettin raised the idea of using corpora in the training of translators in 1998 and further illustrated the point by presenting an experiment in which the Olympics corpus was used by a group of trainee translators to translate an Italian sports article into English. Since then, scholars began to pay attention to the role corpora could play in translation education and new approaches were developed. Jennifer Pearson (2000) noted that parallel corpora were very useful in the translator training environment because they could show the trainees "how professional translators have overcome specific translation problems." Natalie Kübler (2000) illustrated how to use specialized and general corpora and corpus query tools to look for term candidates and their phraseology. Krista Varantola (2000) introduced a new type of corpus—disposable corpora which were used as performance-enhancing tools in the training of prospective professional translators and she also demonstrated how to apply Wordsmith Tools in corpus analysis.

1.2 Problems in Corpus-based C-E Translation Study

Since equivalents can be easily extracted from aligned parallel corpora, they are extensively used in translation practice. The significant role parallel corpora play in terminology extraction is not in dispute here. However, when focusing on Chinese-English translation study, relying solely on parallel corpora represents a problem.

First of all, high-quality C-E translation are comparatively rare since most C-E translations are done by native Chinese translators, who live in a Chinese-speaking environment and have little peer support from native English speakers. One can easily spot "unconventional" and "creative" expressions in these translations which, in most cases, confuse native English readers. These translations can hardly meet the need of communication between source language writers and target language readers. Therefore, the quality of a parallel corpus containing poor translations as raw materials is in doubt.

Secondly, it is difficult to align a parallel corpus of high-quality C-E translation since English is a language of hypotaxis while Chinese is a language of parataxis. To make the translation sound natural to native English readers, translators need to bring out the implied logic in Chinese texts by using discourse markers or other means. Absolute equivalence in syntactic structures does not exist. Therefore, a huge amount of aligning work will be involved in parallel corpus compiling since automatic construction is difficult to carry out.

Therefore, a comparable corpus, which provides samples of language as they are used naturally by native English speakers, is extremely useful for translators who translate out of their mother tongue. A comparable corpus has one collection of texts written by native speakers of the target language on the same topic of the translated texts (city introduction is the topic in this paper). Translators can imitate the sentence pattern and idiomatic expressions used by native speakers.

2 Methodology

This paper aims to illustrate the value a monolingual comparable corpus has in Chinese-English translation practice and to demonstrate how a comparable corpus can be used in C-E translation practice to enhance the quality of the translation done by a non-native English speaker. Therefore, it is a practical translation study.

2.1 Comparable Corpus

In the experiment, a comparable corpus which comprises two English subcorpora—a translated text collection and an original text collection, is constructed. The comparable corpus is the most important translation corpus for translators who translate out of their mother tongue. As already mentioned it is indispensable for native Chinese translators to be involved in C-E translation, since the ultimate goal in their translation practice should be making the translated texts understandable and sound natural to native target-language readers. The aim is not easy to be achieved in C-E translation without the participation of native English speakers. Therefore, the comparable corpus, which serves as an English consultant, plays an irreplaceable role in C-E translation practice.

2.2 Corpus Analysis Tools

Different from paper texts, electronic corpora can be processed by computer automatically. In this study, three freely available programs are used in corpus analysis, terminology extraction and corpus construction, namely, A Corpus Worker's Toolkit (ACWT), AntConc and GoTagger.

3 A Case Study of City Introduction: Procedure

3.1 Corpus Construction

As the largest corpus, the Internet provides an almost unlimited number of electronic articles updated every minute. The vast pool of information serves well as a translation corpus resource. The comparable corpus used in the experiment is a disposable corpus which has two subcorpora on the same subject—City Introduction.

3.1.1 Subcorpus A—translated texts done by native Chinese speakers

Two steps are involved in Subcorpus A's construction—data collecting and compilation.

In the process of data collecting, it was found that C-E translated articles on city introduction can be obtained from several kinds of website, including tourism websites, websites to invite investment and local government websites. Since tourism websites, in most cases, are commercial websites, the city introduction unavoidably has several descriptive paragraphs and functions as an advertisement. Therefore, most articles compiled in the corpus are from government-run websites and mainly provide factual information. Therefore, the search strategies involved in the data collecting process are quite simple—downloading the city introduction pages from China's local government websites (usually provincial capital cities' websites).

However, web pages cannot be processed by corpus analysis software directly. The articles in html format need to be converted into txt format. In this step, "A Corpus Worker's Toolkit" (ACWT) is used to do the conversion. First, the web page is opened in the NoteTab. Then HTML<—>Text Conversion tool is run to get the article in txt form. After converting all texts into txt form, the merge file tool is applied to obtain a single file. ACWT saves the tedious and mechanical job of corpus compilation dramatically.

3.1.2 Subcorpus B—original texts written by native English speakers

Since the articles in Subcorpus A are factual information on cities, an English on-line electronic encyclopedia is chosen as the source for Subcorpus B. Compared with articles in Wikipedia, Encyclopedia Britannica and other on-line English encyclopedia texts, in Encarta, a digital multimedia encyclopedia published by Microsoft Corporation, are more relevant to the texts in Subcorpus A. Therefore, five metropolitan introductions are selected. The same compilation strategy as in Subcorpus A construction is applied here. The detailed quantitative characteristics of the corpus are demonstrated in Table 1

	Number of Articles	Tokens	Types
Subcorpus A	22	28947	4653
Subcorpus B	5	28816	4949

Table 1 Corpus Characteristics

As Table 1 shows, the two subcorpora are comparable as their sizes are quite similar. It has been observed that some data have meaning only when the tokens are similar, such as the type/token ratio which is the ratio of different words to total words. Since the number of total English words is fixed, tokens can be infinitely great, which is not true for types (Yang Huizhong, 2002).

3.2 Data Processing

The corpus analysis tools introduced above are applied for exploring the comparable corpus to get information on stylistic features and to do terminology extraction as well as to check whether some expressions in the translated texts sound natural to target-language readers. The main processing steps are shown as below.

3.2.1 Stylistic Features

In calculating the lexical density (LD), the formula is derived from the ACWT—Lexical Density = (Number of different words / Total number of words) x 100. To measure two numbers here, the word counter tool in Microsoft Word and the wordlist tool in AntConc are applied. Then the data are put back to the formula to get the LD of the two subcorpora.

In measuring sentence length, the formula is Sentence Length = token / (number of full stops + number of exclamatory marks + number of interrogation marks). ACWT is applied in counting the punctuation mentioned above.

	Full Stop	Exclamatory Marks	Interrogation Marks	Sentence Length	Lexical Density
Subcorpus A	2075	0	8	13.9	16.1%
Subcorpus B	1466	1	2	19.6	17.2%

Table 2 Data on Stylistic Features

3.2.2 Terminology Extraction & Concordance

Besides revealing the stylistic features of the translation, comparable corpora can be used in terminology extraction and to demonstrate the context in which the terms occur in the native speakers' writing. AntConc is the corpus query software used in this process. Since the number of texts compiled in the disposable corpus is limited, the British National Corpus (BNC) is used as a supplement to Subcorpus B for terminology extraction. Parallel corpora and on-line dictionaries play a complementary role in actual C-E translation practice, in which the equivalence of the Chinese terms are looked for in a parallel corpus or an on-line dictionary. Usually, several candidate terms are found. Then, it is the comparable corpus that tells which candidate term is the natural expression in the target language and suitable to be used as well as how to combine it with other words in the context. The following example is to illustrate the idea.

The term "公共交通" (gōng gņng jiāo tōng, literally public transport) often occurs in city introductions. Looking for the Chinese in the China National Knowledge Infrastructure (CNKI) on-line dictionary, a dictionary based on parallel corpora, one may get three candidate terms, namely, public transportation, public traffic and public transport. Using the concordance tool in AntConc to query the three terms in the comparable corpus, one may only find "public traffic" has appeared in Subcorpus A, while "public transport" and "public transportation" have occurred in Subcorpus B. The occurrences of the two terms resembled, whereas "public transport" has 929 occurrences, public transportation 6 occurrences and no solutions found for "public traffic" when queried in the BNC. The ratio of occurrences of "public transport" to "public transportation" would have been more favorable to the latter if a U.S. corpus had been consulted. Obviously, public traffic is an unnatural expression in English.

3.2.3 Part of Speech Tagging

GoTagger07 is the tool used in part of speech tagging (POS tagging). The statistics in Table 3 is derived from the tagged texts.

	Subcorpus A	Subcorpus B
Determiner	2729	3540
Coordinating Conjunction	1341	1181
Adjective	3500	3153
Noun (exclude proper noun)	6372	5148
Personal Pronoun	136	223
Adverb	481	707
Verb, base form	216	317
Verb, past tense	844	1082
Verb, non-3rd ps. sing.	299	274
Verb, 3rd ps. sing. Present	572	562
Verb	1931	2235
Verb, gerund/present participle	586	451
Verb, past participle	821	777
wh-determiner	98	138
wh-pronoun	18	53
Possessive wh-pronoun	7	5
wh-adverb	19	66

Table 3 Data on Parts of Speech

4 A Case Study of City Introduction: Discussion

4.1 Style of the Translated Texts

Translated texts have such a distinguished language style from the written language that a term—translationese—was coined to describe it. In this study, some special features of the translated texts have been spotted. Compared with the city introductions originally written in English in Subcorpus B, the translated city introduction tends to generate shorter sentences with simpler sentence patterns, fewer different words, more nouns, and fewer verbs.

First of all, translators form shorter sentences and are more likely to use simple and compound sentences than target language writers. As Table 2 shows, the average sentence length in Subcorpus A is 5.7 words shorter than that in Subcorpus B. Moreover, there's a considerable difference in the sentence patterns between the two subcorpora. As Table 2 shows, Subcorpus A has eight interrogative sentences, while Subcorpus B has two interrogative sentences and one exclamatory sentence. Therefore, most wh- words are used as subordinate clause links. As the statistics in Table 3 shows, the number of wh- words in Subcorpus B almost doubles compared to Subcorpus A. Thus we can conclude that more complex sentences are used in texts originally written by native English speakers than in translations done by native Chinese speakers.

Secondly, compared with texts in Subcorpus B, less word variety is noted in the translated texts. However, no striking difference is spotted. As Table 2 demonstrates, the lexical density in Subcorpus A is only 1.1% lower than that in Subcorpus B.

Thirdly, the translated texts have more nouns and fewer verbs than the texts originally written in English. It is observed from Table 3 that nouns (excluding proper nouns) take up 22.0% of all words in Subcorpus A and 17.9% in Subcorpus B while predicate verbs account for 6.7% in Subcorpus A and 7.8% in Subcorpus B. The words used in Subcorpus A are 4.1% higher in nouns and 1.1% lower in verbs.

As the three unique stylistic features of the translated texts mentioned above showed, the sentence pattern and word variety can be improved to make the translation sound more natural to the target-language readers.

4.2 Comparable Corpus's Role in Making Translation Sound Natural

Native English speakers are the intended readers of the English translation. Therefore, the basic quality that a good piece of translation should have is that the language should sound natural to the target-language readers. Though an easy criterion for translators who translate into their mother tongue, it is quite a challenge for translators who translate out of their native language. In this sense, a comparable corpus, which provides examples of native English speakers' expressions, can assist native Chinese translators to use idiomatic expressions by providing the context in which terms occur in native speakers' writings, spotting awkward collocations and highlighting some small errors which are often overlooked by non-native speakers, such as the use of articles.

First of all, to produce a good piece of translation with accurate use of terminology, a corpus is an indispensable tool because it can display the context where these terms occur in native speakers' writing. Compared with a traditional paper dictionary, a comparable corpus is more efficient in terminology extraction. Looking up a heavy and thick paper dictionary is quite time-consuming. Moreover, the word entries in the paper dictionary are fixed. Since vast amount of new words are coined every day, the fixed paper dictionary can never catch up with the development of society and technology. The shortcomings of paper dictionaries are overcome by the on-line dictionaries. An Internet-based dictionary (such as the yodao dictionary) can be updated every day. However, almost all the examples provided by these on-line dictionaries are extracted from C-E translations which, in most cases, were done by native Chinese translators. Besides, since the example sentences are queried from the Internet without careful selection, the quality cannot be guaranteed. In contrast, the illustrative sentences extracted from the comparable corpus based on well-selected texts written by native target language speakers are more reliable and sound more natural.

Secondly, together with corpus analysis tools, comparable corpora can be applied in spotting awkward collocations in the translated texts. In addition to choosing the suitable words, combining them is a complicated problem. Awkward collocations are the commonly occurring error that influences the understanding of the target language reader. In finding these unnatural expressions, a concordancer which is contained in most corpus analysis software would be a very effective tool. One may simply type in the phrase that he is not quite sure about and query in both subcorpora. If the collocation has occurred in a similar context, then it can be used. If not, the core of the phrase is to be typed in and the corpus based on texts written by native speaker is queried to derive a natural expression. For example, "自然条件" (zì rán tiáo jiàn, literally natural resources) was translated into "natural condition (s)" in three translated texts (Suzhou, Foshan and Shaoguan city introduction). However, no hit was returned in Subcorpus B. Then the phrase was typed into the BNC query resulting in ten hits. But, taking a close look into the sentences, we found that "natural condition" means the condition which is not made or controlled by human beings.

Thirdly, the quantitative comparison between the translated texts and texts written by native English speakers can reveal some subtle errors which impair the quality of the translation, but are often overlooked by non-native English speakers. For example, articles (a, an, the) will not influence the meaning enormously, whereas their absence can make the text sound strange. Since articles do not exist in Chinese, they are often forgotten by native Chinese translators. As Table 4 shows, the ratio of articles taking up in Subcorpus A is 2.9% lower than that in Subcorpus B. And the number of "the" in the translated texts is considerably lower than that in the texts written by native English speakers. That slight difference would considerably improve the translation.

	the	a	an	article	ratio
Subcorpus A	1685	296	127	2108	7.3%
Subcorpus B	2447	427	74	2948	10.2%

Table 4 Articles

Conclusion

This study conducted an experiment to explore ways to use comparable corpora in translation studies with the aim of assisting translators who translate from their native language in order to enhance the quality of the translated texts. By carrying out a quantitative analysis, we acquired data which indicate the special stylistic features of the translated texts written by non-English-speaking translators in Subcorpus B compared with the texts written by native English speakers. Translators can make improvements of the different stylistic features. Furthermore, this paper argues that the comparable corpus is an indispensable tool in terminology extraction by showing how to use the corpus in the process. Besides, this paper explores the ways to apply a comparable corpus in making the translation sound natural.

As this paper has proved, comparable corpora play a significant role in translation study and practice. However, they also have some limitations. First, one may not find comparable texts in the target language. For example, it is difficult to find comparable material for fictional works usually containing many cultural elements which are unique to a nation. Surely, one cannot find an English novel which is comparable to the Chinese novel Dream of the Red Chamber (《红楼梦》). Therefore, the comparable corpus is mainly useful in translating universal topics. Second, comparable corpora are not very helpful in translating materials in which creative expressions are required, since they only allow translators to use expressions that already exist. However, for native Chinese translators, parroting English speakers' words is not a bad idea because it at least makes the translated texts readable and understandable to the target language reader.

Although comparable corpora have some shortcomings, their potential in translation studies is not to be underestimated. In addition to studies on word level and syntactic level, further studies can be carried out on the application of comparable corpora on discourse-level translation studies. Cohesive devises such as discourse markers, theme and rheme distribution can be studied quantitatively. Furthermore, strategies in constructing comparable corpora using the Internet as its source can be developed.

References

Baker, M. (1996). Corpus-based translation studies: the challenges that lie ahead. In H. Somers (ed.). Terminology, LSP and Translation: Studies in Language Engineering, in Honour of Juan C. Sager. Amsterdam: John Benjamins.
Baker, M. (2000). Towards a Methodology for Investigating the Style of a Literary Translator. Target 12, 241-266.
Laviosa, S. (1998). Universals of Translation. In Mona Baker (ed.). Routledge Encyclopedia of Translation Studies. London: Routledge.
Laviosa, S. (2002). Corpus-based Translation Studies: Theory, Findings, Applications. Amsterdam: Rodopi.
Olohan, M. & Baker, M. (2000). Reporting "that" in translated English: Evidence of or subliminal processes of explicitation? Across Languages and Cultures, 1(2), 141-158.
Kübler, N. (2000). Corpora and LSP Translation. In Federico Zanettin, Silvia Bernardini, Dominic Stewart (eds.). Corpora in Translator Education. Beijing: Foreign language Teaching and Research Press.
Pearson, J. (2000). Using Parallel Texts in the Translator Training Environment. In Federico Zanettin, Silvia Bernardini, Dominic Stewart (eds.). Corpora in Translator Education. Beijing: Foreign language Teaching and Research Press.
Varantola, K. (2000). Translators and Disposable Corpora. In Federico Zanettin, Silvia Bernardini, Dominic Stewart (eds.). Corpora in Translator Education. Beijing: Foreign language Teaching and Research Press.
Zanettin, F. (1998). Bilingual Comparable Corpora and the Training of Translators. Meta 4, 1—14.
陈江宏，(2007)，中国翻译人才缺口达60% 翻译界最缺汉译英人才，《北京晚报》， 2007/8/20。
徐梅江， (2004)，汉译英基本模式及其发展趋势， http://www.cctb.net/wjjg/wxb/wxbkycg/200408040002.htm ，2008/4/6
杨惠中，(2002)，《语料库语言导论》，上海：上海外语教育出版社。

This article was originally published at Translation Journal (http://accurapid.com/journal).

Submit your article!