Corpus Linguistics Approach: A Novel Framework for Translation Studies Research

1. Introduction

As a relatively new approach, corpus linguistics emerged in 1960's which was an important point in the way of applying corpus studies in language sciences. The term "Corpus Linguistics" was initially presented by Leech in 1980s (Leech, 1992). Leech in 1992, speaks of the general story of corpus linguistics. In the 1940s to 1950s, American structuralists were increasingly interested in using corpora. "A corpus of authentically occurring discourse was the thing that the linguist was meant to be studying" (Leech, 1992, p. 105).

At almost the same time, in 1965, Chomsky put forth his famous linguistic text aspects of the Theory of Syntax, which was entirely welcomed and admired by linguists throughout the world. Chomsky was quiet on a hostile ground to the so-called language study methodology, corpus-linguistics, and severely criticized it. The most part of Chomsky's and also Chomskyan's criticisms was about the "skewedness" of corpora. It should be mentioned, however, that the size of corpora at that time was generally very small and the used corpora served as sources for investigating distinct features in phonetics (Ling, 1999). Therefore, after the rising of corpus linguistics in 1960's, owing to structuralists' criticisms, the approach was stopped and was nearly, if not totally, ignored for about 30 years, i.e. to 1990's.

As Leech (1992) contends, Chomskyan linguistics had such an effect on corpus linguistics in 1950's to place it in a backwater and so it was not paid any attention for a quarter of a century. According to Leon (2005), Chomsky as a known critic of structuralism, discussed that corpus linguistics procedures resulted only a static inventory of signs, bearing no significance and providing no theoretical explanation. The gained description using this method was, in his terms, only valid for the collected data and produces no information about the nature of language.

These arguments all made against corpus linguistics approach forced it to remain silent for about 25 years.

2. What Is a Corpus?

To be defined in modern linguistics, a corpus is a body of naturally happening language (McEnery, Xiao and Tono, 2006). Leech (1992) as a leading figure in the revival period of corpus linguistics states that computer corpora are not random collections of texts. They are usually gathered with a specific goal and are to be considered as representatives of special languages or text types.

Corpora, in Sinclair (1996) terms, are parts of language, chosen and organized based on explicit language criteria, so that they can be criticized and as Johnson (1998) contends, help to choose and set together the texts "in a principled way" (p. 3).

As a sequence, a corpus is not an accidentally selected body of texts, instead, it "can be best defined as a collection of sampled texts, written or spoken in machine-readable form which may be annotated with various forms of linguistic information" (McEnery et all, 2006, p. 4). Baker (1995) puts forward this definition for the term corpora: "a collection of texts held in machine-readable form and capable of being analyzed automatically in a variety of ways" (p. 225).

Teubert and Čermáková (2007) provide another definition: "a collection of naturally occurring language texts in electronic form, often compiled according to specific design criteria and typically containing many millions of words" (p. 140).

Taking into account all the proposed definitions for the term "corpus", it can be concluded that, usually a corpus is supposed to bear four main characteristics: sampling and representativeness, finite size, machine-readable from and standard reference (McEnery and Wilson, 2001).

A useful corpus should be representative and to acquire that, a careful manner should be applied in choosing it. McEnery and Wilson (2001) believe, in corpora collecting, the aim is to gather a wide range of authors and genres so that when put together, a precise image of the whole language population intended to be studied is presented.

A finite corpus is supposed to contain qualitative data which is not changing continuously.

What is meant by a machine-readable corpus is that a corpus should be organized in such a way that can be read or analyzed by a machine.

A standard reference corpus is a "corpus which constitutes a standard reference for the language variety that it presents" (McEnery and Wilson, 2001, p. 14).

3. Famous Corpora in 1950's and 1960's

Randolph Quirk's survey of English (SEU) was the first large English corpus which later resulted in A Comprehensive Grammar of the English Language (Quirk, Greenbaum, Leech and Svartvik, 1985) which for years served as the Standard English Grammar. It was in the late 1950's which the project was stopped. Anyhow, the Survey did not profit computerizing the data (Teubert and Čermáková, 2007). The project included about 50000 words of spoken English alongside about 500000 written words. Jan Svartvik in the late 1970's put the spoken part on computer as the London Lund Corp. It turned to be the first and a greatly accessible spoken corpus to be used which was published as a book, but it was not, unfortunately, available as a soundtrack (Svartvik, 1990). What was mainly under the focus of the survey was grammar. Anyhow, because of the Chomskyan theory dominancy this kind of data-oriented research was not welcomed.

The Brown corpus was another research project based on corpus data in the 1960's. Henry Kucera and Nelson Francis were the compilers of this corpus at Brown University. It consisted of 1000000 words, obtained from 2000 words in 500 American texts of five types.

The Brown Corpus was nearly trustable since it was reviewed many times to exclude any mistakes. At first, it was thought that this kind of grammar can provide answers for questions about grammar and the lexicon. But 1000000 didn't include the whole vocabulary. That's why the linguists, gradually, lost interest in Brown Corpus.

The most important corpus was English Lexical Studies which was established in Edinburgh in Birmingham in 1963. John Sinclair was the main figure in compiling English Lexical Studies and also a pioneer in applying a corpus for conducting researches over lexicon. He got a new idea about collocation, proposed earlier by Harold Palmer and A.S. Hornby in Second Interim Report on English collocations (1993). This corpus contained a small electronic sample of spoken and written language, not even 1000000 words (Teubert and Čermáková, 2007). The focus of the study was the meaning of lexical items, including collocations. Sinclair didn't thoroughly reject the word as a unit of meaning. But he was seeking to somehow reform this view that word was the sole unit of meaning.

However, corpus-based studies were somehow abandoned up to the beginning of 1990's. After that, they experienced a revival period and a new interest in the field. In 1991, an international conference was held which brought linguists from Britain, Germany, Sweden and Norway together (Proceedings in Svartvik (ed.) 1992). The researchers began to publish various collective books in an international journal: The International Journal of Corpus Linguistics. In 2002, a well-known figure in the area of corpus linguistics, Geoffry Leech, spoke of the corpus linguistics society as a suitably formed research community (Leech, 2002). Since then, because of efforts made by linguists around the world, alongside the new improvements in computer technology and the opportunity to make huge electronic corpora using computers, corpus linguistics has undergone widespread developments. Since corpus linguistics covers different fields as lexicography, descriptive linguistics, applied linguistics and fields which need corpora (Leon, 2005), it is somehow related to DTS, in which searching or specific translation features greatly depends on using corpora.

Recent developments in corpus linguistics have affected corpus-based studies in translation area too, directing it toward horizons recently pretty welcomed by translation scholars.

4. Corpus Translation Studies

Late developments in technology and the tools it provides to be used in language research have brought about novel approaches and eras in various scientific fields. Corpus translation studies (CTS), a newly adopted research mode in TS, is a growing result of the information age which facilitates the process of storing, retrieving and manipulating information. CTS is a discipline initially got roots out of corpus linguistics and its relations to translation studies. It also instructs a connection to DTS rather than prescriptive one, which is generally developed by scholars such as Toury (2002), Even-Zohar (1990) and Holmes (1988). CTS allows to get access to huge quantities of data which are encoded in effective shapes which could not be ever collected or organized by any individual, translator or author (Tymoczko, 1998). In addition, in her terms, this new approach makes it possible to gather data from populations with different sizes, including divergent cultures, dominant and minor languages.

According to Baker (1992) and Holmes (1988), the product and the process of translation are both under CTS concentration. The scholars working in the area of DTS have largely benefited this new method which provides them with large-scale corpora and texts that can be investigated from the most detailed characteristics to the biggest patterns, whether cultural or linguistic. Tymoczko (1998) counts several reasons as the strength points of the approach: the flexibility and adaptability with the openendedness of the corpora construct. Baker (1993) is a pioneer figure in applying the instruments and methods of corpus linguistics into DTS. She states that gathering large corpora including translated and untranslated texts by the help of novel corpus linguistics methods and tools would soon provide an advantageous methodology in TS revealing the very nature of translated texts. The language of translation, according to her, bears specific features which can be discovered by investigating large corpora. But, as she believes, this is not the mere object of CTS. Behind this special language of translation, there lies particular motivation, forces or limitations. Finding out these reasons is the main goal of CTS.

Translation universals, earlier discussed about, are among the features of translated texts which can best be sought for in corpora, using CTS methodology.

In her proposed methodology, Baker (1995) defines 3 distinctive electronic corpora to be used in TS, each of them are of special importance to translation scholars. The following section provides efficient information about those corpora

4.1. Parallel Corpora

A parallel corpus, in Baker's methodology, includes source language texts together with their translations. Parallel corpora are used to obtain information about the translational behavior of language-pairs, to investigate the relationship between lexical or structural equivalences in source text and target text.

This type of corpora can be applied in areas as translator training, improving machine translation systems and material writing (Shuttleworth and Cowie, 1997). Malkjær (1993) also holds that, including efficient information of translator's background, parallel corpora can potentially provide desired data in searching for the differences between first and second language acquisition.

4.2. Multilingual corpora

Defined by Baker in 1995, multilingual corpora are bodies of texts in different languages which are all selected by means of similar design criteria. Therefore, this kind of corpora cover texts in their native language and no translated text is included. Consequently, multilingual corpora are of special importance to contrastive linguistics since they deal with comparing the naturally occurring structures or patterns of two or more languages through analyzing the texts produced by those languages. Translation studies don't benefit much from this type of corpora since they do not concentrate on translated texts.

4.3. Comparable Corpora

As one of the most useful kinds of corpora in TS, also used in the present study, comparable corpora is defined by Baker (1995) as a collection of texts in one language together with the texts translated into the same language. Baker is of the belief that this type of corpora aids a great deal in the search for those translation features which generally occur more frequently in translated texts, typically known as translation universals, among which "simplification" is focused in the present study.

Although comparable corpora don't have any roles in areas like translator training, materials writing or machine translation (Shuttleworth and Cowie, 1997), they play a significant role in investigating the very nature of translated texts (Baker, 1993).

Simplification can best be investigated through a comparable corpus, by way of calculating sentence length, lexical density and type-token ratio. All the calculations can be easily done by the help of computer software. If a translated text shows high type-token, low lexical density and low sentence length ratio in comparison with untranslated texts in the same language, it can be proved to be "simpler", i.e. the "simplification" hypothesis is confirmed.


