Legal Aspects of Compiling Corpora to be used as Translation Resources

By Michael Wilkinson,
Finland
teaches courses in translation from Finnish to English,
oral expression and liaison interpreting

Michael.Wilkinson at uef.fi

Questions of Copyright

Corpora and corpus analysis tools

In the last issue of Translation Journal (Wilkinson 2006) I described various ways of compiling your own corpus to be used as a translation resource in conjunction with corpus analysis tools by downloading texts from websites, by scanning documents such as brochures, or by converting translation briefs into plain text format. But if one compiles corpora in this way, is it necessary to obtain permission from the copyright holders? When I began to compile a corpus of tourist brochure texts, I consulted colleagues and browsed through Internet discussion forums dealing with the legal aspects of corpora compiling. In the process, I encountered a spectrum of attitudes:

The reassuringly confident:
"It's okay for non-commercial education and research purposes."
The carefree:
"You'll not get caught anyway."
The cavalier:
"Even if you do get caught, they'll not sue you."
The cautious:
"It's better to be safe than sorry."

Of course, if you are a freelance translator using a self-compiled corpus as a private reference aid, or if you are a teacher or researcher using a corpus purely for private study and research, there is almost certainly no need to go through the process of requesting permission. But do you need permission if you write an article based on your experiences or on your research with examples of Key-Word-In-Context (KWIC) displays (see Fig. 1) containing short segments of text, as I have done in previous issues of Translation Journal (Wilkinson 2005a & 2005b)? And do you need permission if the corpus is made accessible to a wider user group - for example if it is shared amongst translators in a company, or if it can be freely accessed by others within an educational institution, just as my tourism corpus can be used by all students and staff at Savonlinna School of Translation Studies for teaching and research purposes?

Edited KWIC display for the search word permission generated by WordSmith Tools

Fig 1: Edited KWIC display for the search word permission generated by WordSmith Tools

Citations in Articles

In the US, Canada and UK, reference is often made to the concept of "fair dealing" or "fair use", which permits certain acts without requiring the permission of the copyright owner. As Hilton (2001) states: "If the use of a work furthers progress in the sciences and the arts (i.e. if it promotes learning, knowledge, and the public good) and if its use will do relatively little harm to the author's property rights, then it is not necessary to get the author's permission to use the work."

The US Copyright Law Section 107 lays down the following four factors to be used to determine whether the use of copyright material in a particular case is a "fair use" or not:

the purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.

So it would seem that if you display concordance lines from your corpus in order to elucidate certain lexical features, you will not be sued by US copyright owners (and probably not by UK or Canadian copyright holders either) provided you pay attention to the fair use factors, especially the fourth one, which many experts suggest carries the most weight.

How does my corpus of tourism texts comply with these conditions? In regard to the first factor, the Tourism Corpus is for non-profit and educational purposes (the situation would be different if copies of the corpus were sold); in regard to the second factor, the brochures used in the corpus are freely-available to the public at no charge (the situation would be more dubious with regard to a school text-book or a best-selling novel); in regard to the third factor, all or most of the text in each brochure is included in the corpus, but in citations, only a few words appear; in regard to the fourth factor, there is absolutely no adverse market effect - on the contrary, it seems that the copyright holders of tourist brochures and tourism websites welcome all the exposure they can get.

However, Davies (2002) points out that two lawyers he consulted explained that the copyright law that matters, at least regarding making a corpus available on the Web, is the law of the country from which the corpus is distributed, NOT the country where the original texts were created OR the country from which end users access the materials.

So what does Finnish legislation have to say in this matter? According to a lawyer from the Finnish Ministry of Education, downloading material from the Internet and saving it as a corpus requires permission from the copyright holders, as does making the corpus accessible to other user-groups, including students, since there are no fair-use exceptions regarding educational usage in Finnish law as in US law. Similarly, a representative from Finland's Copyright Information and Anti-Piracy Centre agreed that Finnish copyright law does not include any exonerating conditions akin to those in US and UK law, except for the right to copy material for purely personal use (such as private study and research or leisure pursuits).

However since Finland joined the European Union in 1995, the development of copyright legislation has been closely linked with Community law. Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society contains the following statement: "This Directive should seek to promote learning and culture by protecting works and other subject-matter while permitting exceptions or limitations in the public interest for the purpose of education and teaching."

Moreover, Finland adheres to the Berne Convention on the protection of literary and artistic works, which is perhaps the most important international copyright convention. Article 10 (1) of the Convention states: "It shall be permissible to make quotations from a work which has already been lawfully made available to the public, provided that their making is compatible with fair practice, and their extent does not exceed that justified by the purpose, including quotations from newspaper articles and periodicals in the form of press summaries."

So it seems that Finnish law, through its adherence to international law, also recognises the concept of "fair use", though not as explicitly as US law. Unfortunately, Article 10(3) of the Berne Convention states: "Where use is made of works in accordance with the preceding paragraphs of this Article, mention shall be made of the source, and of the name of the author, if it appears thereon." - Mentioning the source of every concordance line in a KWIC display would be a rather cumbersome process!

McEnery et al (2006) maintain that the fair-use provisions of copyright law as they apply to citations in published works should operate differently when they apply to corpus-building so as to allow corpus builders to build corpora quickly and legally. McEnery et al suggest that limited reproduction of copyrighted works, for instance in chunks of 3,000 words or one-third of the whole text (whichever is shorter) should be allowed under fair use for non-profit making research and educational purposes.

Accessibility to the corpus within educational institutions

When considering accessibility to your corpus, the legal situation is perhaps even murkier than that regarding citations from the corpora. Davies (2002), writing about the situation in the USA, states:

"A couple of months ago I was talking to a lawyer/professor from another university who specializes in copyright law as it applies to electronic materials and more specifically, electronic materials on the Web. I explained to him a project where I had a large amount of material in a web-based corpus, but users could only see the hits in very short context concordance lines. His view was that because the material that was made available to the end user was so radically different from the original format (i.e. complete texts), there was no problem at all. In addition, I emailed a second professor at another university, who also specializes in copyright law as it applies to the Internet, and she said basically the same thing."

Fig 2: Edited KWIC display for the search pattern accessib* generated by WordSmith Tools

However, many corpus analysis tools enable the user to view the concordance line in a wider context, ranging from several paragraphs to, in the case of WordSmith Tools 4 (Scott, 2004), the entire file. Bearing this in mind, one must consider whether the "fair use" philosophy allows in-house accessibility, whereby colleagues use the corpus for research purposes or students use the corpus in the translation lab as a reference tool for improving their translations. Here again, US law suggests that "multiple copies for classroom use" is covered by fair use, and Part III of Canada's Copyright Act suggests that, with certain provisions, there is no infringement of copyright by educational institutions where copies are made of works in printed form.

However the ICLT4LT (Information and Communications Technology for Language Teachers) website, referring to advice given by the British Educational Communications and Technology Agency (BECTA) concerning copyright involving electronic materials, suggests that making multiple copies of electronic materials for classroom use has been established as being outside fair dealing definitions. So this would suggest that if the corpus is made available to students on CD-ROMs or on the hard discs of the computers in the translation lab for them to use as reference tools when carrying out translation tasks then fair dealing would cease.

And of course if you are intending to sell your corpus, it is extremely advisable to get permission. To quote Kilgarriff (2002): "Copyright law is in general about the case where someone makes money from selling intellectual property: if you are going to sell a corpus, the issues need taking very seriously, as people will be upset by you making money out of selling their text (unless you give them a share)."

Degrees of necessity

The following table attempts to summarise some of the points discussed above, though it must stressed that this "guide" is to a large extent speculative and should not be followed blindly, and that the legal situation varies from country to country. But if the circumstances surrounding your corpus project conform mainly to the criteria in the left column of the table you might consider not bothering with the time-consuming effort of requesting permission - and keeping track of permission granted - whereas if it scores hits in the right hand column of the table, you should be on your guard.

Need for obtaining permission to include texts in your corpus:
← Relatively low?	Grey area?	Relatively high? →
Corpus used for private study & research within an educational institution	Multiple copies accessible to students & colleagues for study or research within an educational institution	Multiple copies accessible to staff and students for study or research outside the educational institution
Users are able to see only very short concordance lines	Users are able to see hits in the context of a few paragraphs	Users are able to view the entire text of the corpus
Research papers and articles read by a relatively small audience containing very limited citations of concordance lines	Articles read by a wide audience containing extensive citations of concordance lines
Corpus compiled by a freelance translator and used as a translation aid	Corpus compiled by translators within a small company and used as a translation aid	Corpus compiled by translators within a large company and used as a translation aid
Corpus contains relatively small portions (less than a third) of original source text	Corpus contains a substantial proportion of the original source text	Corpus contains the entire source text
Corpus contains texts that are available to the public free of charge		Corpus contains texts that are commercially marketed
Corpus is used for non-commercial purposes	Corpus is used indirectly for commercial gain, e.g. by professional translators to enhance their productivity	Corpus is commercially marketed

Kilgarriff maintains that "to be unequivocally, completely, totally in the clear you need to get copyright clearance from all copyright holders", although he does go on to say that "the law is in its infancy and there is very little which is obviously right or wrong/legal or illegal" and reveals a more cavalier attitude when he continues that if it is only for in-house use, then one simple issue is "who will ever know?".

Use it and lose it

A number of translator-trainers (e.g. Varantola 2003; Zanettin 2002) have reported on the use of student-compiled "ad hoc"corpora (also referred to as "virtual", "DIY" or "disposable" corpora) in their courses.

KWIC display for the search pattern do-it-* generated by WordSmith Tools

Fig 3: KWIC display for the search pattern do-it-* generated by WordSmith Tools

But why do such corpora need to be disposable? Couldn't they be open-ended collections - constantly added-to, updated and revised - and perhaps even pooled amongst the students in a group? Or do some translator-trainers think that "ad hoc"corpora are somehow exempt from the copyright laws? If so, I suspect they are mistaken - compiling a corpus on a "use it and lose it" basis is not a way of getting around the copyright laws, though it does reduce the risk of getting caught.

Requesting Permission

Requesting permission to use texts in a corpus can be a time-consuming process. Not only do you have to keep careful track of from whom you have requested and been granted permission, but also careful care needs to be given to composing your letters in such a way that the recipients will bother to reply: the letter shouldn't be too long but the recipient should obviously understand the nature of your project.

A number of teachers and researchers have expressed their irritation at the time-consuming need to seek permission before using texts in corpora. For example Cooper (2003) expresses his concern at suggestions that it is necessary or even advisable to obtain permissions, and possibly pay compensation, before using texts in such a way, and points out that although this may be a consistent position for corpus developers who are also publishers, it may unnecessarily discourage researchers in other environments.

References

Cooper, Doug (2003). In Corpora List Archive "Legal aspects of corpora compiling". Online at http://torvald.aksis.uib.no/corpora/2003-1/0596.html

Davies, Mark (2002). In Corpora List Archive "Legal aspects of corpora compiling". Online at http://torvald.aksis.uib.no/corpora/2002-4/0016.html

Hilton, James (2001). "Copyright Assumptions and Challenges" EDUCAUSE Review 36/6 November/December, pp 48-55. Online at http://www.educause.edu/ir/library/pdf/erm0163.pdf

Kilgarriff, Adam (2002). In Corpora List Archive "Legal aspects of corpora compiling". Online at http://torvald.aksis.uib.no/corpora/2002-3/0253.html

McEnery, Tony, Richard Xiao & Yukio Tono (2006). Corpus-Based Language Studies: an advanced resource book. London: Routledge.

Scott, Mike (2004). WordSmith Tools version 4, Oxford University Press.

Varantola, Krista (2003). "Translators and Disposable Corpora" in Zanettin, F., Bernardini S. and Stewart D.(eds.) Corpora in Translator Education Manchester: St Jerome, pp 55-70.

Wilkinson, Michael (2005a). "Using a Specialized Corpus to Improve Translation Quality", in Translation Journal, Volume 9, No 3.
Online at: http://accurapid.com/journal/33corpus.htm

Wilkinson, Michael (2005b). "Discovering Translation Equivalents in a Tourism Corpus by Means of Fuzzy Searching", in Translation Journal, Volume 9, No 4.
Online at: http://accurapid.com/journal/34corpus.htm

Wilkinson, Michael (2006). "Compiling Corpora for use as Translation Resources", in Translation Journal, Volume 10, No 1.
Online at: http://accurapid.com/journal/35corpus.htm

Zanettin, Frederico (2002). "DIY Corpora: The WWW and the Translator" In Maia, Belinda / Haller, Jonathan / Urlrych, Margherita (eds.) Training the Language Services Provider for the New Millennium, Porto: Faculdade de Letras, Universidade do Porto, pp 239-248.
http://www.federicozanettin.net/DIYcorpora.htm.

This article was originally published at Translation Journal (http://accurapid.com/journal).

Submit your article!