Building a Legal TM and Glossary from an English-Malay Parallel Corpus
This article is, in fact, a brief report of a research project currently being carried out at the School of Languages, Literacies and Translation of Universiti Sains Malaysia. The main objective of this project is to build a Translation Memory of legal texts and a Glossary of legal terminology. Legal texts and documents have one thing in common: repetition. They all contain similar language and messages, recurring phrases and statements that can sometimes make up a considerable percentage of the texts in the same genre. Without using a translation memory to capture this repeated content for future reuse, we will be localizing the same phrases time and time again. On the other hand, terminology is growing in importance as terms are becoming increasingly adopted by legal entities and organizations. If left unmanaged, terminology can become inconsistent leading to translations that contain competing definitions. This lack of consistency means that translations cannot be re-used. For the purpose of this project which started at Jan 2009, we have been developing a legal English-Malay Parallel Corpus which currently has reached to 210,000 words. Legal texts and their translations go through a rigorous check and after selection they are scanned and turned into soft copy using the latest version of Readiris Pro. Then the latest version of Trados is used for the purpose of building the TM and the Glossary.
Translation Memories (TM), as defined by Bowker, are repositories of “source texts segments explicitly aligned with their target texts counterparts” (2002, p 92). They, in fact, can be considered as data banks from which translators can retrieve already translated segments that match a current segment to be translated. There are generally two ways to build such repositories: using TMs to carry out translations and using existing translations as input for TMs. The first way is described by Somers (2003) as the simplest method to build TMs since each sentence is automatically added to translation memory database as one goes along the process of translating. The second method is to “take an existing translation together with the original text and have the software build a TM database from it automatically” (ibid, p. 34).
The only issue which needs to be considered is that of alignment which is defined by Bowker (2002, p. 109) as “the process of comparing a source texts and its translation, matching the corresponding segments, and binding them together as translation units in a TM”. Automatic alignment tools can be used to carry out this process; however it is necessary to mention that there are some limitations that must be considered (see Bowker, 2002, P. 109-110). In this paper we describe our ongoing efforts to build an English-Malay Translation Memory of legal texts and a Termbase Glossary of legal terminology. We chose to focus on legal genre due to the importance of this genre and the repetitive nature of legal texts and documents.
2 The construction of the English-Malay Parallel Corpus of Legal Texts
Corpora are of paramount importance in translation training and translation evaluation. In fact, parallel corpora “can act as expert systems, drawing the learner’s attention to (un)typical solutions for typical problems found by mature, expert translators” (Bernardini, 2004:20). Moreover, Parallel corpora can provide translators with information that bilingual dictionaries may lack. In fact, they can provide evidence of how professional translators have dealt with this lack of direct equivalence at word level and can easily provide learner’s with best choices possible (Zanettin 2002).
Another area which can also enjoy the availability of corpora is translation evaluation. According to Lynne Bowker (2000:183) different kinds of corpora “can be used to significantly reduce the subjective element in translation evaluation”. This subjectivity has been, in fact, the cause of much confusion among evaluators and dissatisfaction among students who are more often than not after objectivity in the process of evaluation of their translations.
The English-Malay Parallel Corpus construction process we followed in this project included three stages of designing the corpus, converting the hardcopies to softcopies, and finally correcting typos. These stages are explained hereunder.
2.1 Design of the corpus
Before venturing into corpus compilation, there were a number of issues we had to address including the overall length of the corpus and the number and length of texts to be used in the corpus. The overall size of the parallel corpus which is being compiled for the purpose of this research would be 500.000 words with each sub corpus containing approximately 250.000 words. The languages included are English and Malay with the direction of translation from English to Malay.
The legal documents and texts are selected based on availability from published translations of English legal texts whose source texts are also available. As for the sampling technique, the decision was made to use the first 25.000 words in each source and translation book. This decision was made to ensure that our parallel corpus covers a variety of domains within the legal genre.
2.2 Conversion of hardcopy to softcopy format
After text selection, we have been converting the hardcopies to machine readable texts. To do so, we scan the pages using the CanoScan 4400F Canon scanner and then convert the images to editable texts using Readiris Pro 11, which is a professional conversion tool from I.R.I.S.
2.3 Correction of typos
Although conversion tools have had dramatic improvements during the last couple of years in terms of their final output, they are not 100% reliable; thus, we have had to have a quick review of the texts to correct any mistakes. The Spelling tool of the Microsoft Word software has helped us by speeding up the review process. As mentioned earlier, the size of the parallel corpus has reached to 210.000 words. The processes mentioned shows the procedure we have adopted in constructing the first product of the project.
3 Building a Translation Memory from the English-Malay Parallel Corpus of Legal Texts
Translation Memories can be considered as “repositories of translation units and their equivalents in the target language” (Teubert 2002:204). TMs are parallel corpora in a sense. They are previously translated texts stored in databases during translation processes with source text and target text aligned and segments from both languages linked together. As mentioned earlier, TMs can also be constructed from parallel corpora, just like what we are doing in the current project. TM software can save a lot of time by retrieving the texts and their translations that are similar to the text that is being translated thus exempting the translator from beginning from scratch (also see Heyn, 1998).
Once a time consuming procedure, building TMs out of parallel corpora is now easier and faster. The TM building procedure we have been following for the purpose of this project is explained hereunder.
3.1 Preparing the texts
The preliminary step in building a TM is the preparation of the texts. We saved the text files in RTF format, so that the software can process the files. This is possible through the save as type function of Microsoft Word. We just need to open the files and click on save as. Then from the window which pops up, we need to choose Rich Text Format from the save as type drop down menu and then click on save. Our files are now in RTF format and ready for the next step.
3.2 Aligning the sentences, verifying the alignments and exporting the TM
After constructing the parallel corpus, the first step in building the TM is the alignment of the source and target languages sub-corpora. Alignment is the process by which source text segments are matched with their counterpart target text segments.
For the alignment task in this project, we have used SDL Trados WinAlign which is an interactive visual alignment tool. WinAlign allows the user to create a TM from existing translated documents or parallel corpora. It determines parts of the source and target language files which belong together and puts them side by side. The users can have an interactive part in the alignment process. They are able to optimize the alignment results through modifying alignments and editing text segments directly. After running the WinAlign software, we must choose our source and target texts which are in fact our source and target languages sub-corpora. Then, we must click on the align file names button and the software starts aligning the text as you see in figure 1.
Figure 1 Sentence alignment in WinAlign
Afterwards, the alignments done by Trados WinAlign tool are reviewed and necessary modifications are made. After doing the modifications, we have to save and export the project which in fact results in the construction of the TM. Later on, as new hardcopies of legal texts are turned into softcopies, we go through the same procedure to construct a TM and then export it to the main TM.
4. Building the termbase and glossary
The last but not the least product of this project will be the termbase and glossary of legal terms which will be built from the previously built TM. A termbase is a database that contains a list of multilingual terms and rules regarding their usage. It increases the accuracy and consistency of every project by standardizing terms and reducing inconsistencies within the translation supply chain which in turn allows for more efficient and effective translations.
Terminology is typically used in conjunction with a translation memory. Although a flat file can store terms, its ability to offer long-term value is somewhat limited. This is due to flat files not being scalable, shareable or embeddable. To achieve maximum flexibility with terminology, our termbase needs to be searchable in any direction, allow for limitless terms, users and languages. Here, the Terminology Management Tools save the effort.
At this stage of the project, terms and their translations will be identified from the TM, extracted and entered into SDL MultiTerm termbase. The software used for the extraction of terminology from the TM is SDL MultiTerm Extract which is another stand-alone application from Trados Company. SDL MultiTerm Extract makes use of a statistical extraction method to determine the frequency of the appearance of candidate terms. It extracts term candidates and their probable translations found in sentences and presents the extracted terms as term candidate words or phrases.
Processes of preparing the parallel corpus and building the TM in this project have been ongoing processes, meaning that building the TM started right after the scanning of the first set of books rather that postponing the TM building process to a time when the parallel corpus reaches to 500.000 words. However, for the final product of the project which is the termbase and glossary of legal terminology, we decided to wait until our parallel corpus reaches to its target size. The decision is made to prevent multiple entries for a single term. Below is the description of the methodology we will use in order to build the termbase and glossary.
4.1 Creating Dictionary Compilation Project
As mentioned above, in order to extract the legal terminology from our parallel corpus, we use SDL MultiTerm Extract application. After running the application and choosing new project from the file menu, we are provided with five options in New Project Wizard, one of which is the Dictionary Compilation project. TMX, TMW, and TTX are all among the supported file formats for this operation. After clicking on Dictionary Compilation project, we need to add our TM and then we must click on the Finish button, upon which the project is created using the default settings. However, since we are creating a legal termbase and glossary, we need to prevent general terms from entering the termbase and the glossary. This option is available by clicking next and using the Excluded terms page.
Before being able to use the Excluded terms page, we have to separate general terms from legal terms. To do so, we first need to create a wordlist from the English sub corpus. This is done by the use of WordSmith 5 tool. WordSmith Tools is an integrated suite of programs for looking at how words behave in texts. The WordList tool of Wordsmith lets us see a list of all the words or word-clusters in a text, set out in alphabetical or frequency order. After creating the wordlist from the English sub corpus, we have to manually go through the list, select all the general terms, and save them in a separate file named excluded terms. Returning to the excluded terms page, we need to add the excluded terms file. SDL MultiTerm Extract ignores any terms that already exist in the file that is specified under Exclusion Settings. We can also add a stop list in which we define a list of function words to be excluded from the extraction process as well. After clicking on finish button, the Completing the New Project Wizard page is displayed. Clicking on the finish button again, displays the SDL MultiTerm Extract Confirm dialog box. Upon clicking on Yes, we start the project processing operation.
4.2 Terminology extraction
After creating the project, it is the time for extracting the terms from the TM we added during the project creation. In fact, SDL MultiTerm Extract prompts us to start the process after the project is created. By clicking on Yes button, the Term Extraction dialog box is displayed with a progress bar indicating the completion of the extraction process. Once the progress bar has reached 100%, we must click on OK button to view the extracted terms in SDL MultiTerm Extract. SDL MultiTerm Extract also allows us to add more terms to the term candidate list by manually extracting them from a the TM, just in case we might feel some terms have been left unattended.
The SDL MultiTerm Extract interface is divided into four parts (Figure 2): Menu and toolbars, Term window, Term properties window, and Concordance window. The Term window contains a list of all the terms that SDL MultiTerm Extract has extracted in the current project, allowing us to validate or invalidate each term and its candidate translation. The Term properties window has three main areas: Term and translation information area in which we can validate the source language term and its translation, enter category, synonyms, and antonym for the term, and see information about the word forms, file name, and the date and time the term was created or modifies; Definition and Note Boxes in which we can enter definition of the term and add comments and notes about the term or its translation; Context Information area in which we can add any number of new sentences showing the term in context or generate sentences containing the term from the TM. Finally, the Concordance window displays the extracted terms in context.
Figure 2 MultiTerm Extract interface
Upon validating the terms and adding extra data for each term, the second stage of building the term base and glossary is complete.
4.3 Exporting the terms
For the purpose of this project, we will export the terms in two formats: Project Termbase and Tab-deliminated Text Format. Project Termbase format is used to run the termbase along with the TM built earlier and the Tab-deliminated Text Format is used to create the glossary.
The stages which are followed in the project for the development of a Legal English-Malay Parallel Corpus, Translation Memory and Termbase were explained above and so far the number of words in our English-Malay Parallel Corpus has reached 210,000 words. The products of the project have multipurpose applications. The parallel corpus for example, can be used in translation classrooms and for translation evaluation. The translation memory and the termbase on the other hand are important tools for translators of legal documents actively working in the market, by improving translation quality, terminology consistency and speed simultaneously.
Bernardini, S. (2004). ‘Corpora in the Classroom: An overview and some reflections on the future developments’, in John Sinclair (ed.).
Bowker, L. (2000). ‘A Corpus-Based Approach to Evaluating Student Translations’, in The Translator 6(2):183-210.
Bowker, L. (2002). Computer-Aided Translation Technology: A Practical Introduction. Canada: University of Ottawa Press.
Heyn, M. (1998). ‘Translation Memories: Insights and Prospects’, in Bowker et al (eds.).
Scott, Mike. 2004. Oxford WordSmith Tools version 5. Oxford: Oxford University. Available Online: http://www.lexically.net/wordsmith/index.html
Somers, H. (2003). Computers and Translation: A translator’s guide. Amsterdam and Philadelphia: John Benjamins.
Teubert, W. (2002). ‘Corpus-based Bilingual Lexicography: The role of parallel corpora in translation and multilingual lexicography’, in Bengt Altenberg, and S. Granger (eds.).
Zanettin, Federico (2002) "Corpora in translation practice". In Elia Yuste-Rodrigo (ed.) Language Resources for Translation Work and Research, LREC 2002 Workshop Proceedings, Las Palmas de Gran Canaria, 10-14.
Tengku Sepora Tengku Mahadi, Helia Vaezian, Mahmoud Akbari, Nor Aini Ali, Chew Saw Cheng. (2009) ‘Building a Legal TM and Glossary from an English-Malay Parallel Corpus’, in H. Che Omar, H. Haroon & A. Abd. Ghani (eds) The Sustainability of The Translation Field: The 12th International Conference on Translation, Kula Lumpur: Malaysian Translators Association, pp. 362-369.
Published - October 2009
Please see some ads as well as other content from TranslationDirectory.com: