Evaluation of machine translation
By Wikipedia,
the free encyclopedia,
http://en.wikipedia.org/wiki/Evaluation_of_machine_translation
Get the List of 5,400+ Translation Agencies Now! No Recurring Membership Fees!
Various methods for the evaluation for machine translation
have been employed. This article will focus on the evaluation
of the output of machine
translation, rather than on performance or usability
evaluation.
Before covering the large scale studies, a brief comment
will be made on one of the more pervasive evaluation techniques,
that of round-trip
translation (or "back translation"). One of the typical
ways for lay people to assess the quality of a machine translation
engine is through translating from a source language into
a target language, and then back to the source language
using the same engine.
Round-trip translation
Although this may intuitively be a good method of evaluation,
it has been shown that round-trip translation is a, "poor
predictor of quality".[1]
The reason why it is such a poor predictor of quality is
reasonably intuitive. When a round-trip translation is performed,
the method is not testing one system, but two systems. The
language pair of the engine for translating in to
the target language, and the language pair translating back
from the target language.
Consider the following examples of round-trip translation
performed from English
to Italian
and Portuguese
from Somers (2005):
-
| Original text |
Select this link to look at our home page. |
| Translated |
Selezioni questo collegamento per guardare
il nostro Home Page. |
| Translated back |
Selections this connection in order to watch our
Home Page. |
-
| Original text |
Tit for tat |
| Translated |
Melharuco para o tat |
| Translated back |
Tit for tat |
In the first example, where the text is translated into
Italian
and then back into English,
although the English text is significantly garbled, the
Italian is a serviceable translation. In the second example,
although the text that is translated back into English is
perfect, the Portuguese
translation is meaningless.
While round-trip translation may be useful in order to
generate a "surplus of fun"[2],
the methodology is deficient for any serious study of the
quality of machine translation output.
Human evaluation
This section will cover two of the large scale evaluation
studies that have had a significant impact on the field.
The first study to be presented will be the ALPAC
1966 study, and then the ARPA study[3]
will be examined.
Automatic Language Processing Advisory
Committee (ALPAC)
One of the constituent parts of the ALPAC report was a
study comparing different levels of human translation with
machine translation output, using human subjects as judges.
The human judges were specially trained for the purpose.
The evaluation study compared an MT system translating from
Russian
into English
with human translators, on two variables.
The variables studied were "intelligibility" and "fidelity".
Intelligibility was a measure of how "understandable" the
sentence was, and was measured on a scale of 1—9. Fidelity
was a measure of how much information the translated sentence
retained compared to the original, and was measured on a
scale of 0—9. Each point on the scale was associated with
a textual description. For example, 3 on the intelligibility
scale was described as "Generally unintelligible; it tends
to read like nonsense but, with a considerable amount of
reflection and study, one can at least hypothesize the idea
intended by the sentence"[4].
Intelligibility was measured without reference to the original,
while fidelity was measured indirectly. The translated sentence
was presented, and after reading it and absorbing the content,
the original sentence was presented. The judges were asked
to rate the original sentence on informativeness. So, the
more informative the original sentence, the lower the quality
of the translation.
The study showed that the variables were highly correlated
when the human judgement was averaged per sentence. The
variation
among raters was small, but the researchers recommended
that at the very least, three or four raters should be used.
The evaluation methodology managed to separate translations
by humans from translations by machines with ease.
The study concluded that, "highly reliable assessments
can be made of the quality of human and machine translations".[5].
Advanced Research Projects Agency
(ARPA)
As part of the Human Language Technologies Program, the
Advanced
Research Projects Agency (ARPA) created a methodology
to evaluate machine translation systems, and continues to
perform evaluations based on this methodology. The evaluation
programme was instigated in 1991, and continues to this
day. Details of the programme can be found in White et al.
(1994) and White (1995).
The evaluation programme involved testing several systems
based on different theoretical approaches; statistical,
rule-based and human-assisted. A number of methods for the
evaluation of the output from these systems were tested
in 1992 and the most recent suitable methods were selected
for inclusion in the programmes for subsequent years. The
methods were; comprehension evaluation, quality panel evaluation,
and evaluation based on adequacy and fluency.
Comprehension evaluation aimed to directly compare systems
based on the results from multiple choice comprehension
tests, as in Church et al. (1993). The texts chosen were
a set of articles in English on the subject of financial
news. These articles were translated by professional translators
into a series of language pairs, and then translated back
into English using the machine translation systems. It was
decided that this was not adequate for a standalone method
of comparing systems and as such abandoned due to issues
with the modification of meaning in the process of translating
from English.
The idea of quality panel evaluation was to submit translations
to a panel of expert native English speakers who were professional
translators and get them to evaluate them. The evaluations
were done on the basis of a metric, modelled on a standard
US government metric used to rate human translations. This
was good from the point of view that the metric was "externally
motivated"[6],
since it was not specifically developed for machine translation.
However, the quality panel evaluation was very difficult
to set up logistically, as it necessitated having a number
of experts together in one place for a week or more, and
furthermore for them to reach consensus. This method was
also abandoned.
Along with a modified form of the comprehension evaluation
(re-styled as informativeness evaluation), the most popular
method was to obtain ratings from monolingual judges for
segments of a document. The judges were presented with a
segment, and asked to rate it for two variables, adequacy
and fluency. Adequacy is a rating of how much information
is transferred between the original and the translation,
and fluency is a rating of how good the English is. This
technique was found to cover the relevant parts of the quality
panel evaluation, while at the same time being easier to
deploy, as it didn't require expert judgement.
Measuring systems based on adequacy and fluency, along
with informativeness is now the standard methodology for
the ARPA evaluation program.[7]
Automatic evaluation
In the context of this article, a metric
will be understood as a measurement. A metric for the evaluation
of machine translation output is a measurement of the quality
of the output. The quality of a translation is inherently
subjective, there is no objective or quantifiable "good".
Therefore, the task for any metric is to assign scores of
quality in such a way that they correlate with human judgement
of quality. That is, a metric should score highly those
translations which humans score highly, and give low scores
to those which humans give low scores to. Human judgement
is used as the benchmark for assessing the automatic metrics
as humans are the end-users of any translation output.
The measure of evaluation for metrics is correlation
with human judgement. This is generally done at two levels,
at the sentence level, where scores are calculated by the
metric for a set of translated sentences, and then correlated
against human judgement for the same sentences. And at the
corpus level, where scores over the sentences are aggregated
for both human judgements and metric judgements, and these
aggregate scores are then correlated. Figures for correlation
at the sentence level are rarely reported, although Banerjee
et al. (2005) do give correlation figures which show that,
at least for their metric, sentence level correlation is
substantially worse than corpus level correlation.
While not widely reported, it has been noted that the genre,
or domain, of a text has an effect on the correlation obtained
when using metrics. Coughlin (2003) reports that comparing
the candidate text against a single reference translation
does not adversely affect the correlation of metrics when
working in a restricted domain text.
Even if a metric is shown to correlate well with human
judgement in one study, on one corpus, it does not follow
that this correlation will carry over to another corpus.
Good performance of a metric, across text types or domains,
is important for the reusability of the metric. A metric
that only works for text in a specific domain is useful,
but less useful than one that works across many domains,
for the reason that the necessity to create a new metric
for every new evaluation or domain is undesirable.
Another important factor in the usefulness of an evaluation
metric is to have good correlation, even when working with
small amounts of data, that is candidate sentences and reference
translations. Turian et al. (2003) point out that, "Any
MT evaluation measure is less reliable on shorter translations",
and show that increasing the amount of data improves the
reliability of a metric. However, they add that "... reliability
on shorter texts, as short as one sentence or even one phrase,
is highly desirable because a reliable MT evaluation measure
can greatly accelerate exploratory data analysis".[8]
Banerjee et al. (2005) highlight five attributes that a
good automatic metric must possess; correlation, sensitivity,
consistency, reliability and generality. Any good metric
must correlate highly with human judgement, it must be consistent,
giving similar results to the same MT system on similar
text. It must be sensitive to differences between MT systems
and reliable in that MT systems that score similarly should
be expected to perform similarly. Finally, the metric must
be general, that is it should work with different text
domains, in a wide range of scenarios and MT tasks.
The aim of this subsection is to give an overview of the
state of the art in automatic metrics for evaluating machine
translation.[9]
BLEU
BLEU was one of the first metrics to report high correlation
with human judgements of quality. The metric is currently
one of the most popular in the field. The central idea behind
the metric is that "the closer a machine translation is
to a professional human translation, the better it is".[10]
The metric calculates scores for individual segments, generally
sentences, and then averages these scores over the whole
corpus in order to reach a final score. It has been shown
to correlate highly with human judgements of quality at
the corpus level.[11]
BLEU uses a modified form of precision to compare a candidate
translation against multiple reference translations. The
metric modifies simple precision since machine translation
systems have been known to generate more words than appear
in a reference text.
NIST
The NIST metric is based on the BLEU
metric, but with some alterations. Where BLEU
simply calculates n-gram
precision adding equal weight to each one, NIST also calculates
how informative a particular n-gram
is. That is to say when a correct n-gram
is found, the rarer that n-gram is, the more weight it will
be given.[12]
For example, if the bigram "on the" is correctly matched,
it will receive lower weight than the correct matching of
bigram "interesting calculations", as this is less likely
to occur. NIST also differs from BLEU
in its calculation of the brevity penalty insofar as small
variations in translation length do not impact the overall
score as much.
Word error rate
The Word error rate (WER) is a metric based on the Levenshtein
distance, where the Levenshtein distance works at the
character level, WER works at the word level. It was originally
used for measuring the performance of speech
recognition systems, but is also used in the evaluation
of machine translation. The metric is based on the calculation
of the number of words which differ between a piece of machine
translated text and a reference translation.
A related metric is the Position-independent word error
rate (PER), this allows for re-ordering of words and sequences
of words between a translated text and a references translation.
METEOR
The METEOR metric is designed to address some of the deficiencies
inherent in the BLEU metric. The metric is based on the
weighted harmonic
mean of unigram precision and unigram recall. The metric
was designed after research by Lavie (2004) into the significance
of recall in evaluation metrics. Their research showed that
metrics based on recall consistently achieved higher correlation
than those based on precision alone, cf. BLEU and NIST.[13]
METEOR also includes some other features not found in other
metrics, such as synonymy matching, where instead of matching
only on the exact word form, the metric will also match
on synonyms. For example, if the word "good" appears in
the reference and the word "well" appears in the translation,
this will be counted as a match. The metric is also includes
a stemmer, which lemmatises words and matches on the lemmatised
forms. The implementation of the metric is modular insofar
as the algorithms that match words are implemented as modules,
and new modules that implement different matching strategies
may easily be added.
See also
Notes
- ^
Somers (2005)
- ^
Gaspari (2006)
- ^
White et al. (1994)
- ^
ALPAC (1966)
- ^
ALPAC (1966)
- ^
White et al. (1994)
- ^
White (1995)
- ^
Turian et al. (2003)
- ^
While the metrics are described as for the evaluation
of machine translation, in practice they may also be
used to measure the quality of human translation. The
same metrics have even been used for plagiarism detection,
for details see Somers et al. (2006).
- ^
Papineni et al. (2002)
- ^
Papineni et al. (2002), Coughlin (2003)
- ^
Doddington (2002)
- ^
Lavie (2004)
References
- Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic
Metric for MT Evaluation with Improved Correlation with
Human Judgments" in Proceedings of Workshop on Intrinsic
and Extrinsic Evaluation Measures for MT and/or Summarization
at the 43rd Annual Meeting of the Association of Computational
Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005
- Church, K. and Hovy, E. (1993) "Good Applications for
Crummy Machine Translation". Machine Translation,
8 pp. 239--258
- Coughlin, D. (2003) "Correlating Automated and Human
Assessments of Machine Translation Quality" in MT Summit
IX, New Orleans, USA pp. 23--27
- Doddington, G. (2002) "Automatic evaluation of machine
translation quality using n-gram cooccurrence statistics".
Proceedings of the Human Language Technology Conference
(HLT), San Diego, CA pp.128--132
- Gaspari, F. (2006) "Look Who's Translating. Impersonations,
Chinese Whispers and Fun with Machine Translation on the
Internet" in Proceedings of the 11th Annual Conference
of the European Association of Machine Translation
- Lavie, A., Sagae, K. and Jayaraman, S. (2004) "The Significance
of Recall in Automatic Metrics for MT Evaluation" in Proceedings
of AMTA 2004, Washington DC. September 2004
- Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002).
"BLEU: a method for automatic evaluation of machine translation"
in ACL-2002: 40th Annual meeting of the Association
for Computational Linguistics pp. 311--318
- Somers, H. (2005) "Round-trip
Translation: What Is It Good For?"
- Somers, H., Gaspari, F. and Ana Niño (2006) "Detecting
Inappropriate Use of Free Online Machine Translation by
Language Students - A Special Case of Plagiarism Detection".
Proceedings of the 11th Annual Conference of the European
Association of Machine Translation, Oslo University (Norway)
pp. 41--48
- ALPAC (1966) "Languages and machines: computers in translation
and linguistics". A report by the Automatic Language Processing
Advisory Committee, Division of Behavioral Sciences, National
Academy of Sciences, National Research Council. Washington,
D.C.: National Academy of Sciences, National Research
Council, 1966. (Publication 1416.)
- Turian, J., Shen, L. and Melamed, I. D. (2003) "Evaluation
of Machine Translation and its Evaluation". Proceedings
of the MT Summit IX, New Orleans, USA, 2003 pp. 386--393
- White, J., O'Connell, T. and O'Mara, F. (1994) "The
ARPA MT Evaluation Methodologies: Evolution, Lessons,
and Future Approaches". Proceedings of the 1st Conference
of the Association for Machine Translation in the Americas.
Columbia, MD pp. 193--205
- White, J. (1995) "Approaches to Black Box MT Evaluation".
Proceedings of MT Summit V
Further reading
|