An Analysis of Google Translate Accuracy
Although not appropriate for all situations, machine translation (MT) is now being used by many translators to aid their work. Many others use MT to get a quick grasp of foreign text from email, Web pages, or other computer-based material which they would not otherwise understand. Free, Web-based MT services are available to assist with this task, but relatively few studies have analyzed their accuracies. In particular, to our knowledge, there has been no comprehensive analysis of how well Google Translate (GT) performs, perhaps the most used system. Here, we investigate the translation accuracy of 2,550 language-pair combinations provided by this software. Results show that the majority of these combinations provide adequate comprehension, but translations among Western languages are generally best and those among Asian languages are often poor.
Even if a human translator is available, results can be obtained from MT much quicker. One study (Ablanedo, et al., 2007) found that a free Web-based MT system was 195 times faster than humans. Further, MT and human translation are not mutually exclusive. Once the reader has skimmed the results from the software, he or she might pay a professional if a more accurate translation is required.
Several free Web-based MT systems are available, including:
However, few studies have comprehensively evaluated their translation accuracy. One study (Bezhanova, et al., 2005) compared three systems and found LogoMedia to be best, followed by PROMT, and SYSTRAN. Another study (Aiken, et al., 2009a) compared four systems and found that Google Translate was best, followed by Yahoo, X10, and Applied Language. Finally, an NIST comparison of 22 MT systems in 2005 (many not free or Web-based) found that GT was often first and never lower than third in the rankings using text translated from Arabic to English and from Chinese to English. More detailed studies have been made of individual systems, e.g. Yahoo-SYSTRAN (Aiken, et al., 2006), but we believe Google Translate is used more frequently, provides more language-pair combinations, and is probably more accurate overall, and therefore, we will be focusing on it (Aiken & Ghosh, 2009; Och, 2009).
Automatic Evaluation of GT
A few studies have attempted to assess GT’s accuracy (e.g., Aiken, et al., 2009b), but to our knowledge, there has been no published, comprehensive evaluation of Google Translate’s accuracy, i.e., an analysis of all language pairs. Because it is impractical to obtain human translators for all 51 languages to analyze 50 passages of text each, automatic evaluation is necessary.
Although several techniques exist, BLEU (Bilingual Evaluation Understudy) is perhaps the most common (Papineni, et al., 2002), and some studies have shown that it achieves a high correlation with human judgments of quality (Coughlin, 2003; Culy & Richemann, 2003). Using this technique, scores ranging from 0 to 100 are calculated for a sample of translated text by comparing it to a reference translation, and the method takes into account the number of words that are the same in the two passages as well as the word order.
To help judge the appropriateness of using BLEU to analyze 2,550 (51 x 50) language pairs, we had two independent evaluators assess the comprehensibility of 50 non-English text samples translated to English with GT. (Note: At the time of the evaluation, GT supported only 51 languages. Armenian, Azerbaijani, Basque, Georgian, Haitian Creole, Latin, and Urdu have since been added to make a total of 58 languages supported.) The equivalent text for each of the following sentences (Flesch Reading Ease score of 81.6 on a scale of 0=hard to100=easy, Flesch-Kincaid grade level of 3.6 on a scale of 1 to 14) was obtained for all of the 50 non-English languages from Omniglot:
The scores from Evaluator 1 and Evaluator 2 were significantly correlated (R = 0.527, p < 0.001), and the scores from each evaluator were also significantly correlated with the BLEU scores calculated using Asia Online’s Language Studio Lite software (Evaluator 1: R = 0.789, p < 0.001; Evaluator 2: R = 0.506, p < 0.001). In addition, BLEU scores were significantly correlated (R = 0.499, p = 0.003) with comprehension measures from Aiken & Vanjani (2009) (R = 0.447, p = 0.010). Thus, we believe BLEU scores give a good indication of how well humans would understand the translated text.
We used Language Studio Lite to calculate BLEU scores for each of 2,550 translations obtained with GT for the text above. For example, the equivalent text for the six items was retrieved for French, translated to German with GT, and then, the BLEU score was calculated to see how well the translation matched the equivalent German phrases. The table of 2,550 BLEU scores can be found HERE. In this table, the source languages appear as row headings, and the destination languages appear as column headings. For example, the text in Arabic (source) was translated to Afrikaans (destination), and this translation was compared to the equivalent text in Afrikaans, resulting in a BLEU score of 46.
Next, we averaged the translations for each language pair to obtain an overall measure of how understandable translations between two languages would be. For example, the BLEU score for Icelandic to Bulgarian is 42, and the score for Bulgarian to Icelandic is 49, giving a final average of 45.5. The sorted list of 1,275 combined language pairs can be found HERE. (Note: The language order in this list is random. That is, Japanese-Malay is the same as Malay-Japanese.)
The language pair ranking gives an indication of the relative accuracy of translations among various language pairs, but it does not provide information on how adequate they are. For example, would translations between a pair of languages with an average BLEU score of 50.0 be sufficiently understandable, or is a minimum score of 70.0 required? Obviously, a higher standard should be placed on translations of legal, financial, medical, or other critical information, and a lower standard is probably alright for gathering the gist of informal, relatively unimportant material.
One standard is the reading comprehension score from the Test of English as a Foreign Language (TOEFL) required by many universities in the United States for students whose primary language is not English. For example, UCLA’s graduate program requires a minimum score of 21 out of 30, while Auburn’s MBA program requires a minimum of 16. In one study (Aiken, et al., 2011), 75 American students whose primary language was English took reading comprehension tests comprised of TOEFL passages in Chinese, German, Hindi, Korean, Malay, and Spanish translated to English using Google Translate. Results showed an average reading score of 21.90, just above the 21 minimum required by UCLA’s graduate program, indicating that the comprehension of these translations was, on average, sufficient for material that a graduate student might encounter during the course of studies. The corresponding average BLEU score for these six tests was 19.67.
However, the material in our analysis was easier (Flesch Reading Ease = 81.6: Grade = 3.6 versus the TOEFL tests’ Reading Ease = 63.5 and Grade Level = 8.3), and therefore, the average BLEU score of the six items above translated to English from Chinese, German, Hindi, Korean, Malay, and Spanish was much higher (58.83). If we assume a linear relationship between the BLEU and TOEFL reading comprehension scores, the corresponding TOEFL score for the six items translated to English from the six languages would be 24.5 out of 30. The adjusted minimum TOEFL reading scores for this easier material for UCLA would be 26.2 and for Auburn would be 20.0. Using 26.2 as the standard, 737 of the 1,275 language combinations would be sufficient for comprehension of graduate college material at UCLA, and 865 would be sufficient using Auburn’s MBA standard.
Although Google Translate provides translations among a large number of languages, the accuracies vary greatly. This study gives for the first time an estimate of how good a potential translation might be using the software. Our analysis shows that translations between European languages are usually good, while those involving Asian languages are often relatively poor. Further, the vast majority of language combinations probably provide sufficient accuracy for reading comprehension in college.
There are several limitations to the study, however. First, a very limited text sample was used due to the difficulty of acquiring equivalent text for 50 different languages. Other, more complicated text samples are likely to result in lower BLEU scores. On the other hand, only one reference text was used in the calculations, again, due to the problem of obtaining similar passages. That is, each translation was compared to only one "correct" result. Other acceptable translations using alternative wording and synonyms would result in higher BLEU scores. Finally, human judgments of comprehension are usually preferable to automatic evaluation, but in this case, it was impractical due to the many language combinations that had to be assessed.
Finally, this evaluation is not static. Google Translate continually adds new languages, and the existing language translation algorithm is constantly improved as the software is trained with additional text and volunteers correct mistranslations. Although its performance is never likely to reach the level of an expert human’s, it can provide quick, cheap translations for unusual language pairs.
Published - June 2011