5. bleu

BLEU: a Method for Automatic
Evaluation of Machine Translation
(BiLingual Evaluation Understudy)
Kishore Papineni, Salim Roukos, Todd
Ward, and Wei-Jing Zhu
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL),
Philadelphia, July 2002, pp. 311- 318

Viewpoint
• The idea: the closer a machine translation is to a
professional human translation, the better it is.
• To judge the quality
– Numerical metric
• So, MT evaluation system requires:
1. A numerical “translation closeness” metric
2. A corpus of good quality human reference translations
• Word error rate metric
– Idea: use of weighted average of variable length phrase
matches against the reference translations
– 参照変換に対して可変長フレーズ一致の加重平均を
使用 (Google Translate)

Baseline BLEU Metric
• The primary programming task for a BLEU
implementor is to compare n-grams of the
candidate with the n-grams of the reference
translation and count the number of matches

• So, we look at computing unigram matches

n-gram precision
• Precision measure
– Counts up the number of candidate translation words
( unigrams ) which occur in any reference translation and
then divides by the total number of words in the candidate
translation
• However, MT generates improbable, high-precision
translations like the example result below
– A ref word considered exhausted after a matching
candidate word is identified

Modified n-gram precision
• Modified unigram precision
– Counts the maximum number of times a word occurs in any single reference
translation
– Clips the total count of each candidate word by its maximum reference count
– Adds these clipped counts up
– Divides by the total (unclipped) number of candidate words
• Modified n-gram precision
– All candidate n-gram counts & corresponding maximum reference counts are
collected
– The candidate counts are clipped by their corresponding reference maximum
value, summed and divided by the total number of candidate n-grams

Modified n-gram precision on text
blocks
• Basic unit of evaluation is the sentence
• Compute the n-gram matches sentence by sentence
• Add clipped n-gram counts for all the candidate sentences
• Divide by the number of candidate n-grams in the test corpus to compute
a modified precision score

Ranking systems
• Human translation & machine translation
• 4 reference translations for each of 127 source sentences
• Result:

• From this result:
– Single n-gram precision score can distinguish good/bad translations
• To be useful, the metric must distinguish between two human translations that do not differ so
greatly in quality

Ranking systems
• Translations done by:
– Lacking native proficiency in both SL/TL
– Native English speaker
– Three commercial systems

• Result:
– The systems in result order is the same rank order by
human judges

Combining the modified n-gram
precisions
• The result, in prev. slide, shows:
– It decays roughly exponentially with n
– mod. unigram precision > bigram > trigram
• BLEU uses the average logarithm with uniform
weights (BLEUは一様重み付き平均の対数を
使用しています)

Recall
• BLEU considers multiple reference translations,
each of which may use a different word choice
to translate the same source word.
• A good candidate translation will only use
(recall) one of these possible choices, but not
all. Indeed, recalling all choices leads to a bad
translation

Sentence brevity penalty
• Candidate translations longer than references are penalized by the
modified n-gram precision measure
• Brevity penalty factor:
– A high-scoring candidate translation must match the reference translations in
length, in word choice and in word order
• Brevity penalty 1.0: candidate’s length is the same as any reference translations length.
• c: the length of the candidate translation
• r: the effective reference corpus length
• exp(1 - r/c): brevity penalty

BLEU details
• Take the geometric mean of the test corpus’ modified precision scores and
then multiply the result by an exponential brevity penalty factor.
• We first compute the geometric average of the modified n-gram precisions,
pn, using n-grams up to length N and positive weights wn summing to one.

• To make the behavior apparent

The BLEU Evaluation
• The BLEU metric ranges from 0 to 1
• 1 is very rare: only for perfect match
• The more, the better
• Human translation score 0.3468 against four references and scored 0.2571
against two references
• Table 1: 5 systems against two reference

• Is the difference in BLEU metric reliable?
• What is the variance of the BLEU score?
• If we were to pick another random set of 500 sentences, would we still judge S3 to
be better than S2?

• 20 blocks of 25 sentences each on BLEU metric
• Computed the means, variances, paired t-statistics
• What the Table2 indicates is:
– 500 sentences in Table 1 and 25 sentences in Table 2
– t-statistics of 1.7 or above is considered 95% significant

Evaluation
• Two groups of people, each group has 10 ppl
– Monolingual group
– Bilingual group
• Evaluated previous 5 systems
• Evaluation Rate: 1 (very bad) to 5 (very good)
• There were some liberal evaluations than
others

BLEU vs Bi, Mono-lingual Judgements

5. bleu

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie 5. bleu

Ähnlich wie 5. bleu (20)

Mehr von Hiroshi Matsumoto

Mehr von Hiroshi Matsumoto (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

5. bleu