MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

MT SUMMIT 2013
Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu,
Junwen Xing and Xiaodong Zeng
September 2nd-6th, 2013, Nice, France
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau

 The importance of machine translation (MT) evaluation
 Automatic MT evaluation metrics introduction
1. Lexical similarity
2. Linguistic features
3. Metrics combination
 Designed metric: LEPOR Series
1. Motivation
2. LEPOR Metrics Description
3. Performances on international ACL-WMT corpora
4. Publications and Open source tools
 Further information

• Eager communication with each other of different
nationalities
– Promote the translation technology
• Rapid development of Machine translation
– machine translation (MT) began as early as in the 1950s
(Weaver, 1955)
– big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)

• Some recent works of MT research:
– Och (2003) present MERT (Minimum Error Rate Training)
for log-linear SMT
– Su et al. (2009) use the Thematic Role Templates model to
improve the translation
– Xiong et al. (2011) employ the maximum-entropy model,
etc.
– The data-driven methods including example-based MT
(Carl and Way, 2003) and statistical MT (Koehn, 2010)
became main approaches in MT literature.

• How well the MT systems perform and whether they
make some progress?
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)

• Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)

• Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 2011)

2.1 Lexical similarity
2.2 Linguistic features
2.3 Metrics combination

• Precision-based
Bleu (Papineni et al., 2002 ACL)
• Recall-based
ROUGE(Lin, 2004 WAS)
• Precision and Recall
Meteor (Banerjee and Lavie, 2005 ACL)

• Word-order based
NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen
et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)
• Word-alignment based
AER (Och and Ney, 2003 J.CL)
• Edit distance-based
WER(Su et al., 1992Coling), PER(Tillmann et al.,
1997 EUROSPEECH), TER (Snover et al., 2006
AMTA)

• Language model
LM-SVM (Gamon et al., 2005EAMT)
• Shallow parsing
GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel
et al., 2012WMT)
• Semantic roles
Named entity, morphological, synonymy,
paraphrasing, discourse representation, etc.

• MTeRater-Plus (Parton et al., 2011WMT)
– Combine BLEU, TERp (Snover et al., 2009) and Meteor
(Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)
• MPF & WMPBleu (Popovic, 2011WMT)
– Arithmetic mean of F score and BLEU score
• SIA (Liu and Gildea, 2006ACL)
– Combine the advantages of n-gram-based metrics and
loose-sequence-based metrics

• hLEPOR: harmonic mean of enhanced Length Penalty,
Precision, n-gram Position difference Penalty and
Recall

• Weaknesses in existing metrics:
– perform well on certain language pairs but weak on others,
which we call as the language-bias problem;
– consider no linguistic information (leading the metrics
result in low correlation with human judgments) or too
many linguistic features (difficult in replicability), which we
call as the extremism problem;
– present incomprehensive factors (e.g. BLEU focus on
precision only).
– What to do?

• to address some of the existing problems:
– Design tunable parameters to address the language-bias
problem;
– Use concise or optimized linguistic features for the
linguistic extremism problem;
– Design augmented factors.

• Sub-factors:
• 𝐸𝐿𝑃 = 𝑒1−
𝑟
𝑐
∶ 𝑐<𝑟
𝑒1−
𝑐
𝑟
∶ 𝑐≥𝑟
(1)
• 𝑟: length of reference sentence
• 𝑐: length of candidate (system-output) sentence

• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2)
• 𝑁𝑃𝐷 =
1
𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡
|𝑃𝐷𝑖|
𝑖=1
(3)
• 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4)
• 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in
output sentence
• 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference
sentence

Fig. 1. N-gram word alignment algorithm

Fig. 2. Example of n-gram word alignment

Fig. 3. Example of NPD calculation

• N-gram precision and recall:
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝐴𝑙𝑖𝑔𝑛𝑒𝑑 𝑛𝑢𝑚
(5)
• 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝐴𝑙𝑖𝑔𝑛𝑒𝑑 𝑛𝑢𝑚
𝐿𝑒𝑛𝑔𝑡ℎ 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒
(6)
• 𝐻𝑃𝑅 =
𝛼+𝛽 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙
𝛼𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝛽𝑅𝑒𝑐𝑎𝑙𝑙
(7)

• Sentence-level hLEPOR Metric:
• ℎ𝐿𝐸𝑃𝑂𝑅 =
𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤 𝐿𝑃 𝐿𝑃, 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅
=
𝑤 𝑖
𝑛
𝑖=1
𝑤 𝑖
𝐹𝑎𝑐𝑡𝑜𝑟 𝑖
𝑛
𝑖=1
=
𝑤 𝐿𝑃+𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤 𝐻𝑃𝑅
𝑤 𝐿𝑃
𝐿𝑃
+
𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙
𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙
+
𝑤 𝐻𝑃𝑅
𝐻𝑃𝑅
(8)
• System-level hLEPOR Metric:
• ℎ𝐿𝐸𝑃𝑂𝑅 =
1
𝑛𝑢𝑚 𝑠𝑒𝑛𝑡
|ℎ𝐿𝐸𝑃𝑂𝑅𝑖|
𝑛𝑢𝑚 𝑠𝑒𝑛𝑡
𝑖=1 (9)

• Example, employment of linguistic features:
Fig. 4. Example of n-gram POS alignment
Fig. 5. Example of NPD calculation

• Enhanced version with linguistic features:
• ℎ𝐿𝐸𝑃𝑂𝑅 𝐸 =
1
𝑤ℎ𝑤+𝑤ℎ𝑝
(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 +
𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (10)
• The system-level scores ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑
and ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 use the same algorithm on word
sequence and POS sequence respectively.

• When multi-references:
• Select the alignment that results in the minimum NPD
score.
Fig. 6. N-gram alignment when multi-references

• How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to approach, currently.
• Correlation with human judgments:
• System-level Spearman rank correlation coefficient:
– 𝜌 𝑋𝑌 = 1 −
6 𝑑 𝑖
2𝑛
𝑖=1
𝑛(𝑛2−1)
(11)
– 𝑋 = 𝑥1, … , 𝑥 𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}

• Training data (WMT08)
– 2,028 sentences for each document
– English vs Spanish/German/French/Czech
• Testing data (WMT11)
– 3,003 sentences for each document
– English vs Spanish/German/French/Czech

Table 1. values of tuned parameters

Table 2. correlation with human judgments on WMT11 corpora

• Language-independent Model for Machine
Translation Evaluation with Reinforced Factors
– Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi
Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT
Summit 2013. Nice, France.
• Machine Translation evaluation tool-hLEPOR:
https://github.com/aaronlifenghan/aaron-project-
hlepor

• Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The exploration of unsupervised evaluation models,
extracting features from source and target languages

• Actually speaking, the evaluation works are very
related to the similarity measuring. Where we have
employed them is in the MT evaluation. These works
can be further developed into other literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.

MT SUMMIT 2013, September 2nd-6th, 2013, Nice, France
Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu,
Junwen Xing and Xiaodong Zeng
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (7)

Similar to MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Similar to MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors (20)

More from Lifeng (Aaron) Han

More from Lifeng (Aaron) Han (20)

Recently uploaded

Recently uploaded (20)

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors