Publisher: Springer-Verlag Berlin Heidelberg 20132013
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
08448380779 Call Girls In Friends Colony Women Seeking Men
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
1. Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, and Liangye He
Open source code: https://github.com/aaronlifenghan/aaron-project-hlepor
May 16th, 2012
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau
TSD 2013, LNAI Vol. 8082, pp. 121-128. Springer Verlag Berlin Heidelberg 2013
2. Introduction and some related work in MT Evaluation
Problem and designed idea for MT evaluation
Employed linguistic feature
Designed measuring formula
Evaluation method of evaluation metric
Experiment on WMT corpora
Conclusion
Reference
3. • The machine translation (MT) began as early as in the 1950s (Weaver,
1955)
• big progress science the 1990s due to the development of computers
(storage capacity and computational power) and the enlarged bilingual
corpora (Marino et al. 2006)
• Some recent works of MT:
• (Och 2003) presented MERT (Minimum Error Rate Training) for log-linear
SMT
• (Su et al. 2009) used the Thematic Role Templates model to improve the
translation
• (Xiong et al. 2011) employed the maximum-entropy model etc.
• The rule-based and data-driven methods including example-based MT
(Carl and Way 2003) and statistical MT (Koehn 2010) became mainly
approaches in MT literature.
4. • Due to the wide-spread development of MT systems, the MT evaluation
becomes more and more important to tell us how well the MT systems
perform and whether they make some progress.
• However, the MT evaluation is difficult:
• language variability results in no single correct translation
• the natural languages are highly ambiguous and different languages do
not always express the same content in the same way (Arnold 2003)
5. • Human evaluation:
• the intelligibility (measuring how understandable the sentence is)
• fidelity (measuring how much information the translated sentence retains
compared to the original) used by the Automatic Language Processing
Advisory Committee (ALPAC) around 1966 (Carroll 1966)
• adequacy (similar as fidelity), fluency (whether the sentence is well-
formed and fluent) and comprehension (improved intelligibility) by
Defense Advanced Research Projects Agency (DARPA) of US (White et al.
1994).
• Problem in manual evaluations :
• time-consuming and thus too expensive to do frequently.
6. • automatic evaluation metrics :
• word error rate WER (Su et al. 1992) (edit distance between the system
output and the closest reference translation)
• position independent word error rate PER (Tillmann et al. 1997) (variant of
WER that disregards word ordering)
• BLEU (Papineni et al. 2002) (the geometric mean of n-gram precision by
the system output with respect to reference translations)
• NIST (Doddington 2002) (adding the information weight)
• GTM (Turian et al. 2003)
7. • Recently, many other methods:
• METEOR (Banerjee and Lavie 2005) metric conducts a flexible matching,
considering stems, synonyms and paraphrases.
• The matching process involves computationally expensive word
alignment. There are some parameters such as the relative weight of recall
to precision, the weight for stemming or synonym that should be tuned.
Meteor-1.3 (Denkowski and Lavie 2011), an modified version of Meteor,
includes ranking and adequacy versions and has overcome some
weaknesses of previous version such as noise in the paraphrase matching,
lack of punctuation handling and discrimination between word types.
8. • Snover (Snover et al. 2006) discussed that one disadvantage of the
Levenshtein distance was that mismatches in word order required the
deletion and re-insertion of the misplaced words.
• They proposed TER by adding an editing step that allows the movement of
word sequences from one part of the output to another. This is something
a human post-editor would do with the cut-and-paste function of a word
processor.
• However, finding the shortest sequence of editing steps is a
computationally hard problem.
9. • AMBER (Chen and Kuhn 2011) including AMBER-TI and AMBER-NL declare
a modified version of BLEU and attaches more kinds of penalty
coefficients, combining the n-gram precision and recall with the arithmetic
average of F-measure.
• Before the evaluation, it provides eight kinds of preparations on the
corpus by whether the words are tokenized or not, extracting (the stem,
prefix and suffix) on the words, and splitting the words into several parts
with different ratios.
10. • F15 (Bicici and Yuret 2011) and F15G3 per-form evaluation with the F1
measure (assigning the same weight on precision and recall) over target
features as a metric for evaluating translation quality.
• The target features they defined include TP (be the true positive), TN (the
true negative), FP (the false positive), and FN (the false negative rates) etc.
To consider the surrounding phrase for a missing token in the translation
they employed the gapped word sequence kernels (Taylor and Cristianini
2004) approach to evaluate translations.
11. • Other related works:
• (Wong and Kit 2008), (Isozaki et al. 2010) and (Talbot et al. 2011) about
the discussion of word order
• ROSE (Song and Cohn 2011), MPF and WMPF (Popovic 2011) about the
employing of POS information
• MP4IBM1 (Popovic et al. 2011) without relying on reference translations,
etc.
12. • The evaluation methods proposed previously suffer from several main
weaknesses more or less:
• perform well in certain language pairs but weak on others, which we call
the language-bias problem;
• consider no linguistic information (leading the metrics result in low
correlation with human judgments) or too many linguistic features
(difficult in replicability), which we call the extremism problem;
• present incomprehensive factors (e.g. BLEU focus on precision only).
• What to do?
• This paper: to address some of above problems
13. • How?
• Enhanced factors
• Tunable parameters
• Organic and scientific factors combinations (mathematical)
• Concise linguistic features
14. • To address the variability phenomenon, researchers used to employ the
synonyms, paraphrasing or text entailment as auxiliary information. All of
these approaches have their advantages and weaknesses, e.g.
• the synonyms are difficulty to cover all the acceptable expressions.
• Instead, in the designed metric, we perform the measuring on the part-of-
speech (POS) information (also applied by ROSE (Song and Cohn 2011),
MPF and WMPF (Popovic 2011)).
• If the translation sentence of system outputs is a good translation then
there is a potential that the output sentence has a similar semantic
information (the two sentence may not contain exactly the same words
but with the words that have similar semantic meaning).
• For example, “there is a big bag” and “there is a large bag” could be the
same expression since “big” and “large” has the similar meaning (with POS
as adjective).
27. • evaluation metric based on mathematical weighted harmonic mean
• tunable weights
• Enhanced factors
• employs concise linguistic feature the POS of the word
• Better performance than the similar metrics using POS, such as ROSE,
MPF, WMPF.
• Performance can be further enhanced by the increasing of POS tools and
the adjusting of the parameter values
• BLEU uses n-gram, other researchers count the number of POS, e.g.
(Avramidis et al. 2011), we combine the n-gram and POS information
together.
• Evaluation methods without the need of reference perform low, e.g. the
MP4IBM1 metric ranked near the bottom in the experiments.
28. • More language pairs will be tested
• Combination of both word and POS will be explored
• Parameter tuning will be achieved automatically
• Evaluation without golden references will be developed
29. • 1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors,
• Machine Translation of Languages: Fourteen Essays. John Wiley and Sons, New
• York, pages 15{23 (1955)
• 2. Marino B. Jose, Rafael E. Banchs, Josep M. Crego, Adria de Gispert, Patrik Lambert,
• Jose A. Fonollosa, Marta R. Costa-jussa: N-gram based machine translation,
• Computational Linguistics, Vol. 32, No. 4. pp. 527-549, MIT Press (2006)
• 3. Och, F. J.: Minimum Error Rate Training for Statistical Machine Translation. In
• Proceedings of (ACL-2003). pp. 160-167 (2003)
• 4. Su Hung-Yu and Chung-Hsien Wu: Improving Structural Statistical Machine Translation
• for Sign Language With Small Corpus Using Thematic Role Templates as
• Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE
• PROCESSING, VOL. 17, NO. 7, SEPTEMBER (2009)
• 5. Xiong D., M. Zhang, H. Li: A Maximum-Entropy Segmentation Model for Statistical
• Machine Translation, Audio, Speech, and Language Processing, IEEE Transactions
• on, Volume: 19, Issue: 8, 2011 , pp. 2494- 2505 (2011)
• 6. Carl, M. and A. Way (eds): Recent Advances in Example-Based Machine Translation.
• Kluwer Academic Publishers, Dordrecht, The Netherlands (2003)
30. • 7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge
• University Press (2010)
• 8. Arnold, D.: Why translation is dicult for computers. In Computers and Translation:
• A translator's guide. Benjamins Translation Library (2003)
• 9. Carroll, J. B.: Aan experiment in evaluating the quality of translation, Pierce, J.
• (Chair), Languages and machines: computers in translation and linguistics. A report
• by the Automatic Language Processing Advisory Committee (ALPAC), Publication
• 1416, Division of Behavioral Sciences, National Academy of Sciences, National Research
• Council, page 67-75 (1966)
• 10. White, J. S., O'Connell, T. A., and O'Mara, F. E.: The ARPA MT evaluation
• methodologies: Evolution, lessons, and future approaches. In Proceedings of the
• Conference of the Association for Machine Translation in the Americas (AMTA
• 1994). pp193-205 (1994)
• 11. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality
• Measure for Machine Translation Systems. In Proceedings of the 14th International
• Conference on Computational Linguistics, pages 433{439, Nantes, France, July
• (1992)
31. • 12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf:
• Accelerated DP Based Search For Statistical Translation. In Proceedings of the 5th
• European Conference on Speech Communication and Technology (EUROSPEECH97)
• (1997)
• 13. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic
• evaluation of machine translation. In Proceedings of the (ACL 2002), pages 311-318,
• Philadelphia, PA, USA (2002)
• 14. Doddington, G.: Automatic evaluation of machine translation quality using ngram
• co-occurrence statistics. In Proceedings of the second international conference
• on Human Language Technology Research(HLT 2002), pages 138-145, San Diego,
• California, USA (2002)
• 15. Turian, J. P., Shen, L. and Melanmed, I. D.: Evaluation of machine translation
• and its evaluation. In Proceedings of MT Summit IX, pages 386-393, New Orleans,
• LA, USA (2003)
• 16. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with
• high levels of correlation with human judgments. In Proceedings of ACL-WMT,
• pages 65-72, Prague, Czech Republic (2005)
32. • 17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization
• and evaluation of machine translation systems. In Proceedings of (ACL-WMT),
• pages 85-91, Edinburgh, Scotland, UK (2011)
• 18. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of
• translation edit rate with targeted human annotation. In Proceedings of the Conference
• of the Association for Machine Translation in the Americas (AMTA), pages
• 223-231, Boston, USA (2006)
• 19. Chen, B. and Kuhn, R.: Amber: A modied bleu, enhanced ranking metric. In
• Proceedings of (ACL-WMT), pages 71-77, Edinburgh, Scotland, UK (2011)
• 20. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination,
• and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh,
• Scotland, UK (2011)
• 21. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge
• University Press 2004.
• 22. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT
• evaluation. In Workshop: MetricsMATR of the Association for Machine Translation
• in the Americas (AMTA), short paper, 3 pages, Waikiki, Hawai'I, USA (2008)
33. • 23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation
• of translation quality for distant language pairs. In Proceedings of the 2010
• Conference on (EMNLP), pages 944{952, Cambridge, MA (2010)
• 24. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A
• Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings
• of the Sixth (ACL-WMT), pages 12-21, Edinburgh, Scotland, UK (2011)
• 25. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence
• level MT evaluation. In Proceedings of the (ACL-WMT), pages 123-129, Edinburgh,
• Scotland, UK (2011)
• 26. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In
• Proceedings of (ACL-WMT), pages 104-107, Edinburgh, Scotland, UK (2011)
• 27. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references:
• IBM1 scores as evaluation metrics. In Proceedings of the (ACL-WMT),
• pages 99-103, Edinburgh, Scotland, UK (2011)
• 28. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate,
• compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages
• 433{440, Sydney, July (2006)
34. • 29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011
• Workshop on Statistical Machine Translation. In Proceedings of (ACL-WMT), pages
• 22-64, Edinburgh, Scotland, UK (2011)
• 30. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M. and Zaidan,
• O. F.: Findings of the 2010 Joint Workshop on Statistical Machine Translation and
• Metrics for Machine Translation. In Proceedings of (ACL-WMT), pages 17-53, PA,
• USA (2010)
• 31. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Findings of the 2009
• Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages
• 1-28, Athens, Greece (2009)
• 32. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Further meta-evaluation
• of machine translation. In Proceedings of (ACL-WMT), pages 70-106, Columbus,
• Ohio, USA (2008)
• 33. Avramidis E., Popovic, M., Vilar, D., Burchardt, A.: Evaluate with Condence
• Estimation: Machine ranking of translation outputs using grammatical features. In
• Proceedings of the Sixth Workshop on Statistical Machine Translation, Association
• for Computational Linguistics (ACL-WMT), pages 65-70, Edinburgh, Scotland, UK
• (2011)
35. Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, and Liangye He
Open source code: https://github.com/aaronlifenghan/aaron-project-hlepor
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau
TSD 2013, LNAI Vol. 8082, pp. 121-128. Springer Verlag Berlin Heidelberg 2013