This paper describes a universal phrase tagset mapping between the French Treebank and English Penn Treebank using 9 phrase categories. It then applies this mapping to an unsupervised machine translation evaluation method that calculates similarity between the source and target sentences without reference translations. The method extracts phrase tags from the source and target, maps them to universal tags, and measures n-gram precision, recall, and position difference as similarity metrics. Evaluation on French-English data shows promising correlation with human judgments, though there is still room for improvement. The tagset and methods could facilitate future multilingual research.
Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation
1. 25th International Conference, GSCL 2013
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li,
and Ling Zhu
September 25th -27th, 2013, Darmstadt, Germany
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
Department of Computer and Information Science
University of Macau
2. Background of language Treebank
Motivation
Designed phrase tagset mapping
Application in MT evaluation
1. Manual evaluations
2. Traditional automatic MT evaluation methods
3. Designed unsupervised MT evaluation
4. Evaluating the evaluation method
5. Experiments
6. Open source code
Discussion
Further information
3. • To promote the development of syntactic analysis
• Many language treebanks are developed
– English Penn Treebank (Marcus et al., 1993; Mitchell et al.,
1994)
– German Negra Treebank (Skut et al., 1997)
– French Treebank (Abeillé et al., 2003)
– Chinese Sinica Treebank (Chen et al., 2003)
– Etc.
4. • Problems
– Different treebanks use their own syntactic tagsets
– The number of tags ranging from tens (e.g. English Penn
Treebank) to hundreds (e.g. Chinese Sinica Treebank)
– Inconvenient when undertaking the multilingual or cross-lingual
research
5. • To bridge the gap between these treebanks and
facilitate future research
– E.g. the unsupervised induction of syntactic structure
• Petrov et al. (2012) develop a universal POS tagset
• How about the phrase level tags?
• The disaccord problem in the phrase level tags
remains unsolved
– Let’s try to solve it
6. • Tentative design of phrase tagset mapping
– On English Penn Treebank I, II & French Treebank
• 9 universal phrasal categories covering
– 14 phrase tags in English Penn Treebank I
– 26 phrase tags in English Penn Treebank II
– 14 phrase tags in French Treebank
7. Table 1: phrase tagset mapping for French and English treebanks
8. • Universal phrasal categories: NP (noun phrase),
VP (verb phrase), AJP (adjective phrase), AVP
(adverbial phrase), PP (prepositional phrase), S (sub/-
sentence), CONJP (conjunction phrase), COP
(coordinated phrse), X (other phrases or unknown)
• NP covering
– French tags: NP
– English tags: NP, NAC (the scope of certain prenominal
modifiers within an NP), NX (within certain complex NPs to
mark the head of NP), WHNP (wh-noun phrase), QP
(quantifier phrase)
9. • VP covering
– French tags: VN (verbal nucleus), VP (infinitives and
nonfinite clauses)
– English tags: VP (verb phrase)
• AJP covering
– French tags: AP (adjectival phrase)
– English tags: ADJP (adjective phrase), WHADJP (wh-adjective
phrase)
10. • AVP covering
– French tags: AdP (adverbial phrases)
– English tags: ADVP (adverb phrase), WHAVP (wh-adverb
phrase), PRT (particle)
• PP covering
– French tags: PP
– English tags: PP, WHPP (wh-propositional phrase phrase)
11. • S covering
– French tags: SENT (sentence), S (finite clause)
– English tags: S (simple declarative clause), SBAR (clause
introduced by a subordinating conjunction), SBARQ (direct
question introduced by a wh-phrase), SINV (declarative
sentence with subject-aux inversion), SQ (sub-constituent
of SBARQ), PRN (parenthetical), FRAG (fragment), RRC
(reduced relative clause).
• CONJP covering
– French tags: N/A
– English tags: CONJP
12. • COP covering
– French tags: COORD (coordinated phrase)
– English tags: UCP (coordinated phrases belonging to
different categories)
• X covering
– French tags: unknown
– English tags: X (unknown or uncertain), INTJ (interjection),
LST (list marker)
14. • Rapid development of Machine Translations
– MT began as early as in the 1950s (Weaver, 1955)
– Big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)
15. • Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)
16. • Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 2011)
17. • Measuring the similarity of automatic translation and
reference translation
– Automatic translation (or hypothesis translation, target
translation): by automatic MT system
– Reference translation: by professional translators
– Source language and source document: not used
• Traditional automatic evaluation:
– BLEU: n-gram precisions (Papineni et al., 2002)
– TER: edit distances (Snover et al., 2006)
– METEOR: precision and recall (Banerjee and Lavie, 2005)
18. • Problems in supervised MT evaluation
– Reference translations are expensive
– Reference translations are not available is some cases
• Could we get rid of the reference translation?
– Unsupervised MT evaluation method
– Extract information from source and target language
– How to use the designed universal phrase tagset?
19. • Assume that the translated sentence should have a
similar set of phrase categories with the source
sentence.
– This design is inspired by the synonymous relation
between source and target sentence.
• Two sentences that have similar set of phrases may
talk about different things.
– However, this evaluation approach is not designed for
general circumstance
– Assume that the targeted sentences are indeed the
translated sentences from the source document
20. • First, we parse the source and target languages
respectively
• Then we extract the phrase set from the source and
target sentences
• Third, we convert the phrases into the developed
universal phrase categories
• Last, we measure the similarity of source and target
language on the universal phrase sequences
22. The level of extracted phrase tags: just the upper level of POS tags, bottom-up
Figure 2: convert the extracted phrase into universal phrase tags
23. • What is the similarity metric we employed?
• Designed similarity metric: HPPR
– N1 gram position order difference penalty
– Weighted N2 gram precision
– Weighted N3 gram recall
– Weighted geometric mean in n-gram precision & recall
– Weighted harmonic mean to combine sub-factors
– The parameters are tunable according to different
language pairs
24. • 퐻푃푃푅 = 퐻푎푟(푤푃푠푁1푃푠퐷푖푓, 푤푃푟푁2푃푟푒, 푤푅푐푁3푅푒푐)
• 퐻푃푃푅 =
푤푃푠+푤푃푟+푤푅푐
푤푃푠
푁1푃푠퐷푖푓
푤푃푟
푁2푃푟푒
+
푤푅푐
푁3푅푒푐
+
• 푁1푃푠퐷푖푓, 푁2푃푟푒, and 푁3푅푒푐 are the corpus level
scores of sub-factors position difference penalty,
precision and recall.
25. • The sentence level 푁1푃푠퐷푖푓 score:
• 푁1푃푠퐷푖푓 = exp(−푁1푃퐷)
1
• 푁1푃퐷 =
퐿푒푛푔푡ℎℎ푦푝
Σ|푃퐷푖 |
• 푃퐷푖 = |푃푠푁ℎ푦푝 − 푀푎푡푐ℎ푃푠푁푠푟푐 |
• 푃푠푁ℎ푦푝 and 푀푎푡푐ℎ푃푠푁푠푟푐 are the position number
of matching tag in the hypothesis and source
sentence respectively. When no match for the tag:
푃퐷푖 = |푃푠푁ℎ푦푝 − 0|
30. • How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to approach, currently
– Correlation with human judgments (Callison-Burch, et al.,
2011, 2012)
• Spearman rank correlation coefficient rs:
– 푟푠 푋푌 = 1 −
푛 푑푖
6 Σ푖=1
2
푛(푛2−1)
– Two rank sequences 푋 = 푥1, … , 푥푛 , 푌 = {푦1, … , 푦푛}
31. • Corpus from WMT
– Workshop of statistical machine translation
– SIGMT, ACL’S special interest group of machine translation
• Training data (WMT11), tune the parameters
– 3, 003 sentences for each document
– 18 automatic French-to-English MT systems
• Testing data (WMT12)
– 3, 003 sentences for each document
– 15 automatic French-to-English MT systems
32. • Training, tune the parameters
– N1, N2 and N3 are tuned as 2, 3 and 3 due to the fact that
the 4-gram chunk match usually results in 0 score.
– Tuned values of factor weights are shown in table
Table 2: tuned parameter values
33. • Comparisons with:
– BLEU, measure the closeness of the hypothesis and
reference translations, n-gram precision
– TER, measure the editing distance of hypothesis to
reference translations
34. Table 3: training (development) scores on WMT11 corpus
Table 4: testing scores on WMT12 corpus
35. Table 5: correlation score intro (Cohen, 1988)
The experiment results on the development and testing corpora show that
HPPR without using reference translations has yielded promising
correlation scores (0.63 and 0.59 respectively).
There is still potential to improve the performances of all the three
metrics, even though that the correlation scores which are higher than 0.5
are already considered as strong correlation as shown in Table 5.
36. • Phrase Tagset Mapping for French and English
Treebanks and Its Application in Machine
Translation Evaluation
– Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He,
Shuo Li, and Ling Zhu. GSCL 2013, Darmstadt, Germany.
LNCS Vol. 8105, pp. 119-131, Volume Editors: Iryna
Gurevych, Chris Biemann and Torsten Zesch.
• Open source tool for phrase tagset mapping
and HPPR similarity measuring algorithms:
https://github.com/aaronlifenghan/aaron-project-hppr
37. • Facilitate future research in multilingual or cross-lingual
literature, this paper designs a phrase tags
mapping between the French Treebank and the
English Penn Treebank using 9 phrase categories.
• One of the potential applications of the designed
universal phrase tagset is shown in the unsupervised
MT evaluation task in the experiment section.
38. • There are still some limitations in this work to be
addressed in the future.
– The designed universal phrase categories may not be
able to cover all the phrase tags of other language
treebanks, so this tagset could be expanded when
necessary.
– The designed HPPR formula contains the n-gram factors
of position difference, precision and recall, which may not
be sufficient or suitable for some of the other language
pairs, so different measuring factors should be added or
switched when facing new tasks.
39. • Actually speaking, the designed models are very
related to the similarity measuring. Where we
have employed them is in the MT evaluation. These
works may be further developed into other
literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.
40. • Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The further explorations of unsupervised evaluation
models, extracting other features from source and target
languages
• Aaron open source tools: https://github.com/aaronlifenghan
• Aaron network Home: http://www.linkedin.com/in/aaronhan
41. GSCL 2013, Darmstadt, Germany
Aaron L.-F. Han
email: hanlifengaaron AT gmail DOT com
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
Department of Computer and Information Science
University of Macau