SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
A Description of Tunable Machine Translation Evaluation Systems in
WMT13 Metrics Task
Aaron L.-F. Han
hanlifengaaron@gmail.com
Derek F. Wong
derekfw@umac.mo
Lidia S. Chao
lidiasc@umac.mo
Yi Lu
mb25435@umac.mo
Liangye He
wutianshui0515@gmail.com
Yiming Wang
mb25433@umac.mo
Jiaji Zhou
mb25473@uamc.mo
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau, Macau S.A.R. China
Abstract
This paper is to describe our machine transla-
tion evaluation systems used for participation
in the WMT13 shared Metrics Task. In the
Metrics task, we submitted two automatic MT
evaluation systems nLEPOR_baseline and
LEPOR_v3.1. nLEPOR_baseline is an n-gram
based language independent MT evaluation
metric employing the factors of modified sen-
tence length penalty, position difference penal-
ty, n-gram precision and n-gram recall.
nLEPOR_baseline measures the similarity of
the system output translations and the refer-
ence translations only on word sequences.
LEPOR_v3.1 is a new version of LEPOR met-
ric using the mathematical harmonic mean to
group the factors and employing some linguis-
tic features, such as the part-of-speech infor-
mation. The evaluation results of WMT13
show LEPOR_v3.1 yields the highest average-
score 0.86 with human judgments at system-
level using Pearson correlation criterion on
English-to-other (FR, DE, ES, CS, RU) lan-
guage pairs.1
1 Introduction
Machine translation has a long history since the
1950s (Weaver, 1955) and gains a fast develop-
ment in the recent years because of the higher
level of computer technology. For instances, Och
(2003) presents Minimum Error Rate Training
(MERT) method for log-linear statistical ma-
chine translation models to achieve better trans-
1
The final publication is available at
http://aclweb.org/anthology/
lation quality; Menezes et al. (2006) introduce a
syntactically informed phrasal SMT system for
English-to-Spanish translation using a phrase
translation model, which is based on global reor-
dering and the dependency tree; Su et al. (2009)
use the Thematic Role Templates model to im-
prove the translation; Costa-jussàet al. (2012)
develop the phrase-based SMT system for Chi-
nese-Spanish translation using a pivot language.
With the rapid development of Machine Transla-
tion (MT), the evaluation of MT has become a
challenge in front of researchers. However, the
MT evaluation is not an easy task due to the fact
of the diversity of the languages, especially for
the evaluation between distant languages (Eng-
lish, Russia, Japanese, etc.).
2 Related works
The earliest human assessment methods for ma-
chine translation include the intelligibility and
fidelity used around 1960s (Carroll, 1966), and
the adequacy (similar as fidelity), fluency and
comprehension (improved intelligibility) (White
et al., 1994). Because of the expensive cost of
manual evaluations, the automatic evaluation
metrics and systems appear recently.
The early automatic evaluation metrics in-
clude the word error rate WER (Su et al., 1992)
and position independent word error rate PER
(Tillmann et al., 1997) that are based on the Le-
venshtein distance. Several promotions for the
MT and MT evaluation literatures include the
ACL’s annual workshop on statistical machine
translation WMT (Koehn and Monz, 2006; Calli-
son-Burch et al., 2012), NIST open machine
translation (OpenMT) Evaluation series (Li,
2005) and the international workshop of spoken
language translation IWSLT, which is also orga-
nized annually from 2004 (Eck and Hori, 2005;
Paul, 2008, 2009; Paul, et al., 2010; Federico et
al., 2011).
BLEU (Papineni et al., 2002) is one of the
commonly used evaluation metrics that is de-
signed to calculate the document level precisions.
NIST (Doddington, 2002) metric is proposed
based on BLEU but with the information weights
added to the n-gram approaches. TER (Snover et
al., 2006) is another well-known MT evaluation
metric that is designed to calculate the amount of
work needed to correct the hypothesis translation
according to the reference translations. TER in-
cludes the edit categories such as insertion, dele-
tion, substitution of single words and the shifts of
word chunks. Other related works include the
METEOR (Banerjee and Lavie, 2005) that uses
semantic matching (word stem, synonym, and
paraphrase), and (Wong and Kit, 2008), (Popovic,
2012), and (Chen et al., 2012) that introduces the
word order factors, etc. The traditional evalua-
tion metrics tend to perform well on the language
pairs with English as the target language. This
paper will introduce the evaluation models that
can also perform well on the language pairs that
with English as source language.
3 Description of Systems
3.1 Sub Factors
Firstly, we introduce the sub factor of modified
length penalty inspired by BLEU metric.
{ (1)
In the formula, means sentence length
penalty that is designed for both the shorter or
longer translated sentence (hypothesis translation)
as compared to the reference sentence. Parame-
ters and represent the length of candidate
sentence and reference sentence respectively.
Secondly, let’s see the factors of n-gram pre-
cision and n-gram recall.
(2)
(3)
The variable represents the
number of matched n-gram chunks between hy-
pothesis sentence and reference sentence. The n-
gram precision and n-gram recall are firstly cal-
culated on sentence-level instead of corpus-level
that is used in BLEU ( ). Then we define the
weighted n-gram harmonic mean of precision
and recall (WNHPR).
(∑ ( (4)
Thirdly, it is the n-gram based position differ-
ence penalty (NPosPenal). This factor is de-
signed to achieve the penalty for the different
order of successfully matched words in reference
sentence and hypothesis sentence. The alignment
direction is from the hypothesis sentence to the
reference sentence. It employs the -gram meth-
od into the matching period, which means that
the potential matched word will be assigned
higher priority if it also has nearby matching.
The nearest matching will be accepted as a back-
up choice if there are both nearby matching or
there is no other matched word around the poten-
tial pairs.
(5)
∑ (6)
(7)
The variable means the length of
the hypothesis sentence; the variables
and represent the posi-
tion number of matched words in hypothesis sen-
tence and reference sentence respectively.
3.2 Linguistic Features
The linguistic features could be easily employed
into our evaluation models. In the submitted ex-
periment results of WMT Metrics Task, we used
the part of speech information of the words in
question. In grammar, a part of speech, which is
also called a word class, a lexical class, or a lexi-
cal category, is a linguistic category of lexical
items. It is generally defined by the syntactic or
morphological behavior of the lexical item in
question. The POS information utilized in our
metric LEPOR_v3.1, an enhanced version of
LEPOR (Han et al., 2012), is extracted using the
Berkeley parser (Petrov et al., 2006) for English,
German,
Ratio
other-to-English English-to-other
CZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR
HPR:LP:NPP(word) 7:2:1 3:2:1 7:2:1 3:2:1 7:2:1 1:3:7 3:2:1 3:2:1
HPR:LP:NPP(POS) NA 3:2:1 NA 3:2:1 7:2:1 7:2:1 NA 3:2:1
( 1:9 9:1 1:9 9:1 9:1 9:1 9:1 9:1
( NA 9:1 NA 9:1 9:1 9:1 NA 9:1
NA 1:9 NA 9:1 1:9 1:9 NA 9:1
Table 1. The tuned weight values in LEPOR_v3.1 system
System
Correlation Score with Human Judgment
other-to-English English-to-other Mean
scoreCZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR
LEPOR_v3.1 0.93 0.86 0.88 0.92 0.83 0.82 0.85 0.83 0.87
nLEPOR_baseline 0.95 0.61 0.96 0.88 0.68 0.35 0.89 0.83 0.77
METEOR 0.91 0.71 0.88 0.93 0.65 0.30 0.74 0.85 0.75
BLEU 0.88 0.48 0.90 0.85 0.65 0.44 0.87 0.86 0.74
TER 0.83 0.33 0.89 0.77 0.50 0.12 0.81 0.84 0.64
Table 2. The performances of nLEPOR_baseline and LEPOR_v3.1 systems on WMT11 corpora
and French languages, using COMPOST Czech
morphology tagger (Collins, 2002) for Czech
language, and using TreeTagger (Schmid, 1994)
for Spanish and Russian languages respectively.
3.3 The nLEPOR_baseline System
The nLEPOR_baseline system utilizes the simple
product value of the factors: modified length
penalty, n-gram position difference penalty, and
weighted n-gram harmonic mean of precision
and recall.
(8)
The system level score is the arithmetical
mean of the sentence level evaluation scores. In
the experiments of Metrics Task using the
nLEPOR_baseline system, we assign N=1 in the
factor WNHPR, i.e. weighted unigram harmonic
mean of precision and recall.
3.4 The LEPOR_v3.1 System
The system of LEPOR_v3.1 (also called as
hLEPOR) combines the sub factors using
weighted mathematical harmonic mean instead
of the simple product value.
(9)
Furthermore, this system takes into account
the linguistic features, such as the POS of the
words. Firstly, we calculate the hLEPOR score
on surface words (the closeness of
the hypothesis translation and the reference
translation). Then, we calculate the hLEPOR
score on the extracted POS sequences
(the closeness of the corresponding
POS tags between hypothesis sentence and refer-
ence sentence). The final score is
the combination of the two sub-scores
and .
(
(10)
4 Evaluation Method
In the MT evaluation task, the Spearman rank
correlation coefficient method is usually used by
the authoritative ACL WMT to evaluate the cor-
relation of different MT evaluation metrics. So
we use the Spearman rank correlation coefficient
to evaluate the performances of
nLEPOR_baseline and LEPOR_v3.1 in system
level correlation with human judgments. When
there are no ties, is calculated using:
∑
(
(11)
The variable is the difference value be-
tween the ranks for and is the number
of systems. We also offer the Pearson correlation
coefficient information as below. Given a sample
of paired data (X, Y) as ( , , the
Pearson correlation coefficient is:
∑ ( (
√∑ ( √∑ ( )
(12)
where and specify the mean of discrete
random variable X and Y respectively.
Directions
EN-
FR
EN-
DE
EN-
ES
EN-
CS
EN-
RU Av
LEPOR_v3.1 .91 .94 .91 .76 .77 .86
nLEPOR_baseline .92 .92 .90 .82 .68 .85
SIMP-
BLEU_RECALL
.95 .93 .90 .82 .63 .84
SIMP-
BLEU_PREC
.94 .90 .89 .82 .65 .84
NIST-mteval-
inter
.91 .83 .84 .79 .68 .81
Meteor .91 .88 .88 .82 .55 .81
BLEU-mteval-
inter
.89 .84 .88 .81 .61 .80
BLEU-moses .90 .82 .88 .80 .62 .80
BLEU-mteval .90 .82 .87 .80 .62 .80
CDER-moses .91 .82 .88 .74 .63 .80
NIST-mteval .91 .79 .83 .78 .68 .79
PER-moses .88 .65 .88 .76 .62 .76
TER-moses .91 .73 .78 .70 .61 .75
WER-moses .92 .69 .77 .70 .61 .74
TerrorCat .94 .96 .95 na na .95
SEMPOS na na na .72 na .72
ACTa .81 -.47 na na na .17
ACTa5+6 .81 -.47 na na na .17
Table 3. System-level Pearson correlation scores
on WMT13 English-to-other language pairs
5 Experiments
5.1 Training
In the training stage, we used the officially re-
leased data of past WMT series. There is no Rus-
sian language in the past WMT shared tasks. So
we trained our systems on the other eight lan-
guage pairs including English to other (French,
German, Spanish, Czech) and the inverse transla-
tion direction. In order to avoid the overfitting
problem, we used the WMT11 corpora as train-
ing data to train the parameter weights in order to
achieve a higher correlation with human judg-
ments at system-level evaluations. For the
nLEPOR_baseline system, the tuned values of
and are 9 and 1 respectively for all language
pairs except for ( , ) for CS-EN lan-
guage pair. For the LEPOR_v3.1 system, the
tuned values of weights are shown in Table 1.
The evaluation scores of the training results on
WMT11 corpora are shown in Table 2. The de-
signed methods have shown promising correla-
tion scores with human judgments at system lev-
el, 0.87 and 0.77 respectively for
nLEPOR_baseline and LEPOR_v3.1 of the mean
score on eight language pairs. As compared to
METEOR, BLEU and TER, we have achieved
higher correlation scores with human judgments.
5.2 Testing
In the WMT13 shared Metrics Task, we also
submitted our system performances on English-
to-Russian and Russian-to-English language
pairs. However, since the Russian language did
not appear in the past WMT shared tasks, we
assigned the default parameter weights to Rus-
sian language for the submitted two systems. The
officially released results on WMT13 corpora are
shown in Table 3, Table 4 and Table 5 respec-
tively for system-level and segment-level per-
formance on English-to-other language pairs.
Directions
EN-
FR
EN-
DE
EN-
ES
EN-
CS
EN-
RU Av
SIMP-
BLEU_RECALL
.92 .93 .83 .87 .71 .85
LEPOR_v3.1 .90 .9 .84 .75 .85 .85
NIST-mteval-
inter
.93 .85 .80 .90 .77 .85
CDER-moses .92 .87 .86 .89 .70 .85
nLEPOR_baseline .92 .90 .85 .82 .73 .84
NIST-mteval .91 .83 .78 .92 .72 .83
SIMP-
BLEU_PREC
.91 .88 .78 .88 .70 .83
Meteor .92 .88 .78 .94 .57 .82
BLEU-mteval-
inter
.92 .83 .76 .90 .66 .81
BLEU-mteval .89 .79 .76 .90 .63 .79
TER-moses .91 .85 .75 .86 .54 .78
BLEU-moses .90 .79 .76 .90 .57 .78
WER-moses .91 .83 .71 .86 .55 .77
PER-moses .87 .69 .77 .80 .59 .74
TerrorCat .93 .95 .91 na na .93
SEMPOS na na na .70 na .70
ACTa5+6 .81 -.53 na na na .14
ACTa .81 -.53 na na na .14
Table 4. System-level Spearman rank correlation
scores on WMT13 English-to-other language
pairs
Table 3 shows LEPOR_v3.1 and
nLEPOR_baseline yield the highest and the sec-
ond highest average Pearson correlation score
0.86 and 0.85 respectively with human judg-
ments at system-level on five English-to-other
language pairs. LEPOR_v3.1 and
nLEPOR_baseline also yield the highest Pearson
correlation score on English-to-Russian (0.77)
and English-to-Czech (0.82) language pairs re-
spectively. The testing results of LEPOR_v3.1
and nLEPOR_baseline show better correlation
scores as compared to METEOR (0.81), BLEU
(0.80) and TER-moses (0.75) on English-to-other
language pairs, which is similar with the training
results.
On the other hand, using the Spearman rank
correlation coefficient, SIMPBLEU_RECALL
yields the highest correlation score 0.85 with
human judgments. Our metric LEPOR_v3.1 also
yields the highest Spearman correlation score on
English-to-Russian (0.85) language pair, which
is similar with the result using Pearson correla-
tion and shows its robust performance on this
language pair.
Directions
EN-
FR
EN-
DE
EN-
ES
EN-
CS
EN-
RU Av
SIMP-
BLEU_RECALL
.16 .09 .23 .06 .12 .13
Meteor .15 .05 .18 .06 .11 .11
SIMP-
BLEU_PREC
.14 .07 .19 .06 .09 .11
sentBLEU-moses .13 .05 .17 .05 .09 .10
LEPOR_v3.1 .13 .06 .18 .02 .11 .10
nLEPOR_baseline .12 .05 .16 .05 .10 .10
dfki_logregNorm-
411
na na .14 na na .14
TerrorCat .12 .07 .19 na na .13
dfki_logregNormS
oft-431
na na .03 na na .03
Table 5. Segment-level Kendall’s tau correlation
scores on WMT13 English-to-other language
pairs
However, we find a problem in the Spearman
rank correlation method. For instance, let two
evaluation metrics MA and MB with their evalu-
ation scores ⃗⃗⃗⃗⃗⃗ and
⃗⃗⃗⃗⃗⃗ respectively reflecting
three MT systems
⃗⃗ . Before the calculation of cor-
relation with human judgments, they will be
converted into ⃗⃗⃗⃗⃗⃗̌ and ⃗⃗⃗⃗⃗⃗̌
with the same rank sequence using Spearman
rank method; thus, the two evaluation systems
will get the same correlation with human judg-
ments whatever are the values of human judg-
ments. But the two metrics reflect different re-
sults indeed: MA gives the outstanding score
(0.95) to M1 system and puts very low scores
(0.50 and 0.45) on the other two systems M2 and
M3 while MB thinks the three MT systems have
similar performances (scores from 0.74 to 0.77).
This information is lost using the Spearman rank
correlation methodology.
The segment-level performance of
LEPOR_v3.1 is moderate with the average Ken-
dall’s tau correlation score 0.10 on five English-
to-other language pairs, which is due to the fact
that we trained our metrics at system-level in this
shared metrics task. Lastly, the officially released
results on WMT13 other-to-English language
pairs are shown in Table 6, Table 7 and Table 8
respectively for system-level and segment-level
performance.
Directions
FR-
EN
DE-
EN
ES-
EN
CS-
EN
RU-
EN Av
Meteor .98 .96 .97 .99 .84 .95
SEMPOS .95 .95 .96 .99 .82 .93
Depref-align .97 .97 .97 .98 .74 .93
Depref-exact .97 .97 .96 .98 .73 .92
SIMP-
BLEU_RECALL
.97 .97 .96 .94 .78 .92
UMEANT .96 .97 .99 .97 .66 .91
MEANT .96 .96 .99 .96 .63 .90
CDER-moses .96 .91 .95 .90 .66 .88
SIMP-
BLEU_PREC
.95 .92 .95 .91 .61 .87
LEPOR_v3.1 .96 .96 .90 .81 .71 .87
nLEPOR_baseline .96 .94 .94 .80 .69 .87
BLEU-mteval-
inter
.95 .92 .94 .90 .61 .86
NIST-mteval-inter .94 .91 .93 .84 .66 .86
BLEU-moses .94 .91 .94 .89 .60 .86
BLEU-mteval .95 .90 .94 .88 .60 .85
NIST-mteval .94 .90 .93 .84 .65 .85
TER-moses .93 .87 .91 .77 .52 .80
WER-moses .93 .84 .89 .76 .50 .78
PER-moses .84 .88 .87 .74 .45 .76
TerrorCat .98 .98 .97 na na .98
Table 6. System-level Pearson correlation scores
on WMT13 other-to-English language pairs
METEOR yields the highest average correla-
tion scores 0.95 and 0.94 respectively using
Pearson and Spearman rank correlation methods
on other-to-English language pairs. The average
performance of nLEPOR_baseline is a little bet-
ter than LEPOR_v3.1 on the five language pairs
of other-to-English even though it is also moder-
ate as compared to other metrics. However, using
the Pearson correlation method,
nLEPOR_baseline yields the average correlation
score 0.87 which already wins the BLEU (0.86)
and TER (0.80) as shown in Table 6.
Directions
FR-
EN
DE-
EN
ES-
EN
CS-
EN
RU-
EN Av
Meteor .98 .96 .98 .96 .81 .94
Depref-align .99 .97 .97 .96 .79 .94
UMEANT .99 .95 .96 .97 .79 .93
MEANT .97 .93 .94 .97 .78 .92
Depref-exact .98 .96 .94 .94 .76 .92
SEMPOS .94 .92 .93 .95 .83 .91
SIMP-
BLEU_RECALL
.98 .94 .92 .91 .81 .91
BLEU-mteval-
inter
.99 .90 .90 .94 .72 .89
BLEU-mteval .99 .89 .89 .94 .69 .88
BLEU-moses .99 .90 .88 .94 .67 .88
CDER-moses .99 .88 .89 .93 .69 .87
SIMP-
BLEU_PREC
.99 .85 .83 .92 .72 .86
nLEPOR_baseline .95 .95 .83 .85 .72 .86
LEPOR_v3.1 .95 .93 .75 0.8 .79 .84
NIST-mteval .95 .88 .77 .89 .66 .83
NIST-mteval-inter .95 .88 .76 .88 .68 .83
TER-moses .95 .83 .83 0.8 0.6
0.8
0
WER-moses .95 .67 .80 .75 .61 .76
PER-moses .85 .86 .36 .70 .67 .69
TerrorCat .98 .96 .97 na na .97
Table 7. System-level Spearman rank correlation
scores on WMT13 other-to-English language
pairs
Once again, our metrics perform moderate at
segment-level on other-to-English language pairs
due to the fact that they are trained at system-
level. We notice that some of the evaluation met-
rics do not submit the results on all the language
pairs; however, their performance on submitted
language pair is sometimes very good, such as
the dfki_logregFSS-33 metric with a segment-
level correlation score 0.27 on German-to-
English language pair.
6 Conclusion
This paper describes our participation in the
WMT13 Metrics Task. We submitted two sys-
tems nLEPOR_baseline and LEPOR_v3.1. Both
of the two systems are trained and tested using
the officially released data. LEPOR_v3.1 yields
the highest Pearson correlation average-score
0.86 with human judgments on five English-to-
other language pairs, and nLEPOR_baseline
yields better performance than LEPOR_v3.1 on
other-to-English language pairs. Furthermore,
LEPOR_v3.1 shows robust system-level perfor-
mance on English-to-Russian language pair, and
nLEPOR_baseline shows best system-level per-
formance on English-to-Czech language pair us-
ing Pearson correlation criterion. As compared to
nLEPOR_baseline, the experiment results of
LEPOR_v3.1 also show that the proper use of
linguistic information can increase the perfor-
mance of the evaluation systems.
Directions
FR-
EN
DE-
EN
ES-
EN
CS-
EN
RU-
EN Av
SIMP-
BLEU_RECALL
.19 .32 .28 .26 .23 .26
Meteor .18 .29 .24 .27 .24 .24
Depref-align .16 .27 .23 .23 .20 .22
Depref-exact .17 .26 .23 .23 .19 .22
SIMP-
BLEU_PREC
.15 .24 .21 .21 .17 .20
nLEPOR_baseline .15 .24 .20 .18 .17 .19
sentBLEU-moses .15 .22 .20 .20 .17 .19
LEPOR_v3.1 .15 .22 .16 .19 .18 .18
UMEANT .10 .17 .14 .16 .11 .14
MEANT .10 .16 .14 .16 .11 .14
dfki_logregFSS-
33
na .27 na na na .27
dfki_logregFSS-
24
na .27 na na na .27
TerrorCat .16 .30 .23 na na .23
Table 8. Segment-level Kendall’s tau correlation
scores on WMT13 other-to-English language
pairs
The designed source codes of this paper can
be freely downloaded for research purpose,
open source online. The source code of nLEPOR
and hLEPOR measuring algorithm is available
here https://github.com/aaronlifenghan/aaron-
project-lepor and
https://github.com/aaronlifenghan/aaron-project-
hlepor respectively. The final publication is
available at
http://www.statmt.org/wmt13/pdf/WMT53.pdf.
Acknowledgments
The authors wish to thank the anonymous re-
viewers for many helpful comments. The authors
are grateful to the Science and Technology De-
velopment Fund of Macau and the Research
Committee of the University of Macau for the
funding support for our research, under the refer-
ence No. 017/2009/A and RG060/09-
10S/CS/FST.
References
Banerjee, Satanjeev and Alon Lavie. 2005. METEOR:
An Automatic Metric for MT Evaluation with Im-
proved Correlation with Human Judgments. In
Proceedings of the 43th Annual Meeting of the
Association of Computational Linguistics
(ACL- 05), pages 65–72, Ann Arbor, US, June.
Association of Computational Linguistics.
Callison-Burch, Chris, Philipp Koehn, Christof Monz,
and Omar F. Zaidan. 2011. Findings of the 2011
Workshop on Statistical Machine Translation.
In Proceedings of the Sixth Workshop on Sta-
tistical Machine Translation (WMT '11). Asso-
ciation for Computational Linguistics, Stroudsburg,
PA, USA, 22-64.
Callison-Burch, Chris, Philipp Koehn, Christof Monz,
Matt Post, Radu Soricut, and Lucia Specia. 2012.
Findings of the 2012Workshop on Statistical Ma-
chine Translation. In Proceedings of the Seventh
Workshop on Statistical Machine Translation,
pages 10–51, Montreal, Canada. Association for
Computational Linguistics.
Carroll, John B. 1966. An Experiment in Evaluating
the Quality of Translations, Mechanical Transla-
tion and Computational Linguistics, vol.9,
nos.3 and 4, September and December 1966, page
55-66, Graduate School of Education, Harvard
University.
Chen, Boxing, Roland Kuhn and Samuel Larkin. 2012.
PORT: a Precision-Order-Recall MT Evaluation
Metric for Tuning, Proceedings of the 50th An-
nual Meeting of the Association for Computa-
tional Linguistics, pages 930–939, Jeju, Republic
of Korea, 8-14 July 2012.
Collins, Michael. 2002. Discriminative Training
Methods for Hidden Markov Models: Theory and
Experiments with Perceptron Algorithms. In Pro-
ceedings of the ACL-02 conference on Empiri-
cal methods in natural language processing,
Volume 10 (EMNLP 02), pages 1-8. Association
for Computational Linguistics, Stroudsburg, PA,
USA.
Costa-jussà, Marta R., Carlos A. Henríquez and Ra-
fael E. Banchs. 2012. Evaluating Indirect Strategies
for Chinese-Spanish Statistical Machine Transla-
tion. Journal of artificial intelligence research,
Volume 45, pages 761-780.
Doddington, George. 2002. Automatic evaluation of
machine translation quality using n-gram co-
occurrence statistics. In Proceedings of the sec-
ond international conference on Human Lan-
guage Technology Research (HLT '02). Morgan
Kaufmann Publishers Inc., San Francisco, CA,
USA, 138-145.
Eck, Matthias and Chiori Hori. 2005. Overview of the
IWSLT 2005 Evaluation Campaign. Proceedings
of IWSLT 2005.
Federico, Marcello, Luisa Bentivogli, Michael Paul,
and Sebastian Stiiker. 2011. Overview of the
IWSLT 2011 Evaluation Campaign. In Proceed-
ings of IWSLT 2011, 11-27.
Han, Aaron Li-Feng, Derek F. Wong and Lidia S.
Chao. 2012. LEPOR: A Robust Evaluation Metric
for Machine Translation with Augmented Factors.
Proceedings of the 24th International Confer-
ence on Computational Linguistics (COLING
2012: Posters), Mumbai, India.
Koehn, Philipp and Christof Monz. 2006. Manual and
Automatic Evaluation of Machine Translation be-
tween European Languages. Proceedings of the
ACLWorkshop on Statistical Machine Trans-
lation, pages 102–121, New York City.
Li, A. (2005). Results of the 2005 NIST machine
translation evaluation. In Machine Translation
Workshop.
Menezes, Arul, Kristina Toutanova and Chris Quirk.
2006. Microsoft Research Treelet Translation Sys-
tem: NAACL 2006 Europarl Evaluation, Proceed-
ings of the ACLWorkshop on Statistical Ma-
chine Translation, pages 158-161, New York
City.
Och, Franz Josef. 2003. Minimum Error Rate Train-
ing for Statistical Machine Translation. In Pro-
ceedings of the 41st Annual Meeting of the As-
sociation for Computational Linguistics (ACL-
2003). pp. 160-167.
Papineni, Kishore, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. BLEU: a method for automat-
ic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting on
Association for Computational Linguis-
tics (ACL '02). Association for Computational
Linguistics, Stroudsburg, PA, USA, 311-318.
Paul, Michael. 2008. Overview of the IWSLT 2008
Evaluation Campaign. Proceeding of IWLST
2008, Hawaii, USA.
Paul, Michael. 2009. Overview of the IWSLT 2009
Evaluation Campaign. In Proc. of IWSLT 2009,
Tokyo, Japan, pp. 1–18.
Paul, Michael, Marcello Federico and Sebastian
Stiiker. 2010. Overview of the IWSLT 2010 Eval-
uation Campaign, Proceedings of the 7th Inter-
national Workshop on Spoken Language
Translation, Paris, December 2nd and 3rd, 2010,
page 1-25.
Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan
Klein. 2006. Learning accurate, compact, and in-
terpretable tree annotation. In Proceedings of the
21st International Conference on Computa-
tional Linguistics and the 44th annual meeting
of the Association for Computational Linguis-
tics (ACL-44). Association for Computational
Linguistics, Stroudsburg, PA, USA, 433-440.
Popovic, Maja. 2012. Class error rates for evaluation
of machine translation output. Proceedings of the
7th Workshop on Statistical Machine Transla-
tion, pages 71–75, Canada.
Schmid, Helmut. 1994. Probabilistic Part-of-Speech
Tagging Using Decision Trees. In Proceedings of
International Conference on New Methods in
Language Processing, Manchester, UK.
Snover, Matthew, Bonnie J. Dorr, Richard Schwartz,
Linnea Micciulla, and John Makhoul. 2006. A
Study of Translation Edit Rate with Targeted Hu-
man Annotation. In Proceedings of the 7th Con-
ference of the Association for Machine Trans-
lation in the Americas (AMTA-06), pages 223–
231, USA. Association for Machine Translation in
the Americas.
Su, Hung-Yu and Chung-Hsien Wu. 2009. Improving
Structural Statistical Machine Translation for Sign
Language With Small Corpus Using Thematic
Role Templates as Translation Memory, IEEE
TRANSACTIONS ON AUDIO, SPEECH, AND
LANGUAGE PROCESSING, VOL. 17, NO. 7.
Su, Keh-Yih, Wu Ming-Wen and Chang Jing-Shin.
1992. A New Quantitative Quality Measure for
Machine Translation Systems. In Proceedings of
the 14th International Conference on Compu-
tational Linguistics, pages 433–439, Nantes,
France.
Tillmann, Christoph, Stephan Vogel, Hermann Ney,
Arkaitz Zubiaga, and Hassan Sawaf. 1997. Accel-
erated DP Based Search For Statistical Translation.
In Proceedings of the 5th European Confer-
ence on Speech Communication and Technol-
ogy (EUROSPEECH-97).
Weaver, Warren. 1955. Translation. In William Locke
and A. Donald Booth, editors, Machine Transla-
tion of Languages: Fourteen Essays. John
Wiley & Sons, New York, pages 15–23.
White, John S., Theresa O’Connell, and Francis
O’Mara. 1994. The ARPA MT evaluation method-
ologies: Evolution, lessons, and future approaches.
In Proceedings of the Conference of the Asso-
ciation for Machine Translation in the Ameri-
cas (AMTA 1994). pp193-205.
Wong, Billy and Chunyu Kit. 2008. Word choice and
word position for automatic MT evaluation. In
Workshop: MetricsMATR of the Association for
Machine Translation in the Americas (AMTA),
short paper, Waikiki, Hawai’I, USA. Association
for Machine Translation in the Americas.

Weitere ähnliche Inhalte

Was ist angesagt?

Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word documentbutest
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Yafi Azhari
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemAssociation for Computational Linguistics
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amramit nagarkoti
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSAN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSijaia
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationijnlc
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemIRJET Journal
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis kevig
 

Was ist angesagt? (20)

Text summarization
Text summarization Text summarization
Text summarization
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word document
 
Text summarization
Text summarizationText summarization
Text summarization
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
p138-jiang
p138-jiangp138-jiang
p138-jiang
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSAN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
semeval2016
semeval2016semeval2016
semeval2016
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarization
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
P13 corley
P13 corleyP13 corley
P13 corley
 
Ay34306312
Ay34306312Ay34306312
Ay34306312
 

Ähnlich wie ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...Lifeng (Aaron) Han
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
Automated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfAutomated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfSarah Marie
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsijaia
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSkevig
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSkevig
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationIJECEIAES
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
 
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...Lifeng (Aaron) Han
 
A Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference AnnotationA Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference AnnotationDarian Pruitt
 

Ähnlich wie ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task (20)

ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
Automated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfAutomated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdf
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Ceis 3
Ceis 3Ceis 3
Ceis 3
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
Sentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic RelationsSentence Validation by Statistical Language Modeling and Semantic Relations
Sentence Validation by Statistical Language Modeling and Semantic Relations
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation System...
 
A Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference AnnotationA Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference Annotation
 

Mehr von Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 

Mehr von Lifeng (Aaron) Han (20)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

  • 1. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task Aaron L.-F. Han hanlifengaaron@gmail.com Derek F. Wong derekfw@umac.mo Lidia S. Chao lidiasc@umac.mo Yi Lu mb25435@umac.mo Liangye He wutianshui0515@gmail.com Yiming Wang mb25433@umac.mo Jiaji Zhou mb25473@uamc.mo Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau, Macau S.A.R. China Abstract This paper is to describe our machine transla- tion evaluation systems used for participation in the WMT13 shared Metrics Task. In the Metrics task, we submitted two automatic MT evaluation systems nLEPOR_baseline and LEPOR_v3.1. nLEPOR_baseline is an n-gram based language independent MT evaluation metric employing the factors of modified sen- tence length penalty, position difference penal- ty, n-gram precision and n-gram recall. nLEPOR_baseline measures the similarity of the system output translations and the refer- ence translations only on word sequences. LEPOR_v3.1 is a new version of LEPOR met- ric using the mathematical harmonic mean to group the factors and employing some linguis- tic features, such as the part-of-speech infor- mation. The evaluation results of WMT13 show LEPOR_v3.1 yields the highest average- score 0.86 with human judgments at system- level using Pearson correlation criterion on English-to-other (FR, DE, ES, CS, RU) lan- guage pairs.1 1 Introduction Machine translation has a long history since the 1950s (Weaver, 1955) and gains a fast develop- ment in the recent years because of the higher level of computer technology. For instances, Och (2003) presents Minimum Error Rate Training (MERT) method for log-linear statistical ma- chine translation models to achieve better trans- 1 The final publication is available at http://aclweb.org/anthology/ lation quality; Menezes et al. (2006) introduce a syntactically informed phrasal SMT system for English-to-Spanish translation using a phrase translation model, which is based on global reor- dering and the dependency tree; Su et al. (2009) use the Thematic Role Templates model to im- prove the translation; Costa-jussàet al. (2012) develop the phrase-based SMT system for Chi- nese-Spanish translation using a pivot language. With the rapid development of Machine Transla- tion (MT), the evaluation of MT has become a challenge in front of researchers. However, the MT evaluation is not an easy task due to the fact of the diversity of the languages, especially for the evaluation between distant languages (Eng- lish, Russia, Japanese, etc.). 2 Related works The earliest human assessment methods for ma- chine translation include the intelligibility and fidelity used around 1960s (Carroll, 1966), and the adequacy (similar as fidelity), fluency and comprehension (improved intelligibility) (White et al., 1994). Because of the expensive cost of manual evaluations, the automatic evaluation metrics and systems appear recently. The early automatic evaluation metrics in- clude the word error rate WER (Su et al., 1992) and position independent word error rate PER (Tillmann et al., 1997) that are based on the Le- venshtein distance. Several promotions for the MT and MT evaluation literatures include the ACL’s annual workshop on statistical machine translation WMT (Koehn and Monz, 2006; Calli- son-Burch et al., 2012), NIST open machine translation (OpenMT) Evaluation series (Li,
  • 2. 2005) and the international workshop of spoken language translation IWSLT, which is also orga- nized annually from 2004 (Eck and Hori, 2005; Paul, 2008, 2009; Paul, et al., 2010; Federico et al., 2011). BLEU (Papineni et al., 2002) is one of the commonly used evaluation metrics that is de- signed to calculate the document level precisions. NIST (Doddington, 2002) metric is proposed based on BLEU but with the information weights added to the n-gram approaches. TER (Snover et al., 2006) is another well-known MT evaluation metric that is designed to calculate the amount of work needed to correct the hypothesis translation according to the reference translations. TER in- cludes the edit categories such as insertion, dele- tion, substitution of single words and the shifts of word chunks. Other related works include the METEOR (Banerjee and Lavie, 2005) that uses semantic matching (word stem, synonym, and paraphrase), and (Wong and Kit, 2008), (Popovic, 2012), and (Chen et al., 2012) that introduces the word order factors, etc. The traditional evalua- tion metrics tend to perform well on the language pairs with English as the target language. This paper will introduce the evaluation models that can also perform well on the language pairs that with English as source language. 3 Description of Systems 3.1 Sub Factors Firstly, we introduce the sub factor of modified length penalty inspired by BLEU metric. { (1) In the formula, means sentence length penalty that is designed for both the shorter or longer translated sentence (hypothesis translation) as compared to the reference sentence. Parame- ters and represent the length of candidate sentence and reference sentence respectively. Secondly, let’s see the factors of n-gram pre- cision and n-gram recall. (2) (3) The variable represents the number of matched n-gram chunks between hy- pothesis sentence and reference sentence. The n- gram precision and n-gram recall are firstly cal- culated on sentence-level instead of corpus-level that is used in BLEU ( ). Then we define the weighted n-gram harmonic mean of precision and recall (WNHPR). (∑ ( (4) Thirdly, it is the n-gram based position differ- ence penalty (NPosPenal). This factor is de- signed to achieve the penalty for the different order of successfully matched words in reference sentence and hypothesis sentence. The alignment direction is from the hypothesis sentence to the reference sentence. It employs the -gram meth- od into the matching period, which means that the potential matched word will be assigned higher priority if it also has nearby matching. The nearest matching will be accepted as a back- up choice if there are both nearby matching or there is no other matched word around the poten- tial pairs. (5) ∑ (6) (7) The variable means the length of the hypothesis sentence; the variables and represent the posi- tion number of matched words in hypothesis sen- tence and reference sentence respectively. 3.2 Linguistic Features The linguistic features could be easily employed into our evaluation models. In the submitted ex- periment results of WMT Metrics Task, we used the part of speech information of the words in question. In grammar, a part of speech, which is also called a word class, a lexical class, or a lexi- cal category, is a linguistic category of lexical items. It is generally defined by the syntactic or morphological behavior of the lexical item in question. The POS information utilized in our metric LEPOR_v3.1, an enhanced version of LEPOR (Han et al., 2012), is extracted using the Berkeley parser (Petrov et al., 2006) for English, German,
  • 3. Ratio other-to-English English-to-other CZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR HPR:LP:NPP(word) 7:2:1 3:2:1 7:2:1 3:2:1 7:2:1 1:3:7 3:2:1 3:2:1 HPR:LP:NPP(POS) NA 3:2:1 NA 3:2:1 7:2:1 7:2:1 NA 3:2:1 ( 1:9 9:1 1:9 9:1 9:1 9:1 9:1 9:1 ( NA 9:1 NA 9:1 9:1 9:1 NA 9:1 NA 1:9 NA 9:1 1:9 1:9 NA 9:1 Table 1. The tuned weight values in LEPOR_v3.1 system System Correlation Score with Human Judgment other-to-English English-to-other Mean scoreCZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR LEPOR_v3.1 0.93 0.86 0.88 0.92 0.83 0.82 0.85 0.83 0.87 nLEPOR_baseline 0.95 0.61 0.96 0.88 0.68 0.35 0.89 0.83 0.77 METEOR 0.91 0.71 0.88 0.93 0.65 0.30 0.74 0.85 0.75 BLEU 0.88 0.48 0.90 0.85 0.65 0.44 0.87 0.86 0.74 TER 0.83 0.33 0.89 0.77 0.50 0.12 0.81 0.84 0.64 Table 2. The performances of nLEPOR_baseline and LEPOR_v3.1 systems on WMT11 corpora and French languages, using COMPOST Czech morphology tagger (Collins, 2002) for Czech language, and using TreeTagger (Schmid, 1994) for Spanish and Russian languages respectively. 3.3 The nLEPOR_baseline System The nLEPOR_baseline system utilizes the simple product value of the factors: modified length penalty, n-gram position difference penalty, and weighted n-gram harmonic mean of precision and recall. (8) The system level score is the arithmetical mean of the sentence level evaluation scores. In the experiments of Metrics Task using the nLEPOR_baseline system, we assign N=1 in the factor WNHPR, i.e. weighted unigram harmonic mean of precision and recall. 3.4 The LEPOR_v3.1 System The system of LEPOR_v3.1 (also called as hLEPOR) combines the sub factors using weighted mathematical harmonic mean instead of the simple product value. (9) Furthermore, this system takes into account the linguistic features, such as the POS of the words. Firstly, we calculate the hLEPOR score on surface words (the closeness of the hypothesis translation and the reference translation). Then, we calculate the hLEPOR score on the extracted POS sequences (the closeness of the corresponding POS tags between hypothesis sentence and refer- ence sentence). The final score is the combination of the two sub-scores and . ( (10) 4 Evaluation Method In the MT evaluation task, the Spearman rank correlation coefficient method is usually used by the authoritative ACL WMT to evaluate the cor- relation of different MT evaluation metrics. So we use the Spearman rank correlation coefficient to evaluate the performances of nLEPOR_baseline and LEPOR_v3.1 in system level correlation with human judgments. When there are no ties, is calculated using: ∑ ( (11) The variable is the difference value be- tween the ranks for and is the number of systems. We also offer the Pearson correlation coefficient information as below. Given a sample of paired data (X, Y) as ( , , the Pearson correlation coefficient is:
  • 4. ∑ ( ( √∑ ( √∑ ( ) (12) where and specify the mean of discrete random variable X and Y respectively. Directions EN- FR EN- DE EN- ES EN- CS EN- RU Av LEPOR_v3.1 .91 .94 .91 .76 .77 .86 nLEPOR_baseline .92 .92 .90 .82 .68 .85 SIMP- BLEU_RECALL .95 .93 .90 .82 .63 .84 SIMP- BLEU_PREC .94 .90 .89 .82 .65 .84 NIST-mteval- inter .91 .83 .84 .79 .68 .81 Meteor .91 .88 .88 .82 .55 .81 BLEU-mteval- inter .89 .84 .88 .81 .61 .80 BLEU-moses .90 .82 .88 .80 .62 .80 BLEU-mteval .90 .82 .87 .80 .62 .80 CDER-moses .91 .82 .88 .74 .63 .80 NIST-mteval .91 .79 .83 .78 .68 .79 PER-moses .88 .65 .88 .76 .62 .76 TER-moses .91 .73 .78 .70 .61 .75 WER-moses .92 .69 .77 .70 .61 .74 TerrorCat .94 .96 .95 na na .95 SEMPOS na na na .72 na .72 ACTa .81 -.47 na na na .17 ACTa5+6 .81 -.47 na na na .17 Table 3. System-level Pearson correlation scores on WMT13 English-to-other language pairs 5 Experiments 5.1 Training In the training stage, we used the officially re- leased data of past WMT series. There is no Rus- sian language in the past WMT shared tasks. So we trained our systems on the other eight lan- guage pairs including English to other (French, German, Spanish, Czech) and the inverse transla- tion direction. In order to avoid the overfitting problem, we used the WMT11 corpora as train- ing data to train the parameter weights in order to achieve a higher correlation with human judg- ments at system-level evaluations. For the nLEPOR_baseline system, the tuned values of and are 9 and 1 respectively for all language pairs except for ( , ) for CS-EN lan- guage pair. For the LEPOR_v3.1 system, the tuned values of weights are shown in Table 1. The evaluation scores of the training results on WMT11 corpora are shown in Table 2. The de- signed methods have shown promising correla- tion scores with human judgments at system lev- el, 0.87 and 0.77 respectively for nLEPOR_baseline and LEPOR_v3.1 of the mean score on eight language pairs. As compared to METEOR, BLEU and TER, we have achieved higher correlation scores with human judgments. 5.2 Testing In the WMT13 shared Metrics Task, we also submitted our system performances on English- to-Russian and Russian-to-English language pairs. However, since the Russian language did not appear in the past WMT shared tasks, we assigned the default parameter weights to Rus- sian language for the submitted two systems. The officially released results on WMT13 corpora are shown in Table 3, Table 4 and Table 5 respec- tively for system-level and segment-level per- formance on English-to-other language pairs. Directions EN- FR EN- DE EN- ES EN- CS EN- RU Av SIMP- BLEU_RECALL .92 .93 .83 .87 .71 .85 LEPOR_v3.1 .90 .9 .84 .75 .85 .85 NIST-mteval- inter .93 .85 .80 .90 .77 .85 CDER-moses .92 .87 .86 .89 .70 .85 nLEPOR_baseline .92 .90 .85 .82 .73 .84 NIST-mteval .91 .83 .78 .92 .72 .83 SIMP- BLEU_PREC .91 .88 .78 .88 .70 .83 Meteor .92 .88 .78 .94 .57 .82 BLEU-mteval- inter .92 .83 .76 .90 .66 .81 BLEU-mteval .89 .79 .76 .90 .63 .79 TER-moses .91 .85 .75 .86 .54 .78 BLEU-moses .90 .79 .76 .90 .57 .78 WER-moses .91 .83 .71 .86 .55 .77 PER-moses .87 .69 .77 .80 .59 .74 TerrorCat .93 .95 .91 na na .93 SEMPOS na na na .70 na .70 ACTa5+6 .81 -.53 na na na .14 ACTa .81 -.53 na na na .14 Table 4. System-level Spearman rank correlation scores on WMT13 English-to-other language pairs Table 3 shows LEPOR_v3.1 and nLEPOR_baseline yield the highest and the sec- ond highest average Pearson correlation score
  • 5. 0.86 and 0.85 respectively with human judg- ments at system-level on five English-to-other language pairs. LEPOR_v3.1 and nLEPOR_baseline also yield the highest Pearson correlation score on English-to-Russian (0.77) and English-to-Czech (0.82) language pairs re- spectively. The testing results of LEPOR_v3.1 and nLEPOR_baseline show better correlation scores as compared to METEOR (0.81), BLEU (0.80) and TER-moses (0.75) on English-to-other language pairs, which is similar with the training results. On the other hand, using the Spearman rank correlation coefficient, SIMPBLEU_RECALL yields the highest correlation score 0.85 with human judgments. Our metric LEPOR_v3.1 also yields the highest Spearman correlation score on English-to-Russian (0.85) language pair, which is similar with the result using Pearson correla- tion and shows its robust performance on this language pair. Directions EN- FR EN- DE EN- ES EN- CS EN- RU Av SIMP- BLEU_RECALL .16 .09 .23 .06 .12 .13 Meteor .15 .05 .18 .06 .11 .11 SIMP- BLEU_PREC .14 .07 .19 .06 .09 .11 sentBLEU-moses .13 .05 .17 .05 .09 .10 LEPOR_v3.1 .13 .06 .18 .02 .11 .10 nLEPOR_baseline .12 .05 .16 .05 .10 .10 dfki_logregNorm- 411 na na .14 na na .14 TerrorCat .12 .07 .19 na na .13 dfki_logregNormS oft-431 na na .03 na na .03 Table 5. Segment-level Kendall’s tau correlation scores on WMT13 English-to-other language pairs However, we find a problem in the Spearman rank correlation method. For instance, let two evaluation metrics MA and MB with their evalu- ation scores ⃗⃗⃗⃗⃗⃗ and ⃗⃗⃗⃗⃗⃗ respectively reflecting three MT systems ⃗⃗ . Before the calculation of cor- relation with human judgments, they will be converted into ⃗⃗⃗⃗⃗⃗̌ and ⃗⃗⃗⃗⃗⃗̌ with the same rank sequence using Spearman rank method; thus, the two evaluation systems will get the same correlation with human judg- ments whatever are the values of human judg- ments. But the two metrics reflect different re- sults indeed: MA gives the outstanding score (0.95) to M1 system and puts very low scores (0.50 and 0.45) on the other two systems M2 and M3 while MB thinks the three MT systems have similar performances (scores from 0.74 to 0.77). This information is lost using the Spearman rank correlation methodology. The segment-level performance of LEPOR_v3.1 is moderate with the average Ken- dall’s tau correlation score 0.10 on five English- to-other language pairs, which is due to the fact that we trained our metrics at system-level in this shared metrics task. Lastly, the officially released results on WMT13 other-to-English language pairs are shown in Table 6, Table 7 and Table 8 respectively for system-level and segment-level performance. Directions FR- EN DE- EN ES- EN CS- EN RU- EN Av Meteor .98 .96 .97 .99 .84 .95 SEMPOS .95 .95 .96 .99 .82 .93 Depref-align .97 .97 .97 .98 .74 .93 Depref-exact .97 .97 .96 .98 .73 .92 SIMP- BLEU_RECALL .97 .97 .96 .94 .78 .92 UMEANT .96 .97 .99 .97 .66 .91 MEANT .96 .96 .99 .96 .63 .90 CDER-moses .96 .91 .95 .90 .66 .88 SIMP- BLEU_PREC .95 .92 .95 .91 .61 .87 LEPOR_v3.1 .96 .96 .90 .81 .71 .87 nLEPOR_baseline .96 .94 .94 .80 .69 .87 BLEU-mteval- inter .95 .92 .94 .90 .61 .86 NIST-mteval-inter .94 .91 .93 .84 .66 .86 BLEU-moses .94 .91 .94 .89 .60 .86 BLEU-mteval .95 .90 .94 .88 .60 .85 NIST-mteval .94 .90 .93 .84 .65 .85 TER-moses .93 .87 .91 .77 .52 .80 WER-moses .93 .84 .89 .76 .50 .78 PER-moses .84 .88 .87 .74 .45 .76 TerrorCat .98 .98 .97 na na .98 Table 6. System-level Pearson correlation scores on WMT13 other-to-English language pairs METEOR yields the highest average correla- tion scores 0.95 and 0.94 respectively using Pearson and Spearman rank correlation methods on other-to-English language pairs. The average performance of nLEPOR_baseline is a little bet-
  • 6. ter than LEPOR_v3.1 on the five language pairs of other-to-English even though it is also moder- ate as compared to other metrics. However, using the Pearson correlation method, nLEPOR_baseline yields the average correlation score 0.87 which already wins the BLEU (0.86) and TER (0.80) as shown in Table 6. Directions FR- EN DE- EN ES- EN CS- EN RU- EN Av Meteor .98 .96 .98 .96 .81 .94 Depref-align .99 .97 .97 .96 .79 .94 UMEANT .99 .95 .96 .97 .79 .93 MEANT .97 .93 .94 .97 .78 .92 Depref-exact .98 .96 .94 .94 .76 .92 SEMPOS .94 .92 .93 .95 .83 .91 SIMP- BLEU_RECALL .98 .94 .92 .91 .81 .91 BLEU-mteval- inter .99 .90 .90 .94 .72 .89 BLEU-mteval .99 .89 .89 .94 .69 .88 BLEU-moses .99 .90 .88 .94 .67 .88 CDER-moses .99 .88 .89 .93 .69 .87 SIMP- BLEU_PREC .99 .85 .83 .92 .72 .86 nLEPOR_baseline .95 .95 .83 .85 .72 .86 LEPOR_v3.1 .95 .93 .75 0.8 .79 .84 NIST-mteval .95 .88 .77 .89 .66 .83 NIST-mteval-inter .95 .88 .76 .88 .68 .83 TER-moses .95 .83 .83 0.8 0.6 0.8 0 WER-moses .95 .67 .80 .75 .61 .76 PER-moses .85 .86 .36 .70 .67 .69 TerrorCat .98 .96 .97 na na .97 Table 7. System-level Spearman rank correlation scores on WMT13 other-to-English language pairs Once again, our metrics perform moderate at segment-level on other-to-English language pairs due to the fact that they are trained at system- level. We notice that some of the evaluation met- rics do not submit the results on all the language pairs; however, their performance on submitted language pair is sometimes very good, such as the dfki_logregFSS-33 metric with a segment- level correlation score 0.27 on German-to- English language pair. 6 Conclusion This paper describes our participation in the WMT13 Metrics Task. We submitted two sys- tems nLEPOR_baseline and LEPOR_v3.1. Both of the two systems are trained and tested using the officially released data. LEPOR_v3.1 yields the highest Pearson correlation average-score 0.86 with human judgments on five English-to- other language pairs, and nLEPOR_baseline yields better performance than LEPOR_v3.1 on other-to-English language pairs. Furthermore, LEPOR_v3.1 shows robust system-level perfor- mance on English-to-Russian language pair, and nLEPOR_baseline shows best system-level per- formance on English-to-Czech language pair us- ing Pearson correlation criterion. As compared to nLEPOR_baseline, the experiment results of LEPOR_v3.1 also show that the proper use of linguistic information can increase the perfor- mance of the evaluation systems. Directions FR- EN DE- EN ES- EN CS- EN RU- EN Av SIMP- BLEU_RECALL .19 .32 .28 .26 .23 .26 Meteor .18 .29 .24 .27 .24 .24 Depref-align .16 .27 .23 .23 .20 .22 Depref-exact .17 .26 .23 .23 .19 .22 SIMP- BLEU_PREC .15 .24 .21 .21 .17 .20 nLEPOR_baseline .15 .24 .20 .18 .17 .19 sentBLEU-moses .15 .22 .20 .20 .17 .19 LEPOR_v3.1 .15 .22 .16 .19 .18 .18 UMEANT .10 .17 .14 .16 .11 .14 MEANT .10 .16 .14 .16 .11 .14 dfki_logregFSS- 33 na .27 na na na .27 dfki_logregFSS- 24 na .27 na na na .27 TerrorCat .16 .30 .23 na na .23 Table 8. Segment-level Kendall’s tau correlation scores on WMT13 other-to-English language pairs The designed source codes of this paper can be freely downloaded for research purpose, open source online. The source code of nLEPOR and hLEPOR measuring algorithm is available here https://github.com/aaronlifenghan/aaron- project-lepor and https://github.com/aaronlifenghan/aaron-project- hlepor respectively. The final publication is available at http://www.statmt.org/wmt13/pdf/WMT53.pdf.
  • 7. Acknowledgments The authors wish to thank the anonymous re- viewers for many helpful comments. The authors are grateful to the Science and Technology De- velopment Fund of Macau and the Research Committee of the University of Macau for the funding support for our research, under the refer- ence No. 017/2009/A and RG060/09- 10S/CS/FST. References Banerjee, Satanjeev and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Im- proved Correlation with Human Judgments. In Proceedings of the 43th Annual Meeting of the Association of Computational Linguistics (ACL- 05), pages 65–72, Ann Arbor, US, June. Association of Computational Linguistics. Callison-Burch, Chris, Philipp Koehn, Christof Monz, and Omar F. Zaidan. 2011. Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of the Sixth Workshop on Sta- tistical Machine Translation (WMT '11). Asso- ciation for Computational Linguistics, Stroudsburg, PA, USA, 22-64. Callison-Burch, Chris, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012Workshop on Statistical Ma- chine Translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 10–51, Montreal, Canada. Association for Computational Linguistics. Carroll, John B. 1966. An Experiment in Evaluating the Quality of Translations, Mechanical Transla- tion and Computational Linguistics, vol.9, nos.3 and 4, September and December 1966, page 55-66, Graduate School of Education, Harvard University. Chen, Boxing, Roland Kuhn and Samuel Larkin. 2012. PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning, Proceedings of the 50th An- nual Meeting of the Association for Computa- tional Linguistics, pages 930–939, Jeju, Republic of Korea, 8-14 July 2012. Collins, Michael. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Pro- ceedings of the ACL-02 conference on Empiri- cal methods in natural language processing, Volume 10 (EMNLP 02), pages 1-8. Association for Computational Linguistics, Stroudsburg, PA, USA. Costa-jussà, Marta R., Carlos A. Henríquez and Ra- fael E. Banchs. 2012. Evaluating Indirect Strategies for Chinese-Spanish Statistical Machine Transla- tion. Journal of artificial intelligence research, Volume 45, pages 761-780. Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram co- occurrence statistics. In Proceedings of the sec- ond international conference on Human Lan- guage Technology Research (HLT '02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 138-145. Eck, Matthias and Chiori Hori. 2005. Overview of the IWSLT 2005 Evaluation Campaign. Proceedings of IWSLT 2005. Federico, Marcello, Luisa Bentivogli, Michael Paul, and Sebastian Stiiker. 2011. Overview of the IWSLT 2011 Evaluation Campaign. In Proceed- ings of IWSLT 2011, 11-27. Han, Aaron Li-Feng, Derek F. Wong and Lidia S. Chao. 2012. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. Proceedings of the 24th International Confer- ence on Computational Linguistics (COLING 2012: Posters), Mumbai, India. Koehn, Philipp and Christof Monz. 2006. Manual and Automatic Evaluation of Machine Translation be- tween European Languages. Proceedings of the ACLWorkshop on Statistical Machine Trans- lation, pages 102–121, New York City. Li, A. (2005). Results of the 2005 NIST machine translation evaluation. In Machine Translation Workshop. Menezes, Arul, Kristina Toutanova and Chris Quirk. 2006. Microsoft Research Treelet Translation Sys- tem: NAACL 2006 Europarl Evaluation, Proceed- ings of the ACLWorkshop on Statistical Ma- chine Translation, pages 158-161, New York City. Och, Franz Josef. 2003. Minimum Error Rate Train- ing for Statistical Machine Translation. In Pro- ceedings of the 41st Annual Meeting of the As- sociation for Computational Linguistics (ACL- 2003). pp. 160-167. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automat- ic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguis- tics (ACL '02). Association for Computational Linguistics, Stroudsburg, PA, USA, 311-318.
  • 8. Paul, Michael. 2008. Overview of the IWSLT 2008 Evaluation Campaign. Proceeding of IWLST 2008, Hawaii, USA. Paul, Michael. 2009. Overview of the IWSLT 2009 Evaluation Campaign. In Proc. of IWSLT 2009, Tokyo, Japan, pp. 1–18. Paul, Michael, Marcello Federico and Sebastian Stiiker. 2010. Overview of the IWSLT 2010 Eval- uation Campaign, Proceedings of the 7th Inter- national Workshop on Spoken Language Translation, Paris, December 2nd and 3rd, 2010, page 1-25. Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and in- terpretable tree annotation. In Proceedings of the 21st International Conference on Computa- tional Linguistics and the 44th annual meeting of the Association for Computational Linguis- tics (ACL-44). Association for Computational Linguistics, Stroudsburg, PA, USA, 433-440. Popovic, Maja. 2012. Class error rates for evaluation of machine translation output. Proceedings of the 7th Workshop on Statistical Machine Transla- tion, pages 71–75, Canada. Schmid, Helmut. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK. Snover, Matthew, Bonnie J. Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A Study of Translation Edit Rate with Targeted Hu- man Annotation. In Proceedings of the 7th Con- ference of the Association for Machine Trans- lation in the Americas (AMTA-06), pages 223– 231, USA. Association for Machine Translation in the Americas. Su, Hung-Yu and Chung-Hsien Wu. 2009. Improving Structural Statistical Machine Translation for Sign Language With Small Corpus Using Thematic Role Templates as Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 7. Su, Keh-Yih, Wu Ming-Wen and Chang Jing-Shin. 1992. A New Quantitative Quality Measure for Machine Translation Systems. In Proceedings of the 14th International Conference on Compu- tational Linguistics, pages 433–439, Nantes, France. Tillmann, Christoph, Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf. 1997. Accel- erated DP Based Search For Statistical Translation. In Proceedings of the 5th European Confer- ence on Speech Communication and Technol- ogy (EUROSPEECH-97). Weaver, Warren. 1955. Translation. In William Locke and A. Donald Booth, editors, Machine Transla- tion of Languages: Fourteen Essays. John Wiley & Sons, New York, pages 15–23. White, John S., Theresa O’Connell, and Francis O’Mara. 1994. The ARPA MT evaluation method- ologies: Evolution, lessons, and future approaches. In Proceedings of the Conference of the Asso- ciation for Machine Translation in the Ameri- cas (AMTA 1994). pp193-205. Wong, Billy and Chunyu Kit. 2008. Word choice and word position for automatic MT evaluation. In Workshop: MetricsMATR of the Association for Machine Translation in the Americas (AMTA), short paper, Waikiki, Hawai’I, USA. Association for Machine Translation in the Americas.