Dr. Dimitar Shterionov (KantanLabs) and Laura Casanellas (KantanMT Professional Services) presented very interesting results gleaned from a comparative ranking of Neural and Statistical MT systems. These systems were developed with KantanMT and ranked using the KantanLQR quality evaluation platform. As ranked by Professional Translators, Neural MT demonstrated clear quality improvements in terms of fluency and adequacy compared to equivalent statistical based outputs.
4. Yet another MT paradigm?
31/07/2017 KantanFest, Dublin, Ireland 4
5. Yet another MT paradigm?
Which technique is faster?
Which technique is better?
How can I integrate NMT in my pipeline?
How can I compare PBSMT and NMT?
How can I improve my NMT engine?
When to use PBSMT and when NMT?
31/07/2017 KantanFest, Dublin, Ireland 5
6. Yet another MT paradigm?
Which technique is faster?
Which technique is better?
How can I integrate NMT in my pipeline?
How can I compare PBSMT and NMT?
How can I improve my NMT engine?
When to use PBSMT and when NMT?
31/07/2017 KantanFest, Dublin, Ireland 6
Is NMT better than PBSMT???
7. Yet another MT paradigm?
Which technique is faster?
Which technique is better?
How can I integrate NMT in my pipeline?
How can I compare PBSMT and NMT?
How can I improve my NMT engine?
When to use PBSMT and when NMT?
31/07/2017 KantanFest, Dublin, Ireland 7
Can NMT better than PBSMT???
8. Various empirical evaluations
(since 2015)
31/07/2017 KantanFest, Dublin, Ireland 8
…
Scientific Rigour – NMT vs PBSMT
9. 31/07/2017 KantanFest, Dublin, Ireland 9
Experiment Setup
Identical Training, Test and Tune Data
NMT training limited to 4 days
Evaluation:
Automated Scores: F-Measure, TER, BLEU
Ranking with KantanLQR™, A/B Testing
Publications and Presentations
EAMT 2017
MT Summit 2017
LocWorld34 NMT GALA Track
Scientific Rigour – NMT vs PBSMT
10. 31/07/2017 KantanFest, Dublin, Ireland 10
A small parenthesis…
There are so many factors
Learning algorithm and rate
Number of epochs
ANN properties
Data – preprocessing, segmentation
you need the right data!
Scientific Rigour – NMT vs PBSMT
12. 31/07/2017 KantanFest, Dublin, Ireland 12
Language Arc F-Measure BLEU TER Time F-Measure BLEU TER Perplexity Time
English->German 62.00% 54.08% 54.31% 18h 62.53% 47.53% 53.41% 3.02 92h
English->Chinese(Simplified) 77.16% 45.36% 46.85% 6h 71.85% 39.39% 47.01% 2.00 10h
English->Japanese 80.04% 63.27% 43.77% 9h 69.51% 40.55% 49.46% 1.89 68h
English->Italian 69.74% 56.98% 42.54% 8h 64.88% 42.00% 48.73% 2.70 83h
English->Spanish 71.53% 54.78% 41.87% 9h 69.41% 49.24% 44.89% 2.59 71h
SMT NMT
Training: Automated Scores
“In information theory, perplexity is a measurement of how well a
probability distribution or probability model predicts a sample. It may be
used to compare probability models. A low perplexity indicates the
probability distribution is good at predicting the sample.”
15. Alternative translations
Source
All dossiers must be individually analysed by the ministry responsible for the
economy and scientific policy.
Reference
Jeder Antrag wird von den Dienststellen des zuständigen Ministers für
Wirtschaft und Wissenschaftspolitik individuell geprüft.
PBSMT
Alle Unterlagen müssen einzeln analysiert werden von den Dienststellen des
zuständigen Ministers für Wirtschaft und Wissenschaftspolitik.
NMT
Alle Unterlagen müssen von dem für die Volkswirtschaft und die
wissenschaftliche Politik zuständigen Ministerium einzeln analysiert werden.
58%
0%
Source En este punto muestro mi desacuerdo con el informe.
Reference On this point, I am not in agreement with the report before us.
PBSMT At this point, I am not in agreement with the report.
NMT In this point I disagree with the report.
72%
7%
Source Debemos apoyarles a todos para que alcancen este objetivo.
Reference We must give them all our support to reach that goal.
PBSMT We must give them all our support to reach that goal.
NMT We have to support everyone to achieve this goal.
100%
0%
BLEU
EN→DEES→ENES→EN
31/07/2017 KantanFest, Dublin, Ireland 15
16. 31/07/2017 KantanFest, Dublin, Ireland 16
Ranking
37
21
13
24
10
21
EN→ZH-CN EN→JA EN→DE EN→IT EN→ES AVERAGE
Average Scores from A/B Testing (in percent)
Same SMT NMT
17. 31/07/2017 KantanFest, Dublin, Ireland 17
Ranking
37
21
13
24
10
21
24
21
34
19
28
25.2
EN→ZH-CN EN→JA EN→DE EN→IT EN→ES AVERAGE
Average Scores from A/B Testing (in percent)
Same SMT NMT
18. 31/07/2017 KantanFest, Dublin, Ireland 18
Ranking
37
21
13
24
10
21
24
21
34
19
28
25.2
39
58
53
56
62
53.6
EN→ZH-CN EN→JA EN→DE EN→IT EN→ES AVERAGE
Average Scores from A/B Testing (in percent)
Same SMT NMT
19. BLEU underestimation of NMT
Take the translations from the NMT engine
considered better than their PBSMT counterparts.
How many of those are scored by BLEU lower than
their PBSMT counterparts?
Do the same for the PBSMT translations.
31/07/2017 KantanFest, Dublin, Ireland 19
EN→ZH-CN EN→JP EN→DE EN→IT EN→ES Average
NMT 40% 59% 55% 34% 53% 48%
PBSMT 12% 0% 9% 9% 0% 6%
20. Take-away messages…
NMT is a new efficient paradigm for MT
NMT does not solve the problem of language
NMT can be much better than PBSMT
Evaluating NMT:
BLEU, TER, F-Measure may underestimate NMT
when compared to PBSMT
Using KantanLQR™ (A/B Testing) facilitates MT ranking
31/07/2017 KantanFest, Dublin, Ireland 20
21. Take-away messages…
NMT is a new efficient paradigm for MT
NMT does not solve the problem of language … but it is getting there
NMT can be much better than PBSMT
Evaluating NMT:
BLEU, TER, F-Measure may underestimate NMT
when compared to PBSMT
Using KantanLQR™ (A/B Testing) facilitates MT ranking
31/07/2017 KantanFest, Dublin, Ireland 21
To NMT or not to NMT?
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
According to the PBSMT paradigm, a sentence is translated phrase by phrase. The translation of each phrase is derived from a phrase table (i.e., a representation of a translation model). Then these phrase-level translations are combined in a sentence in a way that maximises the likelihood for a correct sentence in the target language (i.e., using a language model). Sometimes a third model is used to fix the casing.
(give 30 seconds for people to check and ask which translation they prefer).
(give 30 seconds for people to check and ask which translation they prefer).
(give 30 seconds for people to check and ask which translation they prefer).
(give 30 seconds for people to check and ask which translation they prefer).
Next we aimed to investigate our hypothesis of BLEU underestimating NMT quality. In order to do so, we needed to find irregularities between human evaluation and BLEU scores. To do so, first, we took the set of translations, for each language pair and from the set that the reviewers evaluated, where NMT was marked by all three reviewers better. Next, from this set we counted the number of translations with BLEU score lower than their PBSMT counterparts. Third, we find the ration of the two counts.
We did the same also for the PBSMT – get the set of better translations, count the ones with BLEU score lower than the NMT counterparts and calculate the ration between the two numbers.
It is clear from our results that indeed, the BLEU is not that reliable for NMT. Furthermore, these results indicate that BLEU underestimates the quality, thus confirming our hypothesis.
Now, can we actually trust BLEU??? There are several remarks that need to be noted. First, the numbers shown in our table for each language pair are similar – this means that the affect of the BLEU underestimation is the same among the NMT engines, that is – we can compare NMT engines based on BLEU and still get a sense of their quality differences; Second, we notice the same tendency in the F-Measure score, which is also a metric based on n-grams. That indicates that indeed the issues arise from the underlying principles of PBSMT and NMT (recall the 2D picture with the points linked to the John/Mary sentences). This can push the future research in quality estimation in a particular direction.
And third, something not shown in a table or a graph. Remember that our engines are trained under a time restriction. Assume we let the training continue until the neural network reaches its full potential. That is, it will model optimally the training data. Given that the test data is very similar to the training data this would mean that the engine would model each test sentence also very well, even on a phrase level. And as such, the scores (BLEU, F-Measure and TER) would improve and get closer or even surpass the PBSMT scores. This statement is supported by other research where (e.g., google’s paper from November last year) shows very good scores but also each of their models is trained for almost two weeks.
A translation production line nowadays typically combines an MT component with human post-editing. While the MT component is simply a means to get a raw translation of the original text, which in the next step is modified to meet certain translation quality standards, the choice of correct MT toolset impacts the efficiency of this pipeline.
A translation production line nowadays typically combines an MT component with human post-editing. While the MT component is simply a means to get a raw translation of the original text, which in the next step is modified to meet certain translation quality standards, the choice of correct MT toolset impacts the efficiency of this pipeline.