We will discuss the history of machine translation (MT), ruled based MT, statistical MT, left-to-right hierarchical phrase-based MT, and finally how to use LR-hiero in simultaneous translation.
(LR-Hiero)
Maryam Siahbani
2. Overview
• History of Machine Translation
• Rule based MT
• Statistical MT
– Training
– Decoding
• Left-to-Right Hierarchical Phrase-based MT
• Using LR-Hiero in Simultaneous Translation
2
3. History of Machine Translation
• Late 1940’s: Early rule-based systems
– computers would replace human translations within
5 years!
• 1966: ALPAC report cuts research funding
• Early 1970’s: First commercial system (Systran)
• Late 1980’s: IBM developed first statistical
models inspired by speech research
• Late 2000’s: Explosion in MT research
• 2006: First version of Google Translate
3
4. Rule-based Machine Translation
• Rules hand-written by linguists
• State of the art until early 2000’s
– e.g. Systran
• Expensive to create maintain and adapt
4
French
NP
Noun
chat
Adjective
noir
English
NP
Noun
cat
Adjective
black
5. Statistical Machine Translation
• Data driven approaches to MT
• Learn translation from textual data
– Parallel Data
• Language independent
• Normally use probabilistic models
– The best translation = the most probable translation
𝑒∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑒 𝑃(𝑒|𝑓) where f: source sentence
• State of the art for most language pairs
– Best systems include rules (hybrid)
5
8. Statistical Machine Translation (SMT)
8
Aligned Words
EnZh
happens
发生 事情我们十分 关注 的
we are very much concerned with what in region
地区非洲
African
Learn alignment from parallel text
9. Statistical Machine Translation (SMT)
9
Aligned Words
EnZh
Translation rules
happens
发生 事情我们十分 关注 的
we are very much concerned with what in region
地区非洲
African
Learn alignment from parallel text
Id Source Target Weight
r1 关注 X_1 concerned with X_1 -5.3
r2 X_1 发生 X_2事情 what happens X_2 X_1 -4.8
r3 非洲 地区 African region -3.1
Learn weighted translation rules from word aligned text
10. Translation Rules (phrase-pairs)
10
Source Target p(e|f)
den Vorschlag the proposal 0.6227
den Vorschlag ‘s proposal 0.1068
den Vorschlag a proposal 0.0341
den Vorschlag the idea 0.0250
den Vorschlag this proposal 0.0227
den Vorschlag proposal 0.0205
den Vorschlag of the proposal 0.0159
den Vorschlag the proposals 0.0159
* German-English phrase table trained on Europarl
Millions of
translation rules
Log probability
-1.7986
11. translation
model
Statistical Machine Translation (SMT)
11
drdyee
rhwfePe )(.maxarg)|(maxarg*
)(
Aligned Words
EnZh
Translation rules
Decoder
happens
发生 事情我们十分 关注 的
we are very much concerned with what in region
地区非洲
African
Learn alignment from parallel text
Id Source Target Weight
r1 关注 X_1 concerned with X_1 -5.3
r2 X_1 发生 X_2事情 what happens X_2 X_1 -4.8
r3 非洲 地区 African region -3.1
Learn weighted translation rules from word aligned text
Decoder generates many candidate translations,
scores them and returns the most likely one
Find the translation for any
given input (f)
f e
12. Measuring Translation Quality:
BLEU score
• BLEU is a simple but effective scoring metric
shown to be proportional to human judgment of
translation quality
• The idea is to measure overlap between the
translation generated by MT system and the
reference translation
• Measure one word overlaps, two word
overlaps,… (n-grams)
• Compute precision score for each n-gram
• Impose a brevity penalty for candidates that are
shorter than reference
12
13. Measuring Translation Quality:
BLEU score
• Input:
– Ich war in meinen zwangzigern bevor ich erstmals in
ein kunstmuseum ging .
• Reference translation:
– I was in my twenties before I ever went to an art
museum .
• Low BLEU score (41.1):
– I was twenty I ever went to art .
• High BLEU score (89.0):
– I was in my twenties before I first went to an art
museum .
13
15. SCFG
Hierarchical Phrase-based Translation
Synchronous Context-Free Grammar
15
Aligned Words
EnZh
Translation Rules
X -> <我们十分X_1 / we are very much X_1>
X -> <事情 / what >
我们 十分 关注 发生 的 事情地区非洲
(Hiero)
X -> <非洲 地区 / african region >
we are very much
X-> <关注 X_1 发生 的 X_2 /concerned with X_2 happens in X_1>
concerned with happens inwhat african region
X -> <我们十分X_1 / we are very much X_1>
X-> <关注 X_1 发生 的 X_2 /concerned with X_2 happens in X_1>
X -> <事情 / what >
X -> <非洲 地区 / african region >
translation
model
Decoder
16. Hiero Decoder
O(n^3)
LM computation
我们 关注 发生 的 事情地区十分 非洲 。
we are very much concerned with what happens in african regions .
X_2
X_1 X_2= what
X -> <关注 X_1 发生 的 X_2 / concerned with X_2 happens in X_1>
X_1= african region
concerned with happens in
what
african region
LM LM LM
Bottom-up Dynamic
Programing algorithm
we are very much concerned with
16
18. Left-to-Right Target Generation
(Watanabe et al. 2006)
18
X1
X1
X1
we are very much
concerned with
X2what happens X1
in african region
X1
X1
X1
我们十分
关注
X2发生X1
的非洲 地区
发生
的我们 关注 发生 事情地区十分 非洲
we are very muchconcerned with what happens african regionin
X -> <我们十分 X_1 / we are very much X_1>
X -> <X_1 发生 X_2事情 / what happens X_2 X_1>
X -> < 关注 X_1 / concerned with X_1>
X -> <X_1 发生 的 X_2 / X_2 happens in X_1>Non-GNF
Greibach Normal Form
(GNF)
19. • Search for sub-phrases within larger ones
– Smaller phrases are replaced by non-terminal X
• Dynamic programming algorithm to extract rules
for LR-
– Linear time complexity (in number of rules)
LR-Hiero Rule Extraction
19
<我们十分X_1 / we are very much X_1>
事情
happens
发生我们十分 关注 的
we are very much concerned with what in region
地区非洲
AfricanX_1
X_1
20. • Search for sub-phrases within larger ones
– Smaller phrases are replaced by non-terminal X
• A novel Dynamic programming algorithm to extract
rules for LR-Hiero
– Linear time complexity vs. exhaustive search
LR-Hiero Rule Extraction
20
<我们十分X_1 / we are very much X_1>
事情
happens
发生我们十分 关注 的
we are very much concerned with what in region
地区非洲
African
X2X_1
< X_1 发生 X_2事情 / what happens X_2 X_1>
X2 X_1
21. • Linear time complexity vs. exhaustive search
• Can easily extract rules with more non-terminals
LR-Hiero Rule Extraction
21
0
1000
2000
3000
4000
1 2 3 4
Time(sec.)
No. of Non-terminals
Effect of No. of Non-terminals on
extraction time
Hiero Heuristic
DP Extractor
Expressive Hierarchical Rule Extraction for Left-to-Right Translation. M. Siahbani and A.
Sarkar. AMTA(2014)
22. 的
Left-to-Right Decoding
X -> <我们十分 X_1 / we are very much X_1>
X -> <X_1 发生 X_2事情 / what happens X_2 X_1>
X -> <非洲 地区 / African region >
<s> [0,8]
<s>
<s> we are very much
<s> we are very much concerned with
<s> we are very much concerned with what happens
<s> we are very much concerned with what happens in
0 1 2 3 4 5 6 7 8
我们 关注 发生 事情地区十分 非洲
X -> < 关注 X_1 / concerned with X_1>
X -> <的 / in >
we are very much[2,8]
concerned with[3,8]
what happens[6,7] [3,5]
in
[3,5]
African region
22
23. 的
Left-to-Right Decoding
<s> [0,8]
<s> we are very much [2,8]
<s> we are very much concerned with [3,8]
<s> we are very much concerned with what happens [6,7][3.5]
<s> we are very much concerned with what happens in [3,5]
<s> we are very much concerned with what happens in African region
0 1 2 3 4 5 6 7 8
我们 关注 发生 事情地区十分 非洲
𝑶(𝒏 𝟐
)
Typical CKY: 𝑶(𝒏 𝟑
)
23
drdyt
rfwt )(.maxarg*
)(
Candidate translations are scored by:
<我们十分 X_1 / we are very much X_1>, -4.7
<X_1 发生 X_2事情 / what happens X_2 X_1>, -3.6
<非洲 地区 / African region >, -2.7
< 关注 X_1 / concerned with X_1>, -3.8
<的 / in >, -1.2
, -7.7
, -7.1
, -5.9
, -4.5
, -3.3
, 0
28. Incremental Translation
• Facilitate continuous translation with low
latency
– Latency: time difference between start of source
sentence (speech) and start of target sentence
(speech)
• Ensure acceptable translation accuracy
Good evening, I would like
a taxi to the airport please
Buenas noches. Quiero un
taxi al aeropuerto por favor
6 sec
Good evening, I would
0.7 sec
0.2 sec
0.2 sec
like a taxi
to the airport please
Non-incremental
Buenas noches quiero
como un taxi
al aeropuerto por favor
Incremental
33. Publications
33
• Efficient Left-to-Right Hierarchical Phrase-Based Translation
with Improved Reordering. Siahbani, Maryam and
Sankaran, Baskaran and Sarkar, Anoop. EMNLP(2014)
• Two Improvements to Left-to-Right Decoding for
Hierarchical Phrase-based Machine Translation. Siahbani,
Maryam and Sarkar, Anoop. EMNLP(2014)
• Expressive Hierarchical Rule Extraction for Left-to-Right
Translation. Siahbani, Maryam and Sarkar, Anoop.
AMTA(2014)
• Incremental Translation using a Hierarchical Phrase-based
Translation System. Siahbani, Maryam and Mehdizadeh
Seraj, Ramtin and Sankaran, Baskaran and Sarkar, Anoop. SLT
(2014)complexity (in number of rules)
35. Partial Hypothesis
<s> [0,8], -3.3
<s> we are very much [2,8], -4.5
的
0 1 2 3 4 5 6 7 8
我们 关注 发生 事情地区十分 非洲
<s> we are very much concerned with [3,8], -5.9
<s> we are very much concerned with what happens [6,7][3,5], -7.1
36. LR-Decoding with Beam Search
• LR-Decoding integrated with beam-search
(Watanabe et al. 2006)
• Stacks: hypotheses with same number of source side
words covered
• Exhaustively generating all possible partial
hypotheses for a given stack
36
37. Cube pruning
• Each cube: a group of hypotheses and applicable
rules
• Cubes are fed to a priority queue which fills the
current stack
37
38. • Rows: hypotheses
• Columns: rules
• Rows and columns are sorted based on the scores
• Assumption: The best hypothesis is in the top left
– The next best are the
neighbours of this entry
Cube pruning
38
12.5 12.4 14.3
12.6 12.8 14.7
13.3 13.5 15.4
0.9 1.1 3.2
students have not yet 10.2 12.5
12.5
12.412.4
made
done
do
pupils have not yet 11.5
student has not 12.7
39. Time Efficiency: avg of LM queries
Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering.
M. Siahbani, B. Sankaran and A. Sarkar. EMNLP(2013) 39
Watanabe et al. (2006)
40. Reordering Features
• LR-Hiero by (Watanabe et al. 2006) achieves ~2 BLEU
scores less than Hiero
40
Watanabe et al. (2006)
41. Reordering Features
• Distortion feature (when apply each rule)
• Number of reordering rules (non-terminals on source
and target side are reordered)
41
r<>= 1
r<>= 0
<X_1 发生 X_2事情 / what happens X_2 X_1>
<X_1 发生 X_2事情 / what happens X_1 X_2>
<X_1 发生 X_2事情 / what happens X_2 X_1>
的
0 1 2 3 4 5 6 7 8
我们 关注 发生 事情地区十分 非洲
d =(5-3) + (7-6) + (8-6) + (7-3) + (8-5)
42. Translation Quality
Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering.
M. Siahbani, B. Sankaran and A. Sarkar. EMNLP(2013) 42
Watanabe et al. (2006)
43. Search Error in Cube Pruning
43
8.1 8.2 8.5
8.0 8.4 8.6
8.3 8.9 8.8
0.9 1.3 3.2
6.6
6.7
6.9
9.1 8.9 9.3
8.0 8.5 9.0
7.7 7.9 8.1
1.0 1.3 1.5
6.2
6.3
6.5
8.1
8.0 8.1
8.0
8.28.2
• Assumption: The best hypothesis is in the top left
– The next best are the neighbours of this entry
• Adding LM score violates the assumption
44. Search Error in Cube Pruning
44
• Assumption: The best hypothesis is in the top left
– The next best are the neighbours of this entry
• Adding LM score violates the assumption
8.1 8.2 8.5
8.0 8.4 8.6
8.3 8.9 8.8
0.9 1.3 3.2
6.6
6.7
6.9
9.1 8.9 9.3
8.0 8.5 9.0
7.7 7.9 8.1
1.0 1.3 1.5
6.2
6.3
6.5
8.08.0 8.0
8.0
7.7
7.7
Queue
diversity
45. Queue Diversity
Two Improvements to Left-to-Right Decoding for Hierarchical Phrase-based Machine
Translation. M. Siahbani and A. Sarkar. EMNLP(2014) 45
23.5
24
24.5
25
25.5
26
26.5
Chinese-English
BLEU score
LR-Hiero
LR-Hiero+CP
LR-Hiero+CP
(QD=10)
0
10000
20000
30000
40000
Chinese-English
No. LM calls
LR-Hiero
LR-Hiero+CP
LR-Hiero+CP
(QD=10)
46. Lexicalized Reordering Model
• Distortion penalty is weak
– deviation from the monotonic translation
• Learn reordering preferences for each phrase
(respect to previous phrase)
– Monotone
– Swap
– Discontinuous
46
F
E
Figure from "Statistical Machine Translation“
Koehn 2010
47. Lexicalized Reordering Model
• Collect orientation information during rule extraction
– Convert each rule to a phrase-pair (possibly discontinuous)
– M: If there is a phrase-pair on the top-left
– S: If there is a phrase-pair on the top right
– D: otherwise
• Estimation by relative frequency
𝑃𝑜 𝑜𝑟𝑖𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑒, 𝑓 =
𝑐𝑜𝑢𝑛𝑡(𝑜𝑟𝑖𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛,𝑒,𝑓)
𝑜 𝑐𝑜𝑢𝑛𝑡(𝑜,𝑒,𝑓)
47
F
E
Figure from "Statistical Machine Translation“
Koehn 2010
Hinweis der Redaktion
In Statistical Machine Translation We are basically looking for a translation sentence e which maximizes the probability of e given source sentence f.
Statistical approaches to Machine Translation have achieved impressive performance by leveraging large amounts of parallel corpora. However, such data are available only for a few dozen language pairs in limited domains
#Currently we just have parallel data for a few language pairs like: French-English, Arabic-English, and so on.
But we have more than 5000 languages spoken by people on the world. And we do not have parallel data between most of them.
In Statistical Machine Translation We are basically looking for a translation sentence e which maximizes the probability of e given source sentence f.
Statistical approaches to Machine Translation have achieved impressive performance by leveraging large amounts of parallel corpora. However, such data are available only for a few dozen language pairs in limited domains
#Currently we just have parallel data for a few language pairs like: French-English, Arabic-English, and so on.
But we have more than 5000 languages spoken by people on the world. And we do not have parallel data between most of them.
In Statistical Machine Translation We are basically looking for a translation sentence e which maximizes the probability of e given source sentence f.
Statistical approaches to Machine Translation have achieved impressive performance by leveraging large amounts of parallel corpora. However, such data are available only for a few dozen language pairs in limited domains
#Currently we just have parallel data for a few language pairs like: French-English, Arabic-English, and so on.
But we have more than 5000 languages spoken by people on the world. And we do not have parallel data between most of them.
- Hiero uses a simple rule extraction algorithm based on word alignments
to avoid excessively large grammars, they apply constraints on length of phrase-pairs and rule configuration
Assumes unit count for phrase-pairs
Uniformly distributes the fractional count to all rules extracted from the phrase-pair
- Hiero uses a simple rule extraction algorithm based on word alignments
to avoid excessively large grammars, they apply constraints on length of phrase-pairs and rule configuration
Assumes unit count for phrase-pairs
Uniformly distributes the fractional count to all rules extracted from the phrase-pair
Left-to-right decoding is a potential alternative. It is a Early style decoder which generate the target side in left-to-right order. Each partial hypothesis consists of a partial translation and a sequence of uncovered spans on source side.
It is a faster decoder compare to CKY,
Left-to-right decoding is a potential alternative. It is a Early style decoder which generate the target side in left-to-right order. Each partial hypothesis consists of a partial translation and a sequence of uncovered spans on source side.
It is a faster decoder compare to CKY,
In incremental translation we need to optimize two criteria,
Facilitate continuous translation with low latency
Latency: time difference between start of source language (speech) and start of target language (speech)
Ensure acceptable translation accuracy