Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts.
We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii).
We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder.
Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...Yandex
Weitere ähnliche Inhalte
Ähnlich wie Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksJinho Choi
Ähnlich wie Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2 (20)
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2
1. Combining, Adapting and Reusing Bi-texts
between Related Languages:
Application to Statistical Machine Translation
Preslav Nakov, Qatar Computing Research Institute
(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)
Yandex seminar
August 13, 2014, Moscow, Russia
2. Yandex seminar, August 13, 2014, Moscow, Russia
2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2
Plan
Part I
Introduction to Statistical Machine Translation
Part II
Combining, Adapting and Reusing Bi-texts between Related
Languages: Application to Statistical Machine Translation
Part III
Further Discussion on SMT
4. Yandex seminar, August 13, 2014, Moscow, Russia
4Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 4
Overview
Statistical Machine Translation (SMT) systems
Need large sentence-aligned bilingual corpora (bi-texts).
Problem
Such training bi-texts do not exist for most languages.
Idea
Adapt a bi-text for a related resource-rich language.
5. Yandex seminar, August 13, 2014, Moscow, Russia
5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building an SMT System for a New Language Pair
In theory: only requires few hours/days
In practice: large bi-texts are needed
Only available for
the official languages of the UN
Arabic, Chinese, English, French, Russian, Spanish
the official languages of the EU
some other languages
However, most of the 6,500+ world languages remain
resource-poor from an SMT viewpoint.
This number is even more striking
if we consider language pairs.
Even resource-rich language pairs
become resource-poor
in new domains.
6. Yandex seminar, August 13, 2014, Moscow, Russia
6Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Most Language Pairs Have Little Resources
Zipfian distribution of language resources
7. Yandex seminar, August 13, 2014, Moscow, Russia
7Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building a Bi-text for SMT
Small bi-texts
Relatively easy to build
Large bi-texts
Hard to get, e.g., because of copyright
Sources: parliament debates and legislation
national: Canada, Hong Kong
international
United Nations
European Union: Europarl, Acquis
Becoming an official language of the EU
is an easy recipe for getting rich in bi-texts quickly.
Not all languages are so “lucky”,
but many can still benefit.
8. Yandex seminar, August 13, 2014, Moscow, Russia
8Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
How Google/Bing (Yandex?) Translate
Resource-Poor Languages
How do we translate from Russian to Malay?
Use Triangulation
Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)
RussianEnglishMalay
Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)
рамочное соглашение ||| framework agreement ||| 0.7 …
perjanjian kerangka kerja ||| framework agreement ||| 0.8 …
THUS
рамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …
9. Yandex seminar, August 13, 2014, Moscow, Russia
9Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Idea: reuse bi-texts from related resource-rich
languages to build an improved SMT system for a
related resource-poor language.
NOTE 1: this is NOT triangulation
we focus on translation into English
e.g., Indonesian-English using Malay-English
rather than
IndonesianEnglishMalay
IndonesianMalayEnglish
NOTE 2: We exploit the fact that the source languages
are related
What if We Want to Translate into English?
poor
rich
X
10. Yandex seminar, August 13, 2014, Moscow, Russia
10Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Resource-poor vs. Resource-rich
11. Yandex seminar, August 13, 2014, Moscow, Russia
11Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 11
Related EU – non-EU/unofficial languages
Swedish – Norwegian
Bulgarian – Macedonian
Irish – Gaelic Scottish
Standard German – Swiss German
Related EU languages
Spanish – Catalan
Czech – Slovak
Related languages outside Europe
Russian – Ukrainian
MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)
Hindi – Urdu
Turkish – Azerbaijani
Malay – Indonesian
Resource-rich vs. Resource-poor Languages
We will explore
these pairs.
12. Yandex seminar, August 13, 2014, Moscow, Russia
12Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related languages have
overlapping vocabulary (cognates)
similar
word order
syntax
Motivation
13. 13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Improving
Indonesian-English SMT
Using Malay-English
14. Yandex seminar, August 13, 2014, Moscow, Russia
14Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 14
Malay vs. Indonesian
Malay
Semua manusia dilahirkan bebas dan samarata dari segi
kemuliaan dan hak-hak.
Mereka mempunyai pemikiran dan perasaan hati dan
hendaklah bertindak di antara satu sama lain dengan
semangat persaudaraan.
Indonesian
Semua orang dilahirkan merdeka dan mempunyai martabat
dan hak-hak yang sama.
Mereka dikaruniai akal dan hati nurani dan hendaknya
bergaul satu sama lain dalam semangat persaudaraan.
~50% exact word overlap
from Article 1 of the Universal Declaration of Human Rights
15. Yandex seminar, August 13, 2014, Moscow, Russia
15Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 15
Malay Can Look “More Indonesian”…
Malay
Semua manusia dilahirkan bebas dan samarata dari
segi kemuliaan dan hak-hak.
Mereka mempunyai pemikiran dan perasaan hati
dan hendaklah bertindak di antara satu sama lain
dengan semangat persaudaraan.
~75% exact word overlap
Post-edited Malay to look “Indonesian” (by an Indonesian speaker).
Indonesian
Semua manusia dilahirkan bebas dan mempunyai martabat
dan hak-hak yang sama.
Mereka mempunyai pemikiran dan perasaan dan hendaklah
bergaul satu sama lain dalam semangat persaudaraan.
from Article 1 of the Universal Declaration of Human Rights
We attempt to do this automatically:
adapt Malay to look Indonesian
Then, use it to improve SMT…
16. Yandex seminar, August 13, 2014, Moscow, Russia
16Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Indonesian
Malay
English
poor
Method at a Glance
Indonesian
“Indonesian”
English
poorStep 1:
Adaptation
Indonesian +
“Indonesian”
English
Step 2:
Combination
Adapt
Note that we have no Malay-Indonesian bi-text!
17. 17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Step 1:
Adapting Malay-English
to “Indonesian”-English
18. Yandex seminar, August 13, 2014, Moscow, Russia
18Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18
Word-Level Bi-text Adaptation:
Overview
Given a Malay-English sentence pair
1. Adapt the Malay sentence to “Indonesian”
• Word-level paraphrases
• Phrase-level paraphrases
• Cross-lingual morphology
2. We pair the adapted “Indonesian” with the English from
the Malay-English sentence pair
Thus, we generate a new “Indonesian”-English sentence pair.
Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012)
Pidong Wang, Preslav Nakov, Hwee Tou Ng
19. Yandex seminar, August 13, 2014, Moscow, Russia
19Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 19
Word-Level Bi-text Adaptation:
Motivation
In many cases, word-level substitutions are enough
Adapt Malay to Indonesian (train)
KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
PDB Malaysia akan mencapai 8 persen pada tahun 2010.
Malaysia’s GDP is expected to reach 8 per cent in 2010.
20. Yandex seminar, August 13, 2014, Moscow, Russia
20Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 20
Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
Decode using a large Indonesian LM
Word-Level Bi-text Adaptation:
Overview
Probs: pivoting over English
21. Yandex seminar, August 13, 2014, Moscow, Russia
21Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Malaysia’s GDP is expected to reach 8 per cent in 2010.
21
Pair each with the English counter-part
Thus, we generate a new “Indonesian”-English bi-text.
Word-Level Bi-text Adaptation:
Overview
22. Yandex seminar, August 13, 2014, Moscow, Russia
22Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Indonesian translations for Malay: pivoting over English
Weights
22
Malay sentenceML1 ML2 ML3 ML4 ML5
English sentenceEN1 EN2 EN3 EN4
English sentenceEN11 EN3 EN12
Indonesian sentenceIN1 IN2 IN3 IN4
ML-EN
bi-text
IN-EN
bi-text
Word-Level Adaptation:
Extracting Paraphrases
Note: we have no Malay-Indonesian bi-text, so we pivot.
23. Yandex seminar, August 13, 2014, Moscow, Russia
23Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
IN-EN bi-text is small, thus:
Unreliable IN-EN word alignments bad ML-IN paraphrases
Solution:
improve IN-EN alignments using the ML-EN bi-text
concatenate: IN-EN*k + ML-EN
» k ≈ |ML-EN| / |IN-EN|
word alignment
get the alignments for one copy of IN-EN only
23
Word-Level Adaptation:
Issue 1
IN
ML
EN
poor
Works because of cognates between Malay and Indonesian.
24. Yandex seminar, August 13, 2014, Moscow, Russia
24Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
IN-EN bi-text is small, thus:
Small IN vocabulary for the ML-IN paraphrases
Solution:
Add cross-lingual morphological variants:
Given ML word: seperminuman
Find ML lemma: minum
Propose all known IN words sharing the same lemma:
» diminum, diminumkan, diminumnya, makan-minum,
makananminuman, meminum, meminumkan, meminumnya,
meminum-minuman, minum, minum-minum, minum-minuman,
minuman, minumanku, minumannya, peminum, peminumnya,
perminum, terminum
24
Word-Level Adaptation:
Issue 2
IN
ML
EN
poor
Note: The IN variants are from a larger monolingual IN text.
25. Yandex seminar, August 13, 2014, Moscow, Russia
25Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Word-level pivoting
Ignores context, and relies on LM
Cannot drop/insert/merge/split/reorder words
Solution:
Phrase-level pivoting
Build ML-EN and EN-IN phrase tables
Induce ML-IN phrase table (pivoting over EN)
Adapt the ML side of ML-EN to get “IN”-EN bi-text:
» using Indonesian LM and n-best “IN” as before
Also, use cross-lingual morphological variants
25
Word-Level Adaptation:
Issue 3
- Models context better: not only Indonesian LM, but also phrases.
- Allows many word operations, e.g., insertion, deletion.
IN
ML
EN
poor
26. 26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Step 2:
Combining
IN-EN + “IN”-EN
27. Yandex seminar, August 13, 2014, Moscow, Russia
27Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Combining IN-EN and “IN”-EN bi-texts
Simple concatenation: IN-EN + “IN”-EN
Balanced concatenation: IN-EN * k + “IN”-EN
Sophisticated phrase table combination
Improved word alignments for IN-EN
Phrase table combination with extra features
Improved Statistical Machine Translation for Resource-Poor Languages Using Related
Resource-Rich Languages. (EMNLP 2009)
Preslav Nakov, Hwee Tou Ng
Improving Statistical Machine Translation for a Resource-Poor Language Using
Related Resource-Rich Languages. (JAIR, 2012)
Preslav Nakov, Hwee Tou Ng
28. Yandex seminar, August 13, 2014, Moscow, Russia
28Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
29. Yandex seminar, August 13, 2014, Moscow, Russia
29Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
30. Yandex seminar, August 13, 2014, Moscow, Russia
30Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Summary: Concatenate X1-Y and X2-Y
Advantages
improved word alignments
e.g., for rare words
more translation options
less unknown words
useful non-compositional phrases (improved fluency)
phrases with words from X2 that do not exist in X1: ignored
Disadvantages
X2-Y will dominate: it is larger
translation probabilities are messed up
phrases from X1-Y and X2-Y cannot be distinguished
X1
X2
Y
poor
related
Concatenating Bi-texts (1)
31. Yandex seminar, August 13, 2014, Moscow, Russia
31Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concat×k: Concatenate k copies of the original
and one copy of the additional training bi-text
Concat×k:align
1. Concatenate k copies of the original and one copy of the
additional bi-text.
2. Generate word alignments.
3. Truncate them only keeping alignments for one copy of the
original bi-text.
4. Build a phrase table.
5. Tune the system using MERT.
The value of k is optimized on the development dataset.
X1
X2
Y
poor
related
Concatenating Bi-texts (2)
32. Yandex seminar, August 13, 2014, Moscow, Russia
32Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
33. Yandex seminar, August 13, 2014, Moscow, Russia
33Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Summary: Build two separate phrase tables, then
(a) use them together
(b) merge them
(c) interpolate them
Advantages
phrases from X1-Y and X2-Y can be distinguished
the larger bi-text X2-Y does not dominate X1-Y
more translation options
probabilities are combined in a more principled manner
Disadvantages
improved word alignments are not possible
X1
X2
Y
poor
related
Merging Phrase Tables (1)
34. Yandex seminar, August 13, 2014, Moscow, Russia
34Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Two-tables: Build two separate phrase tables and use
them as alternative decoding paths (Birch et al., 2007).
Merging Phrase Tables (2)
35. Yandex seminar, August 13, 2014, Moscow, Russia
35Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Interpolation: Build two separate phrase tables, Torig and
Textra, and combine them using linear interpolation:
Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s).
The value of α is optimized on a development dataset.
Merging Phrase Tables (3)
36. Yandex seminar, August 13, 2014, Moscow, Russia
36Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Merge:
1. Build separate phrase tables: Torig and Textra.
2. Keep all entries from Torig.
3. Add those entries from Textra that are not in Torig.
4. Add extra features:
F1: 1 if the entry came from Torig, 0 otherwise.
F2: 1 if the entry came from Textra, 0 otherwise.
F3: 1 if the entry was in both tables, 0 otherwise.
The feature weights are set using MERT, and the number of features
is optimized on the development set.
Merging Phrase Tables (4)
37. Yandex seminar, August 13, 2014, Moscow, Russia
37Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
Merging phrase tables
Combined method
Bi-text Combination Strategies
38. Yandex seminar, August 13, 2014, Moscow, Russia
38Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Use Merge to combine the phrase
tables for concat×k:align (as Torig) and
for concat×1 (as Textra).
Two parameters to tune
number of repetitions k
# of extra features to use with Merge:
(a) F1 only;
(b) F1 and F2,
(c) F1, F2 and F3
Improved word alignments.
Improved lexical coverage.
Distinguish phrases
by source table.
Combined Method
40. Yandex seminar, August 13, 2014, Moscow, Russia
40Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Data
Translation data (for IN-EN)
IN2EN-train: 0.9M
IN2EN-dev: 37K
IN2EN-test: 37K
EN-monoling.: 5M
Adaptation data (for ML-EN “IN”-EN)
ML2EN: 8.6M
IN-monoling.: 20M
(tokens)
41. Yandex seminar, August 13, 2014, Moscow, Russia
41Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Isolated Experiments:
Training on “IN”-EN only
14.50
18.67
19.50
20.06
20.63 20.89
21.24
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
BLEU
System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)
42. Yandex seminar, August 13, 2014, Moscow, Russia
42Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
18.49
19.79
20.10
21.55 21.64 21.62
18.0
18.5
19.0
19.5
20.0
20.5
21.0
21.5
22.0
simple
concatenation
balanced
concatenation
phrase table
combination
ML2EN(baseline) System combination
42
BLEU
Combined Experiments:
Training on IN-EN + “IN”-EN
Wang, Nakov & Ng (EMNLP 2012)
43. Yandex seminar, August 13, 2014, Moscow, Russia
43Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Experiments: Improvements
43
14.50
18.67
20.10
21.24 21.64
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
BLEU
Wang, Nakov & Ng (EMNLP 2012)
44. Yandex seminar, August 13, 2014, Moscow, Russia
44Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Improve Macedonian-English SMT by adapting
Bulgarian-English bi-text
Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)
OPUS movie subtitles
Application to Other Languages & Domains
27.33
27.97
28.38
29.05
27.00
27.50
28.00
28.50
29.00
29.50
BG2EN(A) WordParaph+morph(B) PhraseParaph+morph(C) System combination of
A+B+C
BLEU
45. Yandex seminar, August 13, 2014, Moscow, Russia
45Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 45
Analysis
46. Yandex seminar, August 13, 2014, Moscow, Russia
46Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Paraphrasing
Non-Indonesian Malay Words Only
So, we do need to paraphrase all words.
Wang, Nakov & Ng (EMNLP 2012)
47. Yandex seminar, August 13, 2014, Moscow, Russia
47Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Human Judgments
Morphology yields worse top-3 adaptations
but better phrase tables, due to coverage.
Is the adapted sentence better Indonesian
than the original Malay sentence?
100 random sentences
Wang, Nakov & Ng (EMNLP 2012)
48. Yandex seminar, August 13, 2014, Moscow, Russia
48Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Reverse Adaptation
Idea:
Adapt dev/test Indonesian input to “Malay”,
then, translate with a Malay-English system
Input to SMT:
- “Malay” lattice
- 1-best “Malay” sentence from the lattice
Adapting dev/test is worse than adapting the training bi-text:
So, we need both n-best and LM
Wang, Nakov & Ng (EMNLP 2012)
49. Yandex seminar, August 13, 2014, Moscow, Russia
49Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 49
A Specialized Decoder
(Instead of Moses)
50. Yandex seminar, August 13, 2014, Moscow, Russia
50Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Beam-Search Text Rewriting Decoder:
The Algorithm
A Beam-Search Decoder for Normalization of Social Media Text
with Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng
51. Yandex seminar, August 13, 2014, Moscow, Russia
51Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Beam-Search Text Rewriting Decoder:
An Example (Twitter Normalization)
Wang, Nakov & Ng (NAACL 2013)
52. Yandex seminar, August 13, 2014, Moscow, Russia
52Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Hypothesis Producers
Word-level mapping
Phrase-level mapping
Cross-lingual morphology mapping
Indonesian LM
Word penalty (target)
Malay word penalty (source)
Phrase count
Wang, Nakov & Ng (NAACL 2013)
53. Yandex seminar, August 13, 2014, Moscow, Russia
53Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Moses vs. the Specialized Decoder
Decoding level
phrase vs. sentence
Features
Moses vs. richer, e.g., Malay word penalty
word-level + phrase-level
(potentially, manual rules)
Cross-lingual variants
input lattice vs. feature function
Wang, Nakov & Ng (NAACL 2013)
54. Yandex seminar, August 13, 2014, Moscow, Russia
54Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Moses vs. a Specialized Decoder:
Isolated “IN”-EN Experiments:
19.50
20.06
20.63
20.89
21.24
20.39 20.46
20.85
21.07
21.76
18
19
20
21
22
WordPar WordPar+Morph PhrasePar PhrasePar+Morph System
Combination
BLEU
Moses Specialized decoder
55. Yandex seminar, August 13, 2014, Moscow, Russia
55Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 55
Moses vs. a Specialized Decoder:
Combining IN-EN and “IN”-EN
18.49
19.79
20.10
21.55 21.64 21.6221.74 21.81
22.03
17
18
19
20
21
22
23
simple concat balanced concat phrase table combination
BLEU
ML2EN (baseline) Moses Specialized decoder
56. Yandex seminar, August 13, 2014, Moscow, Russia
56Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Experiments: Improvements
56
14.50
18.67
20.10
21.24
21.64
22.03
13
15
17
19
21
23
ML2EN
(baseline)
IN2EN
(baseline)
phrase table
combination
(Moses)
best isolated
system (Moses)
best combined
system (Moses)
best
combination
(DD)
BLEU
57. Yandex seminar, August 13, 2014, Moscow, Russia
57Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKEN, Adapting BG-EN to “MK”-EN
27.33
27.97
28.38
29.05
29.35
26
27
28
29
30
BG2EN wordPar+morph
(Moses)
PhrasePar+morph
(Moses)
combination
(Moses)
combination (DD)
BLEU
58. Yandex seminar, August 13, 2014, Moscow, Russia
58Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 58
Transliteration
59. Yandex seminar, August 13, 2014, Moscow, Russia
59Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
59
Spanish vs. Portuguese
Spanish–Portuguese
Spanish
Todos los seres humanos nacen libres e iguales en dignidad y derechos
y, dotados como están de razón y conciencia, deben comportarse
fraternalmente los unos con los otros.
Portuguese
Todos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os
outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap
60. Yandex seminar, August 13, 2014, Moscow, Russia
60Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Spanish vs. Portuguese
Spanish–Portuguese
Spanish
Todos los seres humanos nacen libres e iguales en dignidad y derechos
y, dotados como están de razón y conciencia, deben comportarse
fraternalmente los unos con los otros.
Portuguese
Todos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os
outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap
67% approx. word overlap
The actual overlap is even higher.
61. Yandex seminar, August 13, 2014, Moscow, Russia
61Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Cognates
Linguistics
Def: Words derived from a common root, e.g.,
Latin tu (‘2nd person singular’)
Old English thou
French tu
Spanish tú
German du
Greek sú
Orthography/phonetics/semantics: ignored.
Computational linguistics
Def: Words in different languages that are mutual translations and
have a similar orthography, e.g.,
evolution vs. evolución vs. evolução vs. evoluzione
Orthography & semantics: important.
Origin: ignored.
Cognates can differ a lot:
• night vs. nacht vs. nuit vs. notte vs. noite
• star vs. estrella vs. stella vs. étoile
• arbeit vs. rabota vs. robota (‘work’)
• father vs. père
• head vs. chef
62. Yandex seminar, August 13, 2014, Moscow, Russia
62Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Spelling Differences Between Cognates
Systematic spelling differences
Spanish – Portuguese
different spelling
-nh- -ñ- (senhor vs. señor)
phonetic
-ción -ção (evolución vs. evolução)
-é -ei (1st sing past) (visité vs. visitei)
-ó -ou (3rd sing past) (visitó vs. visitou)
Occasional differences
Spanish – Portuguese
decir vs. dizer (‘to say’)
Mario vs. Mário
María vs. Maria
Many of these can be
learned automatically.
63. Yandex seminar, August 13, 2014, Moscow, Russia
63Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Automatic Transliteration
Transliteration
1. Extract likely cognates for Portuguese-Spanish
2. Learn a character-level transliteration model
3. Transliterate the Portuguese side of pt-en, to look like Spanish
64. Yandex seminar, August 13, 2014, Moscow, Russia
64Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Automatic Transliteration (2)
Extract pt-es cognates using English (en)
1. Induce pt-es word translation probabilities
2. Filter out by probability if
3. Filter out by orthographic similarity if
constants proposed
in the literature
Longest common
subsequence
65. Yandex seminar, August 13, 2014, Moscow, Russia
65Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
SMT-based Transliteration
Train & tune a monotone character-level SMT system
Representation
Use it to transliterate the Portuguese side of pt-en
66. Yandex seminar, August 13, 2014, Moscow, Russia
66Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
ESEN, Adapting PT-EN to “ES”-EN
5.34
22.87
24.23
13.79
26.24
0
5
10
15
20
25
30
PT-EN ES-EN phrase table combination
BLEU
original transliterated
10K ES-EN, 1.23M PT-EN
67. Yandex seminar, August 13, 2014, Moscow, Russia
67Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 67
Transliteration
vs.
Character-Level Translation
68. Yandex seminar, August 13, 2014, Moscow, Russia
68Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Macedonian vs. Bulgarian
68
69. Yandex seminar, August 13, 2014, Moscow, Russia
69Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKBG: Transliteration vs. Translation
10.74
12.07
22.74
31.10 32.19 32.71
33.94
5
10
15
20
25
30
35
40
MK (original) MK (simple
translit.)
MK (cognate
translit.)
MK-BG
(words)
MK-BG
(words+cogn.
translit.)
MK-BG
(chars)
MK-BG
(words+cogn.
translit. +
chars)
BLEU
Combining Word-Level and Character-Level Models for Machine
Translation Between Closely-Related Languages (ACL 2012).
Preslav Nakov, Jorg Tiedemann.
70. Yandex seminar, August 13, 2014, Moscow, Russia
70Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Character-Level SMT
70
• MK: Никогаш не сум преспала цела сезона.
• BG: Никога не съм спала цял сезон.
• MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ .
• BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .
71. Yandex seminar, August 13, 2014, Moscow, Russia
71Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Character-Level Phrase Pairs
71
Can cover:
word prefixes/suffixes
entire words
word sequences
combinations thereof
Max-phrase-length=10
LM-order=10
72. Yandex seminar, August 13, 2014, Moscow, Russia
72Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKBG: The Impact of Data Size
73. Yandex seminar, August 13, 2014, Moscow, Russia
73Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Slavic Languages in Europe
74. Yandex seminar, August 13, 2014, Moscow, Russia
74Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
BG
MK
SR
CZ
SL
MK XX
75. Yandex seminar, August 13, 2014, Moscow, Russia
75Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK SR, SL, CZ
77. Yandex seminar, August 13, 2014, Moscow, Russia
77Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Pivoting over BG
Macedonian: Никогаш не сум преспала цела сезона.
Bulgarian: Никога не съм спала цял сезон.
English: I’ve never slept for an entire season.
For related languages
• subword transformations
• character-level translation
78. Yandex seminar, August 13, 2014, Moscow, Russia
78Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Pivoting over BG
Analyzing the Use of Character-Level Translation
with Sparse and Noisy Datasets (RANLP 2013)
Jorg Tiedemann, Preslav Nakov
79. Yandex seminar, August 13, 2014, Moscow, Russia
79Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Using Synthetic “MK”-EN Bi-Texts
Translate Bulgarian to Macedonian in a BG-XX corpus
All synthetic data combined (+mk-en): 36.69 BLEU
Tiedemann & Nakov (RANLP 2013)
81. Yandex seminar, August 13, 2014, Moscow, Russia
81Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Adapt bi-texts for related resource-rich languages, using
confusion networks
word-level & phrase-level paraphrasing
cross-lingual morphological analysis
Character-level Models
translation
transliteration
pivoting vs. synthetic data
Future work
other languages & NLP problems
robustness: noise and domain shift Thank you!
Conclusion & Future Work
82.
83. Yandex seminar, August 13, 2014, Moscow, Russia
83Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 83
Related Work
84. Yandex seminar, August 13, 2014, Moscow, Russia
84Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (1)
Machine translation between related languages
E.g.
Cantonese–Mandarin (Zhang, 1998)
Czech–Slovak (Hajic & al., 2000)
Turkish–Crimean Tatar (Altintas & Cicekli, 2002)
Irish–Scottish Gaelic (Scannell, 2006)
Bulgarian–Macedonian (Nakov & Tiedemann, 2012)
We do not translate (no training data), we “adapt”.
85. Yandex seminar, August 13, 2014, Moscow, Russia
85Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (2)
Adapting dialects to standard language (e.g., Arabic)
(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)
manual rules and/or language-specific tools
Normalizing Tweets and SMS
(Aw & al., 2006; Han & Baldwin, 2011)
informal text: spelling, abbreviations, slang
same language
86. Yandex seminar, August 13, 2014, Moscow, Russia
86Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (3)
Adapt Brazilian to European Portuguese (Marujo & al. 2011)
rule-based, language-dependent
tiny improvements for SMT
Reuse bi-texts between related languages (Nakov & Ng. 2009)
no language adaptation (just transliteration)
Cascaded/pivoted translation
(Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)
poor rich X requires an additional poor-rich bi-text
rich X poor does not use the similarity poor-rich
poor
rich
X
our:
Hinweis der Redaktion
Statistical machine translation (or SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations.
We often call such kind of corpora bi-texts.
A well-known problem with the current SMT systems is that collecting sufficiently large training bi-texts is very hard, so most languages in the world are still resource-poor for SMT.
To solve this problem, we want to adapt a bi-text of a resource-rich language to improve machine translation for a related resource-poor language.
Let’s start with an introduction first.
Statistical machine translation (or SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations.
We often call such kind of corpora bi-texts.
A well-known problem with the current SMT systems is that collecting sufficiently large training bi-texts is very hard, so most languages in the world are still resource-poor for SMT.
To solve this problem, we want to adapt a bi-text of a resource-rich language to improve machine translation for a related resource-poor language.
We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
Why does this work?
Because many resource-poor languages are related to some resource-rich languages.
And related languages often share overlapping vocabulary and cognates.
They often have similar word order and syntax.
We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
Why does this work?
Because many resource-poor languages are related to some resource-rich languages.
And related languages often share overlapping vocabulary and cognates.
They often have similar word order and syntax.
We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
Why does this work?
Because many resource-poor languages are related to some resource-rich languages.
And related languages often share overlapping vocabulary and cognates.
They often have similar word order and syntax.
There are many resource-rich and resource-poor languages which are closely related.
[CLICK]
In our work, we focus on the pair, Malay and Indonesian.
We also show the applicability of our method to another language pair: Bulgarian and Macedonian.
We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
Why does this work?
Because many resource-poor languages are related to some resource-rich languages.
And related languages often share overlapping vocabulary and cognates.
They often have similar word order and syntax.
Here is our main focus: improve Indonesian-English SMT using additional Malay-English bi-text.
Malay and Indonesian are closely related languages.
A native speaker of Indonesian can understand Malay texts, and vice versa.
Here are two example sentence pairs which show the similarity: about 50 percent of the words overlap.
So, we can train an SMT system on one language and apply it to the other directly: there are matching words and short phrases.
We asked a native Indonesian speaker to adapt the same Malay sentences into Indonesian while preserving as many Malay words as possible.
As a result, the overlap reached 75 percent.
[CLICK]
Our goal is to do this automatically: adapt Malay to look like Indonesian.
Then, we can use this adapted bi-text to improve Indonesian-English SMT.
Suppose we have a small Indonesian to English bi-text, which is resource-poor.
And we also have another large bi-text for Malay-English, which is resource-rich.
Our method has two steps.
[CLICK]
The first step is bi-text adaptation.
We adapt the Malay side of the Malay-English bi-text to look like Indonesian.
[CLICK]
The second step is bi-text combination.
We try to combine the adapted bi-text with the original small Indonesian-English bi-text in order to improve Indonesian-English SMT.
[CLICK]
Note that we have no Malay-Indonesian bi-text.
The first step is bi-text combination: adapting a Malay-English bi-text to Indonesian-English.
Given a Malay-English sentence pair
We first adapt the Malay sentence to look like “Indonesian” using word-level and phrase-level paraphrases, and cross-lingual morphology.
Then, we pair the adapted “Indonesian” sentence with the English sentence of the Malay-English sentence pair.
[CLICK]
Finally, we can generate a new Indonesian-English sentence pair.
For example,
given a Malay sentence,
[CLICK]
we generate a confusion network.
In the confusion network, each Malay word is augmented with multiple Indonesian word-level paraphrases.
[CLICK]
Then we decode this confusion network using a large Indonesian language model.
Thus, a ranked list of some adapted “Indonesian” sentences is obtained.
For example,
given a Malay sentence,
[CLICK]
we generate a confusion network.
In the confusion network, each Malay word is augmented with multiple Indonesian word-level paraphrases.
[CLICK]
Then we decode this confusion network using a large Indonesian language model.
Thus, a ranked list of some adapted “Indonesian” sentences is obtained.
After that, we pair each adapted “Indonesian” sentence with the English counter-part for the Malay sentence in the Malay-English bi-text.
[CLICK]
We thus end up with a synthetic “Indonesian”–English bi-text.
How do we find the Indonesian word-level paraphrases for a Malay word?
We use pivoting over English to induce potential Indonesian paraphrases for a given Malay word.
First, we generate separate word alignments for the Indonesian–English and the Malay–English bi-texts.
If a Malay word ML3 and an Indonesian word IN3 are both aligned to the same English word EN3,
[CLICK]
then, we consider the Indonesian word IN3 as a potential translation option for the Malay word ML3.
[CLICK]
each translation pair is associated with a conditional probability in the confusion network.
The probability is estimated by pivoting over English.
[CLICK]
Note that we have no Malay-Indonesian bi-text, so we pivot over English to get Malay-Indonesian translation pairs.
Since the Indonesian-English bi-text is small, so its word alignments are unreliable.
As a result, we get bad Malay-Indonesian paraphrases from the word alignments.
[CLICK]
we try to improve the word alignments using the Malay-English bi-text. Since Malay and Indonesian share some vocabulary, we combine the Indonesian-English and Malay-English bi-text to carry out word alignment. As a result, we obtain an improved Indonesian-English word alignment.
When we concatenate the Indonesian-English and the Malay-English bi-text, we concatenate multiple copies of the small Indonesian-English bi-text. The reason is that the Malay-English bi-text is much larger than the small Indonesian-English bi-text.
The second issue is that
Since the Indonesian-English bi-text is small, the Indonesian word-level paraphrases for a Malay word are restricted to the small Indonesian vocabulary of the small Indonesian–English bi-text.
[CLICK]
to enlarge the small Indonesian vocabulary, we use cross-lingual morphological variants.
Now let me explain how we add cross-lingual morphological variants to a confusion network.
If the input Malay sentence has the word seperminuman, we first find its lemma minum, and then determine all Indonesian words sharing the same lemma.
These Indonesian words are considered as the cross-lingual morphological variants for the Malay word.
[CLICK]
Note that here the Indonesian morphological variants are from a large monolingual Indonesian text, so there are new Indonesian words which are not in the small Indonesian-English bi-text.
Word-level pivoting ignores context.
It relies on the Indonesian language model to make the right contextual choice.
[CLICK]
We also try to model the context more directly by generating adaptation options at the phrase level using pivoted phrase tables.
We use standard phrase-based SMT techniques to build two separate phrase tables for the Indonesian–English and the Malay–English bi-texts.
Then we pivot the two phrase tables over English phrases.
The obtained pivoted phrase table is used to adapt Malay to Indonesian.
We also add cross-lingual morphological variants to enlarge the Indonesian vocabulary.
[CLICK]
As a result, we can model the context better by using both Indonesian language model and phrases.
Another advantage is that we can have more word operations here, since we use phrases.
Recall that the second step of our method is bi-text combination.
We combine the original small Indonesian–English bi-text with the adapted “Indonesian”–English bi-text in three ways:
[CLICK]
The first way is to simply concatenate the two bi-texts as the training bi-text.
In this way, we assume the two bi-texts have the same quality.
[CLICK]
The second way is called balanced concatenation.
Since the adapted bi-text is much larger than the original Indonesian-English bi-text, the adapted bi-text will dominate the concatenation.
In order to overcome this problem, we repeat the smaller Indonesian–English bi-text enough times so that the amounts of the two bi-texts are the same before concatenation.
[CLICK]
Finally, we experiment using a method for combining phrase tables proposed in the previous work of Nakov and Ng. This method can improve word alignments and then combine phrase tables with extra features.
I will now present our experiments.
In our experiments, we use the following datasets.
For Indonesian–English: we have a small training bi-text, a development set, and also a test set.
We also use a large Malay–English bi-text, which is then adapted into Indonesian-English.
We have carried out two kinds of experiments:
The first kind is called isolated experiments.
In isolated experiments, we only use the adapted bi-text but not the original Indonesian-English bi-text.
These experiments provide a direct comparison to using the original bi-text.
The green bars show the two baseline systems.
Although the original Malay-English bi-text is about 10 times bigger than the original Indonesian-English bi-text, training on the Malay-English bi-text is much worse than training on the small Indonesian-English bi-text.
This shows the existence of important differences between Malay and Indonesian.
Using our method, we can see that word-level paraphrasing improves by 5 BLEU points over the original Malay-English baseline.
And it improves by close to one BLEU point over the original Indonesian-English baseline.
By adding cross-lingual morphological variants to word-level paraphrasing, we get about half a BLEU point of improvement. This confirms that the cross-lingual morphological variants are actually effective.
As we discussed before, phrase-level paraphrasing can model context better, so phrase-level paraphrasing gets larger improvement.
Finally, we use the system combination method, MEMT, to combine the best word-level paraphrasing system and the best phrase-level paraphrasing system, and it yields even further improvements.
This shows that the two kinds of paraphrasing methods are actually complementary.
The second kind of experiments is combined experiments.
In these experiments, we try to combine the adapted bi-text with the original Indonesian-English bi-text using the three bi-text combination methods.
Similar to the isolated experiments, we get improvements using both word-level and phrase-level paraphrasing methods. This is consistent with the isolated experiments.
One interesting discovery is that using our method, the results of the three bi-text combination methods do not differ so much as the baselines.
To summarize, this graph shows the overall improvements that we obtain in our experiments.
The first three bars are the baselines using existing methods,
And the fourth one is our best isolated system, which improves about 1 BLEU point over the baselines.
The last one is the best combined system, and it gives us 1.5 BLEU point improvement over the baselines.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
While Indonesian is closely related to Malay, there are also some false friends.
They share some words, but the words may have very different meanings in the two languages.
That’s why we paraphrase all the words in our experiments.
We asked a native Indonesian speaker who does not speak Malay to judge whether our adapted “Indonesian” sentences are more understandable to him than the original Malay input.
It turns out that they are similar to the Indonesian speaker.
The adapted sentences did work better than the original Malay sentences in our experiments.
We think there can be two reasons for this:
The first one is that SMT systems can tolerate noisy training data;
The second reason can be that the judgments were at the sentence level, while phrases are sub-sentential; there can be many good of them in a “bad” sentence.
We also tried to adapt Indonesian to Malay, and then use a Malay-English translation system to translate the adapted Malay sentences to English.
However, the results turned out to be worse than adapting Malay to Indonesian.
Some related work.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
We have carried out two kinds of experiments:
The first kind is called isolated experiments.
In isolated experiments, we only use the adapted bi-text but not the original Indonesian-English bi-text.
These experiments provide a direct comparison to using the original bi-text.
The green bars show the two baseline systems.
Although the original Malay-English bi-text is about 10 times bigger than the original Indonesian-English bi-text, training on the Malay-English bi-text is much worse than training on the small Indonesian-English bi-text.
This shows the existence of important differences between Malay and Indonesian.
Using our method, we can see that word-level paraphrasing improves by 5 BLEU points over the original Malay-English baseline.
And it improves by close to one BLEU point over the original Indonesian-English baseline.
By adding cross-lingual morphological variants to word-level paraphrasing, we get about half a BLEU point of improvement. This confirms that the cross-lingual morphological variants are actually effective.
As we discussed before, phrase-level paraphrasing can model context better, so phrase-level paraphrasing gets larger improvement.
Finally, we use the system combination method, MEMT, to combine the best word-level paraphrasing system and the best phrase-level paraphrasing system, and it yields even further improvements.
This shows that the two kinds of paraphrasing methods are actually complementary.
The second kind of experiments is combined experiments.
In these experiments, we try to combine the adapted bi-text with the original Indonesian-English bi-text using the three bi-text combination methods.
Similar to the isolated experiments, we get improvements using both word-level and phrase-level paraphrasing methods. This is consistent with the isolated experiments.
One interesting discovery is that using our method, the results of the three bi-text combination methods do not differ so much as the baselines.
To summarize, this graph shows the overall improvements that we obtain in our experiments.
The first three bars are the baselines using existing methods,
And the fourth one is our best isolated system, which improves about 1 BLEU point over the baselines.
The last one is the best combined system, and it gives us 1.5 BLEU point improvement over the baselines.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
Some related work.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
Some related work.
We have also applied our method to other languages.
We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
We get similar results.
This confirms the applicability of our method to other language pairs.
Next I will conclude our work.
Next I will conclude our work.
In summary, to improve resource-poor machine translation, we adapt bi-texts for a related resource-rich language, using confusion networks, word-level and phrase-level paraphrasing, and morphological analysis.
We achieved very sizable improvements over the baselines.
In the future, we would like to add more word operations, for example, splitting, and merging words.
We also want to find some methods to better integrate our word-level and phrase-level paraphrasing methods.
Lastly, we want to apply our methods to other languages and NLP problems.
Some related work.
There are some related work on translating texts between related languages, just like our bi-text adaptation step.
Most of the work use rule-based translation systems, but our method is a statistical method, which are language independent.