Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

Combining, Adapting and Reusing Bi-texts
between Related Languages:
Application to Statistical Machine Translation
Preslav Nakov, Qatar Computing Research Institute
(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)
Yandex seminar
August 13, 2014, Moscow, Russia

Yandex seminar, August 13, 2014, Moscow, Russia
2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2
Plan
 Part I
Introduction to Statistical Machine Translation
 Part II
Combining, Adapting and Reusing Bi-texts between Related
Languages: Application to Statistical Machine Translation
 Part III
Further Discussion on SMT

The Problem:
Lack of Resources

Overview
 Statistical Machine Translation (SMT) systems
Need large sentence-aligned bilingual corpora (bi-texts).
 Problem
Such training bi-texts do not exist for most languages.
 Idea
Adapt a bi-text for a related resource-rich language.

5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building an SMT System for a New Language Pair
 In theory: only requires few hours/days
 In practice: large bi-texts are needed
Only available for
 the official languages of the UN
 Arabic, Chinese, English, French, Russian, Spanish
 the official languages of the EU
 some other languages
However, most of the 6,500+ world languages remain
resource-poor from an SMT viewpoint.
This number is even more striking
if we consider language pairs.
Even resource-rich language pairs
become resource-poor
in new domains.

Most Language Pairs Have Little Resources
Zipfian distribution of language resources

Building a Bi-text for SMT
 Small bi-texts
Relatively easy to build
 Large bi-texts
Hard to get, e.g., because of copyright
Sources: parliament debates and legislation
national: Canada, Hong Kong
international
United Nations
European Union: Europarl, Acquis
Becoming an official language of the EU
is an easy recipe for getting rich in bi-texts quickly.
Not all languages are so “lucky”,
but many can still benefit.

How Google/Bing (Yandex?) Translate
Resource-Poor Languages
 How do we translate from Russian to Malay?
 Use Triangulation
Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)
 RussianEnglishMalay
Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)
 рамочное соглашение ||| framework agreement ||| 0.7 …
 perjanjian kerangka kerja ||| framework agreement ||| 0.8 …
THUS
 рамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …

 Idea: reuse bi-texts from related resource-rich
languages to build an improved SMT system for a
related resource-poor language.
 NOTE 1: this is NOT triangulation
we focus on translation into English
 e.g., Indonesian-English using Malay-English
 rather than
IndonesianEnglishMalay
IndonesianMalayEnglish
 NOTE 2: We exploit the fact that the source languages
are related
What if We Want to Translate into English?
poor
rich
X

Resource-poor vs. Resource-rich

 Related EU – non-EU/unofficial languages
 Swedish – Norwegian
 Bulgarian – Macedonian
 Irish – Gaelic Scottish
 Standard German – Swiss German
 Related EU languages
 Spanish – Catalan
 Czech – Slovak
 Related languages outside Europe
 Russian – Ukrainian
 MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)
 Hindi – Urdu
 Turkish – Azerbaijani
 Malay – Indonesian
Resource-rich vs. Resource-poor Languages
We will explore
these pairs.

 Related languages have
 overlapping vocabulary (cognates)
similar
 word order
 syntax
Motivation

13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Improving
Indonesian-English SMT
Using Malay-English

Malay vs. Indonesian
Malay
 Semua manusia dilahirkan bebas dan samarata dari segi
kemuliaan dan hak-hak.
 Mereka mempunyai pemikiran dan perasaan hati dan
hendaklah bertindak di antara satu sama lain dengan
semangat persaudaraan.
Indonesian
 Semua orang dilahirkan merdeka dan mempunyai martabat
dan hak-hak yang sama.
 Mereka dikaruniai akal dan hati nurani dan hendaknya
bergaul satu sama lain dalam semangat persaudaraan.
~50% exact word overlap
from Article 1 of the Universal Declaration of Human Rights

Malay Can Look “More Indonesian”…
Malay
 Semua manusia dilahirkan bebas dan samarata dari
segi kemuliaan dan hak-hak.
 Mereka mempunyai pemikiran dan perasaan hati
dan hendaklah bertindak di antara satu sama lain
dengan semangat persaudaraan.
~75% exact word overlap
Post-edited Malay to look “Indonesian” (by an Indonesian speaker).
Indonesian
 Semua manusia dilahirkan bebas dan mempunyai martabat
dan hak-hak yang sama.
 Mereka mempunyai pemikiran dan perasaan dan hendaklah
bergaul satu sama lain dalam semangat persaudaraan.
from Article 1 of the Universal Declaration of Human Rights
We attempt to do this automatically:
adapt Malay to look Indonesian
Then, use it to improve SMT…

Indonesian
Malay
English
poor
Method at a Glance
Indonesian
“Indonesian”
English
poorStep 1:
Adaptation
Indonesian +
“Indonesian”
English
Step 2:
Combination
Adapt
Note that we have no Malay-Indonesian bi-text!

Step 1:
Adapting Malay-English
to “Indonesian”-English

Word-Level Bi-text Adaptation:
Overview
Given a Malay-English sentence pair
1. Adapt the Malay sentence to “Indonesian”
• Word-level paraphrases
• Phrase-level paraphrases
• Cross-lingual morphology
2. We pair the adapted “Indonesian” with the English from
the Malay-English sentence pair
Thus, we generate a new “Indonesian”-English sentence pair.
Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012)
Pidong Wang, Preslav Nakov, Hwee Tou Ng

Motivation
 In many cases, word-level substitutions are enough
 Adapt Malay to Indonesian (train)
KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
PDB Malaysia akan mencapai 8 persen pada tahun 2010.
Malaysia’s GDP is expected to reach 8 per cent in 2010.

Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
Decode using a large Indonesian LM
Overview
Probs: pivoting over English

Malaysia’s GDP is expected to reach 8 per cent in 2010.
21
Pair each with the English counter-part
Thus, we generate a new “Indonesian”-English bi-text.
Overview

 Indonesian translations for Malay: pivoting over English
 Weights
22
Malay sentenceML1 ML2 ML3 ML4 ML5
English sentenceEN1 EN2 EN3 EN4
English sentenceEN11 EN3 EN12
Indonesian sentenceIN1 IN2 IN3 IN4
ML-EN
bi-text
IN-EN
bi-text
Word-Level Adaptation:
Extracting Paraphrases
Note: we have no Malay-Indonesian bi-text, so we pivot.

IN-EN bi-text is small, thus:
 Unreliable IN-EN word alignments  bad ML-IN paraphrases
 Solution:
 improve IN-EN alignments using the ML-EN bi-text
 concatenate: IN-EN*k + ML-EN
» k ≈ |ML-EN| / |IN-EN|
 word alignment
 get the alignments for one copy of IN-EN only
23
Issue 1
IN
ML
EN
poor
Works because of cognates between Malay and Indonesian.

IN-EN bi-text is small, thus:
 Small IN vocabulary for the ML-IN paraphrases
 Solution:
 Add cross-lingual morphological variants:
 Given ML word: seperminuman
 Find ML lemma: minum
 Propose all known IN words sharing the same lemma:
» diminum, diminumkan, diminumnya, makan-minum,
makananminuman, meminum, meminumkan, meminumnya,
meminum-minuman, minum, minum-minum, minum-minuman,
minuman, minumanku, minumannya, peminum, peminumnya,
perminum, terminum
24
Issue 2
IN
ML
EN
poor
Note: The IN variants are from a larger monolingual IN text.

Word-level pivoting
 Ignores context, and relies on LM
 Cannot drop/insert/merge/split/reorder words
 Solution:
Phrase-level pivoting
 Build ML-EN and EN-IN phrase tables
 Induce ML-IN phrase table (pivoting over EN)
 Adapt the ML side of ML-EN to get “IN”-EN bi-text:
» using Indonesian LM and n-best “IN” as before
 Also, use cross-lingual morphological variants
25
Issue 3
- Models context better: not only Indonesian LM, but also phrases.
- Allows many word operations, e.g., insertion, deletion.
IN
ML
EN
poor

Step 2:
Combining
IN-EN + “IN”-EN

Combining IN-EN and “IN”-EN bi-texts
 Simple concatenation: IN-EN + “IN”-EN
 Balanced concatenation: IN-EN * k + “IN”-EN
 Sophisticated phrase table combination
 Improved word alignments for IN-EN
 Phrase table combination with extra features
Improved Statistical Machine Translation for Resource-Poor Languages Using Related
Resource-Rich Languages. (EMNLP 2009)
Preslav Nakov, Hwee Tou Ng
Improving Statistical Machine Translation for a Resource-Poor Language Using
Related Resource-Rich Languages. (JAIR, 2012)
Preslav Nakov, Hwee Tou Ng

 Concatenating bi-texts
 Merging phrase tables
 Combined method
Bi-text Combination Strategies

Concatenating bi-texts
 Combined method

 Summary: Concatenate X1-Y and X2-Y
 Advantages
 improved word alignments
 e.g., for rare words
 more translation options
 less unknown words
 useful non-compositional phrases (improved fluency)
 phrases with words from X2 that do not exist in X1: ignored
 Disadvantages
 X2-Y will dominate: it is larger
 translation probabilities are messed up
 phrases from X1-Y and X2-Y cannot be distinguished
X1
X2
Y
poor
related
Concatenating Bi-texts (1)

 Concat×k: Concatenate k copies of the original
and one copy of the additional training bi-text
 Concat×k:align
1. Concatenate k copies of the original and one copy of the
additional bi-text.
2. Generate word alignments.
3. Truncate them only keeping alignments for one copy of the
original bi-text.
4. Build a phrase table.
5. Tune the system using MERT.
The value of k is optimized on the development dataset.
X1
X2
Y
poor
related
Concatenating Bi-texts (2)

Merging phrase tables
 Combined method

 Summary: Build two separate phrase tables, then
(a) use them together
(b) merge them
(c) interpolate them
 Advantages
phrases from X1-Y and X2-Y can be distinguished
the larger bi-text X2-Y does not dominate X1-Y
more translation options
probabilities are combined in a more principled manner
 Disadvantages
improved word alignments are not possible
X1
X2
Y
poor
related
Merging Phrase Tables (1)

 Two-tables: Build two separate phrase tables and use
them as alternative decoding paths (Birch et al., 2007).

 Interpolation: Build two separate phrase tables, Torig and
Textra, and combine them using linear interpolation:
Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s).
The value of α is optimized on a development dataset.

 Merge:
1. Build separate phrase tables: Torig and Textra.
2. Keep all entries from Torig.
3. Add those entries from Textra that are not in Torig.
4. Add extra features:
 F1: 1 if the entry came from Torig, 0 otherwise.
 F2: 1 if the entry came from Textra, 0 otherwise.
 F3: 1 if the entry was in both tables, 0 otherwise.
The feature weights are set using MERT, and the number of features
is optimized on the development set.

Combined method

 Use Merge to combine the phrase
tables for concat×k:align (as Torig) and
for concat×1 (as Textra).
 Two parameters to tune
 number of repetitions k
 # of extra features to use with Merge:
 (a) F1 only;
 (b) F1 and F2,
 (c) F1, F2 and F3
Improved word alignments.
Improved lexical coverage.
Distinguish phrases
by source table.
Combined Method

Experiments & Evaluation

Data
 Translation data (for IN-EN)
 IN2EN-train: 0.9M
 IN2EN-dev: 37K
 IN2EN-test: 37K
 EN-monoling.: 5M
 Adaptation data (for ML-EN  “IN”-EN)
 ML2EN: 8.6M
 IN-monoling.: 20M
(tokens)

Isolated Experiments:
Training on “IN”-EN only
14.50
18.67
19.50
20.06
20.63 20.89
21.24
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
BLEU
System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)

18.49
19.79
20.10
21.55 21.64 21.62
18.0
18.5
19.0
19.5
20.0
20.5
21.0
21.5
22.0
simple
concatenation
balanced
concatenation
phrase table
combination
ML2EN(baseline) System combination
42
BLEU
Combined Experiments:
Training on IN-EN + “IN”-EN
Wang, Nakov & Ng (EMNLP 2012)

Experiments: Improvements
43
14.50
18.67
20.10
21.24 21.64
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
BLEU

 Improve Macedonian-English SMT by adapting
Bulgarian-English bi-text
 Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)
 OPUS movie subtitles
Application to Other Languages & Domains
27.33
27.97
28.38
29.05
27.00
27.50
28.00
28.50
29.00
29.50
BG2EN(A) WordParaph+morph(B) PhraseParaph+morph(C) System combination of
A+B+C
BLEU

Analysis

Paraphrasing
Non-Indonesian Malay Words Only
So, we do need to paraphrase all words.

Human Judgments
Morphology yields worse top-3 adaptations
but better phrase tables, due to coverage.
Is the adapted sentence better Indonesian
than the original Malay sentence?
100 random sentences

Reverse Adaptation
Idea:
Adapt dev/test Indonesian input to “Malay”,
then, translate with a Malay-English system
Input to SMT:
- “Malay” lattice
- 1-best “Malay” sentence from the lattice
Adapting dev/test is worse than adapting the training bi-text:
So, we need both n-best and LM

A Specialized Decoder
(Instead of Moses)

Beam-Search Text Rewriting Decoder:
The Algorithm
A Beam-Search Decoder for Normalization of Social Media Text
with Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng

Beam-Search Text Rewriting Decoder:
An Example (Twitter Normalization)
Wang, Nakov & Ng (NAACL 2013)

Hypothesis Producers
 Word-level mapping
 Phrase-level mapping
 Cross-lingual morphology mapping
 Indonesian LM
 Word penalty (target)
 Malay word penalty (source)
 Phrase count

Moses vs. the Specialized Decoder
 Decoding level
 phrase vs. sentence
 Features
 Moses vs. richer, e.g., Malay word penalty
word-level + phrase-level
(potentially, manual rules)
 Cross-lingual variants
 input lattice vs. feature function

Moses vs. a Specialized Decoder:
Isolated “IN”-EN Experiments:
19.50
20.06
20.63
20.89
21.24
20.39 20.46
20.85
21.07
21.76
18
19
20
21
22
WordPar WordPar+Morph PhrasePar PhrasePar+Morph System
Combination
BLEU
Moses Specialized decoder

Moses vs. a Specialized Decoder:
Combining IN-EN and “IN”-EN
18.49
19.79
20.10
21.55 21.64 21.6221.74 21.81
22.03
17
18
19
20
21
22
23
simple concat balanced concat phrase table combination
BLEU
ML2EN (baseline) Moses Specialized decoder

Experiments: Improvements
56
14.50
18.67
20.10
21.24
21.64
22.03
13
15
17
19
21
23
ML2EN
(baseline)
IN2EN
(baseline)
phrase table
combination
(Moses)
best isolated
system (Moses)
best combined
system (Moses)
best
combination
(DD)
BLEU

MKEN, Adapting BG-EN to “MK”-EN
27.33
27.97
28.38
29.05
29.35
26
27
28
29
30
BG2EN wordPar+morph
(Moses)
PhrasePar+morph
(Moses)
combination
(Moses)
combination (DD)
BLEU

Transliteration

59
Spanish vs. Portuguese
 Spanish–Portuguese
Spanish
 Todos los seres humanos nacen libres e iguales en dignidad y derechos
y, dotados como están de razón y conciencia, deben comportarse
fraternalmente los unos con los otros.
Portuguese
 Todos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os
outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap

Spanish vs. Portuguese
 Spanish–Portuguese
Spanish
 Todos los seres humanos nacen libres e iguales en dignidad y derechos
y, dotados como están de razón y conciencia, deben comportarse
fraternalmente los unos con los otros.
Portuguese
 Todos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os
outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap
67% approx. word overlap
The actual overlap is even higher.

Cognates
 Linguistics
Def: Words derived from a common root, e.g.,
 Latin tu (‘2nd person singular’)
 Old English thou
 French tu
 Spanish tú
 German du
 Greek sú
Orthography/phonetics/semantics: ignored.
 Computational linguistics
Def: Words in different languages that are mutual translations and
have a similar orthography, e.g.,
 evolution vs. evolución vs. evolução vs. evoluzione
Orthography & semantics: important.
Origin: ignored.
Cognates can differ a lot:
• night vs. nacht vs. nuit vs. notte vs. noite
• star vs. estrella vs. stella vs. étoile
• arbeit vs. rabota vs. robota (‘work’)
• father vs. père
• head vs. chef

Spelling Differences Between Cognates
 Systematic spelling differences
Spanish – Portuguese
 different spelling
-nh-  -ñ- (senhor vs. señor)
 phonetic
-ción  -ção (evolución vs. evolução)
-é  -ei (1st sing past) (visité vs. visitei)
-ó  -ou (3rd sing past) (visitó vs. visitou)
 Occasional differences
Spanish – Portuguese
 decir vs. dizer (‘to say’)
 Mario vs. Mário
 María vs. Maria
Many of these can be
learned automatically.

Automatic Transliteration
 Transliteration
1. Extract likely cognates for Portuguese-Spanish
2. Learn a character-level transliteration model
3. Transliterate the Portuguese side of pt-en, to look like Spanish

Automatic Transliteration (2)
 Extract pt-es cognates using English (en)
1. Induce pt-es word translation probabilities
2. Filter out by probability if
3. Filter out by orthographic similarity if
constants proposed
in the literature
Longest common
subsequence

SMT-based Transliteration
Train & tune a monotone character-level SMT system
 Representation
 Use it to transliterate the Portuguese side of pt-en

ESEN, Adapting PT-EN to “ES”-EN
5.34
22.87
24.23
13.79
26.24
0
5
10
15
20
25
30
PT-EN ES-EN phrase table combination
BLEU
original transliterated
10K ES-EN, 1.23M PT-EN

Transliteration
vs.
Character-Level Translation

Macedonian vs. Bulgarian
68

MKBG: Transliteration vs. Translation
10.74
12.07
22.74
31.10 32.19 32.71
33.94
5
10
15
20
25
30
35
40
MK (original) MK (simple
translit.)
MK (cognate
translit.)
MK-BG
(words)
MK-BG
(words+cogn.
translit.)
MK-BG
(chars)
MK-BG
(words+cogn.
translit. +
chars)
BLEU
Combining Word-Level and Character-Level Models for Machine
Translation Between Closely-Related Languages (ACL 2012).
Preslav Nakov, Jorg Tiedemann.

Character-Level SMT
70
• MK: Никогаш не сум преспала цела сезона.
• BG: Никога не съм спала цял сезон.
• MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ .
• BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .

Character-Level Phrase Pairs
71
Can cover:
 word prefixes/suffixes
 entire words
 word sequences
 combinations thereof
Max-phrase-length=10
LM-order=10

MKBG: The Impact of Data Size

Slavic Languages in Europe

BG
MK
SR
CZ
SL
MK XX

MK SR, SL, CZ

MK->EN: Pivoting over BG
Macedonian: Никогаш не сум преспала цела сезона.
Bulgarian: Никога не съм спала цял сезон.
English: I’ve never slept for an entire season.
For related languages
• subword transformations
• character-level translation

MK->EN: Pivoting over BG
Analyzing the Use of Character-Level Translation
with Sparse and Noisy Datasets (RANLP 2013)
Jorg Tiedemann, Preslav Nakov

MK->EN: Using Synthetic “MK”-EN Bi-Texts
Translate Bulgarian to Macedonian in a BG-XX corpus
All synthetic data combined (+mk-en): 36.69 BLEU
Tiedemann & Nakov (RANLP 2013)

 Adapt bi-texts for related resource-rich languages, using
 confusion networks
 word-level & phrase-level paraphrasing
 cross-lingual morphological analysis
 Character-level Models
 translation
 transliteration
 pivoting vs. synthetic data
 Future work
 other languages & NLP problems
 robustness: noise and domain shift Thank you!
Conclusion & Future Work

Related Work

Related Work (1)
 Machine translation between related languages
 E.g.
 Cantonese–Mandarin (Zhang, 1998)
 Czech–Slovak (Hajic & al., 2000)
 Turkish–Crimean Tatar (Altintas & Cicekli, 2002)
 Irish–Scottish Gaelic (Scannell, 2006)
 Bulgarian–Macedonian (Nakov & Tiedemann, 2012)
 We do not translate (no training data), we “adapt”.

Related Work (2)
 Adapting dialects to standard language (e.g., Arabic)
(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)
 manual rules and/or language-specific tools
 Normalizing Tweets and SMS
(Aw & al., 2006; Han & Baldwin, 2011)
 informal text: spelling, abbreviations, slang
 same language

Related Work (3)
 Adapt Brazilian to European Portuguese (Marujo & al. 2011)
 rule-based, language-dependent
 tiny improvements for SMT
 Reuse bi-texts between related languages (Nakov & Ng. 2009)
 no language adaptation (just transliteration)
 Cascaded/pivoted translation
(Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)
 poor  rich  X requires an additional poor-rich bi-text
 rich  X  poor does not use the similarity poor-rich
poor
rich
X
our:

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

Ähnlich wie Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2 (20)

Mehr von Yandex

Mehr von Yandex (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

Hinweis der Redaktion