SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Combining, Adapting and Reusing Bi-texts
between Related Languages:
Application to Statistical Machine Translation
Preslav Nakov, Qatar Computing Research Institute
(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)
Yandex seminar
August 13, 2014, Moscow, Russia
Yandex seminar, August 13, 2014, Moscow, Russia
2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2
Plan
 Part I
Introduction to Statistical Machine Translation
 Part II
Combining, Adapting and Reusing Bi-texts between Related
Languages: Application to Statistical Machine Translation
 Part III
Further Discussion on SMT
The Problem:
Lack of Resources
Yandex seminar, August 13, 2014, Moscow, Russia
4Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 4
Overview
 Statistical Machine Translation (SMT) systems
Need large sentence-aligned bilingual corpora (bi-texts).
 Problem
Such training bi-texts do not exist for most languages.
 Idea
Adapt a bi-text for a related resource-rich language.
Yandex seminar, August 13, 2014, Moscow, Russia
5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building an SMT System for a New Language Pair
 In theory: only requires few hours/days
 In practice: large bi-texts are needed
Only available for
 the official languages of the UN
 Arabic, Chinese, English, French, Russian, Spanish
 the official languages of the EU
 some other languages
However, most of the 6,500+ world languages remain
resource-poor from an SMT viewpoint.
This number is even more striking
if we consider language pairs.
Even resource-rich language pairs
become resource-poor
in new domains.
Yandex seminar, August 13, 2014, Moscow, Russia
6Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Most Language Pairs Have Little Resources
Zipfian distribution of language resources
Yandex seminar, August 13, 2014, Moscow, Russia
7Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Building a Bi-text for SMT
 Small bi-texts
Relatively easy to build
 Large bi-texts
Hard to get, e.g., because of copyright
Sources: parliament debates and legislation
national: Canada, Hong Kong
international
United Nations
European Union: Europarl, Acquis
Becoming an official language of the EU
is an easy recipe for getting rich in bi-texts quickly.
Not all languages are so “lucky”,
but many can still benefit.
Yandex seminar, August 13, 2014, Moscow, Russia
8Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
How Google/Bing (Yandex?) Translate
Resource-Poor Languages
 How do we translate from Russian to Malay?
 Use Triangulation
Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)
 RussianEnglishMalay
Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)
 рамочное соглашение ||| framework agreement ||| 0.7 …
 perjanjian kerangka kerja ||| framework agreement ||| 0.8 …
THUS
 рамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …
Yandex seminar, August 13, 2014, Moscow, Russia
9Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Idea: reuse bi-texts from related resource-rich
languages to build an improved SMT system for a
related resource-poor language.
 NOTE 1: this is NOT triangulation
we focus on translation into English
 e.g., Indonesian-English using Malay-English
 rather than
IndonesianEnglishMalay
IndonesianMalayEnglish
 NOTE 2: We exploit the fact that the source languages
are related
What if We Want to Translate into English?
poor
rich
X
Yandex seminar, August 13, 2014, Moscow, Russia
10Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Resource-poor vs. Resource-rich
Yandex seminar, August 13, 2014, Moscow, Russia
11Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 11
 Related EU – non-EU/unofficial languages
 Swedish – Norwegian
 Bulgarian – Macedonian
 Irish – Gaelic Scottish
 Standard German – Swiss German
 Related EU languages
 Spanish – Catalan
 Czech – Slovak
 Related languages outside Europe
 Russian – Ukrainian
 MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)
 Hindi – Urdu
 Turkish – Azerbaijani
 Malay – Indonesian
Resource-rich vs. Resource-poor Languages
We will explore
these pairs.
Yandex seminar, August 13, 2014, Moscow, Russia
12Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Related languages have
 overlapping vocabulary (cognates)
similar
 word order
 syntax
Motivation
13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Improving
Indonesian-English SMT
Using Malay-English
Yandex seminar, August 13, 2014, Moscow, Russia
14Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 14
Malay vs. Indonesian
Malay
 Semua manusia dilahirkan bebas dan samarata dari segi
kemuliaan dan hak-hak.
 Mereka mempunyai pemikiran dan perasaan hati dan
hendaklah bertindak di antara satu sama lain dengan
semangat persaudaraan.
Indonesian
 Semua orang dilahirkan merdeka dan mempunyai martabat
dan hak-hak yang sama.
 Mereka dikaruniai akal dan hati nurani dan hendaknya
bergaul satu sama lain dalam semangat persaudaraan.
~50% exact word overlap
from Article 1 of the Universal Declaration of Human Rights
Yandex seminar, August 13, 2014, Moscow, Russia
15Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 15
Malay Can Look “More Indonesian”…
Malay
 Semua manusia dilahirkan bebas dan samarata dari
segi kemuliaan dan hak-hak.
 Mereka mempunyai pemikiran dan perasaan hati
dan hendaklah bertindak di antara satu sama lain
dengan semangat persaudaraan.
~75% exact word overlap
Post-edited Malay to look “Indonesian” (by an Indonesian speaker).
Indonesian
 Semua manusia dilahirkan bebas dan mempunyai martabat
dan hak-hak yang sama.
 Mereka mempunyai pemikiran dan perasaan dan hendaklah
bergaul satu sama lain dalam semangat persaudaraan.
from Article 1 of the Universal Declaration of Human Rights
We attempt to do this automatically:
adapt Malay to look Indonesian
Then, use it to improve SMT…
Yandex seminar, August 13, 2014, Moscow, Russia
16Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Indonesian
Malay
English
poor
Method at a Glance
Indonesian
“Indonesian”
English
poorStep 1:
Adaptation
Indonesian +
“Indonesian”
English
Step 2:
Combination
Adapt
Note that we have no Malay-Indonesian bi-text!
17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Step 1:
Adapting Malay-English
to “Indonesian”-English
Yandex seminar, August 13, 2014, Moscow, Russia
18Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18
Word-Level Bi-text Adaptation:
Overview
Given a Malay-English sentence pair
1. Adapt the Malay sentence to “Indonesian”
• Word-level paraphrases
• Phrase-level paraphrases
• Cross-lingual morphology
2. We pair the adapted “Indonesian” with the English from
the Malay-English sentence pair
Thus, we generate a new “Indonesian”-English sentence pair.
Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012)
Pidong Wang, Preslav Nakov, Hwee Tou Ng
Yandex seminar, August 13, 2014, Moscow, Russia
19Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 19
Word-Level Bi-text Adaptation:
Motivation
 In many cases, word-level substitutions are enough
 Adapt Malay to Indonesian (train)
KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
PDB Malaysia akan mencapai 8 persen pada tahun 2010.
Malaysia’s GDP is expected to reach 8 per cent in 2010.
Yandex seminar, August 13, 2014, Moscow, Russia
20Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 20
Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010.
Decode using a large Indonesian LM
Word-Level Bi-text Adaptation:
Overview
Probs: pivoting over English
Yandex seminar, August 13, 2014, Moscow, Russia
21Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Malaysia’s GDP is expected to reach 8 per cent in 2010.
21
Pair each with the English counter-part
Thus, we generate a new “Indonesian”-English bi-text.
Word-Level Bi-text Adaptation:
Overview
Yandex seminar, August 13, 2014, Moscow, Russia
22Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Indonesian translations for Malay: pivoting over English
 Weights
22
Malay sentenceML1 ML2 ML3 ML4 ML5
English sentenceEN1 EN2 EN3 EN4
English sentenceEN11 EN3 EN12
Indonesian sentenceIN1 IN2 IN3 IN4
ML-EN
bi-text
IN-EN
bi-text
Word-Level Adaptation:
Extracting Paraphrases
Note: we have no Malay-Indonesian bi-text, so we pivot.
Yandex seminar, August 13, 2014, Moscow, Russia
23Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
IN-EN bi-text is small, thus:
 Unreliable IN-EN word alignments  bad ML-IN paraphrases
 Solution:
 improve IN-EN alignments using the ML-EN bi-text
 concatenate: IN-EN*k + ML-EN
» k ≈ |ML-EN| / |IN-EN|
 word alignment
 get the alignments for one copy of IN-EN only
23
Word-Level Adaptation:
Issue 1
IN
ML
EN
poor
Works because of cognates between Malay and Indonesian.
Yandex seminar, August 13, 2014, Moscow, Russia
24Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
IN-EN bi-text is small, thus:
 Small IN vocabulary for the ML-IN paraphrases
 Solution:
 Add cross-lingual morphological variants:
 Given ML word: seperminuman
 Find ML lemma: minum
 Propose all known IN words sharing the same lemma:
» diminum, diminumkan, diminumnya, makan-minum,
makananminuman, meminum, meminumkan, meminumnya,
meminum-minuman, minum, minum-minum, minum-minuman,
minuman, minumanku, minumannya, peminum, peminumnya,
perminum, terminum
24
Word-Level Adaptation:
Issue 2
IN
ML
EN
poor
Note: The IN variants are from a larger monolingual IN text.
Yandex seminar, August 13, 2014, Moscow, Russia
25Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Word-level pivoting
 Ignores context, and relies on LM
 Cannot drop/insert/merge/split/reorder words
 Solution:
Phrase-level pivoting
 Build ML-EN and EN-IN phrase tables
 Induce ML-IN phrase table (pivoting over EN)
 Adapt the ML side of ML-EN to get “IN”-EN bi-text:
» using Indonesian LM and n-best “IN” as before
 Also, use cross-lingual morphological variants
25
Word-Level Adaptation:
Issue 3
- Models context better: not only Indonesian LM, but also phrases.
- Allows many word operations, e.g., insertion, deletion.
IN
ML
EN
poor
26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Step 2:
Combining
IN-EN + “IN”-EN
Yandex seminar, August 13, 2014, Moscow, Russia
27Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Combining IN-EN and “IN”-EN bi-texts
 Simple concatenation: IN-EN + “IN”-EN
 Balanced concatenation: IN-EN * k + “IN”-EN
 Sophisticated phrase table combination
 Improved word alignments for IN-EN
 Phrase table combination with extra features
Improved Statistical Machine Translation for Resource-Poor Languages Using Related
Resource-Rich Languages. (EMNLP 2009)
Preslav Nakov, Hwee Tou Ng
Improving Statistical Machine Translation for a Resource-Poor Language Using
Related Resource-Rich Languages. (JAIR, 2012)
Preslav Nakov, Hwee Tou Ng
Yandex seminar, August 13, 2014, Moscow, Russia
28Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Concatenating bi-texts
 Merging phrase tables
 Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
29Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Concatenating bi-texts
 Merging phrase tables
 Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
30Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Summary: Concatenate X1-Y and X2-Y
 Advantages
 improved word alignments
 e.g., for rare words
 more translation options
 less unknown words
 useful non-compositional phrases (improved fluency)
 phrases with words from X2 that do not exist in X1: ignored
 Disadvantages
 X2-Y will dominate: it is larger
 translation probabilities are messed up
 phrases from X1-Y and X2-Y cannot be distinguished
X1
X2
Y
poor
related
Concatenating Bi-texts (1)
Yandex seminar, August 13, 2014, Moscow, Russia
31Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Concat×k: Concatenate k copies of the original
and one copy of the additional training bi-text
 Concat×k:align
1. Concatenate k copies of the original and one copy of the
additional bi-text.
2. Generate word alignments.
3. Truncate them only keeping alignments for one copy of the
original bi-text.
4. Build a phrase table.
5. Tune the system using MERT.
The value of k is optimized on the development dataset.
X1
X2
Y
poor
related
Concatenating Bi-texts (2)
Yandex seminar, August 13, 2014, Moscow, Russia
32Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Concatenating bi-texts
Merging phrase tables
 Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
33Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Summary: Build two separate phrase tables, then
(a) use them together
(b) merge them
(c) interpolate them
 Advantages
phrases from X1-Y and X2-Y can be distinguished
the larger bi-text X2-Y does not dominate X1-Y
more translation options
probabilities are combined in a more principled manner
 Disadvantages
improved word alignments are not possible
X1
X2
Y
poor
related
Merging Phrase Tables (1)
Yandex seminar, August 13, 2014, Moscow, Russia
34Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Two-tables: Build two separate phrase tables and use
them as alternative decoding paths (Birch et al., 2007).
Merging Phrase Tables (2)
Yandex seminar, August 13, 2014, Moscow, Russia
35Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Interpolation: Build two separate phrase tables, Torig and
Textra, and combine them using linear interpolation:
Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s).
The value of α is optimized on a development dataset.
Merging Phrase Tables (3)
Yandex seminar, August 13, 2014, Moscow, Russia
36Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Merge:
1. Build separate phrase tables: Torig and Textra.
2. Keep all entries from Torig.
3. Add those entries from Textra that are not in Torig.
4. Add extra features:
 F1: 1 if the entry came from Torig, 0 otherwise.
 F2: 1 if the entry came from Textra, 0 otherwise.
 F3: 1 if the entry was in both tables, 0 otherwise.
The feature weights are set using MERT, and the number of features
is optimized on the development set.
Merging Phrase Tables (4)
Yandex seminar, August 13, 2014, Moscow, Russia
37Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Concatenating bi-texts
 Merging phrase tables
Combined method
Bi-text Combination Strategies
Yandex seminar, August 13, 2014, Moscow, Russia
38Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Use Merge to combine the phrase
tables for concat×k:align (as Torig) and
for concat×1 (as Textra).
 Two parameters to tune
 number of repetitions k
 # of extra features to use with Merge:
 (a) F1 only;
 (b) F1 and F2,
 (c) F1, F2 and F3
Improved word alignments.
Improved lexical coverage.
Distinguish phrases
by source table.
Combined Method
39Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)
Experiments & Evaluation
Yandex seminar, August 13, 2014, Moscow, Russia
40Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Data
 Translation data (for IN-EN)
 IN2EN-train: 0.9M
 IN2EN-dev: 37K
 IN2EN-test: 37K
 EN-monoling.: 5M
 Adaptation data (for ML-EN  “IN”-EN)
 ML2EN: 8.6M
 IN-monoling.: 20M
(tokens)
Yandex seminar, August 13, 2014, Moscow, Russia
41Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Isolated Experiments:
Training on “IN”-EN only
14.50
18.67
19.50
20.06
20.63 20.89
21.24
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
BLEU
System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
42Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
18.49
19.79
20.10
21.55 21.64 21.62
18.0
18.5
19.0
19.5
20.0
20.5
21.0
21.5
22.0
simple
concatenation
balanced
concatenation
phrase table
combination
ML2EN(baseline) System combination
42
BLEU
Combined Experiments:
Training on IN-EN + “IN”-EN
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
43Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Experiments: Improvements
43
14.50
18.67
20.10
21.24 21.64
14.00
15.00
16.00
17.00
18.00
19.00
20.00
21.00
22.00
23.00
BLEU
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
44Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Improve Macedonian-English SMT by adapting
Bulgarian-English bi-text
 Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)
 OPUS movie subtitles
Application to Other Languages & Domains
27.33
27.97
28.38
29.05
27.00
27.50
28.00
28.50
29.00
29.50
BG2EN(A) WordParaph+morph(B) PhraseParaph+morph(C) System combination of
A+B+C
BLEU
Yandex seminar, August 13, 2014, Moscow, Russia
45Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 45
Analysis
Yandex seminar, August 13, 2014, Moscow, Russia
46Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Paraphrasing
Non-Indonesian Malay Words Only
So, we do need to paraphrase all words.
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
47Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Human Judgments
Morphology yields worse top-3 adaptations
but better phrase tables, due to coverage.
Is the adapted sentence better Indonesian
than the original Malay sentence?
100 random sentences
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
48Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Reverse Adaptation
Idea:
Adapt dev/test Indonesian input to “Malay”,
then, translate with a Malay-English system
Input to SMT:
- “Malay” lattice
- 1-best “Malay” sentence from the lattice
Adapting dev/test is worse than adapting the training bi-text:
So, we need both n-best and LM
Wang, Nakov & Ng (EMNLP 2012)
Yandex seminar, August 13, 2014, Moscow, Russia
49Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 49
A Specialized Decoder
(Instead of Moses)
Yandex seminar, August 13, 2014, Moscow, Russia
50Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Beam-Search Text Rewriting Decoder:
The Algorithm
A Beam-Search Decoder for Normalization of Social Media Text
with Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng
Yandex seminar, August 13, 2014, Moscow, Russia
51Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Beam-Search Text Rewriting Decoder:
An Example (Twitter Normalization)
Wang, Nakov & Ng (NAACL 2013)
Yandex seminar, August 13, 2014, Moscow, Russia
52Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Hypothesis Producers
 Word-level mapping
 Phrase-level mapping
 Cross-lingual morphology mapping
 Indonesian LM
 Word penalty (target)
 Malay word penalty (source)
 Phrase count
Wang, Nakov & Ng (NAACL 2013)
Yandex seminar, August 13, 2014, Moscow, Russia
53Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Moses vs. the Specialized Decoder
 Decoding level
 phrase vs. sentence
 Features
 Moses vs. richer, e.g., Malay word penalty
word-level + phrase-level
(potentially, manual rules)
 Cross-lingual variants
 input lattice vs. feature function
Wang, Nakov & Ng (NAACL 2013)
Yandex seminar, August 13, 2014, Moscow, Russia
54Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Moses vs. a Specialized Decoder:
Isolated “IN”-EN Experiments:
19.50
20.06
20.63
20.89
21.24
20.39 20.46
20.85
21.07
21.76
18
19
20
21
22
WordPar WordPar+Morph PhrasePar PhrasePar+Morph System
Combination
BLEU
Moses Specialized decoder
Yandex seminar, August 13, 2014, Moscow, Russia
55Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 55
Moses vs. a Specialized Decoder:
Combining IN-EN and “IN”-EN
18.49
19.79
20.10
21.55 21.64 21.6221.74 21.81
22.03
17
18
19
20
21
22
23
simple concat balanced concat phrase table combination
BLEU
ML2EN (baseline) Moses Specialized decoder
Yandex seminar, August 13, 2014, Moscow, Russia
56Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Experiments: Improvements
56
14.50
18.67
20.10
21.24
21.64
22.03
13
15
17
19
21
23
ML2EN
(baseline)
IN2EN
(baseline)
phrase table
combination
(Moses)
best isolated
system (Moses)
best combined
system (Moses)
best
combination
(DD)
BLEU
Yandex seminar, August 13, 2014, Moscow, Russia
57Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKEN, Adapting BG-EN to “MK”-EN
27.33
27.97
28.38
29.05
29.35
26
27
28
29
30
BG2EN wordPar+morph
(Moses)
PhrasePar+morph
(Moses)
combination
(Moses)
combination (DD)
BLEU
Yandex seminar, August 13, 2014, Moscow, Russia
58Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 58
Transliteration
Yandex seminar, August 13, 2014, Moscow, Russia
59Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
59
Spanish vs. Portuguese
 Spanish–Portuguese
Spanish
 Todos los seres humanos nacen libres e iguales en dignidad y derechos
y, dotados como están de razón y conciencia, deben comportarse
fraternalmente los unos con los otros.
Portuguese
 Todos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os
outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap
Yandex seminar, August 13, 2014, Moscow, Russia
60Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Spanish vs. Portuguese
 Spanish–Portuguese
Spanish
 Todos los seres humanos nacen libres e iguales en dignidad y derechos
y, dotados como están de razón y conciencia, deben comportarse
fraternalmente los unos con los otros.
Portuguese
 Todos os seres humanos nascem livres e iguais em dignidade e em
direitos. Dotados de razão e de consciência, devem agir uns para com os
outros em espírito de fraternidade.
(from Article 1 of the Universal Declaration of Human Rights)
17% exact word overlap
67% approx. word overlap
The actual overlap is even higher.
Yandex seminar, August 13, 2014, Moscow, Russia
61Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Cognates
 Linguistics
Def: Words derived from a common root, e.g.,
 Latin tu (‘2nd person singular’)
 Old English thou
 French tu
 Spanish tú
 German du
 Greek sú
Orthography/phonetics/semantics: ignored.
 Computational linguistics
Def: Words in different languages that are mutual translations and
have a similar orthography, e.g.,
 evolution vs. evolución vs. evolução vs. evoluzione
Orthography & semantics: important.
Origin: ignored.
Cognates can differ a lot:
• night vs. nacht vs. nuit vs. notte vs. noite
• star vs. estrella vs. stella vs. étoile
• arbeit vs. rabota vs. robota (‘work’)
• father vs. père
• head vs. chef
Yandex seminar, August 13, 2014, Moscow, Russia
62Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Spelling Differences Between Cognates
 Systematic spelling differences
Spanish – Portuguese
 different spelling
-nh-  -ñ- (senhor vs. señor)
 phonetic
-ción  -ção (evolución vs. evolução)
-é  -ei (1st sing past) (visité vs. visitei)
-ó  -ou (3rd sing past) (visitó vs. visitou)
 Occasional differences
Spanish – Portuguese
 decir vs. dizer (‘to say’)
 Mario vs. Mário
 María vs. Maria
Many of these can be
learned automatically.
Yandex seminar, August 13, 2014, Moscow, Russia
63Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Automatic Transliteration
 Transliteration
1. Extract likely cognates for Portuguese-Spanish
2. Learn a character-level transliteration model
3. Transliterate the Portuguese side of pt-en, to look like Spanish
Yandex seminar, August 13, 2014, Moscow, Russia
64Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Automatic Transliteration (2)
 Extract pt-es cognates using English (en)
1. Induce pt-es word translation probabilities
2. Filter out by probability if
3. Filter out by orthographic similarity if
constants proposed
in the literature
Longest common
subsequence
Yandex seminar, August 13, 2014, Moscow, Russia
65Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
SMT-based Transliteration
Train & tune a monotone character-level SMT system
 Representation
 Use it to transliterate the Portuguese side of pt-en
Yandex seminar, August 13, 2014, Moscow, Russia
66Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
ESEN, Adapting PT-EN to “ES”-EN
5.34
22.87
24.23
13.79
26.24
0
5
10
15
20
25
30
PT-EN ES-EN phrase table combination
BLEU
original transliterated
10K ES-EN, 1.23M PT-EN
Yandex seminar, August 13, 2014, Moscow, Russia
67Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 67
Transliteration
vs.
Character-Level Translation
Yandex seminar, August 13, 2014, Moscow, Russia
68Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Macedonian vs. Bulgarian
68
Yandex seminar, August 13, 2014, Moscow, Russia
69Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKBG: Transliteration vs. Translation
10.74
12.07
22.74
31.10 32.19 32.71
33.94
5
10
15
20
25
30
35
40
MK (original) MK (simple
translit.)
MK (cognate
translit.)
MK-BG
(words)
MK-BG
(words+cogn.
translit.)
MK-BG
(chars)
MK-BG
(words+cogn.
translit. +
chars)
BLEU
Combining Word-Level and Character-Level Models for Machine
Translation Between Closely-Related Languages (ACL 2012).
Preslav Nakov, Jorg Tiedemann.
Yandex seminar, August 13, 2014, Moscow, Russia
70Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Character-Level SMT
70
• MK: Никогаш не сум преспала цела сезона.
• BG: Никога не съм спала цял сезон.
• MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ .
• BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .
Yandex seminar, August 13, 2014, Moscow, Russia
71Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Character-Level Phrase Pairs
71
Can cover:
 word prefixes/suffixes
 entire words
 word sequences
 combinations thereof
Max-phrase-length=10
LM-order=10
Yandex seminar, August 13, 2014, Moscow, Russia
72Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MKBG: The Impact of Data Size
Yandex seminar, August 13, 2014, Moscow, Russia
73Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Slavic Languages in Europe
Yandex seminar, August 13, 2014, Moscow, Russia
74Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
BG
MK
SR
CZ
SL
MK XX
Yandex seminar, August 13, 2014, Moscow, Russia
75Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK SR, SL, CZ
76
Pivoting
Yandex seminar, August 13, 2014, Moscow, Russia
77Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Pivoting over BG
Macedonian: Никогаш не сум преспала цела сезона.
Bulgarian: Никога не съм спала цял сезон.
English: I’ve never slept for an entire season.
For related languages
• subword transformations
• character-level translation
Yandex seminar, August 13, 2014, Moscow, Russia
78Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Pivoting over BG
Analyzing the Use of Character-Level Translation
with Sparse and Noisy Datasets (RANLP 2013)
Jorg Tiedemann, Preslav Nakov
Yandex seminar, August 13, 2014, Moscow, Russia
79Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
MK->EN: Using Synthetic “MK”-EN Bi-Texts
Translate Bulgarian to Macedonian in a BG-XX corpus
All synthetic data combined (+mk-en): 36.69 BLEU
Tiedemann & Nakov (RANLP 2013)
80
Conclusion
Yandex seminar, August 13, 2014, Moscow, Russia
81Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
 Adapt bi-texts for related resource-rich languages, using
 confusion networks
 word-level & phrase-level paraphrasing
 cross-lingual morphological analysis
 Character-level Models
 translation
 transliteration
 pivoting vs. synthetic data
 Future work
 other languages & NLP problems
 robustness: noise and domain shift Thank you!
Conclusion & Future Work
Yandex seminar, August 13, 2014, Moscow, Russia
83Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 83
Related Work
Yandex seminar, August 13, 2014, Moscow, Russia
84Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (1)
 Machine translation between related languages
 E.g.
 Cantonese–Mandarin (Zhang, 1998)
 Czech–Slovak (Hajic & al., 2000)
 Turkish–Crimean Tatar (Altintas & Cicekli, 2002)
 Irish–Scottish Gaelic (Scannell, 2006)
 Bulgarian–Macedonian (Nakov & Tiedemann, 2012)
 We do not translate (no training data), we “adapt”.
Yandex seminar, August 13, 2014, Moscow, Russia
85Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (2)
 Adapting dialects to standard language (e.g., Arabic)
(Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)
 manual rules and/or language-specific tools
 Normalizing Tweets and SMS
(Aw & al., 2006; Han & Baldwin, 2011)
 informal text: spelling, abbreviations, slang
 same language
Yandex seminar, August 13, 2014, Moscow, Russia
86Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation
Related Work (3)
 Adapt Brazilian to European Portuguese (Marujo & al. 2011)
 rule-based, language-dependent
 tiny improvements for SMT
 Reuse bi-texts between related languages (Nakov & Ng. 2009)
 no language adaptation (just transliteration)
 Cascaded/pivoted translation
(Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)
 poor  rich  X requires an additional poor-rich bi-text
 rich  X  poor does not use the similarity poor-rich
poor
rich
X
our:

Weitere ähnliche Inhalte

Ähnlich wie Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...IJECEIAES
 
APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...
APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...
APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...AishwaryaChemate
 
Using NLP to understand textual content at scale
Using NLP to understand textual content at scaleUsing NLP to understand textual content at scale
Using NLP to understand textual content at scaleParsa Ghaffari
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationijnlc
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESmlaij
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineeringNakul Sharma
 
Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioDeep Learning Italia
 
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksDetecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksJinho Choi
 
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksDetecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksJinho Choi
 

Ähnlich wie Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2 (20)

NLP unit-VI.pptx
NLP unit-VI.pptxNLP unit-VI.pptx
NLP unit-VI.pptx
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
 
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
 
APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...
APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...
APznzaalselifJKjGQdTCA51cF7bldYdFMvDcshM8opKFZ_ZaIV-dqkiLoIKIfhz2tS6Fw5UBk25u...
 
Using NLP to understand textual content at scale
Using NLP to understand textual content at scaleUsing NLP to understand textual content at scale
Using NLP to understand textual content at scale
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineering
 
Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglio
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksDetecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
 
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanksDetecting Cross-lingual Semantic Similarity Using Parallel PropBanks
Detecting Cross-lingual Semantic Similarity Using Parallel PropBanks
 

Mehr von Yandex

Предсказание оттока игроков из World of Tanks
Предсказание оттока игроков из World of TanksПредсказание оттока игроков из World of Tanks
Предсказание оттока игроков из World of TanksYandex
 
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...Yandex
 
Структурированные данные, Юлия Тихоход, лекция в Школе вебмастеров Яндекса
Структурированные данные, Юлия Тихоход, лекция в Школе вебмастеров ЯндексаСтруктурированные данные, Юлия Тихоход, лекция в Школе вебмастеров Яндекса
Структурированные данные, Юлия Тихоход, лекция в Школе вебмастеров ЯндексаYandex
 
Представление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров Яндекса
Представление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров ЯндексаПредставление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров Яндекса
Представление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров ЯндексаYandex
 
Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...
Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...
Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...Yandex
 
Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...
Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...
Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...Yandex
 
Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...
Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...
Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...Yandex
 
Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...
Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...
Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...Yandex
 
Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...
Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...
Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...Yandex
 
Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...
Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...
Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...Yandex
 
Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...
Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...
Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...Yandex
 
Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...
Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...
Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...Yandex
 
Как защитить свой сайт, Пётр Волков, лекция в Школе вебмастеров
Как защитить свой сайт, Пётр Волков, лекция в Школе вебмастеровКак защитить свой сайт, Пётр Волков, лекция в Школе вебмастеров
Как защитить свой сайт, Пётр Волков, лекция в Школе вебмастеровYandex
 
Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...
Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...
Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...Yandex
 
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...Yandex
 
Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...
Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...
Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...Yandex
 
Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...
Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...
Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...Yandex
 
Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...
Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...
Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...Yandex
 
Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...
Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...
Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...Yandex
 
Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...
Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...
Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...Yandex
 

Mehr von Yandex (20)

Предсказание оттока игроков из World of Tanks
Предсказание оттока игроков из World of TanksПредсказание оттока игроков из World of Tanks
Предсказание оттока игроков из World of Tanks
 
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...
Как принять/организовать работу по поисковой оптимизации сайта, Сергей Царик,...
 
Структурированные данные, Юлия Тихоход, лекция в Школе вебмастеров Яндекса
Структурированные данные, Юлия Тихоход, лекция в Школе вебмастеров ЯндексаСтруктурированные данные, Юлия Тихоход, лекция в Школе вебмастеров Яндекса
Структурированные данные, Юлия Тихоход, лекция в Школе вебмастеров Яндекса
 
Представление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров Яндекса
Представление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров ЯндексаПредставление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров Яндекса
Представление сайта в поиске, Сергей Лысенко, лекция в Школе вебмастеров Яндекса
 
Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...
Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...
Плохие методы продвижения сайта, Екатерины Гладких, лекция в Школе вебмастеро...
 
Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...
Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...
Основные принципы ранжирования, Сергей Царик и Антон Роменский, лекция в Школ...
 
Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...
Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...
Основные принципы индексирования сайта, Александр Смирнов, лекция в Школе веб...
 
Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...
Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...
Мобильное приложение: как и зачем, Александр Лукин, лекция в Школе вебмастеро...
 
Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...
Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...
Сайты на мобильных устройствах, Олег Ножичкин, лекция в Школе вебмастеров Янд...
 
Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...
Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...
Качественная аналитика сайта, Юрий Батиевский, лекция в Школе вебмастеров Янд...
 
Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...
Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...
Что можно и что нужно измерять на сайте, Петр Аброськин, лекция в Школе вебма...
 
Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...
Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...
Как правильно поставить ТЗ на создание сайта, Алексей Бородкин, лекция в Школ...
 
Как защитить свой сайт, Пётр Волков, лекция в Школе вебмастеров
Как защитить свой сайт, Пётр Волков, лекция в Школе вебмастеровКак защитить свой сайт, Пётр Волков, лекция в Школе вебмастеров
Как защитить свой сайт, Пётр Волков, лекция в Школе вебмастеров
 
Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...
Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...
Как правильно составить структуру сайта, Дмитрий Сатин, лекция в Школе вебмас...
 
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...
 
Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...
Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...
Конструкторы для отдельных элементов сайта, Елена Першина, лекция в Школе веб...
 
Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...
Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...
Контент для интернет-магазинов, Катерина Ерошина, лекция в Школе вебмастеров ...
 
Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...
Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...
Как написать хороший текст для сайта, Катерина Ерошина, лекция в Школе вебмас...
 
Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...
Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...
Usability и дизайн - как не помешать пользователю, Алексей Иванов, лекция в Ш...
 
Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...
Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...
Cайт. Зачем он и каким должен быть, Алексей Иванов, лекция в Школе вебмастеро...
 

Kürzlich hochgeladen

Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Onlineanilsa9823
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 

Kürzlich hochgeladen (20)

Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

  • 1. Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Preslav Nakov, Qatar Computing Research Institute (collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng) Yandex seminar August 13, 2014, Moscow, Russia
  • 2. Yandex seminar, August 13, 2014, Moscow, Russia 2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2 Plan  Part I Introduction to Statistical Machine Translation  Part II Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Part III Further Discussion on SMT
  • 4. Yandex seminar, August 13, 2014, Moscow, Russia 4Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 4 Overview  Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts).  Problem Such training bi-texts do not exist for most languages.  Idea Adapt a bi-text for a related resource-rich language.
  • 5. Yandex seminar, August 13, 2014, Moscow, Russia 5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Building an SMT System for a New Language Pair  In theory: only requires few hours/days  In practice: large bi-texts are needed Only available for  the official languages of the UN  Arabic, Chinese, English, French, Russian, Spanish  the official languages of the EU  some other languages However, most of the 6,500+ world languages remain resource-poor from an SMT viewpoint. This number is even more striking if we consider language pairs. Even resource-rich language pairs become resource-poor in new domains.
  • 6. Yandex seminar, August 13, 2014, Moscow, Russia 6Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Most Language Pairs Have Little Resources Zipfian distribution of language resources
  • 7. Yandex seminar, August 13, 2014, Moscow, Russia 7Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Building a Bi-text for SMT  Small bi-texts Relatively easy to build  Large bi-texts Hard to get, e.g., because of copyright Sources: parliament debates and legislation national: Canada, Hong Kong international United Nations European Union: Europarl, Acquis Becoming an official language of the EU is an easy recipe for getting rich in bi-texts quickly. Not all languages are so “lucky”, but many can still benefit.
  • 8. Yandex seminar, August 13, 2014, Moscow, Russia 8Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation How Google/Bing (Yandex?) Translate Resource-Poor Languages  How do we translate from Russian to Malay?  Use Triangulation Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)  RussianEnglishMalay Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)  рамочное соглашение ||| framework agreement ||| 0.7 …  perjanjian kerangka kerja ||| framework agreement ||| 0.8 … THUS  рамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …
  • 9. Yandex seminar, August 13, 2014, Moscow, Russia 9Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Idea: reuse bi-texts from related resource-rich languages to build an improved SMT system for a related resource-poor language.  NOTE 1: this is NOT triangulation we focus on translation into English  e.g., Indonesian-English using Malay-English  rather than IndonesianEnglishMalay IndonesianMalayEnglish  NOTE 2: We exploit the fact that the source languages are related What if We Want to Translate into English? poor rich X
  • 10. Yandex seminar, August 13, 2014, Moscow, Russia 10Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Resource-poor vs. Resource-rich
  • 11. Yandex seminar, August 13, 2014, Moscow, Russia 11Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 11  Related EU – non-EU/unofficial languages  Swedish – Norwegian  Bulgarian – Macedonian  Irish – Gaelic Scottish  Standard German – Swiss German  Related EU languages  Spanish – Catalan  Czech – Slovak  Related languages outside Europe  Russian – Ukrainian  MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)  Hindi – Urdu  Turkish – Azerbaijani  Malay – Indonesian Resource-rich vs. Resource-poor Languages We will explore these pairs.
  • 12. Yandex seminar, August 13, 2014, Moscow, Russia 12Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Related languages have  overlapping vocabulary (cognates) similar  word order  syntax Motivation
  • 13. 13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Improving Indonesian-English SMT Using Malay-English
  • 14. Yandex seminar, August 13, 2014, Moscow, Russia 14Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 14 Malay vs. Indonesian Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. Indonesian  Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.  Mereka dikaruniai akal dan hati nurani dan hendaknya bergaul satu sama lain dalam semangat persaudaraan. ~50% exact word overlap from Article 1 of the Universal Declaration of Human Rights
  • 15. Yandex seminar, August 13, 2014, Moscow, Russia 15Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 15 Malay Can Look “More Indonesian”… Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. ~75% exact word overlap Post-edited Malay to look “Indonesian” (by an Indonesian speaker). Indonesian  Semua manusia dilahirkan bebas dan mempunyai martabat dan hak-hak yang sama.  Mereka mempunyai pemikiran dan perasaan dan hendaklah bergaul satu sama lain dalam semangat persaudaraan. from Article 1 of the Universal Declaration of Human Rights We attempt to do this automatically: adapt Malay to look Indonesian Then, use it to improve SMT…
  • 16. Yandex seminar, August 13, 2014, Moscow, Russia 16Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Indonesian Malay English poor Method at a Glance Indonesian “Indonesian” English poorStep 1: Adaptation Indonesian + “Indonesian” English Step 2: Combination Adapt Note that we have no Malay-Indonesian bi-text!
  • 17. 17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 1: Adapting Malay-English to “Indonesian”-English
  • 18. Yandex seminar, August 13, 2014, Moscow, Russia 18Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18 Word-Level Bi-text Adaptation: Overview Given a Malay-English sentence pair 1. Adapt the Malay sentence to “Indonesian” • Word-level paraphrases • Phrase-level paraphrases • Cross-lingual morphology 2. We pair the adapted “Indonesian” with the English from the Malay-English sentence pair Thus, we generate a new “Indonesian”-English sentence pair. Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012) Pidong Wang, Preslav Nakov, Hwee Tou Ng
  • 19. Yandex seminar, August 13, 2014, Moscow, Russia 19Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 19 Word-Level Bi-text Adaptation: Motivation  In many cases, word-level substitutions are enough  Adapt Malay to Indonesian (train) KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. PDB Malaysia akan mencapai 8 persen pada tahun 2010. Malaysia’s GDP is expected to reach 8 per cent in 2010.
  • 20. Yandex seminar, August 13, 2014, Moscow, Russia 20Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 20 Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. Decode using a large Indonesian LM Word-Level Bi-text Adaptation: Overview Probs: pivoting over English
  • 21. Yandex seminar, August 13, 2014, Moscow, Russia 21Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Malaysia’s GDP is expected to reach 8 per cent in 2010. 21 Pair each with the English counter-part Thus, we generate a new “Indonesian”-English bi-text. Word-Level Bi-text Adaptation: Overview
  • 22. Yandex seminar, August 13, 2014, Moscow, Russia 22Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Indonesian translations for Malay: pivoting over English  Weights 22 Malay sentenceML1 ML2 ML3 ML4 ML5 English sentenceEN1 EN2 EN3 EN4 English sentenceEN11 EN3 EN12 Indonesian sentenceIN1 IN2 IN3 IN4 ML-EN bi-text IN-EN bi-text Word-Level Adaptation: Extracting Paraphrases Note: we have no Malay-Indonesian bi-text, so we pivot.
  • 23. Yandex seminar, August 13, 2014, Moscow, Russia 23Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation IN-EN bi-text is small, thus:  Unreliable IN-EN word alignments  bad ML-IN paraphrases  Solution:  improve IN-EN alignments using the ML-EN bi-text  concatenate: IN-EN*k + ML-EN » k ≈ |ML-EN| / |IN-EN|  word alignment  get the alignments for one copy of IN-EN only 23 Word-Level Adaptation: Issue 1 IN ML EN poor Works because of cognates between Malay and Indonesian.
  • 24. Yandex seminar, August 13, 2014, Moscow, Russia 24Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation IN-EN bi-text is small, thus:  Small IN vocabulary for the ML-IN paraphrases  Solution:  Add cross-lingual morphological variants:  Given ML word: seperminuman  Find ML lemma: minum  Propose all known IN words sharing the same lemma: » diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum 24 Word-Level Adaptation: Issue 2 IN ML EN poor Note: The IN variants are from a larger monolingual IN text.
  • 25. Yandex seminar, August 13, 2014, Moscow, Russia 25Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Word-level pivoting  Ignores context, and relies on LM  Cannot drop/insert/merge/split/reorder words  Solution: Phrase-level pivoting  Build ML-EN and EN-IN phrase tables  Induce ML-IN phrase table (pivoting over EN)  Adapt the ML side of ML-EN to get “IN”-EN bi-text: » using Indonesian LM and n-best “IN” as before  Also, use cross-lingual morphological variants 25 Word-Level Adaptation: Issue 3 - Models context better: not only Indonesian LM, but also phrases. - Allows many word operations, e.g., insertion, deletion. IN ML EN poor
  • 26. 26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 2: Combining IN-EN + “IN”-EN
  • 27. Yandex seminar, August 13, 2014, Moscow, Russia 27Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Combining IN-EN and “IN”-EN bi-texts  Simple concatenation: IN-EN + “IN”-EN  Balanced concatenation: IN-EN * k + “IN”-EN  Sophisticated phrase table combination  Improved word alignments for IN-EN  Phrase table combination with extra features Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009) Preslav Nakov, Hwee Tou Ng Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages. (JAIR, 2012) Preslav Nakov, Hwee Tou Ng
  • 28. Yandex seminar, August 13, 2014, Moscow, Russia 28Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concatenating bi-texts  Merging phrase tables  Combined method Bi-text Combination Strategies
  • 29. Yandex seminar, August 13, 2014, Moscow, Russia 29Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Concatenating bi-texts  Merging phrase tables  Combined method Bi-text Combination Strategies
  • 30. Yandex seminar, August 13, 2014, Moscow, Russia 30Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Summary: Concatenate X1-Y and X2-Y  Advantages  improved word alignments  e.g., for rare words  more translation options  less unknown words  useful non-compositional phrases (improved fluency)  phrases with words from X2 that do not exist in X1: ignored  Disadvantages  X2-Y will dominate: it is larger  translation probabilities are messed up  phrases from X1-Y and X2-Y cannot be distinguished X1 X2 Y poor related Concatenating Bi-texts (1)
  • 31. Yandex seminar, August 13, 2014, Moscow, Russia 31Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concat×k: Concatenate k copies of the original and one copy of the additional training bi-text  Concat×k:align 1. Concatenate k copies of the original and one copy of the additional bi-text. 2. Generate word alignments. 3. Truncate them only keeping alignments for one copy of the original bi-text. 4. Build a phrase table. 5. Tune the system using MERT. The value of k is optimized on the development dataset. X1 X2 Y poor related Concatenating Bi-texts (2)
  • 32. Yandex seminar, August 13, 2014, Moscow, Russia 32Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concatenating bi-texts Merging phrase tables  Combined method Bi-text Combination Strategies
  • 33. Yandex seminar, August 13, 2014, Moscow, Russia 33Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Summary: Build two separate phrase tables, then (a) use them together (b) merge them (c) interpolate them  Advantages phrases from X1-Y and X2-Y can be distinguished the larger bi-text X2-Y does not dominate X1-Y more translation options probabilities are combined in a more principled manner  Disadvantages improved word alignments are not possible X1 X2 Y poor related Merging Phrase Tables (1)
  • 34. Yandex seminar, August 13, 2014, Moscow, Russia 34Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Two-tables: Build two separate phrase tables and use them as alternative decoding paths (Birch et al., 2007). Merging Phrase Tables (2)
  • 35. Yandex seminar, August 13, 2014, Moscow, Russia 35Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Interpolation: Build two separate phrase tables, Torig and Textra, and combine them using linear interpolation: Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s). The value of α is optimized on a development dataset. Merging Phrase Tables (3)
  • 36. Yandex seminar, August 13, 2014, Moscow, Russia 36Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Merge: 1. Build separate phrase tables: Torig and Textra. 2. Keep all entries from Torig. 3. Add those entries from Textra that are not in Torig. 4. Add extra features:  F1: 1 if the entry came from Torig, 0 otherwise.  F2: 1 if the entry came from Textra, 0 otherwise.  F3: 1 if the entry was in both tables, 0 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set. Merging Phrase Tables (4)
  • 37. Yandex seminar, August 13, 2014, Moscow, Russia 37Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concatenating bi-texts  Merging phrase tables Combined method Bi-text Combination Strategies
  • 38. Yandex seminar, August 13, 2014, Moscow, Russia 38Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Use Merge to combine the phrase tables for concat×k:align (as Torig) and for concat×1 (as Textra).  Two parameters to tune  number of repetitions k  # of extra features to use with Merge:  (a) F1 only;  (b) F1 and F2,  (c) F1, F2 and F3 Improved word alignments. Improved lexical coverage. Distinguish phrases by source table. Combined Method
  • 39. 39Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments & Evaluation
  • 40. Yandex seminar, August 13, 2014, Moscow, Russia 40Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Data  Translation data (for IN-EN)  IN2EN-train: 0.9M  IN2EN-dev: 37K  IN2EN-test: 37K  EN-monoling.: 5M  Adaptation data (for ML-EN  “IN”-EN)  ML2EN: 8.6M  IN-monoling.: 20M (tokens)
  • 41. Yandex seminar, August 13, 2014, Moscow, Russia 41Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Isolated Experiments: Training on “IN”-EN only 14.50 18.67 19.50 20.06 20.63 20.89 21.24 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 BLEU System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)
  • 42. Yandex seminar, August 13, 2014, Moscow, Russia 42Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18.49 19.79 20.10 21.55 21.64 21.62 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 simple concatenation balanced concatenation phrase table combination ML2EN(baseline) System combination 42 BLEU Combined Experiments: Training on IN-EN + “IN”-EN Wang, Nakov & Ng (EMNLP 2012)
  • 43. Yandex seminar, August 13, 2014, Moscow, Russia 43Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Experiments: Improvements 43 14.50 18.67 20.10 21.24 21.64 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 BLEU Wang, Nakov & Ng (EMNLP 2012)
  • 44. Yandex seminar, August 13, 2014, Moscow, Russia 44Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Improve Macedonian-English SMT by adapting Bulgarian-English bi-text  Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)  OPUS movie subtitles Application to Other Languages & Domains 27.33 27.97 28.38 29.05 27.00 27.50 28.00 28.50 29.00 29.50 BG2EN(A) WordParaph+morph(B) PhraseParaph+morph(C) System combination of A+B+C BLEU
  • 45. Yandex seminar, August 13, 2014, Moscow, Russia 45Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 45 Analysis
  • 46. Yandex seminar, August 13, 2014, Moscow, Russia 46Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Paraphrasing Non-Indonesian Malay Words Only So, we do need to paraphrase all words. Wang, Nakov & Ng (EMNLP 2012)
  • 47. Yandex seminar, August 13, 2014, Moscow, Russia 47Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Human Judgments Morphology yields worse top-3 adaptations but better phrase tables, due to coverage. Is the adapted sentence better Indonesian than the original Malay sentence? 100 random sentences Wang, Nakov & Ng (EMNLP 2012)
  • 48. Yandex seminar, August 13, 2014, Moscow, Russia 48Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Reverse Adaptation Idea: Adapt dev/test Indonesian input to “Malay”, then, translate with a Malay-English system Input to SMT: - “Malay” lattice - 1-best “Malay” sentence from the lattice Adapting dev/test is worse than adapting the training bi-text: So, we need both n-best and LM Wang, Nakov & Ng (EMNLP 2012)
  • 49. Yandex seminar, August 13, 2014, Moscow, Russia 49Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 49 A Specialized Decoder (Instead of Moses)
  • 50. Yandex seminar, August 13, 2014, Moscow, Russia 50Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Beam-Search Text Rewriting Decoder: The Algorithm A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng
  • 51. Yandex seminar, August 13, 2014, Moscow, Russia 51Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Beam-Search Text Rewriting Decoder: An Example (Twitter Normalization) Wang, Nakov & Ng (NAACL 2013)
  • 52. Yandex seminar, August 13, 2014, Moscow, Russia 52Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Hypothesis Producers  Word-level mapping  Phrase-level mapping  Cross-lingual morphology mapping  Indonesian LM  Word penalty (target)  Malay word penalty (source)  Phrase count Wang, Nakov & Ng (NAACL 2013)
  • 53. Yandex seminar, August 13, 2014, Moscow, Russia 53Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Moses vs. the Specialized Decoder  Decoding level  phrase vs. sentence  Features  Moses vs. richer, e.g., Malay word penalty word-level + phrase-level (potentially, manual rules)  Cross-lingual variants  input lattice vs. feature function Wang, Nakov & Ng (NAACL 2013)
  • 54. Yandex seminar, August 13, 2014, Moscow, Russia 54Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Moses vs. a Specialized Decoder: Isolated “IN”-EN Experiments: 19.50 20.06 20.63 20.89 21.24 20.39 20.46 20.85 21.07 21.76 18 19 20 21 22 WordPar WordPar+Morph PhrasePar PhrasePar+Morph System Combination BLEU Moses Specialized decoder
  • 55. Yandex seminar, August 13, 2014, Moscow, Russia 55Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 55 Moses vs. a Specialized Decoder: Combining IN-EN and “IN”-EN 18.49 19.79 20.10 21.55 21.64 21.6221.74 21.81 22.03 17 18 19 20 21 22 23 simple concat balanced concat phrase table combination BLEU ML2EN (baseline) Moses Specialized decoder
  • 56. Yandex seminar, August 13, 2014, Moscow, Russia 56Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Experiments: Improvements 56 14.50 18.67 20.10 21.24 21.64 22.03 13 15 17 19 21 23 ML2EN (baseline) IN2EN (baseline) phrase table combination (Moses) best isolated system (Moses) best combined system (Moses) best combination (DD) BLEU
  • 57. Yandex seminar, August 13, 2014, Moscow, Russia 57Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MKEN, Adapting BG-EN to “MK”-EN 27.33 27.97 28.38 29.05 29.35 26 27 28 29 30 BG2EN wordPar+morph (Moses) PhrasePar+morph (Moses) combination (Moses) combination (DD) BLEU
  • 58. Yandex seminar, August 13, 2014, Moscow, Russia 58Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 58 Transliteration
  • 59. Yandex seminar, August 13, 2014, Moscow, Russia 59Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 59 Spanish vs. Portuguese  Spanish–Portuguese Spanish  Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Portuguese  Todos os seres humanos nascem livres e iguais em dignidade e em direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade. (from Article 1 of the Universal Declaration of Human Rights) 17% exact word overlap
  • 60. Yandex seminar, August 13, 2014, Moscow, Russia 60Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Spanish vs. Portuguese  Spanish–Portuguese Spanish  Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Portuguese  Todos os seres humanos nascem livres e iguais em dignidade e em direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade. (from Article 1 of the Universal Declaration of Human Rights) 17% exact word overlap 67% approx. word overlap The actual overlap is even higher.
  • 61. Yandex seminar, August 13, 2014, Moscow, Russia 61Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Cognates  Linguistics Def: Words derived from a common root, e.g.,  Latin tu (‘2nd person singular’)  Old English thou  French tu  Spanish tú  German du  Greek sú Orthography/phonetics/semantics: ignored.  Computational linguistics Def: Words in different languages that are mutual translations and have a similar orthography, e.g.,  evolution vs. evolución vs. evolução vs. evoluzione Orthography & semantics: important. Origin: ignored. Cognates can differ a lot: • night vs. nacht vs. nuit vs. notte vs. noite • star vs. estrella vs. stella vs. étoile • arbeit vs. rabota vs. robota (‘work’) • father vs. père • head vs. chef
  • 62. Yandex seminar, August 13, 2014, Moscow, Russia 62Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Spelling Differences Between Cognates  Systematic spelling differences Spanish – Portuguese  different spelling -nh-  -ñ- (senhor vs. señor)  phonetic -ción  -ção (evolución vs. evolução) -é  -ei (1st sing past) (visité vs. visitei) -ó  -ou (3rd sing past) (visitó vs. visitou)  Occasional differences Spanish – Portuguese  decir vs. dizer (‘to say’)  Mario vs. Mário  María vs. Maria Many of these can be learned automatically.
  • 63. Yandex seminar, August 13, 2014, Moscow, Russia 63Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Automatic Transliteration  Transliteration 1. Extract likely cognates for Portuguese-Spanish 2. Learn a character-level transliteration model 3. Transliterate the Portuguese side of pt-en, to look like Spanish
  • 64. Yandex seminar, August 13, 2014, Moscow, Russia 64Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Automatic Transliteration (2)  Extract pt-es cognates using English (en) 1. Induce pt-es word translation probabilities 2. Filter out by probability if 3. Filter out by orthographic similarity if constants proposed in the literature Longest common subsequence
  • 65. Yandex seminar, August 13, 2014, Moscow, Russia 65Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation SMT-based Transliteration Train & tune a monotone character-level SMT system  Representation  Use it to transliterate the Portuguese side of pt-en
  • 66. Yandex seminar, August 13, 2014, Moscow, Russia 66Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation ESEN, Adapting PT-EN to “ES”-EN 5.34 22.87 24.23 13.79 26.24 0 5 10 15 20 25 30 PT-EN ES-EN phrase table combination BLEU original transliterated 10K ES-EN, 1.23M PT-EN
  • 67. Yandex seminar, August 13, 2014, Moscow, Russia 67Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 67 Transliteration vs. Character-Level Translation
  • 68. Yandex seminar, August 13, 2014, Moscow, Russia 68Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Macedonian vs. Bulgarian 68
  • 69. Yandex seminar, August 13, 2014, Moscow, Russia 69Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MKBG: Transliteration vs. Translation 10.74 12.07 22.74 31.10 32.19 32.71 33.94 5 10 15 20 25 30 35 40 MK (original) MK (simple translit.) MK (cognate translit.) MK-BG (words) MK-BG (words+cogn. translit.) MK-BG (chars) MK-BG (words+cogn. translit. + chars) BLEU Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages (ACL 2012). Preslav Nakov, Jorg Tiedemann.
  • 70. Yandex seminar, August 13, 2014, Moscow, Russia 70Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Character-Level SMT 70 • MK: Никогаш не сум преспала цела сезона. • BG: Никога не съм спала цял сезон. • MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ . • BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .
  • 71. Yandex seminar, August 13, 2014, Moscow, Russia 71Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Character-Level Phrase Pairs 71 Can cover:  word prefixes/suffixes  entire words  word sequences  combinations thereof Max-phrase-length=10 LM-order=10
  • 72. Yandex seminar, August 13, 2014, Moscow, Russia 72Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MKBG: The Impact of Data Size
  • 73. Yandex seminar, August 13, 2014, Moscow, Russia 73Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Slavic Languages in Europe
  • 74. Yandex seminar, August 13, 2014, Moscow, Russia 74Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation BG MK SR CZ SL MK XX
  • 75. Yandex seminar, August 13, 2014, Moscow, Russia 75Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK SR, SL, CZ
  • 77. Yandex seminar, August 13, 2014, Moscow, Russia 77Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK->EN: Pivoting over BG Macedonian: Никогаш не сум преспала цела сезона. Bulgarian: Никога не съм спала цял сезон. English: I’ve never slept for an entire season. For related languages • subword transformations • character-level translation
  • 78. Yandex seminar, August 13, 2014, Moscow, Russia 78Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK->EN: Pivoting over BG Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets (RANLP 2013) Jorg Tiedemann, Preslav Nakov
  • 79. Yandex seminar, August 13, 2014, Moscow, Russia 79Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK->EN: Using Synthetic “MK”-EN Bi-Texts Translate Bulgarian to Macedonian in a BG-XX corpus All synthetic data combined (+mk-en): 36.69 BLEU Tiedemann & Nakov (RANLP 2013)
  • 81. Yandex seminar, August 13, 2014, Moscow, Russia 81Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Adapt bi-texts for related resource-rich languages, using  confusion networks  word-level & phrase-level paraphrasing  cross-lingual morphological analysis  Character-level Models  translation  transliteration  pivoting vs. synthetic data  Future work  other languages & NLP problems  robustness: noise and domain shift Thank you! Conclusion & Future Work
  • 82.
  • 83. Yandex seminar, August 13, 2014, Moscow, Russia 83Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 83 Related Work
  • 84. Yandex seminar, August 13, 2014, Moscow, Russia 84Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Related Work (1)  Machine translation between related languages  E.g.  Cantonese–Mandarin (Zhang, 1998)  Czech–Slovak (Hajic & al., 2000)  Turkish–Crimean Tatar (Altintas & Cicekli, 2002)  Irish–Scottish Gaelic (Scannell, 2006)  Bulgarian–Macedonian (Nakov & Tiedemann, 2012)  We do not translate (no training data), we “adapt”.
  • 85. Yandex seminar, August 13, 2014, Moscow, Russia 85Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Related Work (2)  Adapting dialects to standard language (e.g., Arabic) (Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)  manual rules and/or language-specific tools  Normalizing Tweets and SMS (Aw & al., 2006; Han & Baldwin, 2011)  informal text: spelling, abbreviations, slang  same language
  • 86. Yandex seminar, August 13, 2014, Moscow, Russia 86Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Related Work (3)  Adapt Brazilian to European Portuguese (Marujo & al. 2011)  rule-based, language-dependent  tiny improvements for SMT  Reuse bi-texts between related languages (Nakov & Ng. 2009)  no language adaptation (just transliteration)  Cascaded/pivoted translation (Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)  poor  rich  X requires an additional poor-rich bi-text  rich  X  poor does not use the similarity poor-rich poor rich X our:

Hinweis der Redaktion

  1. Statistical machine translation (or SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations. We often call such kind of corpora bi-texts. A well-known problem with the current SMT systems is that collecting sufficiently large training bi-texts is very hard, so most languages in the world are still resource-poor for SMT. To solve this problem, we want to adapt a bi-text of a resource-rich language to improve machine translation for a related resource-poor language.
  2. Let’s start with an introduction first.
  3. Statistical machine translation (or SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations. We often call such kind of corpora bi-texts. A well-known problem with the current SMT systems is that collecting sufficiently large training bi-texts is very hard, so most languages in the world are still resource-poor for SMT. To solve this problem, we want to adapt a bi-text of a resource-rich language to improve machine translation for a related resource-poor language.
  4. We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT. Why does this work? Because many resource-poor languages are related to some resource-rich languages. And related languages often share overlapping vocabulary and cognates. They often have similar word order and syntax.
  5. We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT. Why does this work? Because many resource-poor languages are related to some resource-rich languages. And related languages often share overlapping vocabulary and cognates. They often have similar word order and syntax.
  6. We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT. Why does this work? Because many resource-poor languages are related to some resource-rich languages. And related languages often share overlapping vocabulary and cognates. They often have similar word order and syntax.
  7. There are many resource-rich and resource-poor languages which are closely related. [CLICK] In our work, we focus on the pair, Malay and Indonesian. We also show the applicability of our method to another language pair: Bulgarian and Macedonian.
  8. We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT. Why does this work? Because many resource-poor languages are related to some resource-rich languages. And related languages often share overlapping vocabulary and cognates. They often have similar word order and syntax.
  9. Here is our main focus: improve Indonesian-English SMT using additional Malay-English bi-text.
  10. Malay and Indonesian are closely related languages. A native speaker of Indonesian can understand Malay texts, and vice versa. Here are two example sentence pairs which show the similarity: about 50 percent of the words overlap. So, we can train an SMT system on one language and apply it to the other directly: there are matching words and short phrases.
  11. We asked a native Indonesian speaker to adapt the same Malay sentences into Indonesian while preserving as many Malay words as possible. As a result, the overlap reached 75 percent. [CLICK] Our goal is to do this automatically: adapt Malay to look like Indonesian. Then, we can use this adapted bi-text to improve Indonesian-English SMT.
  12. Suppose we have a small Indonesian to English bi-text, which is resource-poor. And we also have another large bi-text for Malay-English, which is resource-rich. Our method has two steps. [CLICK] The first step is bi-text adaptation. We adapt the Malay side of the Malay-English bi-text to look like Indonesian. [CLICK] The second step is bi-text combination. We try to combine the adapted bi-text with the original small Indonesian-English bi-text in order to improve Indonesian-English SMT. [CLICK] Note that we have no Malay-Indonesian bi-text.
  13. The first step is bi-text combination: adapting a Malay-English bi-text to Indonesian-English.
  14. Given a Malay-English sentence pair We first adapt the Malay sentence to look like “Indonesian” using word-level and phrase-level paraphrases, and cross-lingual morphology. Then, we pair the adapted “Indonesian” sentence with the English sentence of the Malay-English sentence pair. [CLICK] Finally, we can generate a new Indonesian-English sentence pair.
  15. For example, given a Malay sentence, [CLICK] we generate a confusion network. In the confusion network, each Malay word is augmented with multiple Indonesian word-level paraphrases. [CLICK] Then we decode this confusion network using a large Indonesian language model. Thus, a ranked list of some adapted “Indonesian” sentences is obtained.
  16. For example, given a Malay sentence, [CLICK] we generate a confusion network. In the confusion network, each Malay word is augmented with multiple Indonesian word-level paraphrases. [CLICK] Then we decode this confusion network using a large Indonesian language model. Thus, a ranked list of some adapted “Indonesian” sentences is obtained.
  17. After that, we pair each adapted “Indonesian” sentence with the English counter-part for the Malay sentence in the Malay-English bi-text. [CLICK] We thus end up with a synthetic “Indonesian”–English bi-text.
  18. How do we find the Indonesian word-level paraphrases for a Malay word? We use pivoting over English to induce potential Indonesian paraphrases for a given Malay word. First, we generate separate word alignments for the Indonesian–English and the Malay–English bi-texts. If a Malay word ML3 and an Indonesian word IN3 are both aligned to the same English word EN3, [CLICK] then, we consider the Indonesian word IN3 as a potential translation option for the Malay word ML3. [CLICK] each translation pair is associated with a conditional probability in the confusion network. The probability is estimated by pivoting over English. [CLICK] Note that we have no Malay-Indonesian bi-text, so we pivot over English to get Malay-Indonesian translation pairs.
  19. Since the Indonesian-English bi-text is small, so its word alignments are unreliable. As a result, we get bad Malay-Indonesian paraphrases from the word alignments. [CLICK] we try to improve the word alignments using the Malay-English bi-text. Since Malay and Indonesian share some vocabulary, we combine the Indonesian-English and Malay-English bi-text to carry out word alignment. As a result, we obtain an improved Indonesian-English word alignment. When we concatenate the Indonesian-English and the Malay-English bi-text, we concatenate multiple copies of the small Indonesian-English bi-text. The reason is that the Malay-English bi-text is much larger than the small Indonesian-English bi-text.
  20. The second issue is that Since the Indonesian-English bi-text is small, the Indonesian word-level paraphrases for a Malay word are restricted to the small Indonesian vocabulary of the small Indonesian–English bi-text. [CLICK] to enlarge the small Indonesian vocabulary, we use cross-lingual morphological variants. Now let me explain how we add cross-lingual morphological variants to a confusion network. If the input Malay sentence has the word seperminuman, we first find its lemma minum, and then determine all Indonesian words sharing the same lemma. These Indonesian words are considered as the cross-lingual morphological variants for the Malay word. [CLICK] Note that here the Indonesian morphological variants are from a large monolingual Indonesian text, so there are new Indonesian words which are not in the small Indonesian-English bi-text.
  21. Word-level pivoting ignores context. It relies on the Indonesian language model to make the right contextual choice. [CLICK] We also try to model the context more directly by generating adaptation options at the phrase level using pivoted phrase tables. We use standard phrase-based SMT techniques to build two separate phrase tables for the Indonesian–English and the Malay–English bi-texts. Then we pivot the two phrase tables over English phrases. The obtained pivoted phrase table is used to adapt Malay to Indonesian. We also add cross-lingual morphological variants to enlarge the Indonesian vocabulary. [CLICK] As a result, we can model the context better by using both Indonesian language model and phrases. Another advantage is that we can have more word operations here, since we use phrases.
  22. Recall that the second step of our method is bi-text combination.
  23. We combine the original small Indonesian–English bi-text with the adapted “Indonesian”–English bi-text in three ways: [CLICK] The first way is to simply concatenate the two bi-texts as the training bi-text. In this way, we assume the two bi-texts have the same quality. [CLICK] The second way is called balanced concatenation. Since the adapted bi-text is much larger than the original Indonesian-English bi-text, the adapted bi-text will dominate the concatenation. In order to overcome this problem, we repeat the smaller Indonesian–English bi-text enough times so that the amounts of the two bi-texts are the same before concatenation. [CLICK] Finally, we experiment using a method for combining phrase tables proposed in the previous work of Nakov and Ng. This method can improve word alignments and then combine phrase tables with extra features.
  24. I will now present our experiments.
  25. In our experiments, we use the following datasets. For Indonesian–English: we have a small training bi-text, a development set, and also a test set. We also use a large Malay–English bi-text, which is then adapted into Indonesian-English.
  26. We have carried out two kinds of experiments: The first kind is called isolated experiments. In isolated experiments, we only use the adapted bi-text but not the original Indonesian-English bi-text. These experiments provide a direct comparison to using the original bi-text. The green bars show the two baseline systems. Although the original Malay-English bi-text is about 10 times bigger than the original Indonesian-English bi-text, training on the Malay-English bi-text is much worse than training on the small Indonesian-English bi-text. This shows the existence of important differences between Malay and Indonesian. Using our method, we can see that word-level paraphrasing improves by 5 BLEU points over the original Malay-English baseline. And it improves by close to one BLEU point over the original Indonesian-English baseline. By adding cross-lingual morphological variants to word-level paraphrasing, we get about half a BLEU point of improvement. This confirms that the cross-lingual morphological variants are actually effective. As we discussed before, phrase-level paraphrasing can model context better, so phrase-level paraphrasing gets larger improvement. Finally, we use the system combination method, MEMT, to combine the best word-level paraphrasing system and the best phrase-level paraphrasing system, and it yields even further improvements. This shows that the two kinds of paraphrasing methods are actually complementary.
  27. The second kind of experiments is combined experiments. In these experiments, we try to combine the adapted bi-text with the original Indonesian-English bi-text using the three bi-text combination methods. Similar to the isolated experiments, we get improvements using both word-level and phrase-level paraphrasing methods. This is consistent with the isolated experiments. One interesting discovery is that using our method, the results of the three bi-text combination methods do not differ so much as the baselines.
  28. To summarize, this graph shows the overall improvements that we obtain in our experiments. The first three bars are the baselines using existing methods, And the fourth one is our best isolated system, which improves about 1 BLEU point over the baselines. The last one is the best combined system, and it gives us 1.5 BLEU point improvement over the baselines.
  29. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  30. While Indonesian is closely related to Malay, there are also some false friends. They share some words, but the words may have very different meanings in the two languages. That’s why we paraphrase all the words in our experiments.
  31. We asked a native Indonesian speaker who does not speak Malay to judge whether our adapted “Indonesian” sentences are more understandable to him than the original Malay input. It turns out that they are similar to the Indonesian speaker. The adapted sentences did work better than the original Malay sentences in our experiments. We think there can be two reasons for this: The first one is that SMT systems can tolerate noisy training data; The second reason can be that the judgments were at the sentence level, while phrases are sub-sentential; there can be many good of them in a “bad” sentence.
  32. We also tried to adapt Indonesian to Malay, and then use a Malay-English translation system to translate the adapted Malay sentences to English. However, the results turned out to be worse than adapting Malay to Indonesian.
  33. Some related work.
  34. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  35. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  36. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  37. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  38. We have carried out two kinds of experiments: The first kind is called isolated experiments. In isolated experiments, we only use the adapted bi-text but not the original Indonesian-English bi-text. These experiments provide a direct comparison to using the original bi-text. The green bars show the two baseline systems. Although the original Malay-English bi-text is about 10 times bigger than the original Indonesian-English bi-text, training on the Malay-English bi-text is much worse than training on the small Indonesian-English bi-text. This shows the existence of important differences between Malay and Indonesian. Using our method, we can see that word-level paraphrasing improves by 5 BLEU points over the original Malay-English baseline. And it improves by close to one BLEU point over the original Indonesian-English baseline. By adding cross-lingual morphological variants to word-level paraphrasing, we get about half a BLEU point of improvement. This confirms that the cross-lingual morphological variants are actually effective. As we discussed before, phrase-level paraphrasing can model context better, so phrase-level paraphrasing gets larger improvement. Finally, we use the system combination method, MEMT, to combine the best word-level paraphrasing system and the best phrase-level paraphrasing system, and it yields even further improvements. This shows that the two kinds of paraphrasing methods are actually complementary.
  39. The second kind of experiments is combined experiments. In these experiments, we try to combine the adapted bi-text with the original Indonesian-English bi-text using the three bi-text combination methods. Similar to the isolated experiments, we get improvements using both word-level and phrase-level paraphrasing methods. This is consistent with the isolated experiments. One interesting discovery is that using our method, the results of the three bi-text combination methods do not differ so much as the baselines.
  40. To summarize, this graph shows the overall improvements that we obtain in our experiments. The first three bars are the baselines using existing methods, And the fourth one is our best isolated system, which improves about 1 BLEU point over the baselines. The last one is the best combined system, and it gives us 1.5 BLEU point improvement over the baselines.
  41. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  42. Some related work.
  43. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  44. Some related work.
  45. We have also applied our method to other languages. We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text. We get similar results. This confirms the applicability of our method to other language pairs.
  46. Next I will conclude our work.
  47. Next I will conclude our work.
  48. In summary, to improve resource-poor machine translation, we adapt bi-texts for a related resource-rich language, using confusion networks, word-level and phrase-level paraphrasing, and morphological analysis. We achieved very sizable improvements over the baselines. In the future, we would like to add more word operations, for example, splitting, and merging words. We also want to find some methods to better integrate our word-level and phrase-level paraphrasing methods. Lastly, we want to apply our methods to other languages and NLP problems.
  49. Some related work.
  50. There are some related work on translating texts between related languages, just like our bi-text adaptation step. Most of the work use rule-based translation systems, but our method is a statistical method, which are language independent.