Parallel text extraction from multimodal comparable corpora
1. Introduction
Existing Works
Proposed Approach
Conclusion
Parallel text extraction from multimodal
comparable corpora
Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc
LIUM, University of Le Maine
72085 Le Mans cedex 9, FRANCE
FirstName.LastName@lium.univ-lemans.fr
Oct 22, 2012
JapTal 2012, Kanazawa - JAPAN
1/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
2. Introduction
Existing Works
Proposed Approach
Conclusion
Outline
1 Introduction and Context
Statistical Machine Translation
Parallel and Comparable Corpora
2 Existing Works
Exploiting Comparable Corpora
Main Existing Methods
3 Proposed Approach
System Architecture
Several Issues
Task Description
Experimental setup
Results
4 Conclusion and Discussion
2/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
3. Introduction
Existing Works
Proposed Approach
Conclusion
Statistical Machine Translation
Purpose : text translation
Approach : Statistical, given by :
t ∗ = arg max P(s|t)P(t)
t
Modeling
Translation Model : P(s|t)
Language Model : P(t)
Decoding Algorithme : argmax
Some open source tools are available like Moses and Joshua
⇒ needs parallel data
3/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
4. Introduction
Existing Works
Proposed Approach
Conclusion
Parallel Corpora
Texts that are translations of each other
An essential resource for MT
Provide training data for statistical translation models
Also useful for other NLP applications
Expensive and time consuming to prepare
Translate, Sentence Align, ...
But limited in
Size, Language and Domain
⇒ There are no better data than more data
4/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
5. Introduction
Existing Works
Proposed Approach
Conclusion
Comparable Corpora
Generally not parallel, but overlapping information
Readily available
Mainly from Newswire
AFP, Al JAZEERA, BBC ...
Much larger quantities than parallel corpora
Multiple languages and Genres
Large collections available for NLP tasks
e.g. Gigaword corpora from LDC
English, Arabic, Chinese, French, Spanish
5/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
6. Introduction
Existing Works
Proposed Approach
Conclusion
Exploiting comparable corpora
Extract parallel documents
Using structural information
Extending parallel sentence alignement algorithms
Extract parallel sentence pairs
With sentence alignement algorithms
Cross-lingual IR methods
Translation aproach
6/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
7. Introduction
Existing Works
Proposed Approach
Conclusion
Main Existing Methods
Webcrawling [Resnik and Smith, 2003] : use URLs to find
matching documents
Alignment [Brown et al., 1991] : use word alignment models
to judge how close a source and a target document (sentence)
are
Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to
translate source words and apply information retrieval
techniques
Translation [Rauf and Schwenk, 2011] : use SMT system to
translate documents and apply information retrieval
7/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
8. Introduction
Existing Works
Proposed Approach
Conclusion
Goal : Exploiting multimodal comparable corpora
Text
Audio
Parallel text extraction
Parallel text
8/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
9. Introduction
Existing Works
Proposed Approach
Conclusion
Proposed Approach
Build a baseline SMT system (using generic data )
Transcribe the audio data
Translate the transcribed text
Use translations as queries for IR to find the ”matching”
sentences in the target comparable corpus
Use TER between SMT translation and the found sentences
to detect parallel ones
9/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora
10. Introduction
Existing Works
Proposed Approach
Conclusion
System Architecture
Multimodal
Audio L1 comparable
corpora
ASR
Transc. L1
SMT
Bitext Transl. L2
IR Texts L2
Text L2
10/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
11. Introduction
Existing Works
Proposed Approach
Conclusion
Several issues
Feasibility : Is the multimodal comparable corpora useful to
extract parallel text ?
Good quality : Can we get a parallel text generated from
multimodal corpora good as the bitext extracted from
comparable text ?
Effectiveness : since one of our motivations for exploiting
comparable corpora is to adapt a SMT system for a specific
domain, extracted bitext needs to be useful to improve SMT
performance.
11/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
12. Introduction
Existing Works
Proposed Approach
Conclusion
Task description (1)
Analyze the impact of the errors of each module
⇒ conducted three different types of experiments
Exp 1 : we use the reference translations as queries for the IR
system → This is the most favorable condition, it simulates
the case where the ASR and the SMT systems do not commit
any error.
Exp 2 : we use the reference transcription as input to the SMT
system → In this case, the errors come only from the SMT
system since no ASR is involved.
Exp 3 : represents the complete proposed framework → It
corresponds to a real scenario.
12/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
13. Introduction
Existing Works
Proposed Approach
Conclusion
Task description (2)
Exp 1 Exp 2 Exp 3
TED audio
ASR
TEDbi. En TEDasr. En
SMT SMT
TEDbi_tran. TEDasr_tran .
TEDbi. FR
FR FR
IR IR IR
Texte FR Texte FR Texte FR
ccb2+
%TrainTED.fr
13/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
14. Introduction
Existing Works
Proposed Approach
Conclusion
Task description (3)
Importance of the degree of similarity between the two parts
of the comparable corpora
⇒ we artificially created four comparable corpora with
different degrees of similarity
the source part of our comparable corpus is always the same
the target language part of the comparable corpus consists of a
large generic corpus plus 25%, 50%, 75% and 100%
respectively of the reference translations
Evaluation of the approach
final parallel data extracted are re-injected into the baseline
system
systems are evaluated using the BLEU score
14/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
15. Introduction
Existing Works
Proposed Approach
Conclusion
Experimental Setup : Data (TED task in IWSLT)
Training
bitexts # words in domain ?
nc7 3.7M no
eparl7 56.4M no
TEDasr 1.8M yes
TEDbi 1.9M yes
Development and test
Dev # words
dev.outASR 36k
dev.refSMT 38k
Test # words
tst.outASR 8.7k
tst.refSMT 9.1 k
15/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
16. Introduction
Existing Works
Proposed Approach
Conclusion
Experimental Setup : Modules
ASR : a five-pass system based on CMU Sphinx
has a WER of about 18%
SMT : a phrase-based system based on Moses SMT toolkit
trained on generic bitext only
word alignments in both directions are calculated using
GIZA++
phrases and lexical reordering are extracted using the default
settings of the Moses toolkit
the parameters were tuned on dev.outASR, using the MERT
tool
IR : system based on Lemur IR toolkit
index all target language (French) text data
transforming the translated source language (English) to
queries using Indri Query Language
16/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
17. Introduction
Existing Works
Proposed Approach
Conclusion
Results
Table: BLEU scores on dev and test after adaptation of a baseline system
with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi)
Experiment Dev Test
Baseline system 22.93 23.96
Exp1 24.14 25.14
Exp2 23.90 25.15
Exp3 23.40 24.69
Extracted sentences do improve the SMT system
BLEU score of the adapted system matches the one of Exp1
in most of the cases
⇒ errors inducted by the SMT and ASR systems have no
major impact on the performance of the parallel sentence
extraction algorithm
17/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
18. Introduction
Existing Works
Proposed Approach
Conclusion
Results
Table: BLEU scores for different degrees of parallelism of the comparable
corpus.
Experiment Dev Test # injected words
Baseline system 22.93 23.96 -
25% TEDbi 23.11 24.40 ∼110k
50% TEDbi 23.27 24.58 ∼215k
75% TEDbi 23.43 24.42 ∼293k
100% TEDbi 23.40 24.69 ∼393k
The degree of similarity of the comparable corpus is important
in term of the performance of the extraction process
and the quality of parallel sentences extracted
18/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
19. Introduction
Existing Works
Proposed Approach
Conclusion
Results
e.g. 1:
Source sentence:
i wrote a story about genetically engineered food
Baseline Sys: Adapted Sys:
j'ai écrit un article sur la nourriture j'ai écrit un article sur les produits
génétiquement modifiée alimentaires génétiquement modifiés
Domain Adaptation
e.g. 2:
Source sentence:
yeah you're right let's fix it
Baseline Sys: Adapted Sys:
yeah tu as raison de réparer euh oui tu as raison il faut réparer
Oral vocabulary correction
19/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
20. Introduction
Existing Works
Proposed Approach
Conclusion
Conclusion
Proposed to extend exploiting comparable corpora to
multimodal comparable corpora, i.e. the source side is
available as audio and the target side as text
An encouraging result since we automatically aligned source
audio in one language with texts in another language, without
the need of human intervention to transcribe and translate the
data
Able to adapt a generic SMT system to the task of lecture
translation by extracting parallel data from a multimodal
comparable corpus
20/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
21. Introduction
Existing Works
Proposed Approach
Conclusion
Perspectives
Apply this task at a much larger scale, i.e using hundreds of
hours of speech and hundreds of millions of words
Woking on deferent specific domains or subdomains
Iterate the process in order to use the extracted bitexts to
translate again source sentences
Calculate the degree of the similarity of the corpus before
using it
21/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
22. Introduction
Existing Works
Proposed Approach
Conclusion
Brown, P. F., Lai, J. C., and Mercer, R. L. (1991).
Aligning sentences in parallel corpora.
In Proceedings of the 29th annual meeting on ACL, pages
169–176.
Munteanu, D. S. and Marcu, D. (2005).
Improving Machine Translation Performance by Exploiting
Non-Parallel Corpora.
Computational Linguistics, 31(4) :477–504.
Rauf, S. A. and Schwenk, H. (2011).
Parallel sentence generation from comparable corpora for
improved SMT.
Machine Translation, 25(4) :341–375.
Resnik, P. and Smith, N. A. (2003).
The web as a parallel corpus.
Comput. Linguist., 29 :349–380.
22/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
23. Introduction
Existing Works
Proposed Approach
Conclusion
Thank you
23/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
24. Introduction
Existing Works
Proposed Approach
Conclusion
Results (1)
24.5 24.5
Exp1 Exp1
Exp2 Exp2
Exp3 Exp3
24 24
score BLEU
score BLEU
23.5 23.5
23 23
22.5 22.5
0 20 40 60 80 100 0 20 40 60 80 100
TER threshold TER threshold
Figure: BLEU score on dev using Figure: BLEU score on dev using
SMT systems adapted with bitexts SMT systems adapted with bitexts
extracted from ccb2 + 100% extracted from ccb2 + 75% TEDbi
TEDbi index corpus. index corpus.
The choice of the appropriate TER threshold
depends on the type of data
24/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
25. Introduction
Existing Works
Proposed Approach
Conclusion
Crawling the Web [Resnik and Smith, 2003]
Search for web pages with similar URLs
Many companies and organizations have their web pages in
multiple languages
Identified by language ID, eg
http ://x.../y.../z.en and http ://x.../y.../z.fr
Pages have links to parallel pages
Webcrawler, which exploits this structural information
25/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
26. Introduction
Existing Works
Proposed Approach
Conclusion
Alignment Approach [Brown et al., 1991]
Train initial lexicon based on parallel data
Use lexicon to calculate alignment score between documents
(or sentences)
Typically IBM1
Select most reliable document (sentence) pairs
Add to parallel training data and retrain -> bootstrapping
26/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
27. Introduction
Existing Works
Proposed Approach
Conclusion
Finding Comparable Documents [Zhao and Vogel, 2002]
Given comparable documents, find (nearly) parallel sentences
Xinhua News Agency publishes news in English and Chinese
Calculate similarity based on lexicon
Iterative process
27/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
28. Introduction
Existing Works
Proposed Approach
Conclusion
CLIR Aproach [Munteanu and Marcu, 2005]
Figure: CLIR Aproach
28/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor
29. Introduction
Existing Works
Proposed Approach
Conclusion
Translation Approach [Rauf and Schwenk, 2011]
Figure: Translation Approach
29/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpor