Parallel text extraction from multimodal comparable corpora

Introduction
Existing Works
Proposed Approach
Conclusion

Parallel text extraction from multimodal
comparable corpora

Haithem Aﬂi, Lo¨ Barrault and Holger Schwenk
ıc
LIUM, University of Le Maine
72085 Le Mans cedex 9, FRANCE
FirstName.LastName@lium.univ-lemans.fr

Oct 22, 2012
JapTal 2012, Kanazawa - JAPAN

1/ 29 Haithem Aﬂi, Lo¨ Barrault and Holger Schwenk
ıc Parallel text extraction from multimodal comparable corpora

Introduction
Existing Works
Proposed Approach
Conclusion

Outline

1 Introduction and Context
Statistical Machine Translation
Parallel and Comparable Corpora
2 Existing Works
Exploiting Comparable Corpora
Main Existing Methods
3 Proposed Approach
System Architecture
Several Issues
Task Description
Experimental setup
Results
4 Conclusion and Discussion


Introduction
Existing Works
Proposed Approach
Conclusion

Statistical Machine Translation

Purpose : text translation
Approach : Statistical, given by :

t ∗ = arg max P(s|t)P(t)
t

Modeling
Translation Model : P(s|t)
Language Model : P(t)
Decoding Algorithme : argmax
Some open source tools are available like Moses and Joshua
⇒ needs parallel data


Introduction
Existing Works
Proposed Approach
Conclusion

Parallel Corpora

Texts that are translations of each other
An essential resource for MT
Provide training data for statistical translation models
Also useful for other NLP applications
Expensive and time consuming to prepare
Translate, Sentence Align, ...
But limited in
Size, Language and Domain
⇒ There are no better data than more data


Introduction
Existing Works
Proposed Approach
Conclusion

Comparable Corpora

Generally not parallel, but overlapping information
Readily available
Mainly from Newswire
AFP, Al JAZEERA, BBC ...
Much larger quantities than parallel corpora
Multiple languages and Genres
Large collections available for NLP tasks
e.g. Gigaword corpora from LDC
English, Arabic, Chinese, French, Spanish


Introduction
Existing Works
Proposed Approach
Conclusion

Exploiting comparable corpora
Extract parallel documents
Using structural information
Extending parallel sentence alignement algorithms

Extract parallel sentence pairs
With sentence alignement algorithms
Cross-lingual IR methods
Translation aproach


Introduction
Existing Works
Proposed Approach
Conclusion

Main Existing Methods

Webcrawling [Resnik and Smith, 2003] : use URLs to ﬁnd
matching documents
Alignment [Brown et al., 1991] : use word alignment models
to judge how close a source and a target document (sentence)
are
Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to
translate source words and apply information retrieval
techniques
Translation [Rauf and Schwenk, 2011] : use SMT system to
translate documents and apply information retrieval


Introduction
Existing Works
Proposed Approach
Conclusion

Goal : Exploiting multimodal comparable corpora

Text
Audio

Parallel text extraction

Parallel text


Introduction
Existing Works
Proposed Approach
Conclusion

Proposed Approach

Build a baseline SMT system (using generic data )
Transcribe the audio data
Translate the transcribed text
Use translations as queries for IR to ﬁnd the ”matching”
sentences in the target comparable corpus
Use TER between SMT translation and the found sentences
to detect parallel ones


Introduction
Existing Works
Proposed Approach
Conclusion

System Architecture

Multimodal
Audio L1 comparable
corpora
ASR

Transc. L1
SMT

Bitext Transl. L2
IR Texts L2

Text L2

ıc Parallel text extraction from multimodal comparable corpor

Introduction
Existing Works
Proposed Approach
Conclusion

Several issues

Feasibility : Is the multimodal comparable corpora useful to
extract parallel text ?
Good quality : Can we get a parallel text generated from
multimodal corpora good as the bitext extracted from
comparable text ?
Eﬀectiveness : since one of our motivations for exploiting
comparable corpora is to adapt a SMT system for a speciﬁc
domain, extracted bitext needs to be useful to improve SMT
performance.


Introduction
Existing Works
Proposed Approach
Conclusion

Task description (1)

Analyze the impact of the errors of each module
⇒ conducted three diﬀerent types of experiments
Exp 1 : we use the reference translations as queries for the IR
system → This is the most favorable condition, it simulates
the case where the ASR and the SMT systems do not commit
any error.
Exp 2 : we use the reference transcription as input to the SMT
system → In this case, the errors come only from the SMT
system since no ASR is involved.
Exp 3 : represents the complete proposed framework → It
corresponds to a real scenario.


Introduction
Existing Works
Proposed Approach
Conclusion

Exp 1 Exp 2 Exp 3
TED audio

ASR

TEDbi. En TEDasr. En

SMT SMT
TEDbi_tran. TEDasr_tran .
TEDbi. FR
FR FR
IR IR IR

Texte FR Texte FR Texte FR

ccb2+
%TrainTED.fr


Introduction
Existing Works
Proposed Approach
Conclusion


Importance of the degree of similarity between the two parts
of the comparable corpora
⇒ we artificially created four comparable corpora with
different degrees of similarity
the source part of our comparable corpus is always the same
the target language part of the comparable corpus consists of a
large generic corpus plus 25%, 50%, 75% and 100%
respectively of the reference translations
Evaluation of the approach
final parallel data extracted are re-injected into the baseline
system
systems are evaluated using the BLEU score


Introduction
Existing Works
Proposed Approach
Conclusion

Experimental Setup : Data (TED task in IWSLT)
Training
bitexts # words in domain ?
nc7 3.7M no
eparl7 56.4M no
TEDasr 1.8M yes
TEDbi 1.9M yes
Development and test
Dev # words
dev.outASR 36k
dev.refSMT 38k
Test # words
tst.outASR 8.7k
tst.refSMT 9.1 k


Introduction
Existing Works
Proposed Approach
Conclusion

Experimental Setup : Modules

ASR : a ﬁve-pass system based on CMU Sphinx
has a WER of about 18%
SMT : a phrase-based system based on Moses SMT toolkit
trained on generic bitext only
word alignments in both directions are calculated using
GIZA++
phrases and lexical reordering are extracted using the default
settings of the Moses toolkit
the parameters were tuned on dev.outASR, using the MERT
tool
IR : system based on Lemur IR toolkit
index all target language (French) text data
transforming the translated source language (English) to
queries using Indri Query Language


Introduction
Existing Works
Proposed Approach
Conclusion

Results

Table: BLEU scores on dev and test after adaptation of a baseline system
with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi)

Experiment Dev Test
Baseline system 22.93 23.96
Exp1 24.14 25.14
Exp2 23.90 25.15
Exp3 23.40 24.69
Extracted sentences do improve the SMT system
BLEU score of the adapted system matches the one of Exp1
in most of the cases
⇒ errors inducted by the SMT and ASR systems have no
major impact on the performance of the parallel sentence
extraction algorithm

Introduction
Existing Works
Proposed Approach
Conclusion

Results

Table: BLEU scores for diﬀerent degrees of parallelism of the comparable
corpus.
Experiment Dev Test # injected words
Baseline system 22.93 23.96 -
25% TEDbi 23.11 24.40 ∼110k
50% TEDbi 23.27 24.58 ∼215k
75% TEDbi 23.43 24.42 ∼293k
100% TEDbi 23.40 24.69 ∼393k

The degree of similarity of the comparable corpus is important
in term of the performance of the extraction process
and the quality of parallel sentences extracted


Introduction
Existing Works
Proposed Approach
Conclusion

Results
e.g. 1:
Source sentence:
i wrote a story about genetically engineered food

Baseline Sys: Adapted Sys:
j'ai écrit un article sur la nourriture j'ai écrit un article sur les produits
génétiquement modifiée alimentaires génétiquement modifiés

Domain Adaptation

e.g. 2:
Source sentence:
yeah you're right let's fix it

Baseline Sys: Adapted Sys:
yeah tu as raison de réparer euh oui tu as raison il faut réparer

Oral vocabulary correction


Introduction
Existing Works
Proposed Approach
Conclusion

Conclusion

Proposed to extend exploiting comparable corpora to
multimodal comparable corpora, i.e. the source side is
available as audio and the target side as text
An encouraging result since we automatically aligned source
audio in one language with texts in another language, without
the need of human intervention to transcribe and translate the
data
Able to adapt a generic SMT system to the task of lecture
translation by extracting parallel data from a multimodal
comparable corpus


Introduction
Existing Works
Proposed Approach
Conclusion

Perspectives

Apply this task at a much larger scale, i.e using hundreds of
hours of speech and hundreds of millions of words
Woking on deferent speciﬁc domains or subdomains
Iterate the process in order to use the extracted bitexts to
translate again source sentences
Calculate the degree of the similarity of the corpus before
using it


Introduction
Existing Works
Proposed Approach
Conclusion

Brown, P. F., Lai, J. C., and Mercer, R. L. (1991).
Aligning sentences in parallel corpora.
In Proceedings of the 29th annual meeting on ACL, pages
169–176.
Munteanu, D. S. and Marcu, D. (2005).
Improving Machine Translation Performance by Exploiting
Non-Parallel Corpora.
Computational Linguistics, 31(4) :477–504.
Rauf, S. A. and Schwenk, H. (2011).
Parallel sentence generation from comparable corpora for
improved SMT.
Machine Translation, 25(4) :341–375.
Resnik, P. and Smith, N. A. (2003).
The web as a parallel corpus.
Comput. Linguist., 29 :349–380.

Introduction
Existing Works
Proposed Approach
Conclusion

Thank you


Introduction
Existing Works
Proposed Approach
Conclusion

Results (1)
24.5 24.5
Exp1 Exp1
Exp2 Exp2
Exp3 Exp3

24 24
score BLEU

score BLEU
23.5 23.5

23 23

22.5 22.5
0 20 40 60 80 100 0 20 40 60 80 100
TER threshold TER threshold

Figure: BLEU score on dev using Figure: BLEU score on dev using
SMT systems adapted with bitexts SMT systems adapted with bitexts
extracted from ccb2 + 100% extracted from ccb2 + 75% TEDbi
TEDbi index corpus. index corpus.

The choice of the appropriate TER threshold
depends on the type of data


Introduction
Existing Works
Proposed Approach
Conclusion

Crawling the Web [Resnik and Smith, 2003]

Search for web pages with similar URLs
Many companies and organizations have their web pages in
multiple languages
Identiﬁed by language ID, eg
http ://x.../y.../z.en and http ://x.../y.../z.fr
Pages have links to parallel pages
Webcrawler, which exploits this structural information


Introduction
Existing Works
Proposed Approach
Conclusion

Alignment Approach [Brown et al., 1991]

Train initial lexicon based on parallel data
Use lexicon to calculate alignment score between documents
(or sentences)
Typically IBM1
Select most reliable document (sentence) pairs
Add to parallel training data and retrain -> bootstrapping


Introduction
Existing Works
Proposed Approach
Conclusion

Finding Comparable Documents [Zhao and Vogel, 2002]

Given comparable documents, ﬁnd (nearly) parallel sentences
Xinhua News Agency publishes news in English and Chinese
Calculate similarity based on lexicon
Iterative process


Introduction
Existing Works
Proposed Approach
Conclusion

CLIR Aproach [Munteanu and Marcu, 2005]

Figure: CLIR Aproach


Introduction
Existing Works
Proposed Approach
Conclusion

Translation Approach [Rauf and Schwenk, 2011]

Figure: Translation Approach


Parallel text extraction from multimodal comparable corpora

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (15)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Parallel text extraction from multimodal comparable corpora

Ähnlich wie Parallel text extraction from multimodal comparable corpora (20)

Mehr von Haithem Afli

Mehr von Haithem Afli (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallel text extraction from multimodal comparable corpora