SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Introduction
                                  Existing Works
                              Proposed Approach
                                      Conclusion




           Parallel text extraction from multimodal
                      comparable corpora

           Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc
                                LIUM, University of Le Maine
                              72085 Le Mans cedex 9, FRANCE
                          FirstName.LastName@lium.univ-lemans.fr

                                  Oct 22, 2012
                         JapTal 2012, Kanazawa - JAPAN




1/ 29   Haithem Afli, Lo¨ Barrault and Holger Schwenk
                       ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                      Existing Works
                                  Proposed Approach
                                          Conclusion


    Outline

        1   Introduction and Context
                  Statistical Machine Translation
                  Parallel and Comparable Corpora
        2   Existing Works
                  Exploiting Comparable Corpora
                  Main Existing Methods
        3   Proposed Approach
                  System Architecture
                  Several Issues
                  Task Description
                  Experimental setup
                  Results
        4   Conclusion and Discussion

2/ 29       Haithem Afli, Lo¨ Barrault and Holger Schwenk
                           ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Statistical Machine Translation

          Purpose : text translation
          Approach : Statistical, given by :

                                      t ∗ = arg max P(s|t)P(t)
                                                     t

          Modeling
                Translation Model : P(s|t)
                Language Model : P(t)
                Decoding Algorithme : argmax
          Some open source tools are available like Moses and Joshua
          ⇒ needs parallel data



3/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                      Existing Works
                                  Proposed Approach
                                          Conclusion


    Parallel Corpora


            Texts that are translations of each other
            An essential resource for MT
                  Provide training data for statistical translation models
                  Also useful for other NLP applications
                  Expensive and time consuming to prepare
                         Translate, Sentence Align, ...
            But limited in
                  Size, Language and Domain
        ⇒ There are no better data than more data



4/ 29       Haithem Afli, Lo¨ Barrault and Holger Schwenk
                           ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Comparable Corpora


         Generally not parallel, but overlapping information
         Readily available
               Mainly from Newswire
                      AFP, Al JAZEERA, BBC ...
               Much larger quantities than parallel corpora
               Multiple languages and Genres
         Large collections available for NLP tasks
               e.g. Gigaword corpora from LDC
                      English, Arabic, Chinese, French, Spanish




5/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                        ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Exploiting comparable corpora
          Extract parallel documents
                Using structural information
                Extending parallel sentence alignement algorithms




          Extract parallel sentence pairs
                With sentence alignement algorithms
                Cross-lingual IR methods
                Translation aproach

6/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Main Existing Methods


         Webcrawling [Resnik and Smith, 2003] : use URLs to find
         matching documents
         Alignment [Brown et al., 1991] : use word alignment models
         to judge how close a source and a target document (sentence)
         are
         Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to
         translate source words and apply information retrieval
         techniques
         Translation [Rauf and Schwenk, 2011] : use SMT system to
         translate documents and apply information retrieval



7/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Goal : Exploiting multimodal comparable corpora




                                                                          Text
                      Audio



                                    Parallel text extraction




                                             Parallel text


8/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Proposed Approach



         Build a baseline SMT system (using generic data )
         Transcribe the audio data
         Translate the transcribed text
         Use translations as queries for IR to find the ”matching”
         sentences in the target comparable corpus
         Use TER between SMT translation and the found sentences
         to detect parallel ones




9/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                        ıc                              Parallel text extraction from multimodal comparable corpora
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    System Architecture


                                                                        Multimodal
                                        Audio L1                        comparable
                                                                          corpora
                                                ASR

                                       Transc. L1
                                               SMT

                   Bitext               Transl. L2
                                                IR                      Texts L2

                                         Text L2




10/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Several issues


          Feasibility : Is the multimodal comparable corpora useful to
          extract parallel text ?
          Good quality : Can we get a parallel text generated from
          multimodal corpora good as the bitext extracted from
          comparable text ?
          Effectiveness : since one of our motivations for exploiting
          comparable corpora is to adapt a SMT system for a specific
          domain, extracted bitext needs to be useful to improve SMT
          performance.




11/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Task description (1)


          Analyze the impact of the errors of each module
          ⇒ conducted three different types of experiments
                Exp 1 : we use the reference translations as queries for the IR
                system → This is the most favorable condition, it simulates
                the case where the ASR and the SMT systems do not commit
                any error.
                Exp 2 : we use the reference transcription as input to the SMT
                system → In this case, the errors come only from the SMT
                system since no ASR is involved.
                Exp 3 : represents the complete proposed framework → It
                corresponds to a real scenario.




12/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Task description (2)
                            Exp 1              Exp 2             Exp 3
                                                                 TED audio

                                                                         ASR

                                               TEDbi. En         TEDasr. En

                                                     SMT                SMT
                                              TEDbi_tran.      TEDasr_tran .
                            TEDbi. FR
                                                 FR                FR
                                   IR                 IR                 IR


                             Texte FR           Texte FR          Texte FR




                                               ccb2+
                                            %TrainTED.fr



13/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                                Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Task description (3)

          Importance of the degree of similarity between the two parts
          of the comparable corpora
          ⇒ we artificially created four comparable corpora with
          different degrees of similarity
                the source part of our comparable corpus is always the same
                the target language part of the comparable corpus consists of a
                large generic corpus plus 25%, 50%, 75% and 100%
                respectively of the reference translations
          Evaluation of the approach
                final parallel data extracted are re-injected into the baseline
                system
                systems are evaluated using the BLEU score



14/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Experimental Setup : Data (TED task in IWSLT)
         Training
                              bitexts         # words      in domain ?
                              nc7             3.7M         no
                              eparl7          56.4M        no
                              TEDasr          1.8M         yes
                              TEDbi           1.9M         yes
         Development and test
                           Dev                           # words
                           dev.outASR                    36k
                           dev.refSMT                    38k
                           Test                          # words
                           tst.outASR                    8.7k
                           tst.refSMT                    9.1 k

15/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Experimental Setup : Modules

         ASR : a five-pass system based on CMU Sphinx
               has a WER of about 18%
         SMT : a phrase-based system based on Moses SMT toolkit
               trained on generic bitext only
               word alignments in both directions are calculated using
               GIZA++
               phrases and lexical reordering are extracted using the default
               settings of the Moses toolkit
               the parameters were tuned on dev.outASR, using the MERT
               tool
         IR : system based on Lemur IR toolkit
               index all target language (French) text data
               transforming the translated source language (English) to
               queries using Indri Query Language

16/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                        Existing Works
                                    Proposed Approach
                                            Conclusion


    Results

         Table: BLEU scores on dev and test after adaptation of a baseline system
         with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi)

                                Experiment                     Dev      Test
                                Baseline system               22.93     23.96
                                Exp1                          24.14     25.14
                                Exp2                          23.90     25.15
                                Exp3                          23.40     24.69
              Extracted sentences do improve the SMT system
              BLEU score of the adapted system matches the one of Exp1
              in most of the cases
              ⇒ errors inducted by the SMT and ASR systems have no
              major impact on the performance of the parallel sentence
              extraction algorithm
17/ 29         Haithem Afli, Lo¨ Barrault and Holger Schwenk
                              ıc                                Parallel text extraction from multimodal comparable corpor
Introduction
                                        Existing Works
                                    Proposed Approach
                                            Conclusion


    Results

         Table: BLEU scores for different degrees of parallelism of the comparable
         corpus.
                 Experiment                   Dev        Test      # injected words
                 Baseline system             22.93       23.96     -
                 25% TEDbi                   23.11       24.40     ∼110k
                 50% TEDbi                   23.27       24.58     ∼215k
                 75% TEDbi                   23.43       24.42     ∼293k
                 100% TEDbi                  23.40       24.69     ∼393k

              The degree of similarity of the comparable corpus is important
              in term of the performance of the extraction process
              and the quality of parallel sentences extracted

18/ 29         Haithem Afli, Lo¨ Barrault and Holger Schwenk
                              ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                            Existing Works
                                        Proposed Approach
                                                Conclusion


    Results
                    e.g. 1:
                                                    Source sentence:
                                    i wrote a story about genetically engineered food


                                  Baseline Sys:                                Adapted Sys:
                     j'ai écrit un article sur la nourriture       j'ai écrit un article sur les produits
                            génétiquement modifiée                alimentaires génétiquement modifiés


                                                  Domain Adaptation

                    e.g. 2:
                                                    Source sentence:
                                                 yeah you're right let's fix it



                               Baseline Sys:                                  Adapted Sys:
                         yeah tu as raison de réparer               euh oui tu as raison il faut réparer



                                             Oral vocabulary correction




19/ 29        Haithem Afli, Lo¨ Barrault and Holger Schwenk
                             ıc                                             Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Conclusion


         Proposed to extend exploiting comparable corpora to
         multimodal comparable corpora, i.e. the source side is
         available as audio and the target side as text
         An encouraging result since we automatically aligned source
         audio in one language with texts in another language, without
         the need of human intervention to transcribe and translate the
         data
         Able to adapt a generic SMT system to the task of lecture
         translation by extracting parallel data from a multimodal
         comparable corpus



20/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                    Existing Works
                                Proposed Approach
                                        Conclusion


    Perspectives



          Apply this task at a much larger scale, i.e using hundreds of
          hours of speech and hundreds of millions of words
          Woking on deferent specific domains or subdomains
          Iterate the process in order to use the extracted bitexts to
          translate again source sentences
          Calculate the degree of the similarity of the corpus before
          using it




21/ 29     Haithem Afli, Lo¨ Barrault and Holger Schwenk
                          ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


         Brown, P. F., Lai, J. C., and Mercer, R. L. (1991).
         Aligning sentences in parallel corpora.
         In Proceedings of the 29th annual meeting on ACL, pages
         169–176.
         Munteanu, D. S. and Marcu, D. (2005).
         Improving Machine Translation Performance by Exploiting
         Non-Parallel Corpora.
         Computational Linguistics, 31(4) :477–504.
         Rauf, S. A. and Schwenk, H. (2011).
         Parallel sentence generation from comparable corpora for
         improved SMT.
         Machine Translation, 25(4) :341–375.
         Resnik, P. and Smith, N. A. (2003).
         The web as a parallel corpus.
         Comput. Linguist., 29 :349–380.
22/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Thank you




23/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                                                Existing Works
                                                            Proposed Approach
                                                                    Conclusion


    Results (1)
                         24.5                                                                 24.5
                                                               Exp1                                                                 Exp1
                                                               Exp2                                                                 Exp2
                                                               Exp3                                                                 Exp3


                          24                                                                   24
            score BLEU




                                                                                 score BLEU
                         23.5                                                                 23.5




                          23                                                                   23




                         22.5                                                                 22.5
                                0   20   40            60      80     100                            0     20    40            60   80     100
                                          TER threshold                                                           TER threshold




         Figure: BLEU score on dev using                                                      Figure: BLEU score on dev using
         SMT systems adapted with bitexts                                                     SMT systems adapted with bitexts
         extracted from ccb2 + 100%                                                           extracted from ccb2 + 75% TEDbi
         TEDbi index corpus.                                                                  index corpus.


                         The choice of the appropriate TER threshold
                         depends on the type of data

24/ 29                      Haithem Afli, Lo¨ Barrault and Holger Schwenk
                                           ıc                                                            Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Crawling the Web [Resnik and Smith, 2003]



         Search for web pages with similar URLs
               Many companies and organizations have their web pages in
               multiple languages
               Identified by language ID, eg
                      http ://x.../y.../z.en and http ://x.../y.../z.fr
               Pages have links to parallel pages
         Webcrawler, which exploits this structural information




25/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Alignment Approach [Brown et al., 1991]



         Train initial lexicon based on parallel data
         Use lexicon to calculate alignment score between documents
         (or sentences)
               Typically IBM1
         Select most reliable document (sentence) pairs
         Add to parallel training data and retrain -> bootstrapping




26/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Finding Comparable Documents [Zhao and Vogel, 2002]




         Given comparable documents, find (nearly) parallel sentences
         Xinhua News Agency publishes news in English and Chinese
         Calculate similarity based on lexicon
         Iterative process




27/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    CLIR Aproach [Munteanu and Marcu, 2005]




                                    Figure: CLIR Aproach



28/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor
Introduction
                                   Existing Works
                               Proposed Approach
                                       Conclusion


    Translation Approach [Rauf and Schwenk, 2011]




                               Figure: Translation Approach


29/ 29    Haithem Afli, Lo¨ Barrault and Holger Schwenk
                         ıc                              Parallel text extraction from multimodal comparable corpor

Weitere ähnliche Inhalte

Was ist angesagt?

GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002Yannis Kalfoglou
 
Generating Lexical Information for Terminology in a Bioinformatics Ontology
Generating Lexical Information for Terminologyin a Bioinformatics OntologyGenerating Lexical Information for Terminologyin a Bioinformatics Ontology
Generating Lexical Information for Terminology in a Bioinformatics OntologyHammad Afzal
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
Semantic Web languages: Expressivity vs scalability
Semantic Web languages: Expressivity vs scalabilitySemantic Web languages: Expressivity vs scalability
Semantic Web languages: Expressivity vs scalabilitynvitucci
 

Was ist angesagt? (15)

IJCTT-V4I9P137
IJCTT-V4I9P137IJCTT-V4I9P137
IJCTT-V4I9P137
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
Towards Integrating Ontologies An EDM-Based Approach
Towards Integrating Ontologies An EDM-Based ApproachTowards Integrating Ontologies An EDM-Based Approach
Towards Integrating Ontologies An EDM-Based Approach
 
Oop
OopOop
Oop
 
Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002Information Flow based Ontology Mapping - 2002
Information Flow based Ontology Mapping - 2002
 
Generating Lexical Information for Terminology in a Bioinformatics Ontology
Generating Lexical Information for Terminologyin a Bioinformatics OntologyGenerating Lexical Information for Terminologyin a Bioinformatics Ontology
Generating Lexical Information for Terminology in a Bioinformatics Ontology
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
CALICO 2010 Workshop
CALICO 2010  Workshop CALICO 2010  Workshop
CALICO 2010 Workshop
 
Semantic Web languages: Expressivity vs scalability
Semantic Web languages: Expressivity vs scalabilitySemantic Web languages: Expressivity vs scalability
Semantic Web languages: Expressivity vs scalability
 
AICOL2015_paper_16
AICOL2015_paper_16AICOL2015_paper_16
AICOL2015_paper_16
 

Andere mochten auch

Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsAlberto Simões
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Tobias Wunner
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationwebLyzard technology
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Adrien Barbaresi
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van braziliëJan-Willem Lammens
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconİrem Tümer
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesEstelle Delpech
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Association for Computational Linguistics
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Association for Computational Linguistics
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in Englishteflang
 

Andere mochten auch (17)

Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation Patterns
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and Evaluation
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van brazilië
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexicon
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-Sierra
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Ähnlich wie Parallel text extraction from multimodal comparable corpora

A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationChristoph Lange
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdfAmir Abdalla
 
Making Heterogeneous Ontologies Interoperable Through Standardisation
Making Heterogeneous Ontologies Interoperable Through StandardisationMaking Heterogeneous Ontologies Interoperable Through Standardisation
Making Heterogeneous Ontologies Interoperable Through StandardisationChristoph Lange
 
8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)AEGIS-ACCESSIBLE Projects
 
FCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of OntologiesFCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of Ontologiesalemarrena
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
 
8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)AEGIS-ACCESSIBLE Projects
 
Summarization in Computational linguistics
Summarization in Computational linguisticsSummarization in Computational linguistics
Summarization in Computational linguisticsAhmad Mashhood
 
Open issue in oop
Open issue in oopOpen issue in oop
Open issue in oopAnas Ahmed
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyAlexander Decker
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...cscpconf
 
A new approach based on the detection of opinion by sentiwordnet for automati...
A new approach based on the detection of opinion by sentiwordnet for automati...A new approach based on the detection of opinion by sentiwordnet for automati...
A new approach based on the detection of opinion by sentiwordnet for automati...csandit
 
Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Nakul Sharma
 
Dimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityDimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityLawrie Hunter
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesRokan Uddin Faruqui
 

Ähnlich wie Parallel text extraction from multimodal comparable corpora (20)

A Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and DocumentationA Mathematical Approach to Ontology Authoring and Documentation
A Mathematical Approach to Ontology Authoring and Documentation
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
Making Heterogeneous Ontologies Interoperable Through Standardisation
Making Heterogeneous Ontologies Interoperable Through StandardisationMaking Heterogeneous Ontologies Interoperable Through Standardisation
Making Heterogeneous Ontologies Interoperable Through Standardisation
 
8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)
 
FCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of OntologiesFCA-MERGE: Bottom-Up Merging of Ontologies
FCA-MERGE: Bottom-Up Merging of Ontologies
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
 
8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)8 ontology integration and interoperability (onto i op)
8 ontology integration and interoperability (onto i op)
 
Summarization in Computational linguistics
Summarization in Computational linguisticsSummarization in Computational linguistics
Summarization in Computational linguistics
 
Open issue in oop
Open issue in oopOpen issue in oop
Open issue in oop
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontology
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...
A NEW APPROACH BASED ON THE DETECTION OF OPINION BY SENTIWORDNET FOR AUTOMATI...
 
A new approach based on the detection of opinion by sentiwordnet for automati...
A new approach based on the detection of opinion by sentiwordnet for automati...A new approach based on the detection of opinion by sentiwordnet for automati...
A new approach based on the detection of opinion by sentiwordnet for automati...
 
ppt
pptppt
ppt
 
Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...
 
Dimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityDimensions of Media Object Comprehensibility
Dimensions of Media Object Comprehensibility
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
 
A Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and SemanticsA Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and Semantics
 

Mehr von Haithem Afli

How NLP is reshaping Fintech
How NLP is reshaping Fintech How NLP is reshaping Fintech
How NLP is reshaping Fintech Haithem Afli
 
Looking Beyond the AI & IoT Research and Industrial Opportunities: How two Br...
Looking Beyond the AI & IoTResearch and Industrial Opportunities:How two Br...Looking Beyond the AI & IoTResearch and Industrial Opportunities:How two Br...
Looking Beyond the AI & IoT Research and Industrial Opportunities: How two Br...Haithem Afli
 
AI Meets Digital Health, Social Science and AgriTech
AI Meets Digital Health, Social Science and AgriTechAI Meets Digital Health, Social Science and AgriTech
AI Meets Digital Health, Social Science and AgriTechHaithem Afli
 
Affective Analytics and Visualization for Ensemble event-driven stock market ...
Affective Analytics and Visualization for Ensemble event-driven stock market ...Affective Analytics and Visualization for Ensemble event-driven stock market ...
Affective Analytics and Visualization for Ensemble event-driven stock market ...Haithem Afli
 
Introduction to Natural Language Processing
Introduction to Natural Language Processing  Introduction to Natural Language Processing
Introduction to Natural Language Processing Haithem Afli
 
Natural Language Engineering in the Golden Age of Artificial Intelligence
 Natural Language Engineering in the Golden Age of Artificial Intelligence Natural Language Engineering in the Golden Age of Artificial Intelligence
Natural Language Engineering in the Golden Age of Artificial IntelligenceHaithem Afli
 
Industrial Internet Consortium 2019
Industrial Internet Consortium 2019Industrial Internet Consortium 2019
Industrial Internet Consortium 2019Haithem Afli
 
Présentation de thèse Haithem AFLI
Présentation de thèse Haithem AFLIPrésentation de thèse Haithem AFLI
Présentation de thèse Haithem AFLIHaithem Afli
 

Mehr von Haithem Afli (9)

How NLP is reshaping Fintech
How NLP is reshaping Fintech How NLP is reshaping Fintech
How NLP is reshaping Fintech
 
Looking Beyond the AI & IoT Research and Industrial Opportunities: How two Br...
Looking Beyond the AI & IoTResearch and Industrial Opportunities:How two Br...Looking Beyond the AI & IoTResearch and Industrial Opportunities:How two Br...
Looking Beyond the AI & IoT Research and Industrial Opportunities: How two Br...
 
AI Meets Digital Health, Social Science and AgriTech
AI Meets Digital Health, Social Science and AgriTechAI Meets Digital Health, Social Science and AgriTech
AI Meets Digital Health, Social Science and AgriTech
 
Affective Analytics and Visualization for Ensemble event-driven stock market ...
Affective Analytics and Visualization for Ensemble event-driven stock market ...Affective Analytics and Visualization for Ensemble event-driven stock market ...
Affective Analytics and Visualization for Ensemble event-driven stock market ...
 
Introduction to Natural Language Processing
Introduction to Natural Language Processing  Introduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Engineering in the Golden Age of Artificial Intelligence
 Natural Language Engineering in the Golden Age of Artificial Intelligence Natural Language Engineering in the Golden Age of Artificial Intelligence
Natural Language Engineering in the Golden Age of Artificial Intelligence
 
Industrial Internet Consortium 2019
Industrial Internet Consortium 2019Industrial Internet Consortium 2019
Industrial Internet Consortium 2019
 
Analytics2017
Analytics2017Analytics2017
Analytics2017
 
Présentation de thèse Haithem AFLI
Présentation de thèse Haithem AFLIPrésentation de thèse Haithem AFLI
Présentation de thèse Haithem AFLI
 

Kürzlich hochgeladen

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 

Kürzlich hochgeladen (20)

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 

Parallel text extraction from multimodal comparable corpora

  • 1. Introduction Existing Works Proposed Approach Conclusion Parallel text extraction from multimodal comparable corpora Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc LIUM, University of Le Maine 72085 Le Mans cedex 9, FRANCE FirstName.LastName@lium.univ-lemans.fr Oct 22, 2012 JapTal 2012, Kanazawa - JAPAN 1/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 2. Introduction Existing Works Proposed Approach Conclusion Outline 1 Introduction and Context Statistical Machine Translation Parallel and Comparable Corpora 2 Existing Works Exploiting Comparable Corpora Main Existing Methods 3 Proposed Approach System Architecture Several Issues Task Description Experimental setup Results 4 Conclusion and Discussion 2/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 3. Introduction Existing Works Proposed Approach Conclusion Statistical Machine Translation Purpose : text translation Approach : Statistical, given by : t ∗ = arg max P(s|t)P(t) t Modeling Translation Model : P(s|t) Language Model : P(t) Decoding Algorithme : argmax Some open source tools are available like Moses and Joshua ⇒ needs parallel data 3/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 4. Introduction Existing Works Proposed Approach Conclusion Parallel Corpora Texts that are translations of each other An essential resource for MT Provide training data for statistical translation models Also useful for other NLP applications Expensive and time consuming to prepare Translate, Sentence Align, ... But limited in Size, Language and Domain ⇒ There are no better data than more data 4/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 5. Introduction Existing Works Proposed Approach Conclusion Comparable Corpora Generally not parallel, but overlapping information Readily available Mainly from Newswire AFP, Al JAZEERA, BBC ... Much larger quantities than parallel corpora Multiple languages and Genres Large collections available for NLP tasks e.g. Gigaword corpora from LDC English, Arabic, Chinese, French, Spanish 5/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 6. Introduction Existing Works Proposed Approach Conclusion Exploiting comparable corpora Extract parallel documents Using structural information Extending parallel sentence alignement algorithms Extract parallel sentence pairs With sentence alignement algorithms Cross-lingual IR methods Translation aproach 6/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 7. Introduction Existing Works Proposed Approach Conclusion Main Existing Methods Webcrawling [Resnik and Smith, 2003] : use URLs to find matching documents Alignment [Brown et al., 1991] : use word alignment models to judge how close a source and a target document (sentence) are Crosslingual IR [Munteanu and Marcu, 2005] : use lexicon to translate source words and apply information retrieval techniques Translation [Rauf and Schwenk, 2011] : use SMT system to translate documents and apply information retrieval 7/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 8. Introduction Existing Works Proposed Approach Conclusion Goal : Exploiting multimodal comparable corpora Text Audio Parallel text extraction Parallel text 8/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 9. Introduction Existing Works Proposed Approach Conclusion Proposed Approach Build a baseline SMT system (using generic data ) Transcribe the audio data Translate the transcribed text Use translations as queries for IR to find the ”matching” sentences in the target comparable corpus Use TER between SMT translation and the found sentences to detect parallel ones 9/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpora
  • 10. Introduction Existing Works Proposed Approach Conclusion System Architecture Multimodal Audio L1 comparable corpora ASR Transc. L1 SMT Bitext Transl. L2 IR Texts L2 Text L2 10/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 11. Introduction Existing Works Proposed Approach Conclusion Several issues Feasibility : Is the multimodal comparable corpora useful to extract parallel text ? Good quality : Can we get a parallel text generated from multimodal corpora good as the bitext extracted from comparable text ? Effectiveness : since one of our motivations for exploiting comparable corpora is to adapt a SMT system for a specific domain, extracted bitext needs to be useful to improve SMT performance. 11/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 12. Introduction Existing Works Proposed Approach Conclusion Task description (1) Analyze the impact of the errors of each module ⇒ conducted three different types of experiments Exp 1 : we use the reference translations as queries for the IR system → This is the most favorable condition, it simulates the case where the ASR and the SMT systems do not commit any error. Exp 2 : we use the reference transcription as input to the SMT system → In this case, the errors come only from the SMT system since no ASR is involved. Exp 3 : represents the complete proposed framework → It corresponds to a real scenario. 12/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 13. Introduction Existing Works Proposed Approach Conclusion Task description (2) Exp 1 Exp 2 Exp 3 TED audio ASR TEDbi. En TEDasr. En SMT SMT TEDbi_tran. TEDasr_tran . TEDbi. FR FR FR IR IR IR Texte FR Texte FR Texte FR ccb2+ %TrainTED.fr 13/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 14. Introduction Existing Works Proposed Approach Conclusion Task description (3) Importance of the degree of similarity between the two parts of the comparable corpora ⇒ we artificially created four comparable corpora with different degrees of similarity the source part of our comparable corpus is always the same the target language part of the comparable corpus consists of a large generic corpus plus 25%, 50%, 75% and 100% respectively of the reference translations Evaluation of the approach final parallel data extracted are re-injected into the baseline system systems are evaluated using the BLEU score 14/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 15. Introduction Existing Works Proposed Approach Conclusion Experimental Setup : Data (TED task in IWSLT) Training bitexts # words in domain ? nc7 3.7M no eparl7 56.4M no TEDasr 1.8M yes TEDbi 1.9M yes Development and test Dev # words dev.outASR 36k dev.refSMT 38k Test # words tst.outASR 8.7k tst.refSMT 9.1 k 15/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 16. Introduction Existing Works Proposed Approach Conclusion Experimental Setup : Modules ASR : a five-pass system based on CMU Sphinx has a WER of about 18% SMT : a phrase-based system based on Moses SMT toolkit trained on generic bitext only word alignments in both directions are calculated using GIZA++ phrases and lexical reordering are extracted using the default settings of the Moses toolkit the parameters were tuned on dev.outASR, using the MERT tool IR : system based on Lemur IR toolkit index all target language (French) text data transforming the translated source language (English) to queries using Indri Query Language 16/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 17. Introduction Existing Works Proposed Approach Conclusion Results Table: BLEU scores on dev and test after adaptation of a baseline system with bitexts extracted in conditions Exp1, Exp2 and Exp3 (100% TEDbi) Experiment Dev Test Baseline system 22.93 23.96 Exp1 24.14 25.14 Exp2 23.90 25.15 Exp3 23.40 24.69 Extracted sentences do improve the SMT system BLEU score of the adapted system matches the one of Exp1 in most of the cases ⇒ errors inducted by the SMT and ASR systems have no major impact on the performance of the parallel sentence extraction algorithm 17/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 18. Introduction Existing Works Proposed Approach Conclusion Results Table: BLEU scores for different degrees of parallelism of the comparable corpus. Experiment Dev Test # injected words Baseline system 22.93 23.96 - 25% TEDbi 23.11 24.40 ∼110k 50% TEDbi 23.27 24.58 ∼215k 75% TEDbi 23.43 24.42 ∼293k 100% TEDbi 23.40 24.69 ∼393k The degree of similarity of the comparable corpus is important in term of the performance of the extraction process and the quality of parallel sentences extracted 18/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 19. Introduction Existing Works Proposed Approach Conclusion Results e.g. 1: Source sentence: i wrote a story about genetically engineered food Baseline Sys: Adapted Sys: j'ai écrit un article sur la nourriture j'ai écrit un article sur les produits génétiquement modifiée alimentaires génétiquement modifiés Domain Adaptation e.g. 2: Source sentence: yeah you're right let's fix it Baseline Sys: Adapted Sys: yeah tu as raison de réparer euh oui tu as raison il faut réparer Oral vocabulary correction 19/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 20. Introduction Existing Works Proposed Approach Conclusion Conclusion Proposed to extend exploiting comparable corpora to multimodal comparable corpora, i.e. the source side is available as audio and the target side as text An encouraging result since we automatically aligned source audio in one language with texts in another language, without the need of human intervention to transcribe and translate the data Able to adapt a generic SMT system to the task of lecture translation by extracting parallel data from a multimodal comparable corpus 20/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 21. Introduction Existing Works Proposed Approach Conclusion Perspectives Apply this task at a much larger scale, i.e using hundreds of hours of speech and hundreds of millions of words Woking on deferent specific domains or subdomains Iterate the process in order to use the extracted bitexts to translate again source sentences Calculate the degree of the similarity of the corpus before using it 21/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 22. Introduction Existing Works Proposed Approach Conclusion Brown, P. F., Lai, J. C., and Mercer, R. L. (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on ACL, pages 169–176. Munteanu, D. S. and Marcu, D. (2005). Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics, 31(4) :477–504. Rauf, S. A. and Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4) :341–375. Resnik, P. and Smith, N. A. (2003). The web as a parallel corpus. Comput. Linguist., 29 :349–380. 22/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 23. Introduction Existing Works Proposed Approach Conclusion Thank you 23/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 24. Introduction Existing Works Proposed Approach Conclusion Results (1) 24.5 24.5 Exp1 Exp1 Exp2 Exp2 Exp3 Exp3 24 24 score BLEU score BLEU 23.5 23.5 23 23 22.5 22.5 0 20 40 60 80 100 0 20 40 60 80 100 TER threshold TER threshold Figure: BLEU score on dev using Figure: BLEU score on dev using SMT systems adapted with bitexts SMT systems adapted with bitexts extracted from ccb2 + 100% extracted from ccb2 + 75% TEDbi TEDbi index corpus. index corpus. The choice of the appropriate TER threshold depends on the type of data 24/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 25. Introduction Existing Works Proposed Approach Conclusion Crawling the Web [Resnik and Smith, 2003] Search for web pages with similar URLs Many companies and organizations have their web pages in multiple languages Identified by language ID, eg http ://x.../y.../z.en and http ://x.../y.../z.fr Pages have links to parallel pages Webcrawler, which exploits this structural information 25/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 26. Introduction Existing Works Proposed Approach Conclusion Alignment Approach [Brown et al., 1991] Train initial lexicon based on parallel data Use lexicon to calculate alignment score between documents (or sentences) Typically IBM1 Select most reliable document (sentence) pairs Add to parallel training data and retrain -> bootstrapping 26/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 27. Introduction Existing Works Proposed Approach Conclusion Finding Comparable Documents [Zhao and Vogel, 2002] Given comparable documents, find (nearly) parallel sentences Xinhua News Agency publishes news in English and Chinese Calculate similarity based on lexicon Iterative process 27/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 28. Introduction Existing Works Proposed Approach Conclusion CLIR Aproach [Munteanu and Marcu, 2005] Figure: CLIR Aproach 28/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor
  • 29. Introduction Existing Works Proposed Approach Conclusion Translation Approach [Rauf and Schwenk, 2011] Figure: Translation Approach 29/ 29 Haithem Afli, Lo¨ Barrault and Holger Schwenk ıc Parallel text extraction from multimodal comparable corpor