SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Capturing Word-level
 Dependencies in Morpheme-
  based Language Modeling



  Martha Yifiru Tachbelie and Wolfgang Menzel

                       @

University of Hamburg, Department of Informatics,
         Natural Languages Systems Group
Outline
   Language Modeling
   Morphology of Amharic
   Language Modeling for Amharic
    −   Capturing word-level dependencies
   Language Modeling Experiment
    −   Word segmentation
    −   Factored data preparation
    −   The language models
   Speech recognition experiment
    −   The baseline speech recognition system
    −   Lattice re-scoring experiment
   Conclusion and future work
Language Modeling
   Language models are fundamental to many
    natural language processing
   Statistical language models – the most
    widely used ones
    −   provide an estimate of the probability of a word
        sequence W for a given task
    −   require large training data
          data sparseness problem     Serious for morphologically
          OOV problem                 rich languages

   Languages with rich morphology:
    −   high vocabulary growth rate => a high perplexity
        and a large number of OOV
         ➢   Sub-word units are used in language modeling
Morphology of Amharic
   Amharic is one of the morphologically rich
    languages
   Spoken mainly in Ethiopia
    −   the second spoken Semitic language
   Exhibits root-pattern non-concatenative
    morphological phenomenon
    −   e.g. sbr
   Uses different affixes to create inflectional
    and derivational word forms
➢   Data sparseness and OOV are serious
    problems
    ➔   Sub-word based language modeling has been
        recommended (Solomon, 2006)
Language Modeling for Amharic
   Sub-word based language models have
    been developed
   Substantial reduction in the OOV rate have
    been obtained
   Morphemes have been used as a unit in
    language modeling
       loss of word level dependency
   Solution:
    −   Higher order ngram => model complexity
    ➢   factored language modeling
Capturing Word-level
               Dependencies
   In FLM a word is viewed as a bundle or
    vector K parallel features or factors
                    W n ≡ f 1 , f 2 ,... , f k
                            n     n          n

    −   Factors: linguistic features
   Some of the features can define the word
    ➢   the probability can be calculated on the basis of
        these features
   In Amharic: roots represent the lexical
    meaning of a word
    ➢   root-based models to capture word-level
        dependencies
Morphological Analysis

   There is a need for morphological analyser
    −   attempts (Bayou, 2000; Bayu,2002; Amsalu and
        Gibbon, 2005)
    −   suffer from lack of data
       can not be used for our purpose

   Unsupervised morphology learning tools
    ➢   not applicable for this study
   Manual segmentation
    −   72,428 word types found in a corpus of 21,338
        sentences have been segmented
          polysemous or homonymous
          geminated or non-geminated
Factored Data Preparation

   Each word is considered as a bundle of
    features: Word, POS, prefix, root, pattern
    and suffix
    −   W-Word:POS-noun:PR-prefix:R-root:PA-
        pattern:Su-suffix
    −   A given tag-value pair may be missing - the tag
        takes a special value 'null'
   When roots are considered in language
    modeling:
    −   words not derived from roots will be excluded
    ➢   stems of these words are considered
Factored Data Preparation --
           cont.

    Manually seg.
    word list




    Factored         Text corpus
    representation




                     Factored data
The Language Models

   The corpus is divided into training and test
    sets (80:10:10)
●   Root-based models order 2 to 5 have been
    developed
    ●   smoothed with Kneser-Ney smoothing technique
The Language Models -- cont.
   Perplexity of root-based models on
    development test set
                  Root ngram Perplexity
                 Bigram       278.57
                 Trigram      223.26
                 Quadrogram   213.14
                 Pentagram    211.93


   A higher improvement: bigram Vs trigram
   Only 295 OOV
   The best model has:
    −   a logprob of -53102.3
    −   a perplexity of 204.95 on the evaluation test set
The Language Models -- cont.
   Word-based model
    −   with the same training data
    −   smoothed with Kneser-Ney smoothing

                 Word ngram Perplexity
                 Bigram      1148.76
                 Trigram      989.95
                 Quadrogram   975.41
                 Pentagram    972.58


   A higher improvement: bigram Vs trigram
   2,672 OOV
   The best model has a logprob of -61106.0
The Language Models -- cont.
   Word-based models that use an additional
    feature in the ngram history have also been
    developed
              Language models  Perplexity
             W/W2,POS2,W1,POS1  885.81
             W/W2,PR2,W1,PR1    857.61
             W/W2,R2,W1,R1      896.59
             W/W2,PA2,W1,PA1    958.31
             W/W2,SU2,W1,SU1    898.89

   Root-based models seem better than all the
    others, but might be less constraining
    ➢   Speech recognition experiment – lattice
        rescoring
Speech Recognition Experiment
   The baseline speech recognition system
    (Abate, 2006)
   Acoustic model:
    −   trained on 20 hours of read speech corpus
    −   a set of intra-word triphone HHMs with 3 emitting
        states and 12 Gaussian mixture
   The language model
    −   trained on a corpus consisting of 77,844
        sentences (868,929 tokens or 108,523 types)
    −   a closed vocabulary backoff bigram model
    −   smoothed with absolute discounting method
    −   perplexity of 91.28 on a test set that consists of
        727 sentences (8,337 tokens)
Speech Recognition Experiment
          -- cont.
   Performance:
    −   5k development test set (360 sentences read by
        20 speakers) has been used to generate the
        lattices
    −   Lattices have been generated from the 100 best
        alternatives for each sentence
    −   Best path transcription has been decoded
            91.67% word recognition accuracy
Speech Recognition Experiment
          -- cont.
   To make the results comparable
    −   root-based and factored language models has
        been developed
                The corpus used in the
                   baseline system




                   factored version



                                      root-based and factored
                                      language models
Speech Recognition Experiment
          -- cont.
   Perplexity of root-based models trained on
    the corpus used in the baseline speech rec.

            Root ngram   Perplexity   Logprob
           Bigram         113.57      -18628.9
           Trigram         24.63      -12611.8
           Quadrogram      11.20      -9510.29
           Pentagram        8.72      -8525.42
Speech Recognition Experiment
          -- cont.
   Perplexity of factored models

           Language models   Perplexity   Logprob
         W/W2,POS2,W1,POS1     10.61      -9298.57
         W/W2,PR2,W1,PR1       10.67      -9322.02
         W/W2,R2,W1,R1         10.36       -9204.7
         W/W2,PA2,W1,PA1       10.89      -9401.08
         W/W2,SU2,W1,SU1       10.70      -9330.96
Speech Recognition Experiment
          -- cont.
   Word lattice to factored lattice

                                        Word lattice




          Factored version             Factored lattice




     Word bigram model (FBL)


                                Best path transcription

                                         91.60 %
Speech Recognition Experiment
          -- cont.
   WRA with factored models

               Language models        WRA in %
         Factored word bigram (FBL)    91.60
         FBL + W/W2,POS2,W1,POS1       93.60
         FBL + W/W2,PR2,W1,PR1         93.82
         FBL + W/W2,R2,W1,R1           93.65
         FBL + W/W2,PA2,W1,PA1         93.68
         FBL + W/W2,SU2,W1,SU1         93.53
Speech Recognition Experiment
          -- cont.
   WRA with root-based language models

                Language models        WRA in %
          Factored word bigram (FBL)    91.60
          FBL + Bigram                  90.77
          FBL + Trigram                 90.87
          FBL + Quadrogram              90.99
          FBL + Pentagram               91.14
Conclusion and Future Work
   Root-based models have low perplexity and
    high logprob
   But, did not contribute to the improvement of
    word recognition accuracy
   Improvement of these models by adding
    other word features but still maintaining
    word-level dependencies
   Other ways of integrating the root-based
    models to a speech recognition system
Thank you

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

No permitas que te llamen viejo
No permitas que te llamen viejoNo permitas que te llamen viejo
No permitas que te llamen viejo
 
Explicacion de la crisis
Explicacion de la crisis Explicacion de la crisis
Explicacion de la crisis
 
Quieres ser feliz
Quieres ser felizQuieres ser feliz
Quieres ser feliz
 
Bellas imágenes
Bellas imágenesBellas imágenes
Bellas imágenes
 
Madrid en el recuerdo
Madrid en el recuerdoMadrid en el recuerdo
Madrid en el recuerdo
 
Si las pizarras hablaran
Si las pizarras hablaranSi las pizarras hablaran
Si las pizarras hablaran
 
El desierto
El desiertoEl desierto
El desierto
 
China y el papel higiénico
China y el papel higiénicoChina y el papel higiénico
China y el papel higiénico
 
Fotos espectaculares
Fotos espectacularesFotos espectaculares
Fotos espectaculares
 

Ähnlich wie Capturing Word-level Dependencies in Morpheme-based Language Modeling

Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingIAESIJAI
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognitionVipul Munot
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmDhruvKushwaha12
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesMohamed El-Geish
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicGuy De Pauw
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionHybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translationbehzad66
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpusThennarasuSakkan
 

Ähnlich wie Capturing Word-level Dependencies in Morpheme-based Language Modeling (20)

Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech tagging
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesSemi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training Ensembles
 
Intern presentation
Intern presentationIntern presentation
Intern presentation
 
Modeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for AmharicModeling Improved Syllabification Algorithm for Amharic
Modeling Improved Syllabification Algorithm for Amharic
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
lec26_audio.pptx
lec26_audio.pptxlec26_audio.pptx
lec26_audio.pptx
 
Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionHybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition
Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
Speech processing
Speech processingSpeech processing
Speech processing
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpus
 

Mehr von Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 

Mehr von Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Capturing Word-level Dependencies in Morpheme-based Language Modeling

  • 1. Capturing Word-level Dependencies in Morpheme- based Language Modeling Martha Yifiru Tachbelie and Wolfgang Menzel @ University of Hamburg, Department of Informatics, Natural Languages Systems Group
  • 2. Outline  Language Modeling  Morphology of Amharic  Language Modeling for Amharic − Capturing word-level dependencies  Language Modeling Experiment − Word segmentation − Factored data preparation − The language models  Speech recognition experiment − The baseline speech recognition system − Lattice re-scoring experiment  Conclusion and future work
  • 3. Language Modeling  Language models are fundamental to many natural language processing  Statistical language models – the most widely used ones − provide an estimate of the probability of a word sequence W for a given task − require large training data  data sparseness problem Serious for morphologically  OOV problem rich languages  Languages with rich morphology: − high vocabulary growth rate => a high perplexity and a large number of OOV ➢ Sub-word units are used in language modeling
  • 4. Morphology of Amharic  Amharic is one of the morphologically rich languages  Spoken mainly in Ethiopia − the second spoken Semitic language  Exhibits root-pattern non-concatenative morphological phenomenon − e.g. sbr  Uses different affixes to create inflectional and derivational word forms ➢ Data sparseness and OOV are serious problems ➔ Sub-word based language modeling has been recommended (Solomon, 2006)
  • 5. Language Modeling for Amharic  Sub-word based language models have been developed  Substantial reduction in the OOV rate have been obtained  Morphemes have been used as a unit in language modeling  loss of word level dependency  Solution: − Higher order ngram => model complexity ➢ factored language modeling
  • 6. Capturing Word-level Dependencies  In FLM a word is viewed as a bundle or vector K parallel features or factors W n ≡ f 1 , f 2 ,... , f k n n n − Factors: linguistic features  Some of the features can define the word ➢ the probability can be calculated on the basis of these features  In Amharic: roots represent the lexical meaning of a word ➢ root-based models to capture word-level dependencies
  • 7. Morphological Analysis  There is a need for morphological analyser − attempts (Bayou, 2000; Bayu,2002; Amsalu and Gibbon, 2005) − suffer from lack of data  can not be used for our purpose  Unsupervised morphology learning tools ➢ not applicable for this study  Manual segmentation − 72,428 word types found in a corpus of 21,338 sentences have been segmented  polysemous or homonymous  geminated or non-geminated
  • 8. Factored Data Preparation  Each word is considered as a bundle of features: Word, POS, prefix, root, pattern and suffix − W-Word:POS-noun:PR-prefix:R-root:PA- pattern:Su-suffix − A given tag-value pair may be missing - the tag takes a special value 'null'  When roots are considered in language modeling: − words not derived from roots will be excluded ➢ stems of these words are considered
  • 9. Factored Data Preparation -- cont. Manually seg. word list Factored Text corpus representation Factored data
  • 10. The Language Models  The corpus is divided into training and test sets (80:10:10) ● Root-based models order 2 to 5 have been developed ● smoothed with Kneser-Ney smoothing technique
  • 11. The Language Models -- cont.  Perplexity of root-based models on development test set Root ngram Perplexity Bigram 278.57 Trigram 223.26 Quadrogram 213.14 Pentagram 211.93  A higher improvement: bigram Vs trigram  Only 295 OOV  The best model has: − a logprob of -53102.3 − a perplexity of 204.95 on the evaluation test set
  • 12. The Language Models -- cont.  Word-based model − with the same training data − smoothed with Kneser-Ney smoothing Word ngram Perplexity Bigram 1148.76 Trigram 989.95 Quadrogram 975.41 Pentagram 972.58  A higher improvement: bigram Vs trigram  2,672 OOV  The best model has a logprob of -61106.0
  • 13. The Language Models -- cont.  Word-based models that use an additional feature in the ngram history have also been developed Language models Perplexity W/W2,POS2,W1,POS1 885.81 W/W2,PR2,W1,PR1 857.61 W/W2,R2,W1,R1 896.59 W/W2,PA2,W1,PA1 958.31 W/W2,SU2,W1,SU1 898.89  Root-based models seem better than all the others, but might be less constraining ➢ Speech recognition experiment – lattice rescoring
  • 14. Speech Recognition Experiment  The baseline speech recognition system (Abate, 2006)  Acoustic model: − trained on 20 hours of read speech corpus − a set of intra-word triphone HHMs with 3 emitting states and 12 Gaussian mixture  The language model − trained on a corpus consisting of 77,844 sentences (868,929 tokens or 108,523 types) − a closed vocabulary backoff bigram model − smoothed with absolute discounting method − perplexity of 91.28 on a test set that consists of 727 sentences (8,337 tokens)
  • 15. Speech Recognition Experiment -- cont.  Performance: − 5k development test set (360 sentences read by 20 speakers) has been used to generate the lattices − Lattices have been generated from the 100 best alternatives for each sentence − Best path transcription has been decoded  91.67% word recognition accuracy
  • 16. Speech Recognition Experiment -- cont.  To make the results comparable − root-based and factored language models has been developed The corpus used in the baseline system factored version root-based and factored language models
  • 17. Speech Recognition Experiment -- cont.  Perplexity of root-based models trained on the corpus used in the baseline speech rec. Root ngram Perplexity Logprob Bigram 113.57 -18628.9 Trigram 24.63 -12611.8 Quadrogram 11.20 -9510.29 Pentagram 8.72 -8525.42
  • 18. Speech Recognition Experiment -- cont.  Perplexity of factored models Language models Perplexity Logprob W/W2,POS2,W1,POS1 10.61 -9298.57 W/W2,PR2,W1,PR1 10.67 -9322.02 W/W2,R2,W1,R1 10.36 -9204.7 W/W2,PA2,W1,PA1 10.89 -9401.08 W/W2,SU2,W1,SU1 10.70 -9330.96
  • 19. Speech Recognition Experiment -- cont.  Word lattice to factored lattice Word lattice Factored version Factored lattice Word bigram model (FBL) Best path transcription 91.60 %
  • 20. Speech Recognition Experiment -- cont.  WRA with factored models Language models WRA in % Factored word bigram (FBL) 91.60 FBL + W/W2,POS2,W1,POS1 93.60 FBL + W/W2,PR2,W1,PR1 93.82 FBL + W/W2,R2,W1,R1 93.65 FBL + W/W2,PA2,W1,PA1 93.68 FBL + W/W2,SU2,W1,SU1 93.53
  • 21. Speech Recognition Experiment -- cont.  WRA with root-based language models Language models WRA in % Factored word bigram (FBL) 91.60 FBL + Bigram 90.77 FBL + Trigram 90.87 FBL + Quadrogram 90.99 FBL + Pentagram 91.14
  • 22. Conclusion and Future Work  Root-based models have low perplexity and high logprob  But, did not contribute to the improvement of word recognition accuracy  Improvement of these models by adding other word features but still maintaining word-level dependencies  Other ways of integrating the root-based models to a speech recognition system