Capturing Word-level Dependencies in Morpheme-based Language Modeling

Capturing Word-level
Dependencies in Morpheme-
based Language Modeling

Martha Yifiru Tachbelie and Wolfgang Menzel

@

University of Hamburg, Department of Informatics,
Natural Languages Systems Group

Outline
 Language Modeling
 Morphology of Amharic
 Language Modeling for Amharic
− Capturing word-level dependencies
 Language Modeling Experiment
− Word segmentation
− Factored data preparation
− The language models
 Speech recognition experiment
− The baseline speech recognition system
− Lattice re-scoring experiment
 Conclusion and future work

Language Modeling
 Language models are fundamental to many
natural language processing
 Statistical language models – the most
widely used ones
− provide an estimate of the probability of a word
sequence W for a given task
− require large training data
 data sparseness problem Serious for morphologically
 OOV problem rich languages

 Languages with rich morphology:
− high vocabulary growth rate => a high perplexity
and a large number of OOV
➢ Sub-word units are used in language modeling

Morphology of Amharic
 Amharic is one of the morphologically rich
languages
 Spoken mainly in Ethiopia
− the second spoken Semitic language
 Exhibits root-pattern non-concatenative
morphological phenomenon
− e.g. sbr
 Uses different affixes to create inflectional
and derivational word forms
➢ Data sparseness and OOV are serious
problems
➔ Sub-word based language modeling has been
recommended (Solomon, 2006)

Language Modeling for Amharic
 Sub-word based language models have
been developed
 Substantial reduction in the OOV rate have
been obtained
 Morphemes have been used as a unit in
language modeling
 loss of word level dependency
 Solution:
− Higher order ngram => model complexity
➢ factored language modeling

Capturing Word-level
Dependencies
 In FLM a word is viewed as a bundle or
vector K parallel features or factors
W n ≡ f 1 , f 2 ,... , f k
n n n

− Factors: linguistic features
 Some of the features can define the word
➢ the probability can be calculated on the basis of
these features
 In Amharic: roots represent the lexical
meaning of a word
➢ root-based models to capture word-level
dependencies

Morphological Analysis

 There is a need for morphological analyser
− attempts (Bayou, 2000; Bayu,2002; Amsalu and
Gibbon, 2005)
− suffer from lack of data
 can not be used for our purpose

 Unsupervised morphology learning tools
➢ not applicable for this study
 Manual segmentation
− 72,428 word types found in a corpus of 21,338
sentences have been segmented
 polysemous or homonymous
 geminated or non-geminated

Factored Data Preparation

 Each word is considered as a bundle of
features: Word, POS, prefix, root, pattern
and suffix
− W-Word:POS-noun:PR-prefix:R-root:PA-
pattern:Su-suffix
− A given tag-value pair may be missing - the tag
takes a special value 'null'
 When roots are considered in language
modeling:
− words not derived from roots will be excluded
➢ stems of these words are considered

Factored Data Preparation --
cont.

Manually seg.
word list

Factored Text corpus
representation

Factored data

The Language Models

 The corpus is divided into training and test
sets (80:10:10)
● Root-based models order 2 to 5 have been
developed
● smoothed with Kneser-Ney smoothing technique

The Language Models -- cont.
 Perplexity of root-based models on
development test set
Root ngram Perplexity
Bigram 278.57
Trigram 223.26
Quadrogram 213.14
Pentagram 211.93

 A higher improvement: bigram Vs trigram
 Only 295 OOV
 The best model has:
− a logprob of -53102.3
− a perplexity of 204.95 on the evaluation test set

 Word-based model
− with the same training data
− smoothed with Kneser-Ney smoothing

Word ngram Perplexity
Bigram 1148.76
Trigram 989.95
Quadrogram 975.41
Pentagram 972.58

 A higher improvement: bigram Vs trigram
 2,672 OOV
 The best model has a logprob of -61106.0

 Word-based models that use an additional
feature in the ngram history have also been
developed
Language models Perplexity
W/W2,POS2,W1,POS1 885.81
W/W2,PR2,W1,PR1 857.61
W/W2,R2,W1,R1 896.59
W/W2,PA2,W1,PA1 958.31
W/W2,SU2,W1,SU1 898.89

 Root-based models seem better than all the
others, but might be less constraining
➢ Speech recognition experiment – lattice
rescoring

Speech Recognition Experiment
 The baseline speech recognition system
(Abate, 2006)
 Acoustic model:
− trained on 20 hours of read speech corpus
− a set of intra-word triphone HHMs with 3 emitting
states and 12 Gaussian mixture
 The language model
− trained on a corpus consisting of 77,844
sentences (868,929 tokens or 108,523 types)
− a closed vocabulary backoff bigram model
− smoothed with absolute discounting method
− perplexity of 91.28 on a test set that consists of
727 sentences (8,337 tokens)

-- cont.
 Performance:
− 5k development test set (360 sentences read by
20 speakers) has been used to generate the
lattices
− Lattices have been generated from the 100 best
alternatives for each sentence
− Best path transcription has been decoded
 91.67% word recognition accuracy

-- cont.
 To make the results comparable
− root-based and factored language models has
been developed
The corpus used in the
baseline system

factored version

root-based and factored
language models

-- cont.
 Perplexity of root-based models trained on
the corpus used in the baseline speech rec.

Root ngram Perplexity Logprob
Bigram 113.57 -18628.9
Trigram 24.63 -12611.8
Quadrogram 11.20 -9510.29
Pentagram 8.72 -8525.42

-- cont.
 Perplexity of factored models

Language models Perplexity Logprob
W/W2,POS2,W1,POS1 10.61 -9298.57
W/W2,PR2,W1,PR1 10.67 -9322.02
W/W2,R2,W1,R1 10.36 -9204.7
W/W2,PA2,W1,PA1 10.89 -9401.08
W/W2,SU2,W1,SU1 10.70 -9330.96

-- cont.
 Word lattice to factored lattice

Word lattice

Factored version Factored lattice

Word bigram model (FBL)

Best path transcription

91.60 %

-- cont.
 WRA with factored models

Language models WRA in %
Factored word bigram (FBL) 91.60
FBL + W/W2,POS2,W1,POS1 93.60
FBL + W/W2,PR2,W1,PR1 93.82
FBL + W/W2,R2,W1,R1 93.65
FBL + W/W2,PA2,W1,PA1 93.68
FBL + W/W2,SU2,W1,SU1 93.53

-- cont.
 WRA with root-based language models

Language models WRA in %
Factored word bigram (FBL) 91.60
FBL + Bigram 90.77
FBL + Trigram 90.87
FBL + Quadrogram 90.99
FBL + Pentagram 91.14

Conclusion and Future Work
 Root-based models have low perplexity and
high logprob
 But, did not contribute to the improvement of
word recognition accuracy
 Improvement of these models by adding
other word features but still maintaining
word-level dependencies
 Other ways of integrating the root-based
models to a speech recognition system

Capturing Word-level Dependencies in Morpheme-based Language Modeling

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Capturing Word-level Dependencies in Morpheme-based Language Modeling

Ähnlich wie Capturing Word-level Dependencies in Morpheme-based Language Modeling (20)

Mehr von Guy De Pauw

Mehr von Guy De Pauw (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Capturing Word-level Dependencies in Morpheme-based Language Modeling