Breaking the Kubernetes Kill Chain: Host Path Mount
Capturing Word-level Dependencies in Morpheme-based Language Modeling
1. Capturing Word-level
Dependencies in Morpheme-
based Language Modeling
Martha Yifiru Tachbelie and Wolfgang Menzel
@
University of Hamburg, Department of Informatics,
Natural Languages Systems Group
2. Outline
Language Modeling
Morphology of Amharic
Language Modeling for Amharic
− Capturing word-level dependencies
Language Modeling Experiment
− Word segmentation
− Factored data preparation
− The language models
Speech recognition experiment
− The baseline speech recognition system
− Lattice re-scoring experiment
Conclusion and future work
3. Language Modeling
Language models are fundamental to many
natural language processing
Statistical language models – the most
widely used ones
− provide an estimate of the probability of a word
sequence W for a given task
− require large training data
data sparseness problem Serious for morphologically
OOV problem rich languages
Languages with rich morphology:
− high vocabulary growth rate => a high perplexity
and a large number of OOV
➢ Sub-word units are used in language modeling
4. Morphology of Amharic
Amharic is one of the morphologically rich
languages
Spoken mainly in Ethiopia
− the second spoken Semitic language
Exhibits root-pattern non-concatenative
morphological phenomenon
− e.g. sbr
Uses different affixes to create inflectional
and derivational word forms
➢ Data sparseness and OOV are serious
problems
➔ Sub-word based language modeling has been
recommended (Solomon, 2006)
5. Language Modeling for Amharic
Sub-word based language models have
been developed
Substantial reduction in the OOV rate have
been obtained
Morphemes have been used as a unit in
language modeling
loss of word level dependency
Solution:
− Higher order ngram => model complexity
➢ factored language modeling
6. Capturing Word-level
Dependencies
In FLM a word is viewed as a bundle or
vector K parallel features or factors
W n ≡ f 1 , f 2 ,... , f k
n n n
− Factors: linguistic features
Some of the features can define the word
➢ the probability can be calculated on the basis of
these features
In Amharic: roots represent the lexical
meaning of a word
➢ root-based models to capture word-level
dependencies
7. Morphological Analysis
There is a need for morphological analyser
− attempts (Bayou, 2000; Bayu,2002; Amsalu and
Gibbon, 2005)
− suffer from lack of data
can not be used for our purpose
Unsupervised morphology learning tools
➢ not applicable for this study
Manual segmentation
− 72,428 word types found in a corpus of 21,338
sentences have been segmented
polysemous or homonymous
geminated or non-geminated
8. Factored Data Preparation
Each word is considered as a bundle of
features: Word, POS, prefix, root, pattern
and suffix
− W-Word:POS-noun:PR-prefix:R-root:PA-
pattern:Su-suffix
− A given tag-value pair may be missing - the tag
takes a special value 'null'
When roots are considered in language
modeling:
− words not derived from roots will be excluded
➢ stems of these words are considered
9. Factored Data Preparation --
cont.
Manually seg.
word list
Factored Text corpus
representation
Factored data
10. The Language Models
The corpus is divided into training and test
sets (80:10:10)
● Root-based models order 2 to 5 have been
developed
● smoothed with Kneser-Ney smoothing technique
11. The Language Models -- cont.
Perplexity of root-based models on
development test set
Root ngram Perplexity
Bigram 278.57
Trigram 223.26
Quadrogram 213.14
Pentagram 211.93
A higher improvement: bigram Vs trigram
Only 295 OOV
The best model has:
− a logprob of -53102.3
− a perplexity of 204.95 on the evaluation test set
12. The Language Models -- cont.
Word-based model
− with the same training data
− smoothed with Kneser-Ney smoothing
Word ngram Perplexity
Bigram 1148.76
Trigram 989.95
Quadrogram 975.41
Pentagram 972.58
A higher improvement: bigram Vs trigram
2,672 OOV
The best model has a logprob of -61106.0
13. The Language Models -- cont.
Word-based models that use an additional
feature in the ngram history have also been
developed
Language models Perplexity
W/W2,POS2,W1,POS1 885.81
W/W2,PR2,W1,PR1 857.61
W/W2,R2,W1,R1 896.59
W/W2,PA2,W1,PA1 958.31
W/W2,SU2,W1,SU1 898.89
Root-based models seem better than all the
others, but might be less constraining
➢ Speech recognition experiment – lattice
rescoring
14. Speech Recognition Experiment
The baseline speech recognition system
(Abate, 2006)
Acoustic model:
− trained on 20 hours of read speech corpus
− a set of intra-word triphone HHMs with 3 emitting
states and 12 Gaussian mixture
The language model
− trained on a corpus consisting of 77,844
sentences (868,929 tokens or 108,523 types)
− a closed vocabulary backoff bigram model
− smoothed with absolute discounting method
− perplexity of 91.28 on a test set that consists of
727 sentences (8,337 tokens)
15. Speech Recognition Experiment
-- cont.
Performance:
− 5k development test set (360 sentences read by
20 speakers) has been used to generate the
lattices
− Lattices have been generated from the 100 best
alternatives for each sentence
− Best path transcription has been decoded
91.67% word recognition accuracy
16. Speech Recognition Experiment
-- cont.
To make the results comparable
− root-based and factored language models has
been developed
The corpus used in the
baseline system
factored version
root-based and factored
language models
17. Speech Recognition Experiment
-- cont.
Perplexity of root-based models trained on
the corpus used in the baseline speech rec.
Root ngram Perplexity Logprob
Bigram 113.57 -18628.9
Trigram 24.63 -12611.8
Quadrogram 11.20 -9510.29
Pentagram 8.72 -8525.42
19. Speech Recognition Experiment
-- cont.
Word lattice to factored lattice
Word lattice
Factored version Factored lattice
Word bigram model (FBL)
Best path transcription
91.60 %
21. Speech Recognition Experiment
-- cont.
WRA with root-based language models
Language models WRA in %
Factored word bigram (FBL) 91.60
FBL + Bigram 90.77
FBL + Trigram 90.87
FBL + Quadrogram 90.99
FBL + Pentagram 91.14
22. Conclusion and Future Work
Root-based models have low perplexity and
high logprob
But, did not contribute to the improvement of
word recognition accuracy
Improvement of these models by adding
other word features but still maintaining
word-level dependencies
Other ways of integrating the root-based
models to a speech recognition system