Exploring the Future Potential of AI-Enabled Smartphone Processors
Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction
1. Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-Ney
Smoothing on Top of Generalized Language
Models for Next Word Prediction
Martin Körner
Oberseminar
25.07.2013
4. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
4 of 30
WeST
Introduction: Motivation
Next word prediction: What is the next word a user will
type?
Use cases for next word prediction:
Augmentative and Alternative
Communication (AAC)
Small keyboards (Smartphones)
5. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
5 of 30
WeST
Introduction to next word prediction
How do we predict words?
1. Rationalist approach
• Manually encoding information about language
• “Toy” problems only
2. Empiricist approach
• Statistical, pattern recognition, and machine learning
methods applied on corpora
• Result: Language models
7. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
7 of 30
WeST
Language models in general
Language model: How likely is a sentence 𝑠?
Probability distribution: 𝑃 𝑠
Calculate 𝑃 𝑠 by multiplying conditional probabilities
Example:
𝑃 If you′
re going to San Francisco , be sure …
=
𝑃 you′
re | If ∗ 𝑃 going | If you′
re ∗
𝑃 to | If you′
re going ∗ 𝑃 San | If you′
re going to ∗
𝑃 Francisco | If you′
re going to San ∗ ⋯
Empirical approach would fail
8. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
8 of 30
WeST
Conditional probabilities simplified
Markov assumption [JM80]:
Only the last n-1 words are relevant for a prediction
Example with n=5:
𝑃 sure | If you′re going to San Francisco , be
≈ 𝑃 sure | San Francisco , be
Counts as a word
9. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
9 of 30
WeST
Definitions and Markov assumption
n-gram: Sequence of length n with a count
E.g.: 5-gram:
If you′re going to San 4
Sequence naming:
𝑤1
𝑖−1
≔ 𝑤1 𝑤2 … 𝑤𝑖−1
Markov assumption formalized:
𝑃 𝑤𝑖 𝑤1
𝑖−1
≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
n-1 words
10. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
10 of 30
WeST
Formalizing next word prediction
Instead of 𝑃(𝑠):
Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
to 𝑃 𝑤 𝑛 𝑤1
𝑛−1
NWP 𝑤1
𝑛−1
= arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1
𝑛−1
How to calculate the probability 𝑃 𝑤 𝑛 𝑤1
𝑛−1
?
Set of all words in the corpus
n-1 words n-1 words
Conditional probability with Markov assumption
11. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
11 of 30
WeST
How to calculate 𝑃(𝑤 𝑛|𝑤1
𝑛−1
)
The easiest way:
Maximum likelihood:
𝑃ML 𝑤 𝑛 𝑤1
𝑛−1
=
𝑐(𝑤1
𝑛
)
𝑐(𝑤1
𝑛−1
)
Example:
𝑃 San | If you′
re going to =
𝑐 If you′re going to San
𝑐 If you′re going to
13. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
13 of 30
WeST
Intro Generalized Language Models (GLMs)
Main idea:
Insert wildcard words (∗) into sequences
Example:
Instead of 𝑃 San | If you′re going to :
• 𝑃 San | If ∗ ∗ ∗
• 𝑃 San | If ∗ ∗ to
• 𝑃 San | If ∗ going ∗
• 𝑃 San | If ∗ going to
• 𝑃 San | If you′re ∗ ∗
• …
Separate different types of GLMs based on:
1. Sequence length
2. Number of wildcard words
Aggregate results
Length: 5, Wildcard words: 2
14. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
14 of 30
WeST
Why Generalized Language Models?
Data sparsity of n-grams
“If you′re going to San” is seen less often than for example
“If ∗ ∗ to San”
Question: Does that really improve the prediction?
Result of evaluation: Yes
… but we should use smoothing for language models
16. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
16 of 30
WeST
Smoothing
Problem: Unseen sequences
Try to estimate probabilities of unseen sequences
Probabilities of seen sequences need to be reduced
Two approaches:
1. Backoff smoothing
2. Interpolation smoothing
17. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
17 of 30
WeST
Backoff smoothing
If sequence unseen: use shorter sequence
E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to
𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖
𝑛−1
=
𝜏 𝑤 𝑛 𝑤𝑖
𝑛−1
𝑖𝑓 𝑐 𝑤𝑖
𝑛
> 0
𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1
𝑛−1
𝑖𝑓 𝑐 𝑤𝑖
𝑛
= 0
Weight Lower order
probability (recursive)
Higher order
probability
18. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
18 of 30
WeST
Interpolated Smoothing
Always use shorter sequence for calculation
𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖
𝑛−1
= 𝜏 𝑤 𝑛 𝑤𝑖
𝑛−1
+ 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1
𝑛−1
Seems to work better than backoff smoothing
Higher order
probability
Weight Lower order
probability (recursive)
19. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
Interpolated smoothing
Idea: Improve lower order calculation
Example: Word visiting unseen in corpus
𝑃 Francisco | visiting = 0
Normal interpolation: 0 + γ ∗ 𝑃 Francisco
𝑃 San | visiting = 0
Normal interpolation: 0 + γ ∗ 𝑃 San
Result: Francisco is as likely as San at that position
Is that correct?
Difference between Francisco and San?
Answer: Number of different contexts
20. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
20 of 30
WeST
Kneser-Ney smoothing idea
For lower order calculation:
Don’t use 𝑐 𝑤 𝑛
Instead: Number of different bigrams the word completes:
𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1
𝑛
> 0
Or in general:
𝑁1+ • 𝑤𝑖+1
𝑛
= 𝑤𝑖: 𝑐 𝑤𝑖
𝑛
> 0
In addition:
𝑁1+ • 𝑤𝑖+1
𝑛−1
• = 𝑤 𝑛
𝑁1+ • 𝑤𝑖+1
𝑛
𝑁1+ 𝑤𝑖
𝑛−1
• = 𝑤 𝑛: 𝑐 𝑤𝑖
𝑛
> 0
Count
21. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
Highest order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑐 𝑤𝑖
𝑛
− 𝐷, 0}
𝑐 𝑤𝑖
𝑛−1
+
𝐷
𝑐 𝑤𝑖
𝑛−1
𝑁1+ 𝑤𝑖
𝑛−1
• 𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
count
Total counts
Assure positive value
Discount value
0 ≤ 𝐷 ≤ 1
Lower order probability
(recursion)
Lower order weight
22. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
22 of 30
WeST
Kneser-Ney smoothing equation
Lower order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑁1+ • 𝑤𝑖
𝑛
− 𝐷, 0}
𝑁1+ • 𝑤𝑖
𝑛−1
•
+
𝐷
𝑁1+ • 𝑤𝑖
𝑛−1
•
𝑁1+ 𝑤𝑖
𝑛−1
• 𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
Lowest order calculation: 𝑃KN 𝑤 𝑛 =
𝑁1+ •𝑤𝑖
𝑛
𝑁1+ •𝑤𝑖
𝑛−1•
Continuation count
Total continuation counts
Assure positive value
Discount value
Lower order probability
(recursion)
Lower order weight
23. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
Different discount values for different absolute counts
Lower order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑁1+ • 𝑤𝑖
𝑛
− 𝐷(𝑐 𝑤𝑖
𝑛
), 0}
𝑁1+ • 𝑤𝑖
𝑛−1
•
+
𝐷1 𝑁1 𝑤𝑖
𝑛−1
• + 𝐷2 𝑁2 𝑤𝑖
𝑛−1
• + 𝐷3+ 𝑁3+ 𝑤𝑖
𝑛−1
•
𝑁1+ • 𝑤𝑖
𝑛−1
•
𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
State of the art (since 15 years!)
24. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
24 of 30
WeST
Smoothing of GLMs
We can use all smoothing techniques on GLMs as well!
Small modification:
E.g: 𝑃 San | If ∗ going ∗
Lower order sequence :
– Normally: 𝑃 San | ∗ going ∗
– Instead use 𝑃 San | going ∗
26. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
26 of 30
WeST
Progress
Done Yet:
Extract text from XML files
Building GLMs
Kneser-Ney and modified Kneser-Ney smoothing
Indexing with MySQL
ToDo’s
Finish evaluation program
Run evaluation
Analyze results
30. Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
30 of 30
WeST
Sources
Images:
Wheelchair Joystick (Slide 4):
http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg
Smartphone Keyboard (Slide 4):
https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg
References:
[CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing
techniques for language modeling. Technical report, Technical Report TR-10-
98, Harvard University, August, 1998.
[JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern
Recognition in Practice, pages 381–397, 1980.
[KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram
language modeling. In Acoustics, Speech, and Signal Processing, 1995.
ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.
IEEE, 1995.