CS571: Language Models

Language Models
Natural Language Processing
Emory University
Jinho D. Choi

Probability
2
Probability of tomorrow being cloudy?
5 days 3 days 2 days
P(cloudy) =
C(cloudy)
C(sunny) + C(cloudy) + C(snowy)
=
3
10

Conditional Probability
3
Probability of tomorrow being cloudy if today is snowy?
P(cloudy|snowy) =
C(snowy, cloudy)
C(snowy)
=
1
2

Conditional Probability
4
Probability of tomorrow being cloudy if 
today and yesterday are snowy?
P(cloudy|snowy, snowy) =
C(snowy, snowy, cloudy)
C(snowy, snowy)
=
1
1
= 1

Joint Probability
5
Probability of next 2 days being cloudy, sunny?
P(cloudy, sunny) = P(cloudy) · P(sunny|cloudy)

Joint Probability
6
Probability of next 3 days being snowy, cloudy, sunny?
P(snowy, cloudy, sunny) = P(snowy)·
P(cloudy|snowy)·
P(sunny|snowy, cloudy)

N-gram Models
7
1-gram (Unigram)
P(wi) =
C(wi)
P
8k C(wk)
2-gram (Bigram)
P(wi+1|wi) =
C(wi, wi+1)
P
8k C(wi, wk)
=
C(wi, wi+1)
C(wi)
=
C(wi)
N
# of tokens
token
vs
type?

-
N-gram Models
• Unigram model
- Given any word w, it shows how likely w appears in context.
- This is known as the likelihood (probability) of w, written as P(w).
- How likely does the word “Emory” appear in context? 
 
 
 
 
 
 
- Does this mean “Emory” appears 17.39% time in any context?
- How can we measure more accurate likelihoods?
8
Emory University was found as Emory College by John Emory.
Emory University is 16th among the colleges and universities in US.
P(Emory) =
4
23
⇡ 0.1739

-
N-gram Models
• Bigram model
- Given any words wi and wj in sequence, it shows the likelihood of  
wj following wi in context.
- This can be represented as the conditional probability of P(wj|wi).
- What is the most likely word following “Emory”?
9
Emory University was found as Emory College by John Emory.
Emory University is the 20th among the national universities in US.
P(University|Emory) = 2
4 = 0.5
P(College|Emory) = 1
4 = 0.25
P(.|Emory) = 1
4 = 0.25
arg max
k
P(wk|Emory)

-
Maximum Likelihood
• Maximum likelihood
- Given any word sequence wi, …, wn, how likely this sequence appears
in context.
- This can be represented as the joint probability of P(wj, …, wn).
- How likely does the sequence “you know” appears in context?
11
you know , I know you know that you do .
Chain rule
P(you, know) =
2
11
P(you) · P(know|you) =
3
11
·
2
3
=
2
11
not 10?

Laplace Smoothing
13
P(xn
1 ) ⇡ P(x1) · P(x2|x1) · P(x3|x2) · · · P(xn|xn 1)
What if P(x1) = 0? P(xn
1 ) ⇡ 0 ← BAD!!
Laplace Smoothing
Pl(xi) =
C(xi) + ↵
P
k(C(xk) + ↵)
=
C(xi) + ↵
P
k C(xk) + ↵|X|
P(xi) =
C(xi)
P
k C(xk)
=
C(xi) + ↵
N + ↵|X|
=
↵
N + ↵|X|
Pl(x?) =
C(x?) + ↵
P
k C(xk) + ↵|X|

Laplace Smoothing
14
Laplace Smoothing
Pl(xj|xi) =
C(xi, xj) + ↵
P
k(C(xi, xk) + ↵)
=
C(xi, xj) + ↵
P
k C(xi, xk) + ↵|Xi,⇤|
P(xj|xi) =
C(xi, xj)
P
k C(xi, xk)
=
C(xi, xj)
C(xi)
=
C(xi, xj) + ↵
C(xi) + ↵|Xi,⇤|
Pl(x?|xi) =
↵
C(xi) + ↵|Xi,⇤|

Discount Smoothing
• Issues with Laplace smoothing
- Unfair discounts. 
 
 
 
 
 
 
 
 
- Unseen likelihood may get penalized too harshly when the minimum
count is much greater than α.
- How to reduce the gap between the minimum count and unseen
count?
15
10
100
= 0.1 !
10 + 1
100 + 10
= 0.1
50
100
= 0.5 !
50 + 1
100 + 10
= 0.46
1
100
= 0.01 !
1 + 1
100 + 10
= 0.018 +0.008
0
-0.04

Good-Turing Smoothing
17
Nc = the count of n-grams that appear c times
Type
carp
perch
whiteﬁsh
trout
salmon
eel
?
Total
N1 = 3, N2 = 1, N3 = 1, N4 = 1
MLE
0.33
0.25
0.17
0.08
0.08
0.08
0.00
1.00
C
4
3
2
1
1
1
0
12
C(xi)⇤
=
(C(xi) + 1) · Nc+1
Nc
P(x?) =
N1
N
C(eel) =
(C(eel) + 1) · N2
N1
=
2
3
C*
5.00
4.00
3.00
0.67
0.67
0.67
0.17
P(carp) =?
P(eel) =
2/3
12

Interpolation
• Interpolation
- Unigrams and bigrams provide different but useful information .
- Use them both with different weights.
19
ˆP(xj|xi) = 1 · Pl|d(xj) + 2 · Pl|d(xj|xi)
X
k
k = 1

CS571: Language Models

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie CS571: Language Models

Ähnlich wie CS571: Language Models (20)

Mehr von Jinho Choi

Mehr von Jinho Choi (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CS571: Language Models