4. Conditional Probability
4
Probability of tomorrow being cloudy if
today and yesterday are snowy?
P(cloudy|snowy, snowy) =
C(snowy, snowy, cloudy)
C(snowy, snowy)
=
1
1
= 1
6. Joint Probability
6
Probability of next 3 days being snowy, cloudy, sunny?
P(snowy, cloudy, sunny) = P(snowy)·
P(cloudy|snowy)·
P(sunny|snowy, cloudy)
7. N-gram Models
7
1-gram (Unigram)
P(wi) =
C(wi)
P
8k C(wk)
2-gram (Bigram)
P(wi+1|wi) =
C(wi, wi+1)
P
8k C(wi, wk)
=
C(wi, wi+1)
C(wi)
=
C(wi)
N
# of tokens
token
vs
type?
8. -
N-gram Models
• Unigram model
- Given any word w, it shows how likely w appears in context.
- This is known as the likelihood (probability) of w, written as P(w).
- How likely does the word “Emory” appear in context?
- Does this mean “Emory” appears 17.39% time in any context?
- How can we measure more accurate likelihoods?
8
Emory University was found as Emory College by John Emory.
Emory University is 16th among the colleges and universities in US.
P(Emory) =
4
23
⇡ 0.1739
9. -
N-gram Models
• Bigram model
- Given any words wi and wj in sequence, it shows the likelihood of
wj following wi in context.
- This can be represented as the conditional probability of P(wj|wi).
- What is the most likely word following “Emory”?
9
Emory University was found as Emory College by John Emory.
Emory University is the 20th among the national universities in US.
P(University|Emory) = 2
4 = 0.5
P(College|Emory) = 1
4 = 0.25
P(.|Emory) = 1
4 = 0.25
arg max
k
P(wk|Emory)
11. -
Maximum Likelihood
• Maximum likelihood
- Given any word sequence wi, …, wn, how likely this sequence appears
in context.
- This can be represented as the joint probability of P(wj, …, wn).
- How likely does the sequence “you know” appears in context?
11
you know , I know you know that you do .
Chain rule
P(you, know) =
2
11
P(you) · P(know|you) =
3
11
·
2
3
=
2
11
not 10?
12. -
Word Segmentation
• Word segmentation
- Segment a chunk of string into a sequence of words.
- Are there more than one possible sequence?
- Choose the sequence that most likely appears in context.
12
youknow
P(you) · P(know|you) > P(yo) · P(uk|yo) · P(now|uk)
log(P(you) · P(know|you)) > log(P(yo) · P(uk|yo) · P(now|uk))
log(P(you)) + log(P(know|you)) > log(P(yo)) + log(P(uk|yo)) + log(P(now|uk))
13. Laplace Smoothing
13
P(xn
1 ) ⇡ P(x1) · P(x2|x1) · P(x3|x2) · · · P(xn|xn 1)
What if P(x1) = 0? P(xn
1 ) ⇡ 0 ← BAD!!
Laplace Smoothing
Pl(xi) =
C(xi) + ↵
P
k(C(xk) + ↵)
=
C(xi) + ↵
P
k C(xk) + ↵|X|
P(xi) =
C(xi)
P
k C(xk)
=
C(xi) + ↵
N + ↵|X|
=
↵
N + ↵|X|
Pl(x?) =
C(x?) + ↵
P
k C(xk) + ↵|X|
15. Discount Smoothing
• Issues with Laplace smoothing
- Unfair discounts.
- Unseen likelihood may get penalized too harshly when the minimum
count is much greater than α.
- How to reduce the gap between the minimum count and unseen
count?
15
10
100
= 0.1 !
10 + 1
100 + 10
= 0.1
50
100
= 0.5 !
50 + 1
100 + 10
= 0.46
1
100
= 0.01 !
1 + 1
100 + 10
= 0.018 +0.008
0
-0.04
16. Discount Smoothing
16
Laplace Discount
Pl(xi) =
C(xi) + ↵
N + ↵|X|
Pl(xj|xi) =
C(xi, xj) + ↵
C(xi) + ↵|Xi,⇤|
Pl(x?|xi) =
↵
C(xi) + ↵|Xi,⇤|
Pl(x?) =
↵
N + ↵|X|
Pd(xi) =
C(xi) Pd(x?)
N
Pd(xj|xi) =
C(xi, xj) Pd(x?|xi)
C(xi)
Pd(x?|xi) = ↵ · min
k
P(xk|xi)
Pd(x?) = ↵ · min
k
P(xk)
17. Good-Turing Smoothing
17
Nc = the count of n-grams that appear c times
Type
carp
perch
whitefish
trout
salmon
eel
?
Total
N1 = 3, N2 = 1, N3 = 1, N4 = 1
MLE
0.33
0.25
0.17
0.08
0.08
0.08
0.00
1.00
C
4
3
2
1
1
1
0
12
C(xi)⇤
=
(C(xi) + 1) · Nc+1
Nc
P(x?) =
N1
N
C(eel) =
(C(eel) + 1) · N2
N1
=
2
3
C*
5.00
4.00
3.00
0.67
0.67
0.67
0.17
P(carp) =?
P(eel) =
2/3
12
18. Pb(xj|xi) =
8
<
:
Pl|d(xj|xi) P(xj|xi) > 0
· Pl|d(xj) Otherwise
Backoff
• Backoff
- Bigrams are more accurate than unigrams.
- Bigrams are sparser than unigrams.
- Use bigrams in general, and use unigrams only bigrams don’t exist.
18
How to measure?
= ↵ ·
h(P(xj|xi)ii,j
h(P(xj)ij
19. Interpolation
• Interpolation
- Unigrams and bigrams provide different but useful information .
- Use them both with different weights.
19
ˆP(xj|xi) = 1 · Pl|d(xj) + 2 · Pl|d(xj|xi)
X
k
k = 1