This is presentation slide files in machine learning summer school in Korea.
http://prml.yonsei.ac.kr/
I talked about dirichlet distribution, dirichlet process and HDP.
1. Bayesian Nonparametric Topic Modeling
Hierarchical Dirichlet Processes
JinYeong Bak
Department of Computer Science
KAIST, Daejeon
South Korea
jy.bak@kaist.ac.kr
August 22, 2013
Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk).
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121
4. Introduction
Bayesian topic models
Latent Dirichlet Allocation (LDA) [BNJ03]
Hierarchical Dircihlet Processes (HDP) [TJBB06]
In this talk,
Dirichlet distribution, Dircihlet process
Concept of Hierarchical Dircihlet Processes (HDP)
How to infer the latent variables in HDP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121
10. Motivation
What are the topics discussed in the article?
How can we describe the topics?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121
16. Topic Modeling
Each topic has word distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121
17. Topic Modeling
Each document has topic proportion
Each word has its own topic index
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
18. Topic Modeling
Each document has topic proportion
Each word has its own topic index
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
19. Topic Modeling
Each document has topic proportion
Each word has its own topic index
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
25. Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
26. Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
27. Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
28. Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
29. Latent Dirichlet Allocation
Our interests
What are the topics discussed in the article?
How can we describe the topics?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121
30. Latent Dirichlet Allocation
What we can see
Words in documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121
31. Latent Dirichlet Allocation
What we want to see
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121
32. Latent Dirichlet Allocation
Our interests
What are the topics discussed in the article?
=> Topic proportion of each document
How can we describe the topics?
=> Word distribution of each topic
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121
33. Latent Dirichlet Allocation
What we can see: w
What we want to see: θ,z,β
∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η)
p(w|α,η)
But this distribution is intractable to compute ( normalization term)
So we do approximate methods
Gibbs Sampling
Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
34. Latent Dirichlet Allocation
What we can see: w
What we want to see: θ,z,β
∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η)
p(w|α,η)
But this distribution is intractable to compute ( normalization term)
So we do approximate methods
Gibbs Sampling
Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
35. Limitation of Latent Dirichlet Allocation
Latent Dirichlet Allocation is parametric model
People should assign the number of topics in a corpus
People should find the best number of topics
Q) Can we get it from data automatically?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121
36. Limitation of Latent Dirichlet Allocation
Latent Dirichlet Allocation is parametric model
People should assign the number of topics in a corpus
People should find the best number of topics
Q) Can we get it from data automatically?
A) Hierarchical Dircihlet Processes
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121
38. Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
According to the textbook, it is widely known as uniform
=> 1
6
for 6 dimentional dice
Is it true?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
39. Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
According to the textbook, it is widely known as uniform
=> 1
6
for 6 dimentional dice
Is it true?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
40. Dice modeling
Think about the probability of a number from dices
According to the textbook, it is widely known as uniform.
=> 1
6
for 6 dimentional dice
Is it true?
Ans) No!
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121
41. Dice modeling
We should model the randomness of pmfs for each dice
How can we do that?
Let’s imagine a bag which has many dices
We cannot see inside the bag
We can draw out one dice from bag
OK, but what is the formal description?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
42. Dice modeling
We should model the randomness of pmfs for each dice
How can we do that?
Let’s imagine a bag which has many dices
We cannot see inside the bag
We can draw out one dice from bag
OK, but what is the formal description?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
43. Standard Simplex
A generalization of the notion of a triangle or tetrahedron
All points are non-negative and sum to 1 1
A pmf can be thought of as a point in the standard simplex
Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1
1
http://en.wikipedia.org/wiki/Simplex
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
44. Standard Simplex
A generalization of the notion of a triangle or tetrahedron
All points are non-negative and sum to 1 1
A pmf can be thought of as a point in the standard simplex
Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1
1
http://en.wikipedia.org/wiki/Simplex
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
45. Dirichlet distribution
Definition [BN06]
A probability distribution over the (K −1) dimensional standard simplex
A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ]
Probability density function
p(θ;α) =
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
46. Dirichlet distribution
Definition [BN06]
A probability distribution over the (K −1) dimensional standard simplex
A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ]
Probability density function
p(θ;α) =
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
47. Dirichlet distribution
Definition [BN06]
A probability distribution over the (K −1) dimensional standard simplex
A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ]
Probability density function
p(θ;α) =
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
49. Property of Dirichlet distribution
Density plots [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121
50. Property of Dirichlet distribution
Sample pmfs from Dirichlet distribution [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121
51. Property of Dirichlet distribution
When K = 2, it is Beta distribution
Conjugate prior for the Multinomial distribution
Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α)
∴ Posterior (θ|X) ∼ Dir(α +n)
Proof)
p(θ|X) =
p(X|θ)p(θ)
p(X)
∝ p(X|θ)p(θ)
=
n!
x1!···xK !
K
∏
k=1
θxk
k ·
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
= C
K
∏
k=1
θαk +xk −1
k
= Dir(α +n)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
52. Property of Dirichlet distribution
When K = 2, it is Beta distribution
Conjugate prior for the Multinomial distribution
Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α)
∴ Posterior (θ|X) ∼ Dir(α +n)
Proof)
p(θ|X) =
p(X|θ)p(θ)
p(X)
∝ p(X|θ)p(θ)
=
n!
x1!···xK !
K
∏
k=1
θxk
k ·
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
= C
K
∏
k=1
θαk +xk −1
k
= Dir(α +n)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
53. Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
54. Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
55. Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
56. Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
58. Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
Draw out a dice from a bag
Problem) Do not know the number of face in a bag
Solution) Dirichlet process
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
59. Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
Draw out a dice from a bag
Problem) Do not know the number of face in a bag
Solution) Dirichlet process
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
60. Dirichlet Process
Definition [BAFG10]
A distribution over probability measures
A distribution whose realizations are distribution over any sample space
Formal definition
(Ω,B) is a measurable space
G0 is a distribution over sample space Ω
α0 is a positive real number
G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1,...,AR) of Ω
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
61. Dirichlet Process
Definition [BAFG10]
A distribution over probability measures
A distribution whose realizations are distribution over any sample space
Formal definition
(Ω,B) is a measurable space
G0 is a distribution over sample space Ω
α0 is a positive real number
G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1,...,AR) of Ω
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
62. Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
63. Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
64. Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
65. Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
66. Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ1
(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
67. Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ1
(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
68. Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ1
(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
69. Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1,...,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
⇐⇒ θ2|θ1 ∼
α0G0 +δθ1
α0 +1
G|θ1,θ2 ∼ DP(α0 +2,
α0G0 +δθ1
+δθ2
α0 +2
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
70. Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1,...,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
⇐⇒ θ2|θ1 ∼
α0G0 +δθ1
α0 +1
G|θ1,θ2 ∼ DP(α0 +2,
α0G0 +δθ1
+δθ2
α0 +2
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
71. Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1,...,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
⇐⇒ θ2|θ1 ∼
α0G0 +δθ1
α0 +1
G|θ1,θ2 ∼ DP(α0 +2,
α0G0 +δθ1
+δθ2
α0 +2
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
73. Blackwell-MacQueen Urn Scheme
Blackwell-MacQueen urn scheme produces a sequence θ1,θ2,... with
the following conditionals
θN|θ1,...,N−1 ∼
α0G0 +∑N−1
n=1 δθn
α0 +N −1
As Polya Urn analogy
Infinite number of ball colors
Empty urn
Filling Polya urn process (n starts 1)
With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urn
With probability n −1, pick a ball from urn record its color, and put it back to
urn with another ball of the same color
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121
74. Chinese Restaurant Process
Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme
With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urn
With probability n −1, pick a ball from urn record its color, and put it back
to urn with another ball of the same color
θs can take same values, θi = θj
There are K < N distinct values, φ1,...,φK
It works as partition of Ω
θ1,θ2,...,θN induces to φ1,...,φK
The distribution over partitions is called the Chinese Restaurant Process
(CRP)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
75. Chinese Restaurant Process
Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme
With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urn
With probability n −1, pick a ball from urn record its color, and put it back
to urn with another ball of the same color
θs can take same values, θi = θj
There are K < N distinct values, φ1,...,φK
It works as partition of Ω
θ1,θ2,...,θN induces to φ1,...,φK
The distribution over partitions is called the Chinese Restaurant Process
(CRP)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
76. Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
77. Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
78. Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
79. Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
80. Chinese Restaurant Process
The CRP exhibits the clustering property of DP
Tables are clusters, φk ∼ G0
Customers are the actual realizations, θn = φzn where zn ∈ {1,...,K}
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121
81. Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1))
∼ Dir((α0 +1)
α0G0 +δθ1
α0 +1
(θ1),(α0 +1)
α0G0 +δθ1
α0 +1
(Ωθ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
82. Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1))
∼ Dir((α0 +1)
α0G0 +δθ1
α0 +1
(θ1),(α0 +1)
α0G0 +δθ1
α0 +1
(Ωθ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
83. Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1))
∼ Dir((α0 +1)
α0G0 +δθ1
α0 +1
(θ1),(α0 +1)
α0G0 +δθ1
α0 +1
(Ωθ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
84. Stick Breaking Construction
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
where G is the probability measure with the point mass θ1 removed
What is G ?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
85. Stick Breaking Construction
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
where G is the probability measure with the point mass θ1 removed
What is G ?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
86. Stick Breaking Construction
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
where G is the probability measure with the point mass θ1 removed
What is G ?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
87. Stick Breaking Construction
Summary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1,...,AR) of Ω
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
88. Stick Breaking Construction
Summary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1,...,AR) of Ω
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
89. Stick Breaking Construction
Summary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1,...,AR) of Ω
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
90. Stick Breaking Construction
Do this repeatly with distinct values, φ1,φ2,···
G ∼ DP(α0,G0)
G = β1δφ1
+(1 −β1)G1
G = β1δφ1
+(1 −β1)(β2δφ2
+(1 −β2)G2)
...
G =
∞
∑
k=1
πk δφk
where
πk = βk
k−1
∏
i=1
(1 −βi ),
∞
∑
k=1
πk = 1 βk ∼ Beta(1,α0) φk ∼ G0
Draws from the DP looks like a sum of point masses, with masses drawn
from a stick-breaking construction.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121
92. Summary of DP
Definition
G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1,...,Ar ) of Ω
(G(A1),...,G(Ar )) ∼ Dir(α0G0(A1),...,α0G0(Ar ))
Chinese Restaurant Process
Stick Breaking Construction
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121
94. Dirichlet Process Mixture Models
We model a data set x1,...,xN using the following
model [Nea00]
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
Each θn is a latent parameter modelling xn, while
G is the unknown distribution over parameters
modelled using a DP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
95. Dirichlet Process Mixture Models
We model a data set x1,...,xN using the following
model [Nea00]
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
Each θn is a latent parameter modelling xn, while
G is the unknown distribution over parameters
modelled using a DP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
96. Dirichlet Process Mixture Models
Since G is of the form
G =
∞
∑
k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We can
equivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
97. Dirichlet Process Mixture Models
Since G is of the form
G =
∞
∑
k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We can
equivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
98. Dirichlet Process Mixture Models
Since G is of the form
G =
∞
∑
k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We can
equivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
101. Topic modeling with documents
Each document consists of bags of words
Each word in a document has latent topic index
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
102. Topic modeling with documents
Each document consists of bags of words
Each word in a document has latent topic index
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
103. Problem of Naive Dirichlet Process Mixture Model
Use a DP mixutre for each document
xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)
But there is no sharing of clusters across different
groups because G0 is smooth
G1 =
∞
∑
k=1
π1k δφ1k
, G2 =
∞
∑
k=1
π2k δφ2k
φ1k ,φ2k ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
104. Problem of Naive Dirichlet Process Mixture Model
Use a DP mixutre for each document
xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)
But there is no sharing of clusters across different
groups because G0 is smooth
G1 =
∞
∑
k=1
π1k δφ1k
, G2 =
∞
∑
k=1
π2k δφ2k
φ1k ,φ2k ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
105. Problem of Naive Dirichlet Process Mixture Model
Solution
Make the base distribution G0 discrete
Put a DP prior on the common base distribution
Hierarchical Dirichlet Process
G0 ∼ DP(γ,H)
G1,G2|G0 ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
106. Problem of Naive Dirichlet Process Mixture Model
Solution
Make the base distribution G0 discrete
Put a DP prior on the common base distribution
Hierarchical Dirichlet Process
G0 ∼ DP(γ,H)
G1,G2|G0 ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
107. Hierarchical Dirichlet Processes
Making G0 discrete forces shared cluster between G1 and G2
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121
108. Stick Breaking Construction
A Hierarchical Dirichlet Process with 1,...,D
documents
G0 ∼ DP(γ,H)
Gd |G0 ∼ DP(α0,G0)
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βl ) βk ∼ Beta(1,γ)
Gd =
∞
∑
k=1
πdk δφk
πdk = πdk
k−1
∏
i=1
(1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 −
k
∑
i=1
βi ))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121
109. Chinese Restaurant Franchise
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme
θd1,θd2,... induces to φd1,φd2,...
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121
110. Chinese Restaurant Franchise
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2,... from a
Blackwell-MacQueen Urn
Scheme
θd1,θd2,... induces to
φd1,φd2,...
Draw θd 1,θd 2,... from a
Blackwell-MacQueen Urn
Scheme
θd 1,θd 2,... induces to
φd 1,φd 2,...
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121
111. Chinese Restaurant Franchise
G0 ∼ DP(γ,H), φk ∼ H
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2,... from a
Blackwell-MacQueen Urn
Scheme
θd1,θd2,... induces to
φd1,φd2,...
Draw θd 1,θd 2,... from a
Blackwell-MacQueen Urn
Scheme
θd 1,θd 2,... induces to
φd 1,φd 2,...
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121
112. Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
113. Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
114. Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
115. Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
117. HDP for Topic modeling
Questions
What can we assume about the topics in a document?
What can we assume about the words in the topics?
Solution
Each document consists of bags of words
Each word in a document has latent topic
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
118. HDP for Topic modeling
Questions
What can we assume about the topics in a document?
What can we assume about the words in the topics?
Solution
Each document consists of bags of words
Each word in a document has latent topic
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
120. Gibbs Sampling
Definition
A special case of Markov-chain Monte Carlo (MCMC) method
An iterative algorithm that constructs a dependent sequence of parameter
values whose distribution converges to the target joint posterior
distribution [Hof09]
Algorithm
Find full conditional distribution of latent variables of target distribution
Initialize all latent variables
Sampling until converged
Sample one latent variable from full conditional distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
121. Gibbs Sampling
Definition
A special case of Markov-chain Monte Carlo (MCMC) method
An iterative algorithm that constructs a dependent sequence of parameter
values whose distribution converges to the target joint posterior
distribution [Hof09]
Algorithm
Find full conditional distribution of latent variables of target distribution
Initialize all latent variables
Sampling until converged
Sample one latent variable from full conditional distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
122. Collapsed Gibbs sampling
A collapsed Gibbs sampling integrates out one or more variables when
sampling for some other variable.
Example)
There are three latent variables A,B and C.
Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially
But when we integrate out B,
Sampling only p(A|C), p(C|A) sequentially
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121
125. Review) Chinese Restaurant Franchise
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121
126. Alternative form of HDP
G0 ∼ DP(γ,H), φdt ∼ G0
∴ G0|φdt ,... ∼ DP(γ +m,
γH+∑K
k=1 mk δφk
γ+m
)
Then G0 is given as
G0 =
K
∑
k=1
βk δφk
+βuGu
where
Gu ∼ DP(γ,H)
π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ)
p(φk |·) ∝ h(φk ) ∏
dn:zdn=k
f(xdn|φk )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
127. Alternative form of HDP
G0 ∼ DP(γ,H), φdt ∼ G0
∴ G0|φdt ,... ∼ DP(γ +m,
γH+∑K
k=1 mk δφk
γ+m
)
Then G0 is given as
G0 =
K
∑
k=1
βk δφk
+βuGu
where
Gu ∼ DP(γ,H)
π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ)
p(φk |·) ∝ h(φk ) ∏
dn:zdn=k
f(xdn|φk )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
129. Gibbs Sampling for HDP
Joint distribution
p(θ,z,φ,x,π,m|α0,η,γ) = p(π|m,γ)
K
∏
k=1
p(φk |η)
D
∏
d=1
p(θd |α0,π)
N
∏
n=1
p(zdn|θd ) p(xdn|zdn,φ)
Integrate out θ,φ
p(z,x,π,m|α0,η,γ) =
Γ(∑K
k=1 m.k +γ)
∏K
k=1 Γ(m.k )Γ(γ)
K
∏
k=1
πm.k −1
k π
γ−1
K+1
K
∏
k=1
Γ(∑V
v=1 ηv )
∏V
v=1 Γ(ηv )
∏V
v=1 Γ(ηv +nk
(·),v
)
Γ(∑V
v=1 ηv +nk
(·),v
)
M
∏
d=1
Γ(∑K
k=1 α0πk )
∏K
k=1 Γ(α0πk )
∏K
k=1 Γ(α0πk +nk
d,(·))
Γ(∑K
k=1 α0πk +nk
d,(·))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121
130. Gibbs Sampling for HDP
Full conditional distribution of z
p(z(d ,n )
= k |z−(d ,n )
,m,π,x,·) =
p(z(d ,n ) = k ,z−(d ,n ),m,π,x|·)
p(z−(d ,n ),m,π,x|·)
∝ p(z(d ,n )
= k ,z−(d ,n )
,m,π,x|·)
∝ α0πk +n
k ,−(d ,n )
d ,(·)
(ηv +n
k ,−(d ,n )
(·),v
)
(∑V
v=1 ηv +n
k ,−(d ,n )
(·),v
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121
131. Gibbs Sampling for HDP
Full conditional distribution of m
The probability that word xd n is assigned to some table t such that
kdt = k
p(θd n = φt |φdt = φk ,θ−(d ,n )
,π) ∝ n
(·),−(d ,n )
d,(·),t
p(θd n = new table|φdtnew = φk ,θ−(d ,n )
,π) ∝ α0πk
These equations form Dirichlet process with concentration parameter
α0πk and assignment of n
(·),−(d ,n )
d,(·),t
to components
The corresponding distribution over the number of components is desired
conditional distribution of mdk
Antoniak [Ant74] has shown that
p(md k = m|z,md k
,π) =
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·))
s(nk
d,(·),(·),m)(α0πk )m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
132. Gibbs Sampling for HDP
Full conditional distribution of m
The probability that word xd n is assigned to some table t such that
kdt = k
p(θd n = φt |φdt = φk ,θ−(d ,n )
,π) ∝ n
(·),−(d ,n )
d,(·),t
p(θd n = new table|φdtnew = φk ,θ−(d ,n )
,π) ∝ α0πk
These equations form Dirichlet process with concentration parameter
α0πk and assignment of n
(·),−(d ,n )
d,(·),t
to components
The corresponding distribution over the number of components is desired
conditional distribution of mdk
Antoniak [Ant74] has shown that
p(md k = m|z,md k
,π) =
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·))
s(nk
d,(·),(·),m)(α0πk )m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
133. Gibbs Sampling for HDP
Full conditional distribution of m
The probability that word xd n is assigned to some table t such that
kdt = k
p(θd n = φt |φdt = φk ,θ−(d ,n )
,π) ∝ n
(·),−(d ,n )
d,(·),t
p(θd n = new table|φdtnew = φk ,θ−(d ,n )
,π) ∝ α0πk
These equations form Dirichlet process with concentration parameter
α0πk and assignment of n
(·),−(d ,n )
d,(·),t
to components
The corresponding distribution over the number of components is desired
conditional distribution of mdk
Antoniak [Ant74] has shown that
p(md k = m|z,md k
,π) =
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·))
s(nk
d,(·),(·),m)(α0πk )m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
134. Gibbs Sampling for HDP
Full conditional distribution of π
(π1,π2,...,πK ,πu)|· ∼ Dir(m.1,m.2,...,m.K ,γ)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121
135. Gibbs Sampling for HDP
Algorithm 1 Gibbs Sampling for HDP
1: Initialize all latent variables as random
2: repeat
3: for Each document d do
4: for Each word n in document d do
5: Sample z(d,n) ∼ Mult α0πk +n
k ,−(d,n)
d ,(·)
(ηv +n
k ,−(d,n)
(·),v
)
(∑V
v=1 ηv +n
k ,−(d,n)
(·),v
)
6: end for
7: Sample m ∼ Mult
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·)
)
s(nk
d,(·),(·),m)(α0πk )m
8: Sample β ∼ Dir(m.1,m.2,...,m.K ,γ)
9: end for
10: until Converged
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121
137. Stick Breaking Construction
A Hierarchical Dirichlet Process with 1,...,D
documents
G0 ∼ DP(γ,H)
Gd |G0 ∼ DP(α0,G0)
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βl ) βk ∼ Beta(1,γ)
Gd =
∞
∑
k=1
πdk δφk
πdk = πdk
k−1
∏
i=1
(1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 −
k
∑
i=1
βi ))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121
138. Alternative Stick Breaking Construction
Problem)
Original Stick Breaking Construction is weights βk and πdk are tightly
correlated
βk = βk
k−1
∏
i=1
(1 −βi ) βk ∼ Beta(1,γ)
πdk = πdk
k−1
∏
i=1
(1 −πdi ) πdk ∼ Beta(α0βk ,α0(1 −
k
∑
i=1
βi ))
Alternative Stick Breaking Construction for each document [FSJW08]
ψdt ∼ G0
πdt = πdt
t−1
∏
i=1
(1 −πdi ) πdt ∼ Beta(1,α0)
Gd =
∞
∑
t=1
πdt δψdt
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121
139. Alternative Stick Breaking Construction
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βl ) βk ∼ Beta(1,γ)
Gd =
∞
∑
t=1
πdt δψdt
ψdt ∼ G0
πdt = πdt
t−1
∏
i=1
(1 −πdi ) πdt ∼ Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121
140. Alternative Stick Breaking Construction
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βi ) βk ∼ Beta(1,γ)
Gd =
∞
∑
t=1
πdt δψdt
ψdt ∼ G0
πdt = πdt
t−1
∏
i=1
(1 −πdi ) πdt ∼ Beta(1,α0)
To connect ψdt and φk
We add auxiliary variable cdt ∼ Mult(β)
Then ψdt = φcdt
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121
141. Alternative Stick Breaking Construction
Generative process
1 For each global-level topic k ∈ {1,...,∞}:
1 Draw topic word proportions, φk ∼ Dir(η)
2 Draw a corpus breaking proportion,
βk ∼ Beta(1,γ)
2 For each document d ∈ {1,...,D}:
1 For each document-level topic t ∈ {1,...,∞}:
1 Draw document-level topic indices,
cdt ∼ Mult(σ(β ))
2 Draw a document breaking proportion,
πdt ∼ Beta(1,α0)
2 For each word n ∈ {1,...,N}:
1 Draw a topic index zdn ∼ Mult(σ(πd ))
2 Generate a word wdn ∼ Mult(φcdzdn
),
3 where
σ(β ) ≡ {β1,β2,...},βk = βk ∏k−1
i=1 (1 −βi )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121
142. Variational Inference
Main idea [JGJS98]
Modify original graphical model to simple model
Minimize similarity between original and modified one
More Formally
Observed data X, Latent variable Z
We want to compute p(Z|X)
Make q(Z)
Minimize similarity between p and q 2
2
Commonly it is KL-divergence of p from q, DKL(q||p)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
143. Variational Inference
Main idea [JGJS98]
Modify original graphical model to simple model
Minimize similarity between original and modified one
More Formally
Observed data X, Latent variable Z
We want to compute p(Z|X)
Make q(Z)
Minimize similarity between p and q 2
2
Commonly it is KL-divergence of p from q, DKL(q||p)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
144. KL-divergence of p from q
Find lower bound of log evidence logp(X)
logp(X) = log ∑
{Z}
p(Z,X) = log ∑
{Z}
p(Z,X)
q(Z|X)
q(Z|X)
= log ∑
{Z}
q(Z|X)
p(Z,X)
q(Z|X)
≥ ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
3
Gap between lower bound of logp(X) and logp(X)
logp(X)− ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
= ∑
Z
q(Z)log
q(Z)
p(Z|X)
= DKL(q||p)
3
Use Jensen’s inequality
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
145. KL-divergence of p from q
Find lower bound of log evidence logp(X)
logp(X) = log ∑
{Z}
p(Z,X) = log ∑
{Z}
p(Z,X)
q(Z|X)
q(Z|X)
= log ∑
{Z}
q(Z|X)
p(Z,X)
q(Z|X)
≥ ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
3
Gap between lower bound of logp(X) and logp(X)
logp(X)− ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
= ∑
Z
q(Z)log
q(Z)
p(Z|X)
= DKL(q||p)
3
Use Jensen’s inequality
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
146. KL-divergence of p from q
logp(X) = ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
+DKL(q||p)
Log evidence logp(X) is fixed with respect to q
Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121
147. Variational Inference
Main idea [JGJS98]
Modify original graphical model to simple model
Minimize similarity between original and modified one
More Formally
Observed data X, Latent variable Z
We want to compute p(Z|X)
Make q(Z)
Minimize similarity between p and q 4
Find lower bound of logp(X)
Maximizing it
4
Commonly it is KL-divergence of p from q, DKL(q||p)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121
148. Variational Inference for HDP
q(β,φ,π,c,z) =
K
∏
k=1
q(φk |λk )
K−1
∏
k=1
q(βk |a1
k ,a2
k )
D
∏
d=1
T
∏
t=1
q(cdt |ζdt )
T−1
∏
t=1
q(πdt |γ1
dt ,γ2
dt )
N
∏
n=1
q(zdn|ϕdn)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121
149. Variational Inference for HDP
Find lower bound of logp(w|α0,γ,η)
lnp(w|α0,γ,η)
= ln
β φ π
∑
c
∑
z
p(w,β,φ,π,c,z|α0,γ,η) dβ dφ dπ
= ln
β φ π
∑
c
∑
z
p(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z)
q(β,φ,π,c,z)
dβ dφ dπ
≥
β φ π
∑
c
∑
z
ln
p(w,β,φ,π,c,z|α0,γ,η)
q(β,φ,π,c,z)
·q(β,φ,π,c,z) dβ dφ dπ
( Jensen’s inequality)
=
β φ π
∑
c
∑
z
lnp(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) dβ dφ dπ
−
β φ π
∑
c
∑
z
lnq(β,φ,π,c,z)·q(β,φ,π,c,z) dβ dφ dπ
= Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121
150. Variational Inference for HDP
lnp(w|α0,γ,η)
≥ Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)]
= Eq[lnp(β|γ)p(φ|η)
D
∏
d=1
p(πd |α0)p(cd |β)
N
∏
n=1
p(wdn|cd ,zdn,φ)p(zdn|πd )]
−Eq[ln
K
∏
k=1
q(φk |λk )
K−1
∏
k=1
q(βk |a1
k ,a2
k )
D
∏
d=1
T
∏
t=1
q(cdt |ζdt )
T−1
∏
t=1
q(πdt |γ1
dt ,γ2
dt )
N
∏
n=1
q(zdn|ϕdn)]
=
D
∑
d=1
Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )]
−Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1
d ,γ2
d )]−Eq[lnq(zd |ϕd )]
+Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1
,a2
)]
We can run Variational EM to maximize lower bound of logp(w|α0,γ,η)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121
151. Variational Inference for HDP
Maximize lower bound of logp(w|α0,γ,η)
Derivative of it with respect to each variational parameter
γ1
dt = 1 +
N
∑
n=1
ϕdnt , γ2
dt = α0 +
N
∑
n=1
T
∑
b=t+1
ϕdnb
ζdtk = exp{
k−1
∑
e=1
(Ψ(a2
e)−Ψ(a1
e +a2
e))+(Ψ(a1
k )−Ψ(a1
k +a2
k ))
+
N
∑
n=1
V
∑
v=1
wv
dnϕdnt (Ψ(λkv )−Ψ(
V
∑
l=1
λkl ))}
ϕdnt = exp{
t−1
∑
h=1
(Ψ(γ2
dh)−Ψ(γ1
dh +γ2
dh))+(Ψ(γ1
dt )−Ψ(γ1
dt +γ2
dt ))
+
K
∑
k=1
V
∑
v=1
wv
dnζdtk (Ψ(λkv )−Ψ(
V
∑
l=1
λkl ))}
a1
k = 1 +
D
∑
d=1
T
∑
t=1
ζdtk , a2
k = γ +
D
∑
d=1
T
∑
t=1
K
∑
f=k+1
ζdtf
λkv = ηv +
D
∑
d=1
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121
152. Variational Inference for HDP
Maximize lower bound of logp(w|α0,γ,η)
Derivative of it with respect to each variational parameter
Run Variational EM
E step: compute document level parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
M step: compute corpus level parameters a1
k ,a2
k ,λkv
Algorithm 2 Variational Inference for HDP
1: Initialize the variational parameters
2: repeat
3: for Each document d do
4: repeat
5: Compute document parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
6: until Converged
7: end for
8: Compute topic parameters a1
k ,a2
k ,λkv
9: until Converged
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121
154. Online Variational Inference
Stochastic optimization to the variational objective [WPB11]
Subsample the documents
Compute approximation of the gradient based on subsample
Follow that gradient with a decreasing step-size
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121
155. Variational Inference for HDP
Lower bound of logp(w|α0,γ,η)
lnp(w|α0,γ,η)
≥
D
∑
d=1
Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )]
−Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1
d ,γ2
d )]−Eq[lnq(zd |ϕd )]
+Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1
,a2
)]
=
D
∑
d=1
Ld +Lk
= Eqj
[DLd +
1
D
Lk ]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121
156. Online Variational Inference for HDP
Lower bound of logp(w|α0,γ,η) = Eqj
[DLd + 1
D
Lk ]
Online learning algorithm for HDP
Sample a document d
Compute its optimal document-level parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
Take the gradient 5
of the corpus level parameters a1
k ,a2
k ,λkv with noise
Update corpus level parameters a1
k ,a2
k ,λkv with decreasing learning rate
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
where ρe is the learning rate which satisfy ∑∞
e=1 ρe = ∞, ∑∞
e=1 ρ2
e < ∞
5
Natural graident is structurally equivalent to the Variational Inference one
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121
157. Online Variational Inference for HDP
Algorithm 3 Online Variational Inference for HDP
1: Initialize the variational parameters
2: e = 0
3: for Each document d ∈ {1,...,D} do
4: repeat
5: Compute document parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
6: until Converged
7: e = e +1
8: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
9: Update topic parameters a1
k ,a2
k ,λkv
10: end for
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121
159. Motivation
Problem 1: Inference for HDP takes a long time
Problem 2: Continuously expanding corpus necessitates continuous
updates of model parameters
But updating of model parameters is not possible with plain HDP
Must re-train with the entire updated corpus
Our Approach: Combine distributed inference and online learning
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121
160. Distributed Online HDP
Based on variational inference
Mini-batch updates via stochastic learning (variational EM)
Distribute variational EM using MapReduce
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121
161. Distributed Online HDP
Algorithm 4 Distributed Online HDP - Driver
1: Initialize the variational parameters
2: e = 0
3: while Run forever do
4: Collect new documents s ∈ {1,...,S}
5: e = e +1
6: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
7: Run MapReduce job
8: Get result of job and update topic parameters
9: end while
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121
162. Distributed Online HDP
Algorithm 5 Distributed Online HDP - Mapper
1: Mapper get one document s ∈ {1,...,S}
2: repeat
3: Compute document parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
4: until Converged
5: Output the sufficient statistics for topic parameters
Algorithm 6 Distributed Online HDP - Reducer
1: Reducer get sufficient statistics for each topic parameter
2: Compute changes of topic parameter with sufficient statistics
3: Output the changes of topic parameter
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121
163. Experimental Setup
Data: 973,266 Twitter conversations, 7.54 tweets / conv
Approximately 7,297,000 tweets
60 node Hadoop system
Each node with 8 x 2.30GHz cores
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121
164. Result
Distributed Online HDP runs faster than online HDP
Distributed Online HDP preserve the quality of result (perplexity)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121
165. Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic Modeling
Concept of Hierarchical Dirichlet Processes
How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school said
Wow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I am
in the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
166. Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic Modeling
Concept of Hierarchical Dirichlet Processes
How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school said
Wow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I am
in the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
167. Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic Modeling
Concept of Hierarchical Dirichlet Processes
How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school said
Wow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I am
in the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
169. Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121
170. Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121
171. HDP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121
172. Property of Dirichlet distribution
Sample pmfs from Dirichlet distribution [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121
173. Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
174. Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
175. Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
176. Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
177. Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121
178. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
Meaning of each parameters
τ0: Slow down the early iterations of the algorithm
κ: Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
179. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
Meaning of each parameters
τ0: Slow down the early iterations of the algorithm
κ: Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
180. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
Meaning of each parameters
τ0: Slow down the early iterations of the algorithm
κ: Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
181. Mini-batch size
When mini-batch size is large, distributed online HDP runs faster
Perplexity is similar as others
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121
182. Summary
Bayesian Nonparametric Topic Modeling
Hierarchical Dirichlet Processes
Chinese Restaurant Franchise
Stick Breaking Construction
Posterior Inference for HDP
Gibbs Sampling
Variational Inference
Online Learning
Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak
Implementations are updated in http://github.com/NoSyu/Topic_Models
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121
183. Further Reading
Dirichlet Process
Dirichlet Process
Dirichlet distribution and Dirichlet Process + Indian Buffet Process
Bayesian Nonparametric model
Machine Learning Summer School - Yee Whye Teh
Machine Learning Summer School - Peter Orbanz
Introductory article
Inference
MCMC
Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121
184. Thank You!
JinYeong Bak
jy.bak@kaist.ac.kr, linkedin.com/in/jybak
Users & Information Lab, KAIST
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121
185. References I
Charles E Antoniak, Mixtures of dirichlet processes with applications to
bayesian nonparametric problems, The annals of statistics (1974),
1152–1174.
Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to the
dirichlet distribution and related processes, Tech. Report
UWEETR-2010-0006, Department of Electrical Engineering, University of
Washington, Seattle, WA 98195, December 2010.
Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition and
machine learning, vol. 1, springer New York, 2006.
David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet
allocation, the Journal of machine Learning research 3 (2003), 993–1022.
Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, An
hdp-hmm for systems with state persistence, Proceedings of the 25th
international conference on Machine learning, ACM, 2008, pp. 312–319.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121
186. References II
Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009.
Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and
Lawrence K Saul, An introduction to variational methods for graphical
models, Springer, 1998.
Yohan Jo and Alice H. Oh, Aspect and sentiment unification model for
online review analysis, Proceedings of the fourth ACM international
conference on Web search and data mining (New York, NY, USA), WSDM
’11, ACM, 2011, pp. 815–824.
Radford M Neal, Markov chain sampling methods for dirichlet process
mixture models, Journal of computational and graphical statistics 9
(2000), no. 2, 249–265.
Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei,
Hierarchical dirichlet processes, Journal of the american statistical
association 101 (2006), no. 476.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121
187. References III
Chong Wang, John W Paisley, and David M Blei, Online variational
inference for the hierarchical dirichlet process, International Conference
on Artificial Intelligence and Statistics, 2011, pp. 752–760.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121
189. Measurable space (Ω,B)
Def) A set considered together with the σ-algebra on the set6
.
Ω: the set of all outcomes, the sample space
B: σ-algebra over Ω
Special kind of collection of subsets of the sample space Ω
Complete
A is σ-algebra, then AC
is also σ-algebra
Closed under countable unions and intersections
A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra
A collection of events
Property
Smallest possible σ-algebra: {Ω, /0}
Largest possible σ-algebra: powerset
6
http://mathworld.wolfram.com/MeasurableSpace.html
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
190. Measurable space (Ω,B)
Def) A set considered together with the σ-algebra on the set6
.
Ω: the set of all outcomes, the sample space
B: σ-algebra over Ω
Special kind of collection of subsets of the sample space Ω
Complete
A is σ-algebra, then AC
is also σ-algebra
Closed under countable unions and intersections
A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra
A collection of events
Property
Smallest possible σ-algebra: {Ω, /0}
Largest possible σ-algebra: powerset
6
http://mathworld.wolfram.com/MeasurableSpace.html
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
191. Proof 1
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Then
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
changes to
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
using decimative property with
α1 = α0 θ1 = (1 −β1)
βk = G0(Ak ) τk = G (Ak )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121