SlideShare ist ein Scribd-Unternehmen logo
1 von 191
Downloaden Sie, um offline zu lesen
Bayesian Nonparametric Topic Modeling
Hierarchical Dirichlet Processes
JinYeong Bak
Department of Computer Science
KAIST, Daejeon
South Korea
jy.bak@kaist.ac.kr
August 22, 2013
Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk).
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121
Introduction
Bayesian topic models
Latent Dirichlet Allocation (LDA) [BNJ03]
Hierarchical Dircihlet Processes (HDP) [TJBB06]
In this talk,
Dirichlet distribution, Dircihlet process
Concept of Hierarchical Dircihlet Processes (HDP)
How to infer the latent variables in HDP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
Motivation
What are the topics discussed in the article?
How can we describe the topics?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
Topic Modeling
Each topic has word distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121
Topic Modeling
Each document has topic proportion
Each word has its own topic index
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
Topic Modeling
Each document has topic proportion
Each word has its own topic index
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
Topic Modeling
Each document has topic proportion
Each word has its own topic index
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Topic Modeling
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Generative process of LDA
For each topic k ∈ {1,...,K}:
Draw word distributions βk ∼ Dir(η)
For each document d ∈ {1,...,D}:
Draw topic proportions θd ∼ Dir(α)
For each word in a document n ∈ {1,...,N}:
Draw a topic index zdn ∼ Mult(θ)
Generate word from chosen topic
wdn ∼ Mult(βzdn
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
Latent Dirichlet Allocation
Our interests
What are the topics discussed in the article?
How can we describe the topics?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121
Latent Dirichlet Allocation
What we can see
Words in documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121
Latent Dirichlet Allocation
What we want to see
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121
Latent Dirichlet Allocation
Our interests
What are the topics discussed in the article?
=> Topic proportion of each document
How can we describe the topics?
=> Word distribution of each topic
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121
Latent Dirichlet Allocation
What we can see: w
What we want to see: θ,z,β
∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η)
p(w|α,η)
But this distribution is intractable to compute ( normalization term)
So we do approximate methods
Gibbs Sampling
Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
Latent Dirichlet Allocation
What we can see: w
What we want to see: θ,z,β
∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η)
p(w|α,η)
But this distribution is intractable to compute ( normalization term)
So we do approximate methods
Gibbs Sampling
Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
Limitation of Latent Dirichlet Allocation
Latent Dirichlet Allocation is parametric model
People should assign the number of topics in a corpus
People should find the best number of topics
Q) Can we get it from data automatically?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121
Limitation of Latent Dirichlet Allocation
Latent Dirichlet Allocation is parametric model
People should assign the number of topics in a corpus
People should find the best number of topics
Q) Can we get it from data automatically?
A) Hierarchical Dircihlet Processes
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121
Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
According to the textbook, it is widely known as uniform
=> 1
6
for 6 dimentional dice
Is it true?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
According to the textbook, it is widely known as uniform
=> 1
6
for 6 dimentional dice
Is it true?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
Dice modeling
Think about the probability of a number from dices
According to the textbook, it is widely known as uniform.
=> 1
6
for 6 dimentional dice
Is it true?
Ans) No!
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121
Dice modeling
We should model the randomness of pmfs for each dice
How can we do that?
Let’s imagine a bag which has many dices
We cannot see inside the bag
We can draw out one dice from bag
OK, but what is the formal description?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
Dice modeling
We should model the randomness of pmfs for each dice
How can we do that?
Let’s imagine a bag which has many dices
We cannot see inside the bag
We can draw out one dice from bag
OK, but what is the formal description?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
Standard Simplex
A generalization of the notion of a triangle or tetrahedron
All points are non-negative and sum to 1 1
A pmf can be thought of as a point in the standard simplex
Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1
1
http://en.wikipedia.org/wiki/Simplex
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
Standard Simplex
A generalization of the notion of a triangle or tetrahedron
All points are non-negative and sum to 1 1
A pmf can be thought of as a point in the standard simplex
Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1
1
http://en.wikipedia.org/wiki/Simplex
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
Dirichlet distribution
Definition [BN06]
A probability distribution over the (K −1) dimensional standard simplex
A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ]
Probability density function
p(θ;α) =
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
Dirichlet distribution
Definition [BN06]
A probability distribution over the (K −1) dimensional standard simplex
A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ]
Probability density function
p(θ;α) =
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
Dirichlet distribution
Definition [BN06]
A probability distribution over the (K −1) dimensional standard simplex
A distribution over pmfs of length K
Notation
θ ∼ Dir(α)
where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ]
Probability density function
p(θ;α) =
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
Latent Dirichlet Allocation
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121
Property of Dirichlet distribution
Density plots [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121
Property of Dirichlet distribution
Sample pmfs from Dirichlet distribution [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121
Property of Dirichlet distribution
When K = 2, it is Beta distribution
Conjugate prior for the Multinomial distribution
Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α)
∴ Posterior (θ|X) ∼ Dir(α +n)
Proof)
p(θ|X) =
p(X|θ)p(θ)
p(X)
∝ p(X|θ)p(θ)
=
n!
x1!···xK !
K
∏
k=1
θxk
k ·
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
= C
K
∏
k=1
θαk +xk −1
k
= Dir(α +n)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
Property of Dirichlet distribution
When K = 2, it is Beta distribution
Conjugate prior for the Multinomial distribution
Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α)
∴ Posterior (θ|X) ∼ Dir(α +n)
Proof)
p(θ|X) =
p(X|θ)p(θ)
p(X)
∝ p(X|θ)p(θ)
=
n!
x1!···xK !
K
∏
k=1
θxk
k ·
Γ(∑K
k=1 αk )
∏K
k=1 Γ(αk )
K
∏
k=1
θα−1
k
= C
K
∏
k=1
θαk +xk −1
k
= Dir(α +n)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Property of Dirichlet distribution
Aggregation property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK )
In general, if {A1,...,AR} is any partition of {1,...,K},
then (∑k∈A1
θk ,...,∑k∈AR
θk ) ∼ Dir(∑k∈A1
αk ,...,∑k∈AR
αk )
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Neutrality property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
then θk is independent of the vector 1
1−θk
(θ1,θ2,...,θk−1,θk+1,...,θK )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121
Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
Draw out a dice from a bag
Problem) Do not know the number of face in a bag
Solution) Dirichlet process
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
Dice modeling
Think about the probability of a number from dices
Each dice has its own pmf
Draw out a dice from a bag
Problem) Do not know the number of face in a bag
Solution) Dirichlet process
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
Dirichlet Process
Definition [BAFG10]
A distribution over probability measures
A distribution whose realizations are distribution over any sample space
Formal definition
(Ω,B) is a measurable space
G0 is a distribution over sample space Ω
α0 is a positive real number
G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1,...,AR) of Ω
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
Dirichlet Process
Definition [BAFG10]
A distribution over probability measures
A distribution whose realizations are distribution over any sample space
Formal definition
(Ω,B) is a measurable space
G0 is a distribution over sample space Ω
α0 is a positive real number
G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1,...,AR) of Ω
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
G ∼ DP(α0,G0) can be treat as a random distribution over Ω
We can draw a sample θ1 from G
We also can make finite partition, (A1,...,AR) of Ω
then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar )
(G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
Using Dirichlet-multinomial conjugacy, the posterior is
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise
It is always true for every finite partition of Ω
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ1
(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ1
(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
Posterior Dirichlet Processes
For every finite partition of Ω,
(G(A1),...,G(AR))|θ1
∼Dir(α0G0(A1)+δθ1
(A1),...,α0G0(AR)+δθ1
(AR))
where δθ1
(Ar ) = 1 if θ1 ∈ Ar and 0 otherwise
The posterior process is also a Dirichlet process
G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Summary)
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1,...,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
⇐⇒ θ2|θ1 ∼
α0G0 +δθ1
α0 +1
G|θ1,θ2 ∼ DP(α0 +2,
α0G0 +δθ1
+δθ2
α0 +2
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1,...,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
⇐⇒ θ2|θ1 ∼
α0G0 +δθ1
α0 +1
G|θ1,θ2 ∼ DP(α0 +2,
α0G0 +δθ1
+δθ2
α0 +2
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
Blackwell-MacQueen Urn Scheme
Now we draw a sample θ1,...,θN
First sample
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Second sample
θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
⇐⇒ θ2|θ1 ∼
α0G0 +δθ1
α0 +1
G|θ1,θ2 ∼ DP(α0 +2,
α0G0 +δθ1
+δθ2
α0 +2
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
Blackwell-MacQueen Urn Scheme
Nth sample
θN|θ1,...,N−1,G ∼ G
G|θ1,...,N−1 ∼ DP(α0 +N −1,
α0G0 +∑N−1
n=1 δθn
α0 +N −1
)
⇐⇒ θN|θ1,...,N−1 ∼
α0G0 +∑N−1
n=1 δθn
α0 +N −1
G|θ1,...,N ∼ DP(α0 +N,
α0G0 +∑N
n=1 δθn
α0 +N
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121
Blackwell-MacQueen Urn Scheme
Blackwell-MacQueen urn scheme produces a sequence θ1,θ2,... with
the following conditionals
θN|θ1,...,N−1 ∼
α0G0 +∑N−1
n=1 δθn
α0 +N −1
As Polya Urn analogy
Infinite number of ball colors
Empty urn
Filling Polya urn process (n starts 1)
With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urn
With probability n −1, pick a ball from urn record its color, and put it back to
urn with another ball of the same color
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121
Chinese Restaurant Process
Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme
With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urn
With probability n −1, pick a ball from urn record its color, and put it back
to urn with another ball of the same color
θs can take same values, θi = θj
There are K < N distinct values, φ1,...,φK
It works as partition of Ω
θ1,θ2,...,θN induces to φ1,...,φK
The distribution over partitions is called the Chinese Restaurant Process
(CRP)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
Chinese Restaurant Process
Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme
With probability α0, pick a new color from the set of infinite ball colors G0,
and paint a new ball that color and add it to urn
With probability n −1, pick a ball from urn record its color, and put it back
to urn with another ball of the same color
θs can take same values, θi = θj
There are K < N distinct values, φ1,...,φK
It works as partition of Ω
θ1,θ2,...,θN induces to φ1,...,φK
The distribution over partitions is called the Chinese Restaurant Process
(CRP)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
θ1,θ2,...,θN induces to φ1,...,φK
Chinese Restaurant Process interpretation
There is a Chinese Restaurant which has infinite tables
Each customer sits at a table
Generating from the Chinese Restaurant Process
First customer sits at the first table
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability nk
α0+n−1 ,
where nk is the number of customers at table k
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
Chinese Restaurant Process
The CRP exhibits the clustering property of DP
Tables are clusters, φk ∼ G0
Customers are the actual realizations, θn = φzn where zn ∈ {1,...,K}
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121
Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1))
∼ Dir((α0 +1)
α0G0 +δθ1
α0 +1
(θ1),(α0 +1)
α0G0 +δθ1
α0 +1
(Ωθ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1))
∼ Dir((α0 +1)
α0G0 +δθ1
α0 +1
(θ1),(α0 +1)
α0G0 +δθ1
α0 +1
(Ωθ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
Stick Breaking Construction
Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself
To construct G, we use Stick Breaking Construction
Review) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1))
∼ Dir((α0 +1)
α0G0 +δθ1
α0 +1
(θ1),(α0 +1)
α0G0 +δθ1
α0 +1
(Ωθ1))
= Dir(1,α0) = Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
Stick Breaking Construction
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
where G is the probability measure with the point mass θ1 removed
What is G ?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
Stick Breaking Construction
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
where G is the probability measure with the point mass θ1 removed
What is G ?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
Stick Breaking Construction
Consider a partition (θ1,Ωθ1) of Ω. Then
(G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0)
G has a point mass located at θ1
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
where G is the probability measure with the point mass θ1 removed
What is G ?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
Stick Breaking Construction
Summary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1,...,AR) of Ω
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
Stick Breaking Construction
Summary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1,...,AR) of Ω
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
Stick Breaking Construction
Summary) Posterior Dirichlet Processes
θ1|G ∼ G G ∼ DP(α0,G0)
⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1,
α0G0 +δθ1
α0 +1
)
G = β1δθ1
+(1 −β1)G β1 ∼ Beta(1,α0)
Consider a further partition (θ1,A1,...,AR) of Ω
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
Using decimative property of Dirichlet distribution (proof)
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
Stick Breaking Construction
Do this repeatly with distinct values, φ1,φ2,···
G ∼ DP(α0,G0)
G = β1δφ1
+(1 −β1)G1
G = β1δφ1
+(1 −β1)(β2δφ2
+(1 −β2)G2)
...
G =
∞
∑
k=1
πk δφk
where
πk = βk
k−1
∏
i=1
(1 −βi ),
∞
∑
k=1
πk = 1 βk ∼ Beta(1,α0) φk ∼ G0
Draws from the DP looks like a sum of point masses, with masses drawn
from a stick-breaking construction.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121
Stick Breaking Construction
Summary)
G =
∞
∑
k=1
πk δφk
πk = βk
k−1
∏
i=1
(1 −βi ),
∞
∑
k=1
πk = 1 βk ∼ Beta(1,α0) φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121
Summary of DP
Definition
G is a random probability measure over (Ω,B)
G ∼ DP(α0,G0)
if for any finite measurable partition (A1,...,Ar ) of Ω
(G(A1),...,G(Ar )) ∼ Dir(α0G0(A1),...,α0G0(Ar ))
Chinese Restaurant Process
Stick Breaking Construction
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121
Dirichlet Process Mixture Models
We model a data set x1,...,xN using the following
model [Nea00]
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
Each θn is a latent parameter modelling xn, while
G is the unknown distribution over parameters
modelled using a DP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
Dirichlet Process Mixture Models
We model a data set x1,...,xN using the following
model [Nea00]
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
Each θn is a latent parameter modelling xn, while
G is the unknown distribution over parameters
modelled using a DP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
Dirichlet Process Mixture Models
Since G is of the form
G =
∞
∑
k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We can
equivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
Dirichlet Process Mixture Models
Since G is of the form
G =
∞
∑
k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We can
equivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
Dirichlet Process Mixture Models
Since G is of the form
G =
∞
∑
k=1
πk δφk
We have θn = φk with probability πk
Let kn take on value k with probability πk . We can
equivalently define θn = φkn
An equivalent model
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
Dirichlet Process Mixture Models
⇐⇒
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121
Topic modeling with documents
Each document consists of bags of words
Each word in a document has latent topic index
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
Topic modeling with documents
Each document consists of bags of words
Each word in a document has latent topic index
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
Problem of Naive Dirichlet Process Mixture Model
Use a DP mixutre for each document
xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)
But there is no sharing of clusters across different
groups because G0 is smooth
G1 =
∞
∑
k=1
π1k δφ1k
, G2 =
∞
∑
k=1
π2k δφ2k
φ1k ,φ2k ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
Problem of Naive Dirichlet Process Mixture Model
Use a DP mixutre for each document
xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0)
But there is no sharing of clusters across different
groups because G0 is smooth
G1 =
∞
∑
k=1
π1k δφ1k
, G2 =
∞
∑
k=1
π2k δφ2k
φ1k ,φ2k ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
Problem of Naive Dirichlet Process Mixture Model
Solution
Make the base distribution G0 discrete
Put a DP prior on the common base distribution
Hierarchical Dirichlet Process
G0 ∼ DP(γ,H)
G1,G2|G0 ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
Problem of Naive Dirichlet Process Mixture Model
Solution
Make the base distribution G0 discrete
Put a DP prior on the common base distribution
Hierarchical Dirichlet Process
G0 ∼ DP(γ,H)
G1,G2|G0 ∼ DP(α0,G0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
Hierarchical Dirichlet Processes
Making G0 discrete forces shared cluster between G1 and G2
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121
Stick Breaking Construction
A Hierarchical Dirichlet Process with 1,...,D
documents
G0 ∼ DP(γ,H)
Gd |G0 ∼ DP(α0,G0)
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βl ) βk ∼ Beta(1,γ)
Gd =
∞
∑
k=1
πdk δφk
πdk = πdk
k−1
∏
i=1
(1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 −
k
∑
i=1
βi ))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121
Chinese Restaurant Franchise
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme
θd1,θd2,... induces to φd1,φd2,...
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121
Chinese Restaurant Franchise
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2,... from a
Blackwell-MacQueen Urn
Scheme
θd1,θd2,... induces to
φd1,φd2,...
Draw θd 1,θd 2,... from a
Blackwell-MacQueen Urn
Scheme
θd 1,θd 2,... induces to
φd 1,φd 2,...
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121
Chinese Restaurant Franchise
G0 ∼ DP(γ,H), φk ∼ H
Gd |G0 ∼ DP(α0,G0), θdn ∼ G0
Draw θd1,θd2,... from a
Blackwell-MacQueen Urn
Scheme
θd1,θd2,... induces to
φd1,φd2,...
Draw θd 1,θd 2,... from a
Blackwell-MacQueen Urn
Scheme
θd 1,θd 2,... induces to
φd 1,φd 2,...
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
Chinese Restaurant Franchise interpretation
Each restaurant has infinite tables
All restaurant share food menu
Each customer sits at a table
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
Chinese Restaurant Franchise
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121
HDP for Topic modeling
Questions
What can we assume about the topics in a document?
What can we assume about the words in the topics?
Solution
Each document consists of bags of words
Each word in a document has latent topic
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
HDP for Topic modeling
Questions
What can we assume about the topics in a document?
What can we assume about the words in the topics?
Solution
Each document consists of bags of words
Each word in a document has latent topic
Latent topics for words in a document can be grouped
Each document has topic proportion
Each topic has word distribution
Topics must be shared across documents
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121
Gibbs Sampling
Definition
A special case of Markov-chain Monte Carlo (MCMC) method
An iterative algorithm that constructs a dependent sequence of parameter
values whose distribution converges to the target joint posterior
distribution [Hof09]
Algorithm
Find full conditional distribution of latent variables of target distribution
Initialize all latent variables
Sampling until converged
Sample one latent variable from full conditional distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
Gibbs Sampling
Definition
A special case of Markov-chain Monte Carlo (MCMC) method
An iterative algorithm that constructs a dependent sequence of parameter
values whose distribution converges to the target joint posterior
distribution [Hof09]
Algorithm
Find full conditional distribution of latent variables of target distribution
Initialize all latent variables
Sampling until converged
Sample one latent variable from full conditional distribution
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
Collapsed Gibbs sampling
A collapsed Gibbs sampling integrates out one or more variables when
sampling for some other variable.
Example)
There are three latent variables A,B and C.
Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially
But when we integrate out B,
Sampling only p(A|C), p(C|A) sequentially
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121
Review) Dirichlet Process Mixture Models
⇐⇒
xn ∼ F(θn)
θn ∼ G
G ∼ DP(α0,G0)
⇐⇒
xn ∼ F(φkn
)
p(kn = k) = πk
πk = βk
k−1
∏
i=1
(1 −βi )
βk ∼ Beta(1,α0)
φk ∼ G0
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121
Review) Blackwell-MacQueen Urn Scheme for DP
Nth sample
θN|θ1,...,N−1,G ∼ G
G|θ1,...,N−1 ∼ DP(α0 +N −1,
α0G0 +∑N−1
n=1 δθn
α0 +N −1
)
⇐⇒ θN|θ1,...,N−1 ∼
α0G0 +∑N−1
n=1 δθn
α0 +N −1
G|θ1,...,N ∼ DP(α0 +N,
α0G0 +∑N
n=1 δθn
α0 +N
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121
Review) Chinese Restaurant Franchise
Generating from the Chinese Restaurant Franchise
For each restaurant
First customer sits at the first table and choose a new menu
n-th customer sits at
A new table with probability α0
α0+n−1
Table k with probability ndt
α0+n−1
where ndt is the number of customers at table t
n-th customer choose
A new menu with probability
γ
γ+m−1
Existing menu with probability mk
γ+m−1
where m is the number of tables in all restaurant, mk is the number of chosen
menu k in all restaurant
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121
Alternative form of HDP
G0 ∼ DP(γ,H), φdt ∼ G0
∴ G0|φdt ,... ∼ DP(γ +m,
γH+∑K
k=1 mk δφk
γ+m
)
Then G0 is given as
G0 =
K
∑
k=1
βk δφk
+βuGu
where
Gu ∼ DP(γ,H)
π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ)
p(φk |·) ∝ h(φk ) ∏
dn:zdn=k
f(xdn|φk )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
Alternative form of HDP
G0 ∼ DP(γ,H), φdt ∼ G0
∴ G0|φdt ,... ∼ DP(γ +m,
γH+∑K
k=1 mk δφk
γ+m
)
Then G0 is given as
G0 =
K
∑
k=1
βk δφk
+βuGu
where
Gu ∼ DP(γ,H)
π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ)
p(φk |·) ∝ h(φk ) ∏
dn:zdn=k
f(xdn|φk )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
Hierarchical Dirichlet Processes
⇐⇒
xdn ∼ F(θn)
θn ∼ Gd
Gd ∼ DP(α0,G0)
G0 ∼ DP(γ,H)
⇐⇒
xn ∼ Mult(φzdn
)
zdn ∼ Mult(θd )
φk ∼ Dir(η)
θd ∼ Dir(α0π)
π ∼ Dir(m.1,...,m.K ,γ)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121
Gibbs Sampling for HDP
Joint distribution
p(θ,z,φ,x,π,m|α0,η,γ) = p(π|m,γ)
K
∏
k=1
p(φk |η)
D
∏
d=1
p(θd |α0,π)
N
∏
n=1
p(zdn|θd ) p(xdn|zdn,φ)
Integrate out θ,φ
p(z,x,π,m|α0,η,γ) =
Γ(∑K
k=1 m.k +γ)
∏K
k=1 Γ(m.k )Γ(γ)
K
∏
k=1
πm.k −1
k π
γ−1
K+1
K
∏
k=1
Γ(∑V
v=1 ηv )
∏V
v=1 Γ(ηv )
∏V
v=1 Γ(ηv +nk
(·),v
)
Γ(∑V
v=1 ηv +nk
(·),v
)
M
∏
d=1
Γ(∑K
k=1 α0πk )
∏K
k=1 Γ(α0πk )
∏K
k=1 Γ(α0πk +nk
d,(·))
Γ(∑K
k=1 α0πk +nk
d,(·))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121
Gibbs Sampling for HDP
Full conditional distribution of z
p(z(d ,n )
= k |z−(d ,n )
,m,π,x,·) =
p(z(d ,n ) = k ,z−(d ,n ),m,π,x|·)
p(z−(d ,n ),m,π,x|·)
∝ p(z(d ,n )
= k ,z−(d ,n )
,m,π,x|·)
∝ α0πk +n
k ,−(d ,n )
d ,(·)
(ηv +n
k ,−(d ,n )
(·),v
)
(∑V
v=1 ηv +n
k ,−(d ,n )
(·),v
)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121
Gibbs Sampling for HDP
Full conditional distribution of m
The probability that word xd n is assigned to some table t such that
kdt = k
p(θd n = φt |φdt = φk ,θ−(d ,n )
,π) ∝ n
(·),−(d ,n )
d,(·),t
p(θd n = new table|φdtnew = φk ,θ−(d ,n )
,π) ∝ α0πk
These equations form Dirichlet process with concentration parameter
α0πk and assignment of n
(·),−(d ,n )
d,(·),t
to components
The corresponding distribution over the number of components is desired
conditional distribution of mdk
Antoniak [Ant74] has shown that
p(md k = m|z,md k
,π) =
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·))
s(nk
d,(·),(·),m)(α0πk )m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
Gibbs Sampling for HDP
Full conditional distribution of m
The probability that word xd n is assigned to some table t such that
kdt = k
p(θd n = φt |φdt = φk ,θ−(d ,n )
,π) ∝ n
(·),−(d ,n )
d,(·),t
p(θd n = new table|φdtnew = φk ,θ−(d ,n )
,π) ∝ α0πk
These equations form Dirichlet process with concentration parameter
α0πk and assignment of n
(·),−(d ,n )
d,(·),t
to components
The corresponding distribution over the number of components is desired
conditional distribution of mdk
Antoniak [Ant74] has shown that
p(md k = m|z,md k
,π) =
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·))
s(nk
d,(·),(·),m)(α0πk )m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
Gibbs Sampling for HDP
Full conditional distribution of m
The probability that word xd n is assigned to some table t such that
kdt = k
p(θd n = φt |φdt = φk ,θ−(d ,n )
,π) ∝ n
(·),−(d ,n )
d,(·),t
p(θd n = new table|φdtnew = φk ,θ−(d ,n )
,π) ∝ α0πk
These equations form Dirichlet process with concentration parameter
α0πk and assignment of n
(·),−(d ,n )
d,(·),t
to components
The corresponding distribution over the number of components is desired
conditional distribution of mdk
Antoniak [Ant74] has shown that
p(md k = m|z,md k
,π) =
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·))
s(nk
d,(·),(·),m)(α0πk )m
where s(n,m) is unsigned Stirling number of the first kind
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
Gibbs Sampling for HDP
Full conditional distribution of π
(π1,π2,...,πK ,πu)|· ∼ Dir(m.1,m.2,...,m.K ,γ)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121
Gibbs Sampling for HDP
Algorithm 1 Gibbs Sampling for HDP
1: Initialize all latent variables as random
2: repeat
3: for Each document d do
4: for Each word n in document d do
5: Sample z(d,n) ∼ Mult α0πk +n
k ,−(d,n)
d ,(·)
(ηv +n
k ,−(d,n)
(·),v
)
(∑V
v=1 ηv +n
k ,−(d,n)
(·),v
)
6: end for
7: Sample m ∼ Mult
Γ(α0πk )
Γ(α0πk +nk
d,(·),(·)
)
s(nk
d,(·),(·),m)(α0πk )m
8: Sample β ∼ Dir(m.1,m.2,...,m.K ,γ)
9: end for
10: until Converged
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121
Stick Breaking Construction
A Hierarchical Dirichlet Process with 1,...,D
documents
G0 ∼ DP(γ,H)
Gd |G0 ∼ DP(α0,G0)
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βl ) βk ∼ Beta(1,γ)
Gd =
∞
∑
k=1
πdk δφk
πdk = πdk
k−1
∏
i=1
(1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 −
k
∑
i=1
βi ))
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121
Alternative Stick Breaking Construction
Problem)
Original Stick Breaking Construction is weights βk and πdk are tightly
correlated
βk = βk
k−1
∏
i=1
(1 −βi ) βk ∼ Beta(1,γ)
πdk = πdk
k−1
∏
i=1
(1 −πdi ) πdk ∼ Beta(α0βk ,α0(1 −
k
∑
i=1
βi ))
Alternative Stick Breaking Construction for each document [FSJW08]
ψdt ∼ G0
πdt = πdt
t−1
∏
i=1
(1 −πdi ) πdt ∼ Beta(1,α0)
Gd =
∞
∑
t=1
πdt δψdt
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121
Alternative Stick Breaking Construction
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βl ) βk ∼ Beta(1,γ)
Gd =
∞
∑
t=1
πdt δψdt
ψdt ∼ G0
πdt = πdt
t−1
∏
i=1
(1 −πdi ) πdt ∼ Beta(1,α0)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121
Alternative Stick Breaking Construction
The stick-breaking construction for the HDP
G0 =
∞
∑
k=1
βk δφk
φk ∼ H
βk = βk
k−1
∏
i=1
(1 −βi ) βk ∼ Beta(1,γ)
Gd =
∞
∑
t=1
πdt δψdt
ψdt ∼ G0
πdt = πdt
t−1
∏
i=1
(1 −πdi ) πdt ∼ Beta(1,α0)
To connect ψdt and φk
We add auxiliary variable cdt ∼ Mult(β)
Then ψdt = φcdt
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121
Alternative Stick Breaking Construction
Generative process
1 For each global-level topic k ∈ {1,...,∞}:
1 Draw topic word proportions, φk ∼ Dir(η)
2 Draw a corpus breaking proportion,
βk ∼ Beta(1,γ)
2 For each document d ∈ {1,...,D}:
1 For each document-level topic t ∈ {1,...,∞}:
1 Draw document-level topic indices,
cdt ∼ Mult(σ(β ))
2 Draw a document breaking proportion,
πdt ∼ Beta(1,α0)
2 For each word n ∈ {1,...,N}:
1 Draw a topic index zdn ∼ Mult(σ(πd ))
2 Generate a word wdn ∼ Mult(φcdzdn
),
3 where
σ(β ) ≡ {β1,β2,...},βk = βk ∏k−1
i=1 (1 −βi )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121
Variational Inference
Main idea [JGJS98]
Modify original graphical model to simple model
Minimize similarity between original and modified one
More Formally
Observed data X, Latent variable Z
We want to compute p(Z|X)
Make q(Z)
Minimize similarity between p and q 2
2
Commonly it is KL-divergence of p from q, DKL(q||p)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
Variational Inference
Main idea [JGJS98]
Modify original graphical model to simple model
Minimize similarity between original and modified one
More Formally
Observed data X, Latent variable Z
We want to compute p(Z|X)
Make q(Z)
Minimize similarity between p and q 2
2
Commonly it is KL-divergence of p from q, DKL(q||p)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
KL-divergence of p from q
Find lower bound of log evidence logp(X)
logp(X) = log ∑
{Z}
p(Z,X) = log ∑
{Z}
p(Z,X)
q(Z|X)
q(Z|X)
= log ∑
{Z}
q(Z|X)
p(Z,X)
q(Z|X)
≥ ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
3
Gap between lower bound of logp(X) and logp(X)
logp(X)− ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
= ∑
Z
q(Z)log
q(Z)
p(Z|X)
= DKL(q||p)
3
Use Jensen’s inequality
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
KL-divergence of p from q
Find lower bound of log evidence logp(X)
logp(X) = log ∑
{Z}
p(Z,X) = log ∑
{Z}
p(Z,X)
q(Z|X)
q(Z|X)
= log ∑
{Z}
q(Z|X)
p(Z,X)
q(Z|X)
≥ ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
3
Gap between lower bound of logp(X) and logp(X)
logp(X)− ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
= ∑
Z
q(Z)log
q(Z)
p(Z|X)
= DKL(q||p)
3
Use Jensen’s inequality
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
KL-divergence of p from q
logp(X) = ∑
{Z}
q(Z|X)log
p(Z,X)
q(Z|X)
+DKL(q||p)
Log evidence logp(X) is fixed with respect to q
Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121
Variational Inference
Main idea [JGJS98]
Modify original graphical model to simple model
Minimize similarity between original and modified one
More Formally
Observed data X, Latent variable Z
We want to compute p(Z|X)
Make q(Z)
Minimize similarity between p and q 4
Find lower bound of logp(X)
Maximizing it
4
Commonly it is KL-divergence of p from q, DKL(q||p)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121
Variational Inference for HDP
q(β,φ,π,c,z) =
K
∏
k=1
q(φk |λk )
K−1
∏
k=1
q(βk |a1
k ,a2
k )
D
∏
d=1
T
∏
t=1
q(cdt |ζdt )
T−1
∏
t=1
q(πdt |γ1
dt ,γ2
dt )
N
∏
n=1
q(zdn|ϕdn)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121
Variational Inference for HDP
Find lower bound of logp(w|α0,γ,η)
lnp(w|α0,γ,η)
= ln
β φ π
∑
c
∑
z
p(w,β,φ,π,c,z|α0,γ,η) dβ dφ dπ
= ln
β φ π
∑
c
∑
z
p(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z)
q(β,φ,π,c,z)
dβ dφ dπ
≥
β φ π
∑
c
∑
z
ln
p(w,β,φ,π,c,z|α0,γ,η)
q(β,φ,π,c,z)
·q(β,φ,π,c,z) dβ dφ dπ
( Jensen’s inequality)
=
β φ π
∑
c
∑
z
lnp(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) dβ dφ dπ
−
β φ π
∑
c
∑
z
lnq(β,φ,π,c,z)·q(β,φ,π,c,z) dβ dφ dπ
= Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121
Variational Inference for HDP
lnp(w|α0,γ,η)
≥ Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)]
= Eq[lnp(β|γ)p(φ|η)
D
∏
d=1
p(πd |α0)p(cd |β)
N
∏
n=1
p(wdn|cd ,zdn,φ)p(zdn|πd )]
−Eq[ln
K
∏
k=1
q(φk |λk )
K−1
∏
k=1
q(βk |a1
k ,a2
k )
D
∏
d=1
T
∏
t=1
q(cdt |ζdt )
T−1
∏
t=1
q(πdt |γ1
dt ,γ2
dt )
N
∏
n=1
q(zdn|ϕdn)]
=
D
∑
d=1
Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )]
−Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1
d ,γ2
d )]−Eq[lnq(zd |ϕd )]
+Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1
,a2
)]
We can run Variational EM to maximize lower bound of logp(w|α0,γ,η)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121
Variational Inference for HDP
Maximize lower bound of logp(w|α0,γ,η)
Derivative of it with respect to each variational parameter
γ1
dt = 1 +
N
∑
n=1
ϕdnt , γ2
dt = α0 +
N
∑
n=1
T
∑
b=t+1
ϕdnb
ζdtk = exp{
k−1
∑
e=1
(Ψ(a2
e)−Ψ(a1
e +a2
e))+(Ψ(a1
k )−Ψ(a1
k +a2
k ))
+
N
∑
n=1
V
∑
v=1
wv
dnϕdnt (Ψ(λkv )−Ψ(
V
∑
l=1
λkl ))}
ϕdnt = exp{
t−1
∑
h=1
(Ψ(γ2
dh)−Ψ(γ1
dh +γ2
dh))+(Ψ(γ1
dt )−Ψ(γ1
dt +γ2
dt ))
+
K
∑
k=1
V
∑
v=1
wv
dnζdtk (Ψ(λkv )−Ψ(
V
∑
l=1
λkl ))}
a1
k = 1 +
D
∑
d=1
T
∑
t=1
ζdtk , a2
k = γ +
D
∑
d=1
T
∑
t=1
K
∑
f=k+1
ζdtf
λkv = ηv +
D
∑
d=1
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121
Variational Inference for HDP
Maximize lower bound of logp(w|α0,γ,η)
Derivative of it with respect to each variational parameter
Run Variational EM
E step: compute document level parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
M step: compute corpus level parameters a1
k ,a2
k ,λkv
Algorithm 2 Variational Inference for HDP
1: Initialize the variational parameters
2: repeat
3: for Each document d do
4: repeat
5: Compute document parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
6: until Converged
7: end for
8: Compute topic parameters a1
k ,a2
k ,λkv
9: until Converged
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121
Online Variational Inference
Stochastic optimization to the variational objective [WPB11]
Subsample the documents
Compute approximation of the gradient based on subsample
Follow that gradient with a decreasing step-size
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121
Variational Inference for HDP
Lower bound of logp(w|α0,γ,η)
lnp(w|α0,γ,η)
≥
D
∑
d=1
Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )]
−Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1
d ,γ2
d )]−Eq[lnq(zd |ϕd )]
+Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1
,a2
)]
=
D
∑
d=1
Ld +Lk
= Eqj
[DLd +
1
D
Lk ]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121
Online Variational Inference for HDP
Lower bound of logp(w|α0,γ,η) = Eqj
[DLd + 1
D
Lk ]
Online learning algorithm for HDP
Sample a document d
Compute its optimal document-level parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
Take the gradient 5
of the corpus level parameters a1
k ,a2
k ,λkv with noise
Update corpus level parameters a1
k ,a2
k ,λkv with decreasing learning rate
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
where ρe is the learning rate which satisfy ∑∞
e=1 ρe = ∞, ∑∞
e=1 ρ2
e < ∞
5
Natural graident is structurally equivalent to the Variational Inference one
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121
Online Variational Inference for HDP
Algorithm 3 Online Variational Inference for HDP
1: Initialize the variational parameters
2: e = 0
3: for Each document d ∈ {1,...,D} do
4: repeat
5: Compute document parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
6: until Converged
7: e = e +1
8: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
9: Update topic parameters a1
k ,a2
k ,λkv
10: end for
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121
Outline
1 Introduction
Motivation
Topic Modeling
2 Background
Dirichlet Distribution
Dirichlet Processes
3 Hierarchical Dirichlet Processes
Dirichlet Process Mixture Models
Hierarchical Dirichlet Processes
4 Inference
Gibbs Sampling
Variational Inference
Online Learning
Distributed Online Learning
5 Practical Tips
6 Summary
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121
Motivation
Problem 1: Inference for HDP takes a long time
Problem 2: Continuously expanding corpus necessitates continuous
updates of model parameters
But updating of model parameters is not possible with plain HDP
Must re-train with the entire updated corpus
Our Approach: Combine distributed inference and online learning
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121
Distributed Online HDP
Based on variational inference
Mini-batch updates via stochastic learning (variational EM)
Distribute variational EM using MapReduce
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121
Distributed Online HDP
Algorithm 4 Distributed Online HDP - Driver
1: Initialize the variational parameters
2: e = 0
3: while Run forever do
4: Collect new documents s ∈ {1,...,S}
5: e = e +1
6: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
7: Run MapReduce job
8: Get result of job and update topic parameters
9: end while
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121
Distributed Online HDP
Algorithm 5 Distributed Online HDP - Mapper
1: Mapper get one document s ∈ {1,...,S}
2: repeat
3: Compute document parameters γ1
dt ,γ2
dt ,ζdtk ,ϕdnt
4: until Converged
5: Output the sufficient statistics for topic parameters
Algorithm 6 Distributed Online HDP - Reducer
1: Reducer get sufficient statistics for each topic parameter
2: Compute changes of topic parameter with sufficient statistics
3: Output the changes of topic parameter
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121
Experimental Setup
Data: 973,266 Twitter conversations, 7.54 tweets / conv
Approximately 7,297,000 tweets
60 node Hadoop system
Each node with 8 x 2.30GHz cores
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121
Result
Distributed Online HDP runs faster than online HDP
Distributed Online HDP preserve the quality of result (perplexity)
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121
Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic Modeling
Concept of Hierarchical Dirichlet Processes
How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school said
Wow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I am
in the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic Modeling
Concept of Hierarchical Dirichlet Processes
How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school said
Wow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I am
in the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
Practical Tips
Unitl now, I talked about Bayesian Nonparametric Topic Modeling
Concept of Hierarchical Dirichlet Processes
How to infer the latent variables in HDP
These are theoretical interests
Someone who attended last machine learning winter school said
Wow! There are good and interesting machine learning
topics! But I want to know about practical issues, because I am
in the industrial field.
So I prepared some tips for him/her and you
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
Implementation
https://github.com/NoSyu/Topic_Models
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121
Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121
Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121
HDP
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121
Property of Dirichlet distribution
Sample pmfs from Dirichlet distribution [BAFG10]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Assign Dirichlet parameters
Dirichlet parameters are less than 1
People usually use a few topics to write a document
People usually do not use all topics
Each topic usually use a few words to represent its own topic
Each topic do not use all words
We can assign the each topics/words weights
Some topics are more general than others
Some words are more general than others
Words that have positive/negative meaning are shown in positive/negative
sentiments [JO11]
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
Some tips for using topic models
How to manage hyper-parameters (Dirichlet parameters)?
How to manage learning rate and mini-batch size in online learning?
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121
Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
Meaning of each parameters
τ0: Slow down the early iterations of the algorithm
κ: Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
Meaning of each parameters
τ0: Slow down the early iterations of the algorithm
κ: Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1]
a1
k = (1 −ρe)a1
k +ρe(1 +D
T
∑
t=1
ζdtk )
a2
k = (1 −ρe)a2
k +ρe(γ +D
T
∑
t=1
K
∑
f=k+1
ζdtf )
λkv = (1 −ρe)λkv +ρe(ηv +D
N
∑
n=1
T
∑
t=1
wv
dnϕdnt ζdtk )
Meaning of each parameters
τ0: Slow down the early iterations of the algorithm
κ: Rate at which old value of topic parameters are forgotten
So it depends on dataset
Usually, we set τ0 = 1.0,κ = 0.7
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
Mini-batch size
When mini-batch size is large, distributed online HDP runs faster
Perplexity is similar as others
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121
Summary
Bayesian Nonparametric Topic Modeling
Hierarchical Dirichlet Processes
Chinese Restaurant Franchise
Stick Breaking Construction
Posterior Inference for HDP
Gibbs Sampling
Variational Inference
Online Learning
Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak
Implementations are updated in http://github.com/NoSyu/Topic_Models
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121
Further Reading
Dirichlet Process
Dirichlet Process
Dirichlet distribution and Dirichlet Process + Indian Buffet Process
Bayesian Nonparametric model
Machine Learning Summer School - Yee Whye Teh
Machine Learning Summer School - Peter Orbanz
Introductory article
Inference
MCMC
Variational Inference
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121
Thank You!
JinYeong Bak
jy.bak@kaist.ac.kr, linkedin.com/in/jybak
Users & Information Lab, KAIST
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121
References I
Charles E Antoniak, Mixtures of dirichlet processes with applications to
bayesian nonparametric problems, The annals of statistics (1974),
1152–1174.
Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to the
dirichlet distribution and related processes, Tech. Report
UWEETR-2010-0006, Department of Electrical Engineering, University of
Washington, Seattle, WA 98195, December 2010.
Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition and
machine learning, vol. 1, springer New York, 2006.
David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet
allocation, the Journal of machine Learning research 3 (2003), 993–1022.
Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, An
hdp-hmm for systems with state persistence, Proceedings of the 25th
international conference on Machine learning, ACM, 2008, pp. 312–319.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121
References II
Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009.
Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and
Lawrence K Saul, An introduction to variational methods for graphical
models, Springer, 1998.
Yohan Jo and Alice H. Oh, Aspect and sentiment unification model for
online review analysis, Proceedings of the fourth ACM international
conference on Web search and data mining (New York, NY, USA), WSDM
’11, ACM, 2011, pp. 815–824.
Radford M Neal, Markov chain sampling methods for dirichlet process
mixture models, Journal of computational and graphical statistics 9
(2000), no. 2, 249–265.
Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei,
Hierarchical dirichlet processes, Journal of the american statistical
association 101 (2006), no. 476.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121
References III
Chong Wang, John W Paisley, and David M Blei, Online variational
inference for the hierarchical dirichlet process, International Conference
on Artificial Intelligence and Statistics, 2011, pp. 752–760.
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121
Images source I
http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm
http://www.flickr.com/photos/autumn2may/3965964418/
http://www.flickr.com/photos/ppix/1802571058/
http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871
http://www.flickr.com/photos/jwight/2710392971/
http://www.flickr.com/photos/jasohill/2511594886/
http://en.wikipedia.org/wiki/Kim_Yuna
http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29
http://en.wikipedia.org/wiki/Gangnam_Style
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121
Measurable space (Ω,B)
Def) A set considered together with the σ-algebra on the set6
.
Ω: the set of all outcomes, the sample space
B: σ-algebra over Ω
Special kind of collection of subsets of the sample space Ω
Complete
A is σ-algebra, then AC
is also σ-algebra
Closed under countable unions and intersections
A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra
A collection of events
Property
Smallest possible σ-algebra: {Ω, /0}
Largest possible σ-algebra: powerset
6
http://mathworld.wolfram.com/MeasurableSpace.html
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
Measurable space (Ω,B)
Def) A set considered together with the σ-algebra on the set6
.
Ω: the set of all outcomes, the sample space
B: σ-algebra over Ω
Special kind of collection of subsets of the sample space Ω
Complete
A is σ-algebra, then AC
is also σ-algebra
Closed under countable unions and intersections
A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra
A collection of events
Property
Smallest possible σ-algebra: {Ω, /0}
Largest possible σ-algebra: powerset
6
http://mathworld.wolfram.com/MeasurableSpace.html
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
Proof 1
Decimative property
Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK )
and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1,
then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK )
Then
(G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR))
∼ Dir(1,α0G0(A1),...,α0G0(AR))
changes to
(G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR))
G ∼ DP(α0,G0)
using decimative property with
α1 = α0 θ1 = (1 −β1)
βk = G0(Ak ) τk = G (Ak )
JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121

Weitere ähnliche Inhalte

Was ist angesagt?

Jarrar: Probabilistic Language Modeling - Introduction to N-grams
Jarrar: Probabilistic Language Modeling - Introduction to N-gramsJarrar: Probabilistic Language Modeling - Introduction to N-grams
Jarrar: Probabilistic Language Modeling - Introduction to N-gramsMustafa Jarrar
 
Reinforcement Learning and Neuroscience
Reinforcement Learning and NeuroscienceReinforcement Learning and Neuroscience
Reinforcement Learning and NeuroscienceMichael Bosello
 
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...Lukas Galke
 
Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4Zahra Amini
 
Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Hitomi Yanaka
 
Rohan's Masters presentation
Rohan's Masters presentationRohan's Masters presentation
Rohan's Masters presentationrohan_anil
 
Reading "Bayesian measures of model complexity and fit"
Reading "Bayesian measures of model complexity and fit"Reading "Bayesian measures of model complexity and fit"
Reading "Bayesian measures of model complexity and fit"Christian Robert
 
Probing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsProbing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsEDD SFSU
 
Term Frequency and its Variants in Retrieval Models
Term Frequency and its Variants in Retrieval ModelsTerm Frequency and its Variants in Retrieval Models
Term Frequency and its Variants in Retrieval ModelsVenkatesh Vinayakarao
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDaisuke BEKKI
 
[seminar] Tues, 23 April, 2019
[seminar] Tues, 23 April, 2019[seminar] Tues, 23 April, 2019
[seminar] Tues, 23 April, 2019Naoto Agawa
 
A method for constructing fuzzy test statistics with application
A method for constructing fuzzy test statistics with applicationA method for constructing fuzzy test statistics with application
A method for constructing fuzzy test statistics with applicationAlexander Decker
 

Was ist angesagt? (18)

Jarrar: Probabilistic Language Modeling - Introduction to N-grams
Jarrar: Probabilistic Language Modeling - Introduction to N-gramsJarrar: Probabilistic Language Modeling - Introduction to N-grams
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
 
Reinforcement Learning and Neuroscience
Reinforcement Learning and NeuroscienceReinforcement Learning and Neuroscience
Reinforcement Learning and Neuroscience
 
Linear models2
Linear models2Linear models2
Linear models2
 
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...
 
Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4Dr azimifar pattern recognition lect4
Dr azimifar pattern recognition lect4
 
12_applications.pdf
12_applications.pdf12_applications.pdf
12_applications.pdf
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?Do Neural Models Learn Transitivity of Veridical Inference?
Do Neural Models Learn Transitivity of Veridical Inference?
 
Logic 2
Logic 2Logic 2
Logic 2
 
Rohan's Masters presentation
Rohan's Masters presentationRohan's Masters presentation
Rohan's Masters presentation
 
Reading "Bayesian measures of model complexity and fit"
Reading "Bayesian measures of model complexity and fit"Reading "Bayesian measures of model complexity and fit"
Reading "Bayesian measures of model complexity and fit"
 
Probing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of FindingsProbing the Efficacy of the Algebra Project: A Summary of Findings
Probing the Efficacy of the Algebra Project: A Summary of Findings
 
Open-ended Visual Question-Answering
Open-ended  Visual Question-AnsweringOpen-ended  Visual Question-Answering
Open-ended Visual Question-Answering
 
Term Frequency and its Variants in Retrieval Models
Term Frequency and its Variants in Retrieval ModelsTerm Frequency and its Variants in Retrieval Models
Term Frequency and its Variants in Retrieval Models
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDependent Types in Natural Language Semantics
Dependent Types in Natural Language Semantics
 
[seminar] Tues, 23 April, 2019
[seminar] Tues, 23 April, 2019[seminar] Tues, 23 April, 2019
[seminar] Tues, 23 April, 2019
 
A method for constructing fuzzy test statistics with application
A method for constructing fuzzy test statistics with applicationA method for constructing fuzzy test statistics with application
A method for constructing fuzzy test statistics with application
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 

Andere mochten auch

Chinese Restaurant Process
Chinese Restaurant ProcessChinese Restaurant Process
Chinese Restaurant ProcessMohitdeep Singh
 
Approximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUsApproximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUsMichael Stumpf
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialVitomir Kovanovic
 
Dirichlet Process
Dirichlet ProcessDirichlet Process
Dirichlet ProcessSangwoo Mo
 
KAIST Web Engineering Lab Introduction (2017 ver.)
KAIST Web Engineering Lab Introduction (2017 ver.)KAIST Web Engineering Lab Introduction (2017 ver.)
KAIST Web Engineering Lab Introduction (2017 ver.)webeng-kaist
 
directmarketing plan for restaurant
directmarketing plan for restaurantdirectmarketing plan for restaurant
directmarketing plan for restaurantgautamsushil90
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...Christos Katsanos
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)muzzy4friends
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...Damiano Spina
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Kyunghoon Kim
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Ra'Fat Al-Msie'deen
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011aneeshabakharia
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 

Andere mochten auch (20)

Chinese Restaurant Process
Chinese Restaurant ProcessChinese Restaurant Process
Chinese Restaurant Process
 
Approximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUsApproximate Bayesian Computation on GPUs
Approximate Bayesian Computation on GPUs
 
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 TutorialTopic Modeling for Learning Analytics Researchers LAK15 Tutorial
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
 
Dirichlet Process
Dirichlet ProcessDirichlet Process
Dirichlet Process
 
KAIST Web Engineering Lab Introduction (2017 ver.)
KAIST Web Engineering Lab Introduction (2017 ver.)KAIST Web Engineering Lab Introduction (2017 ver.)
KAIST Web Engineering Lab Introduction (2017 ver.)
 
directmarketing plan for restaurant
directmarketing plan for restaurantdirectmarketing plan for restaurant
directmarketing plan for restaurant
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 

Ähnlich wie Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

zkStudyClub - cqlin: Efficient linear operations on KZG commitments
zkStudyClub - cqlin: Efficient linear operations on KZG commitments zkStudyClub - cqlin: Efficient linear operations on KZG commitments
zkStudyClub - cqlin: Efficient linear operations on KZG commitments Alex Pruden
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Ph d sem_1@iitm
Ph d sem_1@iitmPh d sem_1@iitm
Ph d sem_1@iitmVinu Ev
 
Дмитрий Игнатов для ФИSNA
Дмитрий Игнатов для ФИSNAДмитрий Игнатов для ФИSNA
Дмитрий Игнатов для ФИSNAAndzhey Arshavskiy
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
 
[PR12] intro. to gans jaejun yoo
[PR12] intro. to gans   jaejun yoo[PR12] intro. to gans   jaejun yoo
[PR12] intro. to gans jaejun yooJaeJun Yoo
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prologRakhi Sinha
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSangwoo Mo
 
Dialectica Categories... and Lax Topological Spaces?
Dialectica Categories... and Lax Topological Spaces?Dialectica Categories... and Lax Topological Spaces?
Dialectica Categories... and Lax Topological Spaces?Valeria de Paiva
 
Introduction to ambient GAN
Introduction to ambient GANIntroduction to ambient GAN
Introduction to ambient GANJaeJun Yoo
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
 
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 

Ähnlich wie Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes (20)

zkStudyClub - cqlin: Efficient linear operations on KZG commitments
zkStudyClub - cqlin: Efficient linear operations on KZG commitments zkStudyClub - cqlin: Efficient linear operations on KZG commitments
zkStudyClub - cqlin: Efficient linear operations on KZG commitments
 
Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Canini09a
Canini09aCanini09a
Canini09a
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Ph d sem_1@iitm
Ph d sem_1@iitmPh d sem_1@iitm
Ph d sem_1@iitm
 
Дмитрий Игнатов для ФИSNA
Дмитрий Игнатов для ФИSNAДмитрий Игнатов для ФИSNA
Дмитрий Игнатов для ФИSNA
 
Anti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCutAnti-differentiating Approximation Algorithms: PageRank and MinCut
Anti-differentiating Approximation Algorithms: PageRank and MinCut
 
[PR12] intro. to gans jaejun yoo
[PR12] intro. to gans   jaejun yoo[PR12] intro. to gans   jaejun yoo
[PR12] intro. to gans jaejun yoo
 
话题模型2
话题模型2话题模型2
话题模型2
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Dialectica Categories... and Lax Topological Spaces?
Dialectica Categories... and Lax Topological Spaces?Dialectica Categories... and Lax Topological Spaces?
Dialectica Categories... and Lax Topological Spaces?
 
Introduction to ambient GAN
Introduction to ambient GANIntroduction to ambient GAN
Introduction to ambient GAN
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
高次元空間におけるハブの出現 (第11回ステアラボ人工知能セミナー)
 

Mehr von JinYeong Bak

20160320 workshop seoul
20160320 workshop seoul20160320 workshop seoul
20160320 workshop seoulJinYeong Bak
 
질문법과 구글링
질문법과 구글링질문법과 구글링
질문법과 구글링JinYeong Bak
 
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...JinYeong Bak
 
2014 KAIST CS Student Representative Final Report
2014 KAIST CS Student Representative Final Report2014 KAIST CS Student Representative Final Report
2014 KAIST CS Student Representative Final ReportJinYeong Bak
 
Self-disclosure topic model for twitter conversations - EMNLP 2014
Self-disclosure topic model for twitter conversations - EMNLP 2014Self-disclosure topic model for twitter conversations - EMNLP 2014
Self-disclosure topic model for twitter conversations - EMNLP 2014JinYeong Bak
 
Self-disclosure in twitter conversations - talk in QCRI
Self-disclosure in twitter conversations - talk in QCRISelf-disclosure in twitter conversations - talk in QCRI
Self-disclosure in twitter conversations - talk in QCRIJinYeong Bak
 
Self-Disclosure and Relationship Strength in Twitter Conversations
Self-Disclosure and Relationship Strength in Twitter ConversationsSelf-Disclosure and Relationship Strength in Twitter Conversations
Self-Disclosure and Relationship Strength in Twitter ConversationsJinYeong Bak
 

Mehr von JinYeong Bak (7)

20160320 workshop seoul
20160320 workshop seoul20160320 workshop seoul
20160320 workshop seoul
 
질문법과 구글링
질문법과 구글링질문법과 구글링
질문법과 구글링
 
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Jos...
 
2014 KAIST CS Student Representative Final Report
2014 KAIST CS Student Representative Final Report2014 KAIST CS Student Representative Final Report
2014 KAIST CS Student Representative Final Report
 
Self-disclosure topic model for twitter conversations - EMNLP 2014
Self-disclosure topic model for twitter conversations - EMNLP 2014Self-disclosure topic model for twitter conversations - EMNLP 2014
Self-disclosure topic model for twitter conversations - EMNLP 2014
 
Self-disclosure in twitter conversations - talk in QCRI
Self-disclosure in twitter conversations - talk in QCRISelf-disclosure in twitter conversations - talk in QCRI
Self-disclosure in twitter conversations - talk in QCRI
 
Self-Disclosure and Relationship Strength in Twitter Conversations
Self-Disclosure and Relationship Strength in Twitter ConversationsSelf-Disclosure and Relationship Strength in Twitter Conversations
Self-Disclosure and Relationship Strength in Twitter Conversations
 

Kürzlich hochgeladen

How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17Celine George
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfTechSoup
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxDr. Santhosh Kumar. N
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17Celine George
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapitolTechU
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxKatherine Villaluna
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 

Kürzlich hochgeladen (20)

How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdfMaximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
Maximizing Impact_ Nonprofit Website Planning, Budgeting, and Design.pdf
 
M-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptxM-2- General Reactions of amino acids.pptx
M-2- General Reactions of amino acids.pptx
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
CapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptxCapTechU Doctoral Presentation -March 2024 slides.pptx
CapTechU Doctoral Presentation -March 2024 slides.pptx
 
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptxPractical Research 1: Lesson 8 Writing the Thesis Statement.pptx
Practical Research 1: Lesson 8 Writing the Thesis Statement.pptx
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 

Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

  • 1. Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes JinYeong Bak Department of Computer Science KAIST, Daejeon South Korea jy.bak@kaist.ac.kr August 22, 2013 Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk). JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121
  • 2. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121
  • 3. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121
  • 4. Introduction Bayesian topic models Latent Dirichlet Allocation (LDA) [BNJ03] Hierarchical Dircihlet Processes (HDP) [TJBB06] In this talk, Dirichlet distribution, Dircihlet process Concept of Hierarchical Dircihlet Processes (HDP) How to infer the latent variables in HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121
  • 5. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121
  • 6. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  • 7. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  • 8. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  • 9. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  • 10. Motivation What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121
  • 11. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121
  • 12. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  • 13. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  • 14. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  • 15. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  • 16. Topic Modeling Each topic has word distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121
  • 17. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
  • 18. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
  • 19. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
  • 20. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  • 21. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  • 22. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  • 23. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  • 24. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  • 25. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  • 26. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  • 27. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  • 28. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  • 29. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121
  • 30. Latent Dirichlet Allocation What we can see Words in documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121
  • 31. Latent Dirichlet Allocation What we want to see JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121
  • 32. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? => Topic proportion of each document How can we describe the topics? => Word distribution of each topic JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121
  • 33. Latent Dirichlet Allocation What we can see: w What we want to see: θ,z,β ∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η) p(w|α,η) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
  • 34. Latent Dirichlet Allocation What we can see: w What we want to see: θ,z,β ∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η) p(w|α,η) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
  • 35. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should find the best number of topics Q) Can we get it from data automatically? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121
  • 36. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should find the best number of topics Q) Can we get it from data automatically? A) Hierarchical Dircihlet Processes JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121
  • 37. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121
  • 38. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
  • 39. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
  • 40. Dice modeling Think about the probability of a number from dices According to the textbook, it is widely known as uniform. => 1 6 for 6 dimentional dice Is it true? Ans) No! JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121
  • 41. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Let’s imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
  • 42. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Let’s imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
  • 43. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
  • 44. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
  • 45. Dirichlet distribution Definition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
  • 46. Dirichlet distribution Definition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
  • 47. Dirichlet distribution Definition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
  • 48. Latent Dirichlet Allocation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121
  • 49. Property of Dirichlet distribution Density plots [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121
  • 50. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121
  • 51. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α) ∴ Posterior (θ|X) ∼ Dir(α +n) Proof) p(θ|X) = p(X|θ)p(θ) p(X) ∝ p(X|θ)p(θ) = n! x1!···xK ! K ∏ k=1 θxk k · Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k = C K ∏ k=1 θαk +xk −1 k = Dir(α +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
  • 52. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α) ∴ Posterior (θ|X) ∼ Dir(α +n) Proof) p(θ|X) = p(X|θ)p(θ) p(X) ∝ p(X|θ)p(θ) = n! x1!···xK ! K ∏ k=1 θxk k · Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k = C K ∏ k=1 θαk +xk −1 k = Dir(α +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
  • 53. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  • 54. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  • 55. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  • 56. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  • 57. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121
  • 58. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
  • 59. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
  • 60. Dirichlet Process Definition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal definition (Ω,B) is a measurable space G0 is a distribution over sample space Ω α0 is a positive real number G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any finite measurable partition (A1,...,AR) of Ω (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
  • 61. Dirichlet Process Definition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal definition (Ω,B) is a measurable space G0 is a distribution over sample space Ω α0 is a positive real number G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any finite measurable partition (A1,...,AR) of Ω (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
  • 62. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  • 63. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  • 64. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  • 65. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  • 66. Posterior Dirichlet Processes For every finite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
  • 67. Posterior Dirichlet Processes For every finite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
  • 68. Posterior Dirichlet Processes For every finite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
  • 69. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
  • 70. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
  • 71. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
  • 72. Blackwell-MacQueen Urn Scheme Nth sample θN|θ1,...,N−1,G ∼ G G|θ1,...,N−1 ∼ DP(α0 +N −1, α0G0 +∑N−1 n=1 δθn α0 +N −1 ) ⇐⇒ θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 G|θ1,...,N ∼ DP(α0 +N, α0G0 +∑N n=1 δθn α0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121
  • 73. Blackwell-MacQueen Urn Scheme Blackwell-MacQueen urn scheme produces a sequence θ1,θ2,... with the following conditionals θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 As Polya Urn analogy Infinite number of ball colors Empty urn Filling Polya urn process (n starts 1) With probability α0, pick a new color from the set of infinite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121
  • 74. Chinese Restaurant Process Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme With probability α0, pick a new color from the set of infinite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color θs can take same values, θi = θj There are K < N distinct values, φ1,...,φK It works as partition of Ω θ1,θ2,...,θN induces to φ1,...,φK The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
  • 75. Chinese Restaurant Process Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme With probability α0, pick a new color from the set of infinite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color θs can take same values, θi = θj There are K < N distinct values, φ1,...,φK It works as partition of Ω θ1,θ2,...,θN induces to φ1,...,φK The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
  • 76. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  • 77. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  • 78. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  • 79. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  • 80. Chinese Restaurant Process The CRP exhibits the clustering property of DP Tables are clusters, φk ∼ G0 Customers are the actual realizations, θn = φzn where zn ∈ {1,...,K} JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121
  • 81. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
  • 82. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
  • 83. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
  • 84. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
  • 85. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
  • 86. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
  • 87. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
  • 88. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
  • 89. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
  • 90. Stick Breaking Construction Do this repeatly with distinct values, φ1,φ2,··· G ∼ DP(α0,G0) G = β1δφ1 +(1 −β1)G1 G = β1δφ1 +(1 −β1)(β2δφ2 +(1 −β2)G2) ... G = ∞ ∑ k=1 πk δφk where πk = βk k−1 ∏ i=1 (1 −βi ), ∞ ∑ k=1 πk = 1 βk ∼ Beta(1,α0) φk ∼ G0 Draws from the DP looks like a sum of point masses, with masses drawn from a stick-breaking construction. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121
  • 91. Stick Breaking Construction Summary) G = ∞ ∑ k=1 πk δφk πk = βk k−1 ∏ i=1 (1 −βi ), ∞ ∑ k=1 πk = 1 βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121
  • 92. Summary of DP Definition G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any finite measurable partition (A1,...,Ar ) of Ω (G(A1),...,G(Ar )) ∼ Dir(α0G0(A1),...,α0G0(Ar )) Chinese Restaurant Process Stick Breaking Construction JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121
  • 93. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121
  • 94. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) Each θn is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
  • 95. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) Each θn is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
  • 96. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently define θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
  • 97. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently define θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
  • 98. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently define θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
  • 99. Dirichlet Process Mixture Models ⇐⇒ xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121
  • 100. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121
  • 101. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
  • 102. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
  • 103. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = ∞ ∑ k=1 π1k δφ1k , G2 = ∞ ∑ k=1 π2k δφ2k φ1k ,φ2k ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
  • 104. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = ∞ ∑ k=1 π1k δφ1k , G2 = ∞ ∑ k=1 π2k δφ2k φ1k ,φ2k ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
  • 105. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 ∼ DP(γ,H) G1,G2|G0 ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
  • 106. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 ∼ DP(γ,H) G1,G2|G0 ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
  • 107. Hierarchical Dirichlet Processes Making G0 discrete forces shared cluster between G1 and G2 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121
  • 108. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 ∼ DP(γ,H) Gd |G0 ∼ DP(α0,G0) The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ k=1 πdk δφk πdk = πdk k−1 ∏ i=1 (1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121
  • 109. Chinese Restaurant Franchise Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121
  • 110. Chinese Restaurant Franchise Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... Draw θd 1,θd 2,... from a Blackwell-MacQueen Urn Scheme θd 1,θd 2,... induces to φd 1,φd 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121
  • 111. Chinese Restaurant Franchise G0 ∼ DP(γ,H), φk ∼ H Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... Draw θd 1,θd 2,... from a Blackwell-MacQueen Urn Scheme θd 1,θd 2,... induces to φd 1,φd 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121
  • 112. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  • 113. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  • 114. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  • 115. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  • 116. Chinese Restaurant Franchise JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121
  • 117. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
  • 118. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
  • 119. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121
  • 120. Gibbs Sampling Definition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
  • 121. Gibbs Sampling Definition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
  • 122. Collapsed Gibbs sampling A collapsed Gibbs sampling integrates out one or more variables when sampling for some other variable. Example) There are three latent variables A,B and C. Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially But when we integrate out B, Sampling only p(A|C), p(C|A) sequentially JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121
  • 123. Review) Dirichlet Process Mixture Models ⇐⇒ xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121
  • 124. Review) Blackwell-MacQueen Urn Scheme for DP Nth sample θN|θ1,...,N−1,G ∼ G G|θ1,...,N−1 ∼ DP(α0 +N −1, α0G0 +∑N−1 n=1 δθn α0 +N −1 ) ⇐⇒ θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 G|θ1,...,N ∼ DP(α0 +N, α0G0 +∑N n=1 δθn α0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121
  • 125. Review) Chinese Restaurant Franchise Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121
  • 126. Alternative form of HDP G0 ∼ DP(γ,H), φdt ∼ G0 ∴ G0|φdt ,... ∼ DP(γ +m, γH+∑K k=1 mk δφk γ+m ) Then G0 is given as G0 = K ∑ k=1 βk δφk +βuGu where Gu ∼ DP(γ,H) π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ) p(φk |·) ∝ h(φk ) ∏ dn:zdn=k f(xdn|φk ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
  • 127. Alternative form of HDP G0 ∼ DP(γ,H), φdt ∼ G0 ∴ G0|φdt ,... ∼ DP(γ +m, γH+∑K k=1 mk δφk γ+m ) Then G0 is given as G0 = K ∑ k=1 βk δφk +βuGu where Gu ∼ DP(γ,H) π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ) p(φk |·) ∝ h(φk ) ∏ dn:zdn=k f(xdn|φk ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
  • 128. Hierarchical Dirichlet Processes ⇐⇒ xdn ∼ F(θn) θn ∼ Gd Gd ∼ DP(α0,G0) G0 ∼ DP(γ,H) ⇐⇒ xn ∼ Mult(φzdn ) zdn ∼ Mult(θd ) φk ∼ Dir(η) θd ∼ Dir(α0π) π ∼ Dir(m.1,...,m.K ,γ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121
  • 129. Gibbs Sampling for HDP Joint distribution p(θ,z,φ,x,π,m|α0,η,γ) = p(π|m,γ) K ∏ k=1 p(φk |η) D ∏ d=1 p(θd |α0,π) N ∏ n=1 p(zdn|θd ) p(xdn|zdn,φ) Integrate out θ,φ p(z,x,π,m|α0,η,γ) = Γ(∑K k=1 m.k +γ) ∏K k=1 Γ(m.k )Γ(γ) K ∏ k=1 πm.k −1 k π γ−1 K+1 K ∏ k=1 Γ(∑V v=1 ηv ) ∏V v=1 Γ(ηv ) ∏V v=1 Γ(ηv +nk (·),v ) Γ(∑V v=1 ηv +nk (·),v ) M ∏ d=1 Γ(∑K k=1 α0πk ) ∏K k=1 Γ(α0πk ) ∏K k=1 Γ(α0πk +nk d,(·)) Γ(∑K k=1 α0πk +nk d,(·)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121
  • 130. Gibbs Sampling for HDP Full conditional distribution of z p(z(d ,n ) = k |z−(d ,n ) ,m,π,x,·) = p(z(d ,n ) = k ,z−(d ,n ),m,π,x|·) p(z−(d ,n ),m,π,x|·) ∝ p(z(d ,n ) = k ,z−(d ,n ) ,m,π,x|·) ∝ α0πk +n k ,−(d ,n ) d ,(·) (ηv +n k ,−(d ,n ) (·),v ) (∑V v=1 ηv +n k ,−(d ,n ) (·),v ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121
  • 131. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the first kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
  • 132. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the first kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
  • 133. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the first kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
  • 134. Gibbs Sampling for HDP Full conditional distribution of π (π1,π2,...,πK ,πu)|· ∼ Dir(m.1,m.2,...,m.K ,γ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121
  • 135. Gibbs Sampling for HDP Algorithm 1 Gibbs Sampling for HDP 1: Initialize all latent variables as random 2: repeat 3: for Each document d do 4: for Each word n in document d do 5: Sample z(d,n) ∼ Mult α0πk +n k ,−(d,n) d ,(·) (ηv +n k ,−(d,n) (·),v ) (∑V v=1 ηv +n k ,−(d,n) (·),v ) 6: end for 7: Sample m ∼ Mult Γ(α0πk ) Γ(α0πk +nk d,(·),(·) ) s(nk d,(·),(·),m)(α0πk )m 8: Sample β ∼ Dir(m.1,m.2,...,m.K ,γ) 9: end for 10: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121
  • 136. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121
  • 137. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 ∼ DP(γ,H) Gd |G0 ∼ DP(α0,G0) The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ k=1 πdk δφk πdk = πdk k−1 ∏ i=1 (1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121
  • 138. Alternative Stick Breaking Construction Problem) Original Stick Breaking Construction is weights βk and πdk are tightly correlated βk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,γ) πdk = πdk k−1 ∏ i=1 (1 −πdi ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) Alternative Stick Breaking Construction for each document [FSJW08] ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) Gd = ∞ ∑ t=1 πdt δψdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121
  • 139. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ t=1 πdt δψdt ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121
  • 140. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,γ) Gd = ∞ ∑ t=1 πdt δψdt ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) To connect ψdt and φk We add auxiliary variable cdt ∼ Mult(β) Then ψdt = φcdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121
  • 141. Alternative Stick Breaking Construction Generative process 1 For each global-level topic k ∈ {1,...,∞}: 1 Draw topic word proportions, φk ∼ Dir(η) 2 Draw a corpus breaking proportion, βk ∼ Beta(1,γ) 2 For each document d ∈ {1,...,D}: 1 For each document-level topic t ∈ {1,...,∞}: 1 Draw document-level topic indices, cdt ∼ Mult(σ(β )) 2 Draw a document breaking proportion, πdt ∼ Beta(1,α0) 2 For each word n ∈ {1,...,N}: 1 Draw a topic index zdn ∼ Mult(σ(πd )) 2 Generate a word wdn ∼ Mult(φcdzdn ), 3 where σ(β ) ≡ {β1,β2,...},βk = βk ∏k−1 i=1 (1 −βi ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121
  • 142. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modified one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
  • 143. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modified one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
  • 144. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log ∑ {Z} p(Z,X) = log ∑ {Z} p(Z,X) q(Z|X) q(Z|X) = log ∑ {Z} q(Z|X) p(Z,X) q(Z|X) ≥ ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X)− ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) = ∑ Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensen’s inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
  • 145. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log ∑ {Z} p(Z,X) = log ∑ {Z} p(Z,X) q(Z|X) q(Z|X) = log ∑ {Z} q(Z|X) p(Z,X) q(Z|X) ≥ ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X)− ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) = ∑ Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensen’s inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
  • 146. KL-divergence of p from q logp(X) = ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) +DKL(q||p) Log evidence logp(X) is fixed with respect to q Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121
  • 147. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modified one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 4 Find lower bound of logp(X) Maximizing it 4 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121
  • 148. Variational Inference for HDP q(β,φ,π,c,z) = K ∏ k=1 q(φk |λk ) K−1 ∏ k=1 q(βk |a1 k ,a2 k ) D ∏ d=1 T ∏ t=1 q(cdt |ζdt ) T−1 ∏ t=1 q(πdt |γ1 dt ,γ2 dt ) N ∏ n=1 q(zdn|ϕdn) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121
  • 149. Variational Inference for HDP Find lower bound of logp(w|α0,γ,η) lnp(w|α0,γ,η) = ln β φ π ∑ c ∑ z p(w,β,φ,π,c,z|α0,γ,η) dβ dφ dπ = ln β φ π ∑ c ∑ z p(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) q(β,φ,π,c,z) dβ dφ dπ ≥ β φ π ∑ c ∑ z ln p(w,β,φ,π,c,z|α0,γ,η) q(β,φ,π,c,z) ·q(β,φ,π,c,z) dβ dφ dπ ( Jensen’s inequality) = β φ π ∑ c ∑ z lnp(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) dβ dφ dπ − β φ π ∑ c ∑ z lnq(β,φ,π,c,z)·q(β,φ,π,c,z) dβ dφ dπ = Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121
  • 150. Variational Inference for HDP lnp(w|α0,γ,η) ≥ Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)] = Eq[lnp(β|γ)p(φ|η) D ∏ d=1 p(πd |α0)p(cd |β) N ∏ n=1 p(wdn|cd ,zdn,φ)p(zdn|πd )] −Eq[ln K ∏ k=1 q(φk |λk ) K−1 ∏ k=1 q(βk |a1 k ,a2 k ) D ∏ d=1 T ∏ t=1 q(cdt |ζdt ) T−1 ∏ t=1 q(πdt |γ1 dt ,γ2 dt ) N ∏ n=1 q(zdn|ϕdn)] = D ∑ d=1 Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )] −Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1 d ,γ2 d )]−Eq[lnq(zd |ϕd )] +Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1 ,a2 )] We can run Variational EM to maximize lower bound of logp(w|α0,γ,η) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121
  • 151. Variational Inference for HDP Maximize lower bound of logp(w|α0,γ,η) Derivative of it with respect to each variational parameter γ1 dt = 1 + N ∑ n=1 ϕdnt , γ2 dt = α0 + N ∑ n=1 T ∑ b=t+1 ϕdnb ζdtk = exp{ k−1 ∑ e=1 (Ψ(a2 e)−Ψ(a1 e +a2 e))+(Ψ(a1 k )−Ψ(a1 k +a2 k )) + N ∑ n=1 V ∑ v=1 wv dnϕdnt (Ψ(λkv )−Ψ( V ∑ l=1 λkl ))} ϕdnt = exp{ t−1 ∑ h=1 (Ψ(γ2 dh)−Ψ(γ1 dh +γ2 dh))+(Ψ(γ1 dt )−Ψ(γ1 dt +γ2 dt )) + K ∑ k=1 V ∑ v=1 wv dnζdtk (Ψ(λkv )−Ψ( V ∑ l=1 λkl ))} a1 k = 1 + D ∑ d=1 T ∑ t=1 ζdtk , a2 k = γ + D ∑ d=1 T ∑ t=1 K ∑ f=k+1 ζdtf λkv = ηv + D ∑ d=1 N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121
  • 152. Variational Inference for HDP Maximize lower bound of logp(w|α0,γ,η) Derivative of it with respect to each variational parameter Run Variational EM E step: compute document level parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt M step: compute corpus level parameters a1 k ,a2 k ,λkv Algorithm 2 Variational Inference for HDP 1: Initialize the variational parameters 2: repeat 3: for Each document d do 4: repeat 5: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 6: until Converged 7: end for 8: Compute topic parameters a1 k ,a2 k ,λkv 9: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121
  • 153. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121
  • 154. Online Variational Inference Stochastic optimization to the variational objective [WPB11] Subsample the documents Compute approximation of the gradient based on subsample Follow that gradient with a decreasing step-size JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121
  • 155. Variational Inference for HDP Lower bound of logp(w|α0,γ,η) lnp(w|α0,γ,η) ≥ D ∑ d=1 Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )] −Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1 d ,γ2 d )]−Eq[lnq(zd |ϕd )] +Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1 ,a2 )] = D ∑ d=1 Ld +Lk = Eqj [DLd + 1 D Lk ] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121
  • 156. Online Variational Inference for HDP Lower bound of logp(w|α0,γ,η) = Eqj [DLd + 1 D Lk ] Online learning algorithm for HDP Sample a document d Compute its optimal document-level parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt Take the gradient 5 of the corpus level parameters a1 k ,a2 k ,λkv with noise Update corpus level parameters a1 k ,a2 k ,λkv with decreasing learning rate a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) where ρe is the learning rate which satisfy ∑∞ e=1 ρe = ∞, ∑∞ e=1 ρ2 e < ∞ 5 Natural graident is structurally equivalent to the Variational Inference one JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121
  • 157. Online Variational Inference for HDP Algorithm 3 Online Variational Inference for HDP 1: Initialize the variational parameters 2: e = 0 3: for Each document d ∈ {1,...,D} do 4: repeat 5: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 6: until Converged 7: e = e +1 8: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] 9: Update topic parameters a1 k ,a2 k ,λkv 10: end for JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121
  • 158. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121
  • 159. Motivation Problem 1: Inference for HDP takes a long time Problem 2: Continuously expanding corpus necessitates continuous updates of model parameters But updating of model parameters is not possible with plain HDP Must re-train with the entire updated corpus Our Approach: Combine distributed inference and online learning JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121
  • 160. Distributed Online HDP Based on variational inference Mini-batch updates via stochastic learning (variational EM) Distribute variational EM using MapReduce JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121
  • 161. Distributed Online HDP Algorithm 4 Distributed Online HDP - Driver 1: Initialize the variational parameters 2: e = 0 3: while Run forever do 4: Collect new documents s ∈ {1,...,S} 5: e = e +1 6: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] 7: Run MapReduce job 8: Get result of job and update topic parameters 9: end while JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121
  • 162. Distributed Online HDP Algorithm 5 Distributed Online HDP - Mapper 1: Mapper get one document s ∈ {1,...,S} 2: repeat 3: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 4: until Converged 5: Output the sufficient statistics for topic parameters Algorithm 6 Distributed Online HDP - Reducer 1: Reducer get sufficient statistics for each topic parameter 2: Compute changes of topic parameter with sufficient statistics 3: Output the changes of topic parameter JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121
  • 163. Experimental Setup Data: 973,266 Twitter conversations, 7.54 tweets / conv Approximately 7,297,000 tweets 60 node Hadoop system Each node with 8 x 2.30GHz cores JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121
  • 164. Result Distributed Online HDP runs faster than online HDP Distributed Online HDP preserve the quality of result (perplexity) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121
  • 165. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial field. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
  • 166. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial field. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
  • 167. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial field. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
  • 168. Implementation https://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121
  • 169. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121
  • 170. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121
  • 171. HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121
  • 172. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121
  • 173. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  • 174. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  • 175. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  • 176. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  • 177. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121
  • 178. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
  • 179. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
  • 180. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
  • 181. Mini-batch size When mini-batch size is large, distributed online HDP runs faster Perplexity is similar as others JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121
  • 182. Summary Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes Chinese Restaurant Franchise Stick Breaking Construction Posterior Inference for HDP Gibbs Sampling Variational Inference Online Learning Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak Implementations are updated in http://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121
  • 183. Further Reading Dirichlet Process Dirichlet Process Dirichlet distribution and Dirichlet Process + Indian Buffet Process Bayesian Nonparametric model Machine Learning Summer School - Yee Whye Teh Machine Learning Summer School - Peter Orbanz Introductory article Inference MCMC Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121
  • 184. Thank You! JinYeong Bak jy.bak@kaist.ac.kr, linkedin.com/in/jybak Users & Information Lab, KAIST JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121
  • 185. References I Charles E Antoniak, Mixtures of dirichlet processes with applications to bayesian nonparametric problems, The annals of statistics (1974), 1152–1174. Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to the dirichlet distribution and related processes, Tech. Report UWEETR-2010-0006, Department of Electrical Engineering, University of Washington, Seattle, WA 98195, December 2010. Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition and machine learning, vol. 1, springer New York, 2006. David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet allocation, the Journal of machine Learning research 3 (2003), 993–1022. Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, An hdp-hmm for systems with state persistence, Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 312–319. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121
  • 186. References II Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009. Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul, An introduction to variational methods for graphical models, Springer, 1998. Yohan Jo and Alice H. Oh, Aspect and sentiment unification model for online review analysis, Proceedings of the fourth ACM international conference on Web search and data mining (New York, NY, USA), WSDM ’11, ACM, 2011, pp. 815–824. Radford M Neal, Markov chain sampling methods for dirichlet process mixture models, Journal of computational and graphical statistics 9 (2000), no. 2, 249–265. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei, Hierarchical dirichlet processes, Journal of the american statistical association 101 (2006), no. 476. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121
  • 187. References III Chong Wang, John W Paisley, and David M Blei, Online variational inference for the hierarchical dirichlet process, International Conference on Artificial Intelligence and Statistics, 2011, pp. 752–760. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121
  • 189. Measurable space (Ω,B) Def) A set considered together with the σ-algebra on the set6 . Ω: the set of all outcomes, the sample space B: σ-algebra over Ω Special kind of collection of subsets of the sample space Ω Complete A is σ-algebra, then AC is also σ-algebra Closed under countable unions and intersections A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra A collection of events Property Smallest possible σ-algebra: {Ω, /0} Largest possible σ-algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
  • 190. Measurable space (Ω,B) Def) A set considered together with the σ-algebra on the set6 . Ω: the set of all outcomes, the sample space B: σ-algebra over Ω Special kind of collection of subsets of the sample space Ω Complete A is σ-algebra, then AC is also σ-algebra Closed under countable unions and intersections A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra A collection of events Property Smallest possible σ-algebra: {Ω, /0} Largest possible σ-algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
  • 191. Proof 1 Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Then (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) changes to (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) using decimative property with α1 = α0 θ1 = (1 −β1) βk = G0(Ak ) τk = G (Ak ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121