2. Outline
Bayes’ Rule
Parametric Bayesian Learning
Concept & Example
Discrete & Continuous Data
Text Clustering & Topic Modeling
Pros and Cons
Some Important Concepts
Non-parametric Bayesian Learning
Dirichlet Process and Process Construction
Dirichlet Process Mixture
Hierarchical Dirichlet Process
Chinese Restaurant Process
5/10/2016 2 Middleware, CCNT, ZJUYueshen Xu
Example: Hierarchical Topic
Modeling
Markov Chain Monte Carlo
Reference
Discussion
3. Bayes’ Rule
Posterior = Prior * Likelihood
5/10/2016 Yueshen Xu 3 Middleware, CCNT, ZJU
𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 =
𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)
𝑝(𝐷𝑎𝑡𝑎)
Posterior
Likelihood Prior
Evidence
Update beliefs in hypotheses in response to data
Parametric or Non-parametric
The structure of hypothesis: constrain or not constrain
We have examples later
Your confidence to the prior
4. Parametric Bayesian Learning
5/10/2016 Yueshen Xu 4 Middleware, CCNT, ZJU
𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 ∝ 𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)
Parametric or Non-parametric Hypothesis
Evidence is the fact
Constant No possibility Trick commonly used
Non-parametric != No parameters
Hyper-parameters
• Parameters of distributions
• Parameter vs. Variable
𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0)
Γ 𝛼1 … Γ 𝛼 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1
Variable
Hyper-parameter Parameter
p(θ|X) ∝ p(X|θ)p(θ)
6. Parametric Bayesian Learning
Serious Problems
How could we know
the number of clusters?
the number of topics?
the number of layers?
5/10/2016 Yueshen Xu 6 Middleware, CCNT, ZJU
Heuristic pre-processing?
Guessing and Tuning
7. Parametric Bayesian Learning
Some basics
Discrete Data & Continuous Data
Discrete Data: text be modeled as natural numbers
Continuous Data: stock, trading, signal, quality, rating be
modeled as real numbers
5/10/2016 Yueshen Xu 7 Middleware, CCNT, ZJU
Some important concepts (Also used in non-parametric case)
Discrete distribution: 𝑋𝑖|𝜃~𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒(𝜃)
𝑝 𝑋 𝜃 =
𝑖=1
𝑛
𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑋𝑖; 𝜃 =
𝑗=1
𝑚
𝜃𝑗
𝑁 𝑗
Multinomial distribution: 𝑁|𝑛, 𝜃~𝑀𝑢𝑙𝑡𝑖(𝜃, 𝑛)
𝑝 𝑁 𝑛, 𝜃 =
𝑛!
𝑗=1
𝑚
𝑁𝑗!
𝑗=1
𝑚
𝜃𝑗
𝑁 𝑗
Computer Sciencers
often mix them up
8. Parametric Bayesian Learning
Some important concepts (cont.)
Dirichlet distribution:𝜃|𝜶~𝐷𝑖𝑟(𝜶)
𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0)
Γ 𝛼1 … Γ 𝛼 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1
Conjugate Prior
the posterior p(θ|X) are in the same family as the p(θ), the prior is called
a conjugate prior of the likelihood p(X|θ)
Examples
Binomial Distribution ←→ Beta Distribution
Multinomial Distribution ←→ Dirichlet Distribution
5/10/2016 Yueshen Xu 8 Middleware, CCNT, ZJU
𝑝 𝜃 𝑵, 𝜶 =𝐷𝑖𝑟 𝜃 𝑵 + 𝜶 =
Γ(𝛼0+𝑁)
Γ 𝛼1+𝑁1 …Γ 𝛼 𝐾+𝑁 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1+𝑁 𝑘
𝑝(𝜃|𝜶) 𝑝 𝑵 𝜃
Why should prior and
posterior better be
conjugate distributions?
…
9. Parametric Bayesian Learning
Some important concepts (cont.)
Probabilistic Graphical Model
Modeling Bayesian Network using plates and circles
5/10/2016 Yueshen Xu 9 Middleware, CCNT, ZJU
Generative Model & Discriminative Model: 𝑝(𝜃|𝑋)
Generative Model: p(θ|X) ∝ p(X|θ)p(θ)
Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
Discriminative Model: 𝑝(𝜃|𝑋)
LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also have graphical model
representations
10. Non-parametric Bayesian Learning
When we talk about non-parametric, what do we usually talk
about?
Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese
Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical
Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process
Multinomial Model, Clustering, …
Continuous Data: Gaussian Distribution, Gaussian Process,
Regression, Classification, Factorization, Gradient Descent,
Covariance Matrix… Brownian Motion
5/10/2016 Yueshen Xu 10 Middleware, CCNT, ZJU
Infinite
∞
11. Non-parametric Bayesian Learning
Dirichlet Process[Yee Whye Teh, etc]
𝐺0 : probabilistic measure/distribution (base distribution), 𝛼0: real
number, (𝐴1, 𝐴2, … , 𝐴 𝑟) : partition of space, G: a probabilistic
distribution, iff
(𝐺 𝐴1 , … , 𝐺(𝐴 𝑟))~𝐷𝑖𝑟(𝛼0 𝐺0 𝐴1 , … , 𝛼0 𝐺0 𝐴 𝑟 )
then, 𝐺~DP(𝛼0, 𝐺0)
5/10/2016, Yueshen Xu 11 Middleware, CCNT, ZJU
𝐺0 : which exact distribution is 𝐺0? We don’t know
𝐺 : which exact distribution is 𝐺? We don’t know
12. Non-parametric Bayesian Learning
Where is infinite? Construction of DP We need to construct
a DP, since it does not exist naturally
Stick-breaking, Polya Urn Scheme, Chinese restaurant process
Middleware, CCNT, ZJU
Stick-breaking construction
(𝛽 𝑘) 𝑘=1
∞
,(𝜙 𝑘) 𝑘=1
∞
:iid sequence
𝑘=1
∞
𝜋 𝑘 = 1 𝛿 𝜙 𝑘
is the probability of 𝜙 𝑘
a distribution of positive integers
𝛽 𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)
𝜙 𝑘|𝛼0~𝐺0
𝜋 𝑘 = 𝛽 𝑘
𝑙=1
𝑘−1
(1 − 𝛽𝑙)
𝐺 =
𝑘=1
∞
𝜋 𝑘 𝛿 𝜙 𝑘
Why DP? …
13. Non-parametric Bayesian Learning
Chinese Restaurant Process
A restaurant with an infinite number of tables, and customers
(word, generated from 𝜃𝑖, one-to-one) enter this restaurant
sequentially. The ith customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to
the probability :
5/10/2016 Yueshen Xu 13 Middleware, CCNT, ZJU
new table
𝜙 𝑘: Clustering == 2/3 unsupervised learning clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computation…
14. Non-parametric Bayesian Learning
Dirichlet Process Mixture (DPM)
You can draw the graphical model yourself DP is not enough
We need similarity instead of cloning Mixture Models
Middleware, CCNT, ZJU
Mixture Models: an element is generated from a mixture/group of
variables (usually latent variables) ∶ GMM, LDA, pLSA…
DPM: 𝜃𝑖|𝐺~𝐺, 𝑥𝑖|𝜃𝑖~𝐹(𝜃𝑖) For text data, 𝐹(𝜃𝑖) is Discrete/Multinomial
Intuitive but not helpful
Construction
𝛽 𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)
𝜙 𝑘|𝛼0~𝐺0
𝜋 𝑘 = 𝛽 𝑘
𝑙=1
𝑘−1
(1 − 𝛽𝑙)
𝐺 =
𝑘=1
∞
𝜋 𝑘 𝛿 𝜙 𝑘
15. Non-parametric Bayesian Learning
Dirichlet Process Mixture (DPM)
5/10/2016 Yueshen Xu 15 Middleware, CCNT, ZJU
Finite
Dirichlet Multinomial
Mixture Model
What can DMMM do?
(0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,….)
C l u s t e r i n g
16. Non-parametric Bayesian Learning
Hierarchical Dirichlet Process (HDP)
5/10/2016 Yueshen Xu 16 Middleware, CCNT, ZJU
Construction
HDP: 𝜃𝑗𝑖|𝐺~𝐺, 𝑥𝑗𝑖|𝜃𝑗𝑖~𝐹(𝜃𝑗𝑖)
LDA
A very natural model for
those statistics guys,
but for our computer
guys…hehe….
Finite (F: Mult)
LDA Hierarchical
Dirichlet Multinomial
Mixture Model
17. Non-parametric Bayesian Learning
Hierarchical Topic Modeling
What we can get from reviews, blogs, question answers, twitter,
news……? Only topics? Far not enough
What we really need is a hierarchy to illustrate what exactly the
text tells people, like
5/10/2016 Yueshen Xu 17 Middleware, CCNT, ZJU
18. Non-parametric Bayesian Learning
Hierarchical Topic Modeling
Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]
NCRP: In a restaurant, at the 1st level, there is one table, which is linked
with an infinite number of tables at the 2nd level. Each table at the
second level is also linked with an infinite number of tables at the 3rd
level. Such a structure is repeated...
CRP is the prior to choose a table to form a path
5/10/2016 Yueshen Xu Middleware, CCNT, ZJU
one document, one path
Doc 2
Matryoshka Doll
19. Non-parametric Bayesian Learning
Hierarchical Topic Modeling
Generative Process
1. Let 𝑐1 be the root restaurant (only one table)
2. For each level 𝑙 ∈ {2, … , 𝐿}:
Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to
by that table
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤 𝑛:
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧
5/10/2016 Yueshen Xu
α
zm,n
N
c1
c2
cL
T
γ
wm,n
M
β
k
m
𝐿 can be infinite, but not necessary
23. Reference
• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
• David Blei. Probabilstic topic models. Communications of the ACM, 2012
• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic
Hierarchies. Journal of the ACM, 2010
• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of
Statistics, 1973
• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference
• Rick Durrett. Probability: Theory and Examples, 2010
• Christopher Bishop. Pattern Recognition and Machine Learning, 2007
• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
• David P. Williams. Gaussian Processes, Duke University, 2006
5/10/2016 Yueshen Xu 23 Middleware, CCNT, ZJU