Non parametric bayesian learning in discrete data

Non-parametric Bayesian
Learning in Discrete Data
Yueshen Xu
xyshzjucs@zju.edu.cn / xuyueshen@163.com
Middleware, CCNT, ZJU
Middleware, CCNT, ZJU5/10/2016
Statistics & Computational Linguistics
1Yueshen Xu

Outline
 Bayes’ Rule
 Parametric Bayesian Learning
 Concept & Example
 Discrete & Continuous Data
 Text Clustering & Topic Modeling
 Pros and Cons
 Some Important Concepts
 Non-parametric Bayesian Learning
 Dirichlet Process and Process Construction
 Dirichlet Process Mixture
 Hierarchical Dirichlet Process
 Chinese Restaurant Process
5/10/2016 2 Middleware, CCNT, ZJUYueshen Xu
 Example: Hierarchical Topic
Modeling
 Markov Chain Monte Carlo
 Reference
 Discussion

Bayes’ Rule
 Posterior = Prior * Likelihood
5/10/2016 Yueshen Xu 3 Middleware, CCNT, ZJU
𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 =
𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)
𝑝(𝐷𝑎𝑡𝑎)
Posterior
Likelihood Prior
Evidence
Update beliefs in hypotheses in response to data
 Parametric or Non-parametric
 The structure of hypothesis: constrain or not constrain
 We have examples later
 Your confidence to the prior

Parametric Bayesian Learning
𝑝 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝐷𝑎𝑡𝑎 ∝ 𝑝 𝐷𝑎𝑡𝑎 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝(𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠)
 Parametric or Non-parametric  Hypothesis
 Evidence is the fact
 Constant  No possibility  Trick commonly used
 Non-parametric != No parameters
Hyper-parameters
• Parameters of distributions
• Parameter vs. Variable
𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0)
Γ 𝛼1 … Γ 𝛼 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1
Variable
Hyper-parameter Parameter
p(θ|X) ∝ p(X|θ)p(θ)

 Some Examples
Clustering Topic Modeling
K-Means/Medoid, NMF LSA, pLSA, LDA
Hierarchical Concept Building

 Serious Problems
 How could we know
 the number of clusters?
 the number of topics?
 the number of layers?
Heuristic pre-processing?
Guessing and Tuning

 Some basics
 Discrete Data & Continuous Data
 Discrete Data: text  be modeled as natural numbers
 Continuous Data: stock, trading, signal, quality, rating  be
modeled as real numbers
 Some important concepts (Also used in non-parametric case)
 Discrete distribution: 𝑋𝑖|𝜃~𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒(𝜃)
𝑝 𝑋 𝜃 =
𝑖=1
𝑛
𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑋𝑖; 𝜃 =
𝑗=1
𝑚
𝜃𝑗
𝑁 𝑗
 Multinomial distribution: 𝑁|𝑛, 𝜃~𝑀𝑢𝑙𝑡𝑖(𝜃, 𝑛)
𝑝 𝑁 𝑛, 𝜃 =
𝑛!
𝑗=1
𝑚
𝑁𝑗!
𝑗=1
𝑚
𝜃𝑗
𝑁 𝑗
Computer Sciencers
often mix them up

 Some important concepts (cont.)
 Dirichlet distribution:𝜃|𝜶~𝐷𝑖𝑟(𝜶)
𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0)
Γ 𝛼1 … Γ 𝛼 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1
 Conjugate Prior
 the posterior p(θ|X) are in the same family as the p(θ), the prior is called
a conjugate prior of the likelihood p(X|θ)
 Examples
Binomial Distribution ←→ Beta Distribution
Multinomial Distribution ←→ Dirichlet Distribution
𝑝 𝜃 𝑵, 𝜶 =𝐷𝑖𝑟 𝜃 𝑵 + 𝜶 =
Γ(𝛼0+𝑁)
Γ 𝛼1+𝑁1 …Γ 𝛼 𝐾+𝑁 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1+𝑁 𝑘
𝑝(𝜃|𝜶) 𝑝 𝑵 𝜃
Why should prior and
posterior better be
conjugate distributions?
 …

 Some important concepts (cont.)
 Probabilistic Graphical Model
 Modeling Bayesian Network using plates and circles
 Generative Model & Discriminative Model: 𝑝(𝜃|𝑋)
 Generative Model: p(θ|X) ∝ p(X|θ)p(θ)
 Naïve Bayes, GMM, pLSA, LDA, HMM, HDP… : Unsupervised Learning
 Discriminative Model: 𝑝(𝜃|𝑋)
 LR, KNN,SVM, Boosting, Decision Tree : Supervised Learning
Also have graphical model
representations

Non-parametric Bayesian Learning
 When we talk about non-parametric, what do we usually talk
about?
 Discrete Data: Dirichlet Distribution, Dirichlet Process, Chinese
Restaurant Process, Polya Urn, Pitman-Yor Process, Hierarchical
Dirichlet Process, Dirichlet Process Mixture, Dirichlet Process
Multinomial Model, Clustering, …
 Continuous Data: Gaussian Distribution, Gaussian Process,
Regression, Classification, Factorization, Gradient Descent,
Covariance Matrix…  Brownian Motion
Infinite
∞

 Dirichlet Process[Yee Whye Teh, etc]
 𝐺0 : probabilistic measure/distribution (base distribution), 𝛼0: real
number, (𝐴1, 𝐴2, … , 𝐴 𝑟) : partition of space, G: a probabilistic
distribution, iff
(𝐺 𝐴1 , … , 𝐺(𝐴 𝑟))~𝐷𝑖𝑟(𝛼0 𝐺0 𝐴1 , … , 𝛼0 𝐺0 𝐴 𝑟 )
then, 𝐺~DP(𝛼0, 𝐺0)
5/10/2016, Yueshen Xu 11 Middleware, CCNT, ZJU
 𝐺0 : which exact distribution is 𝐺0? We don’t know
 𝐺 : which exact distribution is 𝐺? We don’t know

 Where is infinite?  Construction of DP  We need to construct
a DP, since it does not exist naturally
 Stick-breaking, Polya Urn Scheme, Chinese restaurant process
 Stick-breaking construction
 (𝛽 𝑘) 𝑘=1
∞
,(𝜙 𝑘) 𝑘=1
∞
:iid sequence
𝑘=1
∞
𝜋 𝑘 = 1 𝛿 𝜙 𝑘
is the probability of 𝜙 𝑘
a distribution of positive integers
𝛽 𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)
𝜙 𝑘|𝛼0~𝐺0
𝜋 𝑘 = 𝛽 𝑘
𝑙=1
𝑘−1
(1 − 𝛽𝑙)
𝐺 =
𝑘=1
∞
𝜋 𝑘 𝛿 𝜙 𝑘
Why DP?  …

 Chinese Restaurant Process
 A restaurant with an infinite number of tables, and customers
(word, generated from 𝜃𝑖, one-to-one) enter this restaurant
sequentially. The ith customer (𝜃𝑖) sits at a table (𝜙 𝑘) according to
the probability :
new table
𝜙 𝑘: Clustering == 2/3 unsupervised learning  clustering, topic modeling (two layer
clustering), hierarchical concept building, collaborative filtering, similarity computation…

 Dirichlet Process Mixture (DPM)
 You can draw the graphical model yourself  DP is not enough 
We need similarity instead of cloning  Mixture Models
 Mixture Models: an element is generated from a mixture/group of
variables (usually latent variables)  ∶ GMM, LDA, pLSA…
 DPM: 𝜃𝑖|𝐺~𝐺, 𝑥𝑖|𝜃𝑖~𝐹(𝜃𝑖) For text data, 𝐹(𝜃𝑖) is Discrete/Multinomial
Intuitive but not helpful
Construction
𝛽 𝑘|𝛼0~𝐵𝑒𝑡𝑎(1, 𝛼0)
𝜙 𝑘|𝛼0~𝐺0
𝜋 𝑘 = 𝛽 𝑘
𝑙=1
𝑘−1
(1 − 𝛽𝑙)
𝐺 =
𝑘=1
∞
𝜋 𝑘 𝛿 𝜙 𝑘

 Dirichlet Process Mixture (DPM)
Finite
Dirichlet Multinomial
Mixture Model
What can DMMM do?
(0,0,0,Caption,0,0,0,0,0,0,USA,0,0,0,0,0,0,0,0,0,Action,0,0,0,0,0,0,0,Hero,0,0 0,0,0,0,….)
C l u s t e r i n g

 Hierarchical Dirichlet Process (HDP)
Construction
 HDP: 𝜃𝑗𝑖|𝐺~𝐺, 𝑥𝑗𝑖|𝜃𝑗𝑖~𝐹(𝜃𝑗𝑖)
LDA
A very natural model for
those statistics guys,
but for our computer
guys…hehe….
Finite (F: Mult)
LDA  Hierarchical
Dirichlet Multinomial
Mixture Model

 Hierarchical Topic Modeling
 What we can get from reviews, blogs, question answers, twitter,
news……?  Only topics?  Far not enough
 What we really need is a hierarchy to illustrate what exactly the
text tells people, like

 Prior: Nested CRP/DP (nCRP) [Blei and Jordan, NIPS, 04]
 NCRP: In a restaurant, at the 1st level, there is one table, which is linked
with an infinite number of tables at the 2nd level. Each table at the
second level is also linked with an infinite number of tables at the 3rd
level. Such a structure is repeated...
 CRP is the prior to choose a table to form a path
5/10/2016 Yueshen Xu Middleware, CCNT, ZJU
one document, one path
Doc 2
Matryoshka Doll

 Generative Process
1. Let 𝑐1 be the root restaurant (only one table)
2. For each level 𝑙 ∈ {2, … , 𝐿}:
Draw a table from restaurant 𝑐𝑙−1 using CRP. Set 𝑐𝑙 to be the restaurant referred to
by that table
3. Draw an 𝐿 -dimensional topic proportion vector 𝜃~𝐷𝑖𝑟(𝛼)
4. For each word 𝑤 𝑛:
Draw 𝑧 ∈ 1, … , 𝐿 ~ Mult(𝜃)
Draw 𝑤 𝑛 from the topic associated with restaurant 𝑐 𝑧
5/10/2016 Yueshen Xu
α
zm,n
N
c1
c2
cL
T
γ
wm,n
M
β
k


m


𝐿 can be infinite, but not necessary

 What we can get

Markov Chain Monte Carlo
 Markov Chain
 Initialization probability: 𝜋0 = {𝜋0 1 , 𝜋0 2 , … , 𝜋0(|𝑆|)}
 𝜋 𝑛 = 𝜋 𝑛−1 𝑃 = 𝜋 𝑛−2 𝑃2
= ⋯ = 𝜋0 𝑃 𝑛
: Chapman-Kolomogrov equation
 Central-limit Theorem: Under the premise of connectivity of P, lim
𝑛→∞
𝑃𝑖𝑗
𝑛
= 𝜋 𝑗 ; 𝜋 𝑗 = 𝑖=1
|𝑆|
𝜋 𝑖 𝑃𝑖𝑗
 lim
𝑛→∞
𝜋0 𝑃 𝑛
=
𝜋(1) … 𝜋(|𝑆|)
⋮ ⋮ ⋮
𝜋(1) 𝜋(|𝑆|)
 𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}
5/10/2016 21 Middleware, CCNT, ZJU
Stationary Distribution
𝑋0~𝜋0 𝑥 −→ 𝑋1~𝜋1 𝑥 −→ ⋯ −→ 𝑋 𝑛~𝜋 𝑥 −→ 𝑋 𝑛+1~𝜋 𝑥 −→ 𝑋 𝑛+2~𝜋 𝑥 −→
sample
Convergence
Stationary Distribution
Yueshen Xu


















|)||(|...)2|(|)1|(|
)12(p...)22(p)12(p
|)|1(...)21()11(p
SSpSpSp
Spp
P

Xm
Xm+1

Markov Chain Monte Carlo
 Gibbs Sampling
Step1: Initialize: 𝑋0 = 𝑥0 = {𝑥1: 𝑖 = 1,2, … 𝑛}
Step2: for t = 0, 1, 2, …
1. 𝑥1
(𝑡+1)
~𝑝 𝑥1 𝑥2
(𝑡)
, 𝑥3
(𝑡)
, … , 𝑥 𝑛
(𝑡)
;
2. 𝑥2
𝑡+1
~𝑝 𝑥2 𝑥1
(𝑡+1)
, 𝑥3
(𝑡)
, … , 𝑥 𝑛
(𝑡)
3. …
4. 𝑥𝑗
𝑡+1
~𝑝 𝑥𝑗 𝑥1
(𝑡+1)
, 𝑥𝑗−1
(𝑡+1)
, 𝑥𝑗+1
(𝑡)
… , 𝑥 𝑛
(𝑡)
5. …
6. 𝑥 𝑛
𝑡+1
~𝑝 𝑥 𝑛 𝑥1
(𝑡+1)
, 𝑥2
(𝑡+1)
, … , 𝑥 𝑛−1
(𝑡+1)
𝑥𝑖~𝑝 𝑥 𝑥−𝑖
A(x1,x1)
B(x1,x2)
C(x2,x1)
D
Metropolis-Hastings Sampling
You want to know ‘Gibbs sampling for HDP/DPM/nCRP’ ? You’d better understand
Gibbs sampling for ‘LDA and DMMM’

Reference
• Yee Whye Teh. Dirichlet Processes: Tutorial and Practical Course, 2007
• Yee Whye Teh, Jordan M I, etc. Hierarchical Dirichlet Processes, American Statistical
Association, 2006
• David Blei. Probabilstic topic models. Communications of the ACM, 2012
• David Blei, etc. Latent Dirichlet Allocation, JMLR, 2003
• David Blei, etc. The Nested Chinese Restaurant Process and Bayesian Inference of Topic
Hierarchies. Journal of the ACM, 2010
• Gregor Heinrich. Parameter Estimation for Text Analysis, 2008
• T.S., Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of
Statistics, 1973
• Martin J. Wainwright. Graphical Models, Exponential Families, and Variational Inference
• Rick Durrett. Probability: Theory and Examples, 2010
• Christopher Bishop. Pattern Recognition and Machine Learning, 2007
• Vasilis Vryniotis. DatumBox: The Dirichlet Process Mixture Model, 2014
• David P. Williams. Gaussian Processes, Duke University, 2006

Q&A
5/10/2016 Middleware, CCNT, ZJU24Yueshen Xu

Non parametric bayesian learning in discrete data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Ähnlich wie Non parametric bayesian learning in discrete data

Ähnlich wie Non parametric bayesian learning in discrete data (20)

Mehr von Yueshen Xu

Mehr von Yueshen Xu (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Non parametric bayesian learning in discrete data