Automatic Bayesian method for Numerical Integration
Comparing estimation algorithms for block clustering models
1. Comparing estimation algorithms for block
clustering models
Gilles Celeux
Projet SELECT INRIA Saclay-Île-de-France
January 6, 2011 - BIG’MC seminar
2. Block clustering setting
Block clustering of (binary) data
Let y = {(yij ); i ∈ I, j ∈ J} be a dimension n × d binary
matrix, where I is a set n objets and J a set of d variables
Permuting the lines and columns of y to discover a
clustering structure on I × J.
Getting a simple summary of the data matrix y.
Many applications : recommendation systems, genomic
data analysis, text mining, archeology, ...
3. Example
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 4 3 5 7 2 6
A A A
B C C
C H H
D B B
E F F I II III
F J J a
G D D b
H G G c
I I I
J E E
(1) (2) (3) (4)
(1) Binary data matrix
(2) A partition on I
(3) A couple of partitions on I and J
(4) Summary of the binary matrix
4. Model-based clustering framework
Assume that the data are arising from a finite mixture of
parametrised densities.
A cluster is made by observations arising from the same
density.
In a block clustering model, clusters are defined on blocks
∈ I × J.
In a block clustering model, data of a block are modelled
by the same unidimensional density.
5. Latent block mixture model
Density of the observed data is supposed to be
f (y|g, m, φ, α) = p(u|g, m, φ)f (y|g, m, u, α)
u∈U
where u is the indicator block vector.
It is assumed that uijb = zik wj , z (resp.w) being the row (resp.
column) cluster indicator vector.
Assuming that the n × d variables Yij are conditionnally
independent knowing z and w leads to the model
z wj
f (y|g, m, π, ρ, α) = πk ik ρ ϕ(yij |g, m, αk )
z,w∈Z×W i,k j, i,j,k ,
6. An exemple : Bernoulli latent block model
Mixing proportions
For fixed g, the mixing proportions for the row are π1 , . . . , πg .
For fixed m, the mixing proportions for the col. are ρ1 , . . . , ρm .
The Bernoulli density per block
ϕ(yij ; αk ) = (αk )yij (1 − αk )1−yij
where αk ∈ (0, 1).
The mixture density is
z wj
f (y|g, m, π, ρ, α) = πk ik ρ (αk )yij (1−αk )1−yij .
z,w∈Z×W i,k j, i,j,k ,
The parameters to be estimated are the πs, the ρs and the αs.
7. Maximum likelihood estimation
The loglikelihood of the model parameter is
L(θ) = f (y|g, m, π, ρ, α) (g and m fixed)
L(θ) = log p(y, w, z|g, m, θ) − log p(w, z|y; g, m, θ)
= IE[log p(y, w, z; θ)|y; π (c) , θ(c) ] − IE[log p(w, z|y; θ)|y; θ(c) ]
= Q(θ|θ(c) ) − H(θ|θ(c) )
˜
If θ ∈ arg maxθ Q(θ|θ(c) )
˜ ˜ ˜
L(θ)−L(θ(c) ) = Q(θ|θ(c) )−Q(θ|θ(c) )+H(θ(c) |θ(c) )−H(θ|θ(c) ) ≥ 0
EM algorithm
E step : computing the conditional expectation of the
complete loglikelihood Q(θ|θ(c) )
˜
M step : maximising Q(θ|θ(c) ) in θ, θ(c) → θ
8. Conditional expectation of the complete loglikelihood
For the latent block model, it is
(c) (c) (c)
Q(θ|θ(c) ) = sik log πk + tj log ρ + ei,j,k , log ϕ(xij ; αk )
i,k j, i,j,k ,
where
(c) (c)
sik = P(Zik = 1|θ(c) , y), tj = P(Wj = 1|θ(c) , y)
and
(c)
ei,j,k , = P(Zik Wj = 1|θ(c) , y).
(c)
→ Difficulty to compute ei,j,k , ... Approximations are needed.
9. Variational interpretation of EM
From the identity
L(θ) = log p(y, z, w|θ) − log p(z, w|y, θ), we get
p(y, z, w|θ)
L(θ) = IEqzw log + KL(qzw ||p(z, w|y; θ))
qzw (w, z)
= F(qzw , θ) + KL(qzw ||p(z, w|y; θ))
EM as an alterned optimisation algorithm of F(qzw , θ)
E step : Maximising F(qzw , θ(c) ) in qzw (.) with θ(c) fixed, leads to
p(z, w|y; θ(c) ) = arg min KL(qzw ||p(z, w|y; θ(c) ))
qzw
(c) (c)
M step : Maximising F(qzw , θ) in θ with qzw (.) fixed : it amounts
to find
arg max Q(θ|θ(c) ).
θ
10. Variational approximation of EM (VEM)
Restricting qwz to a function set for which the E step is easily
tractable. It is assumed that qzw (z, w, θ) = qz (z)qw (w)
(c) (c)
sik = Pqz (Zik = 1|θ(c) , x), tj = Pqw (Wj = 1|θ(c) , x),
(c) (c) (c)
ei,j,k , = sik wj
Govaert and Nadif (2008)
1. E step : Maximising the free energy F(qzw , θ(c) ) until
convergence
(c)
1.1 computing sik with fixed wjl and θ(c)
(c+1)
1.2 computing wjl with fixed sik and θ(c)
→ s(c+1) and w (c+1)
2. M step : Updating θ(c+1)
11. Some characteristics of VEM
The optimised free energy F(qzw , θ) is a lower bound of
the observed loglikelihood.
The parameter maximising the free energy could be
expected to be a good, if not consistent, approximation of
the maximum likelihood estimator.
Since VEM is minimising KL(qzw ||p(z, w|y; θ)) rather than
KL(p(z, w|y; θ)||qzw ), it is expected to be sensitive to
starting values.
12. The SEM-Gibbs algorithm
SEM
The SEM algorithm (Celeux, Diebolt 1985 ) : After the E step, a
S step is introduced to simulate the missing data according to
the distribution p(z, w|x; θ(c) ).
A difficulty for the latent block model is to simulate p(z, w|x; θ).
Gibbs sampling
The distribution p(z, w|x; θ(c) ) is simulated using a Gibbs
sampler. Repeat
Simulate z(t+1) according to p(z|x, w(t) ; θ(c) )
Simulate w(t+1) according to p(w|x, z(t+1) ; θ(c) )
→ The stationary distribution of the Markov chain is
p(z, w|x; θ(c) )
13. SEM-Gibbs for Bernoulli latent block model
1. E and S steps :
1.1 computation of p(z|y, w(t) ; θ(c) ), then simulation of z(t+1)
πk ψk (yi· , αk · )
p(zi = k |yi· , w(c) ) = , k = 1, . . . g
k πk ψk (yi· , αk · )
u −ui (c) (c)
ψk (yi· , αk · ) = αk i (1−αk )d , ui = wj yij , d = wj
j j
1.2 computation of p(w|y, z(t+1) ; θ(c) ), then simulation of w(t+1)
→ w (c+1) and z (c+1)
2. M step :
(c+1) (c+1)
(c+1) i zik (c+1) j wj
πk = ,ρ =
n d
and
(c+1) (c+1)
(c+1) ij zik wj yij
αk = (c+1) (c+1)
ij zik wj
14. SEM features
SEM is not increasing the loglikelihood at each iteration.
SEM is generating an irreductible Markov chain with a
unique stationary distribution.
The parameter estimates fluctuate around the ml estimate
→ A natural estimator of θ, z, w is the mean of
(θ(c) , z(c) , w(c) ; c = B, . . . , B + C) get after a burn-in period.
How many Gibbs iterations inside the E-S step ?
→ default version : one Gibbs sampler iteration.
15. Numerical experiments
Simulation design
n = 100 rows, d = 60 columns,
g = 3 components for I, m = 2 components for J,
equal proportions on I and J.
The parameters α have the form :
1− 1−
α= 1−
1−
where is defining the overlap between the mixture
components.
16. Comparing VEM and SEM-Gibbs
Criteria of comparison
Estimate parameter values / actual parameter values for θ.
Distance between MAP partition / actual partition,
where the distance between two couples of partitions
u = (z, w) and u = (z , w ) is the relative frequency of
disagreements
1
δ(u, u ) = 1 − zik wjl zik wjl .
nd
i,j,k ,l
18. SEM variance from a unique starting position
n=100, d=60 , π = (0.30, 0.34, 0.36), ρ = (0.53, 0.47),
δSEM = 0.18(0.01), δVEM = 0.18
0.55
0.5
0.45
0.4
0.35
0.3
0.25
1 2 3 4 5
19. Comparing VEM and SEM with starting position on θ0
The comparison is made on 100 different samples
δVEM = 0.28(0.17), δSEM = 0.34(0.17)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12
VEM
VEM
VEM
VEM
VEM
VEM
SEM
SEM
SEM
SEM
SEM
SEM
20. VEM and SEM with random starting positions
Comparisons made on a sample from 100 different positions
δVEM = 0.49(0.16), δSEM = 0.17(0.02)
!
kl
0.65
0.6
0.55
0.5
0.45
0.4
1 2 3 4 5 6 7 8 9 10 11 12
BEM
BEM
BEM
BEM
BEM
BEM
SEM
SEM
SEM
SEM
SEM
SEM
21. Same comparison : less noisy case
Comparisons made on a sample from 100 different positions
δVEM = 0.20(0.23), δSEM = 0.045(0.004)
!
kl
0.65
0.6
0.55
0.5
0.45
0.4
0.35
1 2 3 4 5 6 7 8 9 10 11 12
BEM
BEM
BEM
BEM
BEM
BEM
SEM
SEM
SEM
SEM
SEM
SEM
22. Discussion : VEM vs. SEM
Numerical comparisons lead to the conclusions
VEM leads rapidly to reasonable parameter estimates
when its initial position is near enough the ml estimation.
VEM is quite sensitive to starting values.
SEM-Gibbs is (essentially) unsensitive to starting values.
→ Coupling SEM and VEM should be beneficial to derive
sensible ml estimates for the latent block model.
23. Difficulties with Maximum likelihood
Those difficulties concern the computation of information
criteria for model selection.
The likelihood remains difficult to be computed.
What is the sample size in a latent block model ?
There are many combinations (g, m) to be considered to
choose a relevant number of blocks.
→ Bayesian inference could be thought of as attractive for the
latent block model.
24. Bayesian inference : choosing the priors
Choosing conjugate priors is essential for the latent block
model.
The choice is easy in the binary case : the priors for π, ρ
and α are D(1, . . . , 1) or D(1/2, . . . , 1/2). They are non
informative priors.
In the continuous case, the conjugate priors for α = (µ, σ 2 )
are weakly informative.
Priors for the number of clusters
This sensitive choice jeopardizes Bayesian inference for
mixtures (Aitken 2000).
It seems that choosing truncated Poisson P(1) priors over the
range 1, . . . , gmax and 1, . . . , mmax is often a reasonable
choice (Nobile 2005).
25. Bayesian inference : Reversible Jump sampler
A possible advantage of Bayesian inference could be to make
use of a RJMCMC sampler to choose relevant values for g and
m since the likelihood is unavailable.
But, in the latent block context, the standard RJMCMC is
(remains ?...) unattractive since there is a couple of
clusters to deal with.
Fortunately, the allocation sampler of Nobile and Fearnside
(2007) could be used instead.
26. The allocation sampler : collapsing
The point of allocation sampler is to use a (RJ)MCMC algorithm
on a collapsed model.
Collapsed joint posterior
Using conjugacy properties, we get by integrating the full
posterior with respect to π, ρ and α
g m
P(g, m, z, w|y) = P(g)P(m)CF (.) Mk
k =1 =1
where CF (.) is a closed form function made of Gamma
functions and
Mk = P(αk ) p(yij |αk )dαk .
i/zi =k j/wj =
27. The allocation sampler : MCMC moves
Moves with fixed numbers of clusters
Updating the label of row i in cluster k :
m +i −i
nk + 1 Mk Mk
˜
P(zi = k ) ∝ , k = k.
nk Mk Mk
=1
Other moves are possible (Nobile and Fearnside 2007).
Moves to split or combine clusters
Two reversible moves to split a cluster or combine two clusters
analogous to the RJMCMC moves of R & G’97 are defined.
But, thanks to collapsing, those moves are of fixed dimension.
Integrating out the parameters leads to reduce the sampling
variability.
28. The allocation sampler : label switching
Following Nobile, Fearnside (2007), Friel and Wyse (2010)
used a post-processing procedure with the cost function
T −1 n
(t) (T )
C(k1 , k2 ) = I zi = k1 , zi = k2 .
t=1 i=1
1 The z(t) MCMC sequence has been rearranged such that
for s < t, z(s) uses less or the same number of
components than z(t) .
2 An algorithm returns the permutation σ(.) of the labels in
g
z(T ) which minimises the total cost k T −1 C(k , σ(k )).
=1
3 z(T ) is relabelled using the permutation σ(.).
29. Remarks on the procedure to deal with label switching
Due to collapsing, the cost function does not involve
sampled model parameters.
The row and columns allocations are post-processed
separately.
Simple algebra lead to an efficient on-line post-processing
procedure.
When g and m are large, g! and m! are tremendous.
30. Summarizing MCMC output
Most visited model : for each (k , ), its posterior probability
is estimated by the relative frequency of visits after post
processing to undo label switching.
MAP cluster model : it is the visited (g, m, z, w) having
highest probability a posteriori from the MCMC samples.
31. Simulated data
A 200 × 200 binary table. The posterior model probability of the
generating model was respectively (from left to right and from
top to bottom) : .96, .95, .90 ; .93, .89, .84 ; .80, .30, .15.
32. Congressional voting data
The data set records the votes of 435 members (267
democrats, 168 republicans) of the 98th on 16 different key
issues.
Voting data collapsed LBM BEM2
33. An example on microarray experiments
The data consist of the expression level of 419 genes under 70
conditions.
Weakly informative hyperprior parameters have been chosen.
The sampler has been run 220,000 iterations with 20,000 for
burn-in.
Hereunder is a detail of the posterior distribution of block
clusters models.
columns
rows 3 4 5
24 .064 .071 .042
25 .102 .120 .070
26 .037 .046 .023
Most visited model : (25, 4)
MAP cluster model : (26, 4).
34. References
Govaert, G. and Nadif, M. (2007) Block clustering with
Bernoulli mixture models : Comparison of differents
approaches. Computationanl Statistics and Data Analysis,
52, 3233-3245.
Nobile, A. and Fearnside, A. T. (2007) Bayesian finite
mixtures with an unknown number of components : The
allocation sampler. Statistics and Computing, 17, 147-162.
Wyse, J. and Friel, N. (2010) Block clustering with
collapsed latent block models. In revision at Statistics and
Computing (http ://arxiv.org/abs/1011.2948).