CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
1. A vne
da cd
Ifr t nT e r i
nomai h oyn
o
C P “ aN t e”
VRi n us l
hl CP
VR
T ti
u rl
oa J n 1 -82 1
u e 31 0 0
S nFa c c ,A
a rn i oC
s
Gaussian Mixtures:
Classification & PDF Estimation
Francisco Escolano & Anand Rangarajan
2. Gaussian Mixtures
Background. Gaussian Mixtures are ubiquitous in CVPR. For
instance, in CBIR, it is sometimes iteresting to model the image as a
pdf over the pixel colors and positions (see for instance [Goldberger et
al.,03] where a KL-divergence computation method is presented).
GMs often provide a model for the pdf associated to the image and
this is useful for segmentation. GMs, as we have seen in the
previous lesson, are also useful for modeling shapes.
Therefore GMs estimation has been a recurrent topic in CVPR.
Traditional methods, associated to the EM algorithm have evolved
to incorporate IT elements like the MDL principle for model-order
selection [Figueiredo et al.,02] in parallel with the development of
Variational Bayes (VB) [Constantinopoulos and Likas,07]
2/43
3. Uses of Gaussian Mixtures
Figure: Gaussian Mixtures for modeling images (top) and for color-based
segmentation (bottom)
3/43
4. Review of Gaussian Mixtures
Definition
A d-dimensional random variable Y follows a finite-mixture
distribution when its pdf p(Y |Θ) can be described by a weighted
sum of known pdfs named kernels. When all of these kernels are
Gaussian, the mixture is named in the same way:
K
p(Y |Θ) = πi p(Y |Θi ),
i=1
where 0 ≤ πi ≤ 1, i = 1, . . . , K , K πi = 1, K is the number of
i=1
kernels, π1 , . . . , πK are the a priori probabilities of each kernel, and
Θi are the parameters describing the kernel. In GMs, Θi = {µi , Σi },
that is, the mean vector and covariance.
4/43
5. Review of Gaussian Mixtures (2)
GMs and Maximum Likelihood
The whole set of parameters of a given K-mixture is denoted by
Θ ≡ {Θ1 , . . . , ΘK , π1 , . . . , πK }. Obtaining the optimal set of
parameters Θ∗ is usually posed in terms of maximizing the
log-likelihood of the pdf to be estimated, based on a set of N i.i.d.
samples of the variable Y = {y1 , . . . , yN }:
N
L(Θ, Y ) = (Y |Θ) = log p(Y |Θ) = log p(yn |Θ)
n=1
N K
= log πk p(yn |Θk ).
n=1 k=1
5/43
6. Review of Gaussian Mixtures (3)
GMs and EM
The EM algorithm allows to find maximum-likelihood solutions to
problems where there are hidden variables. In the case of Gaussian
mixtures, these variables are a set of N labels Z = {z 1 , . . . , z N }
associated to the samples. Each label is a binary vector
(n) (n) (n)
z i = [z1 , . . . , zK ], being K the number of components, zm = 1
(n)
and zp = 0, if p = m, denoting that yn has been generated by the
kernel m. Then, given the complete set of data X = {Y , Z }, the
log-likelihood of this set is given by
N K
n
log p(Y , Z |Θ) = zk log[πk p(yn |Θk )].
n=1 k=1
6/43
7. Review of Gaussian Mixtures (4)
E-Step
Consists in estimating the expected value of the hidden variables
given the visible data Y and the current estimation of the
parameters Θ∗ (t):
(n) (n)
E [zk |Y , Θ∗ (t)] = P[zk = 1|yn , Θ∗ (t)])
πk (t)p(yn |Θ∗ (t))
∗
k
= .
ΣK πj∗ (t)p(yn |Θ∗ (t))
j=1 k
Thus, the probability of generating yn with the kernel k is given by:
πk p(yn |k)
p(k|yn ) = .
ΣK πj p(yn |j)
j=1
7/43
8. Review of Gaussian Mixtures (5)
M-Step
Given the expected Z , the new parameters Θ∗ (t + 1) are given by:
N
1
πk = p(k|yn ),
N
n=1
N
n=1 p(k|yn )yn
µk = N
,
n=1 p(k|yn )
N
n=1 p(k|yn )(yn − µk )(yn − µk )T
Σk = N
.
n=1 p(k|yn )
8/43
9. Model Order Selection
Two Extreme Approaches
How many kernels are needed for describe the distribution?
[Figueiredo and Jain,02] it is proposed to perform EM for
different values of K and take the one optimizing ML and a
MLD-like criterion. Starting from a high K , kernel fusions are
preformed if needed. Local optima arise.
In EBEM [Pe˜alver et al., 09] we show that it is possible to apply
n
MDL more efficiently and robustly by starting from a unique
kernel and splitting only if the underlying data is not Gaussian.
The main challenge of this approach is how to estimate
Gaussianity for multi-dimensional data.
9/43
10. Model Order Selection (2)
MDL
Minimum Description Length and related principles choose a
representation of the data that allows us to express them with the
shortest possible message from a postulated set of models.
Rissanen’ MDL implies as minimizing
N(K )
CMDL (Θ(K ) , K ) = −L(Θ(K ) , Y ) + log n,
2
where: N(K ) is the number of parameters required to define a
K -component mixture, and n is the number of samples.
d(d + 1)
N(K ) = (K − 1) + K d+ .
2
10/43
11. Gaussian Deficiency
Maximum Entropy of a Mixture
Attending to the 2nd Gibbs Theorem, Gaussian variables have the
maximum entropy among all the variables with equal variance. This
theoretical maximum entropy for a d-dimensional variable Y only
depends on the covariance Σ and is given by:
1
Hmax (Y ) = log[(2πe)d |Σ|].
2
Therefore, the maximum entropy of the mixture is given by
K
Hmax (Y ) = πk Hmax (k).
k=1
11/43
12. Gaussian Deficiency (2)
Gaussian Deficiency
Instead of using the MDL principle we may compare the estimated
entropy of the underlying data with the entropy of a Gaussian. We
define the Gaussianity Deficiency GD of the whole mixture as the
normalized weighted sum of the differences between maximum and
real entropy of each kernel:
K K
Hmax (k) − Hreal (k) Hreal (k)
GD = πk = πk 1− ,
Hmax (k) Hmax (k)
k=1 k=1
where Hreal (k) is the real entropy of the data under the k−th
kernel. We have: 0 ≤ GD ≤ 1 (0 iff Gaussian). If the GD is high
enough we may stop the algorithm.
12/43
13. Gaussian Deficiency (3)
Kernel Selection
If the GD ratio is below a given threshold, we consider that all
kernels are well fitted. Otherwise, we select the kernel with the
highest individual ratio and it is replaced by two other kernels that
are conveniently placed and initialized. Then, a new EM epoch with
K + 1 kernels starts. The worst kernel is given by
(Hmax (k) − Hreal (k))
k ∗ = arg max πk .
k Hmax (k)
Independently of using MDL or GD, in order to decide what kernel
can be split by two other kernels (if needed), we compute and later
expression and we decide to split k ∗ accordingly to MDL or GD.
13/43
14. Split Process
Split Constrains
The k ∗ component must be decomposed into the kernels k1 and k2
with parameters Θk1 = (µk1 , Σk1 ) and Θk2 = (µk2 , Σk2 ). In
multivariate settings, the corresponding priors, the mean vectors and
the covariance matrices should satisfy the following split equations:
π∗ = π1 + π2 ,
π∗ µ∗ = π1 µ1 + π2 µ2 ,
π∗ (Σ∗ + µ∗ µT ) = π1 (Σ1 + µ1 µT ) + π2 (Σ2 + µ2 µT ),
∗ 1 2
Clearly, the split move is an ill-posed problem because the number of
equations is less than the number of unknowns.
14/43
15. Split Process (2)
Split
T
Following [Dellaportas,06], let Σ∗ = V∗ Λ∗ V∗ . Let also be D a d × d
rotation matrix with orthonormal unit vectors as columns. Then:
π1 = u1 π∗ , π2 = (1 − u1 )π∗ ,
µ1 = µ∗ − ( d u2 λi∗ V∗ ) π2 ,
i=1
i i
π1
d i i π1
µ2 = µ∗ + ( i=1 u2 λi∗ V∗ ) π2 ,
Λ1 = diag(u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ π∗ ,
π1
Λ2 = diag(ι − u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ π∗ ,
π2
V1 = DV∗ , V2 = D T V∗ ,
15/43
16. Split Process (3)
Split (cont.)
The latter spectral split method has a non-evident random
component, because ι is a d x 1 vector of ones,
u1 , u2 = (u2 , u2 , . . . , u2 )T and u3 = (u3 , u3 , . . . , u3 )T are 2d + 1
1 2 d 1 2 d
random variables needed to build priors, means and eigenvalues for
the new component in the mixture. They are calculated as:
1
u1 ∼ β(2, 2), u2 ∼ β(1, 2d),
j 1 j
u2 ∼ U(−1, 1), u3 ∼ β(1, d), u3 ∼ U(0, 1),
with j = 2, . . . , d, U(., .) and β(., .) denotes Beta and Uniform
distributions respectively.
16/43
18. EBEM Algorithm
Alg. 1: EBEM - Entropy Based EM Algorithm
Input: convergence th
1 N 1 N T
K = 1, i = 0, π1 = 1, Θ1 = {µ1 , Σ1 } where µ1 = N i=1 yi , Σ1 = N−1 i=1 (yi − µ1 ) (yi − µ1 )
Final = false
repeat
i =i +1
repeat
EM iteration
Estimate log-likelihood in iteration i: (Y |Θ(i))
until | (Y |Θ(i)) − (Y |Θ(i − 1))| < convergence th ;
Evaluate Hreal (Y ) and Hmax (Y )
(H (k)−Hreal (k))
Select k ∗ with the highest ratio: k ∗ = arg maxk πk maxH (k)
max
d(d+1) N(k)
Estimate CMDL in iteration i: N(k) = (k − 1) + k d + 2
, CMDL (Θ(i)) = − (Y |Θ(i)) + 2
log n
if (C(Θ(i)) ≥ C(Θ(i − 1))) then
Final = true
K = K − 1, Θ∗ = Θ(i − 1)
end
else
Decompose k ∗ in k1 and k2
end
until Final=true ;
Output: Optimal mixture model: K , Θ∗
18/43
22. EBEM Algorithm (5)
EBEM in Higher Dimensions
We have also tested the algorithm with the well known Wine
data set, that contains 3 classes of 178 (13-dimensional)
instances.
The number of samples, 178 is not enough to build the pdf
using Parzen’s windows method in a 13-dimensional space.
With the MST approach (see below) where no pdf estimation is
needed, the algorithm has been applied to this data set.
After EBEM ends with K = 3, a maximum a posteriori
classifier was built. The classification performance was 96.1%.
This result is either similar or even better than the experiments
reported in the literature.
22/43
23. Entropic Graphs
EGs and R´nyi Entropy
e
Entropic Spanning Graphs obtained from data to estimate R´nyi’s
e
α-entropy [Hero and Michel, 02] belong to the “non plug-in” methods
for entropy estimation. R´nyi’s α-entropy of a probability density
e
function p is defined as:
1
Hα (p) = ln p α (z)dz
1−α z
for α ∈ [0, 1[. The α-entropy converges to the Shannon one
limα→1 Hα (p) = H(p) ≡ − p(z) ln p(z)dz, so it is possible to
obtain the Shannon entropy from the R´nyi’s one if the latter limit
e
is either solved or numerically approximated.
23/43
24. Entropic Graphs (2)
EGs and R´nyi Entropy (cont.)
e
Let be a graph G consisting in a set of vertices Xn = {x1 , . . . , xn },
with xn ∈ R d and edges {e} that connect vertices: eij = (xi , xj ). If
we denote by M(Xn ) the possible sets of edges in the class of acyclic
graphs spanning Xn (spanning trees), the total edge length
functional of the Euclidean power weighted Minimal Spanning Tree
is:
LMST (Xn ) = min
γ ||e||γ
M(Xn )
e∈M(Xn )
with γ∈ [0, d] and ||.|| the Euclidean distance. The MST has been
used in order to measure the randomness of a set of points.
24/43
25. Entropic Graphs (3)
EGs and R´nyi Entropy (cont.)
e
It is intuitive that the length of the MST for the uniformly
distributed points increases at a greater rate than does the MST
spanning the more concentrated nonuniform set of points. For
d ≥ 2:
d Lγ (Xn )
Hα (Xn ) = ln − ln βLγ ,d
γ nα
is an asymptotically unbiased and almost surely consistent estimator
of the α-entropy of p where α = (d − γ)/d and βLγ ,d is a constant
bias correction for which there are only known approximations and
bounds: (i) Monte Carlo simulation of uniform random samples on
unit cube [0, 1]d ; (ii) Large d approximation: (γ/2) ln(d/(2πe)).
25/43
27. Entropic Graphs (5)
a+b×e cd
Figure: Extrapolation to Shannon: α∗ = 1 − N
27/43
28. Variational Bayes
Problem Definition
Given N i.i.d. samples X = {x 1 , . . . , x N } of a d-dimensional
random variable X , their associated hidden variables
Z = {z 1 , . . . , z N } and the parameters Θ of the model, the Bayesian
posterior is given by [Watanabe et al.,09] :
N n n
p(Θ) n=1 p(x , z |Θ)
p(Z , Θ|X ) = N
.
n n
p(Θ) n=1 p(x , z |Θ)dΘ
Since the integration w.r.t. Θ is analytically intractable, the
posterior is approximated by a factorized distribution
q(Z , Θ) = q(Z )q(Θ) and the optimal approximation is the one that
minimizes the variational free energy.
28/43
29. Variational Bayes (2)
Problem Definition (cont.)
The variational free energy is given by:
N
q(Z , Θ)
L(q) = q(Z , Θ) log dΘ − log p(Θ) p(x n |θ)dΘ ,
p(Z , Θ|X )
n=1
where the first term is the Kullback-Leibler divergence between the
approximation and the true posterior. As the second term is
independent of the approximation, the Variational Bayes (VB)
approach is reduced to minimize the latter divergence. Such
minimization is addressed in a EM-like process alternating the
updating of q(Θ) and the updating of q(Z ).
29/43
30. Variational Bayes (3)
Problem Definition (cont.)
The EM-like process alternating the updating of q(Θ) and the
updating of q(Z ) is given by
N
q(Θ) ∝ p(Θ) exp log p(x n , z n |Θ) q(Z )
n=1
N
q(Z ) ∝ exp log p(x n , z n |Θ) q(Θ)
n=1
30/43
31. Variational Bayes (4)
Problem Definition (cont.)
In [Constantinopoulos and Likas,07] , the optimization of the
variational free energy yields (being N (.) and W(.) are respectively
the Gaussian and Wishart densities):
n n
q(Z ) = N n=1
s
k=1 rk n
zk K
k=s+1 ρk n
zk
K
q(µ) = k=1 N (µk |mk , Σk
K
q(Σ) = k=1 W(Σk |νk , Vk )
Γ( K γk −1
˜
s −K +s k=s+1 γk )
˜
q(β) = (1 − k=1 πk ) K · Kk=s+1 1− s πk
πk
,
k=s+1 Γ(˜k )
γ k=1
After the maximization of the free energy w.r.t. q(.), it proceeds to
update the coefficients in α which denote the free components.
31/43
32. Model Selection in VB
Fixed and Free Components
In the latter framework, it is assumed that a number of K − s
components fit the data well in their region of influence (fixed
components) and then model order selection is posed in terms
of optimizing the parameters of the remaing s (free
components).
Let α = {πk }sk=1 the coefficients of the free components and
K
β = {πk }k=s+1 the coefficients of the fixed components.Under
the i.i.d. sampling assumption, the prior distribution of Z given
α and β can be modeled by a product of multinomials:
N s n
zk K n
zk
p(Z |α, β) = n=1 k=1 πk k=s+1 πk .
32/43
33. Model Selection in VB (2)
Fixed and Free Components (cont.)
Moreover, assuming conjugate Dirichlet priors over the set of
mixing coefficients, we have that
p(β|α) =
−K +s Γ(
K
γk ) γk −1
s k=s+1 K πk
(1 − k=1 πk ) K · k=s+1 1− s
πk
.
k=s+1 Γ(γk ) k=1
Then, considering fixed coefficients Θ is redefined as
Θ = {µ, Σ, β} and we have the following factorization:
q(Z , Θ) = q(Z )q(µ)q(Σ)q(β) .
33/43
34. Model Selection in VB (3)
Kernel Splits
In [Constantinopoulos and Likas,07] , the VBgmm methodis used
for training an initial K = 2 model. Then, in the so called
VBgmmSplit, they proceed by sorting the obtained kernels and
then trying to split them recursively.
Each splitting consists of:
Removing the original component.
Replacing it by two kernels with the same covariance matrix as
the original but with means placed in opposite directions along
the maximum variabiability direction.
34/43
35. Model Selection in VB (4)
Kernel Splits (cont)
Independently of the split strategy, the critical point of
VBgmmSplit is the amount of splits needed until convergence.
At each iteration of the latter algorithm the K current exisiting
kernels are splited. Consider the case of any split is detected as
proper (non-zero π after running the VB update described in
the previous section, where each new kernel is considered as
free).
Then, the number of components increases and then a new set
of splitting tests starts in the next iteration. This means that if
the algorithm stops (all splits failed) with K kernels, the
number of splits has been 1 + 2 + . . . + K = K (K + 1)/2.
35/43
36. Model Selection in VB (5)
EBVS Split
We split only one kernel per iteration. In order to do so, we
implement a selection criterion based on measuring the entropy
of the kernels.
If ones uses Leonenko’s estimator then there is no need of
extrapolation as in EGs, and asymptotic consistence is ensured.
Then, at each iteration of the algorithm we select the worse, in
terms of low entropy, to be split. If the split is successful we
will have K + 1 kernels to feed the VB optimization in the next
iteration. Otherwise, there is no need to add a new kernel and
the process converges to K kernels. The key question here is
that the overall process is linear (one split per iteration).
36/43
39. EBVS: Fast BV (3)
MD Experiments
With this approach using Leonenko’s estimator, the
classification performance we obtain on this data set is 86%.
Altough experiments in higher dimensions can be performed,
when the number of samples is not high enough, the risk of
unbounded maxima of the likelihood function is higher, due to
singular covariance matrices.
The entropy estimation method, however, performs very well
with thousands of dimensions.
39/43
40. Conclusions
Summarizing Ideas in GMs
In the multi-dimensional case, efficient entropy estimators
become critical.
In VB where model-order selection is implicit, it is possible to
reduce the complexity at least by one order of magnitude.
Can use the same approach for shapes in 2D and 3D.
Once we have the mixtures, new measures for compare them
are waiting to be discovered and used. Let’s do it!
40/43
41. References
[Goldberger et al.,03] Goldberger, J., Gordon, S., Greenspan, H
(2003). An Efficient Image Similarity Measure Based on
Approximations of KL-Divergence Between Two Gaussian Mixtures.
ICCV’03
[Figueiredo and Jain, 02] Figueiredo, M. and Jain, A. (2002).
Unsupervised learning of nite mixture models. IEEE Trans. Pattern
Anal. Mach. Intell., vol. 24, no. 3, pp. 381399
[Constantinopoulos and Likas,07] Constantinopoulos, C. and Likas, A.
(2007). Unsupervised Learning of Gaussian Mixtures based on
Variational Component Splitting. IEEE Trans. Neural Networks, vol.
18., no. 3, 745–755.
41/43
42. References (2)
[Pe˜alver et al., 09] Pe˜alver, A., Escolano, F., S´ez, J.M. Learning
n n a
Gaussian Mixture Models with Entropy Based Criteria. IEEE Trans.
on Neural Networks, 20(11) 1756–1771.
[Dellaportas,06] Dellaportas, P. and Papageorgiou I. (2006).
Multivariate mixtures of normals with unknown number of
components. Statistics and Computing, vol. 16, no. 1, pp. 57–68
[Hero and Michel,02] Hero, A. and Michel, o. (2002). Applications of
spanning entropic graphs. IEEE Signal Processing Magazine, vol. 19,
no. 5, pp. 85–95
[Watanabe et al.,09] Watanabe, K., Akaho, S., Omachi, S.:
Variational bayesian mixture model on a subspace of exponential
family distributions. IEEE Transactions on Neural Networks 20(11)
1783–1796
42/43
43. References (3)
Escolano et al.,10] Escolano, F., Pe˜alver A. and Bonev, B. (2010).
n
Entropy-based Variational Scheme for Fast Bayes Learning of
Gaussian Mixtures. SSPR’2010 (accepted)
[Rajwadee et al.,09] Ajit Rajwade, Arunava Banerjee, Anand
Rangarajan(2009). Probability Density Estimation Using Isocontours
and Isosurfaces: Applications to Information-Theoretic Image
Registration. IEEE Trans. Pattern Anal. Mach. Intell. 31(3):
475–491
[Chen et al.,10] Ting Chen, Baba C Vemuri, Anand Rangarajan,
Stephan J Eisenschenk (2010). Group-wise Point-set registration
using a novel CDF-based Havrda-Charv´t Divergence.Int J Comput
a
Vis. 86 (1):111-124
43/43