1. Outline
Machine Learning
Devdatt Dubhashi
Department of Computer Science and Engineering
Chalmers University
Gothenburg, Sweden.
LP3 2007
Dubhashi Machine Learrning
2. Outline
Outline
1 k-Means Clustering
2 Mixtures of Gaussians and EM Algorithm
Dubhashi Machine Learrning
3. Outline
Outline
1 k-Means Clustering
2 Mixtures of Gaussians and EM Algorithm
Dubhashi Machine Learrning
4. k-means
mix gaussians
Clustering
Data set {x 1 , · · · , x N } of N observations of a random
d-dim eculidean variable x.
Goal is to partition the data set ito K clusters (K known).
Intuitively, the points within a cluster must be “close” to
each other compared to pints outside the cluster.
Dubhashi Machine Learrning
5. k-means
mix gaussians
Cluster centers and assignments
Find a set of centers µk , k ∈ [K ]
Assign each data point to one of the centers so as to
minimize the sum of the squares of the distances to the
assigned centers.
Dubhashi Machine Learrning
6. k-means
mix gaussians
Assignment and Distortion
Introduce binary indicator variables
1, if xn is asssigned to µk
rn,k :=
0, otherwise
Minimize the distortion measure
J := rn,k ||xn − µk ||2 .
n∈[N] k∈[K ]
Dubhashi Machine Learrning
7. k-means
mix gaussians
Two Step Optimization
Start with some initial values of µk . Basic iteration consists of
two steps until convergence.
M Minimize J wrt rn,k keeping µk fixed.
E Minimize J wrt µk keeping rn,k fixed.
Dubhashi Machine Learrning
8. k-means
mix gaussians
Two Step Optimization: M Step
Minimize J wrt rn,k keeping µk fixed:
1, if k = argminj ||xn − µk ||2
rn,k :=
0, otherwise.
Dubhashi Machine Learrning
9. k-means
mix gaussians
Two Step Optimization: E Step
Minimize J wrt µk keeping rn,k fixed: J is a quadratic function of
µk , so setting derivative to zero gives:
rn,k (x n − µk ) = 0,
n∈[N]
hence,
n rn,k x n
µk = .
n rn,k
In words: set µk to be the mean of the points assigned to
cluster k, hence called K-means Algorithms.
Dubhashi Machine Learrning
10. k-means
mix gaussians
K-Means Algorithm Analysis
Since J decreses at each iteration, convergence
guaranteesd.
But may converge to a local rather than a global optimum.
Dubhashi Machine Learrning
16. k-means
mix gaussians
K-Means and Image Segmentation
Image Segmentation problem: partition image into regions
of homogeneosu visual appearance, corresponding to
objects or parts of objects.
Each pixel is a 3-dim point corresponding to intensities of
red, blue and green channels.
perform K-means and redraw image replacing ecah pixel
by the corresponding center µk .
Dubhashi Machine Learrning
18. k-means
mix gaussians
K-Means Algorithm: Example
¦¤¢
¥£ ¡ Original image
Dubhashi Machine Learrning
19. k-means
mix gaussians
K-Means Algorithm: Example
Dubhashi Machine Learrning
20. k-means
mix gaussians
K-Means Algorithm: Example
Dubhashi Machine Learrning
21. k-means
mix gaussians
K-Means and Data Compression
Lossy as opposed to lossless compression where we
accept some errors in recontruction in return for higher rate
of compression.
Instead of storing all the N data pointsm store only the
identity of the assigned cluster, and the cluster centers.
Significant savings provided K << N.
Each data point approximated by nearest center µk :
code-book vectors.
New data compressed by finding nearest center and
storing only the label k of corresponding cluster.
Scheme called Vector Quantization.
Dubhashi Machine Learrning
22. k-means
mix gaussians
K-Means and Data Compression: Example
Suppose original image has N pixels comprising {R, G, B}
values which are stored with 8 bits precision. Then total
space required is 24N bits.
Instead if we first do K -means and transmit only label of
corresponding cluster for ecah pixel, this takes log K bits
per pixel for a total of N log K bits.
Also need to transmit the K code–book vectors which
needs 24K bits.
In example, original image has 240 × 180 = 43, 200 pixels,
requring 24 × 43, 200 = 1, 036, 800 pixels.
Compressed images require 43, 248 (K = 2), 86, 472
(K = 3) and 173, 040 (K = 10) bits.
Dubhashi Machine Learrning
23. k-means
mix gaussians
Mixtures of Gaussians: Motivation
Pure Gaussian distributions have limitations when it comes
to modelling real life data.
Example: “Olfd Faithful” eruption durations.
Forms two dominant clumps
Single Gaussian can’t model this data well
Linear superposition of two Gaussians does much better.
Dubhashi Machine Learrning
25. k-means
mix gaussians
Mixtures of Gaussians: Modelling
Linear combination of Gaussians
can give rise to complex p(x)
distributions.
By using a suffieicnt number of
Gaussians and adjusting their
means and covariannces, as well
as linear combination
coefiicients, can model almost
x
any continuous density to
arbitrary accuracy.
Dubhashi Machine Learrning
26. k-means
mix gaussians
Mixtures of Gaussians: Definition
Superpositon of Gaussians of the form
p(x) := πk N (x | µk , Σk ).
k∈[K ]
Each Gaussian density N (x | µk , Σk ) is a component of
the mixture with its own mean and covariance.
Parameters πk are mixing coefficients and satisfy
0 ≤ πk ≤ 1 and k πk = 1.
Dubhashi Machine Learrning
29. k-means
mix gaussians
Equivalent Definition: Latent Variable
Can introduce a latent variable z which is such that exactly
one component is 1 and the rest are zeros, with
p(zk = 1) = πk . This variable gives the component.
Given z, the conditional distribution is
p(x | zk = 1) = N (x | µk , Σk ).
Inverting this, using Baye’s rule
γ(zk ) := p(zk = 1 | x)
p(zk = 1)p(x | zk = 1)
=
j p(zj = 1)p(x | zj = 1)
πk N (x | µk , Σk )
=
j πj N (x | µj , Σj )
is the posterior probability or responsibility that component
k takes for observation x.
Dubhashi Machine Learrning
32. k-means
mix gaussians
Learning Mixtures
Suppose we have a data set of observations represented
by a N × D matrix X := {x 1 , · · · , x N } and we want to
model it as a mixture of K Gaussians.
Need to find mixing coefficients πk , and parameters of
component models, µk and Σk .
Dubhashi Machine Learrning
33. k-means
mix gaussians
Learning Mixtures: The Means
Start with the loglikelihood function:
ln p(X | πµ, Σ) = ln πk N (x n | µk , Σk )
n∈[N] k∈[K ]
Setting derivative wrt µk to zero, and assuming Σ is
invertible gives:
1
µk = γ(zn,k )x n ,
Nk
n∈[N]
where Nk := n∈[N] γ(zn,k ).
Dubhashi Machine Learrning
34. k-means
mix gaussians
Learning Mixtures: The Means
Interpret Nk as the “effective number of points” assigned to
cluster k .
Note that the mean µk for the kth Gaussian component is
given by a weighted mean of all points in the data set
The weighting factor for data point x n is given by the
posterior probability or responsibilty of component k for
generating x n .
Dubhashi Machine Learrning
35. k-means
mix gaussians
Learning Mixtures: The Covariances
Setting derivative wrt Σk to zero, and assuming Σ is
invertible gives:
1
σk = γ(zn,k )(x n − µk )(x n − µk )T .
Nk
n∈[N]
which is same as sigle Gaussian solution but with a
avergae weighted by the corresponding posterior
probability.
Dubhashi Machine Learrning
36. k-means
mix gaussians
Learning Mixtures: Mixing Coefficients
Setting derivative wrt πk to zero, and taking into account
that k πk = 1 (lagrange multipliers!)
Nk
πk = .
N
The mixing coefficient for the kth componet is the average
responsibilility that the component takes for explaining the
data set.
Dubhashi Machine Learrning
37. k-means
mix gaussians
Learning Mixtures: EM Algorithm
1 Initialize means, covars and mix coeffs and repeat:
2 E Step: Evaluate responsibilities using current parameters:
πk N (x n | µk , Σk )
γ(zn,k ) =
j πj N (x n | µj , Σj )
3 M Step: Re-estimate parameters using current
responsibilities:
1
µnew =
k γ(zn,k )x n
Nk n
1
Σnew =
k γ(zn,k )(x n − µnew )(x n − µnew )T .
k k
Nk n
new Nk
πk = ,
N
where Nk := n γ(zn,k ).
Dubhashi Machine Learrning
41. k-means
mix gaussians
EM vs K-Means
K-means performs a hard assignment of data points to
clusters i.e. each data point is assigned to a unique cluster.
EM algorithm makes a soft assignment based on posterior
probabilities.
K-means can be derived as the limit of the EM algorithm
assigned to a particular instance of Gaussian mixtures.
Dubhashi Machine Learrning