This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
How to Troubleshoot Apps for the Modern Connected Worker
Probabilistic PCA, EM, and more
1. Principal Components Analysis, Expectation
Maximization, and more
Harsh Vardhan Sharma1,2
1 Statistical Speech Technology Group
Beckman Institute for Advanced Science and Technology
2 Dept. of Electrical & Computer Engineering
University of Illinois at Urbana-Champaign
Group Meeting: December 01, 2009
2. Material for this presentation derived from:
Probabilistic Principal Component Analysis.
Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622
Mixtures of Principal Component Analyzers.
Tipping and Bishop, Proceedings of Fifth International Conference on Artificial Neural Networks (1997), 13-18
3. Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39
4. PCA
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 4 / 39
5. PCA
standard PCA in 1 slide
A well-established technique for dimensionality reduction
Most common derivation of PCA → linear projection maximizing the
variance in the projected space:
1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean
i=1
1 N
y =
¯ N i=1 yi .
k
2: Obtain the k principal axes wj ∈ Rp j=1
:: k eigenvectors of data-covariance matrix
1 N T
(Sy = N i=1 yi − y
¯ yi − y
¯ ) corresponding to k largest eigenvalues (k < p).
3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The
¯
components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal
with the k largest eigenvalues of Sy .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 5 / 39
6. PCA
Things to think about
Assumptions behind standard PCA
1 Linearity
problem is that of changing the basis — have measurements in a particular basis; want to
see data in a basis that best expresses it. We restrict ourselves to look at bases that are
linear combinations of the measurement-basis.
2 Large variances = important structure
believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions
with largest variance; lower-variance directions pertain to noise.
3 Principal Components are orthogonal
decorrelation-based dimensional reduction removes redundancy in the original
data-representation.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 6 / 39
7. PCA
Things to think about
Limitations of standard PCA
1 Decorrelation not always the best approach
Useful only when first and second order statistics are sufficient statistics for revealing all
dependencies in data (for e.g., Gaussian distributed data).
2 Linearity assumption not always justifiable
Not valid when data-structure captured by a nonlinear function of dimensions in the
measurement-basis.
3 Non-parametricity
No probabilistic model for observed data. (Advantages of probabilistic extension coming
up!)
4 Calculation of Data Covariance Matrix
When p and N are very large, difficulties arise in terms of computational complexity and
data scarcity
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 7 / 39
8. PCA
Things to think about
Handling the decorrelation and linearity caveats
Example solutions : Independent Components Analysis (imposing more general notion of
statistical dependency), kernel PCA (nonlinearly transforming data to a more appropriate
naive-basis)
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 8 / 39
9. PCA
Things to think about
Handling the non-parametricity caveat: motivation for probabilistic PCA
probabilistic perspective can provide a log-likelihood measure for
comparison with other density-estimation techniques.
Bayesian inference methods may be applied (e.g., for model
comparison).
pPCA can be utilized as a constrained Gaussian density model:
potential applications → classification, novelty detection.
multiple pPCA models can be combined as a probabilistic mixture.
standard PCA uses a naive way to access covariance (distance2 from
observed data): pPCA defines a proper covariance structure whose
parameters can be estimated via EM.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 9 / 39
10. PCA
Things to think about
Handling the computational caveat: motivation for EM-based PCA
computing the sample covariance itself is O Np 2 .
data scarcity: often don’t have enough data for sample covariance to
be full-rank.
computational complexity: direct diagonalization is O p 3 .
standard PCA doesn’t deal properly with missing data. EM
algorithms can estimate ML values of missing data.
EM-based PCA: doesn’t require computing sample covariance,
O (kNp) complexity.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 10 / 39
11. Basic Model
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 11 / 39
12. Basic Model
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
ym − µ = Cxm + m = Cxm +
where for m = 1, . . . , N
xm ∈ Rk ∼ N 0, Q – (hidden) state vector
y m ∈ Rp – output/observable vector
C ∈ Rp×k – observation/measurement matrix
∈ Rp ∼ N 0, R – zero-mean white Gaussian noise
So, we have ym ∼ N µ, W = CQCT + R = CCT + R .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 12 / 39
13. Basic Model
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
ym − µ = Cxm +
The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 13 / 39
14. Inference, Learning
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 14 / 39
15. Inference, Learning
Latent Variable Models and Probability Computations
Case 1 :: know what the hidden states are. just want to estimate
them. (can write down a priori C based on problem-physics)
estimating the states given observations and a model → inference
Case 2 :: have observation data. observation process mostly
unknown. no explicit model for the “causes”
learning a few parameters that model the data well (in the ML sense)
→ learning
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 15 / 39
16. Inference, Learning
Latent Variable Models and Probability Computations
Inference
ym ∼ N µ, W = CCT + R
gives us
P (ym |xm ) · P (xm ) N (µ + Cxm , R) |ym · N 0, I |xm
P (xm |ym ) = =
P (ym ) N (µ, W) |ym
Therefore,
xm |ym ∼ N (β (ym − µ) , I − βC)
−1
where β = CT W−1 = CT CCT + R .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 16 / 39
17. Inference, Learning
Latent Variable Models and Probability Computations
Learning, via Expectation Maximization
Given a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is the
parameter vector, Y is the observed data and X represents the unobserved
latent variables or missing values, the maximum likelihood estimate (MLE)
of θ is obtained iteratively as follows:
Expectation step:
˜
Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, X
Maximization step:
˜
θ(u+1) = arg max Q θ|θ(u)
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 17 / 39
18. Inference, Learning
Latent Variable Models and Probability Computations
Learning, via Expectation Maximization, for linear Gaussian models
Use the solution to the inference problem for estimating the unknown
latent variables / missing values X, given Y, θ(u) . Then use this fictitious
“complete” data to solve for θ(u+1) .
Expectation step: Obtain conditional latent sufficient statistics
T (u) from
xm (u) , xm xm
xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .
Maximization step: Choose C, R to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 18 / 39
19. pPCA
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 19 / 39
20. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
ym − µ = Cxm +
The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.
So, we have ym ∼ N µ, W = CQCT + R = CCT + R .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 20 / 39
21. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
k < p: looking for more parsimonious representation of the observed
data.
:: linear Gaussian model → Factor Analysis ::
R needs to be restricted: the learning procedure could explain all the
structure in the data as noise (i.e., obtain maximal likelihood by
choosing C = 0 and R = W = data-sample covariance).
since ym ∼ N µ, W = CCT + R we can do no better than having the
model-covariance equal the data-sample covariance.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 21 / 39
22. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
Factor Analysis ≡ restricting R to be diagonal
xm ≡ {xmi }k – factors explaining the correlations between the
i=1 p
observation variables ym ≡ ymj j=1
p
ymj j=1
conditionally independent given {xmi }k
i=1
j – variability unique to a particular ymj
R = diag(rjj ) – “uniquenesses”
different from standard PCA, which effectively treats covariance and
variance identically
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 22 / 39
23. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
Probabilistic PCA ≡ constraining R to σ 2 I
Noise, ∼ N 0, σ 2 I
ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I
ym ∼ N µ, W = CCT + σ 2 I
xm |ym ∼ N (β (ym − µ) , I − βC) where
−1
β = CT W−1 = CT CCT + σ 2 I
xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where
−1
κ = M−1 CT = CT C + σ 2 I CT
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 23 / 39
24. pPCA
PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA
−1
β = CT W−1 = CT CCT + σ 2 Ip×p
−1
κ = M−1 CT = CT C + σ 2 Ik×k CT
It can be shown (by applying the Woodbury matrix identity to M) that
1 I − βC = σ 2 M−1
2 β=κ
Then, by letting σ 2 → 0, we obtain
−1 T
xm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 24 / 39
25. pPCA
closed-form ML learning
log-likelihood for the pPCA model,
L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy
2 where
W = CCT + σ 2 I
1 N T
Sy = N m=1 (ym − µ) (ym − µ)
ML estimates of µ, Sy are the sample mean and sample covariance
matrix respectively
ˆ 1 N
µ= N m=1 ym
T
ˆ 1 N ˆ ˆ
Sy = N m=1 ym − µ ym − µ
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 25 / 39
26. pPCA
closed-form ML learning
ML estimates of C and σ 2 , i.e. C and R
ˆ 1/2
C = Uk Λk − σ 2 I V
maps latent space (containing X) to the principal subspace of Y
columns of Uk ∈ Rp×k – principal eigenvectors of Sy
Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy
V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k
ˆ
σ2 = 1 p
p−k r =k+1 λr
variance lost in the projection process, averaged over the number of
dimensions projected out/away
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 26 / 39
27. EM for pPCA
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 27 / 39
28. EM for pPCA
EM-based ML learning for linear Gaussian models
Expectation step:
xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .
Maximization step: Choose C, R to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 28 / 39
29. EM for pPCA
EM-based ML learning for probabilistic PCA
Expectation step:
xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T (u) −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I .
Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 29 / 39
30. EM for pPCA
EM-based ML learning for probabilistic PCA
Expectation step:
(u) −1
xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u)
−1 T T (u) −1 T
where κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) .
Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 30 / 39
31. EM for pPCA
EM-based ML learning for probabilistic PCA
Expectation step: Compute for m = 1, . . . , N
−1 T
xm = M(u) C(u) ˆ
ym − µ
(u) −1 T
T
xm xm = σ2 M(u) + xm xm
Maximization step: Set
N N −1
C(u+1) = ˆ
ym − µ xm T T
xm xm
m=1 m=1
N
(u+1) 1 ˆ
2
T T
ˆ T
σ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T
xm xm C(u+1) C(u+1)
Np m=1
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 31 / 39
32. EM for pPCA
EM-based ML learning for probabilistic PCA
(u) −1 T −1
ˆ ˆ
C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u)
(u+1) 1 ˆ ˆ −1 T
σ2 = trace Sy − Sy C(u) M(u) C(u+1)
p
Convergence: only stable local extremum is the global maximum at
which the true principal subspace is found. Paper doesn’t discuss any
initialization scheme(s).
Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T 2
m ym ym C which is equivalent to finding S explicitly (O Np ).
Very efficient for k << p.
Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 32 / 39
33. Mixture of pPCAs
Outline
1 Principal Components Analysis
2 Basic Model / Model Basics
3 A brief digression - Inference and Learning
4 Probabilistic PCA
5 Expectation Maximization for Probabilistic PCA
6 Mixture of Principal Component Analyzers
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 33 / 39
34. Mixture of pPCAs
Mixture of pPCAs :: the model
Likelihood of observed data
N M
2 M
L θ = {µr } , {Cr } , σr r =1
;Y = log πr · p (ym |r )
m=1 r =1
where, for m = 1, . . . , N and r = 1, . . . , M
ym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model
r
2
M
{πr } , πr ≥ 0, r =1 πr = 1 :: mixture weights
M independent latent variables xmr for each ym .
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 34 / 39
35. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
Stage 1: New estimates of component-specific π, µ
Expectation step: component’s responsibility for generating observation
(u)
(u+1) p (u) (ym |r ) · πr
Rmr = P (u) (r |ym ) = (u)
M (u) (y |r )
r =1 p m · πr
Maximization step:
N
(u+1) 1 (u+1)
πr = Rmr
N
m=1
N (u+1)
ˆ(u+1) = m=1 Rmr · ym
µr (u+1)
N
m=1 Rmr
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 35 / 39
36. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
Stage 2: New estimates of component-specific C, σ 2
Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M
(u)−1 (u)T ˆ(u+1)
xmr = Mr Cr ym − µr
T 2 (u) (u)−1 T
xmr xmr = σr Mr + xmr xmr
Maximization step: Set for r = 1, . . . , M
N N −1
(u+1) (u+1) ˆ(u+1) T (u+1) T
Cr = Rmr ym − µr xmr Rmr xmr xmr
m=1 m=1
N
2 (u+1) 1 (u+1)
σr = (u+1)
Rmr ×
πr p m=1
2 (u+1)T (u+1)T
ˆ(u+1)
ym − µr − 2 xmr T
Cr ˆ(u+1) + trace
ym − µr T
xmr xmr Cr Cr
(u+1)
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 36 / 39
37. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
−1 T
σr = (u+1)
trace Syr r r r
πr p
where
N
ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T
Syr (u+1)
Rmr ym − µr ym − µr
πr N m=1
(u) 2(u) (u)T (u)
Mr = σr I + Cr Cr
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 37 / 39
38. Mixture of pPCAs
Mixture of pPCAs :: EM-based ML learning
−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 −1
ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
T
σr = (u+1)
trace Syr r r r
πr p
Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T C which is equivalent to finding S explicitly (O Np 2 ).
m y m ym
Very efficient for k << p.
Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.
hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 38 / 39