Probabilistic PCA, EM, and more

Principal Components Analysis, Expectation
Maximization, and more

Harsh Vardhan Sharma1,2
1 Statistical Speech Technology Group

Beckman Institute for Advanced Science and Technology
2 Dept. of Electrical & Computer Engineering

University of Illinois at Urbana-Champaign

Group Meeting: December 01, 2009

Material for this presentation derived from:

Probabilistic Principal Component Analysis.
Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622

Mixtures of Principal Component Analyzers.
Tipping and Bishop, Proceedings of Fifth International Conference on Artiﬁcial Neural Networks (1997), 13-18

Outline

1 Principal Components Analysis

2 Basic Model / Model Basics

3 A brief digression - Inference and Learning

4 Probabilistic PCA

5 Expectation Maximization for Probabilistic PCA

6 Mixture of Principal Component Analyzers

hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39

PCA

Outline




4 Probabilistic PCA




PCA

standard PCA in 1 slide

A well-established technique for dimensionality reduction

Most common derivation of PCA → linear projection maximizing the
variance in the projected space:

1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean
i=1
1 N
y =
¯ N i=1 yi .
k
2: Obtain the k principal axes wj ∈ Rp j=1
:: k eigenvectors of data-covariance matrix
1 N T
(Sy = N i=1 yi − y
¯ yi − y
¯ ) corresponding to k largest eigenvalues (k < p).

3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The
¯
components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal
with the k largest eigenvalues of Sy .


PCA

Things to think about

Assumptions behind standard PCA

1 Linearity
problem is that of changing the basis — have measurements in a particular basis; want to
see data in a basis that best expresses it. We restrict ourselves to look at bases that are
linear combinations of the measurement-basis.

2 Large variances = important structure
believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions
with largest variance; lower-variance directions pertain to noise.

3 Principal Components are orthogonal
decorrelation-based dimensional reduction removes redundancy in the original
data-representation.


PCA

Limitations of standard PCA

1 Decorrelation not always the best approach
Useful only when first and second order statistics are sufficient statistics for revealing all
dependencies in data (for e.g., Gaussian distributed data).

2 Linearity assumption not always justifiable
Not valid when data-structure captured by a nonlinear function of dimensions in the
measurement-basis.

3 Non-parametricity
No probabilistic model for observed data. (Advantages of probabilistic extension coming
up!)

4 Calculation of Data Covariance Matrix
When p and N are very large, difficulties arise in terms of computational complexity and
data scarcity


PCA

Handling the decorrelation and linearity caveats

Example solutions : Independent Components Analysis (imposing more general notion of
statistical dependency), kernel PCA (nonlinearly transforming data to a more appropriate
naive-basis)

PCA

Handling the non-parametricity caveat: motivation for probabilistic PCA

probabilistic perspective can provide a log-likelihood measure for
comparison with other density-estimation techniques.

Bayesian inference methods may be applied (e.g., for model
comparison).

pPCA can be utilized as a constrained Gaussian density model:
potential applications → classiﬁcation, novelty detection.

multiple pPCA models can be combined as a probabilistic mixture.

standard PCA uses a naive way to access covariance (distance2 from
observed data): pPCA deﬁnes a proper covariance structure whose
parameters can be estimated via EM.


PCA


Handling the computational caveat: motivation for EM-based PCA

computing the sample covariance itself is O Np 2 .

data scarcity: often don’t have enough data for sample covariance to
be full-rank.

computational complexity: direct diagonalization is O p 3 .

standard PCA doesn’t deal properly with missing data. EM
algorithms can estimate ML values of missing data.

EM-based PCA: doesn’t require computing sample covariance,
O (kNp) complexity.


Basic Model

Outline




4 Probabilistic PCA




Basic Model

PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA

ym − µ = Cxm + m = Cxm +
where for m = 1, . . . , N

xm ∈ Rk ∼ N 0, Q – (hidden) state vector

y m ∈ Rp – output/observable vector

C ∈ Rp×k – observation/measurement matrix

∈ Rp ∼ N 0, R – zero-mean white Gaussian noise

So, we have ym ∼ N µ, W = CQCT + R = CCT + R .


Basic Model


standard PCA

ym − µ = Cxm +

The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.


Inference, Learning

Outline




4 Probabilistic PCA




Inference, Learning

Latent Variable Models and Probability Computations

Case 1 :: know what the hidden states are. just want to estimate
them. (can write down a priori C based on problem-physics)

estimating the states given observations and a model → inference

Case 2 :: have observation data. observation process mostly
unknown. no explicit model for the “causes”

learning a few parameters that model the data well (in the ML sense)
→ learning


Inference, Learning

Learning, via Expectation Maximization

Given a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is the
parameter vector, Y is the observed data and X represents the unobserved
latent variables or missing values, the maximum likelihood estimate (MLE)
of θ is obtained iteratively as follows:

Expectation step:

˜
Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, X

Maximization step:

˜
θ(u+1) = arg max Q θ|θ(u)


Inference, Learning


Learning, via Expectation Maximization, for linear Gaussian models

Use the solution to the inference problem for estimating the unknown
latent variables / missing values X, given Y, θ(u) . Then use this ﬁctitious
“complete” data to solve for θ(u+1) .

Expectation step: Obtain conditional latent suﬃcient statistics
T (u) from
xm (u) , xm xm

xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .

Maximization step: Choose C, R to maximize joint likelihood of X, Y.


pPCA

Outline




4 Probabilistic PCA




pPCA


standard PCA

ym − µ = Cxm +

The restriction to zero-mean noise source is not a loss of generality.
All of the structure in Q can be moved into C and can use Q = Ik×k .
R in general cannot be restricted, since ym are observed and cannot
be whitened/rescaled.
Assumed: C is of rank k; Q and R are always full rank.

So, we have ym ∼ N µ, W = CQCT + R = CCT + R .


pPCA

standard PCA

k < p: looking for more parsimonious representation of the observed
data.

:: linear Gaussian model → Factor Analysis ::

R needs to be restricted: the learning procedure could explain all the
structure in the data as noise (i.e., obtain maximal likelihood by
choosing C = 0 and R = W = data-sample covariance).
since ym ∼ N µ, W = CCT + R we can do no better than having the
model-covariance equal the data-sample covariance.


pPCA


standard PCA

Factor Analysis ≡ restricting R to be diagonal

xm ≡ {xmi }k – factors explaining the correlations between the
i=1 p
observation variables ym ≡ ymj j=1
p
ymj j=1
conditionally independent given {xmi }k
i=1

j – variability unique to a particular ymj
R = diag(rjj ) – “uniquenesses”
diﬀerent from standard PCA, which eﬀectively treats covariance and
variance identically


pPCA

standard PCA

Probabilistic PCA ≡ constraining R to σ 2 I

Noise, ∼ N 0, σ 2 I

ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I
ym ∼ N µ, W = CCT + σ 2 I
xm |ym ∼ N (β (ym − µ) , I − βC) where
−1
β = CT W−1 = CT CCT + σ 2 I
xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where
−1
κ = M−1 CT = CT C + σ 2 I CT


pPCA

standard PCA

−1
β = CT W−1 = CT CCT + σ 2 Ip×p
−1
κ = M−1 CT = CT C + σ 2 Ik×k CT

It can be shown (by applying the Woodbury matrix identity to M) that
1 I − βC = σ 2 M−1
2 β=κ
Then, by letting σ 2 → 0, we obtain
−1 T
xm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA.


pPCA

closed-form ML learning

log-likelihood for the pPCA model,
L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy
2 where

W = CCT + σ 2 I
1 N T
Sy = N m=1 (ym − µ) (ym − µ)

ML estimates of µ, Sy are the sample mean and sample covariance
matrix respectively
ˆ 1 N
µ= N m=1 ym
T
ˆ 1 N ˆ ˆ
Sy = N m=1 ym − µ ym − µ


pPCA

closed-form ML learning

ML estimates of C and σ 2 , i.e. C and R

ˆ 1/2
C = Uk Λk − σ 2 I V
maps latent space (containing X) to the principal subspace of Y
columns of Uk ∈ Rp×k – principal eigenvectors of Sy
Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy
V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k

ˆ
σ2 = 1 p
p−k r =k+1 λr

variance lost in the projection process, averaged over the number of
dimensions projected out/away


EM for pPCA

Outline




4 Probabilistic PCA




EM for pPCA

EM-based ML learning for linear Gaussian models

Expectation step:


T −1 T T −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .

Maximization step: Choose C, R to maximize joint likelihood of X, Y.


EM for pPCA

EM-based ML learning for probabilistic PCA

Expectation step:


T −1 T T (u) −1
where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I .

Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.


EM for pPCA


Expectation step:
(u) −1
xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u)

−1 T T (u) −1 T
where κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) .

Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.


EM for pPCA


Expectation step: Compute for m = 1, . . . , N
−1 T
xm = M(u) C(u) ˆ
ym − µ

(u) −1 T
T
xm xm = σ2 M(u) + xm xm

Maximization step: Set
N N −1

C(u+1) = ˆ
ym − µ xm T T
xm xm
m=1 m=1

N
(u+1) 1 ˆ
2
T T
ˆ T
σ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T
xm xm C(u+1) C(u+1)
Np m=1


EM for pPCA


(u) −1 T −1
ˆ ˆ
C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u)
(u+1) 1 ˆ ˆ −1 T
σ2 = trace Sy − Sy C(u) M(u) C(u+1)
p

Convergence: only stable local extremum is the global maximum at
which the true principal subspace is found. Paper doesn’t discuss any
initialization scheme(s).
Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T 2
m ym ym C which is equivalent to finding S explicitly (O Np ).
Very efficient for k << p.
Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.

Mixture of pPCAs

Outline




4 Probabilistic PCA




Mixture of pPCAs

Mixture of pPCAs :: the model

Likelihood of observed data

N M
2 M
L θ = {µr } , {Cr } , σr r =1
;Y = log πr · p (ym |r )
m=1 r =1

where, for m = 1, . . . , N and r = 1, . . . , M

ym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model
r
2

M
{πr } , πr ≥ 0, r =1 πr = 1 :: mixture weights

M independent latent variables xmr for each ym .


Mixture of pPCAs

Mixture of pPCAs :: EM-based ML learning

Stage 1: New estimates of component-speciﬁc π, µ

Expectation step: component’s responsibility for generating observation
(u)
(u+1) p (u) (ym |r ) · πr
Rmr = P (u) (r |ym ) = (u)
M (u) (y |r )
r =1 p m · πr

Maximization step:
N
(u+1) 1 (u+1)
πr = Rmr
N
m=1
N (u+1)
ˆ(u+1) = m=1 Rmr · ym
µr (u+1)
N
m=1 Rmr


Mixture of pPCAs

Stage 2: New estimates of component-speciﬁc C, σ 2
Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M

(u)−1 (u)T ˆ(u+1)
xmr = Mr Cr ym − µr
T 2 (u) (u)−1 T
xmr xmr = σr Mr + xmr xmr

Maximization step: Set for r = 1, . . . , M
N N −1
(u+1) (u+1) ˆ(u+1) T (u+1) T
Cr = Rmr ym − µr xmr Rmr xmr xmr
m=1 m=1
N
2 (u+1) 1 (u+1)
σr = (u+1)
Rmr ×
πr p m=1

2 (u+1)T (u+1)T
ˆ(u+1)
ym − µr − 2 xmr T
Cr ˆ(u+1) + trace
ym − µr T
xmr xmr Cr Cr
(u+1)


Mixture of pPCAs


−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
−1 T
σr = (u+1)
trace Syr r r r
πr p
where

N
ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T
Syr (u+1)
Rmr ym − µr ym − µr
πr N m=1
(u) 2(u) (u)T (u)
Mr = σr I + Cr Cr


Mixture of pPCAs


−1 T −1
(u+1) (u+1) 2(u)
ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u)
Cr = Syr r σr I + Mr Cr Syr Cr
2 (u+1) 1 −1
ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
ˆ yr
T
σr = (u+1)
trace Syr r r r
πr p

Complexity : require terms of the form SC and trace (S)
T
Computing SC as m ym ym C is O (kNp) and more efficient than
T C which is equivalent to finding S explicitly (O Np 2 ).
m y m ym
Very efficient for k << p.

Require trace (S), not S ⇒ computing only variance along each
coordinate sufficient.


Probabilistic PCA, EM, and more

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Probabilistic PCA, EM, and more

Ähnlich wie Probabilistic PCA, EM, and more (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Probabilistic PCA, EM, and more