SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Principal Components Analysis, Expectation
          Maximization, and more

               Harsh Vardhan Sharma1,2
            1 Statistical Speech Technology Group

    Beckman Institute for Advanced Science and Technology
         2 Dept.   of Electrical & Computer Engineering

          University of Illinois at Urbana-Champaign


        Group Meeting: December 01, 2009
Material for this presentation derived from:




    Probabilistic Principal Component Analysis.
    Tipping and Bishop,   Journal of the Royal Statistical Society (1999) 61:3, 611-622




    Mixtures of Principal Component Analyzers.
    Tipping and Bishop,   Proceedings of Fifth International Conference on Artificial Neural Networks (1997), 13-18
Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)       PCA,EM,etc.       01/12/2009   3 / 39
PCA


Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)       PCA,EM,etc.       01/12/2009   4 / 39
PCA


standard PCA in 1 slide


     A well-established technique for dimensionality reduction

     Most common derivation of PCA → linear projection maximizing the
     variance in the projected space:

1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean
                                    i=1
         1       N
   y =
   ¯     N       i=1 yi .
                                                       k
2: Obtain the k principal axes wj ∈ Rp                 j=1
                                                           ::   k eigenvectors of data-covariance matrix
             1     N                          T
   (Sy =     N     i=1      yi − y
                                 ¯   yi − y
                                          ¯       ) corresponding to k largest eigenvalues (k < p).

3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The
                                                     ¯
   components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal
   with the k largest eigenvalues of Sy .




 hsharma (SST@BI,ECE:UIUC)                             PCA,EM,etc.                            01/12/2009   5 / 39
PCA


Things to think about

Assumptions behind standard PCA

  1   Linearity
      problem is that of changing the basis — have measurements in a particular basis; want to
      see data in a basis that best expresses it. We restrict ourselves to look at bases that are
      linear combinations of the measurement-basis.

  2   Large variances = important structure
      believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions
      with largest variance; lower-variance directions pertain to noise.

  3   Principal Components are orthogonal
      decorrelation-based dimensional reduction removes redundancy in the original
      data-representation.




 hsharma (SST@BI,ECE:UIUC)                  PCA,EM,etc.                           01/12/2009   6 / 39
PCA


Things to think about
Limitations of standard PCA

  1   Decorrelation not always the best approach
      Useful only when first and second order statistics are sufficient statistics for revealing all
      dependencies in data (for e.g., Gaussian distributed data).

  2   Linearity assumption not always justifiable
      Not valid when data-structure captured by a nonlinear function of dimensions in the
      measurement-basis.

  3   Non-parametricity
      No probabilistic model for observed data. (Advantages of probabilistic extension coming
      up!)

  4   Calculation of Data Covariance Matrix
      When p and N are very large, difficulties arise in terms of computational complexity and
      data scarcity

 hsharma (SST@BI,ECE:UIUC)                  PCA,EM,etc.                           01/12/2009   7 / 39
PCA


Things to think about
Handling the decorrelation and linearity caveats




Example solutions : Independent Components Analysis (imposing more general notion of
statistical dependency), kernel PCA (nonlinearly transforming data to a more appropriate
naive-basis)
  hsharma (SST@BI,ECE:UIUC)               PCA,EM,etc.                          01/12/2009   8 / 39
PCA


Things to think about
Handling the non-parametricity caveat: motivation for probabilistic PCA

    probabilistic perspective can provide a log-likelihood measure for
    comparison with other density-estimation techniques.

    Bayesian inference methods may be applied (e.g., for model
    comparison).

    pPCA can be utilized as a constrained Gaussian density model:
    potential applications → classification, novelty detection.

    multiple pPCA models can be combined as a probabilistic mixture.

    standard PCA uses a naive way to access covariance (distance2 from
    observed data): pPCA defines a proper covariance structure whose
    parameters can be estimated via EM.

 hsharma (SST@BI,ECE:UIUC)       PCA,EM,etc.                  01/12/2009   9 / 39
PCA


Things to think about

Handling the computational caveat: motivation for EM-based PCA

    computing the sample covariance itself is O Np 2 .

    data scarcity: often don’t have enough data for sample covariance to
    be full-rank.

    computational complexity: direct diagonalization is O p 3 .

    standard PCA doesn’t deal properly with missing data. EM
    algorithms can estimate ML values of missing data.

    EM-based PCA: doesn’t require computing sample covariance,
    O (kNp) complexity.


 hsharma (SST@BI,ECE:UIUC)      PCA,EM,etc.                 01/12/2009   10 / 39
Basic Model


Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)         PCA,EM,etc.     01/12/2009   11 / 39
Basic Model


PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA



                              ym − µ = Cxm +        m    = Cxm +
where for m = 1, . . . , N

xm ∈ Rk ∼ N 0, Q –              (hidden) state vector

y m ∈ Rp –     output/observable vector

C ∈ Rp×k –       observation/measurement matrix

 ∈ Rp ∼ N 0, R –              zero-mean white Gaussian noise


So, we have ym ∼ N µ, W = CQCT + R = CCT + R .

  hsharma (SST@BI,ECE:UIUC)                PCA,EM,etc.                    01/12/2009    12 / 39
Basic Model


PCA as a limiting case of linear Gaussian models

linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA



                                  ym − µ = Cxm +

     The restriction to zero-mean noise source is not a loss of generality.
     All of the structure in Q can be moved into C and can use Q = Ik×k .
     R in general cannot be restricted, since ym are observed and cannot
     be whitened/rescaled.
     Assumed: C is of rank k; Q and R are always full rank.




  hsharma (SST@BI,ECE:UIUC)              PCA,EM,etc.                      01/12/2009    13 / 39
Inference, Learning


Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)                 PCA,EM,etc.   01/12/2009   14 / 39
Inference, Learning


Latent Variable Models and Probability Computations


    Case 1 :: know what the hidden states are. just want to estimate
    them. (can write down a priori C based on problem-physics)

    estimating the states given observations and a model → inference


    Case 2 :: have observation data. observation process mostly
    unknown. no explicit model for the “causes”

    learning a few parameters that model the data well (in the ML sense)
    → learning




 hsharma (SST@BI,ECE:UIUC)                 PCA,EM,etc.     01/12/2009   15 / 39
Inference, Learning


Latent Variable Models and Probability Computations
Inference


                               ym ∼ N µ, W = CCT + R

gives us


                      P (ym |xm ) · P (xm )   N (µ + Cxm , R) |ym · N 0, I |xm
   P (xm |ym ) =                            =
                            P (ym )                   N (µ, W) |ym

Therefore,
                              xm |ym ∼ N (β (ym − µ) , I − βC)

                                                        −1
where β = CT W−1 = CT CCT + R                                .

  hsharma (SST@BI,ECE:UIUC)                     PCA,EM,etc.        01/12/2009    16 / 39
Inference, Learning


Latent Variable Models and Probability Computations
Learning, via Expectation Maximization

Given a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is the
parameter vector, Y is the observed data and X represents the unobserved
latent variables or missing values, the maximum likelihood estimate (MLE)
of θ is obtained iteratively as follows:

Expectation step:

                      ˜
                      Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, X

Maximization step:

                                              ˜
                             θ(u+1) = arg max Q θ|θ(u)


 hsharma (SST@BI,ECE:UIUC)                  PCA,EM,etc.    01/12/2009   17 / 39
Inference, Learning


Latent Variable Models and Probability Computations

Learning, via Expectation Maximization, for linear Gaussian models

Use the solution to the inference problem for estimating the unknown
latent variables / missing values X, given Y, θ(u) . Then use this fictitious
“complete” data to solve for θ(u+1) .

Expectation step: Obtain conditional latent sufficient statistics
              T (u) from
 xm (u) , xm xm


                      xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)
                        T     −1             T              T           −1
where β (u) = C(u) W(u)            = C(u)            C(u) C(u) + R(u)        .

Maximization step: Choose C, R to maximize joint likelihood of X, Y.


  hsharma (SST@BI,ECE:UIUC)                  PCA,EM,etc.                         01/12/2009   18 / 39
pPCA


Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)       PCA,EM,etc.       01/12/2009   19 / 39
pPCA


PCA as a limiting case of linear Gaussian models

linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA



                                  ym − µ = Cxm +

     The restriction to zero-mean noise source is not a loss of generality.
     All of the structure in Q can be moved into C and can use Q = Ik×k .
     R in general cannot be restricted, since ym are observed and cannot
     be whitened/rescaled.
     Assumed: C is of rank k; Q and R are always full rank.

So, we have ym ∼ N µ, W = CQCT + R = CCT + R .


  hsharma (SST@BI,ECE:UIUC)             PCA,EM,etc.                       01/12/2009    20 / 39
pPCA


PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA


     k < p: looking for more parsimonious representation of the observed
     data.

                   :: linear Gaussian model → Factor Analysis ::


     R needs to be restricted: the learning procedure could explain all the
     structure in the data as noise (i.e., obtain maximal likelihood by
     choosing C = 0 and R = W = data-sample covariance).
            since ym ∼ N µ, W = CCT + R we can do no better than having the
            model-covariance equal the data-sample covariance.



  hsharma (SST@BI,ECE:UIUC)              PCA,EM,etc.                      01/12/2009    21 / 39
pPCA


PCA as a limiting case of linear Gaussian models

linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA


Factor Analysis ≡ restricting R to be diagonal

     xm ≡ {xmi }k – factors explaining the correlations between the
                i=1                  p
     observation variables ym ≡ ymj j=1
               p
       ymj     j=1
                     conditionally independent given {xmi }k
                                                           i=1

       j   – variability unique to a particular ymj
     R = diag(rjj ) – “uniquenesses”
     different from standard PCA, which effectively treats covariance and
     variance identically


  hsharma (SST@BI,ECE:UIUC)             PCA,EM,etc.                       01/12/2009    22 / 39
pPCA


PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA


Probabilistic PCA ≡ constraining R to σ 2 I

     Noise,      ∼ N 0, σ 2 I

     ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I
     ym ∼ N µ, W = CCT + σ 2 I
     xm |ym ∼ N (β (ym − µ) , I − βC) where
                                     −1
     β = CT W−1 = CT CCT + σ 2 I
     xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where
                                         −1
     κ = M−1 CT = CT C + σ 2 I                CT


  hsharma (SST@BI,ECE:UIUC)             PCA,EM,etc.                       01/12/2009    23 / 39
pPCA


PCA as a limiting case of linear Gaussian models
linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →
standard PCA




                                                                   −1
                       β = CT W−1 = CT CCT + σ 2 Ip×p
                                                              −1
                       κ = M−1 CT = CT C + σ 2 Ik×k                CT

It can be shown (by applying the Woodbury matrix identity to M) that
  1   I − βC = σ 2 M−1
  2   β=κ
Then, by letting σ 2 → 0, we obtain
                          −1 T
xm |ym → δ xm − CT C        C (ym − µ) , which is the standard PCA.

  hsharma (SST@BI,ECE:UIUC)             PCA,EM,etc.                       01/12/2009    24 / 39
pPCA


closed-form ML learning


    log-likelihood for the pPCA model,
    L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy
                    2                                       where

           W = CCT + σ 2 I
                   1     N                       T
           Sy =    N     m=1   (ym − µ) (ym − µ)

    ML estimates of µ, Sy are the sample mean and sample covariance
    matrix respectively
           ˆ      1    N
           µ=     N    m=1 ym
                                                       T
           ˆ       1     N           ˆ         ˆ
           Sy =    N     m=1    ym − µ    ym − µ



 hsharma (SST@BI,ECE:UIUC)               PCA,EM,etc.       01/12/2009   25 / 39
pPCA


closed-form ML learning

                      ML estimates of C and σ 2 , i.e. C and R


    ˆ                        1/2
    C = Uk Λk − σ 2 I              V
           maps latent space (containing X) to the principal subspace of Y
           columns of Uk ∈ Rp×k – principal eigenvectors of Sy
           Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy
           V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k

    ˆ
    σ2 =      1       p
             p−k      r =k+1 λr

           variance lost in the projection process, averaged over the number of
           dimensions projected out/away

 hsharma (SST@BI,ECE:UIUC)             PCA,EM,etc.                 01/12/2009     26 / 39
EM for pPCA


Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)          PCA,EM,etc.    01/12/2009   27 / 39
EM for pPCA


EM-based ML learning for linear Gaussian models



Expectation step:

                     xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)


                       T     −1            T           T           −1
where β (u) = C(u) W(u)           = C(u)        C(u) C(u) + R(u)        .


Maximization step: Choose C, R to maximize joint likelihood of X, Y.




 hsharma (SST@BI,ECE:UIUC)               PCA,EM,etc.                        01/12/2009   28 / 39
EM for pPCA


EM-based ML learning for probabilistic PCA



Expectation step:

                     xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u)


                       T     −1            T           T      (u)   −1
where β (u) = C(u) W(u)           = C(u)        C(u) C(u) + σ 2 I        .


Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.




 hsharma (SST@BI,ECE:UIUC)               PCA,EM,etc.                         01/12/2009   29 / 39
EM for pPCA


EM-based ML learning for probabilistic PCA



Expectation step:
                                                        (u)         −1
                     xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u)


                        −1   T            T           (u)     −1         T
where κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I                         C(u) .


Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y.




 hsharma (SST@BI,ECE:UIUC)              PCA,EM,etc.                          01/12/2009   30 / 39
EM for pPCA


EM-based ML learning for probabilistic PCA

Expectation step: Compute for m = 1, . . . , N
                                                            −1          T
                                           xm = M(u)             C(u)             ˆ
                                                                             ym − µ

                                                          (u)          −1                 T
                                             T
                                         xm xm     = σ2         M(u)        + xm     xm


Maximization step: Set
                                            N                                   N              −1

                            C(u+1) =                    ˆ
                                                   ym − µ         xm    T                 T
                                                                                      xm xm
                                          m=1                                  m=1


                      N
     (u+1)        1              ˆ
                                     2
                                                    T             T
                                                                             ˆ                             T
σ2           =              ym − µ       − 2 xm         C(u+1)          ym − µ + trace            T
                                                                                              xm xm C(u+1) C(u+1)
                 Np   m=1




 hsharma (SST@BI,ECE:UIUC)                               PCA,EM,etc.                                 01/12/2009   31 / 39
EM for pPCA


EM-based ML learning for probabilistic PCA

                                   (u)      −1   T                     −1
                        ˆ                          ˆ
               C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u)
                       (u+1)       1       ˆ    ˆ           −1     T
                  σ2           =     trace Sy − Sy C(u) M(u) C(u+1)
                                   p

    Convergence: only stable local extremum is the global maximum at
    which the true principal subspace is found. Paper doesn’t discuss any
    initialization scheme(s).
    Complexity : require terms of the form SC and trace (S)
                                     T
           Computing SC as m ym ym C is O (kNp) and more efficient than
                    T                                                   2
              m ym ym C which is equivalent to finding S explicitly (O Np ).
           Very efficient for k << p.
           Require trace (S), not S ⇒ computing only variance along each
           coordinate sufficient.
 hsharma (SST@BI,ECE:UIUC)                    PCA,EM,etc.               01/12/2009   32 / 39
Mixture of pPCAs


Outline

1    Principal Components Analysis

2    Basic Model / Model Basics

3    A brief digression - Inference and Learning

4    Probabilistic PCA

5    Expectation Maximization for Probabilistic PCA

6    Mixture of Principal Component Analyzers



    hsharma (SST@BI,ECE:UIUC)              PCA,EM,etc.   01/12/2009   33 / 39
Mixture of pPCAs


Mixture of pPCAs :: the model

Likelihood of observed data

                                                           N           M
                                     2      M
    L θ = {µr } , {Cr } ,           σr      r =1
                                                 ;Y    =         log          πr · p (ym |r )
                                                           m=1         r =1


where, for m = 1, . . . , N and r = 1, . . . , M

ym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model
                        r
                             2

                      M
{πr } , πr ≥ 0,       r =1 πr   = 1 :: mixture weights

M independent latent variables xmr for each ym .


  hsharma (SST@BI,ECE:UIUC)                  PCA,EM,etc.                           01/12/2009   34 / 39
Mixture of pPCAs


Mixture of pPCAs :: EM-based ML learning

Stage 1: New estimates of component-specific π, µ

Expectation step: component’s responsibility for generating observation
                                                                              (u)
                    (u+1)                                  p (u) (ym |r ) · πr
                 Rmr         = P (u) (r |ym ) =                                     (u)
                                                         M      (u) (y |r )
                                                         r =1 p       m       · πr

Maximization step:
                                                     N
                                    (u+1)       1             (u+1)
                                  πr        =               Rmr
                                                N
                                                     m=1
                                                 N    (u+1)
                               ˆ(u+1) =          m=1 Rmr    · ym
                               µr                        (u+1)
                                                   N
                                                   m=1 Rmr


 hsharma (SST@BI,ECE:UIUC)                   PCA,EM,etc.                                  01/12/2009   35 / 39
Mixture of pPCAs


Mixture of pPCAs :: EM-based ML learning
Stage 2: New estimates of component-specific C, σ 2
Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M

                                                                 (u)−1      (u)T         ˆ(u+1)
                                               xmr = Mr                  Cr         ym − µr
                                              T           2      (u)    (u)−1                      T
                                         xmr xmr       = σr            Mr          + xmr    xmr


Maximization step: Set for r = 1, . . . , M
                                 N                                                         N                            −1
            (u+1)                     (u+1)              ˆ(u+1)                    T              (u+1)        T
           Cr           =            Rmr            ym − µr                 xmr                  Rmr      xmr xmr
                             m=1                                                           m=1
                                           N
            2   (u+1)            1                  (u+1)
           σr           =    (u+1)
                                                Rmr         ×
                            πr       p   m=1

                        2                       (u+1)T                                                        (u+1)T
          ˆ(u+1)
     ym − µr                − 2 xmr        T
                                               Cr                    ˆ(u+1) + trace
                                                                ym − µr                                 T
                                                                                                   xmr xmr   Cr         Cr
                                                                                                                          (u+1)



  hsharma (SST@BI,ECE:UIUC)                                      PCA,EM,etc.                                        01/12/2009    36 / 39
Mixture of pPCAs


Mixture of pPCAs :: EM-based ML learning


                                                                     −1    T                −1
        (u+1)                      (u+1) 2(u)
                     ˆ (u+1) C(u) πr            (u)  (u) ˆ (u+1) (u)
     Cr            = Syr      r         σr I + Mr   Cr Syr      Cr
       2   (u+1)             1              ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
                                                      ˆ yr
                                                                               −1       T
      σr           =    (u+1)
                                      trace Syr               r    r    r
                       πr         p
where


                                         N
       ˆ (u+1) =              1                 (u+1)           ˆ  (u+1)        ˆ   (u+1) T
       Syr                  (u+1)
                                               Rmr         ym − µr         ym − µr
                       πr           N   m=1
           (u)          2(u)            (u)T    (u)
          Mr       = σr          I + Cr        Cr


 hsharma (SST@BI,ECE:UIUC)                           PCA,EM,etc.                     01/12/2009   37 / 39
Mixture of pPCAs


Mixture of pPCAs :: EM-based ML learning


                                                               −1   T               −1
        (u+1)                      (u+1) 2(u)
                     ˆ (u+1) C(u) πr            (u)  (u) ˆ (u+1) (u)
     Cr            = Syr      r         σr I + Mr   Cr Syr      Cr
       2   (u+1)            1                                           −1
                                          ˆ (u+1) − S(u+1) C(u) M(u) C(u+1)
                                                    ˆ yr
                                                                                T
      σr           =    (u+1)
                                    trace Syr               r    r    r
                       πr       p


Complexity : require terms of the form SC and trace (S)
                               T
    Computing SC as m ym ym C is O (kNp) and more efficient than
              T C which is equivalent to finding S explicitly (O Np 2 ).
       m y m ym
    Very efficient for k << p.

    Require trace (S), not S ⇒ computing only variance along each
    coordinate sufficient.

 hsharma (SST@BI,ECE:UIUC)                       PCA,EM,etc.                 01/12/2009   38 / 39
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
sia16
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
chenhm
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
Reza Ramezani
 

Was ist angesagt? (20)

Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance Theory
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix Factorization
 
Independent Component Analysis
Independent Component AnalysisIndependent Component Analysis
Independent Component Analysis
 
Data clustering
Data clustering Data clustering
Data clustering
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Adaptive Resonance Theory
Adaptive Resonance TheoryAdaptive Resonance Theory
Adaptive Resonance Theory
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 

Andere mochten auch

. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
butest
 
Part 2: Unsupervised Learning Machine Learning Techniques
Part 2: Unsupervised Learning Machine Learning Techniques Part 2: Unsupervised Learning Machine Learning Techniques
Part 2: Unsupervised Learning Machine Learning Techniques
butest
 
Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS software
Swetha A
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model
Jungkyu Lee
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model
Jungkyu Lee
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...
zukun
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
grssieee
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdf
grssieee
 
KPCA_Survey_Report
KPCA_Survey_ReportKPCA_Survey_Report
KPCA_Survey_Report
Randy Salm
 

Andere mochten auch (20)

Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
確率的主成分分析
確率的主成分分析確率的主成分分析
確率的主成分分析
 
Part 2: Unsupervised Learning Machine Learning Techniques
Part 2: Unsupervised Learning Machine Learning Techniques Part 2: Unsupervised Learning Machine Learning Techniques
Part 2: Unsupervised Learning Machine Learning Techniques
 
Steps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS softwareSteps for Principal Component Analysis (pca) using ERDAS software
Steps for Principal Component Analysis (pca) using ERDAS software
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Patient Protection ACA PPCA
Patient Protection ACA PPCAPatient Protection ACA PPCA
Patient Protection ACA PPCA
 
PPCA of Go-FOGO
PPCA of Go-FOGOPPCA of Go-FOGO
PPCA of Go-FOGO
 
Intervalos de confianza en r commander
Intervalos de confianza en r commanderIntervalos de confianza en r commander
Intervalos de confianza en r commander
 
PCA for the uninitiated
PCA for the uninitiatedPCA for the uninitiated
PCA for the uninitiated
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model
 
A Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM AlgorithmA Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM Algorithm
 
Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...Principal component analysis and matrix factorizations for learning (part 2) ...
Principal component analysis and matrix factorizations for learning (part 2) ...
 
Nonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problemNonlinear component analysis as a kernel eigenvalue problem
Nonlinear component analysis as a kernel eigenvalue problem
 
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdfKernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
Kernel Entropy Component Analysis in Remote Sensing Data Clustering.pdf
 
fauvel_igarss.pdf
fauvel_igarss.pdffauvel_igarss.pdf
fauvel_igarss.pdf
 
Different kind of distance and Statistical Distance
Different kind of distance and Statistical DistanceDifferent kind of distance and Statistical Distance
Different kind of distance and Statistical Distance
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
KPCA_Survey_Report
KPCA_Survey_ReportKPCA_Survey_Report
KPCA_Survey_Report
 

Ähnlich wie Probabilistic PCA, EM, and more

Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
NYversity
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
TELKOMNIKA JOURNAL
 

Ähnlich wie Probabilistic PCA, EM, and more (20)

Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
overviewPCA
overviewPCAoverviewPCA
overviewPCA
 
2009 asilomar
2009 asilomar2009 asilomar
2009 asilomar
 
On Selection of Periodic Kernels Parameters in Time Series Prediction
On Selection of Periodic Kernels Parameters in Time Series Prediction On Selection of Periodic Kernels Parameters in Time Series Prediction
On Selection of Periodic Kernels Parameters in Time Series Prediction
 
ON SELECTION OF PERIODIC KERNELS PARAMETERS IN TIME SERIES PREDICTION
ON SELECTION OF PERIODIC KERNELS PARAMETERS IN TIME SERIES PREDICTIONON SELECTION OF PERIODIC KERNELS PARAMETERS IN TIME SERIES PREDICTION
ON SELECTION OF PERIODIC KERNELS PARAMETERS IN TIME SERIES PREDICTION
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Lecture: Interatomic Potentials Enabled by Machine Learning
Lecture: Interatomic Potentials Enabled by Machine LearningLecture: Interatomic Potentials Enabled by Machine Learning
Lecture: Interatomic Potentials Enabled by Machine Learning
 
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
A Hybrid Method of CART and Artificial Neural Network for Short Term Load For...
 
Pca seminar final report
Pca seminar final reportPca seminar final report
Pca seminar final report
 
190 195
190 195190 195
190 195
 
Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...
Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...
Forecasting S&P 500 Index Using Backpropagation Neural Network Based on Princ...
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
Machine learning (11)
Machine learning (11)Machine learning (11)
Machine learning (11)
 
On selection of periodic kernels parameters in time series prediction
On selection of periodic kernels parameters in time series predictionOn selection of periodic kernels parameters in time series prediction
On selection of periodic kernels parameters in time series prediction
 
Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...Kernal based speaker specific feature extraction and its applications in iTau...
Kernal based speaker specific feature extraction and its applications in iTau...
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Probabilistic PCA, EM, and more

  • 1. Principal Components Analysis, Expectation Maximization, and more Harsh Vardhan Sharma1,2 1 Statistical Speech Technology Group Beckman Institute for Advanced Science and Technology 2 Dept. of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Group Meeting: December 01, 2009
  • 2. Material for this presentation derived from: Probabilistic Principal Component Analysis. Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622 Mixtures of Principal Component Analyzers. Tipping and Bishop, Proceedings of Fifth International Conference on Artificial Neural Networks (1997), 13-18
  • 3. Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39
  • 4. PCA Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 4 / 39
  • 5. PCA standard PCA in 1 slide A well-established technique for dimensionality reduction Most common derivation of PCA → linear projection maximizing the variance in the projected space: 1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean i=1 1 N y = ¯ N i=1 yi . k 2: Obtain the k principal axes wj ∈ Rp j=1 :: k eigenvectors of data-covariance matrix 1 N T (Sy = N i=1 yi − y ¯ yi − y ¯ ) corresponding to k largest eigenvalues (k < p). 3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The ¯ components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal with the k largest eigenvalues of Sy . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 5 / 39
  • 6. PCA Things to think about Assumptions behind standard PCA 1 Linearity problem is that of changing the basis — have measurements in a particular basis; want to see data in a basis that best expresses it. We restrict ourselves to look at bases that are linear combinations of the measurement-basis. 2 Large variances = important structure believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions with largest variance; lower-variance directions pertain to noise. 3 Principal Components are orthogonal decorrelation-based dimensional reduction removes redundancy in the original data-representation. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 6 / 39
  • 7. PCA Things to think about Limitations of standard PCA 1 Decorrelation not always the best approach Useful only when first and second order statistics are sufficient statistics for revealing all dependencies in data (for e.g., Gaussian distributed data). 2 Linearity assumption not always justifiable Not valid when data-structure captured by a nonlinear function of dimensions in the measurement-basis. 3 Non-parametricity No probabilistic model for observed data. (Advantages of probabilistic extension coming up!) 4 Calculation of Data Covariance Matrix When p and N are very large, difficulties arise in terms of computational complexity and data scarcity hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 7 / 39
  • 8. PCA Things to think about Handling the decorrelation and linearity caveats Example solutions : Independent Components Analysis (imposing more general notion of statistical dependency), kernel PCA (nonlinearly transforming data to a more appropriate naive-basis) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 8 / 39
  • 9. PCA Things to think about Handling the non-parametricity caveat: motivation for probabilistic PCA probabilistic perspective can provide a log-likelihood measure for comparison with other density-estimation techniques. Bayesian inference methods may be applied (e.g., for model comparison). pPCA can be utilized as a constrained Gaussian density model: potential applications → classification, novelty detection. multiple pPCA models can be combined as a probabilistic mixture. standard PCA uses a naive way to access covariance (distance2 from observed data): pPCA defines a proper covariance structure whose parameters can be estimated via EM. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 9 / 39
  • 10. PCA Things to think about Handling the computational caveat: motivation for EM-based PCA computing the sample covariance itself is O Np 2 . data scarcity: often don’t have enough data for sample covariance to be full-rank. computational complexity: direct diagonalization is O p 3 . standard PCA doesn’t deal properly with missing data. EM algorithms can estimate ML values of missing data. EM-based PCA: doesn’t require computing sample covariance, O (kNp) complexity. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 10 / 39
  • 11. Basic Model Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 11 / 39
  • 12. Basic Model PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA ym − µ = Cxm + m = Cxm + where for m = 1, . . . , N xm ∈ Rk ∼ N 0, Q – (hidden) state vector y m ∈ Rp – output/observable vector C ∈ Rp×k – observation/measurement matrix ∈ Rp ∼ N 0, R – zero-mean white Gaussian noise So, we have ym ∼ N µ, W = CQCT + R = CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 12 / 39
  • 13. Basic Model PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA ym − µ = Cxm + The restriction to zero-mean noise source is not a loss of generality. All of the structure in Q can be moved into C and can use Q = Ik×k . R in general cannot be restricted, since ym are observed and cannot be whitened/rescaled. Assumed: C is of rank k; Q and R are always full rank. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 13 / 39
  • 14. Inference, Learning Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 14 / 39
  • 15. Inference, Learning Latent Variable Models and Probability Computations Case 1 :: know what the hidden states are. just want to estimate them. (can write down a priori C based on problem-physics) estimating the states given observations and a model → inference Case 2 :: have observation data. observation process mostly unknown. no explicit model for the “causes” learning a few parameters that model the data well (in the ML sense) → learning hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 15 / 39
  • 16. Inference, Learning Latent Variable Models and Probability Computations Inference ym ∼ N µ, W = CCT + R gives us P (ym |xm ) · P (xm ) N (µ + Cxm , R) |ym · N 0, I |xm P (xm |ym ) = = P (ym ) N (µ, W) |ym Therefore, xm |ym ∼ N (β (ym − µ) , I − βC) −1 where β = CT W−1 = CT CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 16 / 39
  • 17. Inference, Learning Latent Variable Models and Probability Computations Learning, via Expectation Maximization Given a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is the parameter vector, Y is the observed data and X represents the unobserved latent variables or missing values, the maximum likelihood estimate (MLE) of θ is obtained iteratively as follows: Expectation step: ˜ Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, X Maximization step: ˜ θ(u+1) = arg max Q θ|θ(u) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 17 / 39
  • 18. Inference, Learning Latent Variable Models and Probability Computations Learning, via Expectation Maximization, for linear Gaussian models Use the solution to the inference problem for estimating the unknown latent variables / missing values X, given Y, θ(u) . Then use this fictitious “complete” data to solve for θ(u+1) . Expectation step: Obtain conditional latent sufficient statistics T (u) from xm (u) , xm xm xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T −1 where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) . Maximization step: Choose C, R to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 18 / 39
  • 19. pPCA Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 19 / 39
  • 20. pPCA PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA ym − µ = Cxm + The restriction to zero-mean noise source is not a loss of generality. All of the structure in Q can be moved into C and can use Q = Ik×k . R in general cannot be restricted, since ym are observed and cannot be whitened/rescaled. Assumed: C is of rank k; Q and R are always full rank. So, we have ym ∼ N µ, W = CQCT + R = CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 20 / 39
  • 21. pPCA PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA k < p: looking for more parsimonious representation of the observed data. :: linear Gaussian model → Factor Analysis :: R needs to be restricted: the learning procedure could explain all the structure in the data as noise (i.e., obtain maximal likelihood by choosing C = 0 and R = W = data-sample covariance). since ym ∼ N µ, W = CCT + R we can do no better than having the model-covariance equal the data-sample covariance. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 21 / 39
  • 22. pPCA PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA Factor Analysis ≡ restricting R to be diagonal xm ≡ {xmi }k – factors explaining the correlations between the i=1 p observation variables ym ≡ ymj j=1 p ymj j=1 conditionally independent given {xmi }k i=1 j – variability unique to a particular ymj R = diag(rjj ) – “uniquenesses” different from standard PCA, which effectively treats covariance and variance identically hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 22 / 39
  • 23. pPCA PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA Probabilistic PCA ≡ constraining R to σ 2 I Noise, ∼ N 0, σ 2 I ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I ym ∼ N µ, W = CCT + σ 2 I xm |ym ∼ N (β (ym − µ) , I − βC) where −1 β = CT W−1 = CT CCT + σ 2 I xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where −1 κ = M−1 CT = CT C + σ 2 I CT hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 23 / 39
  • 24. pPCA PCA as a limiting case of linear Gaussian models linear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA → standard PCA −1 β = CT W−1 = CT CCT + σ 2 Ip×p −1 κ = M−1 CT = CT C + σ 2 Ik×k CT It can be shown (by applying the Woodbury matrix identity to M) that 1 I − βC = σ 2 M−1 2 β=κ Then, by letting σ 2 → 0, we obtain −1 T xm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 24 / 39
  • 25. pPCA closed-form ML learning log-likelihood for the pPCA model, L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy 2 where W = CCT + σ 2 I 1 N T Sy = N m=1 (ym − µ) (ym − µ) ML estimates of µ, Sy are the sample mean and sample covariance matrix respectively ˆ 1 N µ= N m=1 ym T ˆ 1 N ˆ ˆ Sy = N m=1 ym − µ ym − µ hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 25 / 39
  • 26. pPCA closed-form ML learning ML estimates of C and σ 2 , i.e. C and R ˆ 1/2 C = Uk Λk − σ 2 I V maps latent space (containing X) to the principal subspace of Y columns of Uk ∈ Rp×k – principal eigenvectors of Sy Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k ˆ σ2 = 1 p p−k r =k+1 λr variance lost in the projection process, averaged over the number of dimensions projected out/away hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 26 / 39
  • 27. EM for pPCA Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 27 / 39
  • 28. EM for pPCA EM-based ML learning for linear Gaussian models Expectation step: xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T −1 where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) . Maximization step: Choose C, R to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 28 / 39
  • 29. EM for pPCA EM-based ML learning for probabilistic PCA Expectation step: xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T (u) −1 where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I . Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 29 / 39
  • 30. EM for pPCA EM-based ML learning for probabilistic PCA Expectation step: (u) −1 xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u) −1 T T (u) −1 T where κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) . Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 30 / 39
  • 31. EM for pPCA EM-based ML learning for probabilistic PCA Expectation step: Compute for m = 1, . . . , N −1 T xm = M(u) C(u) ˆ ym − µ (u) −1 T T xm xm = σ2 M(u) + xm xm Maximization step: Set N N −1 C(u+1) = ˆ ym − µ xm T T xm xm m=1 m=1 N (u+1) 1 ˆ 2 T T ˆ T σ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T xm xm C(u+1) C(u+1) Np m=1 hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 31 / 39
  • 32. EM for pPCA EM-based ML learning for probabilistic PCA (u) −1 T −1 ˆ ˆ C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u) (u+1) 1 ˆ ˆ −1 T σ2 = trace Sy − Sy C(u) M(u) C(u+1) p Convergence: only stable local extremum is the global maximum at which the true principal subspace is found. Paper doesn’t discuss any initialization scheme(s). Complexity : require terms of the form SC and trace (S) T Computing SC as m ym ym C is O (kNp) and more efficient than T 2 m ym ym C which is equivalent to finding S explicitly (O Np ). Very efficient for k << p. Require trace (S), not S ⇒ computing only variance along each coordinate sufficient. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 32 / 39
  • 33. Mixture of pPCAs Outline 1 Principal Components Analysis 2 Basic Model / Model Basics 3 A brief digression - Inference and Learning 4 Probabilistic PCA 5 Expectation Maximization for Probabilistic PCA 6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 33 / 39
  • 34. Mixture of pPCAs Mixture of pPCAs :: the model Likelihood of observed data N M 2 M L θ = {µr } , {Cr } , σr r =1 ;Y = log πr · p (ym |r ) m=1 r =1 where, for m = 1, . . . , N and r = 1, . . . , M ym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model r 2 M {πr } , πr ≥ 0, r =1 πr = 1 :: mixture weights M independent latent variables xmr for each ym . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 34 / 39
  • 35. Mixture of pPCAs Mixture of pPCAs :: EM-based ML learning Stage 1: New estimates of component-specific π, µ Expectation step: component’s responsibility for generating observation (u) (u+1) p (u) (ym |r ) · πr Rmr = P (u) (r |ym ) = (u) M (u) (y |r ) r =1 p m · πr Maximization step: N (u+1) 1 (u+1) πr = Rmr N m=1 N (u+1) ˆ(u+1) = m=1 Rmr · ym µr (u+1) N m=1 Rmr hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 35 / 39
  • 36. Mixture of pPCAs Mixture of pPCAs :: EM-based ML learning Stage 2: New estimates of component-specific C, σ 2 Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M (u)−1 (u)T ˆ(u+1) xmr = Mr Cr ym − µr T 2 (u) (u)−1 T xmr xmr = σr Mr + xmr xmr Maximization step: Set for r = 1, . . . , M N N −1 (u+1) (u+1) ˆ(u+1) T (u+1) T Cr = Rmr ym − µr xmr Rmr xmr xmr m=1 m=1 N 2 (u+1) 1 (u+1) σr = (u+1) Rmr × πr p m=1 2 (u+1)T (u+1)T ˆ(u+1) ym − µr − 2 xmr T Cr ˆ(u+1) + trace ym − µr T xmr xmr Cr Cr (u+1) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 36 / 39
  • 37. Mixture of pPCAs Mixture of pPCAs :: EM-based ML learning −1 T −1 (u+1) (u+1) 2(u) ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u) Cr = Syr r σr I + Mr Cr Syr Cr 2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1) ˆ yr −1 T σr = (u+1) trace Syr r r r πr p where N ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T Syr (u+1) Rmr ym − µr ym − µr πr N m=1 (u) 2(u) (u)T (u) Mr = σr I + Cr Cr hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 37 / 39
  • 38. Mixture of pPCAs Mixture of pPCAs :: EM-based ML learning −1 T −1 (u+1) (u+1) 2(u) ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u) Cr = Syr r σr I + Mr Cr Syr Cr 2 (u+1) 1 −1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1) ˆ yr T σr = (u+1) trace Syr r r r πr p Complexity : require terms of the form SC and trace (S) T Computing SC as m ym ym C is O (kNp) and more efficient than T C which is equivalent to finding S explicitly (O Np 2 ). m y m ym Very efficient for k << p. Require trace (S), not S ⇒ computing only variance along each coordinate sufficient. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 38 / 39