SlideShare a Scribd company logo
1 of 25
Download to read offline
Learning with
                  Maximum Likelihood
                                                  Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted

                                                       Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them

                                              School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in

                                              Carnegie Mellon University
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
                                                    www.cs.cmu.edu/~awm
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
                                                      awm@cs.cmu.edu
received.

                                                        412-268-7599


                  Copyright © 2001, 2004, Andrew W. Moore                  Sep 6th, 2001




             Maximum Likelihood learning of
               Gaussians for Data Mining
  •     Why we should care
  •     Learning Univariate Gaussians
  •     Learning Multivariate Gaussians
  •     What’s a biased estimator?
  •     Bayesian Learning of Gaussians




  Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 2




                                                                                                        1
Why we should care
• Maximum Likelihood Estimation is a very
  very very very fundamental part of data
  analysis.
• “MLE for Gaussians” is training wheels for
  our future techniques
• Learning Gaussians is more useful than you
  might guess…




Copyright © 2001, 2004, Andrew W. Moore                     Maximum Likelihood: Slide 3




      Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~ (i.i.d) N(µ,σ2)
• But you don’t know µ
                                  (you do know σ2)

                            MLE: For which µ is x1, x2, … xR most likely?

                            MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?




Copyright © 2001, 2004, Andrew W. Moore                     Maximum Likelihood: Slide 4




                                                                                          2
Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ
                                   (you do know σ2)
                          Sneer


                            MLE: For which µ is x1, x2, … xR most likely?

                            MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?




Copyright © 2001, 2004, Andrew W. Moore                     Maximum Likelihood: Slide 5




      Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ
                                   (you do know σ2)
                          Sneer


                            MLE: For which µ is x1, x2, … xR most likely?

                            MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?


   Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…



Copyright © 2001, 2004, Andrew W. Moore                     Maximum Likelihood: Slide 6




                                                                                          3
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ (you do know σ2)
• MLE: For which µ is x1, x2, … xR most likely?

               µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                                   µ




Copyright © 2001, 2004, Andrew W. Moore                           Maximum Likelihood: Slide 7




                           Algebra Euphoria
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                    µ

 =                                                                 (by i.i.d)


 =                                                                 (monotonicity of
                                                                   log)

                                                                   (plug in formula
 =
                                                                   for Gaussian)

 =                                                                 (after
                                                                   simplification)




Copyright © 2001, 2004, Andrew W. Moore                           Maximum Likelihood: Slide 8




                                                                                                4
Algebra Euphoria
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                    µ
                                                 R
 =                                                                     (by i.i.d)
                          arg max ∏ p ( xi | µ , σ 2 )
                                 µ              i =1
                          R
 = arg max log p ( x | µ , σ 2 )                                       (monotonicity of
          ∑                                                            log)
                    i
               µ         i =1

                                               ( xi − µ ) 2
                                            R
                                                                       (plug in formula
 = arg max                   1
                                          ∑ − 2σ 2                     for Gaussian)
                            2π σ
              µ                           i =1

 =                                                                     (after
                                      R
                    arg min ∑ ( xi − µ )                               simplification)
                                                       2

                          µ          i =1



Copyright © 2001, 2004, Andrew W. Moore                               Maximum Likelihood: Slide 9




     Intermission: A General Scalar
             MLE strategy
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Set ∂LL/∂θ=0 for a maximum, creating an equation in
    terms of θ
4. Solve it*
5. Check that you’ve found a maximum rather than a
    minimum or saddle-point, and be careful if θ is
    constrained

                          *This is a perfect example of something that works perfectly in
                             all textbook examples and usually involves surprising pain if
                                                          you need it for something new.
Copyright © 2001, 2004, Andrew W. Moore                              Maximum Likelihood: Slide 10




                                                                                                    5
The MLE µ
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                    µ
                                  R
             = arg min ∑ ( xi − µ ) 2
                        µ        i =1

                                   ∂LL
        = µ s.t. 0 =                   =
                                    ∂µ



                                                             = (what?)


Copyright © 2001, 2004, Andrew W. Moore                               Maximum Likelihood: Slide 11




                                        The MLE µ
µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
                    µ
                                  R
             = arg min ∑ ( xi − µ ) 2
                        µ        i =1

                            ∂                       R
                     ∂LL
                                                  ∑ (x       − µ )2
        = µ s.t. 0 =     =
                           ∂µ
                                                         i
                      ∂µ                          i =1
                                              R
                                           − ∑ 2 ( xi − µ )
                                             i =1

                                    1R
                            Thus µ = ∑ xi
                                    R i =1
Copyright © 2001, 2004, Andrew W. Moore                               Maximum Likelihood: Slide 12




                                                                                                     6
Lawks-a-lawdy!
                                                     1R
                                                       ∑ xi
                                           µ mle =
                                                     R i =1

  • The best estimate of the mean of a
    distribution is the mean of the sample!
                                                               At first sight:
                                                 This kind of pedantic, algebra-filled and
                                                ultimately unsurprising fact is exactly the
                                                     reason people throw down their
                                                “Statistics” book and pick up their “Agent
                                                  Based Evolutionary Data Mining Using
                                                    The Neuro-Fuzz Transform” book.

 Copyright © 2001, 2004, Andrew W. Moore                              Maximum Likelihood: Slide 13




                  A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
                                                ⎛ ∂LL ⎞
                                                ⎜     ⎟
                                                ⎜ ∂θ1 ⎟
                                                  ∂LL ⎟
                                           ∂LL ⎜⎜ ∂θ ⎟
                                              =
                                            ∂θ ⎜ 2 ⎟
                                                ⎜M⎟
                                                ⎜ ∂LL ⎟
                                                ⎜ ∂θ ⎟
                                                ⎝ n⎠
 Copyright © 2001, 2004, Andrew W. Moore                              Maximum Likelihood: Slide 14




                                                                                                     7
A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations

                                           ∂LL
                                                 =0
                                           ∂θ1
                                           ∂LL
                                                 =0
                                           ∂θ 2
                                               M
                                           ∂LL
                                                 =0
                                           ∂θ n
 Copyright © 2001, 2004, Andrew W. Moore                         Maximum Likelihood: Slide 15




                  A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations

                                           ∂LL
                                                 =0
                                           ∂θ1
                                           ∂LL
                                                 =0   4.   Check that you’re at
                                           ∂θ 2
                                                           a maximum
                                               M
                                           ∂LL
                                                 =0
                                           ∂θ n
 Copyright © 2001, 2004, Andrew W. Moore                         Maximum Likelihood: Slide 16




                                                                                                8
A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Solve the set of simultaneous equations

                                                ∂LL
                                                      =0
                                                ∂θ1
     If you can’t solve them,
                                                ∂LL
       what should you do?
                                                      =0     4.   Check that you’re at
                                                ∂θ 2
                                                                  a maximum
                                                    M
                                                ∂LL
                                                      =0
                                                ∂θ n
  Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 17




           MLE for univariate Gaussian
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
  • But you don’t know µ or σ2
  • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?
                                                                                    R
                                                             1              1
                                                                                   ∑ (x
log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π +            log σ 2 ) −                      −µ ) 2
                                                                           2σ 2
                                                                                            i
                                                             2                     i =1

   ∂LL                R
        1
                     ∑ (x         −µ )
       =2
   ∂µ  σ
                              i
                     i =1

   ∂LL                                    R
            R      1
                                         ∑ (x       −µ ) 2
        =−      +
   ∂σ      2σ     2σ 4
                                                i
      2       2
                                         i =1


  Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 18




                                                                                                         9
MLE for univariate Gaussian
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
  • But you don’t know µ or σ2
  • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?
                                                                                                                         R
                                                                                        1              1
                                                                                                                        ∑ (x
log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π +                                       log σ 2 ) −                                −µ ) 2
                                                                                                      2σ 2
                                                                                                                                 i
                                                                                        2                               i =1

                         R
                1
                        ∑ (x                −µ )
     0=
              σ2
                                        i
                        i =1

                                                         R
                        R                       1
                                                        ∑ (x            −µ ) 2
      0=−                           +
                 2σ                     2σ
                                                                    i
                            2                       4
                                                        i =1


  Copyright © 2001, 2004, Andrew W. Moore                                                                      Maximum Likelihood: Slide 19




              MLE for univariate Gaussian
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
  • But you don’t know µ or σ2
  • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?
                                                                                                     R
                                                                               1              1
                                                                                                    ∑ (x
  log p ( x1 , x 2 ,... x R | µ , σ 2 ) = − R (log π +                           log σ 2 ) −                   −µ ) 2
                                                                                             2σ 2
                                                                                                           i
                                                                               2                    i =1
                    R                                                      R
          1                                                             1
                ∑ (x                                                      ∑ xi
                                    −µ ) ⇒ µ =
  0=
        σ
                                i
                                                                        R i =1
            2
                i =1
                                                    R
                R                   1
                                                ∑ (x               −µ ) 2 ⇒ what?
 0=−                    +
            2σ                  2σ
                                                               i
                    2                       4
                                                i =1




  Copyright © 2001, 2004, Andrew W. Moore                                                                      Maximum Likelihood: Slide 20




                                                                                                                                              10
MLE for univariate Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
• But you don’t know µ or σ2
• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?

                               1R
                                 ∑ xi
                  µ mle =
                               R i =1

                        1R
                          ∑ ( xi −µ mle ) 2
            σ mle =
              2

                        R i =1




Copyright © 2001, 2004, Andrew W. Moore                   Maximum Likelihood: Slide 21




                      Unbiased Estimators
• An estimator of a parameter is unbiased if the
  expected value of the estimate is the same as the
  true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then
                                          ⎡1 R ⎤
                           E[ µ mle ] = E ⎢ ∑ xi ⎥ = µ
                                          ⎣ R i =1 ⎦

                                                    µmle is unbiased




Copyright © 2001, 2004, Andrew W. Moore                   Maximum Likelihood: Slide 22




                                                                                         11
Biased Estimators
• An estimator of a parameter is biased if the
  expected value of the estimate is different from
  the true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then
                                     ⎡1 ⎛ R            ⎞⎤
                                                        2

   []            ⎡1 R       mle 2 ⎤             1R
             = E ⎢ ∑ ( xi −µ ) ⎥ = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ ≠ σ 2
Eσ     2

                                     ⎢ R ⎜ i =1 R j =1 ⎟ ⎥
       mle
                 ⎣ R i =1         ⎦  ⎣⎝                ⎠⎦
                                                   σ2mle is biased




Copyright © 2001, 2004, Andrew W. Moore                    Maximum Likelihood: Slide 23




                         MLE Variance Bias
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

                          ⎡1 ⎛ R            ⎞⎤ ⎛
                                              2

             []                                      1⎞
                                     1R
                      = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
        Eσ     2
                              ⎜             ⎟⎥ ⎝
               mle
                          ⎢ R ⎝ i =1 R j =1 ⎠        R⎠
                          ⎣                     ⎦

             Intuition check: consider the case of R=1

                                               σ2mle would be an
             Why should our guts expect that
             underestimate of true σ2?
             How could you prove that?




Copyright © 2001, 2004, Andrew W. Moore                    Maximum Likelihood: Slide 24




                                                                                          12
Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

                          ⎡1 ⎛ R            ⎞⎤ ⎛
                                              2

           []                                        1⎞
                                     1R
                      = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
        Eσ     2
                              ⎜             ⎟⎥ ⎝
               mle
                          ⎢ R ⎝ i =1 R j =1 ⎠        R⎠
                          ⎣                     ⎦

                                               σ mle
                                                 2

                                                               [        ]
                         σ                =
                             2
    So define                                            So E σ unbiased = σ 2
                                                                2
                                              ⎛   1⎞
                             unbiased

                                              ⎜1 − ⎟
                                                  R⎠
                                              ⎝




Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 25




     Unbiased estimate of Variance
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then

                          ⎡1 ⎛ R            ⎞⎤ ⎛
                                              2

           []                                        1⎞
                                     1R
                      = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
        Eσ     2
                              ⎜             ⎟⎥ ⎝
               mle
                          ⎢ R ⎝ i =1 R j =1 ⎠        R⎠
                          ⎣                     ⎦

                                               σ mle
                                                 2

                                                                   [        ]
                         σ                =
                             2
                                                           So E σ unbiased = σ 2
    So define                                                     2
                                              ⎛   1⎞
                             unbiased

                                              ⎜1 − ⎟
                                                  R⎠
                                              ⎝

                                                1R
                                                    ∑ ( xi −µ mle ) 2
                           σ unbiased =
                             2

                                               R − 1 i =1


Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 26




                                                                                                       13
Unbiaseditude discussion
• Which is best?
                            1R
                              ∑ ( xi −µ mle ) 2
               σ mle =
                 2

                            R i =1

                                    1R
                                        ∑ ( xi −µ mle ) 2
                σ unbiased =
                  2

                                   R − 1 i =1



                                   Answer:
                                   •It depends on the task
                                   •And doesn’t make much difference once R--> large

Copyright © 2001, 2004, Andrew W. Moore                                  Maximum Likelihood: Slide 27




Don’t get too excited about being
            unbiased
• Assume x1, x2, … xR ~(i.i.d) N(µ,σ2)
• Suppose we had these estimators for the mean

                                           R
                             1
                                          ∑x
        µ                =
            suboptimal
                                                 i
                           R+7 R          i =1

                                                 Are either of these unbiased?
        µ crap = x1                              Will either of them asymptote to the
                                                 correct value as R gets large?
                                                 Which is more useful?



Copyright © 2001, 2004, Andrew W. Moore                                  Maximum Likelihood: Slide 28




                                                                                                        14
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

           1R
             ∑ xk
µ mle =
           R k =1

                     (                )(               )
            1R
              ∑ x k − µ mle x k − µ mle
                                                       T
Σ mle =
            R k =1




Copyright © 2001, 2004, Andrew W. Moore                               Maximum Likelihood: Slide 29




  MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

           1R                                                   Where 1 ≤ i ≤ m
                                                      1R
             ∑ xk                                    = ∑ x ki
µ mle =                                        mle
                                           µ   i
           R k =1                                     R k =1
                                                                And xki is value of the
                                                                ith component of xk
                     (                )(               )
                 R
            1
              ∑ x k − µ mle x k − µ mle
                                                           T
Σ mle =                                                         (the ith attribute of
            R k =1
                                                                the kth record)
                                                                And µimle is the ith
                                                                component of µmle
Copyright © 2001, 2004, Andrew W. Moore                               Maximum Likelihood: Slide 30




                                                                                                     15
MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?
                                                       Where 1 ≤ i ≤ m, 1 ≤ j ≤ m
                R
           1
             ∑ xk
µ mle =                                                And xki is value of the ith
           R k =1
                                                       component of xk (the ith
                                                       attribute of the kth record)
                     (                )(           )
            1R
              ∑ x k − µ mle x k − µ mle
                                                   T
Σ mle =
                                                       And σijmle is the (i,j)th
            R k =1
                                                       component of Σmle

                                            (          )(           )
                                   1R
                                     ∑ x ki − µ imle x kj − µ mle
                      σ ij =
                         mle
                                                              j
                                   R k =1
Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 31




  MLE for m-dimensional Gaussian
                  Q: How would you prove this?
• Suppose you have x1, x2, … xR ~(i.i.d) through the MLE
                                 A: Just plug N(µ,Σ)
                                 recipe.
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, xhow ΣxR most likely?
                             Note , … mle is forced to be
                                   2
                                                       symmetric non-negative definite
                                                       Note the unbiased case
         1R
        = ∑ xk
µ mle                                                  How many datapoints would you
         R k =1                                        need before the Gaussian has a
                                                       chance of being non-degenerate?

                     (                )(           )
            1R
              ∑ x k − µ mle x k − µ mle
                                                   T
Σ mle =
            R k =1

                                                            (             )(                 )
                                           Σ mle   1R
                                              1 R −1 ∑ k
                                                                                              T
                                                          x − µ mle x k − µ mle
                         Σ unbiased =            =
                                          1−         k =1
                                              R
Copyright © 2001, 2004, Andrew W. Moore                                 Maximum Likelihood: Slide 32




                                                                                                       16
Confidence intervals
 We need to talk

 We need to discuss how accurate we expect µmle and Σmle to be as
 a function of R
 And we need to consider how to estimate these accuracies from
 data…
 •Analytically *
 •Non-parametrically (using randomization and bootstrapping) *
 But we won’t. Not yet.
                   *Will be discussed in future Andrew lectures…just before
                   we need this technology.

Copyright © 2001, 2004, Andrew W. Moore                       Maximum Likelihood: Slide 33




                              Structural error
 Actually, we need to talk about something else too..
 What if we do all this analysis when the true distribution is in fact
 not Gaussian?
 How can we tell? *
 How can we survive? *
                   *Will be discussed in future Andrew lectures…just before
                   we need this technology.




Copyright © 2001, 2004, Andrew W. Moore                       Maximum Likelihood: Slide 34




                                                                                             17
Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan




Copyright © 2001, 2004, Andrew W. Moore   Maximum Likelihood: Slide 35




          Data-starved Gaussian MLE
Using three subsets of MPG.
Each subset has 6
randomly-chosen cars.




Copyright © 2001, 2004, Andrew W. Moore   Maximum Likelihood: Slide 36




                                                                         18
Bivariate MLE in action




Copyright © 2001, 2004, Andrew W. Moore                                   Maximum Likelihood: Slide 37




                                    Multivariate MLE




                                   Covariance matrices are not exciting to look at




Copyright © 2001, 2004, Andrew W. Moore                                   Maximum Likelihood: Slide 38




                                                                                                         19
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

               Step 1: Put a prior on (µ,Σ)




Copyright © 2001, 2004, Andrew W. Moore                    Maximum Likelihood: Slide 39




  Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

               Step 1: Put a prior on (µ,Σ)
               Step 1a: Put a prior on Σ
                           (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )
                           This thing is called the Inverse-Wishart
                           distribution.
                           A PDF over SPD matrices!


Copyright © 2001, 2004, Andrew W. Moore                    Maximum Likelihood: Slide 40




                                                                                          20
Being Bayesian: MAP estimates for Gaussians
     ν small: “I am not sure
    Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
•     0
     about my guess of Σ 0 “   Σ 0 : (Roughly) my best
    But you don’t know µ or Σ
•                                     guess of Σ
    ν0 large: “I’m pretty sure
    MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
•
      about my guess of Σ 0 “                         Ε[Σ ] = Σ 0
               Step 1: Put a prior on (µ,Σ)
               Step 1a: Put a prior on Σ
                           (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )
                           This thing is called the Inverse-Wishart
                           distribution.
                           A PDF over SPD matrices!


Copyright © 2001, 2004, Andrew W. Moore                     Maximum Likelihood: Slide 41




    Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

    Step 1: Put a prior on (µ,Σ)
    Step 1a: Put a prior on Σ
       (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )               Together, “Σ” and
                                                      “µ | Σ” define a
    Step 1b: Put a prior on µ | Σ
                                                      joint distribution
                 µ | Σ ~ N(µ0 , Σ / κ0)               on (µ,Σ)



Copyright © 2001, 2004, Andrew W. Moore                     Maximum Likelihood: Slide 42




                                                                                           21
Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ κ0 small: “I am not sure
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, xof µ 0 “ R)?
                            about my guess , … x
                                             2

                                          κ0 large: “I’m pretty sure
µ 0 : My best guess of µ (µ,Σ)
  Step 1: Put a prior on
                                           about my guess of µ 0 “
  Step E[µ ]Putµa prior on Σ
        1a: = 0
       (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )            Together, “Σ” and
                                                   “µ | Σ” define a
    Step 1b: Put a prior on µ | Σ
                                                   joint distribution
                 µ | Σ ~ N(µ0 , Σ / κ0)            on (µ,Σ)
  Notice how we are forced to express our
     ignorance of µ proportionally to Σ

Copyright © 2001, 2004, Andrew W. Moore                  Maximum Likelihood: Slide 43




  Being Bayesian: MAP estimates for Gaussians
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
                                                Why do we use this form
    Step 1: Put a prior on (µ,Σ)
                                                of prior?
    Step 1a: Put a prior on Σ
       (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )
    Step 1b: Put a prior on µ | Σ
                 µ | Σ ~ N(µ0 , Σ / κ0)



Copyright © 2001, 2004, Andrew W. Moore                  Maximum Likelihood: Slide 44




                                                                                        22
Being Bayesian: MAP estimates for Gaussians
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
  • But you don’t know µ or Σ
  • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
                                                          Why do we use this form of
      Step 1: Put a prior on (µ,Σ)                        prior?
      Step 1a: Put a prior on Σ                           Actually, we don’t have to
                                                          But it is computationally and
         (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )
                                                          algebraically convenient…
      Step 1b: Put a prior on µ | Σ                       …it’s a conjugate prior.
                   µ | Σ ~ N(µ0 , Σ / κ0)



  Copyright © 2001, 2004, Andrew W. Moore                           Maximum Likelihood: Slide 45




    Being Bayesian: MAP estimates for Gaussians
  • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
  • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
  Step 1: Prior: (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ), µ | Σ ~ N(µ0 , Σ / κ0)
  Step 2:
                                       κ µ + Rx ν = ν + R
                          1R
                            ∑ x k µ R = 0κ 00 + R R 0
                   x=
                          R k =1                  κR = κ0 + R
                                                                       (x − µ 0 )(x − µ 0 )T
                                               R
(ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) +
                                                                T

                                                                            1/ κ 0 +1/ R
                                              k =1

     Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),
                                     µ | Σ ~ N(µR , Σ / κR)

     Result: µmap = µR, E[Σ |x1, x2,                 … xR ]= ΣR
  Copyright © 2001, 2004, Andrew W. Moore                           Maximum Likelihood: Slide 46




                                                                                                   23
Being Bayesian: •Look carefully at what these formulae are
                       MAP estimates for Gaussians
                      doing. It’s all very sensible.
  • Suppose you have x1, x2, … xRmean prior form and posterior
                    •Conjugate priors ~(i.i.d) N(µ,Σ)
                    form are same and characterized by “sufficient
  • MAP: Which (µ,Σ)statistics” of the p(µ,Σ |x1, x2, … xR)?
                     maximizes data.
  Step 1: Prior: (ν0-m-1) Σ ~ •The 0marginal distributionΣ ~ µ is 0a, student-t
                              IW(ν , (ν0-m-1) Σ 0 ), µ | on N(µ Σ / κ0)
                              •One point of view: it’s pretty academic if R > 30
  Step 2:
                                           κ µ + Rx ν R = ν 0 + R
                       R
                            1
                              ∑    xk µ R = 0 0
                       x=
                                            κ0 + R κ = κ + R
                            R k =1
                                                      R      0

                                                                          (x − µ 0 )(x − µ 0 )T
                                                 R
(ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) +
                                                                   T

                                                                               1/ κ 0 +1/ R
                                                k =1

        Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),
                                     µ | Σ ~ N(µR , Σ / κR)

        Result: µmap = µR, E[Σ |x1, x2,                   … xR ]= ΣR
  Copyright © 2001, 2004, Andrew W. Moore                              Maximum Likelihood: Slide 47




                                Where we’re at

                                            Categorical    Real-valued         Mixed Real /
                                            inputs only    inputs only         Cat okay

                                 Predict Joint BC
   Inputs




                                                                                Dec Tree
                   Classifier   category Naïve BC

                                            Joint DE        Gauss DE
   Inputs Inputs




                                  Prob-
                    Density
                                  ability
                   Estimator                Naïve DE
                                Predict
                   Regressor    real no.


  Copyright © 2001, 2004, Andrew W. Moore                              Maximum Likelihood: Slide 48




                                                                                                      24
What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussian
  parameters
• Understand “biased estimator” versus
  “unbiased estimator”
• Appreciate the outline behind Bayesian
  estimation of Gaussian parameters

Copyright © 2001, 2004, Andrew W. Moore         Maximum Likelihood: Slide 49




                              Useful exercise
• We’d already done some MLE in this class
  without even telling you!
• Suppose categorical arity-n inputs x1, x2, …
  xR~(i.i.d.) from a multinomial
                 M(p1, p2, … pn)
      where
                P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?

Copyright © 2001, 2004, Andrew W. Moore         Maximum Likelihood: Slide 50




                                                                               25

More Related Content

What's hot

Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression Dr Athar Khan
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regressionpankaj8108
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear RegressionAndrew Ferlitsch
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regressionSreerajVA
 
Regression analysis
Regression analysisRegression analysis
Regression analysisSohag Babu
 
Generalized linear model
Generalized linear modelGeneralized linear model
Generalized linear modelRahul Rockers
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spssDr Nisha Arora
 
Logit and Probit and Tobit model: Basic Introduction
Logit and Probit  and Tobit model: Basic IntroductionLogit and Probit  and Tobit model: Basic Introduction
Logit and Probit and Tobit model: Basic IntroductionRabeesh Verma
 

What's hot (20)

Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Ordinal logistic regression
Ordinal logistic regression Ordinal logistic regression
Ordinal logistic regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
 
Ols
OlsOls
Ols
 
Regression
RegressionRegression
Regression
 
PCA
PCAPCA
PCA
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)Multiple Regression Analysis (MRA)
Multiple Regression Analysis (MRA)
 
Generalized linear model
Generalized linear modelGeneralized linear model
Generalized linear model
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
Logit and Probit and Tobit model: Basic Introduction
Logit and Probit  and Tobit model: Basic IntroductionLogit and Probit  and Tobit model: Basic Introduction
Logit and Probit and Tobit model: Basic Introduction
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 

Viewers also liked

Metode maximum likelihood
Metode maximum likelihoodMetode maximum likelihood
Metode maximum likelihoodririn12
 
Chapter 3 maximum likelihood and bayesian estimation-fix
Chapter 3   maximum likelihood and bayesian estimation-fixChapter 3   maximum likelihood and bayesian estimation-fix
Chapter 3 maximum likelihood and bayesian estimation-fixjelli123
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood EstimationAvinash Chamwad
 
Point and Interval Estimation
Point and Interval EstimationPoint and Interval Estimation
Point and Interval EstimationShubham Mehta
 
Interval Estimation & Estimation Of Proportion
Interval Estimation & Estimation Of ProportionInterval Estimation & Estimation Of Proportion
Interval Estimation & Estimation Of ProportionDataminingTools Inc
 
Theory of estimation
Theory of estimationTheory of estimation
Theory of estimationTech_MX
 
Probability Distributions
Probability DistributionsProbability Distributions
Probability Distributionsrishi.indian
 

Viewers also liked (12)

Chi square mahmoud
Chi square mahmoudChi square mahmoud
Chi square mahmoud
 
Metode maximum likelihood
Metode maximum likelihoodMetode maximum likelihood
Metode maximum likelihood
 
Point estimation
Point estimationPoint estimation
Point estimation
 
Point Estimation
Point EstimationPoint Estimation
Point Estimation
 
Chapter 3 maximum likelihood and bayesian estimation-fix
Chapter 3   maximum likelihood and bayesian estimation-fixChapter 3   maximum likelihood and bayesian estimation-fix
Chapter 3 maximum likelihood and bayesian estimation-fix
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Point and Interval Estimation
Point and Interval EstimationPoint and Interval Estimation
Point and Interval Estimation
 
Interval Estimation & Estimation Of Proportion
Interval Estimation & Estimation Of ProportionInterval Estimation & Estimation Of Proportion
Interval Estimation & Estimation Of Proportion
 
Theory of estimation
Theory of estimationTheory of estimation
Theory of estimation
 
Probability Distributions
Probability DistributionsProbability Distributions
Probability Distributions
 
Chi square test
Chi square testChi square test
Chi square test
 
Chi square test
Chi square testChi square test
Chi square test
 

Similar to Maximum Likelihood Estimation

Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionguestfee8698
 
Information Gain
Information GainInformation Gain
Information Gainguest32311f
 
Bayesian Subset Simulation
Bayesian Subset SimulationBayesian Subset Simulation
Bayesian Subset SimulationJulien Bect
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
Computer Aided Assessment (CAA) for mathematics
Computer Aided Assessment (CAA) for mathematicsComputer Aided Assessment (CAA) for mathematics
Computer Aided Assessment (CAA) for mathematicstelss09
 

Similar to Maximum Likelihood Estimation (9)

Mle
MleMle
Mle
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
 
Information Gain
Information GainInformation Gain
Information Gain
 
Bayesian Subset Simulation
Bayesian Subset SimulationBayesian Subset Simulation
Bayesian Subset Simulation
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Computer Aided Assessment (CAA) for mathematics
Computer Aided Assessment (CAA) for mathematicsComputer Aided Assessment (CAA) for mathematics
Computer Aided Assessment (CAA) for mathematics
 
i2ml3e-chap2.pptx
i2ml3e-chap2.pptxi2ml3e-chap2.pptx
i2ml3e-chap2.pptx
 
I2ml3e chap2
I2ml3e chap2I2ml3e chap2
I2ml3e chap2
 

More from guestfee8698

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesguestfee8698
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Modelsguestfee8698
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Modelsguestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Netsguestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiersguestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networksguestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networksguestfee8698
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithmsguestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiersguestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Minersguestfee8698
 

More from guestfee8698 (17)

PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
VC dimensio
VC dimensioVC dimensio
VC dimensio
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Cross-Validation
Cross-ValidationCross-Validation
Cross-Validation
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Gaussians
GaussiansGaussians
Gaussians
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Maximum Likelihood Estimation

  • 1. Learning with Maximum Likelihood Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: www.cs.cmu.edu/~awm http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully awm@cs.cmu.edu received. 412-268-7599 Copyright © 2001, 2004, Andrew W. Moore Sep 6th, 2001 Maximum Likelihood learning of Gaussians for Data Mining • Why we should care • Learning Univariate Gaussians • Learning Multivariate Gaussians • What’s a biased estimator? • Bayesian Learning of Gaussians Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2 1
  • 2. Why we should care • Maximum Likelihood Estimation is a very very very very fundamental part of data analysis. • “MLE for Gaussians” is training wheels for our future techniques • Learning Gaussians is more useful than you might guess… Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 3 Learning Gaussians from Data • Suppose you have x1, x2, … xR ~ (i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) MLE: For which µ is x1, x2, … xR most likely? MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 4 2
  • 3. Learning Gaussians from Data • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) Sneer MLE: For which µ is x1, x2, … xR most likely? MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 5 Learning Gaussians from Data • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) Sneer MLE: For which µ is x1, x2, … xR most likely? MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)? Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see… Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 6 3
  • 4. MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ (you do know σ2) • MLE: For which µ is x1, x2, … xR most likely? µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 7 Algebra Euphoria µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ = (by i.i.d) = (monotonicity of log) (plug in formula = for Gaussian) = (after simplification) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 8 4
  • 5. Algebra Euphoria µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ R = (by i.i.d) arg max ∏ p ( xi | µ , σ 2 ) µ i =1 R = arg max log p ( x | µ , σ 2 ) (monotonicity of ∑ log) i µ i =1 ( xi − µ ) 2 R (plug in formula = arg max 1 ∑ − 2σ 2 for Gaussian) 2π σ µ i =1 = (after R arg min ∑ ( xi − µ ) simplification) 2 µ i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 9 Intermission: A General Scalar MLE strategy Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Set ∂LL/∂θ=0 for a maximum, creating an equation in terms of θ 4. Solve it* 5. Check that you’ve found a maximum rather than a minimum or saddle-point, and be careful if θ is constrained *This is a perfect example of something that works perfectly in all textbook examples and usually involves surprising pain if you need it for something new. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 10 5
  • 6. The MLE µ µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ R = arg min ∑ ( xi − µ ) 2 µ i =1 ∂LL = µ s.t. 0 = = ∂µ = (what?) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 11 The MLE µ µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 ) µ R = arg min ∑ ( xi − µ ) 2 µ i =1 ∂ R ∂LL ∑ (x − µ )2 = µ s.t. 0 = = ∂µ i ∂µ i =1 R − ∑ 2 ( xi − µ ) i =1 1R Thus µ = ∑ xi R i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 12 6
  • 7. Lawks-a-lawdy! 1R ∑ xi µ mle = R i =1 • The best estimate of the mean of a distribution is the mean of the sample! At first sight: This kind of pedantic, algebra-filled and ultimately unsurprising fact is exactly the reason people throw down their “Statistics” book and pick up their “Agent Based Evolutionary Data Mining Using The Neuro-Fuzz Transform” book. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 13 A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus ⎛ ∂LL ⎞ ⎜ ⎟ ⎜ ∂θ1 ⎟ ∂LL ⎟ ∂LL ⎜⎜ ∂θ ⎟ = ∂θ ⎜ 2 ⎟ ⎜M⎟ ⎜ ∂LL ⎟ ⎜ ∂θ ⎟ ⎝ n⎠ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 14 7
  • 8. A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Solve the set of simultaneous equations ∂LL =0 ∂θ1 ∂LL =0 ∂θ 2 M ∂LL =0 ∂θ n Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 15 A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Solve the set of simultaneous equations ∂LL =0 ∂θ1 ∂LL =0 4. Check that you’re at ∂θ 2 a maximum M ∂LL =0 ∂θ n Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 16 8
  • 9. A General MLE strategy Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters. Task: Find MLE θ assuming known form for p(Data| θ,stuff) 1. Write LL = log P(Data| θ,stuff) 2. Work out ∂LL/∂θ using high-school calculus 3. Solve the set of simultaneous equations ∂LL =0 ∂θ1 If you can’t solve them, ∂LL what should you do? =0 4. Check that you’re at ∂θ 2 a maximum M ∂LL =0 ∂θ n Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 17 MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? R 1 1 ∑ (x log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2 2σ 2 i 2 i =1 ∂LL R 1 ∑ (x −µ ) =2 ∂µ σ i i =1 ∂LL R R 1 ∑ (x −µ ) 2 =− + ∂σ 2σ 2σ 4 i 2 2 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 18 9
  • 10. MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? R 1 1 ∑ (x log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2 2σ 2 i 2 i =1 R 1 ∑ (x −µ ) 0= σ2 i i =1 R R 1 ∑ (x −µ ) 2 0=− + 2σ 2σ i 2 4 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 19 MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? R 1 1 ∑ (x log p ( x1 , x 2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2 2σ 2 i 2 i =1 R R 1 1 ∑ (x ∑ xi −µ ) ⇒ µ = 0= σ i R i =1 2 i =1 R R 1 ∑ (x −µ ) 2 ⇒ what? 0=− + 2σ 2σ i 2 4 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 20 10
  • 11. MLE for univariate Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2) • But you don’t know µ or σ2 • MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely? 1R ∑ xi µ mle = R i =1 1R ∑ ( xi −µ mle ) 2 σ mle = 2 R i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 21 Unbiased Estimators • An estimator of a parameter is unbiased if the expected value of the estimate is the same as the true value of the parameters. • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 R ⎤ E[ µ mle ] = E ⎢ ∑ xi ⎥ = µ ⎣ R i =1 ⎦ µmle is unbiased Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 22 11
  • 12. Biased Estimators • An estimator of a parameter is biased if the expected value of the estimate is different from the true value of the parameters. • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ 2 [] ⎡1 R mle 2 ⎤ 1R = E ⎢ ∑ ( xi −µ ) ⎥ = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ ≠ σ 2 Eσ 2 ⎢ R ⎜ i =1 R j =1 ⎟ ⎥ mle ⎣ R i =1 ⎦ ⎣⎝ ⎠⎦ σ2mle is biased Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 23 MLE Variance Bias • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ ⎛ 2 [] 1⎞ 1R = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2 Eσ 2 ⎜ ⎟⎥ ⎝ mle ⎢ R ⎝ i =1 R j =1 ⎠ R⎠ ⎣ ⎦ Intuition check: consider the case of R=1 σ2mle would be an Why should our guts expect that underestimate of true σ2? How could you prove that? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 24 12
  • 13. Unbiased estimate of Variance • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ ⎛ 2 [] 1⎞ 1R = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2 Eσ 2 ⎜ ⎟⎥ ⎝ mle ⎢ R ⎝ i =1 R j =1 ⎠ R⎠ ⎣ ⎦ σ mle 2 [ ] σ = 2 So define So E σ unbiased = σ 2 2 ⎛ 1⎞ unbiased ⎜1 − ⎟ R⎠ ⎝ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 25 Unbiased estimate of Variance • If x1, x2, … xR ~(i.i.d) N(µ,σ2) then ⎡1 ⎛ R ⎞⎤ ⎛ 2 [] 1⎞ 1R = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2 Eσ 2 ⎜ ⎟⎥ ⎝ mle ⎢ R ⎝ i =1 R j =1 ⎠ R⎠ ⎣ ⎦ σ mle 2 [ ] σ = 2 So E σ unbiased = σ 2 So define 2 ⎛ 1⎞ unbiased ⎜1 − ⎟ R⎠ ⎝ 1R ∑ ( xi −µ mle ) 2 σ unbiased = 2 R − 1 i =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 26 13
  • 14. Unbiaseditude discussion • Which is best? 1R ∑ ( xi −µ mle ) 2 σ mle = 2 R i =1 1R ∑ ( xi −µ mle ) 2 σ unbiased = 2 R − 1 i =1 Answer: •It depends on the task •And doesn’t make much difference once R--> large Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 27 Don’t get too excited about being unbiased • Assume x1, x2, … xR ~(i.i.d) N(µ,σ2) • Suppose we had these estimators for the mean R 1 ∑x µ = suboptimal i R+7 R i =1 Are either of these unbiased? µ crap = x1 Will either of them asymptote to the correct value as R gets large? Which is more useful? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 28 14
  • 15. MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely? 1R ∑ xk µ mle = R k =1 ( )( ) 1R ∑ x k − µ mle x k − µ mle T Σ mle = R k =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 29 MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely? 1R Where 1 ≤ i ≤ m 1R ∑ xk = ∑ x ki µ mle = mle µ i R k =1 R k =1 And xki is value of the ith component of xk ( )( ) R 1 ∑ x k − µ mle x k − µ mle T Σ mle = (the ith attribute of R k =1 the kth record) And µimle is the ith component of µmle Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 30 15
  • 16. MLE for m-dimensional Gaussian • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely? Where 1 ≤ i ≤ m, 1 ≤ j ≤ m R 1 ∑ xk µ mle = And xki is value of the ith R k =1 component of xk (the ith attribute of the kth record) ( )( ) 1R ∑ x k − µ mle x k − µ mle T Σ mle = And σijmle is the (i,j)th R k =1 component of Σmle ( )( ) 1R ∑ x ki − µ imle x kj − µ mle σ ij = mle j R k =1 Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 31 MLE for m-dimensional Gaussian Q: How would you prove this? • Suppose you have x1, x2, … xR ~(i.i.d) through the MLE A: Just plug N(µ,Σ) recipe. • But you don’t know µ or Σ • MLE: For which θ =(µ,Σ) is x1, xhow ΣxR most likely? Note , … mle is forced to be 2 symmetric non-negative definite Note the unbiased case 1R = ∑ xk µ mle How many datapoints would you R k =1 need before the Gaussian has a chance of being non-degenerate? ( )( ) 1R ∑ x k − µ mle x k − µ mle T Σ mle = R k =1 ( )( ) Σ mle 1R 1 R −1 ∑ k T x − µ mle x k − µ mle Σ unbiased = = 1− k =1 R Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 32 16
  • 17. Confidence intervals We need to talk We need to discuss how accurate we expect µmle and Σmle to be as a function of R And we need to consider how to estimate these accuracies from data… •Analytically * •Non-parametrically (using randomization and bootstrapping) * But we won’t. Not yet. *Will be discussed in future Andrew lectures…just before we need this technology. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 33 Structural error Actually, we need to talk about something else too.. What if we do all this analysis when the true distribution is in fact not Gaussian? How can we tell? * How can we survive? * *Will be discussed in future Andrew lectures…just before we need this technology. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 34 17
  • 18. Gaussian MLE in action Using R=392 cars from the “MPG” UCI dataset supplied by Ross Quinlan Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 35 Data-starved Gaussian MLE Using three subsets of MPG. Each subset has 6 randomly-chosen cars. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 36 18
  • 19. Bivariate MLE in action Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 37 Multivariate MLE Covariance matrices are not exciting to look at Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 38 19
  • 20. Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Put a prior on (µ,Σ) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 39 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Put a prior on (µ,Σ) Step 1a: Put a prior on Σ (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices! Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 40 20
  • 21. Being Bayesian: MAP estimates for Gaussians ν small: “I am not sure Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • 0 about my guess of Σ 0 “ Σ 0 : (Roughly) my best But you don’t know µ or Σ • guess of Σ ν0 large: “I’m pretty sure MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? • about my guess of Σ 0 “ Ε[Σ ] = Σ 0 Step 1: Put a prior on (µ,Σ) Step 1a: Put a prior on Σ (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ) This thing is called the Inverse-Wishart distribution. A PDF over SPD matrices! Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 41 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Put a prior on (µ,Σ) Step 1a: Put a prior on Σ (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Together, “Σ” and “µ | Σ” define a Step 1b: Put a prior on µ | Σ joint distribution µ | Σ ~ N(µ0 , Σ / κ0) on (µ,Σ) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 42 21
  • 22. Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ κ0 small: “I am not sure • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, xof µ 0 “ R)? about my guess , … x 2 κ0 large: “I’m pretty sure µ 0 : My best guess of µ (µ,Σ) Step 1: Put a prior on about my guess of µ 0 “ Step E[µ ]Putµa prior on Σ 1a: = 0 (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Together, “Σ” and “µ | Σ” define a Step 1b: Put a prior on µ | Σ joint distribution µ | Σ ~ N(µ0 , Σ / κ0) on (µ,Σ) Notice how we are forced to express our ignorance of µ proportionally to Σ Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 43 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Why do we use this form Step 1: Put a prior on (µ,Σ) of prior? Step 1a: Put a prior on Σ (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Step 1b: Put a prior on µ | Σ µ | Σ ~ N(µ0 , Σ / κ0) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 44 22
  • 23. Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • But you don’t know µ or Σ • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Why do we use this form of Step 1: Put a prior on (µ,Σ) prior? Step 1a: Put a prior on Σ Actually, we don’t have to But it is computationally and (ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) algebraically convenient… Step 1b: Put a prior on µ | Σ …it’s a conjugate prior. µ | Σ ~ N(µ0 , Σ / κ0) Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 45 Being Bayesian: MAP estimates for Gaussians • Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ) • MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)? Step 1: Prior: (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ), µ | Σ ~ N(µ0 , Σ / κ0) Step 2: κ µ + Rx ν = ν + R 1R ∑ x k µ R = 0κ 00 + R R 0 x= R k =1 κR = κ0 + R (x − µ 0 )(x − µ 0 )T R (ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) + T 1/ κ 0 +1/ R k =1 Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ), µ | Σ ~ N(µR , Σ / κR) Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 46 23
  • 24. Being Bayesian: •Look carefully at what these formulae are MAP estimates for Gaussians doing. It’s all very sensible. • Suppose you have x1, x2, … xRmean prior form and posterior •Conjugate priors ~(i.i.d) N(µ,Σ) form are same and characterized by “sufficient • MAP: Which (µ,Σ)statistics” of the p(µ,Σ |x1, x2, … xR)? maximizes data. Step 1: Prior: (ν0-m-1) Σ ~ •The 0marginal distributionΣ ~ µ is 0a, student-t IW(ν , (ν0-m-1) Σ 0 ), µ | on N(µ Σ / κ0) •One point of view: it’s pretty academic if R > 30 Step 2: κ µ + Rx ν R = ν 0 + R R 1 ∑ xk µ R = 0 0 x= κ0 + R κ = κ + R R k =1 R 0 (x − µ 0 )(x − µ 0 )T R (ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) + T 1/ κ 0 +1/ R k =1 Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ), µ | Σ ~ N(µR , Σ / κR) Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 47 Where we’re at Categorical Real-valued Mixed Real / inputs only inputs only Cat okay Predict Joint BC Inputs Dec Tree Classifier category Naïve BC Joint DE Gauss DE Inputs Inputs Prob- Density ability Estimator Naïve DE Predict Regressor real no. Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 48 24
  • 25. What you should know • The Recipe for MLE • What do we sometimes prefer MLE to MAP? • Understand MLE estimation of Gaussian parameters • Understand “biased estimator” versus “unbiased estimator” • Appreciate the outline behind Bayesian estimation of Gaussian parameters Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 49 Useful exercise • We’d already done some MLE in this class without even telling you! • Suppose categorical arity-n inputs x1, x2, … xR~(i.i.d.) from a multinomial M(p1, p2, … pn) where P(xk=j|p)=pj • What is the MLE p=(p1, p2, … pn)? Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 50 25