Maximum Likelihood Estimation

Learning with
Maximum Likelihood
Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted

Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them

School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in

Carnegie Mellon University
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
www.cs.cmu.edu/~awm
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
awm@cs.cmu.edu
received.

412-268-7599

Copyright © 2001, 2004, Andrew W. Moore Sep 6th, 2001

Maximum Likelihood learning of
Gaussians for Data Mining
• Why we should care
• Learning Univariate Gaussians
• Learning Multivariate Gaussians
• What’s a biased estimator?
• Bayesian Learning of Gaussians

Copyright © 2001, 2004, Andrew W. Moore Maximum Likelihood: Slide 2

1

Why we should care
• Maximum Likelihood Estimation is a very
very very very fundamental part of data
analysis.
• “MLE for Gaussians” is training wheels for
our future techniques
• Learning Gaussians is more useful than you
might guess…


Learning Gaussians from Data
• Suppose you have x1, x2, … xR ~ (i.i.d) N(µ,σ2)
• But you don’t know µ
(you do know σ2)

MLE: For which µ is x1, x2, … xR most likely?

MAP: Which µ maximizes p(µ|x1, x2, … xR , σ2)?


2

• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,σ2)
(you do know σ2)
Sneer




(you do know σ2)
Sneer



Despite this, we’ll spend 95% of our time on MLE. Why? Wait and see…


3

MLE for univariate Gaussian
• But you don’t know µ (you do know σ2)
• MLE: For which µ is x1, x2, … xR most likely?

µ mle = arg max p ( x1 , x2 ,... x R | µ , σ 2 )
µ


Algebra Euphoria
µ

= (by i.i.d)

= (monotonicity of
log)

(plug in formula
=
for Gaussian)

= (after
simplification)


4

Algebra Euphoria
µ
R
= (by i.i.d)
arg max ∏ p ( xi | µ , σ 2 )
µ i =1
R
= arg max log p ( x | µ , σ 2 ) (monotonicity of
∑ log)
i
µ i =1

( xi − µ ) 2
R
(plug in formula
= arg max 1
∑ − 2σ 2 for Gaussian)
2π σ
µ i =1

= (after
R
arg min ∑ ( xi − µ ) simplification)
2

µ i =1


Intermission: A General Scalar
MLE strategy
Task: Find MLE θ assuming known form for p(Data| θ,stuff)
1. Write LL = log P(Data| θ,stuff)
2. Work out ∂LL/∂θ using high-school calculus
3. Set ∂LL/∂θ=0 for a maximum, creating an equation in
terms of θ
4. Solve it*
5. Check that you’ve found a maximum rather than a
minimum or saddle-point, and be careful if θ is
constrained

*This is a perfect example of something that works perfectly in
all textbook examples and usually involves surprising pain if
you need it for something new.

5

The MLE µ
µ
R
= arg min ∑ ( xi − µ ) 2
µ i =1

∂LL
= µ s.t. 0 = =
∂µ

= (what?)


The MLE µ
µ
R
= arg min ∑ ( xi − µ ) 2
µ i =1

∂ R
∂LL
∑ (x − µ )2
= µ s.t. 0 = =
∂µ
i
∂µ i =1
R
− ∑ 2 ( xi − µ )
i =1

1R
Thus µ = ∑ xi
R i =1

6

Lawks-a-lawdy!
1R
∑ xi
µ mle =
R i =1

• The best estimate of the mean of a
distribution is the mean of the sample!
At first sight:
This kind of pedantic, algebra-filled and
ultimately unsurprising fact is exactly the
reason people throw down their
“Statistics” book and pick up their “Agent
Based Evolutionary Data Mining Using
The Neuro-Fuzz Transform” book.


A General MLE strategy
Suppose θ = (θ1, θ2, …, θn)T is a vector of parameters.
⎛ ∂LL ⎞
⎜ ⎟
⎜ ∂θ1 ⎟
∂LL ⎟
∂LL ⎜⎜ ∂θ ⎟
=
∂θ ⎜ 2 ⎟
⎜M⎟
⎜ ∂LL ⎟
⎜ ∂θ ⎟
⎝ n⎠

7

3. Solve the set of simultaneous equations

∂LL
=0
∂θ1
∂LL
=0
∂θ 2
M
∂LL
=0
∂θ n


∂LL
=0
∂θ1
∂LL
=0 4. Check that you’re at
∂θ 2
a maximum
M
∂LL
=0
∂θ n

8


∂LL
=0
∂θ1
If you can’t solve them,
∂LL
what should you do?
=0 4. Check that you’re at
∂θ 2
a maximum
M
∂LL
=0
∂θ n

• But you don’t know µ or σ2
• MLE: For which θ =(µ,σ2) is x1, x2,…xR most likely?
R
1 1
∑ (x
log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2
2σ 2
i
2 i =1

∂LL R
1
∑ (x −µ )
=2
∂µ σ
i
i =1

∂LL R
R 1
∑ (x −µ ) 2
=− +
∂σ 2σ 2σ 4
i
2 2
i =1


9

R
1 1
∑ (x
log p ( x1 , x2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2
2σ 2
i
2 i =1

R
1
∑ (x −µ )
0=
σ2
i
i =1

R
R 1
∑ (x −µ ) 2
0=− +
2σ 2σ
i
2 4
i =1


R
1 1
∑ (x
log p ( x1 , x 2 ,... x R | µ , σ 2 ) = − R (log π + log σ 2 ) − −µ ) 2
2σ 2
i
2 i =1
R R
1 1
∑ (x ∑ xi
−µ ) ⇒ µ =
0=
σ
i
R i =1
2
i =1
R
R 1
∑ (x −µ ) 2 ⇒ what?
0=− +
2σ 2σ
i
2 4
i =1


10


1R
∑ xi
µ mle =
R i =1

1R
∑ ( xi −µ mle ) 2
σ mle =
2

R i =1


Unbiased Estimators
• An estimator of a parameter is unbiased if the
expected value of the estimate is the same as the
true value of the parameters.
• If x1, x2, … xR ~(i.i.d) N(µ,σ2) then
⎡1 R ⎤
E[ µ mle ] = E ⎢ ∑ xi ⎥ = µ
⎣ R i =1 ⎦

µmle is unbiased


11

Biased Estimators
• An estimator of a parameter is biased if the
expected value of the estimate is different from
the true value of the parameters.
⎡1 ⎛ R ⎞⎤
2

[] ⎡1 R mle 2 ⎤ 1R
= E ⎢ ∑ ( xi −µ ) ⎥ = E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ ≠ σ 2
Eσ 2

⎢ R ⎜ i =1 R j =1 ⎟ ⎥
mle
⎣ R i =1 ⎦ ⎣⎝ ⎠⎦
σ2mle is biased


MLE Variance Bias

⎡1 ⎛ R ⎞⎤ ⎛
2

[] 1⎞
1R
= E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
Eσ 2
⎜ ⎟⎥ ⎝
mle
⎢ R ⎝ i =1 R j =1 ⎠ R⎠
⎣ ⎦

Intuition check: consider the case of R=1

σ2mle would be an
Why should our guts expect that
underestimate of true σ2?
How could you prove that?


12

Unbiased estimate of Variance

⎡1 ⎛ R ⎞⎤ ⎛
2

[] 1⎞
1R
= E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
Eσ 2
⎜ ⎟⎥ ⎝
mle
⎢ R ⎝ i =1 R j =1 ⎠ R⎠
⎣ ⎦

σ mle
2

[ ]
σ =
2
So define So E σ unbiased = σ 2
2
⎛ 1⎞
unbiased

⎜1 − ⎟
R⎠
⎝


Unbiased estimate of Variance

⎡1 ⎛ R ⎞⎤ ⎛
2

[] 1⎞
1R
= E ⎢ ⎜ ∑ xi − ∑ x j ⎟ ⎥ = ⎜1 − ⎟σ 2 ≠ σ 2
Eσ 2
⎜ ⎟⎥ ⎝
mle
⎢ R ⎝ i =1 R j =1 ⎠ R⎠
⎣ ⎦

σ mle
2

[ ]
σ =
2
So E σ unbiased = σ 2
So define 2
⎛ 1⎞
unbiased

⎜1 − ⎟
R⎠
⎝

1R
∑ ( xi −µ mle ) 2
σ unbiased =
2

R − 1 i =1


13

Unbiaseditude discussion
• Which is best?
1R
∑ ( xi −µ mle ) 2
σ mle =
2

R i =1

1R
∑ ( xi −µ mle ) 2
σ unbiased =
2

R − 1 i =1

Answer:
•It depends on the task
•And doesn’t make much difference once R--> large


Don’t get too excited about being
unbiased
• Assume x1, x2, … xR ~(i.i.d) N(µ,σ2)
• Suppose we had these estimators for the mean

R
1
∑x
µ =
suboptimal
i
R+7 R i =1

Are either of these unbiased?
µ crap = x1 Will either of them asymptote to the
correct value as R gets large?
Which is more useful?


14

MLE for m-dimensional Gaussian
• Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• But you don’t know µ or Σ
• MLE: For which θ =(µ,Σ) is x1, x2, … xR most likely?

1R
∑ xk
µ mle =
R k =1

( )( )
1R
∑ x k − µ mle x k − µ mle
T
Σ mle =
R k =1



1R Where 1 ≤ i ≤ m
1R
∑ xk = ∑ x ki
µ mle = mle
µ i
R k =1 R k =1
And xki is value of the
ith component of xk
( )( )
R
1
T
Σ mle = (the ith attribute of
R k =1
the kth record)
And µimle is the ith
component of µmle

15

Where 1 ≤ i ≤ m, 1 ≤ j ≤ m
R
1
∑ xk
µ mle = And xki is value of the ith
R k =1
component of xk (the ith
attribute of the kth record)
( )( )
1R
T
Σ mle =
And σijmle is the (i,j)th
R k =1
component of Σmle

( )( )
1R
∑ x ki − µ imle x kj − µ mle
σ ij =
mle
j
R k =1

Q: How would you prove this?
• Suppose you have x1, x2, … xR ~(i.i.d) through the MLE
A: Just plug N(µ,Σ)
recipe.
• MLE: For which θ =(µ,Σ) is x1, xhow ΣxR most likely?
Note , … mle is forced to be
2
symmetric non-negative definite
Note the unbiased case
1R
= ∑ xk
µ mle How many datapoints would you
R k =1 need before the Gaussian has a
chance of being non-degenerate?

( )( )
1R
T
Σ mle =
R k =1

( )( )
Σ mle 1R
1 R −1 ∑ k
T
x − µ mle x k − µ mle
Σ unbiased = =
1− k =1
R

16

Confidence intervals
We need to talk

We need to discuss how accurate we expect µmle and Σmle to be as
a function of R
And we need to consider how to estimate these accuracies from
data…
•Analytically *
•Non-parametrically (using randomization and bootstrapping) *
But we won’t. Not yet.
*Will be discussed in future Andrew lectures…just before
we need this technology.


Structural error
Actually, we need to talk about something else too..
What if we do all this analysis when the true distribution is in fact
not Gaussian?
How can we tell? *
How can we survive? *
*Will be discussed in future Andrew lectures…just before
we need this technology.


17

Gaussian MLE in action
Using R=392 cars from the
“MPG” UCI dataset supplied
by Ross Quinlan


Data-starved Gaussian MLE
Using three subsets of MPG.
Each subset has 6
randomly-chosen cars.


18

Bivariate MLE in action


Multivariate MLE

Covariance matrices are not exciting to look at


19

Being Bayesian: MAP estimates for Gaussians
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?

Step 1: Put a prior on (µ,Σ)



Step 1a: Put a prior on Σ
(ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!


20

ν small: “I am not sure
Suppose you have x1, x2, … xR ~(i.i.d) N(µ,Σ)
• 0
about my guess of Σ 0 “ Σ 0 : (Roughly) my best
But you don’t know µ or Σ
• guess of Σ
ν0 large: “I’m pretty sure
MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, x2, … xR)?
•
about my guess of Σ 0 “ Ε[Σ ] = Σ 0
(ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 )
This thing is called the Inverse-Wishart
distribution.
A PDF over SPD matrices!



(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Together, “Σ” and
“µ | Σ” define a
Step 1b: Put a prior on µ | Σ
joint distribution
µ | Σ ~ N(µ0 , Σ / κ0) on (µ,Σ)


21

• But you don’t know µ or Σ κ0 small: “I am not sure
• MAP: Which (µ,Σ) maximizes p(µ,Σ |x1, xof µ 0 “ R)?
about my guess , … x
2

κ0 large: “I’m pretty sure
µ 0 : My best guess of µ (µ,Σ)
Step 1: Put a prior on
about my guess of µ 0 “
Step E[µ ]Putµa prior on Σ
1a: = 0
(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 ) Together, “Σ” and
“µ | Σ” define a
joint distribution
µ | Σ ~ N(µ0 , Σ / κ0) on (µ,Σ)
Notice how we are forced to express our
ignorance of µ proportionally to Σ


Why do we use this form
of prior?
(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )
µ | Σ ~ N(µ0 , Σ / κ0)


22

Why do we use this form of
Step 1: Put a prior on (µ,Σ) prior?
Step 1a: Put a prior on Σ Actually, we don’t have to
But it is computationally and
(ν0-m-1)Σ ~ IW(ν0, (ν0-m-1)Σ 0 )
algebraically convenient…
Step 1b: Put a prior on µ | Σ …it’s a conjugate prior.
µ | Σ ~ N(µ0 , Σ / κ0)


Step 1: Prior: (ν0-m-1) Σ ~ IW(ν0, (ν0-m-1) Σ 0 ), µ | Σ ~ N(µ0 , Σ / κ0)
Step 2:
κ µ + Rx ν = ν + R
1R
∑ x k µ R = 0κ 00 + R R 0
x=
R k =1 κR = κ0 + R
(x − µ 0 )(x − µ 0 )T
R
(ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) +
T

1/ κ 0 +1/ R
k =1

Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),
µ | Σ ~ N(µR , Σ / κR)

Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR

23

Being Bayesian: •Look carefully at what these formulae are
MAP estimates for Gaussians
doing. It’s all very sensible.
• Suppose you have x1, x2, … xRmean prior form and posterior
•Conjugate priors ~(i.i.d) N(µ,Σ)
form are same and characterized by “sufficient
• MAP: Which (µ,Σ)statistics” of the p(µ,Σ |x1, x2, … xR)?
maximizes data.
Step 1: Prior: (ν0-m-1) Σ ~ •The 0marginal distributionΣ ~ µ is 0a, student-t
IW(ν , (ν0-m-1) Σ 0 ), µ | on N(µ Σ / κ0)
•One point of view: it’s pretty academic if R > 30
Step 2:
κ µ + Rx ν R = ν 0 + R
R
1
∑ xk µ R = 0 0
x=
κ0 + R κ = κ + R
R k =1
R 0

(x − µ 0 )(x − µ 0 )T
R
(ν R + m − 1) Σ R = (ν 0 + m − 1) Σ 0 + ∑ (x k − x )(x k − x ) +
T

1/ κ 0 +1/ R
k =1

Step 3: Posterior: (νR+m-1)Σ ~ IW(νR, (νR+m-1) Σ R ),
µ | Σ ~ N(µR , Σ / κR)

Result: µmap = µR, E[Σ |x1, x2, … xR ]= ΣR

Where we’re at

Categorical Real-valued Mixed Real /
inputs only inputs only Cat okay

Predict Joint BC
Inputs

Dec Tree
Classifier category Naïve BC

Joint DE Gauss DE
Inputs Inputs

Prob-
Density
ability
Estimator Naïve DE
Predict
Regressor real no.


24

What you should know
• The Recipe for MLE
• What do we sometimes prefer MLE to MAP?
• Understand MLE estimation of Gaussian
parameters
• Understand “biased estimator” versus
“unbiased estimator”
• Appreciate the outline behind Bayesian
estimation of Gaussian parameters


Useful exercise
• We’d already done some MLE in this class
without even telling you!
• Suppose categorical arity-n inputs x1, x2, …
xR~(i.i.d.) from a multinomial
M(p1, p2, … pn)
where
P(xk=j|p)=pj
• What is the MLE p=(p1, p2, … pn)?


25

Maximum Likelihood Estimation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Maximum Likelihood Estimation

Similar to Maximum Likelihood Estimation (9)

More from guestfee8698

More from guestfee8698 (17)

Recently uploaded

Recently uploaded (20)

Maximum Likelihood Estimation