SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Maximum Likelihood
Much estimation theory is presented in a rather ad hoc fashion. Minimising
squared errors seems a good idea but why not minimise the absolute error or the
cube of the absolute error?
The answer is that there is an underlying approach which justifies a particular
minimisation strategy conditional on certain assumptions.
This is the maximum likelihood principle.
The idea is to assume a particular model with unknown parameters, we
can then define the probability of observing a given event conditional on
a particular set of parameters. We have observed a set of outcomes in the
real world. It is then possible to choose a set of parameters which are
most likely to have produced the observed results.
This is maximum likelihood. In most cases it is both consistent and
efficient. It provides a standard to compare other estimation techniques.
An example
Suppose we sample a set of goods for quality and find 5 defective items
in a sample of 10. What is our estimate of the proportion of bad items in
the whole population.
Intuitively of course it is 50%. Formally in a sample of size n the
probability of finding B bad items is
population
the
in
items
bad
of
proportion
the
is
)
-
(1
B)!
-
(n
B!
n!
=
P
B
-
n
B



If the true proportion=0.1, P=0.0015, if it equals 0.2, P=0.0254 etc, we could
search for the most likely value. Or we can solve the problem analytically,
0
=
)
-
(1
B)!
-
(n
B!
n!
B)
-
(n
-
)
-
(1
B)!
-
(n
B!
!
n
B
=
P
1
-
B
-
n
B
B
-
n
1
-
B











ˆ
ˆ
ˆ
ˆ
ˆ
ˆ


0.5
=
5/10
=
B/n
=
B)
-
(n
=
)
-
B(1
)
-
B)(1
-
(n
=
B
0
=
)
-
(1
B)
-
(n
)
-
(1
B
1
-
1
-
1
-
B
-
n
B
B
-
n
1
-
B










ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
So the maximum likelihood estimate of the population proportion of bad
items is 0.5.
This basic procedure can be applied in many cases, once we can define
the probability density function for a particular event we have a general
estimation strategy.
A general Statement
Consider a sample (X1...Xn) which is drawn from a probability distribution
P(X|A) where A are parameters. If the Xs are independent with probability
density function P(Xi|A) the joint probability of the whole set is
A)
|
X
P(
=
A)
|
X
...
X
P( i
n
1
=
i
n
1 
this may be maximised with respect to A to give the maximum likelihood
estimates.
It is often convenient to work with the Log of the likelihood function.
)
log
log A)
|
X
(P(
=
(L(A)) i
n
1
=
i

the advantage of this approach is that it is extremely general but if the
model is misspecified it may be particularly sensitive to this
misspecification.
The Likelihood function for the general non linear model
if Y is a vector of n endogenous variables and
)
N(0,
e
e
+
)
f(X,
=
Y 
~

Then the likelihood function for one period is
)]
f(X,
-
(Y
)
f(X,
-
[-0.5(Y
|
|
)
(2
1
=
)
,
L( 1
-
0.5
0.5



 


 exp
and dropping some constants and taking logs
)
log )
f(X,
-
(Y
)
f(X,
-
-(Y
|
|
=
))
,
(L( -1


 


 
if the covariance structure is constant and has zero off diagonal elements this
reduces to single equation OLS
Two important matrices
The efficient score matrix
S(A)
=
))
(L(
-


 log
this is made up of the first derivatives at each point in time. It is a measure of
dispersion of the maximum estimate.
The information matrix (Hessian)
This is defined as
)
I(
=
))
(L(
-
E
2


















log
This is a measure of how `pointy' the likelihood function is.
The variance of the parameters is given either by the inverse Hessian or the
outer product of the score matrix
)
S(
)
S(
=
]
)
[I(
=
)
Var(
-1
ML



 
ˆ
ˆ
The Cramer-Rao Lower Bound
This is an important theorem which establishes the superiority of the ML
estimate over all others. The Cramer-Rao lower bound is the smallest
theoretical variance which can be achieved. ML gives this so any other
estimation technique can at best only equal it.
)
I(
)
Var(
of
estimate
another
is
if
1
-
*
*




ˆ

this is the Cramer-Rao inequality.
Concentrating the Likelihood function
suppose we split the parameter vector into two sub vectors
)
,
L(
=
)
L( 2
1 


now suppose we knew 1 then sometimes we can derive a formulae for the ML
estimate of 2, eg
)
g(
= 1
2 

then we could write the LF as
)
(
L
=
))
g(
,
L(
=
)
,
L( 1
*
1
1
2
1 




this is the concentrated likelihood function.
This process is often very useful in practical estimation as it reduces the
number of parameters which need to be estimated.
An example of concentrating the LF
The likelihood function for a standard single variable normal non-linear
model is


 2
2
2
/
e
-
)
-Tlog(
=
)
L( 
we can concentrate this with respect to the variance as follows, the FOC for a
maximum with respect to the variance is
0
=
)
/(
e
+
-T/
=
L 2
2
2
2
2 





which implies that
/T
e
= 2
2


so the concentrated log likelihood becomes
T
-
/T)
e
Tlog(
=
)
(
L
2
*



Prediction error decomposition
We assumed that the observations were independent in the statements
above. This will not generally be true especially in the presence of
lagged dependent variables. However the prediction error
decomposition allows us to extend standard ML procedures to
dynamic models.
From the basic definition of conditional probability
)
)Pr(
|
Pr(
=
)
,
Pr( 




this may be applied directly to the likelihood function,
))
Y
,...,
Y
,
Y
(L(
+
))
Y
,...,
Y
,
Y
|
Y
(L(
=
))
Y
,
Y
,...
Y
,
Y
(L(
1
-
T
2
1
1
-
T
2
1
T
T
1
-
T
2
1
log
log
log
The first term is the conditional probability of Y given all past values. We
can then condition the second term and so on to give
))
Y
(L(
+
))
Y
,...,
Y
|
Y
(L(
= 1
1
-
i
-
T
1
i
-
T
2
-
T
0
=
i
log
log

that is, a series of one step ahead prediction errors conditional on actual
lagged Y.
Testing hypothesis.
If a restriction on a model is acceptable this means that the reduction in
the likelihood value caused by imposing the restriction is not `significant'.
This gives us a very general basis for constructing hypothesis tests but
to implement the tests we need some definite metric to judge the tests
against, i.e. what is significant.
L
Lu
LR
*

Consider how the likelihood function changes as we move around the
parameter space, we can evaluate this by taking a Taylor series expansion
around the ML point
O(1)
+
)
-
(
))
(L(
)
-
0.5(
+
))
(L(
)
-
(
+
))
(L(
=
))
(L(
2















ˆ
log
ˆ
log
ˆ
ˆ
log
log














and of course
0
=
)
S(
=
))
(L(



 log
)
I(
=
))
(L(
2



 log
So
)
-
)(
I(
)
-
0.5(
+
))
(L(
=
))
(L( 





 ˆ
ˆ
ˆ
log
log 
)
-
)(
I(
)
-
0.5.(
=
))
(L(
-
))
(L(
r
r
r






 ˆ
ˆ
ˆ
log
ˆ
log 
it is possible to demonstrate that
(m)
)
-
)(
I(
)
r
-
( 2
r





 ~
ˆ
ˆ
ˆ 
where m is the number of restrictions, and so
(m)
))]
(L(
-
))
(L(
2[ 2
r


 ~
log
ˆ
log
And so
this gives us a measure for judging the significance of likelihood based
tests.
Three test procedures.
To construct the basic test we need an estimate of the likelihood value at
the unrestricted point and the restricted point and we compare these two.
There are three ways of deriving this.
The likelihood ratio test
we simply estimate the model twice, once unrestricted and once
restricted and compare the two.
The Wald test
This estimates only the unrestricted point and uses an estimate of the
second derivative to `guess' at the restricted point. Standard `t' tests are a
form of wald test.
The LaGrange multiplier test
this estimates only the restricted model and again uses an estimate of the
second derivatives to guess at the restricted point.
If the likelihood function were quadratic then LR=LM=W. In general
however W>LR>LM
L
Lu
LR
*

A special form of the LM test
The LM test can be calculated in a particularly convenient way under
certain circumstances.
The general form of the LM test is
(m)
)
S(
]
)
[I(
)
S(
=
LM 2
-1



 ~

Now suppose
e
+
)
,
,
X
f(
=
Y t
2
1
t
t 

where we assume that the subset of parameters 1 is fixed according to a
set of restrictions g=0 (G is the derivative of this restriction).
Now
)
G)
G
E(
(
=
)
I(
e
G
=
)
S(
1
-
2
-
1
-2
1






and so the LM test becomes
e
G
]
G)
G
E(
G[
e 2
-
-1
2
-
2
-


 


And
e
e
=
LM
G
G
=
G)
G
E(
if
2
-




which may be interpreted as TR2 from a regression of e on G
This is used in many tests for serial correlation heteroskedasticity
functional form etc.
e is the actual errors from a restricted model and G is the restrictions in
the model.
An Example: Serial correlation
Suppose
e
+
u
=
u
u
+
X
=
Y
1
-


The restriction that may be tested as an LM test as follows
estimate the model without serial correlation. save the residuals u. then
estimate the model
0
=

u
+
X
=
u i
-
t
m
1
=
i
ˆ
ˆ 
 
then TR2 from this regression is an LM(m) test for serial correlation
Quasi Maximum Likelihood
ML rests on the assumption that the errors follow a particular distribution
(OLS is only ML if the errors are normal etc.) What happens if we make
the wrong assumption.
White(1982) Econometrica, 50,1,pg1. demonstrates that, under very broad
assumptions about the misspecification of the error process, ML is still a
consistent estimator. The estimation is then referred to as Quasi
Maximum Likelihood.
But the covariance matrix is no longer the standard ML one instead it is
given by
)
)]I(
S(
)
[S(
)
I(
=
)
C(
-1
-1




 ˆ
ˆ
ˆ
ˆ
ˆ 
Generally we may construct valid Wald and LM tests by using this
corrected covariance matrix but the LR test is invalid as it works directly
from the value of the likelihood function.
Numerical optimisation
In simple cases (e.g. OLS) we can calculate the maximum likelihood
estimates analytically. But in many cases we cannot, then we resort to
numerical optimisation of the likelihood function.
This amounts to hill climbing in parameter space.
there are many algorithms and many computer programmes implement
these for you.
It is useful to understand the broad steps of the procedure.
1. set an arbitrary initial set of parameters.
2. determine a direction of movement
3. determine a step length to move
4. examine some termination criteria and either stop or go back to 2.
L
Lu
*

1
 2

Important classes of maximisation techniques.
Gradient methods. These base the direction of movement on the first
derivatives of the LF with respect to the parameters. Often the step length
is also determined by (an approximation to) the second derivatives. So























L
L
+
= 2
2
-1
i
1
+
i
These include, Newton, Quasi Newton, Scoring, Steepest descent,
Davidson Fletcher Powel, BHHH etc.
Derivative free techniques. These do not use derivatives and so they are
less efficient but more robust to extreme non-linearity’s. e.g. Powell or
non-linear Simplex.
These techniques can all be sensitive to starting values and `tuning'
parameters.
Some special LFs
Qualitative response models.
These are where we have only partial information (insects and poison) in
one form or another.
We assume an underlying continuous model,
u
+
X
=
Y t
t
t 
but we only observe certain limited information, eg z=1 or 0 related to y
0
<
Y
if
0
=
z
0
>
Y
if
1
=
z
then we can group the data into two groups and form a likelihood function
with the following form
)
X
-
F(1
)
X
F(-
=
L t
1
=
z
t
0
=
z

 

where F is a particular density function eg. the standard normal
Cumulative function or perhaps the logistic (logit model) function
ARCH and GARCH
These are an important class of models which have time varying
variances
Suppose
e
+
h
+
=
h
)
h
N(0,
e
e
+
X
=
Y
2
1
-
t
2
1
-
t
1
0
t
t
t
t
t
t




~
then the likelihood function for this model is
)
h
/
e
-(
|
h
|
-
=
))
,
(L( t
2
t
t
T
1
=
t



log
which is a specialisation of the general Normal LF with a time varying
variance.
An alternative approach
Method of moments
A widely used technique in estimation is the Generalised Method of
Moments (GMM), This is an extension of the standard method of
moments.
The idea here is that if we have random drawings from an unknown
probability distribution then the sample statistics we calculate will
converge in probability to some constant. This constant will be a function
of the unknown parameters of the distribution. If we want to estimate k of
these parameters,
k

 ,...,
1
we compute k statistics (or moments) whose probability limits are known
functions of the parameters
k
m
m ,...,
1
These k moments are set equal to the function which generates the
moments and the function is inverted.
)
(
1
m
f 


A simple example
Suppose the first moment (the mean) is generated by the following
distribution, . The observed moment from a sample of n
observations is
)
|
( 1

x
f



n
i
i
x
n
m
1
1 )
/
1
(
Hence
)
|
( 1
1

x
f
m 
And
1
1
1
1
)
( m
m
f 
 

Method of Moments Estimation (MM)
This is a direct extension of the method of moments into a much more
useful setting.
The idea here is that we have a model which implies certain things about
the distribution or covariance’s of the variables and the errors. So we
know what some moments of the distribution should be. We then invert
the model to give us estimates of the unknown parameters of the model
which match the theoretical moments for a given sample.
So suppose we have a model
)
,
( X
f
Y 

where are k parameters. And we have k conditions (or moments)
which should be met by the model.

0
))
|
,
(
( 

X
Y
g
E
then we approximate E(g) with a sample measure and invert g.
)
0
,
,
(
1
X
Y
g


Examples
OLS
In OLS estimation we make the assumption that the regressors (Xs) are
orthogonal to the errors. Thus
0
)
( 
Xe
E
The sample analogue for each xi is
0
)
/
1
( 1


n
t t
it e
x
n
and so

 




n
t t
t
it
n
t t
it
x
y
x
n
e
x
n 1
1
)
'
(
)
/
1
(
0
)
/
1
( 
and so the method of moments estimator in this case is the value of
which simultaneously solves these i equations. This will be identical to
the OLS estimate.

Maximum likelihood as an MM estimator
In maximum likelihood we have a general likelihood function.

 ))
|
,
(
(
)
( 
x
y
f
Ln
L
Ln
and this will be maximised when the following k first order conditions are
met.
0
)
/
))
|
,
(
ln(
( 

 

x
y
f
E
This give rise to the following k sample conditions
0
)
/
))
|
,
(
ln(
(
)
/
1
( 1






x
y
f
n
n
i
Simultaneously solving these equations for gives the MM equivalent of
maximum likelihood.

Generalised Method of Moments (GMM)
In the previous conditions there are as many moments as unknown
parameters, so the parameters are uniquely and exactly determined. If
there were less moment conditions we would not be able to solve them
for a unique set of parameters (the model would be under identified). If
there are more moment conditions than parameters then all the
conditions can not be met at the same time, the model is over identified
and we have GMM estimation.
Basically, if we can not satisfy all the conditions at the same time we have
to trade them of against each other. So we need to make them all as close
to zero as possible at the same time. We need a criterion function to
minimise.
Suppose we have k parameters but L moment conditions L>k.





n
t j
j L
j
m
n
m
E 1
,...
1
0
)
(
)
/
1
(
0
))
(
( 

Then we need to make all L moments as small as possible
simultaneously. One way is a weighted least squares criterion.
)
(
)'
(
)
( 
 m
A
m
q
Min 
That is, the weighted squared sum of the moments.
This gives a consistent estimator for any positive definite matrix A (not a
function of )

The optimal A
If any weighting matrix is consistent they clearly can not all be equally
efficient so what is the optimal estimate of A.
Hansen(1982) established the basic properties of the optimal A and how
to construct the covariance of the parameter estimates.
The optimal A is simply the covariance matrix of the moment conditions.
(just as in GLS)
Thus
)
var(
. m
asy
W
A
optimal 

)
,
0
(
~ gmm
gmm V
N

The parameters which solve this criterion function then have the
following properties.
Where
1
1
)
'
)(
/
1
( 


 G
G
n
Vgmm
where G is the matrix of derivatives of the moments with respect to the
parameters and
))
(
var( 2
/
1



 m
n
is the true moment value.

Conclusion
• Both ML and GMM are very flexible
estimation strategies
• They are equivalent ways of
approaching the same problem in many
instances.

Weitere ähnliche Inhalte

Was ist angesagt?

Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOKazuki Yoshida
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
The Wishart and inverse-wishart distribution
 The Wishart and inverse-wishart distribution The Wishart and inverse-wishart distribution
The Wishart and inverse-wishart distributionPankaj Das
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 
Maximum likelihood estimation
Maximum likelihood estimationMaximum likelihood estimation
Maximum likelihood estimationzihad164
 
Assumptions of Linear Regression - Machine Learning
Assumptions of Linear Regression - Machine LearningAssumptions of Linear Regression - Machine Learning
Assumptions of Linear Regression - Machine LearningKush Kulshrestha
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmgarima931
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation Remyagharishs
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysisNimrita Koul
 
The Method Of Maximum Likelihood
The Method Of Maximum LikelihoodThe Method Of Maximum Likelihood
The Method Of Maximum LikelihoodMax Chipulu
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 

Was ist angesagt? (20)

AR model
AR modelAR model
AR model
 
Visual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSOVisual Explanation of Ridge Regression and LASSO
Visual Explanation of Ridge Regression and LASSO
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
The Wishart and inverse-wishart distribution
 The Wishart and inverse-wishart distribution The Wishart and inverse-wishart distribution
The Wishart and inverse-wishart distribution
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
Maximum likelihood estimation
Maximum likelihood estimationMaximum likelihood estimation
Maximum likelihood estimation
 
Assumptions of Linear Regression - Machine Learning
Assumptions of Linear Regression - Machine LearningAssumptions of Linear Regression - Machine Learning
Assumptions of Linear Regression - Machine Learning
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
Hidden Markov Model
Hidden Markov Model Hidden Markov Model
Hidden Markov Model
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
The Method Of Maximum Likelihood
The Method Of Maximum LikelihoodThe Method Of Maximum Likelihood
The Method Of Maximum Likelihood
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Bayes' theorem
Bayes' theoremBayes' theorem
Bayes' theorem
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 

Ähnlich wie Lecture 1 maximum likelihood

STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...ijfls
 
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...Wireilla
 
from_data_to_differential_equations.ppt
from_data_to_differential_equations.pptfrom_data_to_differential_equations.ppt
from_data_to_differential_equations.pptashutoshvb1
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkAshwin Rao
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_NotesLu Mao
 
whitehead-logistic-regression.ppt
whitehead-logistic-regression.pptwhitehead-logistic-regression.ppt
whitehead-logistic-regression.ppt19DSMA012HarshSingh
 
Stochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest wayStochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest wayIOSR Journals
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...Alexander Decker
 

Ähnlich wie Lecture 1 maximum likelihood (20)

Glm
GlmGlm
Glm
 
Lec12
Lec12Lec12
Lec12
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
 
autocorrelation.pptx
autocorrelation.pptxautocorrelation.pptx
autocorrelation.pptx
 
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
STATISTICAL ANALYSIS OF FUZZY LINEAR REGRESSION MODEL BASED ON DIFFERENT DIST...
 
from_data_to_differential_equations.ppt
from_data_to_differential_equations.pptfrom_data_to_differential_equations.ppt
from_data_to_differential_equations.ppt
 
3_MLE_printable.pdf
3_MLE_printable.pdf3_MLE_printable.pdf
3_MLE_printable.pdf
 
On the dynamics of distillation processes
On the dynamics of distillation processesOn the dynamics of distillation processes
On the dynamics of distillation processes
 
Big o
Big oBig o
Big o
 
Sat
SatSat
Sat
 
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural NetworkRecursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
Research Assignment INAR(1)
Research Assignment INAR(1)Research Assignment INAR(1)
Research Assignment INAR(1)
 
whitehead-logistic-regression.ppt
whitehead-logistic-regression.pptwhitehead-logistic-regression.ppt
whitehead-logistic-regression.ppt
 
I stata
I stataI stata
I stata
 
Stochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest wayStochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest way
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...
 

Kürzlich hochgeladen

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 

Kürzlich hochgeladen (20)

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 

Lecture 1 maximum likelihood

  • 2. Much estimation theory is presented in a rather ad hoc fashion. Minimising squared errors seems a good idea but why not minimise the absolute error or the cube of the absolute error? The answer is that there is an underlying approach which justifies a particular minimisation strategy conditional on certain assumptions. This is the maximum likelihood principle.
  • 3. The idea is to assume a particular model with unknown parameters, we can then define the probability of observing a given event conditional on a particular set of parameters. We have observed a set of outcomes in the real world. It is then possible to choose a set of parameters which are most likely to have produced the observed results. This is maximum likelihood. In most cases it is both consistent and efficient. It provides a standard to compare other estimation techniques.
  • 4. An example Suppose we sample a set of goods for quality and find 5 defective items in a sample of 10. What is our estimate of the proportion of bad items in the whole population. Intuitively of course it is 50%. Formally in a sample of size n the probability of finding B bad items is population the in items bad of proportion the is ) - (1 B)! - (n B! n! = P B - n B   
  • 5. If the true proportion=0.1, P=0.0015, if it equals 0.2, P=0.0254 etc, we could search for the most likely value. Or we can solve the problem analytically, 0 = ) - (1 B)! - (n B! n! B) - (n - ) - (1 B)! - (n B! ! n B = P 1 - B - n B B - n 1 - B            ˆ ˆ ˆ ˆ ˆ ˆ  
  • 6. 0.5 = 5/10 = B/n = B) - (n = ) - B(1 ) - B)(1 - (n = B 0 = ) - (1 B) - (n ) - (1 B 1 - 1 - 1 - B - n B B - n 1 - B           ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ So the maximum likelihood estimate of the population proportion of bad items is 0.5. This basic procedure can be applied in many cases, once we can define the probability density function for a particular event we have a general estimation strategy.
  • 7. A general Statement Consider a sample (X1...Xn) which is drawn from a probability distribution P(X|A) where A are parameters. If the Xs are independent with probability density function P(Xi|A) the joint probability of the whole set is A) | X P( = A) | X ... X P( i n 1 = i n 1  this may be maximised with respect to A to give the maximum likelihood estimates.
  • 8. It is often convenient to work with the Log of the likelihood function. ) log log A) | X (P( = (L(A)) i n 1 = i  the advantage of this approach is that it is extremely general but if the model is misspecified it may be particularly sensitive to this misspecification.
  • 9. The Likelihood function for the general non linear model if Y is a vector of n endogenous variables and ) N(0, e e + ) f(X, = Y  ~  Then the likelihood function for one period is )] f(X, - (Y ) f(X, - [-0.5(Y | | ) (2 1 = ) , L( 1 - 0.5 0.5         exp
  • 10. and dropping some constants and taking logs ) log ) f(X, - (Y ) f(X, - -(Y | | = )) , (L( -1         if the covariance structure is constant and has zero off diagonal elements this reduces to single equation OLS
  • 11. Two important matrices The efficient score matrix S(A) = )) (L( -    log this is made up of the first derivatives at each point in time. It is a measure of dispersion of the maximum estimate.
  • 12. The information matrix (Hessian) This is defined as ) I( = )) (L( - E 2                   log This is a measure of how `pointy' the likelihood function is. The variance of the parameters is given either by the inverse Hessian or the outer product of the score matrix ) S( ) S( = ] ) [I( = ) Var( -1 ML      ˆ ˆ
  • 13. The Cramer-Rao Lower Bound This is an important theorem which establishes the superiority of the ML estimate over all others. The Cramer-Rao lower bound is the smallest theoretical variance which can be achieved. ML gives this so any other estimation technique can at best only equal it. ) I( ) Var( of estimate another is if 1 - * *     ˆ  this is the Cramer-Rao inequality.
  • 14. Concentrating the Likelihood function suppose we split the parameter vector into two sub vectors ) , L( = ) L( 2 1    now suppose we knew 1 then sometimes we can derive a formulae for the ML estimate of 2, eg ) g( = 1 2   then we could write the LF as ) ( L = )) g( , L( = ) , L( 1 * 1 1 2 1      this is the concentrated likelihood function. This process is often very useful in practical estimation as it reduces the number of parameters which need to be estimated.
  • 15. An example of concentrating the LF The likelihood function for a standard single variable normal non-linear model is    2 2 2 / e - ) -Tlog( = ) L(  we can concentrate this with respect to the variance as follows, the FOC for a maximum with respect to the variance is 0 = ) /( e + -T/ = L 2 2 2 2 2      
  • 16. which implies that /T e = 2 2   so the concentrated log likelihood becomes T - /T) e Tlog( = ) ( L 2 *   
  • 17. Prediction error decomposition We assumed that the observations were independent in the statements above. This will not generally be true especially in the presence of lagged dependent variables. However the prediction error decomposition allows us to extend standard ML procedures to dynamic models. From the basic definition of conditional probability ) )Pr( | Pr( = ) , Pr(      this may be applied directly to the likelihood function,
  • 18. )) Y ,..., Y , Y (L( + )) Y ,..., Y , Y | Y (L( = )) Y , Y ,... Y , Y (L( 1 - T 2 1 1 - T 2 1 T T 1 - T 2 1 log log log The first term is the conditional probability of Y given all past values. We can then condition the second term and so on to give )) Y (L( + )) Y ,..., Y | Y (L( = 1 1 - i - T 1 i - T 2 - T 0 = i log log  that is, a series of one step ahead prediction errors conditional on actual lagged Y.
  • 19. Testing hypothesis. If a restriction on a model is acceptable this means that the reduction in the likelihood value caused by imposing the restriction is not `significant'. This gives us a very general basis for constructing hypothesis tests but to implement the tests we need some definite metric to judge the tests against, i.e. what is significant.
  • 21. Consider how the likelihood function changes as we move around the parameter space, we can evaluate this by taking a Taylor series expansion around the ML point O(1) + ) - ( )) (L( ) - 0.5( + )) (L( ) - ( + )) (L( = )) (L( 2                ˆ log ˆ log ˆ ˆ log log               and of course 0 = ) S( = )) (L(     log ) I( = )) (L( 2     log
  • 22. So ) - )( I( ) - 0.5( + )) (L( = )) (L(        ˆ ˆ ˆ log log  ) - )( I( ) - 0.5.( = )) (L( - )) (L( r r r        ˆ ˆ ˆ log ˆ log  it is possible to demonstrate that (m) ) - )( I( ) r - ( 2 r       ~ ˆ ˆ ˆ  where m is the number of restrictions, and so
  • 23. (m) ))] (L( - )) (L( 2[ 2 r    ~ log ˆ log And so this gives us a measure for judging the significance of likelihood based tests.
  • 24. Three test procedures. To construct the basic test we need an estimate of the likelihood value at the unrestricted point and the restricted point and we compare these two. There are three ways of deriving this. The likelihood ratio test we simply estimate the model twice, once unrestricted and once restricted and compare the two. The Wald test This estimates only the unrestricted point and uses an estimate of the second derivative to `guess' at the restricted point. Standard `t' tests are a form of wald test. The LaGrange multiplier test this estimates only the restricted model and again uses an estimate of the second derivatives to guess at the restricted point.
  • 25. If the likelihood function were quadratic then LR=LM=W. In general however W>LR>LM L Lu LR * 
  • 26. A special form of the LM test The LM test can be calculated in a particularly convenient way under certain circumstances. The general form of the LM test is (m) ) S( ] ) [I( ) S( = LM 2 -1     ~  Now suppose e + ) , , X f( = Y t 2 1 t t   where we assume that the subset of parameters 1 is fixed according to a set of restrictions g=0 (G is the derivative of this restriction).
  • 27. Now ) G) G E( ( = ) I( e G = ) S( 1 - 2 - 1 -2 1       and so the LM test becomes e G ] G) G E( G[ e 2 - -1 2 - 2 -      
  • 28. And e e = LM G G = G) G E( if 2 -     which may be interpreted as TR2 from a regression of e on G This is used in many tests for serial correlation heteroskedasticity functional form etc. e is the actual errors from a restricted model and G is the restrictions in the model.
  • 29. An Example: Serial correlation Suppose e + u = u u + X = Y 1 -   The restriction that may be tested as an LM test as follows estimate the model without serial correlation. save the residuals u. then estimate the model 0 =  u + X = u i - t m 1 = i ˆ ˆ    then TR2 from this regression is an LM(m) test for serial correlation
  • 30. Quasi Maximum Likelihood ML rests on the assumption that the errors follow a particular distribution (OLS is only ML if the errors are normal etc.) What happens if we make the wrong assumption. White(1982) Econometrica, 50,1,pg1. demonstrates that, under very broad assumptions about the misspecification of the error process, ML is still a consistent estimator. The estimation is then referred to as Quasi Maximum Likelihood. But the covariance matrix is no longer the standard ML one instead it is given by ) )]I( S( ) [S( ) I( = ) C( -1 -1      ˆ ˆ ˆ ˆ ˆ  Generally we may construct valid Wald and LM tests by using this corrected covariance matrix but the LR test is invalid as it works directly from the value of the likelihood function.
  • 31. Numerical optimisation In simple cases (e.g. OLS) we can calculate the maximum likelihood estimates analytically. But in many cases we cannot, then we resort to numerical optimisation of the likelihood function. This amounts to hill climbing in parameter space. there are many algorithms and many computer programmes implement these for you. It is useful to understand the broad steps of the procedure.
  • 32. 1. set an arbitrary initial set of parameters. 2. determine a direction of movement 3. determine a step length to move 4. examine some termination criteria and either stop or go back to 2.
  • 34. Important classes of maximisation techniques. Gradient methods. These base the direction of movement on the first derivatives of the LF with respect to the parameters. Often the step length is also determined by (an approximation to) the second derivatives. So                        L L + = 2 2 -1 i 1 + i These include, Newton, Quasi Newton, Scoring, Steepest descent, Davidson Fletcher Powel, BHHH etc.
  • 35. Derivative free techniques. These do not use derivatives and so they are less efficient but more robust to extreme non-linearity’s. e.g. Powell or non-linear Simplex. These techniques can all be sensitive to starting values and `tuning' parameters.
  • 36. Some special LFs Qualitative response models. These are where we have only partial information (insects and poison) in one form or another. We assume an underlying continuous model, u + X = Y t t t  but we only observe certain limited information, eg z=1 or 0 related to y 0 < Y if 0 = z 0 > Y if 1 = z
  • 37. then we can group the data into two groups and form a likelihood function with the following form ) X - F(1 ) X F(- = L t 1 = z t 0 = z     where F is a particular density function eg. the standard normal Cumulative function or perhaps the logistic (logit model) function
  • 38. ARCH and GARCH These are an important class of models which have time varying variances Suppose e + h + = h ) h N(0, e e + X = Y 2 1 - t 2 1 - t 1 0 t t t t t t     ~ then the likelihood function for this model is ) h / e -( | h | - = )) , (L( t 2 t t T 1 = t    log which is a specialisation of the general Normal LF with a time varying variance.
  • 39. An alternative approach Method of moments A widely used technique in estimation is the Generalised Method of Moments (GMM), This is an extension of the standard method of moments. The idea here is that if we have random drawings from an unknown probability distribution then the sample statistics we calculate will converge in probability to some constant. This constant will be a function of the unknown parameters of the distribution. If we want to estimate k of these parameters, k   ,..., 1 we compute k statistics (or moments) whose probability limits are known functions of the parameters
  • 40. k m m ,..., 1 These k moments are set equal to the function which generates the moments and the function is inverted. ) ( 1 m f   
  • 41. A simple example Suppose the first moment (the mean) is generated by the following distribution, . The observed moment from a sample of n observations is ) | ( 1  x f    n i i x n m 1 1 ) / 1 ( Hence ) | ( 1 1  x f m  And 1 1 1 1 ) ( m m f    
  • 42. Method of Moments Estimation (MM) This is a direct extension of the method of moments into a much more useful setting. The idea here is that we have a model which implies certain things about the distribution or covariance’s of the variables and the errors. So we know what some moments of the distribution should be. We then invert the model to give us estimates of the unknown parameters of the model which match the theoretical moments for a given sample. So suppose we have a model ) , ( X f Y   where are k parameters. And we have k conditions (or moments) which should be met by the model. 
  • 43. 0 )) | , ( (   X Y g E then we approximate E(g) with a sample measure and invert g. ) 0 , , ( 1 X Y g  
  • 44. Examples OLS In OLS estimation we make the assumption that the regressors (Xs) are orthogonal to the errors. Thus 0 ) (  Xe E The sample analogue for each xi is 0 ) / 1 ( 1   n t t it e x n and so        n t t t it n t t it x y x n e x n 1 1 ) ' ( ) / 1 ( 0 ) / 1 (  and so the method of moments estimator in this case is the value of which simultaneously solves these i equations. This will be identical to the OLS estimate. 
  • 45. Maximum likelihood as an MM estimator In maximum likelihood we have a general likelihood function.   )) | , ( ( ) (  x y f Ln L Ln and this will be maximised when the following k first order conditions are met. 0 ) / )) | , ( ln( (      x y f E This give rise to the following k sample conditions 0 ) / )) | , ( ln( ( ) / 1 ( 1       x y f n n i Simultaneously solving these equations for gives the MM equivalent of maximum likelihood. 
  • 46. Generalised Method of Moments (GMM) In the previous conditions there are as many moments as unknown parameters, so the parameters are uniquely and exactly determined. If there were less moment conditions we would not be able to solve them for a unique set of parameters (the model would be under identified). If there are more moment conditions than parameters then all the conditions can not be met at the same time, the model is over identified and we have GMM estimation. Basically, if we can not satisfy all the conditions at the same time we have to trade them of against each other. So we need to make them all as close to zero as possible at the same time. We need a criterion function to minimise.
  • 47. Suppose we have k parameters but L moment conditions L>k.      n t j j L j m n m E 1 ,... 1 0 ) ( ) / 1 ( 0 )) ( (   Then we need to make all L moments as small as possible simultaneously. One way is a weighted least squares criterion. ) ( )' ( ) (   m A m q Min  That is, the weighted squared sum of the moments. This gives a consistent estimator for any positive definite matrix A (not a function of ) 
  • 48. The optimal A If any weighting matrix is consistent they clearly can not all be equally efficient so what is the optimal estimate of A. Hansen(1982) established the basic properties of the optimal A and how to construct the covariance of the parameter estimates. The optimal A is simply the covariance matrix of the moment conditions. (just as in GLS) Thus ) var( . m asy W A optimal  
  • 49. ) , 0 ( ~ gmm gmm V N  The parameters which solve this criterion function then have the following properties. Where 1 1 ) ' )( / 1 (     G G n Vgmm where G is the matrix of derivatives of the moments with respect to the parameters and )) ( var( 2 / 1     m n is the true moment value. 
  • 50. Conclusion • Both ML and GMM are very flexible estimation strategies • They are equivalent ways of approaching the same problem in many instances.