2. OLS
Motivation
Economics is (essentially) observational science
Theory provides discussion regarding the relationship between variables
–Example: Monetary policy and macroeconomic conditions
What?: Properties of OLS
Why?: Most commonly used estimation technique
How?: From simple to more complex
Outline
1. Simple (bivariate) linear regression
2. General framework for regression analysis
3. OLS estimator and its properties
4. CLS (OLS estimation subject to linear constraints)
5. Inference (Tests for linear constraints)
6. Prediction
1
4. OLS
Correlation Coe¢ cient
Intended to measure direction and closeness of linear association
Observations: fyt; xtgT
t=1
Data expressed in deviations from the (sample) mean:
ezt = zt z; z = T 1
TX
t=1
zt; z = y; x
Cov(y; x) = E (yx) E (y) E (x)
sxy = T 1
TX
t=1
exteyt
which depends on the units in which x and y are measured
Correlation coe¢ cient is a measure of linear association independent of units
r = T 1
TX
t=1
ext
sx
eyt
sy
=
sxy
sxsy
; sz =
v
u
u
tT 1
TX
t=1
ez2
t ; z = y; x
Limits: 1 r 1 (applying Cauchy-Schwarz inequality)
3
5. OLS
Caution
Fallacy: “Post hoc, ergo propter hoc”(after this, therefore because of this)
Correlation is not causation
Numerical and statistical signi…cance, may not mean nothing
Nonsense (spurious) correlation
Yule (1926):
–Death rate - Proportion of marriages in the Church of England (1866-1911)
–r = 0:95
–Ironic: To achieve immortality ! close the church!
A few more recent examples
4
11. OLS
Simple linear regression model
Economics as a remedy for nonsense (correlation does not indicate direction of dependence)
Take a stance:
yt = 1 + 2xt + ut
–Linear
–Dependent / independent
–Systematic / unpredictable
T observations, 2 unknowns
In…nite possible solutions
–Fit a line by eye
–Choose two pairs of observations and join them
–Minimize distance between y and predictable component
min
P
jutj !LAD
min
P
u2
t !OLS
10
14. OLS
Simple linear regression model
De…ne the sum of squares of the residuals (SSR) function as:
ST ( ) =
TX
t=1
(yt 1 2xt)2
Estimator: Formula for estimating unknown parameters
Estimate: Numerical value obtained when sample data is substituted in formula
The OLS estimator (b) minimizes ST ( ). FONC:
@ST ( )
@ 1 b
= 2
X
yt
b1
b2xt = 0
@ST ( )
@ 2 b
= 2
X
xt yt
b1
b2xt = 0
Two equations, two unknowns:
b1 = y b2x
b2 =
sxy
s2
x
= r
sy
sx
=
PT
t=1 exteyt
PT
t=1 ex2
t
13
15. OLS
Simple linear regression model
Properties:
–b1; b2 minimize SSR
–OLS line passes through the mean point (x; y)
–but yt
b1
b2xt are uncorrelated (in the sample) with xt
Figure 9: SSR
14
16. OLS
General Framework
Observational data fw1; w2; :::; wT g
Partition wt = (yt; xt) where yt 2 R, xt 2 Rk
Joint density: f(yt; xt; ), vector of unknown parameters
Conditional distribution: f(yt; xt; ) = f(yt jxt; 1)f(xt; 2); f(xt; 2) =
R 1
1 f(yt; xt; )dy
Regression analysis: statistical inferences on 1
Ignore f(xt; 2) provided 1 and 2 are “variation free”
y: ‘dependent’or ‘endogenous’variable. x: vector of ‘independent’or ‘exogenous’variables
Conditional mean: m (xt; 3). Conditional variance: g (xt; 4)
m (xt; 3) = E (yt jxt; 3) =
Z 1
1
yf(yj xt; 2)dy
g (xt; 4) =
Z 1
1
y2
f(yj xt; 2)dy [m (xt; 3)]2
ut: di¤erence between yt and conditional mean:
yt = m (xt; 3) + ut (1)
15
17. OLS
General Framework
Proposition 1 Properties of ut
1. E (ut jxt) = 0
2. E (ut) = 0
3. E [h(xt)ut] = 0 for any function h ( )
4. E (xtut) = 0
Proof. 1. By de…nition of ut and linearity of conditional expectations,
E (ut jxt) = E [yt m (xt) jxt]
= E [yt jxt] E [m(xt) jxt]
= m (xt) m (xt) = 0
2. By the law of iterated expectations and the …rst result,
E (ut) = E [E (ut jxt)] = E (0) = 0
3. By essentially the same argument,
E [h(xt)ut] = E [E [h(xt)ut jxt]]
= E [h(xt)E [ut jxt]]
= E [h(xt) 0] = 0
4. Follows from the third result setting h(xt) = xt:
16
18. OLS
General Framework
(1) + …rst result of Proposition 1: regression framework:
yt = m (xt; 3) + ut
E (ut jxt) = 0
Important: framework, not model: holds true by de…nition.
m ( ) and g ( ) can take any shape
If m ( ) is linear: Linear Regression Model (LRM).
m (xt; 3) = x0
t
Y
T 1
=
2
4
y1
...
yT
3
5 ; X
T k
=
2
4
x0
1
...
x0
T
3
5 =
2
4
x1;1 x1;k
... ... ...
xT;1 xT;k
3
5 ; u
T 1
=
2
4
u1
...
uT
3
5
17
19. OLS
Regression models
De…nition 1 The Linear Regression Model (LRM) is:
1. yt = x0
t + ut or Y = X + u
2. E (ut jxt) = 0
3. rank(X) = k or det (X0
X) 6= 0
4. E (utus) = 0 8t 6= s
De…nition 2 The Homoskedastic Linear Regression Model (HLRM) is the LRM plus
5. E u2
t jxt = 2
or E (uu0
jX) = 2
IT
De…nition 3 The Normal Linear Regression Model (NLRM) is the LRM plus
6. ut N 0; 2
18
20. OLS
De…nition of OLS Estimator
De…ne the sum of squares of the residuals (SSR) function as:
ST ( ) = (Y X )0
(Y X ) = Y 0
Y 2Y 0
X + 0
X0
X
The OLS estimator (b) minimizes ST ( ). FONC:
@ST ( )
@ b
= 2X0
Y + 2X0
Xb = 0
which yield normal equations X0
Y = X0
Xb.
Proposition 2 b = (X0
X) 1
(X0
Y ) is the arg minST ( )
Proof. Using normal equations: b = (X0
X) 1
(X0
Y ). SOSC:
@2
ST ( )
@ @ 0
b
= 2X0
X
then b is minimum as X0
X is p.d.m.
Important implications:
–b is a linear function of Y
–b is a random variable (function of X and Y )
–X0
X must be of full rank
19
21. OLS
Interpretation
De…ne least squares residuals
bu = Y Xb (2)
b2
= T 1
bu0
bu
Y = Xb + bu = PY + MY ; where P = X (X0
X)
1
X0
and M = I P
Proposition 3 Let A be an n r matrix of rank r. A matrix of the form P = A (A0
A) 1
A0
is
called a projection matrix and has the following properties:
i) P = P0
= P2
(Hence P is symmetric and idempotent)
ii) rank(P) = r
iii) Characteristic roots (eigenvalues) of P consist of r ones and n-r zeros
iv) If Z = Ac for some vector c, then PZ = Z (hence the word projection)
v) M = I P is also idempotent with rank n-r, eigenvalues consist of n-r ones and r zeros,
and if Z = Ac, then MZ= 0
vi) P can be written as G0
G, where GG0
= I, or as v1v0
1 + v2v0
2 + ::: + vrv0
r where vi is a vector
and r = rank(P)
20
22. OLS
Interpretation
Y = Xb + bu = PY + MY
Y
Col(X)
MY
PY
0
Figure 10: Orthogonal Decomposition of Y
21
23. OLS
The Mean of b
Proposition 4 In the LRM, E
h
b jX
i
= 0 and Eb =
Proof. By the previous results,
b = (X0
X)
1
X0
Y = (X0
X)
1
X0
(X + u)
= + (X0
X)
1
X0
u
Then
E
h
b jX
i
= E
h
(X0
X)
1
X0
u jX
i
= (X0
X)
1
X0
E (u jX)
= 0
Applying the law of iterated expectations, Eb = E
h
E b jX
i
=
22
24. OLS
The Variance of b
Proposition 5 In the HLRM, V b jX = 2
(X0
X) 1
and V b = 2
E
h
(X0
X) 1
i
Proof. Since b = (X0
X) 1
X0
u;
V b jX = E b b
0
jX
= E
h
(X0
X)
1
X0
uu0
X (X0
X)
1
jX
i
= (X0
X)
1
X0
E [uu0
jX] X (X0
X)
1
= 2
(X0
X)
1
Thus, V b = E
h
V b jX
i
+ V
h
E b jX
i
= 2
E
h
(X0
X) 1
i
Important features of V b jX = 2
(X0
X) 1
:
–Grows proportionally with 2
–Decreases with sample size
–Decreases with volatility of X
23
25. OLS
The Mean and Variance of b2
Proposition 6 In the LRM, b2
is biased.
Proof. We know that bu = MY . It is trivial to verify that bu = Mu. Then, b2
= T 1
bu0
bu =
T 1
u0
Mu: This implies that
E b2
jX = T 1
E [u0
Mu jX]
= T 1
trE [u0
Mu jX]
= T 1
E [tr(u0
Mu) jX]
= T 1
E [tr(Muu0
) jX]
= T 1 2
tr(M)
= 2
(T k) T 1
Applying the law of iterated expectations we obtain Eb2
= 2
(T k) T 1
Unbiased estimator: e2
= (T k) 1
bu0
bu.
Proposition 7 In the NLRM, Vb2
= T 2
2 (T k) 4
Important:
–With the exception of Proposition 7, normality is not required
–b2
is biased, but it is the MLE under normality and is consistent
–Variance of b and b2
depend on 2
. bV b = e2
(X0
X) 1
24
26. OLS
b is BLUE
Theorem 1 (Gauss-Markov) b is BLUE
Proof. Let A = (X0
X) 1
X0
, so b = AY . Consider any other linear estimator b = (A + C) Y .
Then,
E (b jX) = (X0
X)
1
X0
X + CX = (I CX)
For b to be unbiased we require CX = 0, then:
V (b jX) = E (A + C) uu0
(A + C)0
As (A + C) (A + C)0
= (X0
X) 1
+ CC0
, we obtain
V (b jX) = V b jX + 2
CC0
As CC0
is p.s.d. we have V (b jX) V b jX
Despite popularity, Gauss-Markov not very powerful
–Restricts quest to linear and unbiased estimators
–There may be “nonlinear”or biased estimator that can do better (lower MSE)
–OLS not BLUE when homoskedasticity is relaxed
25
27. OLS
Asymptotics I
Unbiasedness is not that useful in practice (frequentist perspective)
It is also not common in general contexts
Asymptotic theory: properties of estimators when sample size is in…nitely large
Cornerstones: LLN (consistency) and CLT (inference)
De…nition 4 (Convergence in probability) A sequence of real or vector valued random
variables fxtg is said to converge to x in probability if
lim
T!1
Pr (kxT xk > ") = 0 for any " > 0
We write xT
p
! x or p lim xT = x.
De…nition 5 (Convergence in mean square) fxtg converges to x in mean square if
lim
T!1
E (xT x)2
= 0
We write xT
M
! x.
De…nition 6 (Almost sure convergence) fxtg converges to x almost surely if
Pr
h
lim
T!1
xT = x
i
= 1
We write xT
a:s:
! x.
De…nition 7 The estimator bT of 0 is said to be a weakly consistent estimator if bT
p
! 0.
De…nition 8 The estimator bT of 0 is said to be a strongly consistent estimator if bT
a:s:
! 0.
26
28. OLS
Laws of Large Numbers and Consistency of b
Theorem 2 (WLLN1, Chebyshev) Let E (xt) = t, V (xt) = 2
t , Cov (xi; xj) = 0 8i 6= j.
If lim
T!1
1
T
PT
t=1
2
t M < 1, then
xT T
p
! 0
Theorem 3 (SLLN1, Kolmogorov) Let fxtg be independent with …nite variance V (xt) =
2
t < 1. If
P1
t=1
2
t
t2 < 1, then
xT T
a:s:
! 0
Assume that T 1
X0
X ! Q (invertible and nonstochastic)
b = (X0
X)
1
X0
u
= T 1
X0
X
1
T 1
X0
u
p
! 0
b is consistent: b p
!
27
29. OLS
Analysis of Variance (ANOVA)
Y = bY + bu
Y Y = bY Y + bu
Y Y
0
Y Y = bY Y
0
bY Y + 2 bY Y
0
bu + bu0
bu
but bY 0
bu = Y 0
PMY = 0 and Y
0
bu = Y {0
bu = 0. Thus
Y Y
0
Y Y = bY Y
0
bY Y + bu0
bu
This is called the ANOVA formula, often written as
TSS = ESS + SSR
R2
=
ESS
TSS
= 1
SSR
TSS
= 1
Y 0
MY
Y 0LY
L = IT T 1
{{0
. If regressors include constant, 0 R2
1.
28
30. OLS
Analysis of Variance (ANOVA)
Measures percentage of variance of Y accounted for in variation of bY .
Not “measure”or “goodness”of …t
Doesn’t explain anything
Not even clear if R2
has interpretation in terms of forecast performance
Model 1: yt = xt + ut Model 2: yt xt = xt + ut with = 1
Mathematically identical and yield same implications and forecasts
Yet reported R2
will di¤er greatly
Suppose ' 1. Second model: R2
' 0, First model can be arbitrarily close to one
R2
is increases as regressors are added. Theil proposed:
R
2
= 1
SSR
TSS
T
T k
= 1
e2
b2
y
Not used that much today, as better model evaluation criteria have been developed
29
31. OLS
OLS Estimator of a Subset of
Partition X = X1 X2 = 1
2
Then X0
Xb = X0
Y can be written as:
X0
1X1
b1 + X0
1X2
b2 = X0
1Y (3a)
X0
2X1
b1 + X0
2X2
b2 = X0
2Y (3b)
Solving for b2 and reinserting in (3a) we obtain
b1 = (X0
1M2X1)
1
X0
1M2Y
b2 = (X0
2M1X2)
1
X0
2M1Y
where Mi = I Pi = I Xi (X0
iXi) 1
X0
i (for i = 1; 2).
Theorem 4 (Frisch-Waugh-Lovell) b2 and bu can be computed using the following algorithm:
1. Regress Y on X1; obtain residual eY
2. Regress X2 on X1; obtain residual eX2
3. Regress eY on eX2, obtain b2 and residuals bu
FWL was used to speed computation
30
32. OLS
Application of FWL: (Demeaning)
Partition X = X1 X2 where X1 = { and X2 is the matrix of observed regressors
eX2 = M1X2 = X2 { ({0
{)
1
{0
X2
= X2 X2
eY = M1Y = Y { ({0
{)
1
{0
Y
= Y Y
FWL states that b2 is OLS estimate from regression of eY on eX2
b2 =
TX
t=1
ex2tex0
2t
! 1 TX
t=1
ex2teyt
!
Thus the OLS estimator for the slope coe¢ cients is a regression with demeaned data.
31
33. OLS
Constrained Least Squares (CLS)
Assume the following constraint must hold:
Q0
= c (4)
Q (k q matrix of known constants), c (q-vector of known constants). q < k, rank(Q) = q.
CLS estimator of ( ) is value of that minimizes SSR subject to (4).
L ( ; ) = (Y X )0
(Y X ) + 2 0
(Q0
c)
is a q-vector of Lagrange multipliers. FONC:
@L
@ ;
= 2X0
Y + 2X0
X + 2Q = 0
@L
@ ;
= Q0
c = 0
= b (X0
X)
1
Q
h
Q0
(X0
X)
1
Q
i 1
Q0b c (5)
2
= T 1
Y X
0
Y X
is BLUE
32
34. OLS
Inference
Up to now, properties of estimators did not depend on the distribution of u
Consider the NLRM with ut N 0; 2
. Then:
yt jxt N x0
t ; 2
On the other hand as b = (X0
X) 1
X0
Y , then:
b jX
a
v N ; 2
(X0
X)
1
However, as b p
! , it also converges in distribution to a degenerate distribution
Thus, we require something more to conduct inference
Next, we discuss …nite (exact) and large sample distribution of estimators to test hypothesis
Components:
–Null hypothesis H0
–Alternative hypothesis H1
–Test statistic (one tail, two tails)
–Rejection region
–Conclusion
33
35. OLS
Inference with Linear Constraints (normality)
H0 : Q0
= c H1 : Q0
6= c
The t Test
q = 1. Assume u is normal, under the null hypothesis:
Q0b a
v N
h
c; 2
Q0
(X0
X)
1
Q
i
Q0b c
h
2Q0 (X0X) 1
Q
i1=2
N (0; 1) (6)
Test statistic used when is known. If not, recall
bu0
bu
2
2
T k (7)
As (6) and (7) are independent, hence:
tT =
Q0b c
h
e2
Q0 (X0X) 1
Q
i1=2
ST k
(6) holds even when normality of u is not present.
If H0 : 1 = 0, de…ne Q = 1 0 0
0
c = 0,
tT =
b1
q
bV1;1
34
36. OLS
Inference with Linear Constraints (normality)
Con…dence interval:
Pr bi z =2
q
bVi;i < i < bi + z =2
q
bVi;i = 1
Tail probability, or probability value (p-value) function
pT = p (tT ) = Pr (jZj jtT j) = 2 (1 (jtT j))
Reject the null when the p-value is less than or equal to
Con…dence interval for :
Pr
"
(T k) e2
2
T k;1 =2
< 2
<
(T k) e2
2
T k; =2
#
= 1 (8)
35
37. OLS
The F Test (normality)
q > 1. Under the null:
ST ST
b
2
2
q
When 2
is not known, replace 2
with e2
and obtain
ST ST
b
e2 =
T k
q
Q0b c
0 h
Q0
(X0
X) 1
Q
i 1
Q0b c
bu0bu
Fq;T k (9)
As with t tests, reject null when the value computed exceeds the critical value
36
38. OLS
Asymptotics II
How to conduct inference when u is not necessarilly normal?
Figure 11: Convergence in distribution
37
39. OLS
CLT
De…nition 9 (Convergence in distribution) fxtg is said to converge to x in distribution
if the distribution function FT of xT converges to the distribution F of x at every continuity
point of F. We write xT
D
! x and we call F the limiting distribution of fxtg. If fxtg and fytg
have the same limiting distribution, we write xT
LD
= yT .
Theorem 5 (CLT1, Lindeberg-Lévy) Let fxtg be i.i.d. with Ext = and Vxt = 2
. Then
ZT =
xT
[VxT ]1=2
=
p
T
xT D
! N (0; 1)
Assume that T 1
X0
X ! Q (invertible and nonstochastic) and that T 1=2
X0
u
D
! N 0; 2
Q
p
T b = T 1
X0
X
1
T 1=2
X0
u
D
! N 0; 2
Q 1
Thus, under the HLRM, asymptotic distribution does not depend on distribution of u
Normal vs t-test / Chi2
vs F test
38
40. OLS
Tests for Structural Breaks
Suppose we have two regimes regression
Y1 = X1 1 + u1
Y2 = X2 2 + u2
E
u1
u2
u0
1 u0
2 =
2
1IT1 0
0 2
2IT2
H0 : 1 = 2
Assume 1 = 2. De…ne
Y = X + u
Y =
Y1
Y2
, X =
X1 0
0 X2
, = 1
2
, and u =
u1
u2
Applying (9) we obtain:
T1 + T2 2k
k
b1
b2
0 h
(X0
1X1) 1
+ (X0
2X2) 1
i 1
b1
b2
Y 0
h
I X (X0
X)
1
X0
i
Y
Fk;T1+T2 2k (10)
where b1 = (X0
1X1) 1
X0
1Y1 and b2 = (X0
2X2) 1
X0
2Y2.
39
41. OLS
Tests for Structural Breaks
Same result can be derived as follows: De…ne SSR under alternative (structural change)
ST
b = Y 0
h
I X (X0
X)
1
X0
i
Y
and SSR under the null hypothesis
ST = Y 0
h
I X (X0
X)
1
X0
i
Y
T1 + T2 2k
k
ST ST
b
ST
b
Fk;T1+T2 2k (11)
Unbiased estimate of 2
is
e2
=
ST
T1 + T2 2k
Chow tests are popular, but modern practice is skeptic. Recent theoretical and empirical
applications: period of possible break as endogenous latent variable.
40
42. OLS
Prediction
Out-of-sample predictions for yp (for p > T) is not easy. In that period: yp = x0
p + up
Types of uncertainty:
–Unpredictable component
–Parameter uncertainty
–Uncertainty about xp
–Speci…cation uncertainty
Types of forecasts
–Point forecast
–Interval forecast
–Density forecast
Active area of research
41
43. OLS
Prediction
If HLRM holds, the predictor that minimizes MSE is bx0
p
bT
Given x, mean squared prediction error is
E
h
(byp yp)2
jxp
i
= 2
h
1 + x0
p (X0
X)
1
xp
i
To construct estimator of variance of forecast error, substitute e2
for 2
You may think that a con…dence interval forecast could be formulated as:
Pr byp z =2
q
bVbyp
< yp < byT+p + z =2
q
bVbyp
= 1
WRONG. Notice that
yp byp
r
2
h
1 + x0
p (X0X) 1
xp
i =
up + x0
p
b
r
2
h
1 + x0
p (X0X) 1
xp
i
Relation does not have a discernible limiting distribution (unless u is normal). We didn’t need
to impose normality for all the previous results (at least asymptotically).
We assumed that the econometrician knew xp. If x is stochastic and not known at T, MSE
could be seriously underestimated.
42
45. OLS
Measures of predictive accuracy of forecasting models
RMSE =
v
u
u
t 1
P
PX
p=1
(yp byp)2
MAE =
1
P
PX
p=1
jyp bypj
Theil U statistic:
U =
v
u
u
t
PP
p=1 (yp byp)2
PP
p=1 y2
p
U =
v
u
u
t
PP
p=1 ( yp byp)2
PP
p=1 ( yp)2
yp = yp yp 1 and byp = byp yp 1
or, in percentage changes,
yp =
yp yp 1
yp 1
and byp =
byp yp 1
yp 1
These measures will re‡ect the model’s ability to track turning points in the data
44
46. OLS
Evaluation
When comparing 2 models, is one model really better than the other?
Diebold-Mariano: Framework for comparing models
dp = g (bui;p) g (buj;p) ; DM =
d
p
V
d
! N (0; 1)
Harvey, Leyborne, and Newbold (HLN): Correct size distortions and use Student´s t
HLN = DM
P + 1 2h + h (h 1) =P
P
1=2
45
47. OLS
Finite Samples
Statistical properties of most methods: known only asymptotically
“Exact”…nite sample theory can rarely be used to interpret estimates or test statistics
Are theoretical properties reasonably good approximations for the problem at hand?
How to proceed in these cases?
Monte Carlo experiments and bootstrap
46
48. OLS
Monte Carlo Experiments
Often used to analyze …nite sample properties of estimators or test statistics
Quantities approximated by generating many pseudo-random realizations of stochastic process
and averaging them
–Model and estimators or tests associated with the model. Objective: assess small sample
properties.
–DGP: special case of model. Specify “true” values of parameters, laws of motion of
variables, and distributions of r.v.
–Experiment: replications or samples (J), generating arti…cial samples of data according
to DGP and calculating the estimates or test statistics of interest
–After J replications, we have equal number of estimators which are subjected to statistical
analysis
–Experiments may be performed by changing sample size, values of parameters, etc. Re-
sponse surfaces.
Monte Carlo experiments are random. Essential to perform enough replications so results
are su¢ ciently accurate. Critical values, etc.
47
49. OLS
Bootstrap Resampling
Bootstrap views observed sample as a population
Distribution function for this population is the EDF of the sample, and parameter estimates
based on the observed sample are treated as the actual model parameters
Conceptually: examine properties of estimators or test statistics in repeated samples drawn
from tangible data-sampling process that mimics actual DGP
Bootstrap do not represent exact …nite sample properties of estimators and test statistics
under actual DGP, but provides approximation that improves as size of observed sample
increases
Reasons for acceptance in recent years:
–Avoids most of strong distributional assumptions required in Monte Carlo
–Like Monte Carlo, it may be used to solve intractable estimation and inference problems
by computation rather than reliance on asymptotic approximations, which may be very
complicated in nonstandard problems
–Boostrap approximations are often equivalent to …rst-order asymptotic results, and may
dominate them in cases.
48