This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
1. Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 4 - The linear Regression Model: Theory and
Application - 23.01.2015
2. Introduction
In the past lessons we analyzed one variable.
For some reasons, it is even useful to analyze two or more variables
together.
The question we want to answer regards:
What are the relations and the causal effects between two or
more variables,
analyze the determinants of changes in a variable,
forecast or predict a variable for unknown n or t.
In symbols, the idea can be represented as follows:
Y = f (Y1, Y2, ...)
Y is the response, which is a function (it depends on) one or more
variables.
3. Objectives
All in all, the regression model is the instrument used to:
measure the entity of the relations between two or more
variables: Y / X,
and to measure the causal direction ( X −→ Y or
viceversa?)
forecast the value of the variable Y in response to some
changes in the others X1, X2, ... (called explanatories),
or for some cases that are not considered in the sample.
4. Simple linear regression model
The regression model is a stochastic model, which differ from a
deterministic one.
Giving two sets of values (two variables) from a random sample of
length n: x = {x1, x2, ..., xi , ..xn}; y = {y1, y2, ..., yi , ..yn}:
Deterministic formula:
yi = β1 + β2xi , ∀i = 1, .., n
Stochastic formula:
yi = β1 + β2xi + i ∀i = 1, .., n
where i is the stochastic component.
β2 define the slope in the relations between X and Y (See graph in
chart 1)
5. Simple linear regression model - 2
We need to find ˆβ = {ˆβ1, ˆβ2} as estimators of β1 and β2.
After β is estimated, we can draw the estimated regression line,
which corresponds to the estimated regression model, as
follow:
ˆyi = ˆβ1 + ˆβ2xi
Here, ˆi = ˆyi − yi .
Where ˆyi is the i-element of the estimated y vector, and yi is the
i-elements of the real y vector. (see graph in chart 2)
6. Empirical Steps in the Regression Analysis
1. Study of the relations (scatter plot, correlations) between two
or more variables.
2. Estimation of the parameters of the model ˆβ = {ˆβ1, ˆβ2}.
3. Hypothesis tests on the estimated ˆβ2 to verify the causal
effects of X over Y
4. Robustness checks on the estimated model.
5. Use of the model to analyse the causal effect and/or to do
forecasts/predictions.
7. Why linear?
It is simple to estimate, to analyse and to interpret.
It likely fits with most of empirical cases, in which the
relations between two phenomenon is linear (NOT REALLY
SURE OF IT!)1.
1
The real complex world is not linear in the relations: logit, probit, mixed
model, generalized additive models (GAM) are only some examples of non
linear more advanced models you will study on econometric classes.
8. Model Hypotesis
In order the OLS estimation of the model to be unbiased, certain
hypothesis must hold:
E( i ) = 0, ∀i −→ E(yi ) = β1 + β2xi
Omoschedasticity: V ( i ) = σ2
i = σ2, ∀i
Null covariance: Cov( i , j ) = 0, ∀i = j
Null covariance among residuals and explanatories:
Cov(xi , i ) = 0, ∀i
Normal assumption: i ∼ N(0, σ2)
9. Model Hypotesis - 2
From the hypotesis above, it follows that:
V (yi ) = σ2, ∀i. Y is stochastic only for the component.
Cov(yi , yj ) = 0, ∀i = j. Since the residuals are uncorrelated.
yi ∼ N[(β1 + β2x1), σ2]. Since the residuals are also normal in
shape.
10. Ordinary Least Squares (OLS) Estimation
The OLS is the estimation method used to estimate the vector β.
The method comes from the idea to minimize the values of the
residuals.
Since ei (ˆi ) = yi − ˆyi we are interested in minimize the component
ei = yi − ˆβ1 − ˆβ2xi .
N.B. i = β1 − β2xi , while ei = ˆβ1 − ˆβ2xi
The method consists in minimize the sum of the square
differences:
n
i (yi − ˆyi )2 = n
i e2
i = Min,
which is equal to solve the following two-equation-system derived
using derivatives.
11. Ordinary Least Squares (OLS) Estimation - 2
δ
δβ1
n
i
e2
i = 0 (1)
δ
δβ2
n
i
e2
i = 0 (2)
After some maths, we end up with this estimators for the vector
ˆβ:
ˆβ1 = ¯y − ˆβ2 ¯x (3)
ˆβ2 =
n
i (yi − ¯y)(xi − ¯x)
n
i (xi − ¯x)2
(4)
12. OLS estimators
OLS ˆβ1 and ˆβ2 are stochastic estimators (they are part of a
distribution. They belong to the sample space of all the
possible estimators defined with different samples)
ˆβ2: measure the estimated variation in Y determined by a
unitary variation in X (δY /δX)
The OLS estimators are both unbiased (E(ˆβ1) = β1 and
E(ˆβ2) = β2),
and they are BLUE (corrects and with the lowest variance,
furthermore, they are constructed on the full sample).
13. Linear dependency index (R2
)
The R2 index is the most used measure to evaluate the linear
fitting of the model.
R2 is confined in the boundary [0, 1], where, values near to 1
means that the explanatories are properly describing the changes in
Y (the model is well defined).
How R2 is constructed:
SQT = SQR + SQE, or
n
i (yi − ¯y)2 = n
i (ˆyi − ¯y)2 + n
i (yi − ˆyi )2, or
total variation = model variation + residual variation
The R2 is defined as SQR
SQT or 1 − SQE
SQT . Or, equivalent:
R2 =
n
i (ˆyi −¯y)2
n
i (yi −¯y)2
14. Hypotesis testing on β2
The hypothesis test for the slope parameter is really similar to the
tests for the mean parameter. The estimated slope parameter β2 is
stochastic. It distributes as a normal variable, when the sample is
large:
ˆβ2 ∼ N[β2, σ2/SSx]
We can make use of the hypothesis tests approach to investigate
on the causal relation between Y and X:
H0 : β2 = 0; H1 : β2 = 0,
where, alternative hypothesis means causal relation. The test
is:
z =
ˆβ2−β2√
σ2/SSx)
∼ N(0, 1).
Since SSx is, generally, unknown, we estimate it as :
ˆSSx = n
i (xi − ¯x)2, and we use t − test with n − 1 degrees of
freedom (in case n is small).
15. Prediction within the regression model
The question we want to answer is the following: Which is the
expected value of y (say yn+1), for a certain observation that is not
in the sample?
Suppose we have, for that observation, the value for the variable X
(say xn+1)
We make use of the estimated β to estimate ˆyn+1 as:
ˆyn+1 = ˆβ1 + ˆβ2xn+1
16. Model Checking
Several methods are used to test for the robustness of the model,
most of them based on the stochastic part of the model (the
estimated residuals).
Graphical (at eye) checks: To plot the residuals versus the
fitted values (residual hypothesis)
qq-plot and Shapiro-Wilk test for normality
Durbin-Watson test for residual correlation
Breusch-Pagan test for residual heteroschedasticity.
Moreover, the Leverage is used to evaluate the contribution of each
observation in determining the estimated coefficients β.
The Stepwise procedure is used to choice between different model
specifications, in other words, to remove the explanatories which
are not significant.
17. Model Checking using estimated residuals - Linearity
An example of departure from the linearity assumption. In this
case we can draw a curve (not a horizontal line) to interpolate the
points.
Figure: residuals (Y) versus estimated (X) values
18. Model Checking using estimated residuals -
Omoscedasticity
An example of departure from the omoschedasticity assumption. In
this picture the estimated residuals increases as the predicted
values increases.
Figure: residuals (Y) versus estimated (X) values
19. Model Checking using estimated residuals - Normality
An example of departure from the normality assumption. Here the
qq-points do not lie into the the qq-line bounds.
Figure: residuals (Y) versus estimated (X) values
20. Model Checking using estimated residuals - Serial
correlation
An example of departure from the assumption of no serial
correlation of residuals: the residual at i depends on the value at
i − 1
21. Homeworks
1. Using cement data (n = 13), determine the β1 and β2
coefficients manually, using OLS formula at page 11, of the
model y = β1 + β2x1
2. Using cement data, estimate the R2 index of the model
y = β1 + β2x1, using formula at page 13.