This document discusses panel data analysis. Some key points:
- Panel data combines cross-sectional and time series data to observe multiple subjects over time in balanced and unbalanced panels.
- Panel data is useful for reducing noise, studying dynamic changes, and addressing issues with limited data availability.
- Choosing between fixed effects and random effects models depends on tests like the Hausman test and whether the unobserved effects are correlated with regressors.
- Panel data regression techniques like pooled mean group allow for heterogeneity across subjects while assuming some parameters are the same.
2. Why pane data
The main interest is group and not the individual units in
the group units in the group, which means that very little
information is lost by taking the panel perspective.
The use of panel rather than time series data not only
increases the total number of observations and their
variations but also reduces noise coming from the
individual time series.
(noise is the signal whose samples are a sequence of
unrelated random numbers, or random variables with no
mean (Σx/n) and limited variance)
3. Why pane data
Best suited where data availability is an issue particularly for
developing countries where short term time spans (space) for
variables are rampant (mostly) often insufficient for fitting time
series regression.
There is heterogeneity (differences) among units in the panel
Panel estimation techniques takes there heterogeneity into
account by allowing for subject specific variables
Suited for studying dynamic changes due to repeated cross
sectional observations
4. Panel data
Panel data combines Cross-Sectional and Time Series data and looks at multiple
subjects and how they change over the course of time balanced and unbalanced
panel
In microeconomic panels, the individuals are not always interviewed the same
number of times, leading to an unbalanced panel
And in an unbalanced panel, the number of time series observations is different
across individuals
In a balanced panel, each individual has the same number of observations
5. Introduction
Describe what panel data is and the reasons for
using it in this format
Assess the importance of fixed and random
effects
Examine the Hausman test, which determines if
fixed or random effects should be used.
Evaluate some panel data models
6. Panel Data
These are Models that Combine Cross-section
and Time-Series Data
In panel data the same cross-sectional unit
(industry, firm, country) is surveyed over time, so
we have data which is pooled over space as well
as time.
7. Reasons for using Panel Data
1. Panel data can take explicit account of individual-specific
heterogeneity (“individual” here means related to the micro
unit)
2. By combining data in two dimensions, panel data gives more
data variation, less collinearity and more degrees of freedom.
3. Panel data is better suited than cross-sectional data for
studying the dynamics of change.
For example it is well suited to understanding transition
behaviour – such as, company bankruptcy or merger.
8. 4. Panel data is better at detecting and measuring effects that
cannot be observed in either cross-section or time-series data.
5. Panel data enables the study of more complex behavioural models
– for example the effects of technological change, or economic
cycles.
6. Panel data can minimise the effects of aggregation bias, from
aggregating firms into broad groups.
9. Panel data analysis
Involves using a combination of cross section (N) and time series (T)
observations for analysis
N= no of groups (firms , countries, industries individuals etc)
T is the no of years
Panel data variant
Long panel: large N large T
Short panel: Large N small T
Heterogeneous panel: Large or small N, Large panel but N < T
Static panel: only exogeneous variables as regressors
Dynamic panel: Inclusion of lagged dependent variables as a regressor
(independent variable)
10. Heterogenous Dynamic Panel Data Modelling
Steps to estimation
1- Specify the model
2-Descriptive statistics
3- Correlation analysis
4- Perform unit root test
5- Optimal lags selection
6- Cointegration test
7- Perform Hausman test
8- Estimate the model
9- Causality test
10- Diagnostic
11. 1- Let see the specified models more detailed –
ARDL model was introduced by Pesaran et al. (2001) in order to
incorporate I(0) and I(1) variables in same estimation
if your variables are stationary I(0) then OLS is appropriate and
if all are non stationary but stationary at I(1) then it is advisable to
do VECM (Johanson Approach) as it is much simple model.
12. We cannot estimate conventional OLS on the variables if any one of
them or all of them are I(1) because these variable will not behave like
constants which is required in OLS and as most of them are changing in
time so OLS will mistakenly show high t values and significant results
But in reality it would be inflated because of common time component,
in econometric it is called spurious results where R square of the model
becomes higher than the Durban Watson Statistic.
So we move to a new set of models which can work on I(1) variables.
13. Panel ARDL
Lest start with ARDL where variables are mixed
with I(1) and I(0)
Yit= 𝑗=1
𝑝
δ 𝑖𝑌𝑖 + β𝑖 𝑋𝑖 + ϕ𝑖 + 𝑒 𝑖𝑡
14. How to Estimate ARDL model
In order to run ARDL some preconditions needed to be
checked
• Dependent must be non stationary in order for the model to
behave better.
• None of the variable should be I(2) in normal conditions
(ADF test)
• none of the variable should be I(2) in structural break (Zivot
Andrews test)
15. Check Optimal Lag order
First we need to check the lag order to see what lag (past
period) we use to the ADF test for each variable which is
being used in the model.
This is done using (Vector Auto Regressive Specification
Order Criterion) available in STATA that can be quickly
applied, in EVIEWS you have to do it after VAR model and
check the Lag length criterion.
16. In statistic and econometrics a distributed lag model is a
model for time series data in which a regression equation
is used to predict current values of a dependent variables
based on both the current values of an explanatory
variables and the lagged values of this explanatory
variables.
17. 2- Descriptive Statistics
Show and explain the characteristics of each variable in the model if
possible they relate to each group to engage a comparative statistics
3- correlation analysis
Show that the regressors do not have perfect or exact linear
representations of one another
18. Autocorrelation
Although different to autocorrelation using the usual OLS
models, a version of the Durbin-Watson test can be used
in the usual way (less than 2 is the reason for positive
serial correlation).
To remedy autocorrelation we can use the usual methods,
such as the Error Correction Model.
‘Dynamic Models’ are also often used, which basically
involves adding a lagged dependent variable.
Recently the use of a method for adjusting the standard
errors has become popular, the most common method is
termed the ‘Newey-West’ adjusted standard errors.
19. 4- Unit root test
Ascertain that no variables is integrated of order 2 (diff 2) perform
1st and 2nd generation URT
5- Determine Optimal lag selection
Using the unrestricted model and an information criterion decide the
choice of lags for each unit/ group per variable then choose the most
common lag for each variable to represent the lags for the model.
20. 6 Cointegration test
Perform Pedroni 1999 and 2004, Westerlund 2007,
cointegration test but on the assumption of long run
homogeneity this step can be skipped.
cointegration is ascertained from the statistical significance of
long run coefficients essentially cointegration (or long run
coefficient) present itself as the joint significance of the level
equation.
7 Hausman test
Reject the null hypothesis if probability value is less than
0.05
21. Hausman test:
Tests for the statistical significance of the difference between the
coefficient estimates obtained by FE and by RE, under then null
hypothesis that the RE estimates are efficient and consistent, and FE
estimates are inefficient.
H (0): Random effect is appropriated
H (1): Fixed effect is appropriated.
When the probability value is higher than 0.05 it has meant no serial
correlation results are not significant. And we cannot reject null means
random effect is appropriated.
The test has a Wald test form, and is usually reported in Chi2 form
with k-1 degrees of freedom (k is the number of regressors).
If W < critical value then random effects is the preferred estimator.
22. 8- Estimate the model
If the test favors the PMG (pooled mean group) estimator observe, the
statistical significance of the long run coefficients, the size of group specific
error adjustment coefficients and short run coefficients.
And finally interpret results accordingly.
23. 9- Causality test
Perform granger causality test and Wald or Weak exogeneity tests
(this step is optional)
Causality can also be determined using the significance of the
Error correction term (for joint causality)
Long run coefficient
Short run coefficient
ECT long and short run coefficient (for strong coefficient)
10 Diagnostic test
The diagnostic should be group specific and not panel and the results
can be compared.
24. Basis of panel ARDL - fixed and random effects
Mean group estimator
1- proposed by Pesaran
2- less informative
3- Averages the data
4- Estimate N separate regression
5- Examine the distribution of the estimated coefficients across group
6- Produce consistent estimates of the average of the parameters
7- parameters are freely independent across the groups
8- Does not recognize the fact that certain parameters may be the
same across groups.
25. Dynamic fixed effects (DFE)
1- Intercept differ across groups
2- Slope coefficients and error variance are identical
3- Allows the dynamic specification (no of lags included) to
differ across groups
Pooled mean group estimator
1-Proposed by Pesaran Shin Smith (1999)
2-An intermediate estimator
Between mean group and DFE
3- Involves both pooling and averaging
4- Allows the intercepts
5- LR coefficients are the same
6- Generates consistent estimates of the mean of SR coefficient by taking the
simple average of individual unit root
26. MG and PMG Estimators
The MG provides consistent estimates of the mean of the long run
coefficients but these will be inefficient if the slope homogeneity hold.
PMG estimators are consistent and efficient under the assumption of
long run slope homogeneity.
27. Homogeneity
As it is seen Homogeneity is important to perform if the data set is homogeneous
before statistical technique is applied.
For homogeneity all outside processes that could potentially affect the data must
remain constant for the complete time period of the sample.
How to do ?
Calculate the median
Subtract the median from each value in the dataset
Count how many times the data make a run above or below the median (i.e.
persistant of positive or negative value)
28. MG or PMG ?
Perform Hausman test
H0: MG and PMG estimates are not significant different PMG more
efficient
H1: Null is not true
Decision: Use PMG if P value is higher than 0.05 (P>0.05) (Ho cannot
be rejected)
Or
Use MG if P values is lower than 0.05 (reject Ho), (P < 0.05)
29. MG or Dynamic fixed effects (DFE) ?
Perform Hausman test
H0: MG and DFE estimates are not significant different DFE more
efficient
H1: Null is not true
Decision
Use DFE if P value is higher than 0.05 (P>0.05) (Ho cannot be
rejected)
Use MG if P values is lower than 0.05 (reject Ho), (P < 0.05)
30. DFE or PMG ?
Perform Hausman test
H0: DFE and PMG estimates are not significant different PMG more
efficient
H1: Null is not true
Decision
Use PMG if P value is higher than 0.05 (P>0.05) (Ho cannot be rejected)
Use DFE if P values is lower than 0.05 (reject Ho), (P < 0.05)
31. PANEL DATA REGRESSION MODELS
Panel data regression models, units over several time periods. are based on panel
data, which are observations on the same cross-sectional, or individual
A balanced panel has the same number of time observations for each cross-
sectional unit.
Panel data have several advantages over purely cross-sectional or purely time
series data.
These include:
(a) Increase in the sample size
(b) Study of dynamic changes in cross-sectional units over time
(c) Study of more complicated behavioral models,
including study of time-invariant variables Damodar Gujarati Econometrics by
Example, second edition
32. PANEL DATA REGRESSION MODELS
However, panel models pose several estimation
and inference problems, such as
heteroscedasticity, autocorrelation, and cross-
correlation in cross-sectional units at the same
point in time.
The fixed effects model (FEM) and the random
effects model (REM), also known as the error
components model (ECM), are commonly used
methods to deal with one or more of these
problems.
33. FIXED EFFECTS MODEL (FEM)
In FEM, the intercept in the regression model is allowed to differ among
individuals to reflect the unique feature of individual units.
This is done by using dummy variables, provided we take care of the
dummy variable trap.
The FEM using dummy variables is known as the least-squares dummy
variable model (LSDV).
FEM is appropriate in situations where the individual-specific intercept
may be correlated with one or more regressors, but consumes a lot of
degrees of freedom when N (the number of cross-sectional units) is very
large.
34. WITHIN-GROUP (WG) ESTIMATOR
An alternative to LSDV is to use the within-group (WG) estimator.
Here we subtract the (group) mean values of the regressand and
regressor (independent) from their individual values and run the
regression on the mean-corrected variables.
Although it is economical in terms of the degrees of freedom, the mean-
corrected variables wipe out time-invariant variables (such as gender and
race) from the model.
35. RANDOM EFFECTS MODEL (REM)
In REM we assume that the intercept value of an individual unit
is a random drawing from a much larger population with a
constant mean.
The individual intercept is then expressed as a deviation from
the constant mean value.
REM is more economical than FEM in terms of the number of
parameters estimated.
REM is appropriate in situations where the (random) intercept
of each cross-sectional unit is uncorrelated with the regressors.
36. Random Effects and fixed effects Estimation
The fixed effects model assumes that each group (firm) has a non-
stochastic group-specific component to y. Including dummy variables is
a way of controlling for unobservable effects on y.
But these unobservable effects may be stochastic (i.e. random). The
Random Effects Model attempts to deal with this:
Here the unobservable component,
vi , is treated as a component of the random error term.
vi is the element of the error which varies between groups but not
within groups.
εit is the element of the error which varies over group and time.
)
6
(
1
0 it
i
it
it v
x
a
a
y
37. Choosing between Fixed Effects (FE) and Random Effects (RE)
1. With large T and small N there is likely to be little difference, so FE
is preferable as it is easier to compute
2. With large N and small T, estimates can differ significantly. If the
cross-sectional groups are a random sample of the population RE is
preferable. If not the FE is preferable.
3. If the error component, vi , is correlated with x then RE is biased
(or not rational), but FE is not.
4. For large N and small T and if the assumptions behind RE hold
then RE is more efficient than FE.
38. FIXED EFFECTS OR RANDOM EFFECTS
If it assumed that εi and the regressors are uncorrelated, REM
may be appropriate, but if they are correlated, FEM may be
appropriate.
In the former case we also have to estimate fewer parameters.
The Hausman test can be used to decide between FEM and REM:
The null hypothesis underlying the Hausman test is that FEM and
REM do not differ substantially.
The test statistic has an asymptotic
39. If the computed chi-square value exceeds the critical
chi-square value and the level of significance,
we conclude that REM is not appropriate because the random
error term are probably correlated with one or more regressors.
In this case, FEM is preferred to REM.