SlideShare ist ein Scribd-Unternehmen logo
1 von 81
PS 699 Section March 18, 2010
Megan Reif
Graduate Student Instructor, Political Science
Professor Rob Franzese
University of Michigan
Regression Diagnostics for Extreme
Values (also known as extreme value
diagnostics, influence diagnostics,
leverage diagnostics, case diagnostics)
Review of (often iterative) Modeling Process
• EDA helps identify obvious
violations of CLRM
• Address trade-offs between
corrections
•Numeric & Graphic, Formal &
Informal Diagnostics
•Influence
•Normality
•Collinearity
•Non-Sphericality
• Exploratory Data Analysis
(EDA) of Empirical
Distribution (center, spread,
skewness, tail length, outliers)
• Uni-& Bivariate
• Numeric & Graphic, Formal
& Informal
• Include your prior info
about population
distribution & variance
• Data-generating process
• Assumptions
THEORY
FORMULATION &
MODEL
SPECIFICATION
DATA (Measure,
Sample, Collect,
Clean)
MODEL
ESTIMATION &
INFERENCE
POST-
ESTIMATION
ANALYSIS /
CRITICAL
ASSESSMENT OF
ASSUMPTIONS
But don’t start
dropping
observations
at this stage!
But don’t start
dropping
observations
at this stage!
Treat Outliers
as INFO, not
NUISANCE,
Explain them,
don’t hide
them
Treat Outliers
as INFO, not
NUISANCE,
Explain them,
don’t hide
them
2(c) Megan Reif
I. Pre-Modeling Exploratory Data Analysis (EDA)
(Review/Checklist)
• Not to be confused with data-mining – Arrive at data with your theory in
hand
• Because Multivariate analysis builds on uni- and bivariate analysis, begin
with univariate analysis, followed with bivariate, before proceding.
• These notes assume knowledge of production of descriptive statistics, but
provides basic commands and output as a sort of checklist.
• Don’t forget to start by using Stata’s “describe”, “summarize”,
codebook”,and “inspect” commands to understand (a) how the
variables are labeled and coded (b) basic distributions (c) how much
missing data there are for each variable.
• To think about possible effect of missing data on your model, use “list if”
command
list yvar xvar1 xvar2 xvar3 if yvar==.
list yvar xvar1 xvar2 xvar3 if xvar1==.
and so on
• Recode and label your variables for easier interpretation before
proceeding, particularly the uniqueid variable (such as country-year,
individual 1-n, etc.) for easy labeling of points (choose a short name).
3(c) Megan Reif
I.A Exploratory Data Analysis
(EDA):Univariate & Bivariate Analysis
1. Summarize Basic Univariate and Bivariate Distributions for
Theoretical Model Variables for data structure:
1. Location (Mean, Median)
2. Spread (Range, Variance, Quartiles)
3. Genuine Skewness vs. Outliers
The most efficient way to obtain this information is to use
Stata’s “tabstat” command and the statistics you desire
for your model variables and then inspect:
• Histograms (do not forget to explore using different bin
sizes and between 5-20 bins, since histogram distributions
are sensitive to bin size)
• Boxplots
• Matrix Scatterplots
4(c) Megan Reif
Univariate Outliers
• Distinguish between GENUINE skewness in the population distribution &
subsequently the empirical distribution, as opposed to unusual behavior (outliers)
in one of the tails. Your theory about the population may guide you on this.
• Do not leave univariate outliers out of your model or model them explicitly based
on descriptive statistics until you have done post-estimation diagnostics to
determine whether they are also MULTIVARIATE outliers (or correct them if they
are due to obvious typos or missing data or non-response codes like “999”).
• A UNIVARIATE outlier is a data point which is distant from the main body of the
data (say, middle 50%). One way to measure this distance is the Inter-Quartile
range (range of the middle 50% of the data). A data point xo is an outlier if:
– OBSERVE whether the middle 50 percent of the data ALSO manifest skewness.
– If IQR is skewed, a transformation such as a log or square may be called for; IF
NOT, focus on the outliers.
– Use a Box Plot to check location of the median in relation to the quartiles.
1.5 or
1.5
A data point is a far outlier if
3.0 or
3.0
o L
o U
o L
o U
x Q IQR
x Q IQR
x Q IQR
x Q IQR
 
 
 
 
5
In Stata, a Box-Plot will show outliers
(1.5IQR criteria) as points if they are
present in the data.
(c) Megan Reif
Tanzania Revenue Data
tabstat rev rexp dexp t, s(mean median sd var count min max iqr)
stats | rev rexp dexp t
---------+----------------------------------------
mean | 3728.381 4030.048 1693.619 80
p50 | 3544 3891 1549 80
sd | 817.1005 821.3014 894.879 6.204837
variance | 667653.2 674535.9 800808.3 38.5
N | 21 21 21 21
min | 2549 2899 586 70
max | 5433 5627 3589 90
iqr | 926 1127 1379 10
--------------------------------------------------
6
EXAMPLE Data: Tanzania.dta (Mukjeree et al)
REV: Gov Recurrent Revenue REXP: Gov Recurrent Expenditure
DEXP: Gov Development Expenditure Year (T) 1970-1990
Decade 0=1970s, 1=1980s, 3=1990
(c) Megan Reif
BIVARIATE NOTE: You can add “by(groupvariable)” after the comma to look at
descriptives for subgroups of interest.
tabstat rev rexp dexp t, s(mean median sd var count min max iqr)
by(decade)
Summary statistics: mean, p50, sd, variance, N, min, max, iqr
by categories of: decade (decade)
decade | rev rexp dexp t
---------+----------------------------------------
0 | 4133.7 4057.3 2151.6 74.5
| 3962.5 3850 1994 74.5
| 814.8448 789.6686 774.8313 3.02765
| 663972 623576.5 600363.6 9.166667
| 10 10 10 10
| 3243 3122 1228 70
| 5433 5571 3589 79
| 1072 937 927 5
---------+----------------------------------------
1 | 3303.1 3950 1346.4 84.5
| 3221 3812.5 993 84.5
| 657.0976 914.5912 822.1245 3.02765
| 431777.2 836477.1 675888.7 9.166667
| 10 10 10 10
| 2549 2899 588 80
| 4506 5627 3096 89
| 857 1392 1037 5
---------+----------------------------------------
2 | 3928 4558 586 90
| 3928 4558 586 90
| . . . .
| . . . .
| 1 1 1 1
| 3928 4558 586 90
| 3928 4558 586 90
| 0 0 0 0
---------+----------------------------------------
Total | 3728.381 4030.048 1693.619 80
| 3544 3891 1549 80
| 817.1005 821.3014 894.879 6.204837
| 667653.2 674535.9 800808.3 38.5
| 21 21 21 21
| 2549 2899 586 70
| 5433 5627 3589 90
| 926 1127 1379 10
--------------------------------------------------
7(c) Megan Reif
Univariate Box Plots & Histograms
2,0003,0004,0005,0006,000
rev
graph box rev
• Notice that the inter-
quartile range manifests
skewness, in addition to the
maximum being much
further from the middle
50% of the observations
• Note how different the
histogram for Revenue
appears for 4, 6, 8, and 10
bins (21 observations)
• See histogram help file to
ensure you properly display
histograms for continuous
vs. discrete variables.
8
6
8
5
2
02468
Frequency
2500 3000 3500 4000 4500 5000
rev
5 5
4
2
3
2
012345
Frequency
2500 3000 3500 4000 4500 5000
rev
4
2
6
2 2
3
2
0246
Frequency
2000 3000 4000 5000
rev
3
2 2
5
2 2
3
2
012345
Frequency
2000 3000 4000 5000 6000
rev
Different Bin Sizes
Histogram of Tanzania Annual Revenue
(c) Megan Reif
graph box rev if decade ==0 |
decade==1, over(decade)
histogram rev, by(decade)
• Box Plot of Revenue by decade
(1970s and 1980s)
• Note that the IQR is less
skewed for the 1970s than the
1980s
• Since there are no dots in the
boxplot we know there are no
formal univariate outliers.
• We also know from other
financial data that skewness
may be something to correct
for with a log transformation.
9
2,0003,0004,0005,0006,000
rev
0 1
05.0e-04.001.001505.0e-04.001.0015
2000 3000 4000 5000 6000
2000 3000 4000 5000 6000
0 1
2
Density
rev
Graphs by decade
Bivariate Box Plots & Histograms: Inspecting by Subgroups or
Categorical Transformations of Continuous Variables
(c) Megan Reif
Scatterplot Matrices and Cross-Tabulations
• Use these prior to ever running regression to
see differences and reveal potential violations
of CLRM
Group 1
Group 2
May have same relationship
to Y on average, but something
else is going on.
y
10(c) Megan Reif
The four panels form “Anscombe’s Quartet”—a famous demonstration by statistician Francis Anscombe in 1973. By
creating the four plots he was able to check the assumptions of his linear regression model, and found them wanting
for three of the four data sets (all but the top left). As Epstein et al. write, “Anscombe’s point, of course, was to
underscore the importance of graphing data before analyzing it” (24).
F.J. Anscombe, 1973. “Graphs in Statistical Analysis,” American Statistician 27:17, 19-20, cited in Lee Epstein, Andrew D. Martin, and Matthew M. Schneider,
2006. “On the Effective Communication of the Results of Empirical Studies, Part I.” Paper presented at the Vanderbilt Law Review Symposium on Empirical
Legal Scholarship, February 17.
Remember that looking at
correlations alone will conceal
curvilinear relationships,
heteroskedasticity, outliers, and
distributional shape. For example,
THE DATA IN THE FOUR PLOTS HAVE
THE SAME:
1) means for both y and x variables
2) slope and intercept estimates in a
regression of y on x.
3) R2 and F values (statistics we will
come to later).
Bi-Variate Correlations/Regressions: The NEED TO GRAPH
data: Same Statistics, Different Relationships
11(c) Megan Reif
Scatterplot Matrices
graph matrix rev rexp dexp t, half
• Allows you to look at bivariate
relationships between your
model variables, think about
possible colinearity between
explanatory variables, non-
linearity in relationships, etc.
• Notice time trend of all three
financial variables—consider
autocorrelation
• Extreme Points: We may want to
inspect the scatterplots for rev –
dexp and rexp – dexp for
observations that seem to be
unusual given our theory that
development expenditure would
be a function of revenue (the
observations have high
development expenditure but
low revenue) 12
rev
rexp
dexp
t
2000 4000 6000
3000
4000
5000
6000
3000 4000 5000 6000
0
2000
4000
0 2000 4000
70
80
90
(c) Megan Reif
A Closer Look: Scatterplot with Labels
scatter dexp rev, mlabel(t)
• Note that in 1990, revenue
was middling but
development expenditures
were low. What might cause
this?
13
70
7172 73
74
75
7677
78
79
80
81
82
83
848586
87
88
89
90
01000200030004000
dexp
2000 3000 4000 5000 6000
rev
3243
34973426
3756
3409
4169
44824498
54245433
4506
4112
3603
3470
2972
2746
2623
2549
2906
3544
3928
20003000400050006000
rev
70 75 80 85 90
t
scatter rev t, mlabel(rev)
• Scatter of revenue over time suggests a trend
and possible autocorrelation.It is also curious
that 1979 and 1980 have almost identical (and
high) levels of revenue. Possible data error or
real stagnation in revenue? There was a war
between Uganda and Tanzania in 1979. Note
how inspecting the data can lead to case-
specific information that may require modeling
adjustments (e.g., war dummies). And we didn’t
know a thing about Tanzania!
(c) Megan Reif
Cross-Tabulations (Contingency Tables)
• Recode continuous variables into categories (see notes from March 11), which
enables you to summarize continuous variables by categories (below) and inspect
test statistics for inter-group differences in means and variances (next slide)
gen revcat=rev
recode revcat 2549/3500=1 3501/4500=2 4501/max=3
label define revcat 1 "low" 2 "med" 3 "high“
Label values revcat revcat tatistics and interpret
tab revcat decade, sum(dexp)
• We want to see if the mean, sd for development expenditure varies by revenue
level and decade, for example, in order to see if one decade is responsible for all of
the high revenue observations, etc. – remember how important sub-group size is
when using interaction terms. Cross-tabs are an important exploring whether the
same small subgroup is driving the key results of estimation. Remember the 13
educated women in the dummy model (Feb 25 notes).]
| decade
revcat | 0 1 2 | Total
-----------+---------------------------------+----------
low | 1497.25 934.16667 . | 1159.4
| 188.20977 275.87274 . | 372.3422
| 4 6 0 | 10
-----------+---------------------------------+----------
med | 2205.75 1587.6667 586 | 1771.5
| 439.05989 850.62408 0 | 782.53526
| 4 3 1 | 8
-----------+---------------------------------+----------
high | 3352 3096 . | 3266.6667
| 335.16861 0 . | 279.31046
| 2 1 0 | 3
-----------+---------------------------------+----------
Total | 2151.6 1346.4 586 | 1693.619
| 774.83134 822.12451 0 | 894.87896
| 10 10 1 | 21
14(c) Megan Reif
• Inspect test statistics for inter-group
differences in means and variances
• Categories of low, medium, and high
revenue levels are not statistically
significantly disproportionately
distributed in any one decade -- one
period alone will probably not be driving
statistically significant results for revenue
effects), with the caveat that our
categories need to be meaningful—
perhaps coded at natural breaks in the
data, quartiles, etc. However, outliers
that do not fall in subgroups will not
show up with this method. It is still
useful to consider possible clusters of
data that will influence our model.
tab revcat decade, column row chi2
lrchi2 V exact gamma taub
decade
revcat | 0 1 2 | Total
-----------+---------------------------------+----------
low | 4 6 0 | 10
| 40.00 60.00 0.00 | 100.00
| 40.00 60.00 0.00 | 47.62
-----------+---------------------------------+----------
med | 4 3 1 | 8
| 50.00 37.50 12.50 | 100.00
| 40.00 30.00 100.00 | 38.10
-----------+---------------------------------+----------
high | 2 1 0 | 3
| 66.67 33.33 0.00 | 100.00
| 20.00 10.00 0.00 | 14.29
-----------+---------------------------------+----------
Total | 10 10 1 | 21
| 47.62 47.62 4.76 | 100.00
| 100.00 100.00 100.00 | 100.00
Pearson chi2(4) = 2.6075 Pr = 0.625
likelihood-ratio chi2(4) = 2.8982 Pr = 0.575
Cramér's V = 0.2492
gamma = -0.2000 ASE = 0.327
Kendall's tau-b = -0.1183 ASE = 0.197
Fisher's exact = 0.645
15
Cross-Tabulations (Contingency Tables)
(c) Megan Reif
II. Post-Estimation Diagnostics: OLS
Estimator is a (Sensitive) Mean
• The sample mean is a least squares estimator of
the location of the center of the data, but the
mean is not a resistant estimator in that it is
sensitive to the presence of outliers in the
sample. That is, changing a small part of the data
can change the value of the estimator
substantially, leading us astray.
• This is particularly problematic if we are unsure
about the actual shape of the population
distribution from which our data are drawn.
16(c) Megan Reif
II. Post-Estimation Diagnostics
Extreme Points (start here, since extreme points will
affect formal testing procedures) Also called case
diagnostics, case deletion diagnostics.
• In multivariate analysis, extreme data points create
more complex problems than in univariate analysis.
– A UNIVARIATE outlier is simply a value of x
different from X (unconditionally unusual, but
may not be a REGRESSION outlier).
– An outlier in simple bivariate regression is an
observation whose dependent variable value is
UNUSUAL GIVEN the value of the independent
variable (conditionally unusual).
17(c) Megan Reif
II. Bivariate Regression Extreme Points
• An outlier in either X or Y that has an atypical or anomalous
X value has LEVERAGE. It affects model summary statistics
(e.g. R2, standard error), but has little effect on the
regression coefficient estimates.
• An INFLUENCE point has an unusual Y value (AND maybe an
extreme x value). It is characterized by having a noticeable
impact on the estimated regression coefficients (i.e., if
removing it from the sample would markedly change the
slope and direction of the regression line).
• A RESIDUAL OUTLIER has large VERTICAL distance of a data
point from the regression line. IMPORTANT NOTE: An outlier
in X or Y is NOT necessarily associated with a large residual,
and vice versa.
18(c) Megan Reif
II.A.1.a Extreme Observations in Y
19(c) Megan Reif
II.A.1.a Extreme Observations in Y
20(c) Megan Reif
II.A.1.b Extreme Observations in X
NOTE: These examples reveal that it is most typically observations extreme in BOTH
x AND y that have influence (second graph on these two slides) but it is not always the case.
21(c) Megan Reif
Summary Table: Model Effects for Outliers, Leverage, Influence
Type of
Extreme Value
Y
DIRECTION
X
DIRECTION
LEVERAGE INFLUENCE
EFFECT ON INTERCEPT/
COEFFICIENTS/
UNCERTAINTY?*
Outlier in y
(yi far from Y)
Unusual In Trend = No No Yes/No/Yes
Unusual & Unusual = Yes Yes Yes/Large/Yes
Outlier in x
(xi far from X)
In Trend & Unusual = Yes No
No/No/Yes-Tends to
Reduce Uncertainty
Unusual & Unusual = Yes Yes Yes/Large/Yes
Outlier in
Residual
Yes & Possibly
Possible
but not
necessarily
Possible
but not
necessarily
No/No/Yes
*Note that influence can refer several things: (1) effect on y-intercept; (2) on particular coefficient; (3) on all
coefficients; (4) on estimated standard errors. Thus we have a variety of procedures to evaluate influence.
22(c) Megan Reif
1. OUTLIERS are not
necessarily influential
2. BUT they can be,
depending on leverage
3. Yet high LEVERAGE points
are not always influential
4. And INFLUENTIAL points
are not necessarily outliers PLOT OUTLIER LEVERAGE INFLUENCE
1 Yes No No
2 Yes Yes Yes
3 No Yes No
4 No Yes Yes
23(c) Megan Reif
II.A Multivariate Extreme Points
• Influence in multivariate regression results from a
particular combination of values on all variables
in the regression, not necessarily just from
unusual values on one or two of the variables,
but the concepts from the bivariate case apply.
• When there are 2 or greater explanatory
variables X, scatterplots may not reveal
multivariate outliers, which are separated from
the centroid of all the Xs but do not appear in
bivariate relations of any two of them.
24(c) Megan Reif
Residual Analysis: A Caution
• Recall that residuals e are just an estimate of an unobservable
vector with given distributional properties. Assessing the
appropriateness of the model for a given problem may entail
the use of the residuals in the absence of Epsilon, but since e
is by definition orthogonal to (uncorrelated with) Cov(X,e)=0
the regressors with E(e)=0, one cannot use residuals to test
these assumptions of the CLRM model.
Sample: e Population: ε
Residuals Error/Disturbance Term/Stochastic Component
Estimated
Unobserved Parameter
We Try to Estimate
Difference between these means that you are never totally confident that e is a good estimate
of ε: If you meet all assumptions of CLRM then e is an unbiased, efficient, and consistent
estimate of ε.
25(c) Megan Reif
II.A.1 The “Hat” Matrix (Least Squares
Projection Matrix / Fitted Value Maker)
• DeNardo calls it P (because it is the Projection matrix for the
Predictor Space/least squares projection matrix (see
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_projection_matrix.htm for a lovely geometric
explanation)); Rob calls it N (“fitted value maker”, Cook calls V, and
Belsey H. I use H since most of the books on diagnostics seems to
use H.
• The hat matrix is
• Since and by definition the vector of
fitted values it follows that
• The individual diagonal elements h1, h2, hi,..., hn of H
can thus be related to the distance between each
xi, xj .... xk explanatory variables
and the row vector of explanatory vector means x̅
where xi is the ith row of the matrix X.
T T-
1
H = X(X X) X
T T-
1
b = (X X) X y
y = Xb
) y = Hy
)
26(c) Megan Reif
27(c) Megan Reif
II.A.1 Hat Matrix, cont.
(the matrix)
(vector of diagonal elements)
which equal . This is the effect of the ith element
on its own predicted value.
In scalar form, the hat (leverage)
for
T T
T T
ii i i
i
i
y
y
  
   


-
-
1
1
H = X X X X
h x X X x
)
2
2
the ith observation (note adjustment
for number of observations, where as n
grows larger individual leverage
of any one observation diminishes), is
( )1
( )
(off-diagon
i
i
i
T T
ij i j
x x
h
n x x

 

   

-
1
h x X X x al)
h serves as a measure of
leverage of the ith data
point, because its
numerator is the
squared distance of the
ith datapoint from its
mean in the X direction,
while its denominator is
a measure of the overall
variability of the data
points along the X-axis.
It therefore the distance
of the data point in
relation to the overall
variation in the X
direction.
h serves as a measure of
leverage of the ith data
point, because its
numerator is the
squared distance of the
ith datapoint from its
mean in the X direction,
while its denominator is
a measure of the overall
variability of the data
points along the X-axis.
It therefore the distance
of the data point in
relation to the overall
variation in the X
direction.
28(c) Megan Reif
II.A.1 Hat Matrix, cont.
• Because H is a projection matrix,
(for proof see Belsey et al., 1980, Appendix 2A)
• It is possible to express the fitted values in
terms of the observed values (scalar):
• hij therefore captures the extent to which yi is
close to the fitted values. If it is large, than the
i-th observation had a substantial impact on
the j-th fitted value. The hat value summarizes
the potential influence of yi on ALL the fitted
values.
1 1 2 2 3 3
1
... ...
n
i j j j jj j nj n ij i
i
y h y h y h y h y h y h y

        
)
29
1 1ih 
(c) Megan Reif
II.A.1.a Hat and the Residuals
• Since
• The relationship between
the residual and the true
stochastic component also
depends on H. If the hijs are
sufficiently small, e is a
reasonable estimate of ε.
• Note the interesting
situation in which a better
“fit”, if based on extreme
values, may signal an
underestimate of the
randomness in the world.
1
ˆ
then
ˆ( )
ˆsubstituting for ,
( )( )
( )
in scalar form, for 1,2... ,
where I is the identity matrix.
n
i i ij j
j
i n
e h 

 
 
  
 

  
e y y
e I H y
Xβ y
e I H Xβ ε
e I H ε
30(c) Megan Reif
II.A.1.a Hat and the Residuals
• The variance of e is also related to H (See
DeNardo).
• For high leverage cases, in which h approaches its
upper bound of one, the residual value will tend
to zero (see graph above).
• This means that the residuals will not be a
reliable means of detecting influential points, so
we need to transform them… leading us to the
subject of studentized (jacknifed) residuals:
2
( ) (1 )i iiVar e h 
31(c) Megan Reif
II.A.1.a Hat / Studentized Residuals
PURPOSE: Detection of Multivariate Outliers
• Adjust residuals to make them conspicuous so they are reliable for detecting
leverage and influential points.
• DeNardo’s “internally Studentized residual” is called “standardized residual” or
“normalized”residual in other contexts--can disguise outliers.
• The “externally” Studentized residual uses the Standard Error of the Regression
(Residual Sum of Squares/n-k = e’e/(n-k), deleting the i-th observation, which
allows solving for h, the measure of leverage.
• These residuals are distributed as Student’s t, with n-k d.f, so “a test” of each
outlier can be made, with each studentized residual representing a t-value for its
observation.
• This is an application of the jacknife method, whereby observations are omitted
and estimation iterated to arrive at the studentized residuals (just one of many
applications of jacknife). Also called “Jacknife residual”
*
( )
( )
where s is the Standard Error of the
1
Estimate/Regression calculated after deleting ith observation.
i
i i
i
e
r
s h


32(c) Megan Reif
II.A.1.a. continued
Steps for Assessing Studentized Residuals
1. Studentized residuals correspond to the t-statistic we would obtain by including
in the regression a dummy predictor coded 1 for that observation and 0 for all
others. One can then test the null hypothesis that coefficient δ equals zero (Ho:
δ=0) in:
This tests whether case i causes a shift in the regression intercept.
2. We set an alpha significance level α of our overall Type I error risk; probability
of rejecting the null when it is in fact true. According to the Bonferroni
inquality[Pr(set of events occurring) cannot exceed the sum of individual
probabilities of the events], the probability that at least one of the cases is a
statistically significant outlier (when the null hypothesis is actually true) cannot
exceed nα, so….
3. We want to run n tests (one for each case) for each residual at the α/n level
(let’s call this α*). Suppose we set α =.05 and we have 21 observations. To test
whether ANY case in a sample of n=21 is a significant outlier at level α , we
check whether the maximum studentized residual max|ri| is significant at α* =
.05/21=.0024 (given a t-distribution with df = n-K-1; 21-2-1 =19). Most t-tables
do not have low numbers for t, so a computer is required.
33
0 1 1 2 2 1 , 1( ) ...i i i k i k iE y x x x I          
(c) Megan Reif
Tanzania Revenue Data
regress rexp rev (Expenditure as function of Revenue)
Source | SS df MS Number of obs = 21
-------------+------------------------------ F( 1, 19) = 55.16
Model | 10034268 1 10034268 Prob > F = 0.0000
Residual | 3456450.93 19 181918.47 R-squared = 0.7438
-------------+------------------------------ Adj R-squared = 0.7303
Total | 13490719 20 674535.948 Root MSE = 426.52
------------------------------------------------------------------------------
rexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
rev | .8668668 .1167207 7.43 0.000 .6225675 1.111166
_cons | 798.038 445.0211 1.79 0.089 -133.4019 1729.478
------------------------------------------------------------------------------
predict resid, resid (creates variable with ORDINARY RESIDUALS)
predict estu, rstudent (STUDENTIZED RESIDUALS)
34
EXAMPLE Data: Tanzania.dta (Mukjeree et al)
REV: Gov Recurrent Revenue REXP: Gov Recurrent Expenditure
DEXP: Gov Development Expenditure Year (T) 1970-1990
(c) Megan Reif
4. Identify the largest and smallest residuals. As a rule of thumb, we should pay attention to
residuals with absolute values greater than 2, be worried about those with values greater
than 2.5, and most concerned about those exceeding 3. There are a variety of ways to
identify/inspect these residuals. See
http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm for more
options. The fastest in a small dataset is to list the observations with a studentized residual
exceeding + or -2. We see here that 1980 is an outlier. We can use Stata to carry out the
Bonferroni Outlier Test as follows:
list if abs(estu)>2
rev rexp dexp t resid estu decade revcat |
|-----------------------------------------------------------------|
11. | 4506 5627 3096 80 922.8602 2.590934 1 high |
+-----------------------------------------------------------------+
The maximum student residual of 2.59 is our t-value and n=21. For 1980 to be a
significant outlier (cause a significant shift in intercept) at The above P-value
(p=.01796) is not below alpha/n=.0023, so 1980 is not a significant outlier at α
=.05, then t=2.59 must be significant at .05/21.
display .05/21
.00238095
display 2*ttail(19, 2.59)
.01796427
The obtained P-value (P=.01796) is NOT below α/n=.00238, so 1980 is NOT a significant outlier at
α =.05.
35
II.A.1.a. continued-Assessing Studentized Residuals
Bonferroni Outlier Test (Test for outlier influence on y-intercept)
(c) Megan Reif
II.A.1.b Hat Matrix and Leverage: Outlier influence on fitted
values (recall that fit is overly dependent on these outliers)
• Note that if hii =1, then ; that is, ei = 0, and the i-th case would be fit exactly.
• This means that, if no observations are exact replicates, one parameter is dedicated
to one data point, which would make it impossible to obtain the determinant to
invert X’X and to obtain OLS estimates.
• This rarely occurs, so that the value of hii will rarely reach its upper bound of 1.
• The MAGNITUDE of hii depends on this relationship
• The higher the value of hii the higher the leverage of the ith data point.
• where c is the number of times the ith row of X is replicated (generally then, h
will range from 1/n to 1, but in survey data, it is possible to have duplicate responses
for multiple respondents, so you can check this in stata with the
“duplicates”command)
• The average hat value is E(h) = (k+1)/n, where k is the number of regressors. We
therefore proceed by looking at the maximum hat value. A hat value has leverage if it
is more than twice the mean hat value.
• Huber (1981) suggests another rule of thumb for interpreting hii , but this might
overlook more than one large hat value.
i iy y
)
1 1
iih
n c
 
 T
ix
36
max( ) .2 little to worry about
.2 max( ) .5 risky
.5 max( ) too much leverage
i
i
i
h
h
h

 
(c) Megan Reif
70
7172
73
74
75
7677
7879
80
81
8283
84
85
86
87
88
89 90
.05.1.15.2.25
Leverage
2000 3000 4000 5000 6000
rev
predict h, hat
summarize h
Variable | Obs Mean Std. Dev. Min Max
-------------+-------------------------------------------------
h | 21 .0952381 .0638661 .0476762 .2652265
display 2/21
.0952381
list if h>2*.0952381
9. | rev | rexp | dexp | t | resid | estu | |
| 5424 | 5058 | 3115 | 78 | -441.9235 | -1.222458 |
| h .2629347 |
10. | rev | rexp | dexp | t | resid | estu |
| 5433 | 5571 | 3589 | 79 | 63.27473 | .1685843 |
h =.2652265
scatter h rev, mlabel(t)
• Use predict command to create the
hat values for each observation.
• Summarize or calculate to get the
mean.
• List observations whose h values
exceed 2 times E(h). We seen 1978
and 1979 have leverage.
• We can graph the hat values against
the values of the independent
variable(s) The leverage points are
well above 0.2 and more than twice
their mean. Recall that we identified
from EDA that something might be
different for 1978 and 1979. This
means that too much of the sample’s
information about the X-Y
relationship may come from a single
case.
37
II.A.1.b Hat Matrix and Leverage: Outlier influence of X
values on fitted values, continued
(c) Megan Reif
( )
( )
The regression coefficient on is .
Let represent the same coefficient
when the ith case is deleted. Deleting the
ithcase therefore changes the coefficient on
by - . We can express th
k k
k i
k k k i
X b
b
X b b
( )
( )
( )
is change in
standard errors:
-
Where represents the residual standard
deviation with the ith case deleted, and
is the residual sum of squares from the auxiliary
regres
k k i
ik
e i k
e i
k
b b
DFBETAS
s RSS
s
RSS

sion of X on all the other X variables
(without deleting the ith case). The denominator
therefore modifies the usual estimate of the standard
error of the coefficient if the ith case is deleted.
DFBET
k
kb
A can also be expressed in terms of the Hat
statistic (see DeNardo).
• Interpreting direction of influence
with DFBETAS:
• The size of influence: DFBETAS tells
us “By how many standard errors
does the coefficient change if we
drop case i?”
• A DFBETA of +1.34, for example,
means that if case i were deleted,
the coefficient for regressor k would
be 1.34 standard errors lower.
38
II.A.1.c The DFBETA Statistic (depends on X and Y values,
tests how much a case i influences the coefficients, not a
formal test statistic with a hypothesis test))
If 0, case increases magnitude of
If < 0, case decreases magnitude of
ik k
ik k
DFBETAS i b
DFBETAS i b

(c) Megan Reif
0246
Frequency
-.8 -.6 -.4 -.2 0 .2 .4 .6
Dfbeta rev
dfbeta
_dfbeta_1: dfbeta(rev)
list _dfbeta_1| _dfbeta_1 |
1. | .1004588 |
2. | .0401034 |
3. | .0582458 |
4. | -.0044781 |
5. | .1422126 |
6. | -.1596971 |
7. | -.1744603 |
8. | -.1502057 |
9. | -.6607218 |
10. | .0917439 |
11. | .5789033 |
12. | .1527179 |
13. | -.059624 |
14. | -.0800557 |
15. | -.0528694 |
16. | -.0164145 |
17. | .1149248 |
18. | .0945607 |
19. | -.064976 |
20. | -.0342036 |
21. | .0475227 |
display 2/sqrt(21)
.43643578
histogram _dfbeta_1, bin(10) frequency xline(-.4364 .4364) xlabel(#10)
(bin=10, start=-.66072184, width=.12396252)
• Stata’s dfbeta command creates the
DFBETA statistic for each of the
regressors in the model, then list for all
of our observations. A rule of thumb for
large datasets where listing and
inspecting all of the DFBETA values
would be difficult is to inspect all
DFBETAs in excess of 2/sqrt(n)
• Since DFBETAs are obtained by case-
wise deletion, they do not account for
situations where a number of
observations may cluster together,
jointly pulling the regression line in a
direction, but not individually showing
up as influential. You should not rely
solely on DFBETA, then, to test for
influence. A histogram of DFBETA can
reveal groups of influential cases (the
one displayed at left uses reference
lines for + or – 2/sqrt(n) = .4364). Two
observations fall outside the safe range.39
II.A.1.c The DFBETA Statistic
(c) Megan Reif
scatter _dfbeta_1 t, ylabel(-1(.5)1) yline(.4364 -.4364) mlabel(t)
list t _dfbeta_1 rev rexp if t==78 | t==80
+------------------------------+
| t _dfbeta_1 rev rexp |
|------------------------------|
9. | 78 -.6607218 5424 5058 |
11. | 80 .5789033 4506 5627 |
• Now that we know there are two potential
observations to worry about, it is useful to
use another plot to identify which they
are (this is most useful for multivariate
regression – it is rather obvious for the
single regressor case).
• We see that 1978 and 1980 are influential.
• Note that 1978 and 1979 had leverage,
but only 1978 is also influential. 1980 is
Influential but did not have leverage
(review Slide 23).
• 1978 decreases the coefficient on revenue
by -.66 standard errors and 1980 increases
it by .58 standard errors.
40
II.A.1.c The DFBETA Statistic
70
71 72
73
74
75 76 77
78
79
80
81
82 83 84
85
86 87
88 89
90
-1-.50.51
Dfbetarev
70 75 80 85 90
t
(c) Megan Reif
II.A.1.d Influence of a Case on Model as a Whole (Cook’s
Distance and DFFITS Statistics)
• Returning to the Hat statistic, if
we want to know the effect of
case i on the predicted values, we
can use the DFFITS statistic,
which does not depend on the
coordinate system used to form
the regression model.
• Rule of thumb cutoff values for
small to medium sized data sets
are to inspect observations with
DFFITS that exceed the following
values (and to run the regression
without those observations to
see by how much the coefficient
estimates change):
41
ˆ ( ) [ ( )]
1
to scale the measure, one can divide by
ˆthe standard deviation of the fit, where ( ) is
our estimate of variance with observation i deleted.
1
i i
i i i i
i
i
i i
i
i
he
DFFIT y y i i
h
h
y s i
h
DFFITS

    




x b b
x b
( ) 1
This is intuitive in that the first term increases the greater
the hat statistic (and therefore the leverage) for case i,
and the second term increases the larger the studentized
re
i
i i
e
h s i h
 
 
  
* *
( )
sidual (outlier).
Since then DFFITS can be written as
11
Then we want to know what the scaled changes in fit for the model
are for the values other than the ith row:
[ ( )]
( )
i i
i i
ii
j
j
e h
r r
hs h
i
s i h


x b b
( ) (1
The absolute value of this change in fit for the remaining cases will
be less than the absolute value for the change attributed to the fitted
ˆvalue when the ith value is del
ij i
j i
i
h e
s i h h
y
 
 
  
eted.
[ ( )]
( )
is the number of standard errors that the fitted value
for case i changes if the ith observation is deleted from the data.
j
i
j
i
DFFITS
s i h
DFFITS


x b b
Small to medium datasets: DFFITS 1
1
Large datasets: DFFITS 2
i
i
k
n



(c) Megan Reif
DFFITS and Hat vs. DFFIT PLOT
(from Tanzania Model)
predict dffit, dfits
list t rev rexp dffit
| t rev rexp dffit |
|------------------------------|
1. | 70 3243 3304 -.1932092 |
2. | 71 3497 3569 -.143909 |
3. | 72 3426 3480 -.1642726 |
4. | 73 3756 3809 -.1293692 |
5. | 74 3409 3122 -.3824875 |
|------------------------------|
6. | 75 4169 3891 -.3301978 |
7. | 76 4482 4352 -.2539932 |
8. | 77 4498 4417 -.216292 |
9. | 78 5424 5058 -.7301378 |
10. | 79 5433 5571 .1012859 |
|------------------------------|
11. | 80 4506 5627 .8291757 |
12. | 81 4112 4932 .3522714 |
13. | 82 3603 4594 .3838607 |
14. | 83 3470 4261 .2597122 |
15. | 84 2972 3476 .0768232 |
|------------------------------|
16. | 85 2746 3202 .0211414 |
17. | 86 2623 2929 -.1417075 |
18. | 87 2549 2899 -.1141464 |
19. | 88 2906 3431 .0905055 |
20. | 89 3544 4149 .1518263 |
|------------------------------|
21. | 90 3928 4558 .1956947 |
+------------------------------+
Scatter h dffit, mlabel(t)
• No observation has a DFFIT statistic larger than 1 in this
small dataset. The largest is .829757, for 1980.
• Note that as a function of hat and the studentized
residuals, DFFITS is a kind of measure of
OUTLIERNESS*LEVERAGE
• A graphical alternative to the influence measures is to plot
graphically hat against the studentized residuals to look
for observations for which both are big (only 1979
approaches this criteria, but is well under the DFFITS
cutoff):
42
*
1
i
i i
i
h
DFFITS r
h


70
7172
73
74
75
7677
78 79
80
81
82
83
84
85
86
87
88
8990
.05.1.15.2.25
Leverage
-1 -.5 0 .5 1
Dfits
(c) Megan Reif
• Cook’s D is similar to the DFFITS statistic, but DFITS gives relatively
more weight to leverage points, since its shows the effect on an
observation’s fitted value when that particular one is dropped.
• Cook’s Distance “tests the hypothesis” that the true slope coefficients are
equal in the aggregate to the slope coefficients estimated with
observation i deleted (Ho: β =b(i)). It is more a rule of thumb that produces
a measure of distance independent to how the variables are measured,
rather than a formal F-test. influential if Di exceeds the median of the F
distribution with k parameters [(Fk, n-k)(.5)]
• Observations with larger D values than the rest of the data are
those that have unusual leverage.
• While there are numerical rules for assessing Cook’s D authors
differ in their advice.
• Some argue that it is best to graph Cook’s D values to see whether
any one or two points have a much bigger Di than the others.
43
II.A.1.d Influence of a Case on Model as a Whole (Cook’s Distance)
2
*
2
( ) ( )
*2
1
Since
1 1 1
then Cook's can be rewritten as
(1 )
i i i
i i
i i i i i
i i
i
i
h e e
D r
k h s h s h
r h
D
k h
   
           


(c) Megan Reif
Cook’s D, Continued
predict cooksd, cooksd
• We can then look up the median value of the F-distribution with k+1
numerator and n-k denominator degrees of freedom:
display invFtail(2,19, .5) For the Tanzania data; no observations this large.
.71906057
list t rev rexp if cooksd>.71906057
• Some authors suggest looking at the five most influential, which can be
done in Stata by (NOTE: last term is a lowercase “L” for last observation.).
list t rev rexp cooksd dffit _dfbeta_1 in -5/l
| t rev rexp cooksd dffit _dfbeta_1 |
|-----------------------------------------------------|
17. | 81 4112 4932 .0589684 .3522714 .1527179 |
18. | 82 3603 4594 .0670656 .3838607 -.059624 |
19. | 74 3409 3122 .067792 -.3824875 .1422126 |
20. | 78 5424 5058 .2597905 -.7301378 -.6607218 |
21. | 80 4506 5627 .2642971 .8291757 .5789033 |
+-----------------------------------------------------+
44(c) Megan Reif
Proportional Plots for Influence Statistics
• It is useful to graph Cook’s D and DFFITS
with Residual vs. Fitted Plots, with
symbols proportional to the size of
Cook’s D. First we have to predict the
fitted values:
predict yhat
(option xb assumed; fitted values)
• Then weight the symbols by the value of
the influence statistic of interest:
graph twoway scatter resid yhat[aweight =
cooksd], msymbol(Oh) yline(0)
saving(Dprop) NOTE: Prop Plot with weights
disallows labeling, so I create two versions, one with
labels, one with proportions, and use ppt to overlay.
graph twoway scatter resid yhat[aweight =
cooksd], mlabel(t) yline(0)
saving(Dlabe
• We can also plot the studentized residuals vs. HAT
(leverage, not the fitted values), with Proption to
Cook’s D, to look at outlierness, leverage, and
influence at the same time. Same command as
above except variables are: estu h (or whatever you
have named your studentized residuals and hat)
45
-1000-50005001000
Residuals
3000 3500 4000 4500 5000 5500
Fitted values
-2-10123
Studentizedresiduals
.05 .1 .15 .2 .25
Leverage
(c) Megan Reif
• Recall that by increasing the variance of one or more Xs, a high-
leverage observation will decrease the standard error of the
coefficient(s), even if it does not influence the magnitude. Though
this may be considered beneficial, it may also exaggerate our
confidence in our estimate, especially if we don’t know if the high-
leverage outlier is representative of the population distribution, or
due entirely to stochastic factors or error (sampling, coding, etc. –
that is, a true outlier).
• Using the COVRATIO statistic, we can examine the impact of
deleting each observation in turn on the size of the joint-confidence
region (in n-space) for β, since the size of this region is equivalent to
the length of the confidence interval for an individual coefficient,
which is proportional to its standard error. The squared length of
the CI is therefore proportional to the sampling variance for b. The
squared size of a joint confidence region is proportional to the
variance for a set of coefficients (“generalized variance”) (Fox 1991,
31; See Belsey et al. for the derivation, pp 22-24).
46
II.A.1.d Influence of a Case on Precision of the Estimates (COVRATIO)
2*2
1
2
(1 )
2
i
i
i
COVRATIO
n k r
h
n k

   
  
  
(c) Megan Reif
COVRATIO
• Look for values that differ substantially from 1.
• A small COVRATIO (below 1) means that the
generalized variance of the model would be SMALLER
without the ith observation (i is reducing precision of
estimates)
• A big COVRATIO (above 1) means the generalized
variance would be LARGER without ith case, but if it is
a high-leverage point, it may be making us overly
confident in the precision our estimated coefficients.
• Belsey et al. suggest that a COVRATIO should be
examined when:
47
3( 1)
1i
k
COVRATIO
n

 
(c) Megan Reif
COVRATIO example
48
85
84
88
79
87
73
86
7189
72
70
90
7776
83
7581
82
74
78
80
.05.1.15.2.25
Leverage
.6 .8 1 1.2 1.4 1.6
Covratio
85
8488 79
8773 8671
89
7270
90
77
76
83
75
8182
74
78
80
-1-.50.51
Dfits
.6 .8 1 1.2 1.4 1.6
Covratio
predict covratio, covratio
list t covratio rev rexp if abs(covratio-1)>(3*3)/21
+-----------------------------+
| t covratio rev rexp |
|-----------------------------|
4. | 79 1.511605 5433 5571 |
+-----------------------------+
• We see that 1979 is large and therefore
has perhaps exaggerated our certainty.
• Plotting COVRATIO against hat reveals
that 1979 has leverage, but plotted
against DFFITs, we see it is not greater
than one. 1979 does not affect the
magnitude of our coefficient estimates,
but it may affect our hypothesis testing
and conclusions.
(c) Megan Reif
A Summary of Tests / Statistics for Extreme Values (note sample size dependence)
Statistic Formula Use Critical Value Rule of Thumb
Studentized
Residual
Outliers’ Effect on
Intercept
1. Critical values (higher than usual t-test), recommended for
exploratory diagnosis
2. Rule of Thumb Values
Hat Statistic (h) Leverage
Bounded by 1/n to 1
(assumes no replicates-
Check this in survey data).
Higher value=higher leverage:
(depends on X Values)
DFBETA
Influence of a
Case on a
Particular
Coefficient
Calculate for each regressor. Rule of Thumb: Under 2/√n means the
point has no influence; over means the point is influential (depends on
both X AND Y values). Value of DFBETA is # of s.e.s by which case i
increases or decreases coefficient for regressor k.
Cook’s Distance
Influence of a
Case on Model
Measure of aggregate impact of the ith case on the group of regression
coefficients as well as the group of fitted values (sometimes called
forecasting effect). A point is influential if Di exceeds the median of the F
distribution with k parameters [(Fk, n-k)(.5)].
DFFITS
Influence of a
Case on Model
The number of s.e.s by which the
fitted value for ŷi changes
if the ith observation is deleted.
COVRATIO
Influence of a
Case on Model
Standard Errors
Measures how precision of parameter estimates
(generalized variance) change with removal of
ith observation. Inspect if:
Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimate from the sample omitting observation i. In each case
you should use the absolute value of the calculated statistic.
49
*
( ) 1
i
i
i
e
r
s h


2
2
( )1
( )
i
i
i
T T
ij i j
x x
h
n x x

 

   

-
1
h x X X x
( )
( )
-k k i
ik
e i k
b b
DFBETAS
s RSS

*
*
*
2 pay attention
2.5 cause for worry
3 cause for greatest concern
i
i
ii
r
r
r



( 1)
2 where
max( ) .2 little to worry about
.2 max( ) .5 risky
.5 max( ) too much leverage
i
i
i
i
k
h h h
n
or
h
h
h

 



 
 
If 0, case increases magnitude of
If < 0, case decreases magnitude of
ik k
ik k
DFBETAS i b
DFBETAS i b

*
1
i
i i
i
h
DFFITS r
h


Small/med datasets: DFFITS 1
Large datasets: DFFITS 2 ( 1) /
i
i k n

 
1*2
1
2
(1 )
2
i k
i
i
COVRATIO
n k r
h
n k


   
  
  
*2
(1 )
i i
i
i
r h
D
k h


3( 1)
1i
k
COVRATIO
n

 
(c) Megan Reif
III. Plots to Identify Extreme Values
• EXAMPLE: Model from Mukherjee et al. of crude birth rate as a
function of:
– GNP per capita (logged, per Feb 18 Notes and gnrl practice for such variables)
– IM: Infant mortality
– URBAN: percent % population urban
– HDI: human development index (From WB Human Dev Report 1993)
regress birthr lngnp hdi infmor urbanpop
Source | SS df MS Number of obs = 110
-------------+------------------------------ F( 4, 105) = 129.19
Model | 16552.2585 4 4138.06462 Prob > F = 0.0000
Residual | 3363.19755 105 32.0304528 R-squared = 0.8311
-------------+------------------------------ Adj R-squared = 0.8247
Total | 19915.456 109 182.710606 Root MSE = 5.6595
------------------------------------------------------------------------------
birthr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lngnp | -.2138487 .7960166 -0.27 0.789 -1.792203 1.364505
hdi | -24.50566 7.495152 -3.27 0.001 -39.36716 -9.644157
infmor | .111157 .0396176 2.81 0.006 .0326026 .1897115
urbanpop | .0111358 .0396627 0.28 0.779 -.0675081 .0897797
_cons | 39.56958 6.599771 6.00 0.000 26.48346 52.65571
------------------------------------------------------------------------------ 50
2
0 lnGNPC HDI IM URBBIRTHrt GNPC HDI IM URBAN          
(c) Megan Reif
III.Plots to Identify Extreme Values:
A. Leverage vs. Normalized Squared Residual Plots
51
lvr2plot, mcolor(green) msize(vsmall) mlabel(cid) mlabcolor(black)
• This plot squares the NORMALIZED residuals (as standard deviation of each
residual from mean residual) to make them more conspicuous in the plot
(these are not the same as the externally studentized residuals).
• Remember that we are worried about observations with HIGH LEVERAGE but
LOW RESIDUALS, which indicates potential influence.
• What we would like to see: A ball of points evenly spread around the
intersection of the two means with no points discernibly far out in any
direction, and no leverage point above 0.2 with a low residual (to the left of
the mean normalized squared residual line).
• The vertical line represents the average squared normalized residual and the
horizontal line represents the average hat (leverage) value.
• Points with high leverage and low residuals will lie below the mean of the
squared residual (X), and above the mean of hat, which we should worry
about if hat is above .2, and really worry if it is above .5.
(c) Megan Reif
Leverage vs. Normalized Squared Residual Plots
52
Outlier but low leverage
High Leverage, High Residual
(might be reducing our standard
Errors, but not above risky .2 level.
May want to look at COVRATIO.
Examine
further if
above
0.2 in this
region
Based on this plot, the potential for points with high influence on our
coefficients is low. There are no points that meet the high leverage, low
residual criteria, individually or as a group.
(c) Megan Reif
III.Plots to Identify Extreme Values:
B. Partial Regression Leverage Plots (also known as partial-regression leverage plots, adjusted
partial residuals plots, adjusted variable plots, individual coefficient plots, and added variable plots)
53
avplots, rlopts(lcolor(red)) mlabel(cid) mlabsize(tiny) msize(small)
• Plots graph the residuals from fitting y on the Xk variables EXCEPT one (y=Xk-1b)(the value of those
residuals given the Xk-1 variables (e|Xk-1 ) shown on y-axis, …plotted against… ordinary residuals of the
regression of the EXCLUDED xi on the remaining Xk-1 independent variables on the x-axis (xi = Xk-1b)
• Helps to uncover observations exerting a disproportionate influence on the regression model by
showing how each coefficient has been influenced by particular observations.
• The regression slopes we see in the plots are the same as the original multiple regression coefficients for
the regression y=Xkb.
• What we would like to see: Scatter of points even around the line in each plot – the “noise” or size of
the cloud and spacing around the line need not concern us, but points very far from the rest should be
examined.
• Cause for concern: Recall the bivariate examples from the first part of the notes – you are looking for
values extreme in X (horizontal axis) with unusual/out-of-trend y-values. Pay most attention to the
theoretical variable(s) of interest and whether your conclusions and/or statistical significance would
change without the observation.
• Utility of the graph: DFBETA will give a much more precise assessment of the change in magnitude of
the coefficient in the absence of an influence point, but the graph can identify clusters of points that
might be jointly influential.
• Cautions: Pay attention to the SCALE of the axes reported in your computer output—a point may look
like an outlier but in reality be part of a cloud of points on which we are “zoomed in” rather close. If you
have doubts about the reliability of “eyeballing” the plot, you can re-run the regression leaving out the
influential point and comparing the change in slope, but be sure to use commands that will retain the
original scale of the output so you can compare the changes (see slide 54-55). Some books recommend
this plot for deciding to include or discard variables. BETTER TO BASE THIS DECISION ON THEORY,
techniques discussed previously.
(c) Megan Reif
NIC
CHL
COL
URY
LKA
TZA
POL
MOZ
VEN
CHN
SYR
BGR
CRI
ECU
DOM
HUN
PHL
PERARG
MEX
TUR
JAMMDG
PRY
IDN
BGD
JOR
TTO
ETH
ROM
PAN
BRA
HND
LAO
SOM
MUS
KEN
MYS
TUN
BOL
KHM
THA
KOR
UGA
GHA
EGY
IND
PAK
SLV
NPL
GRC
NGA
GTM
ZMB
TCD
IRQ
LSO
HTI
ISR
MWI
MAR
ZAF
HKG
ZWEBDI
MLI
BTN
NZL
ESP
CAF
GBR
SLE
TGO
SGP
BEL
AUS
NLD
PRT
IRN
BEN
CIV
COG
IRL
RWA
MRT
DZA
CMR
BFA
SAU
CAN
FRA
DEU
NERSWE
DNK
USA
BWA
ITA
NOR
SEN
PNG
JPNAUT
GIN
FIN
GAB
NAM
CHE
ARE
OMN
-1001020
e(birthr|X)
-2 -1 0 1 2
e( lngnp | X )
coef = -.21384874, se = .79601659, t = -.27
SEN
PNG
CAF
ARE
ZWE
BEN
SGP
TGO
MRT
EGY
CMR
ZMB
GIN
CIV
BWA
TCD
SLV
NER
OMN
NGANAM
GHA
HKG
MAR
KEN
JOR
NIC
TUN
HTI
BELSOM
CHN
DZA
HND
PRY
JAM
GTM
BFA
BDI
DEUDNK
PAN
IRQ
ISR
ROM
SLE
ESP
IND
SWE
NLD
GBR
ITA
PHL
JPNBOL
NZL
DOM
AUS
LAO
FIN
KOR
CAN
NOR
PAK
FRA
LKA
USA
BTN
SAU
ARG
MYS
CHL
CHE
AUT
IDN
RWA
IRL
KHM
BGR
NPL
GRC
VEN
PER
UGA
URY
POL
MUS
PRT
THA
ECU
GAB
HUN
SYR
COLBRA
IRN
MEX
COG
BGD
MOZ
TTO
CRI
ETHLSO
TZA
MLI
ZAF
TUR
MDG
MWI
-1001020
e(birthr|X)
-.2 -.1 0 .1 .2
e( hdi | X )
coef = -24.505659, se = 7.4951522, t = -3.27
SEN
PNG
ZWE
CHN
CAF
TGO
EGY
SGP
LKA
BWA
JAM
KEN
NIC
ARE
PRY
SLV
ZMB
BEN
OMN
PAN
TUN
PHL
ROM
GHA
JOR
MARCMR
HND
HKG
NGA
MYS
GTM
CIV
HTI
POL
BDI
TCDBGR
MRT
CHL
THA
DOM
ESP
IDN
SOM
IND
MUS
BEL
ISR
NER
KOR
DEU
LAO
PRTGRCITA
DZA
DNK
CRI
IRL
GBR
NLD
URY
HUN
NZLNAM
ARG
GIN
SWE
SYR
AUS
JPN
FIN
IRQ
COL
AUT
VEN
FRA
CAN
BOL
ECU
BFA
KHM
NOR
BTN
USA
PAK
UGA
NPL
CHE
PER
TTO
RWA
MEX
BGD
SLE
TZA
BRA
MOZ
SAU
LSOETH
TUR
ZAF
MDG
IRN
COGGAB
MWI
MLI
-1001020
e(birthr|X)
-40 -20 0 20 40
e( infmor | X )
coef = .11115703, se = .03961763, t = 2.81
OMN
PRT
THA
CHE
LSO
AUT
FIN
RWA
BWA
LKAMUS
IRL
MWI
BTN
MYS
BDI
CRI
BGD
GAB
PNG
UGAZAF
NPL
KHM
NAM
ETHMDG
IDN
CHN
ITA
FRA
GRC
NOR
USA
BFA
LAO
JPN
KEN
CAN
COG
TZA
IND
TTO
MLI
ZWE
GTM
SYR
PAN
HUN
NER
PAK
PHL
IRN
JAM
HTI
TUR
PRY
ROM
SWE
DZA
POLGIN
KOR
SLV
DNK
ESP
DEU
HND
AUS
GHA
ECU
TGO
BGR
MARCMR
NZL
MOZ
CIV
MEX
NLD
TUNNGA
SOMGBR
SAU
SLE
EGY
BRA
BOL
ARE
ZMB
COL
DOM
TCD
ISR
BEL
SEN
BEN
IRQ
PER
HKG
ARG
JOR
MRT
CHL
URY
NIC
CAF
VEN
SGP
-1001020
e(birthr|X)
-40 -20 0 20 40
e( urbanpop | X )
coef = .01113578, se = .03966274, t = .28
Partial Regression Leverage Plots
54(c) Megan Reif
NIC
CHL
COL
URY
LKA
TZA
POL
MOZ
VEN
CHN
SYR
BGR
CRI
ECU
DOM
HUN
PHL
PERARG
MEX
TUR
JAMMDG
PRY
IDN
BGD
JOR
TTO
ETH
ROM
PAN
BRA
HND
LAO
SOM
MUS
KEN
MYS
TUN
BOL
KHM
THA
KOR
UGA
GHA
EGY
IND
PAK
SLV
NPL
GRC
NGA
GTM
ZMB
TCD
IRQ
LSO
HTI
ISR
MWI
MAR
ZAF
HKG
ZWEBDI
MLI
BTN
NZL
ESP
CAF
GBR
SLE
TGO
SGP
BEL
AUS
NLD
PRT
IRN
BEN
CIV
COG
IRL
RWA
MRT
DZA
CMR
BFA
SAU
CAN
FRA
DEU
NERSWE
DNK
USA
BWA
ITA
NOR
SEN
PNG
JPNAUT
GIN
FIN
GAB
NAM
CHE
ARE
OMN
-1001020
e(birthr|X)
-2 -1 0 1 2
e( lngnp | X )
coef = -.21384874, se = .79601659, t = -.27
SEN
PNG
CAF
ARE
ZWE
BEN
SGP
TGO
MRT
EGY
CMR
ZMB
GIN
CIV
BWA
TCD
SLV
NER
OMN
NGANAM
GHA
HKG
MAR
KEN
JOR
NIC
TUN
HTI
BELSOM
CHN
DZA
HND
PRY
JAM
GTM
BFA
BDI
DEUDNK
PAN
IRQ
ISR
ROM
SLE
ESP
IND
SWE
NLD
GBR
ITA
PHL
JPNBOL
NZL
DOM
AUS
LAO
FIN
KOR
CAN
NOR
PAK
FRA
LKA
USA
BTN
SAU
ARG
MYS
CHL
CHE
AUT
IDN
RWA
IRL
KHM
BGR
NPL
GRC
VEN
PER
UGA
URY
POL
MUS
PRT
THA
ECU
GAB
HUN
SYR
COLBRA
IRN
MEX
COG
BGD
MOZ
TTO
CRI
ETHLSO
TZA
MLI
ZAF
TUR
MDG
MWI
-1001020
e(birthr|X)
-.2 -.1 0 .1 .2
e( hdi | X )
coef = -24.505659, se = 7.4951522, t = -3.27
SEN
PNG
ZWE
CHN
CAF
TGO
EGY
SGP
LKA
BWA
JAM
KEN
NIC
ARE
PRY
SLV
ZMB
BEN
OMN
PAN
TUN
PHL
ROM
GHA
JOR
MARCMR
HND
HKG
NGA
MYS
GTM
CIV
HTI
POL
BDI
TCDBGR
MRT
CHL
THA
DOM
ESP
IDN
SOM
IND
MUS
BEL
ISR
NER
KOR
DEU
LAO
PRTGRCITA
DZA
DNK
CRI
IRL
GBR
NLD
URY
HUN
NZLNAM
ARG
GIN
SWE
SYR
AUS
JPN
FIN
IRQ
COL
AUT
VEN
FRA
CAN
BOL
ECU
BFA
KHM
NOR
BTN
USA
PAK
UGA
NPL
CHE
PER
TTO
RWA
MEX
BGD
SLE
TZA
BRA
MOZ
SAU
LSOETH
TUR
ZAF
MDG
IRN
COGGAB
MWI
MLI
-1001020
e(birthr|X)
-40 -20 0 20 40
e( infmor | X )
coef = .11115703, se = .03961763, t = 2.81
OMN
PRT
THA
CHE
LSO
AUT
FIN
RWA
BWA
LKAMUS
IRL
MWI
BTN
MYS
BDI
CRI
BGD
GAB
PNG
UGAZAF
NPL
KHM
NAM
ETHMDG
IDN
CHN
ITA
FRA
GRC
NOR
USA
BFA
LAO
JPN
KEN
CAN
COG
TZA
IND
TTO
MLI
ZWE
GTM
SYR
PAN
HUN
NER
PAK
PHL
IRN
JAM
HTI
TUR
PRY
ROM
SWE
DZA
POLGIN
KOR
SLV
DNK
ESP
DEU
HND
AUS
GHA
ECU
TGO
BGR
MARCMR
NZL
MOZ
CIV
MEX
NLD
TUNNGA
SOMGBR
SAU
SLE
EGY
BRA
BOL
ARE
ZMB
COL
DOM
TCD
ISR
BEL
SEN
BEN
IRQ
PER
HKG
ARG
JOR
MRT
CHL
URY
NIC
CAF
VEN
SGP
-1001020
e(birthr|X)
-40 -20 0 20 40
e( urbanpop | X )
coef = .01113578, se = .03966274, t = .28
NIC
CHL
COL
URY
MOZ
TZA
VEN
POL
LKA
SYR
BGR
ECU
CHN
DOM
PER
CRI
ARG
HUN
PHL
MEX
TUR
MDG
JAM
JOR
TTO
PRY
ETH
BGD
IDN
ROM
BRA
SOM
HND
PAN
LAO
BOL
TUN
KEN
MUS
MYS
KOR
KHM
GHA
EGY
UGA
IRQ
PAK
TCD
IND
NGA
ZMB
THA
SLV
NPL
GTM
GRC
ISR
LSO
HTI
MWI
MAR
HKG
ZAF
MLI
NZL
CAF
ZWE
ESP
BTN
BDI
SLE
SGP
GBR
BEL
TGO
AUS
NLD
IRN
BEN
COG
CIV
MRT
PRT
IRL
DZA
SAU
RWA
CMR
BFA
CAN
DEU
NER
DNK
SWE
FRA
USA
NOR
ITA
SEN
JPN
BWA
PNG
AUT
GIN
GAB
FIN
NAM
CHE
ARE
-1001020
e(birthr|X)
-2 -1 0 1 2
e( lngnp | X )
coef = -.91084738, se = .7842531, t = -1.16
SEN
PNG
CAF
ARE
ZWE
BENSGP
TGO
MRT
EGY
CMR
ZMB
GIN
BWA
CIV
TCD
NERSLVNAM
NGA
GHA
MAR
HKG
KEN
JOR
TUN
NIC
HTICHN
BELSOM
DZA
PRY
HND
JAM
GTM
BDI
BFADEUDNK
PAN
ROM
ESP
ISR
IRQ
IND
SLE
SWE
NLD
ITA
GBR
PHL
JPN
NZL
BOL
AUSFIN
DOM
LAO
NOR
CAN
FRA
KORLKA
PAK
USA
BTN
CHE
MYS
AUT
SAU
IRL
RWA
ARG
IDN
CHL
KHM
NPL
BGR
GRC
UGA
PER
VEN
POL
URY
PRT
MUS
THA
GAB
ECU
HUN
SYR
COL
IRN
BRA
MEX
COG
BGD
CRI
TTO
MOZ
ETHLSOTZA
ZAF
MLI
TUR
MDG
MWI
-10-5051015
e(birthr|X)
-.2 -.1 0 .1 .2
e( hdi | X )
coef = -22.281675, se = 7.1637796, t = -3.11
SEN
PNG
ZWE
CHN
TGO
CAF
LKA
EGY
BWA
SGP
JAM
KEN
ARE
NIC
PRY
SLV
ZMB
BEN
PAN
TUN
PHL
ROM
GHA
JOR
CMRMAR
HND
NGA
HKG
MYS
GTM
CIV
BDI
HTI
POL
TCDBGR
THA
MRT
CHL
ESP
MUS
IDN
IND
DOM
SOMBEL
NER
PRT
ISR
KOR
DEU
ITA
LAO
GRC
DZA
IRL
DNK
CRI
NAM
NLDGBR
HUN
NZL
GIN
SWE
URY
FIN
JPN
SYR
ARG
AUS
AUT
IRQ
FRACOL
CAN
BFA
NOR
KHM
BOL
VEN
ECU
BTN
USA
PAK
UGA
CHE
NPL
PER
RWA
TTO
MEX
BGD
SLE
TZA
BRA
LSO
SAU
MOZ
ETH
TUR
ZAF
MDG
IRN
GABCOG
MWI
MLI
-1001020
e(birthr|X)
-40 -20 0 20 40
e( infmor | X )
coef = .12391559, se = .0378934, t = 3.27
PRT
THA
CHE
BWA
FIN
AUT
RWA
LSO
LKAMUS
IRL
BTN
MWI
MYS
BDI
PNG
CRI
GAB
BGD
NAM
UGA
NPL
ZAF
KHM
ITA
FRA
CHN
IDN
ETH
NOR
MDG
USA
GRC
JPN
BFA
KEN
CAN
LAO
IND
COG
ZWETTO
TZA
GTM
NER
PAN
MLI
SYR
PHL
HUN
PAK
JAM
IRN
HTI
PRY
ROM
SWE
DZA
GIN
TUR
SLV
DNK
KOR
ESPPOL
DEU
TGO
AUS
GHA
CMR
HND
MAR
BGR
ECU
NZL
CIV
NLD
TUNNGA
MOZ
MEX
GBRSOM
ARE
SAU
EGY
SLE
ZMB
BOL
BRACOL
TCD
SEN
DOM
ISR
BEL
BEN
IRQ
HKG
PER
MRT
JOR
ARG
CHL
CAFURY
NIC
VEN
SGP
-1001020
e(birthr|X)
-40 -20 0 20 40
e( urbanpop | X )
coef = .05291099, se = .03965302, t = 1.33
NIC
CHL
COL
URY
LKA
TZA
POL
MOZ
VEN
CHN
SYR
BGR
CRI
ECU
DOM
HUN
PER
PHL
ARG
MEX
TUR
MDGJAM
BGD
TTO
IDN
ETH
PRY
JOR
ROM
BRA
PAN
HND
LAO
MUS
SOM
MYS
KEN
TUN
BOL
THA
KHM
KOR
UGA
GHA
PAK
EGY
IND
GRC
LSO
NPL
GTM
SLV
NGA
IRQ
TCD
MWI
ZMB
HTI
ISR
ZAF
MAR
HKG
MLI
BTN
BDI
NZL
ZWE
ESP
GBR
SLE
BEL
CAF
AUS
TGO
SGP
PRT
NLD
IRN
COG
IRL
CIV
BEN
RWA
DZA
MRT
BFA
SAU
CAN
CMR
FRA
DEU
SWE
DNK
NER
USA
NOR
ITA
BWA
JPN
PNG
AUT
FIN
GAB
GIN
CHE
NAM ARE
OMN
-1001020
e(birthr|X)
-2 -1 0 1 2
e( lngnp | X )
coef = -.24870177, se = .80570718, t = -.31
PNG
CAF
ARE
BEN
ZWE
SGP
TGO
MRT
EGY
CMR
ZMB
GIN
CIV
BWA
TCD
NER
SLV
NGA
OMN
NAM
GHA
MAR
HKG
KEN
JOR
NIC
TUN
HTI
SOMBEL
CHN
DZA
HND
PRY
JAM
GTM
BFA
BDI
DEUDNK
PAN
IRQ
SLE
ROM
ISR
IND
ESP
SWE
NLD
GBR
ITA
PHL
JPNBOL
NZL
DOM
AUS
LAO
FIN
PAK
KOR
CAN
NOR
BTN
FRA
LKA
SAU
USA
ARG
CHL
MYS
CHE
RWA
IDN
KHM
NPL
AUT
IRL
BGR
UGA
PER
VEN
GRC
URY
POL
MUS
PRT
THA
ECU
GAB
HUN
SYR
COLBRA
IRN
MEX
COG
BGD
MOZ
TTO
CRI
ETH
TZA
LSO
MLI
ZAF
TUR
MDG
MWI
-1001020
e(birthr|X)
-.2 -.1 0 .1 .2
e( hdi | X )
coef = -23.625004, se = 7.9461429, t = -2.97
PNG
ZWE
CHN
CAF
TGO
SGP
EGY
BWA
ARE
LKA
KEN
JAM
NIC
BEN
ZMB
SLV
PRY
OMN
PAN
TUN
GHA
ROM
PHLCMR
JOR
MAR
HND
NGA
HKG
GTM
CIV
MYS
TCD
MRT
HTI
BDI
POL
BGR
CHL
SOM
DOM
ESP
IND
THA
BEL
IDN
NER
MUS
ISR
DEU
DZALAO
KOR
GIN
ITA
NAM
DNK
PRTGRC
GBR
NLD
IRLNZL
URY
SWE
CRI
HUN
ARG
IRQ
JPN
AUS
SYR
FIN
BFA
COL
BOL
FRA
AUT
VEN
CAN
KHM
ECU
NOR
BTN
PAK
USA
UGA
NPL
CHE
PER
RWA
TTO
SLE
MEX
BGD
TZA
BRA
SAU
MOZ
ETHLSO
TUR
ZAF
IRN
MDG
COGGAB
MWI
MLI
-1001020
e(birthr|X)
-40 -20 0 20 40
e( infmor | X )
coef = .11481757, se = .04116964, t = 2.79
OMN
PRT
THA
CHE
LSO
AUT
FIN
RWA
BWA
LKAMUS
MWI
IRL
BTN
MYS
CRI
BDI
BGD
GAB
ZAF
UGA
NPL
MDGETH
KHM
PNGNAM
IDN
FRA
ITA
CHN
GRC
NOR
USA
JPN
LAO
BFA
COG
CAN
TZA
TTO
KEN
IND
MLI
SYR
HUN
GTM
PAN
IRN
PAK
PHL
ZWE
NERTUR
JAM
HTI
PRY
ROM
SWE
DZA
POL
KOR
ESP
GINDNK
SLV
DEU
ECU
AUS
HND
BGR
GHA
MOZ
MAR
TGO
NZL
CMR
MEX
NLD
CIV
TUN
GBRSOM
NGA
SAU
SLE
BRA
BOL
COL
EGY
ZMB
ARE
DOM
TCD
ISR
BEL
IRQ
PERBEN
ARG
HKG
JOR
CHL
MRT
URY
NIC
VEN
CAF
SGP
-1001020
e(birthr|X)
-40 -20 0 20 40
e( urbanpop | X )
coef = .00935766, se = .04016079, t = .23
55
NIC
CHL
COL
URY
LKA
TZA
POL
MOZ
VEN
CHN
SYR
BGR
CRI
ECU
DOM
HUN
PER
PHL
ARG
MEX
TUR
MDGJAM
BGD
TTO
IDN
ETH
PRY
JOR
ROM
BRA
PAN
HND
LAO
MUS
SOM
MYS
KEN
TUN
BOL
THA
KHM
KOR
UGA
GHA
PAK
EGY
IND
GRC
LSO
NPL
GTM
SLV
NGA
IRQ
TCD
MWI
ZMB
HTI
ISR
ZAF
MAR
HKG
MLI
BTN
BDI
NZL
ZWE
ESP
GBR
SLE
BEL
CAF
AUS
TGO
SGP
PRT
NLD
IRN
COG
IRL
CIV
BEN
RWA
DZA
MRT
BFA
SAU
CAN
CMR
FRA
DEU
SWE
DNK
NER
USA
NOR
ITA
BWA
JPN
PNG
AUT
FIN
GAB
GIN
CHE
NAM ARE
OMN
-1001020
e(birthr|X)
-2 -1 0 1 2
e( lngnp | X )
coef = -.24870177, se = .80570718, t = -.31
PNG
CAF
ARE
BEN
ZWE
SGP
TGO
MRT
EGY
CMR
ZMB
GIN
CIV
BWA
TCD
NER
SLV
NGA
OMN
NAM
GHA
MAR
HKG
KEN
JOR
NIC
TUN
HTI
SOMBEL
CHN
DZA
HND
PRY
JAM
GTM
BFA
BDI
DEUDNK
PAN
IRQ
SLE
ROM
ISR
IND
ESP
SWE
NLD
GBR
ITA
PHL
JPNBOL
NZL
DOM
AUS
LAO
FIN
PAK
KOR
CAN
NOR
BTN
FRA
LKA
SAU
USA
ARG
CHL
MYS
CHE
RWA
IDN
KHM
NPL
AUT
IRL
BGR
UGA
PER
VEN
GRC
URY
POL
MUS
PRT
THA
ECU
GAB
HUN
SYR
COLBRA
IRN
MEX
COG
BGD
MOZ
TTO
CRI
ETH
TZA
LSO
MLI
ZAF
TUR
MDG
MWI
-1001020
e(birthr|X)
-.2 -.1 0 .1 .2
e( hdi | X )
coef = -23.625004, se = 7.9461429, t = -2.97
PNG
ZWE
CHN
CAF
TGO
SGP
EGY
BWA
ARE
LKA
KEN
JAM
NIC
BEN
ZMB
SLV
PRY
OMN
PAN
TUN
GHA
ROM
PHLCMR
JOR
MAR
HND
NGA
HKG
GTM
CIV
MYS
TCD
MRT
HTI
BDI
POL
BGR
CHL
SOM
DOM
ESP
IND
THA
BEL
IDN
NER
MUS
ISR
DEU
DZALAO
KOR
GIN
ITA
NAM
DNK
PRTGRC
GBR
NLD
IRLNZL
URY
SWE
CRI
HUN
ARG
IRQ
JPN
AUS
SYR
FIN
BFA
COL
BOL
FRA
AUT
VEN
CAN
KHM
ECU
NOR
BTN
PAK
USA
UGA
NPL
CHE
PER
RWA
TTO
SLE
MEX
BGD
TZA
BRA
SAU
MOZ
ETHLSO
TUR
ZAF
IRN
MDG
COGGAB
MWI
MLI
-1001020
e(birthr|X)
-40 -20 0 20 40
e( infmor | X )
coef = .11481757, se = .04116964, t = 2.79
OMN
PRT
THA
CHE
LSO
AUT
FIN
RWA
BWA
LKAMUS
MWI
IRL
BTN
MYS
CRI
BDI
BGD
GAB
ZAF
UGA
NPL
MDGETH
KHM
PNGNAM
IDN
FRA
ITA
CHN
GRC
NOR
USA
JPN
LAO
BFA
COG
CAN
TZA
TTO
KEN
IND
MLI
SYR
HUN
GTM
PAN
IRN
PAK
PHL
ZWE
NERTUR
JAM
HTI
PRY
ROM
SWE
DZA
POL
KOR
ESP
GINDNK
SLV
DEU
ECU
AUS
HND
BGR
GHA
MOZ
MAR
TGO
NZL
CMR
MEX
NLD
CIV
TUN
GBRSOM
NGA
SAU
SLE
BRA
BOL
COL
EGY
ZMB
ARE
DOM
TCD
ISR
BEL
IRQ
PERBEN
ARG
HKG
JOR
CHL
MRT
URY
NIC
VEN
CAF
SGP
-1001020
e(birthr|X)
-40 -20 0 20 40
e( urbanpop | X )
coef = .00935766, se = .04016079, t = .23
Note that Senegal looked like a
possible outlier but it was of the
good sort and it wasn’t particularly
extreme relative to the scale of
values shown. The coefficient
changes little and the SE increases
slightly without it (indicating it was
contributing to the fit somewhat).
(c) Megan Reif
56
NIC
CHL
COL
URY
MOZ
TZA
VEN
POL
LKA
SYR
BGR
ECU
CHN
DOM
PER
CRI
ARG
HUN
PHL
MEX
TUR
MDG
JAM
JOR
TTO
PRY
ETH
BGD
IDN
ROM
BRA
SOM
HND
PAN
LAO
BOL
TUN
KEN
MUS
MYS
KOR
KHM
GHA
EGY
UGA
IRQ
PAK
TCD
IND
NGA
ZMB
THA
SLV
NPL
GTM
GRC
ISR
LSO
HTI
MWI
MAR
HKG
ZAF
MLI
NZL
CAF
ZWE
ESP
BTN
BDI
SLE
SGP
GBR
BEL
TGO
AUS
NLD
IRN
BEN
COG
CIV
MRT
PRT
IRL
DZA
SAU
RWA
CMR
BFA
CAN
DEU
NER
DNK
SWE
FRA
USA
NOR
ITA
SEN
JPN
BWA
PNG
AUT
GIN
GAB
FIN
NAM
CHE
ARE
-1001020
e(birthr|X)
-2 -1 0 1 2
e( lngnp | X )
coef = -.91084738, se = .7842531, t = -1.16
SEN
PNG
CAF
ARE
ZWE
BENSGP
TGO
MRT
EGY
CMR
ZMB
GIN
BWA
CIV
TCD
NERSLVNAM
NGA
GHA
MAR
HKG
KEN
JOR
TUN
NIC
HTICHN
BELSOM
DZA
PRY
HND
JAM
GTM
BDI
BFADEUDNK
PAN
ROM
ESP
ISR
IRQ
IND
SLE
SWE
NLD
ITA
GBR
PHL
JPN
NZL
BOL
AUSFIN
DOM
LAO
NOR
CAN
FRA
KORLKA
PAK
USA
BTN
CHE
MYS
AUT
SAU
IRL
RWA
ARG
IDN
CHL
KHM
NPL
BGR
GRC
UGA
PER
VEN
POL
URY
PRT
MUS
THA
GAB
ECU
HUN
SYR
COL
IRN
BRA
MEX
COG
BGD
CRI
TTO
MOZ
ETHLSOTZA
ZAF
MLI
TUR
MDG
MWI
-10-5051015
e(birthr|X)
-.2 -.1 0 .1 .2
e( hdi | X )
coef = -22.281675, se = 7.1637796, t = -3.11
SEN
PNG
ZWE
CHN
TGO
CAF
LKA
EGY
BWA
SGP
JAM
KEN
ARE
NIC
PRY
SLV
ZMB
BEN
PAN
TUN
PHL
ROM
GHA
JOR
CMRMAR
HND
NGA
HKG
MYS
GTM
CIV
BDI
HTI
POL
TCDBGR
THA
MRT
CHL
ESP
MUS
IDN
IND
DOM
SOMBEL
NER
PRT
ISR
KOR
DEU
ITA
LAO
GRC
DZA
IRL
DNK
CRI
NAM
NLDGBR
HUN
NZL
GIN
SWE
URY
FIN
JPN
SYR
ARG
AUS
AUT
IRQ
FRACOL
CAN
BFA
NOR
KHM
BOL
VEN
ECU
BTN
USA
PAK
UGA
CHE
NPL
PER
RWA
TTO
MEX
BGD
SLE
TZA
BRA
LSO
SAU
MOZ
ETH
TUR
ZAF
MDG
IRN
GABCOG
MWI
MLI
-1001020
e(birthr|X)
-40 -20 0 20 40
e( infmor | X )
coef = .12391559, se = .0378934, t = 3.27
PRT
THA
CHE
BWA
FIN
AUT
RWA
LSO
LKAMUS
IRL
BTN
MWI
MYS
BDI
PNG
CRI
GAB
BGD
NAM
UGA
NPL
ZAF
KHM
ITA
FRA
CHN
IDN
ETH
NOR
MDG
USA
GRC
JPN
BFA
KEN
CAN
LAO
IND
COG
ZWETTO
TZA
GTM
NER
PAN
MLI
SYR
PHL
HUN
PAK
JAM
IRN
HTI
PRY
ROM
SWE
DZA
GIN
TUR
SLV
DNK
KOR
ESPPOL
DEU
TGO
AUS
GHA
CMR
HND
MAR
BGR
ECU
NZL
CIV
NLD
TUNNGA
MOZ
MEX
GBRSOM
ARE
SAU
EGY
SLE
ZMB
BOL
BRACOL
TCD
SEN
DOM
ISR
BEL
BEN
IRQ
HKG
PER
MRT
JOR
ARG
CHL
CAFURY
NIC
VEN
SGP
-1001020
e(birthr|X)
-40 -20 0 20 40
e( urbanpop | X )
coef = .05291099, se = .03965302, t = 1.33
(c) Megan Reif
III.Plots to Identify Extreme Values:
C. Star Plots for outliers, leverage, and model generalized influence
display invFtail(5,105, .5)
.87591656
(use above command to get cut-off for Cook’s
D), then use with the other rules of
thumb to choose observations to display:
graph7 estu h cooksd if abs(estu) > 2 & h >
.2 & cooksd > .87591656, star
graph7 estu h cooksd, star label(cid)
select(88, 108)
NOTES: This is an old but working Stata 7 command, search
“graph7” for help file. Variable and thus direction associated
w/ each line depends on order listed in command .
57
• The scaling of a star chart is a function of all the stars. Selecting just a few to be displayed still
maintains the scaling based on all the observations and variables.
• In our example model, no observations meet all three criteria for influence, so instead I will tell
Stata to select some observations that include Senegal and Oman to show what the plot looks
like (do this by using selecting observations 88-108)
• What we want to see: Dot OR (Line in outlier direction &/OR Line in leverage direction) and no
or tiny line in influence direction.
• Look for longer lines in influence direction (pointing Lower LFT), leverage (lower RT).
(c) Megan Reif
III.C. Star Plots for DFBETAS (Individual Coefficient Influence)
display 5/sqrt(110)
.47673129
(use above command to get cut-off for
dfbeta)
graph7 _dfbeta_1 _dfbeta_2 _dfbeta_3
_dfbeta_4 if abs(_dfbeta_1) >
.4767 | abs(_dfbeta_2) >.4767 |
abs(_dfbeta_3) >.4767 |
abs(_dfbeta_4) >.4767, star
label(cid)
NOTE we have to create new variables in
the next command to ensure graphing of
absolute values, so we do not know from
Star Plot whether the point increases or
decreases the coefficient.
gen dflngnp=abs(_dfbeta_1)
gen dfhdi=abs(_dfbeta_2)
gen dfinform=abs(_dfbeta_3)
gen dfurban=abs(_dfbeta_4)
graph7 dfbeta_1 _dfbeta_2 _dfbeta_3
_dfbeta_4, star label(cid)
select(88, 108)
58
• The scaling of a star chart is a function of all the stars. Selecting just a few to be displayed still
maintains the scaling based on all the observations and variables.
• In our example model, only OMAN meets ANY of the criteria for influence, so let’s select some
observations to show what the plot looks like (a good reminder to use the statistics and rules of
thumb in addition to eyeballing). Only OMAN is influential on all the coefficients at a level
above the cut-off point for DFBETAS. OMAN is an oddity—lots of oil, relatively small Omani
population, high birth rates, and a great deal of social development spending, raising HDI
despite a largely rural population. How would you model this without deleting Oman?
• What we want to see: Dot, tiny lines in ALL directions.
• Look for longer lines ANY direction.
(c) Megan Reif
A Summary of COMMON DIAGNOSTIC PLOTS to identify potential extreme values
Plot Type/
Command
Preferred
Appearance
Use Description/Interpretation
Leverage(h) (y-axis)
v. Squared Normalized
Residual Plot
(x-axis)
lvr2plot
Scatter evenly spread
around intersection of
two means; no points
to left of the mean
normalized squared
residual line (upper
LEFT quadrant)
Potential Influence
on (1) ALL
coefficients and (2)
standard errors
Vertical line represents average
squared normalized residual and
horizontal line average hat value.
1. IDENTIFY POINTS in RED AREA
High Leverage AND Low Residual, when
leverage greater than 0.2
2. POINTS in upper RIGHT quadrant
are high leverage (>0.2) & high residual;
Not influential on b but may diminish
SEs and overstate certainty).
Partial Regression
Leverage Plots (Also
called Added Variable
Plot)
avplots
Scatter (loose or tight)
of points even around
the line in each plot.
Potential Influence
on EACH
coefficient
Residuals from regressing y on the Xk
EXCEPT one (y=Xk-1b) y-axis, v.
ordinary residuals EXCLUDED xi on
remaining Xk-1 variables (xi = Xk-1b) x-axis
 Look for points extreme in X
w/ unusual e|y=Xk-1b values.
CAUTIONS: (a) Verify points identified
through “eyeballing” with DFBETAS.
(b) Pay attention to scale of plots.
Stretched or compacted displays mislead.
Star Plots
(a) Outliers, Leverage,
& Model Influence
(Cook’s D)
gr7 estu h cooksd, star
(b) Coefficient Influence
(DFBETAs)
gr7 dfx1 dfx2 dfxn, star
(a) Dot OR (Line in
outlier Direction
&/OR Line in
Leverage Direction)
and no or tiny line
in influence
direction.
(b) Dot (Lines in one or
more directions=
possible influence
on one or more
coefficients)
(a) Multivariate
Outliers, Leverage,
&/or Influence
Points
(b) Potential
Influence on EACH
coefficient
(a) Look for longer lines in Direction (a)
(pointing Lower LFT), leverage
(lower RT).
(b) Look for longer line in any direction
for DFBETA – each for a coefficient.
NOTES: 1. Working old Stata 7 command,
search “graph7” for help file. 2. Variable (b)
associated w/ each line depends on
order listed in command .
(c) Megan Reif 59
Cautions about Extreme Value Procedures
• One weakness in the DFFITS and other statistics is that they will not always detect cases
where there are two similar outliers. A single point would count as influential by itself, but
included together, they are influential.
• Cluster of outliers may indicate that model was wrongly applied to set of points. Partial
regression plots and other methods may be better for finding such clusters than individual
diagnostic statistics such as DFBETA. Both types of postestimation should be conducted.
• A single outlier may indicate a typing error, ignoring a special missing data code, such as 999,
or suggest that the model does account for important variation in the data. Only delete or
change an observation if it is an obvious error, like a person being 10 feet tall, or a negative
geographical distance.
• Should not be abused to remove points to effect a desired change in a coefficient or its
standard error! “An observation should only be removed if it is shown to be uncorrectably in
error. Often no action is warranted, and when it is, the action should be more subtle than
deletion….the benefits obtained from information on influential points far outweigh any
potential danger” (Belsey et al., 16).
• Think about non-linear or other specifications that might model the outliers directly. Outliers
may present a research opportunity—do the outliers have anything in common?
• Often the most that can be done is to report the results both with and without the outlier
(maybe with one of the results in an appendix). The exception to this is the case of extreme
x-values. It is possible to reduce the range over which your predictions will be valid (e.g., only
OECD countries, only EU, only low-income, etc.)--it is ok to say your height and weight
relationship is only usable for those between 5’5” and 6’5” for example, or that your model
only applies to advanced industrialized democracies.
60(c) Megan Reif
RESOURCES
• UCLA Stata Regression Diagnostic Steps (good examples of data with problems)
– http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm
– http://www.ats.ucla.edu/stat/stata/examples/alsm/alsm9.htm
– http://www.ats.ucla.edu/stat/stata/examples/ara/arastata11.htm
• Belsley, D. A., E. Kuh, and R. E. Welsch. (1980). Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. New York: Wiley.
• Cook, D. R. and S. Weisberg (1982). Residuals and Influence in Regression. New
York, NY, Chapman and Hall.
• Fox, J. (1991). Regression Diagnostics. Newbury Park: Sage Publications.
• Hamilton, L. C. (1992). Regression with Graphics: A Second Course in Applied
Statistics. Pacific Grove, CA: Brooks/Cole Publishing Company.
– Also has excellent chapter on pre-estimation graphical inspection of data
– Includes section on post-estimation diagnostics for logit
• For Regression Diagnostics for survey data (weighting for surveys requires adjusted
methods), see Li, J. and R. Valliant Influence Analysis in Linear Regression with
Sampling Weights. 3330-3337 and Valliant, R., J. Li, et al. (2009). Regression
Diagnostics for Survey Data. Stata Conference. Washington, DC, Stata User's Gro
• Temple, J. (2000). "Growth Regressions and What the Textbooks Don't Tell You."
Bulletin of Economic Research 52(3): 181-205. The paper discusses three
econometric problems that are rarely given adequate discussion in textbooks:
model uncertainty, parameter heterogeneity, and outliers.
61(c) Megan Reif
PS 699 Section March 25, 2010
Megan Reif
Graduate Student Instructor, Political Science
Professor Rob Franzese
University of Michigan
Regression Diagnostics
1. Diagnostics for Assessing (assessable) CLRM
Assumptions
2. Diagnostics for Assessing Data Problems (e.g.,
Multicolinearity)
Step One: Histogram and Box-Plot of the Ordinary Residuals (the former is useful in detecting multi-modal distribution of residuals,
which suggests omitted qualitative variable that divides data into groups)
Step Two: Graphical methods – tests exist for error normality, but visual methods are generally preferred
Step Three: Q-Q Plot of Residuals vs. Normal Distribution, and Normal Probability Plot
Background: What is a Q-Q Plot – Quantile-Quantile Plot?
– Q-Q Plot is a scatterplot that graphs quantiles of one variable against quantiles of the second variable
– The Quantiles are the data values in ascending order, where the first coordinate shows the lowest x1 value
against the lowest x2 value, the second coordinate are the next two lowest values of x1 and x2 and so on (We
graph a set of points with coordinates (X1i, X2i), where X1i is the ith-from lowest value of X1 and X2i is the ith-
from-lowest value of X2).
– What we can learn from a Q-Q Plot of Two Variables:
1. If the distributions of the two variables are similar in center, spread, and shape, then the points
will lie on the 45-degree diagonal line from the origin.
2. If the distributions have the same SPREAD and SHAPE but different center (mean, median…), the
points will follow a straight line parallel to the 45-degree diagonal but not crossing the origin.
3. If distributions have different spreads/variances and centers, but similar in shape, the points will
follow a straight line NOT parallel to the diagonal.
4. If the points do not follow a straight line, the distributions are different shapes entirely.
Two uses for Q-Q Plots:
1. Compare two empirical distributions (useful to assess whether subsets of the data, such as
different time periods or groups, share the same distribution or come from different
populations).
2. Compare an empirical distribution against a theoretical distribution (such as the Normal).
I. Normal Distribution of Disturbances, ε,
Can only be Evaluated using Estimate e.
A. Residual Quantile-Normal Plot (also known as probit plot,
normal-quantile comparison plot of residuals)
1. Quantile–Normal Plot (qnorm): - emphasize the tails of the
distribution
2. Normal Probability Plot (pnorm): put the focus on the center of the
distribution
• What we expect to see if the empirical distribution is
identical to a normal distribution, expect all points to
lie on a diagonal line.
I.A. Q-Q Plot of Residuals vs. Normal Distribution
Quantile-Normal Plot Interpretation Basics
Source: Hamilton, Regression with Graphics, p. 16
Quantile-Quantile Plot Diagnostic Patterns
Description of Point Pattern Possible Interpretation
Points on 45
o
diagonal line from Origin Distributions similar in center, spread, and shape
Points on straight line parallel to 45
o
diagonal Same SPREAD and SHAPE but different center
(mean, median…), never see e with non-zero mean!
Points follow straight line NOT parallel to the diagonal Different spreads/variances and centers, but similar
in shape.
Points do not follow a straight line Distributions have different shape.
Vertically Steep (closer to parallel to y-axis)
at Top and Bottom
Heavy Tails, Outliers at Low and High
Data Values
Horizontal (closer to parallel to x-axis) at Top and
Bottom
Light Tails, Fewer Outliers
Two or more less-step areas (horizontal parallel to x-
axis) indicate higher than normal density, separated by
a gap or steep climb (area of lower density)
Distribution is bi- or multi-modal
(subgroups, different populations)
All but a few points fall on a line - some points are
vertically separated from the rest of the data
Outliers in the data
Left end of pattern is below the line; right end of
pattern is above the line
Long tails at both ends of the data
distribution
Left end of pattern is above the line; right end of
pattern is below the line
Short tails at both ends of the distribution
Curved pattern with slope increasing from left to right Data distribution is skewed to the right
Curved pattern with slope decreasing from left to right Data distribution is skewed to the left
Granularity: Staircase pattern (plateaus and gaps) Data values have been rounded
or are discrete
• CONTINUING EXAMPLE (from March 18 Notes): Model from
Mukherjee et al. of crude birth rate as a function of:
– GNP per capita (logged, per Feb 18 Notes and gnrl practice for such variables)
– IM: Infant mortality
– URBAN: percent % population urban
– HDI: human development index (From WB Human Dev Report 1993)
regress birthr lngnp hdi infmor urbanpop
Source | SS df MS Number of obs = 110
-------------+------------------------------ F( 4, 105) = 129.19
Model | 16552.2585 4 4138.06462 Prob > F = 0.0000
Residual | 3363.19755 105 32.0304528 R-squared = 0.8311
-------------+------------------------------ Adj R-squared = 0.8247
Total | 19915.456 109 182.710606 Root MSE = 5.6595
------------------------------------------------------------------------------
birthr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lngnp | -.2138487 .7960166 -0.27 0.789 -1.792203 1.364505
hdi | -24.50566 7.495152 -3.27 0.001 -39.36716 -9.644157
infmor | .111157 .0396176 2.81 0.006 .0326026 .1897115
urbanpop | .0111358 .0396627 0.28 0.779 -.0675081 .0897797
_cons | 39.56958 6.599771 6.00 0.000 26.48346 52.65571
------------------------------------------------------------------------------
6
2
0 lnGNPC HDI IM URBBIRTHrt GNPC HDI IM URBAN          
qnorm estu, grid mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) yline(-.1535893, lpattern(longdash) lcolor(cranberry))
caption(Red Dashed Line Shows Median of Studentized Residuals, size(vsmall)) legend(on)
--------------------------------------------
Quantile-
Normal
Plot
pnorm estu, grid mcolor(green) msize(small) msymbol(circle) mlabel(cid)
mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
What does this granularity suggest?
Normal
Probability
Plot
What Non-Normal Residuals do to your OLS
Estimates and What do to
• If errors not normally distributed:
– Efficiency decreases and inference based on t- and F distributions are
not justified, especially as sample size decreases
– Heavy-tailed error distributions (more outliers) will result in great
sample-to-sample variation (less generalizability)
– Normality is not required in order to obtain unbiased estimates of the
regression coefficients.
• If you have not already transformed skewed variables, doing so may
help, as non-normal distribution of e may be caused by skewed X
and/or Y distributions.
• Model re-specification may be required if evidence of granularity,
multi-modality
• Robust methods provide alternatives to OLS for dealing with non-
normal errors.
(Ordinary) Residual vs. Fitted Plot
CLRM:
•Heteroskedasticity (leads to inefficiency and biased
standard error estimates)
•Residual Non-Normality (compounds in efficiency
and undermines rationale for t- and F-tests, casting
doubt on p-values reported in output)
SPECIFICATION:
•Non-linearity in X-Y relationship(s)
rvfplot, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on)
Heteroskedasticity
Variance for smaller
fitted values larger than
for medium fitted
values?
Absolute Value of Residual v. Fitted (easier to see heteroskedasticity)
predict yhat
predict resid, resid
gen absresid=abs(resid)
graph twoway scatter absresid yhat, mcolor(green) msize(small) msymbol(circle)
mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
Note: Fox Recommends Using Studentized Residuals vs. Fitted Values
(in example there is little difference)
graph twoway scatter estu yhat, mcolor(green) msize(small) msymbol(circle) mlabel(cid)
mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
Residual v. Predictor Plot
• Heteroskedasticity e varies with values of one
or more Xs.
rvpplot lngnp, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(lngnp)
rvpplot hdi, mcolor(red) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdi, replace)
rvpplot infmor, mcolor(blue) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(infmor, replace)
rvpplot urbanpop, mcolor(orange) msize(small) msymbol(circle) mlabel(cid)
mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(urbanpop,
replace)
graph combine lngnp hdi infmor urbanpop
Residual v. Predictor Plot
Component-Plus-Residual Plot
• The component plus residual plot is also known as partial-
regression leverage plots, adjusted partial residuals plots or
adjusted variable plots.
• This plot shows the expectation of the dependent variable
given a single independent variable, holding all else constant,
PLUS the residual for that observation from the FULL model.
• Looks at one of the explained parts of Y, plus the unexplained
part (e), plotted against an independent variable.
• CLRM: Heteroskedasticity
• Functional form / non-linearity
cprplot lngnp, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(lngnpcp, replace)
cprplot hdi, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdicp, replace)
cprplot hdi, mcolor(red) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdicp, replace)
cprplot infmor, mcolor(blue) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny)
mlabcolor(gs4) mlabangle(forty_five) legend(on) name(infcp, replace)
cprplot urbanpop, mcolor(orange) msize(small) msymbol(circle) mlabel(cid)
mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(urbcp, replace)
graph combine lngnpcp hdicp infcp urbcp
CPR Plot
Durbin Watson Test Statistic
Correlograms
Semi-Variograms
Time Plot
I. Autocorrelation
Variance Inflation Factor
High Collinearity increases se and reduce significance on
important variables.
VIF = 1/(1-R2j) where R2j is from the regression of variable j
on the other independent variables. If variable j is
completely uncorrelated with the other variables, than the
R2 j will be zero, and VIF will be one. If fit perfect, R2j will
be large. Larger VIF=more coll.
I. Multicollinearity
Summary of COMMON DIAGNOSTIC PLOTS to assess CLRM assumptions & Data Problems (post-estimation)
Plot Type/Command Preferred Appearance Use Description/Interpretation
Quantile-Normal Plot
(Ordinary Residuals vs.
Normal)
qnorm estu
&
Normal Probability Plot
(Studentized Residuals v.
Standard Normal)
pnorm estu
If the empirical
distribution of the
residuals is identical to
a normal distribution,
expect all points to lie
on the 45-degree
diagonal line through
the origin.
Normally Distributed
Stochastic
Component
Q-Normal Plot: Inspect Tails
P-Probability Plot: Inspect Middle
1. Look for multi-modality, granularity
(possible misspecification)
2. Right or Left Skewness (bowled up,
Bowled down)
3. Heavy Tails
4. Vertical difference in values (outliers)
Ordinary or Studentized
Residual v. Fitted Values
rvfplot
&
|Residual| v. Fitted
graph twoway scatter
absresid yhat
No discernable pattern,
even band with
constant variance
above and below zero,
and high and low
values of y.
CLRM:
Heteroskedasticity e
varies with y
Residual Normality
SPECIFICATION:
Non-linearity in X-Y
relationship(s)
Sum total of what the regression has explained.
1. Look for systematic variation in the distance of
residuals from their mean of zero.
2. Q-N plot better to assess normality
3. This plot helps asses whether error
variance Increases or decreases at
smaller or larger values of y.
4. Clusters of residuals above or below zero
(Ordinary) Residual v.
Predictor Plot (each X)
rvpplot x1varname
No discernable pattern,
even band with
constant variance
above and below zero,
and high and low
values of each X.
CLRM:
Heteroskedasticity e
varies with values of
one or more Xs.
SPECIFICATION:
Non-linearity
1. Look for systematic variation in the
distance of residuals from mean
2. Whether error variance increases or
decreases at smaller or large values of
each X
3. Clusters of residuals above or below 0.
Component Plus
Residual Plot
cprplot x1varname

Weitere ähnliche Inhalte

Was ist angesagt?

Statistical Methods
Statistical MethodsStatistical Methods
Statistical Methods
guest2137aa
 
Basic Stat Notes
Basic Stat NotesBasic Stat Notes
Basic Stat Notes
roopcool
 
09 ch ken black solution
09 ch ken black solution09 ch ken black solution
09 ch ken black solution
Krunal Shah
 
18 ch ken black solution
18 ch ken black solution18 ch ken black solution
18 ch ken black solution
Krunal Shah
 
15 ch ken black solution
15 ch ken black solution15 ch ken black solution
15 ch ken black solution
Krunal Shah
 
12 ch ken black solution
12 ch ken black solution12 ch ken black solution
12 ch ken black solution
Krunal Shah
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
kongara
 
Mann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi SquaredMann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi Squared
guest2137aa
 
presentation of data
presentation of datapresentation of data
presentation of data
Chie Pegollo
 

Was ist angesagt? (18)

Frequency Tables & Univariate Charts
Frequency Tables & Univariate ChartsFrequency Tables & Univariate Charts
Frequency Tables & Univariate Charts
 
Statistical Methods
Statistical MethodsStatistical Methods
Statistical Methods
 
Tps4e ch1 1.1
Tps4e ch1 1.1Tps4e ch1 1.1
Tps4e ch1 1.1
 
Basic Stat Notes
Basic Stat NotesBasic Stat Notes
Basic Stat Notes
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 
09 ch ken black solution
09 ch ken black solution09 ch ken black solution
09 ch ken black solution
 
18 ch ken black solution
18 ch ken black solution18 ch ken black solution
18 ch ken black solution
 
QUANTITAIVE DATA ANALYSIS
QUANTITAIVE DATA ANALYSISQUANTITAIVE DATA ANALYSIS
QUANTITAIVE DATA ANALYSIS
 
15 ch ken black solution
15 ch ken black solution15 ch ken black solution
15 ch ken black solution
 
Statistic and probability 2
Statistic and probability 2Statistic and probability 2
Statistic and probability 2
 
12 ch ken black solution
12 ch ken black solution12 ch ken black solution
12 ch ken black solution
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
 
Mann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi SquaredMann Whitney U Test And Chi Squared
Mann Whitney U Test And Chi Squared
 
presentation of data
presentation of datapresentation of data
presentation of data
 
Stats chapter 1
Stats chapter 1Stats chapter 1
Stats chapter 1
 
Hierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationHierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validation
 
Chapter 11 Psrm
Chapter 11 PsrmChapter 11 Psrm
Chapter 11 Psrm
 
Presentation of data mod 6
Presentation of data mod 6Presentation of data mod 6
Presentation of data mod 6
 

Ähnlich wie Reif Regression Diagnostics I and II

Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
Edda Kang
 
202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization
FEG
 
Aed1222 lesson 6
Aed1222 lesson 6Aed1222 lesson 6
Aed1222 lesson 6
nurun2010
 
Aed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partAed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd part
nurun2010
 

Ähnlich wie Reif Regression Diagnostics I and II (20)

CHAPTER 7.pptx
CHAPTER 7.pptxCHAPTER 7.pptx
CHAPTER 7.pptx
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & Graphing
 
Eda sri
Eda sriEda sri
Eda sri
 
202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization
 
CPSC 531: System Modeling and Simulation.pptx
CPSC 531:System Modeling and Simulation.pptxCPSC 531:System Modeling and Simulation.pptx
CPSC 531: System Modeling and Simulation.pptx
 
Engineering Data Analysis-ProfCharlton
Engineering Data  Analysis-ProfCharltonEngineering Data  Analysis-ProfCharlton
Engineering Data Analysis-ProfCharlton
 
2 biostatistics presenting data
2  biostatistics presenting data2  biostatistics presenting data
2 biostatistics presenting data
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
Aed1222 lesson 6
Aed1222 lesson 6Aed1222 lesson 6
Aed1222 lesson 6
 
Aed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partAed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd part
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1
 
Estadística investigación _grupo1_ Zitácuaro
Estadística investigación _grupo1_ ZitácuaroEstadística investigación _grupo1_ Zitácuaro
Estadística investigación _grupo1_ Zitácuaro
 
2-L2 Presentation of data.pptx
2-L2 Presentation of data.pptx2-L2 Presentation of data.pptx
2-L2 Presentation of data.pptx
 

Kürzlich hochgeladen

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 

Kürzlich hochgeladen (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

Reif Regression Diagnostics I and II

  • 1. PS 699 Section March 18, 2010 Megan Reif Graduate Student Instructor, Political Science Professor Rob Franzese University of Michigan Regression Diagnostics for Extreme Values (also known as extreme value diagnostics, influence diagnostics, leverage diagnostics, case diagnostics)
  • 2. Review of (often iterative) Modeling Process • EDA helps identify obvious violations of CLRM • Address trade-offs between corrections •Numeric & Graphic, Formal & Informal Diagnostics •Influence •Normality •Collinearity •Non-Sphericality • Exploratory Data Analysis (EDA) of Empirical Distribution (center, spread, skewness, tail length, outliers) • Uni-& Bivariate • Numeric & Graphic, Formal & Informal • Include your prior info about population distribution & variance • Data-generating process • Assumptions THEORY FORMULATION & MODEL SPECIFICATION DATA (Measure, Sample, Collect, Clean) MODEL ESTIMATION & INFERENCE POST- ESTIMATION ANALYSIS / CRITICAL ASSESSMENT OF ASSUMPTIONS But don’t start dropping observations at this stage! But don’t start dropping observations at this stage! Treat Outliers as INFO, not NUISANCE, Explain them, don’t hide them Treat Outliers as INFO, not NUISANCE, Explain them, don’t hide them 2(c) Megan Reif
  • 3. I. Pre-Modeling Exploratory Data Analysis (EDA) (Review/Checklist) • Not to be confused with data-mining – Arrive at data with your theory in hand • Because Multivariate analysis builds on uni- and bivariate analysis, begin with univariate analysis, followed with bivariate, before proceding. • These notes assume knowledge of production of descriptive statistics, but provides basic commands and output as a sort of checklist. • Don’t forget to start by using Stata’s “describe”, “summarize”, codebook”,and “inspect” commands to understand (a) how the variables are labeled and coded (b) basic distributions (c) how much missing data there are for each variable. • To think about possible effect of missing data on your model, use “list if” command list yvar xvar1 xvar2 xvar3 if yvar==. list yvar xvar1 xvar2 xvar3 if xvar1==. and so on • Recode and label your variables for easier interpretation before proceeding, particularly the uniqueid variable (such as country-year, individual 1-n, etc.) for easy labeling of points (choose a short name). 3(c) Megan Reif
  • 4. I.A Exploratory Data Analysis (EDA):Univariate & Bivariate Analysis 1. Summarize Basic Univariate and Bivariate Distributions for Theoretical Model Variables for data structure: 1. Location (Mean, Median) 2. Spread (Range, Variance, Quartiles) 3. Genuine Skewness vs. Outliers The most efficient way to obtain this information is to use Stata’s “tabstat” command and the statistics you desire for your model variables and then inspect: • Histograms (do not forget to explore using different bin sizes and between 5-20 bins, since histogram distributions are sensitive to bin size) • Boxplots • Matrix Scatterplots 4(c) Megan Reif
  • 5. Univariate Outliers • Distinguish between GENUINE skewness in the population distribution & subsequently the empirical distribution, as opposed to unusual behavior (outliers) in one of the tails. Your theory about the population may guide you on this. • Do not leave univariate outliers out of your model or model them explicitly based on descriptive statistics until you have done post-estimation diagnostics to determine whether they are also MULTIVARIATE outliers (or correct them if they are due to obvious typos or missing data or non-response codes like “999”). • A UNIVARIATE outlier is a data point which is distant from the main body of the data (say, middle 50%). One way to measure this distance is the Inter-Quartile range (range of the middle 50% of the data). A data point xo is an outlier if: – OBSERVE whether the middle 50 percent of the data ALSO manifest skewness. – If IQR is skewed, a transformation such as a log or square may be called for; IF NOT, focus on the outliers. – Use a Box Plot to check location of the median in relation to the quartiles. 1.5 or 1.5 A data point is a far outlier if 3.0 or 3.0 o L o U o L o U x Q IQR x Q IQR x Q IQR x Q IQR         5 In Stata, a Box-Plot will show outliers (1.5IQR criteria) as points if they are present in the data. (c) Megan Reif
  • 6. Tanzania Revenue Data tabstat rev rexp dexp t, s(mean median sd var count min max iqr) stats | rev rexp dexp t ---------+---------------------------------------- mean | 3728.381 4030.048 1693.619 80 p50 | 3544 3891 1549 80 sd | 817.1005 821.3014 894.879 6.204837 variance | 667653.2 674535.9 800808.3 38.5 N | 21 21 21 21 min | 2549 2899 586 70 max | 5433 5627 3589 90 iqr | 926 1127 1379 10 -------------------------------------------------- 6 EXAMPLE Data: Tanzania.dta (Mukjeree et al) REV: Gov Recurrent Revenue REXP: Gov Recurrent Expenditure DEXP: Gov Development Expenditure Year (T) 1970-1990 Decade 0=1970s, 1=1980s, 3=1990 (c) Megan Reif
  • 7. BIVARIATE NOTE: You can add “by(groupvariable)” after the comma to look at descriptives for subgroups of interest. tabstat rev rexp dexp t, s(mean median sd var count min max iqr) by(decade) Summary statistics: mean, p50, sd, variance, N, min, max, iqr by categories of: decade (decade) decade | rev rexp dexp t ---------+---------------------------------------- 0 | 4133.7 4057.3 2151.6 74.5 | 3962.5 3850 1994 74.5 | 814.8448 789.6686 774.8313 3.02765 | 663972 623576.5 600363.6 9.166667 | 10 10 10 10 | 3243 3122 1228 70 | 5433 5571 3589 79 | 1072 937 927 5 ---------+---------------------------------------- 1 | 3303.1 3950 1346.4 84.5 | 3221 3812.5 993 84.5 | 657.0976 914.5912 822.1245 3.02765 | 431777.2 836477.1 675888.7 9.166667 | 10 10 10 10 | 2549 2899 588 80 | 4506 5627 3096 89 | 857 1392 1037 5 ---------+---------------------------------------- 2 | 3928 4558 586 90 | 3928 4558 586 90 | . . . . | . . . . | 1 1 1 1 | 3928 4558 586 90 | 3928 4558 586 90 | 0 0 0 0 ---------+---------------------------------------- Total | 3728.381 4030.048 1693.619 80 | 3544 3891 1549 80 | 817.1005 821.3014 894.879 6.204837 | 667653.2 674535.9 800808.3 38.5 | 21 21 21 21 | 2549 2899 586 70 | 5433 5627 3589 90 | 926 1127 1379 10 -------------------------------------------------- 7(c) Megan Reif
  • 8. Univariate Box Plots & Histograms 2,0003,0004,0005,0006,000 rev graph box rev • Notice that the inter- quartile range manifests skewness, in addition to the maximum being much further from the middle 50% of the observations • Note how different the histogram for Revenue appears for 4, 6, 8, and 10 bins (21 observations) • See histogram help file to ensure you properly display histograms for continuous vs. discrete variables. 8 6 8 5 2 02468 Frequency 2500 3000 3500 4000 4500 5000 rev 5 5 4 2 3 2 012345 Frequency 2500 3000 3500 4000 4500 5000 rev 4 2 6 2 2 3 2 0246 Frequency 2000 3000 4000 5000 rev 3 2 2 5 2 2 3 2 012345 Frequency 2000 3000 4000 5000 6000 rev Different Bin Sizes Histogram of Tanzania Annual Revenue (c) Megan Reif
  • 9. graph box rev if decade ==0 | decade==1, over(decade) histogram rev, by(decade) • Box Plot of Revenue by decade (1970s and 1980s) • Note that the IQR is less skewed for the 1970s than the 1980s • Since there are no dots in the boxplot we know there are no formal univariate outliers. • We also know from other financial data that skewness may be something to correct for with a log transformation. 9 2,0003,0004,0005,0006,000 rev 0 1 05.0e-04.001.001505.0e-04.001.0015 2000 3000 4000 5000 6000 2000 3000 4000 5000 6000 0 1 2 Density rev Graphs by decade Bivariate Box Plots & Histograms: Inspecting by Subgroups or Categorical Transformations of Continuous Variables (c) Megan Reif
  • 10. Scatterplot Matrices and Cross-Tabulations • Use these prior to ever running regression to see differences and reveal potential violations of CLRM Group 1 Group 2 May have same relationship to Y on average, but something else is going on. y 10(c) Megan Reif
  • 11. The four panels form “Anscombe’s Quartet”—a famous demonstration by statistician Francis Anscombe in 1973. By creating the four plots he was able to check the assumptions of his linear regression model, and found them wanting for three of the four data sets (all but the top left). As Epstein et al. write, “Anscombe’s point, of course, was to underscore the importance of graphing data before analyzing it” (24). F.J. Anscombe, 1973. “Graphs in Statistical Analysis,” American Statistician 27:17, 19-20, cited in Lee Epstein, Andrew D. Martin, and Matthew M. Schneider, 2006. “On the Effective Communication of the Results of Empirical Studies, Part I.” Paper presented at the Vanderbilt Law Review Symposium on Empirical Legal Scholarship, February 17. Remember that looking at correlations alone will conceal curvilinear relationships, heteroskedasticity, outliers, and distributional shape. For example, THE DATA IN THE FOUR PLOTS HAVE THE SAME: 1) means for both y and x variables 2) slope and intercept estimates in a regression of y on x. 3) R2 and F values (statistics we will come to later). Bi-Variate Correlations/Regressions: The NEED TO GRAPH data: Same Statistics, Different Relationships 11(c) Megan Reif
  • 12. Scatterplot Matrices graph matrix rev rexp dexp t, half • Allows you to look at bivariate relationships between your model variables, think about possible colinearity between explanatory variables, non- linearity in relationships, etc. • Notice time trend of all three financial variables—consider autocorrelation • Extreme Points: We may want to inspect the scatterplots for rev – dexp and rexp – dexp for observations that seem to be unusual given our theory that development expenditure would be a function of revenue (the observations have high development expenditure but low revenue) 12 rev rexp dexp t 2000 4000 6000 3000 4000 5000 6000 3000 4000 5000 6000 0 2000 4000 0 2000 4000 70 80 90 (c) Megan Reif
  • 13. A Closer Look: Scatterplot with Labels scatter dexp rev, mlabel(t) • Note that in 1990, revenue was middling but development expenditures were low. What might cause this? 13 70 7172 73 74 75 7677 78 79 80 81 82 83 848586 87 88 89 90 01000200030004000 dexp 2000 3000 4000 5000 6000 rev 3243 34973426 3756 3409 4169 44824498 54245433 4506 4112 3603 3470 2972 2746 2623 2549 2906 3544 3928 20003000400050006000 rev 70 75 80 85 90 t scatter rev t, mlabel(rev) • Scatter of revenue over time suggests a trend and possible autocorrelation.It is also curious that 1979 and 1980 have almost identical (and high) levels of revenue. Possible data error or real stagnation in revenue? There was a war between Uganda and Tanzania in 1979. Note how inspecting the data can lead to case- specific information that may require modeling adjustments (e.g., war dummies). And we didn’t know a thing about Tanzania! (c) Megan Reif
  • 14. Cross-Tabulations (Contingency Tables) • Recode continuous variables into categories (see notes from March 11), which enables you to summarize continuous variables by categories (below) and inspect test statistics for inter-group differences in means and variances (next slide) gen revcat=rev recode revcat 2549/3500=1 3501/4500=2 4501/max=3 label define revcat 1 "low" 2 "med" 3 "high“ Label values revcat revcat tatistics and interpret tab revcat decade, sum(dexp) • We want to see if the mean, sd for development expenditure varies by revenue level and decade, for example, in order to see if one decade is responsible for all of the high revenue observations, etc. – remember how important sub-group size is when using interaction terms. Cross-tabs are an important exploring whether the same small subgroup is driving the key results of estimation. Remember the 13 educated women in the dummy model (Feb 25 notes).] | decade revcat | 0 1 2 | Total -----------+---------------------------------+---------- low | 1497.25 934.16667 . | 1159.4 | 188.20977 275.87274 . | 372.3422 | 4 6 0 | 10 -----------+---------------------------------+---------- med | 2205.75 1587.6667 586 | 1771.5 | 439.05989 850.62408 0 | 782.53526 | 4 3 1 | 8 -----------+---------------------------------+---------- high | 3352 3096 . | 3266.6667 | 335.16861 0 . | 279.31046 | 2 1 0 | 3 -----------+---------------------------------+---------- Total | 2151.6 1346.4 586 | 1693.619 | 774.83134 822.12451 0 | 894.87896 | 10 10 1 | 21 14(c) Megan Reif
  • 15. • Inspect test statistics for inter-group differences in means and variances • Categories of low, medium, and high revenue levels are not statistically significantly disproportionately distributed in any one decade -- one period alone will probably not be driving statistically significant results for revenue effects), with the caveat that our categories need to be meaningful— perhaps coded at natural breaks in the data, quartiles, etc. However, outliers that do not fall in subgroups will not show up with this method. It is still useful to consider possible clusters of data that will influence our model. tab revcat decade, column row chi2 lrchi2 V exact gamma taub decade revcat | 0 1 2 | Total -----------+---------------------------------+---------- low | 4 6 0 | 10 | 40.00 60.00 0.00 | 100.00 | 40.00 60.00 0.00 | 47.62 -----------+---------------------------------+---------- med | 4 3 1 | 8 | 50.00 37.50 12.50 | 100.00 | 40.00 30.00 100.00 | 38.10 -----------+---------------------------------+---------- high | 2 1 0 | 3 | 66.67 33.33 0.00 | 100.00 | 20.00 10.00 0.00 | 14.29 -----------+---------------------------------+---------- Total | 10 10 1 | 21 | 47.62 47.62 4.76 | 100.00 | 100.00 100.00 100.00 | 100.00 Pearson chi2(4) = 2.6075 Pr = 0.625 likelihood-ratio chi2(4) = 2.8982 Pr = 0.575 Cramér's V = 0.2492 gamma = -0.2000 ASE = 0.327 Kendall's tau-b = -0.1183 ASE = 0.197 Fisher's exact = 0.645 15 Cross-Tabulations (Contingency Tables) (c) Megan Reif
  • 16. II. Post-Estimation Diagnostics: OLS Estimator is a (Sensitive) Mean • The sample mean is a least squares estimator of the location of the center of the data, but the mean is not a resistant estimator in that it is sensitive to the presence of outliers in the sample. That is, changing a small part of the data can change the value of the estimator substantially, leading us astray. • This is particularly problematic if we are unsure about the actual shape of the population distribution from which our data are drawn. 16(c) Megan Reif
  • 17. II. Post-Estimation Diagnostics Extreme Points (start here, since extreme points will affect formal testing procedures) Also called case diagnostics, case deletion diagnostics. • In multivariate analysis, extreme data points create more complex problems than in univariate analysis. – A UNIVARIATE outlier is simply a value of x different from X (unconditionally unusual, but may not be a REGRESSION outlier). – An outlier in simple bivariate regression is an observation whose dependent variable value is UNUSUAL GIVEN the value of the independent variable (conditionally unusual). 17(c) Megan Reif
  • 18. II. Bivariate Regression Extreme Points • An outlier in either X or Y that has an atypical or anomalous X value has LEVERAGE. It affects model summary statistics (e.g. R2, standard error), but has little effect on the regression coefficient estimates. • An INFLUENCE point has an unusual Y value (AND maybe an extreme x value). It is characterized by having a noticeable impact on the estimated regression coefficients (i.e., if removing it from the sample would markedly change the slope and direction of the regression line). • A RESIDUAL OUTLIER has large VERTICAL distance of a data point from the regression line. IMPORTANT NOTE: An outlier in X or Y is NOT necessarily associated with a large residual, and vice versa. 18(c) Megan Reif
  • 19. II.A.1.a Extreme Observations in Y 19(c) Megan Reif
  • 20. II.A.1.a Extreme Observations in Y 20(c) Megan Reif
  • 21. II.A.1.b Extreme Observations in X NOTE: These examples reveal that it is most typically observations extreme in BOTH x AND y that have influence (second graph on these two slides) but it is not always the case. 21(c) Megan Reif
  • 22. Summary Table: Model Effects for Outliers, Leverage, Influence Type of Extreme Value Y DIRECTION X DIRECTION LEVERAGE INFLUENCE EFFECT ON INTERCEPT/ COEFFICIENTS/ UNCERTAINTY?* Outlier in y (yi far from Y) Unusual In Trend = No No Yes/No/Yes Unusual & Unusual = Yes Yes Yes/Large/Yes Outlier in x (xi far from X) In Trend & Unusual = Yes No No/No/Yes-Tends to Reduce Uncertainty Unusual & Unusual = Yes Yes Yes/Large/Yes Outlier in Residual Yes & Possibly Possible but not necessarily Possible but not necessarily No/No/Yes *Note that influence can refer several things: (1) effect on y-intercept; (2) on particular coefficient; (3) on all coefficients; (4) on estimated standard errors. Thus we have a variety of procedures to evaluate influence. 22(c) Megan Reif
  • 23. 1. OUTLIERS are not necessarily influential 2. BUT they can be, depending on leverage 3. Yet high LEVERAGE points are not always influential 4. And INFLUENTIAL points are not necessarily outliers PLOT OUTLIER LEVERAGE INFLUENCE 1 Yes No No 2 Yes Yes Yes 3 No Yes No 4 No Yes Yes 23(c) Megan Reif
  • 24. II.A Multivariate Extreme Points • Influence in multivariate regression results from a particular combination of values on all variables in the regression, not necessarily just from unusual values on one or two of the variables, but the concepts from the bivariate case apply. • When there are 2 or greater explanatory variables X, scatterplots may not reveal multivariate outliers, which are separated from the centroid of all the Xs but do not appear in bivariate relations of any two of them. 24(c) Megan Reif
  • 25. Residual Analysis: A Caution • Recall that residuals e are just an estimate of an unobservable vector with given distributional properties. Assessing the appropriateness of the model for a given problem may entail the use of the residuals in the absence of Epsilon, but since e is by definition orthogonal to (uncorrelated with) Cov(X,e)=0 the regressors with E(e)=0, one cannot use residuals to test these assumptions of the CLRM model. Sample: e Population: ε Residuals Error/Disturbance Term/Stochastic Component Estimated Unobserved Parameter We Try to Estimate Difference between these means that you are never totally confident that e is a good estimate of ε: If you meet all assumptions of CLRM then e is an unbiased, efficient, and consistent estimate of ε. 25(c) Megan Reif
  • 26. II.A.1 The “Hat” Matrix (Least Squares Projection Matrix / Fitted Value Maker) • DeNardo calls it P (because it is the Projection matrix for the Predictor Space/least squares projection matrix (see http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_projection_matrix.htm for a lovely geometric explanation)); Rob calls it N (“fitted value maker”, Cook calls V, and Belsey H. I use H since most of the books on diagnostics seems to use H. • The hat matrix is • Since and by definition the vector of fitted values it follows that • The individual diagonal elements h1, h2, hi,..., hn of H can thus be related to the distance between each xi, xj .... xk explanatory variables and the row vector of explanatory vector means x̅ where xi is the ith row of the matrix X. T T- 1 H = X(X X) X T T- 1 b = (X X) X y y = Xb ) y = Hy ) 26(c) Megan Reif
  • 28. II.A.1 Hat Matrix, cont. (the matrix) (vector of diagonal elements) which equal . This is the effect of the ith element on its own predicted value. In scalar form, the hat (leverage) for T T T T ii i i i i y y          - - 1 1 H = X X X X h x X X x ) 2 2 the ith observation (note adjustment for number of observations, where as n grows larger individual leverage of any one observation diminishes), is ( )1 ( ) (off-diagon i i i T T ij i j x x h n x x          - 1 h x X X x al) h serves as a measure of leverage of the ith data point, because its numerator is the squared distance of the ith datapoint from its mean in the X direction, while its denominator is a measure of the overall variability of the data points along the X-axis. It therefore the distance of the data point in relation to the overall variation in the X direction. h serves as a measure of leverage of the ith data point, because its numerator is the squared distance of the ith datapoint from its mean in the X direction, while its denominator is a measure of the overall variability of the data points along the X-axis. It therefore the distance of the data point in relation to the overall variation in the X direction. 28(c) Megan Reif
  • 29. II.A.1 Hat Matrix, cont. • Because H is a projection matrix, (for proof see Belsey et al., 1980, Appendix 2A) • It is possible to express the fitted values in terms of the observed values (scalar): • hij therefore captures the extent to which yi is close to the fitted values. If it is large, than the i-th observation had a substantial impact on the j-th fitted value. The hat value summarizes the potential influence of yi on ALL the fitted values. 1 1 2 2 3 3 1 ... ... n i j j j jj j nj n ij i i y h y h y h y h y h y h y           ) 29 1 1ih  (c) Megan Reif
  • 30. II.A.1.a Hat and the Residuals • Since • The relationship between the residual and the true stochastic component also depends on H. If the hijs are sufficiently small, e is a reasonable estimate of ε. • Note the interesting situation in which a better “fit”, if based on extreme values, may signal an underestimate of the randomness in the world. 1 ˆ then ˆ( ) ˆsubstituting for , ( )( ) ( ) in scalar form, for 1,2... , where I is the identity matrix. n i i ij j j i n e h                e y y e I H y Xβ y e I H Xβ ε e I H ε 30(c) Megan Reif
  • 31. II.A.1.a Hat and the Residuals • The variance of e is also related to H (See DeNardo). • For high leverage cases, in which h approaches its upper bound of one, the residual value will tend to zero (see graph above). • This means that the residuals will not be a reliable means of detecting influential points, so we need to transform them… leading us to the subject of studentized (jacknifed) residuals: 2 ( ) (1 )i iiVar e h  31(c) Megan Reif
  • 32. II.A.1.a Hat / Studentized Residuals PURPOSE: Detection of Multivariate Outliers • Adjust residuals to make them conspicuous so they are reliable for detecting leverage and influential points. • DeNardo’s “internally Studentized residual” is called “standardized residual” or “normalized”residual in other contexts--can disguise outliers. • The “externally” Studentized residual uses the Standard Error of the Regression (Residual Sum of Squares/n-k = e’e/(n-k), deleting the i-th observation, which allows solving for h, the measure of leverage. • These residuals are distributed as Student’s t, with n-k d.f, so “a test” of each outlier can be made, with each studentized residual representing a t-value for its observation. • This is an application of the jacknife method, whereby observations are omitted and estimation iterated to arrive at the studentized residuals (just one of many applications of jacknife). Also called “Jacknife residual” * ( ) ( ) where s is the Standard Error of the 1 Estimate/Regression calculated after deleting ith observation. i i i i e r s h   32(c) Megan Reif
  • 33. II.A.1.a. continued Steps for Assessing Studentized Residuals 1. Studentized residuals correspond to the t-statistic we would obtain by including in the regression a dummy predictor coded 1 for that observation and 0 for all others. One can then test the null hypothesis that coefficient δ equals zero (Ho: δ=0) in: This tests whether case i causes a shift in the regression intercept. 2. We set an alpha significance level α of our overall Type I error risk; probability of rejecting the null when it is in fact true. According to the Bonferroni inquality[Pr(set of events occurring) cannot exceed the sum of individual probabilities of the events], the probability that at least one of the cases is a statistically significant outlier (when the null hypothesis is actually true) cannot exceed nα, so…. 3. We want to run n tests (one for each case) for each residual at the α/n level (let’s call this α*). Suppose we set α =.05 and we have 21 observations. To test whether ANY case in a sample of n=21 is a significant outlier at level α , we check whether the maximum studentized residual max|ri| is significant at α* = .05/21=.0024 (given a t-distribution with df = n-K-1; 21-2-1 =19). Most t-tables do not have low numbers for t, so a computer is required. 33 0 1 1 2 2 1 , 1( ) ...i i i k i k iE y x x x I           (c) Megan Reif
  • 34. Tanzania Revenue Data regress rexp rev (Expenditure as function of Revenue) Source | SS df MS Number of obs = 21 -------------+------------------------------ F( 1, 19) = 55.16 Model | 10034268 1 10034268 Prob > F = 0.0000 Residual | 3456450.93 19 181918.47 R-squared = 0.7438 -------------+------------------------------ Adj R-squared = 0.7303 Total | 13490719 20 674535.948 Root MSE = 426.52 ------------------------------------------------------------------------------ rexp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- rev | .8668668 .1167207 7.43 0.000 .6225675 1.111166 _cons | 798.038 445.0211 1.79 0.089 -133.4019 1729.478 ------------------------------------------------------------------------------ predict resid, resid (creates variable with ORDINARY RESIDUALS) predict estu, rstudent (STUDENTIZED RESIDUALS) 34 EXAMPLE Data: Tanzania.dta (Mukjeree et al) REV: Gov Recurrent Revenue REXP: Gov Recurrent Expenditure DEXP: Gov Development Expenditure Year (T) 1970-1990 (c) Megan Reif
  • 35. 4. Identify the largest and smallest residuals. As a rule of thumb, we should pay attention to residuals with absolute values greater than 2, be worried about those with values greater than 2.5, and most concerned about those exceeding 3. There are a variety of ways to identify/inspect these residuals. See http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm for more options. The fastest in a small dataset is to list the observations with a studentized residual exceeding + or -2. We see here that 1980 is an outlier. We can use Stata to carry out the Bonferroni Outlier Test as follows: list if abs(estu)>2 rev rexp dexp t resid estu decade revcat | |-----------------------------------------------------------------| 11. | 4506 5627 3096 80 922.8602 2.590934 1 high | +-----------------------------------------------------------------+ The maximum student residual of 2.59 is our t-value and n=21. For 1980 to be a significant outlier (cause a significant shift in intercept) at The above P-value (p=.01796) is not below alpha/n=.0023, so 1980 is not a significant outlier at α =.05, then t=2.59 must be significant at .05/21. display .05/21 .00238095 display 2*ttail(19, 2.59) .01796427 The obtained P-value (P=.01796) is NOT below α/n=.00238, so 1980 is NOT a significant outlier at α =.05. 35 II.A.1.a. continued-Assessing Studentized Residuals Bonferroni Outlier Test (Test for outlier influence on y-intercept) (c) Megan Reif
  • 36. II.A.1.b Hat Matrix and Leverage: Outlier influence on fitted values (recall that fit is overly dependent on these outliers) • Note that if hii =1, then ; that is, ei = 0, and the i-th case would be fit exactly. • This means that, if no observations are exact replicates, one parameter is dedicated to one data point, which would make it impossible to obtain the determinant to invert X’X and to obtain OLS estimates. • This rarely occurs, so that the value of hii will rarely reach its upper bound of 1. • The MAGNITUDE of hii depends on this relationship • The higher the value of hii the higher the leverage of the ith data point. • where c is the number of times the ith row of X is replicated (generally then, h will range from 1/n to 1, but in survey data, it is possible to have duplicate responses for multiple respondents, so you can check this in stata with the “duplicates”command) • The average hat value is E(h) = (k+1)/n, where k is the number of regressors. We therefore proceed by looking at the maximum hat value. A hat value has leverage if it is more than twice the mean hat value. • Huber (1981) suggests another rule of thumb for interpreting hii , but this might overlook more than one large hat value. i iy y ) 1 1 iih n c    T ix 36 max( ) .2 little to worry about .2 max( ) .5 risky .5 max( ) too much leverage i i i h h h    (c) Megan Reif
  • 37. 70 7172 73 74 75 7677 7879 80 81 8283 84 85 86 87 88 89 90 .05.1.15.2.25 Leverage 2000 3000 4000 5000 6000 rev predict h, hat summarize h Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------- h | 21 .0952381 .0638661 .0476762 .2652265 display 2/21 .0952381 list if h>2*.0952381 9. | rev | rexp | dexp | t | resid | estu | | | 5424 | 5058 | 3115 | 78 | -441.9235 | -1.222458 | | h .2629347 | 10. | rev | rexp | dexp | t | resid | estu | | 5433 | 5571 | 3589 | 79 | 63.27473 | .1685843 | h =.2652265 scatter h rev, mlabel(t) • Use predict command to create the hat values for each observation. • Summarize or calculate to get the mean. • List observations whose h values exceed 2 times E(h). We seen 1978 and 1979 have leverage. • We can graph the hat values against the values of the independent variable(s) The leverage points are well above 0.2 and more than twice their mean. Recall that we identified from EDA that something might be different for 1978 and 1979. This means that too much of the sample’s information about the X-Y relationship may come from a single case. 37 II.A.1.b Hat Matrix and Leverage: Outlier influence of X values on fitted values, continued (c) Megan Reif
  • 38. ( ) ( ) The regression coefficient on is . Let represent the same coefficient when the ith case is deleted. Deleting the ithcase therefore changes the coefficient on by - . We can express th k k k i k k k i X b b X b b ( ) ( ) ( ) is change in standard errors: - Where represents the residual standard deviation with the ith case deleted, and is the residual sum of squares from the auxiliary regres k k i ik e i k e i k b b DFBETAS s RSS s RSS  sion of X on all the other X variables (without deleting the ith case). The denominator therefore modifies the usual estimate of the standard error of the coefficient if the ith case is deleted. DFBET k kb A can also be expressed in terms of the Hat statistic (see DeNardo). • Interpreting direction of influence with DFBETAS: • The size of influence: DFBETAS tells us “By how many standard errors does the coefficient change if we drop case i?” • A DFBETA of +1.34, for example, means that if case i were deleted, the coefficient for regressor k would be 1.34 standard errors lower. 38 II.A.1.c The DFBETA Statistic (depends on X and Y values, tests how much a case i influences the coefficients, not a formal test statistic with a hypothesis test)) If 0, case increases magnitude of If < 0, case decreases magnitude of ik k ik k DFBETAS i b DFBETAS i b  (c) Megan Reif
  • 39. 0246 Frequency -.8 -.6 -.4 -.2 0 .2 .4 .6 Dfbeta rev dfbeta _dfbeta_1: dfbeta(rev) list _dfbeta_1| _dfbeta_1 | 1. | .1004588 | 2. | .0401034 | 3. | .0582458 | 4. | -.0044781 | 5. | .1422126 | 6. | -.1596971 | 7. | -.1744603 | 8. | -.1502057 | 9. | -.6607218 | 10. | .0917439 | 11. | .5789033 | 12. | .1527179 | 13. | -.059624 | 14. | -.0800557 | 15. | -.0528694 | 16. | -.0164145 | 17. | .1149248 | 18. | .0945607 | 19. | -.064976 | 20. | -.0342036 | 21. | .0475227 | display 2/sqrt(21) .43643578 histogram _dfbeta_1, bin(10) frequency xline(-.4364 .4364) xlabel(#10) (bin=10, start=-.66072184, width=.12396252) • Stata’s dfbeta command creates the DFBETA statistic for each of the regressors in the model, then list for all of our observations. A rule of thumb for large datasets where listing and inspecting all of the DFBETA values would be difficult is to inspect all DFBETAs in excess of 2/sqrt(n) • Since DFBETAs are obtained by case- wise deletion, they do not account for situations where a number of observations may cluster together, jointly pulling the regression line in a direction, but not individually showing up as influential. You should not rely solely on DFBETA, then, to test for influence. A histogram of DFBETA can reveal groups of influential cases (the one displayed at left uses reference lines for + or – 2/sqrt(n) = .4364). Two observations fall outside the safe range.39 II.A.1.c The DFBETA Statistic (c) Megan Reif
  • 40. scatter _dfbeta_1 t, ylabel(-1(.5)1) yline(.4364 -.4364) mlabel(t) list t _dfbeta_1 rev rexp if t==78 | t==80 +------------------------------+ | t _dfbeta_1 rev rexp | |------------------------------| 9. | 78 -.6607218 5424 5058 | 11. | 80 .5789033 4506 5627 | • Now that we know there are two potential observations to worry about, it is useful to use another plot to identify which they are (this is most useful for multivariate regression – it is rather obvious for the single regressor case). • We see that 1978 and 1980 are influential. • Note that 1978 and 1979 had leverage, but only 1978 is also influential. 1980 is Influential but did not have leverage (review Slide 23). • 1978 decreases the coefficient on revenue by -.66 standard errors and 1980 increases it by .58 standard errors. 40 II.A.1.c The DFBETA Statistic 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 -1-.50.51 Dfbetarev 70 75 80 85 90 t (c) Megan Reif
  • 41. II.A.1.d Influence of a Case on Model as a Whole (Cook’s Distance and DFFITS Statistics) • Returning to the Hat statistic, if we want to know the effect of case i on the predicted values, we can use the DFFITS statistic, which does not depend on the coordinate system used to form the regression model. • Rule of thumb cutoff values for small to medium sized data sets are to inspect observations with DFFITS that exceed the following values (and to run the regression without those observations to see by how much the coefficient estimates change): 41 ˆ ( ) [ ( )] 1 to scale the measure, one can divide by ˆthe standard deviation of the fit, where ( ) is our estimate of variance with observation i deleted. 1 i i i i i i i i i i i i he DFFIT y y i i h h y s i h DFFITS           x b b x b ( ) 1 This is intuitive in that the first term increases the greater the hat statistic (and therefore the leverage) for case i, and the second term increases the larger the studentized re i i i e h s i h        * * ( ) sidual (outlier). Since then DFFITS can be written as 11 Then we want to know what the scaled changes in fit for the model are for the values other than the ith row: [ ( )] ( ) i i i i ii j j e h r r hs h i s i h   x b b ( ) (1 The absolute value of this change in fit for the remaining cases will be less than the absolute value for the change attributed to the fitted ˆvalue when the ith value is del ij i j i i h e s i h h y        eted. [ ( )] ( ) is the number of standard errors that the fitted value for case i changes if the ith observation is deleted from the data. j i j i DFFITS s i h DFFITS   x b b Small to medium datasets: DFFITS 1 1 Large datasets: DFFITS 2 i i k n    (c) Megan Reif
  • 42. DFFITS and Hat vs. DFFIT PLOT (from Tanzania Model) predict dffit, dfits list t rev rexp dffit | t rev rexp dffit | |------------------------------| 1. | 70 3243 3304 -.1932092 | 2. | 71 3497 3569 -.143909 | 3. | 72 3426 3480 -.1642726 | 4. | 73 3756 3809 -.1293692 | 5. | 74 3409 3122 -.3824875 | |------------------------------| 6. | 75 4169 3891 -.3301978 | 7. | 76 4482 4352 -.2539932 | 8. | 77 4498 4417 -.216292 | 9. | 78 5424 5058 -.7301378 | 10. | 79 5433 5571 .1012859 | |------------------------------| 11. | 80 4506 5627 .8291757 | 12. | 81 4112 4932 .3522714 | 13. | 82 3603 4594 .3838607 | 14. | 83 3470 4261 .2597122 | 15. | 84 2972 3476 .0768232 | |------------------------------| 16. | 85 2746 3202 .0211414 | 17. | 86 2623 2929 -.1417075 | 18. | 87 2549 2899 -.1141464 | 19. | 88 2906 3431 .0905055 | 20. | 89 3544 4149 .1518263 | |------------------------------| 21. | 90 3928 4558 .1956947 | +------------------------------+ Scatter h dffit, mlabel(t) • No observation has a DFFIT statistic larger than 1 in this small dataset. The largest is .829757, for 1980. • Note that as a function of hat and the studentized residuals, DFFITS is a kind of measure of OUTLIERNESS*LEVERAGE • A graphical alternative to the influence measures is to plot graphically hat against the studentized residuals to look for observations for which both are big (only 1979 approaches this criteria, but is well under the DFFITS cutoff): 42 * 1 i i i i h DFFITS r h   70 7172 73 74 75 7677 78 79 80 81 82 83 84 85 86 87 88 8990 .05.1.15.2.25 Leverage -1 -.5 0 .5 1 Dfits (c) Megan Reif
  • 43. • Cook’s D is similar to the DFFITS statistic, but DFITS gives relatively more weight to leverage points, since its shows the effect on an observation’s fitted value when that particular one is dropped. • Cook’s Distance “tests the hypothesis” that the true slope coefficients are equal in the aggregate to the slope coefficients estimated with observation i deleted (Ho: β =b(i)). It is more a rule of thumb that produces a measure of distance independent to how the variables are measured, rather than a formal F-test. influential if Di exceeds the median of the F distribution with k parameters [(Fk, n-k)(.5)] • Observations with larger D values than the rest of the data are those that have unusual leverage. • While there are numerical rules for assessing Cook’s D authors differ in their advice. • Some argue that it is best to graph Cook’s D values to see whether any one or two points have a much bigger Di than the others. 43 II.A.1.d Influence of a Case on Model as a Whole (Cook’s Distance) 2 * 2 ( ) ( ) *2 1 Since 1 1 1 then Cook's can be rewritten as (1 ) i i i i i i i i i i i i i i h e e D r k h s h s h r h D k h                   (c) Megan Reif
  • 44. Cook’s D, Continued predict cooksd, cooksd • We can then look up the median value of the F-distribution with k+1 numerator and n-k denominator degrees of freedom: display invFtail(2,19, .5) For the Tanzania data; no observations this large. .71906057 list t rev rexp if cooksd>.71906057 • Some authors suggest looking at the five most influential, which can be done in Stata by (NOTE: last term is a lowercase “L” for last observation.). list t rev rexp cooksd dffit _dfbeta_1 in -5/l | t rev rexp cooksd dffit _dfbeta_1 | |-----------------------------------------------------| 17. | 81 4112 4932 .0589684 .3522714 .1527179 | 18. | 82 3603 4594 .0670656 .3838607 -.059624 | 19. | 74 3409 3122 .067792 -.3824875 .1422126 | 20. | 78 5424 5058 .2597905 -.7301378 -.6607218 | 21. | 80 4506 5627 .2642971 .8291757 .5789033 | +-----------------------------------------------------+ 44(c) Megan Reif
  • 45. Proportional Plots for Influence Statistics • It is useful to graph Cook’s D and DFFITS with Residual vs. Fitted Plots, with symbols proportional to the size of Cook’s D. First we have to predict the fitted values: predict yhat (option xb assumed; fitted values) • Then weight the symbols by the value of the influence statistic of interest: graph twoway scatter resid yhat[aweight = cooksd], msymbol(Oh) yline(0) saving(Dprop) NOTE: Prop Plot with weights disallows labeling, so I create two versions, one with labels, one with proportions, and use ppt to overlay. graph twoway scatter resid yhat[aweight = cooksd], mlabel(t) yline(0) saving(Dlabe • We can also plot the studentized residuals vs. HAT (leverage, not the fitted values), with Proption to Cook’s D, to look at outlierness, leverage, and influence at the same time. Same command as above except variables are: estu h (or whatever you have named your studentized residuals and hat) 45 -1000-50005001000 Residuals 3000 3500 4000 4500 5000 5500 Fitted values -2-10123 Studentizedresiduals .05 .1 .15 .2 .25 Leverage (c) Megan Reif
  • 46. • Recall that by increasing the variance of one or more Xs, a high- leverage observation will decrease the standard error of the coefficient(s), even if it does not influence the magnitude. Though this may be considered beneficial, it may also exaggerate our confidence in our estimate, especially if we don’t know if the high- leverage outlier is representative of the population distribution, or due entirely to stochastic factors or error (sampling, coding, etc. – that is, a true outlier). • Using the COVRATIO statistic, we can examine the impact of deleting each observation in turn on the size of the joint-confidence region (in n-space) for β, since the size of this region is equivalent to the length of the confidence interval for an individual coefficient, which is proportional to its standard error. The squared length of the CI is therefore proportional to the sampling variance for b. The squared size of a joint confidence region is proportional to the variance for a set of coefficients (“generalized variance”) (Fox 1991, 31; See Belsey et al. for the derivation, pp 22-24). 46 II.A.1.d Influence of a Case on Precision of the Estimates (COVRATIO) 2*2 1 2 (1 ) 2 i i i COVRATIO n k r h n k            (c) Megan Reif
  • 47. COVRATIO • Look for values that differ substantially from 1. • A small COVRATIO (below 1) means that the generalized variance of the model would be SMALLER without the ith observation (i is reducing precision of estimates) • A big COVRATIO (above 1) means the generalized variance would be LARGER without ith case, but if it is a high-leverage point, it may be making us overly confident in the precision our estimated coefficients. • Belsey et al. suggest that a COVRATIO should be examined when: 47 3( 1) 1i k COVRATIO n    (c) Megan Reif
  • 48. COVRATIO example 48 85 84 88 79 87 73 86 7189 72 70 90 7776 83 7581 82 74 78 80 .05.1.15.2.25 Leverage .6 .8 1 1.2 1.4 1.6 Covratio 85 8488 79 8773 8671 89 7270 90 77 76 83 75 8182 74 78 80 -1-.50.51 Dfits .6 .8 1 1.2 1.4 1.6 Covratio predict covratio, covratio list t covratio rev rexp if abs(covratio-1)>(3*3)/21 +-----------------------------+ | t covratio rev rexp | |-----------------------------| 4. | 79 1.511605 5433 5571 | +-----------------------------+ • We see that 1979 is large and therefore has perhaps exaggerated our certainty. • Plotting COVRATIO against hat reveals that 1979 has leverage, but plotted against DFFITs, we see it is not greater than one. 1979 does not affect the magnitude of our coefficient estimates, but it may affect our hypothesis testing and conclusions. (c) Megan Reif
  • 49. A Summary of Tests / Statistics for Extreme Values (note sample size dependence) Statistic Formula Use Critical Value Rule of Thumb Studentized Residual Outliers’ Effect on Intercept 1. Critical values (higher than usual t-test), recommended for exploratory diagnosis 2. Rule of Thumb Values Hat Statistic (h) Leverage Bounded by 1/n to 1 (assumes no replicates- Check this in survey data). Higher value=higher leverage: (depends on X Values) DFBETA Influence of a Case on a Particular Coefficient Calculate for each regressor. Rule of Thumb: Under 2/√n means the point has no influence; over means the point is influential (depends on both X AND Y values). Value of DFBETA is # of s.e.s by which case i increases or decreases coefficient for regressor k. Cook’s Distance Influence of a Case on Model Measure of aggregate impact of the ith case on the group of regression coefficients as well as the group of fitted values (sometimes called forecasting effect). A point is influential if Di exceeds the median of the F distribution with k parameters [(Fk, n-k)(.5)]. DFFITS Influence of a Case on Model The number of s.e.s by which the fitted value for ŷi changes if the ith observation is deleted. COVRATIO Influence of a Case on Model Standard Errors Measures how precision of parameter estimates (generalized variance) change with removal of ith observation. Inspect if: Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimate from the sample omitting observation i. In each case you should use the absolute value of the calculated statistic. 49 * ( ) 1 i i i e r s h   2 2 ( )1 ( ) i i i T T ij i j x x h n x x          - 1 h x X X x ( ) ( ) -k k i ik e i k b b DFBETAS s RSS  * * * 2 pay attention 2.5 cause for worry 3 cause for greatest concern i i ii r r r    ( 1) 2 where max( ) .2 little to worry about .2 max( ) .5 risky .5 max( ) too much leverage i i i i k h h h n or h h h           If 0, case increases magnitude of If < 0, case decreases magnitude of ik k ik k DFBETAS i b DFBETAS i b  * 1 i i i i h DFFITS r h   Small/med datasets: DFFITS 1 Large datasets: DFFITS 2 ( 1) / i i k n    1*2 1 2 (1 ) 2 i k i i COVRATIO n k r h n k             *2 (1 ) i i i i r h D k h   3( 1) 1i k COVRATIO n    (c) Megan Reif
  • 50. III. Plots to Identify Extreme Values • EXAMPLE: Model from Mukherjee et al. of crude birth rate as a function of: – GNP per capita (logged, per Feb 18 Notes and gnrl practice for such variables) – IM: Infant mortality – URBAN: percent % population urban – HDI: human development index (From WB Human Dev Report 1993) regress birthr lngnp hdi infmor urbanpop Source | SS df MS Number of obs = 110 -------------+------------------------------ F( 4, 105) = 129.19 Model | 16552.2585 4 4138.06462 Prob > F = 0.0000 Residual | 3363.19755 105 32.0304528 R-squared = 0.8311 -------------+------------------------------ Adj R-squared = 0.8247 Total | 19915.456 109 182.710606 Root MSE = 5.6595 ------------------------------------------------------------------------------ birthr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lngnp | -.2138487 .7960166 -0.27 0.789 -1.792203 1.364505 hdi | -24.50566 7.495152 -3.27 0.001 -39.36716 -9.644157 infmor | .111157 .0396176 2.81 0.006 .0326026 .1897115 urbanpop | .0111358 .0396627 0.28 0.779 -.0675081 .0897797 _cons | 39.56958 6.599771 6.00 0.000 26.48346 52.65571 ------------------------------------------------------------------------------ 50 2 0 lnGNPC HDI IM URBBIRTHrt GNPC HDI IM URBAN           (c) Megan Reif
  • 51. III.Plots to Identify Extreme Values: A. Leverage vs. Normalized Squared Residual Plots 51 lvr2plot, mcolor(green) msize(vsmall) mlabel(cid) mlabcolor(black) • This plot squares the NORMALIZED residuals (as standard deviation of each residual from mean residual) to make them more conspicuous in the plot (these are not the same as the externally studentized residuals). • Remember that we are worried about observations with HIGH LEVERAGE but LOW RESIDUALS, which indicates potential influence. • What we would like to see: A ball of points evenly spread around the intersection of the two means with no points discernibly far out in any direction, and no leverage point above 0.2 with a low residual (to the left of the mean normalized squared residual line). • The vertical line represents the average squared normalized residual and the horizontal line represents the average hat (leverage) value. • Points with high leverage and low residuals will lie below the mean of the squared residual (X), and above the mean of hat, which we should worry about if hat is above .2, and really worry if it is above .5. (c) Megan Reif
  • 52. Leverage vs. Normalized Squared Residual Plots 52 Outlier but low leverage High Leverage, High Residual (might be reducing our standard Errors, but not above risky .2 level. May want to look at COVRATIO. Examine further if above 0.2 in this region Based on this plot, the potential for points with high influence on our coefficients is low. There are no points that meet the high leverage, low residual criteria, individually or as a group. (c) Megan Reif
  • 53. III.Plots to Identify Extreme Values: B. Partial Regression Leverage Plots (also known as partial-regression leverage plots, adjusted partial residuals plots, adjusted variable plots, individual coefficient plots, and added variable plots) 53 avplots, rlopts(lcolor(red)) mlabel(cid) mlabsize(tiny) msize(small) • Plots graph the residuals from fitting y on the Xk variables EXCEPT one (y=Xk-1b)(the value of those residuals given the Xk-1 variables (e|Xk-1 ) shown on y-axis, …plotted against… ordinary residuals of the regression of the EXCLUDED xi on the remaining Xk-1 independent variables on the x-axis (xi = Xk-1b) • Helps to uncover observations exerting a disproportionate influence on the regression model by showing how each coefficient has been influenced by particular observations. • The regression slopes we see in the plots are the same as the original multiple regression coefficients for the regression y=Xkb. • What we would like to see: Scatter of points even around the line in each plot – the “noise” or size of the cloud and spacing around the line need not concern us, but points very far from the rest should be examined. • Cause for concern: Recall the bivariate examples from the first part of the notes – you are looking for values extreme in X (horizontal axis) with unusual/out-of-trend y-values. Pay most attention to the theoretical variable(s) of interest and whether your conclusions and/or statistical significance would change without the observation. • Utility of the graph: DFBETA will give a much more precise assessment of the change in magnitude of the coefficient in the absence of an influence point, but the graph can identify clusters of points that might be jointly influential. • Cautions: Pay attention to the SCALE of the axes reported in your computer output—a point may look like an outlier but in reality be part of a cloud of points on which we are “zoomed in” rather close. If you have doubts about the reliability of “eyeballing” the plot, you can re-run the regression leaving out the influential point and comparing the change in slope, but be sure to use commands that will retain the original scale of the output so you can compare the changes (see slide 54-55). Some books recommend this plot for deciding to include or discard variables. BETTER TO BASE THIS DECISION ON THEORY, techniques discussed previously. (c) Megan Reif
  • 54. NIC CHL COL URY LKA TZA POL MOZ VEN CHN SYR BGR CRI ECU DOM HUN PHL PERARG MEX TUR JAMMDG PRY IDN BGD JOR TTO ETH ROM PAN BRA HND LAO SOM MUS KEN MYS TUN BOL KHM THA KOR UGA GHA EGY IND PAK SLV NPL GRC NGA GTM ZMB TCD IRQ LSO HTI ISR MWI MAR ZAF HKG ZWEBDI MLI BTN NZL ESP CAF GBR SLE TGO SGP BEL AUS NLD PRT IRN BEN CIV COG IRL RWA MRT DZA CMR BFA SAU CAN FRA DEU NERSWE DNK USA BWA ITA NOR SEN PNG JPNAUT GIN FIN GAB NAM CHE ARE OMN -1001020 e(birthr|X) -2 -1 0 1 2 e( lngnp | X ) coef = -.21384874, se = .79601659, t = -.27 SEN PNG CAF ARE ZWE BEN SGP TGO MRT EGY CMR ZMB GIN CIV BWA TCD SLV NER OMN NGANAM GHA HKG MAR KEN JOR NIC TUN HTI BELSOM CHN DZA HND PRY JAM GTM BFA BDI DEUDNK PAN IRQ ISR ROM SLE ESP IND SWE NLD GBR ITA PHL JPNBOL NZL DOM AUS LAO FIN KOR CAN NOR PAK FRA LKA USA BTN SAU ARG MYS CHL CHE AUT IDN RWA IRL KHM BGR NPL GRC VEN PER UGA URY POL MUS PRT THA ECU GAB HUN SYR COLBRA IRN MEX COG BGD MOZ TTO CRI ETHLSO TZA MLI ZAF TUR MDG MWI -1001020 e(birthr|X) -.2 -.1 0 .1 .2 e( hdi | X ) coef = -24.505659, se = 7.4951522, t = -3.27 SEN PNG ZWE CHN CAF TGO EGY SGP LKA BWA JAM KEN NIC ARE PRY SLV ZMB BEN OMN PAN TUN PHL ROM GHA JOR MARCMR HND HKG NGA MYS GTM CIV HTI POL BDI TCDBGR MRT CHL THA DOM ESP IDN SOM IND MUS BEL ISR NER KOR DEU LAO PRTGRCITA DZA DNK CRI IRL GBR NLD URY HUN NZLNAM ARG GIN SWE SYR AUS JPN FIN IRQ COL AUT VEN FRA CAN BOL ECU BFA KHM NOR BTN USA PAK UGA NPL CHE PER TTO RWA MEX BGD SLE TZA BRA MOZ SAU LSOETH TUR ZAF MDG IRN COGGAB MWI MLI -1001020 e(birthr|X) -40 -20 0 20 40 e( infmor | X ) coef = .11115703, se = .03961763, t = 2.81 OMN PRT THA CHE LSO AUT FIN RWA BWA LKAMUS IRL MWI BTN MYS BDI CRI BGD GAB PNG UGAZAF NPL KHM NAM ETHMDG IDN CHN ITA FRA GRC NOR USA BFA LAO JPN KEN CAN COG TZA IND TTO MLI ZWE GTM SYR PAN HUN NER PAK PHL IRN JAM HTI TUR PRY ROM SWE DZA POLGIN KOR SLV DNK ESP DEU HND AUS GHA ECU TGO BGR MARCMR NZL MOZ CIV MEX NLD TUNNGA SOMGBR SAU SLE EGY BRA BOL ARE ZMB COL DOM TCD ISR BEL SEN BEN IRQ PER HKG ARG JOR MRT CHL URY NIC CAF VEN SGP -1001020 e(birthr|X) -40 -20 0 20 40 e( urbanpop | X ) coef = .01113578, se = .03966274, t = .28 Partial Regression Leverage Plots 54(c) Megan Reif
  • 55. NIC CHL COL URY LKA TZA POL MOZ VEN CHN SYR BGR CRI ECU DOM HUN PHL PERARG MEX TUR JAMMDG PRY IDN BGD JOR TTO ETH ROM PAN BRA HND LAO SOM MUS KEN MYS TUN BOL KHM THA KOR UGA GHA EGY IND PAK SLV NPL GRC NGA GTM ZMB TCD IRQ LSO HTI ISR MWI MAR ZAF HKG ZWEBDI MLI BTN NZL ESP CAF GBR SLE TGO SGP BEL AUS NLD PRT IRN BEN CIV COG IRL RWA MRT DZA CMR BFA SAU CAN FRA DEU NERSWE DNK USA BWA ITA NOR SEN PNG JPNAUT GIN FIN GAB NAM CHE ARE OMN -1001020 e(birthr|X) -2 -1 0 1 2 e( lngnp | X ) coef = -.21384874, se = .79601659, t = -.27 SEN PNG CAF ARE ZWE BEN SGP TGO MRT EGY CMR ZMB GIN CIV BWA TCD SLV NER OMN NGANAM GHA HKG MAR KEN JOR NIC TUN HTI BELSOM CHN DZA HND PRY JAM GTM BFA BDI DEUDNK PAN IRQ ISR ROM SLE ESP IND SWE NLD GBR ITA PHL JPNBOL NZL DOM AUS LAO FIN KOR CAN NOR PAK FRA LKA USA BTN SAU ARG MYS CHL CHE AUT IDN RWA IRL KHM BGR NPL GRC VEN PER UGA URY POL MUS PRT THA ECU GAB HUN SYR COLBRA IRN MEX COG BGD MOZ TTO CRI ETHLSO TZA MLI ZAF TUR MDG MWI -1001020 e(birthr|X) -.2 -.1 0 .1 .2 e( hdi | X ) coef = -24.505659, se = 7.4951522, t = -3.27 SEN PNG ZWE CHN CAF TGO EGY SGP LKA BWA JAM KEN NIC ARE PRY SLV ZMB BEN OMN PAN TUN PHL ROM GHA JOR MARCMR HND HKG NGA MYS GTM CIV HTI POL BDI TCDBGR MRT CHL THA DOM ESP IDN SOM IND MUS BEL ISR NER KOR DEU LAO PRTGRCITA DZA DNK CRI IRL GBR NLD URY HUN NZLNAM ARG GIN SWE SYR AUS JPN FIN IRQ COL AUT VEN FRA CAN BOL ECU BFA KHM NOR BTN USA PAK UGA NPL CHE PER TTO RWA MEX BGD SLE TZA BRA MOZ SAU LSOETH TUR ZAF MDG IRN COGGAB MWI MLI -1001020 e(birthr|X) -40 -20 0 20 40 e( infmor | X ) coef = .11115703, se = .03961763, t = 2.81 OMN PRT THA CHE LSO AUT FIN RWA BWA LKAMUS IRL MWI BTN MYS BDI CRI BGD GAB PNG UGAZAF NPL KHM NAM ETHMDG IDN CHN ITA FRA GRC NOR USA BFA LAO JPN KEN CAN COG TZA IND TTO MLI ZWE GTM SYR PAN HUN NER PAK PHL IRN JAM HTI TUR PRY ROM SWE DZA POLGIN KOR SLV DNK ESP DEU HND AUS GHA ECU TGO BGR MARCMR NZL MOZ CIV MEX NLD TUNNGA SOMGBR SAU SLE EGY BRA BOL ARE ZMB COL DOM TCD ISR BEL SEN BEN IRQ PER HKG ARG JOR MRT CHL URY NIC CAF VEN SGP -1001020 e(birthr|X) -40 -20 0 20 40 e( urbanpop | X ) coef = .01113578, se = .03966274, t = .28 NIC CHL COL URY MOZ TZA VEN POL LKA SYR BGR ECU CHN DOM PER CRI ARG HUN PHL MEX TUR MDG JAM JOR TTO PRY ETH BGD IDN ROM BRA SOM HND PAN LAO BOL TUN KEN MUS MYS KOR KHM GHA EGY UGA IRQ PAK TCD IND NGA ZMB THA SLV NPL GTM GRC ISR LSO HTI MWI MAR HKG ZAF MLI NZL CAF ZWE ESP BTN BDI SLE SGP GBR BEL TGO AUS NLD IRN BEN COG CIV MRT PRT IRL DZA SAU RWA CMR BFA CAN DEU NER DNK SWE FRA USA NOR ITA SEN JPN BWA PNG AUT GIN GAB FIN NAM CHE ARE -1001020 e(birthr|X) -2 -1 0 1 2 e( lngnp | X ) coef = -.91084738, se = .7842531, t = -1.16 SEN PNG CAF ARE ZWE BENSGP TGO MRT EGY CMR ZMB GIN BWA CIV TCD NERSLVNAM NGA GHA MAR HKG KEN JOR TUN NIC HTICHN BELSOM DZA PRY HND JAM GTM BDI BFADEUDNK PAN ROM ESP ISR IRQ IND SLE SWE NLD ITA GBR PHL JPN NZL BOL AUSFIN DOM LAO NOR CAN FRA KORLKA PAK USA BTN CHE MYS AUT SAU IRL RWA ARG IDN CHL KHM NPL BGR GRC UGA PER VEN POL URY PRT MUS THA GAB ECU HUN SYR COL IRN BRA MEX COG BGD CRI TTO MOZ ETHLSOTZA ZAF MLI TUR MDG MWI -10-5051015 e(birthr|X) -.2 -.1 0 .1 .2 e( hdi | X ) coef = -22.281675, se = 7.1637796, t = -3.11 SEN PNG ZWE CHN TGO CAF LKA EGY BWA SGP JAM KEN ARE NIC PRY SLV ZMB BEN PAN TUN PHL ROM GHA JOR CMRMAR HND NGA HKG MYS GTM CIV BDI HTI POL TCDBGR THA MRT CHL ESP MUS IDN IND DOM SOMBEL NER PRT ISR KOR DEU ITA LAO GRC DZA IRL DNK CRI NAM NLDGBR HUN NZL GIN SWE URY FIN JPN SYR ARG AUS AUT IRQ FRACOL CAN BFA NOR KHM BOL VEN ECU BTN USA PAK UGA CHE NPL PER RWA TTO MEX BGD SLE TZA BRA LSO SAU MOZ ETH TUR ZAF MDG IRN GABCOG MWI MLI -1001020 e(birthr|X) -40 -20 0 20 40 e( infmor | X ) coef = .12391559, se = .0378934, t = 3.27 PRT THA CHE BWA FIN AUT RWA LSO LKAMUS IRL BTN MWI MYS BDI PNG CRI GAB BGD NAM UGA NPL ZAF KHM ITA FRA CHN IDN ETH NOR MDG USA GRC JPN BFA KEN CAN LAO IND COG ZWETTO TZA GTM NER PAN MLI SYR PHL HUN PAK JAM IRN HTI PRY ROM SWE DZA GIN TUR SLV DNK KOR ESPPOL DEU TGO AUS GHA CMR HND MAR BGR ECU NZL CIV NLD TUNNGA MOZ MEX GBRSOM ARE SAU EGY SLE ZMB BOL BRACOL TCD SEN DOM ISR BEL BEN IRQ HKG PER MRT JOR ARG CHL CAFURY NIC VEN SGP -1001020 e(birthr|X) -40 -20 0 20 40 e( urbanpop | X ) coef = .05291099, se = .03965302, t = 1.33 NIC CHL COL URY LKA TZA POL MOZ VEN CHN SYR BGR CRI ECU DOM HUN PER PHL ARG MEX TUR MDGJAM BGD TTO IDN ETH PRY JOR ROM BRA PAN HND LAO MUS SOM MYS KEN TUN BOL THA KHM KOR UGA GHA PAK EGY IND GRC LSO NPL GTM SLV NGA IRQ TCD MWI ZMB HTI ISR ZAF MAR HKG MLI BTN BDI NZL ZWE ESP GBR SLE BEL CAF AUS TGO SGP PRT NLD IRN COG IRL CIV BEN RWA DZA MRT BFA SAU CAN CMR FRA DEU SWE DNK NER USA NOR ITA BWA JPN PNG AUT FIN GAB GIN CHE NAM ARE OMN -1001020 e(birthr|X) -2 -1 0 1 2 e( lngnp | X ) coef = -.24870177, se = .80570718, t = -.31 PNG CAF ARE BEN ZWE SGP TGO MRT EGY CMR ZMB GIN CIV BWA TCD NER SLV NGA OMN NAM GHA MAR HKG KEN JOR NIC TUN HTI SOMBEL CHN DZA HND PRY JAM GTM BFA BDI DEUDNK PAN IRQ SLE ROM ISR IND ESP SWE NLD GBR ITA PHL JPNBOL NZL DOM AUS LAO FIN PAK KOR CAN NOR BTN FRA LKA SAU USA ARG CHL MYS CHE RWA IDN KHM NPL AUT IRL BGR UGA PER VEN GRC URY POL MUS PRT THA ECU GAB HUN SYR COLBRA IRN MEX COG BGD MOZ TTO CRI ETH TZA LSO MLI ZAF TUR MDG MWI -1001020 e(birthr|X) -.2 -.1 0 .1 .2 e( hdi | X ) coef = -23.625004, se = 7.9461429, t = -2.97 PNG ZWE CHN CAF TGO SGP EGY BWA ARE LKA KEN JAM NIC BEN ZMB SLV PRY OMN PAN TUN GHA ROM PHLCMR JOR MAR HND NGA HKG GTM CIV MYS TCD MRT HTI BDI POL BGR CHL SOM DOM ESP IND THA BEL IDN NER MUS ISR DEU DZALAO KOR GIN ITA NAM DNK PRTGRC GBR NLD IRLNZL URY SWE CRI HUN ARG IRQ JPN AUS SYR FIN BFA COL BOL FRA AUT VEN CAN KHM ECU NOR BTN PAK USA UGA NPL CHE PER RWA TTO SLE MEX BGD TZA BRA SAU MOZ ETHLSO TUR ZAF IRN MDG COGGAB MWI MLI -1001020 e(birthr|X) -40 -20 0 20 40 e( infmor | X ) coef = .11481757, se = .04116964, t = 2.79 OMN PRT THA CHE LSO AUT FIN RWA BWA LKAMUS MWI IRL BTN MYS CRI BDI BGD GAB ZAF UGA NPL MDGETH KHM PNGNAM IDN FRA ITA CHN GRC NOR USA JPN LAO BFA COG CAN TZA TTO KEN IND MLI SYR HUN GTM PAN IRN PAK PHL ZWE NERTUR JAM HTI PRY ROM SWE DZA POL KOR ESP GINDNK SLV DEU ECU AUS HND BGR GHA MOZ MAR TGO NZL CMR MEX NLD CIV TUN GBRSOM NGA SAU SLE BRA BOL COL EGY ZMB ARE DOM TCD ISR BEL IRQ PERBEN ARG HKG JOR CHL MRT URY NIC VEN CAF SGP -1001020 e(birthr|X) -40 -20 0 20 40 e( urbanpop | X ) coef = .00935766, se = .04016079, t = .23 55 NIC CHL COL URY LKA TZA POL MOZ VEN CHN SYR BGR CRI ECU DOM HUN PER PHL ARG MEX TUR MDGJAM BGD TTO IDN ETH PRY JOR ROM BRA PAN HND LAO MUS SOM MYS KEN TUN BOL THA KHM KOR UGA GHA PAK EGY IND GRC LSO NPL GTM SLV NGA IRQ TCD MWI ZMB HTI ISR ZAF MAR HKG MLI BTN BDI NZL ZWE ESP GBR SLE BEL CAF AUS TGO SGP PRT NLD IRN COG IRL CIV BEN RWA DZA MRT BFA SAU CAN CMR FRA DEU SWE DNK NER USA NOR ITA BWA JPN PNG AUT FIN GAB GIN CHE NAM ARE OMN -1001020 e(birthr|X) -2 -1 0 1 2 e( lngnp | X ) coef = -.24870177, se = .80570718, t = -.31 PNG CAF ARE BEN ZWE SGP TGO MRT EGY CMR ZMB GIN CIV BWA TCD NER SLV NGA OMN NAM GHA MAR HKG KEN JOR NIC TUN HTI SOMBEL CHN DZA HND PRY JAM GTM BFA BDI DEUDNK PAN IRQ SLE ROM ISR IND ESP SWE NLD GBR ITA PHL JPNBOL NZL DOM AUS LAO FIN PAK KOR CAN NOR BTN FRA LKA SAU USA ARG CHL MYS CHE RWA IDN KHM NPL AUT IRL BGR UGA PER VEN GRC URY POL MUS PRT THA ECU GAB HUN SYR COLBRA IRN MEX COG BGD MOZ TTO CRI ETH TZA LSO MLI ZAF TUR MDG MWI -1001020 e(birthr|X) -.2 -.1 0 .1 .2 e( hdi | X ) coef = -23.625004, se = 7.9461429, t = -2.97 PNG ZWE CHN CAF TGO SGP EGY BWA ARE LKA KEN JAM NIC BEN ZMB SLV PRY OMN PAN TUN GHA ROM PHLCMR JOR MAR HND NGA HKG GTM CIV MYS TCD MRT HTI BDI POL BGR CHL SOM DOM ESP IND THA BEL IDN NER MUS ISR DEU DZALAO KOR GIN ITA NAM DNK PRTGRC GBR NLD IRLNZL URY SWE CRI HUN ARG IRQ JPN AUS SYR FIN BFA COL BOL FRA AUT VEN CAN KHM ECU NOR BTN PAK USA UGA NPL CHE PER RWA TTO SLE MEX BGD TZA BRA SAU MOZ ETHLSO TUR ZAF IRN MDG COGGAB MWI MLI -1001020 e(birthr|X) -40 -20 0 20 40 e( infmor | X ) coef = .11481757, se = .04116964, t = 2.79 OMN PRT THA CHE LSO AUT FIN RWA BWA LKAMUS MWI IRL BTN MYS CRI BDI BGD GAB ZAF UGA NPL MDGETH KHM PNGNAM IDN FRA ITA CHN GRC NOR USA JPN LAO BFA COG CAN TZA TTO KEN IND MLI SYR HUN GTM PAN IRN PAK PHL ZWE NERTUR JAM HTI PRY ROM SWE DZA POL KOR ESP GINDNK SLV DEU ECU AUS HND BGR GHA MOZ MAR TGO NZL CMR MEX NLD CIV TUN GBRSOM NGA SAU SLE BRA BOL COL EGY ZMB ARE DOM TCD ISR BEL IRQ PERBEN ARG HKG JOR CHL MRT URY NIC VEN CAF SGP -1001020 e(birthr|X) -40 -20 0 20 40 e( urbanpop | X ) coef = .00935766, se = .04016079, t = .23 Note that Senegal looked like a possible outlier but it was of the good sort and it wasn’t particularly extreme relative to the scale of values shown. The coefficient changes little and the SE increases slightly without it (indicating it was contributing to the fit somewhat). (c) Megan Reif
  • 56. 56 NIC CHL COL URY MOZ TZA VEN POL LKA SYR BGR ECU CHN DOM PER CRI ARG HUN PHL MEX TUR MDG JAM JOR TTO PRY ETH BGD IDN ROM BRA SOM HND PAN LAO BOL TUN KEN MUS MYS KOR KHM GHA EGY UGA IRQ PAK TCD IND NGA ZMB THA SLV NPL GTM GRC ISR LSO HTI MWI MAR HKG ZAF MLI NZL CAF ZWE ESP BTN BDI SLE SGP GBR BEL TGO AUS NLD IRN BEN COG CIV MRT PRT IRL DZA SAU RWA CMR BFA CAN DEU NER DNK SWE FRA USA NOR ITA SEN JPN BWA PNG AUT GIN GAB FIN NAM CHE ARE -1001020 e(birthr|X) -2 -1 0 1 2 e( lngnp | X ) coef = -.91084738, se = .7842531, t = -1.16 SEN PNG CAF ARE ZWE BENSGP TGO MRT EGY CMR ZMB GIN BWA CIV TCD NERSLVNAM NGA GHA MAR HKG KEN JOR TUN NIC HTICHN BELSOM DZA PRY HND JAM GTM BDI BFADEUDNK PAN ROM ESP ISR IRQ IND SLE SWE NLD ITA GBR PHL JPN NZL BOL AUSFIN DOM LAO NOR CAN FRA KORLKA PAK USA BTN CHE MYS AUT SAU IRL RWA ARG IDN CHL KHM NPL BGR GRC UGA PER VEN POL URY PRT MUS THA GAB ECU HUN SYR COL IRN BRA MEX COG BGD CRI TTO MOZ ETHLSOTZA ZAF MLI TUR MDG MWI -10-5051015 e(birthr|X) -.2 -.1 0 .1 .2 e( hdi | X ) coef = -22.281675, se = 7.1637796, t = -3.11 SEN PNG ZWE CHN TGO CAF LKA EGY BWA SGP JAM KEN ARE NIC PRY SLV ZMB BEN PAN TUN PHL ROM GHA JOR CMRMAR HND NGA HKG MYS GTM CIV BDI HTI POL TCDBGR THA MRT CHL ESP MUS IDN IND DOM SOMBEL NER PRT ISR KOR DEU ITA LAO GRC DZA IRL DNK CRI NAM NLDGBR HUN NZL GIN SWE URY FIN JPN SYR ARG AUS AUT IRQ FRACOL CAN BFA NOR KHM BOL VEN ECU BTN USA PAK UGA CHE NPL PER RWA TTO MEX BGD SLE TZA BRA LSO SAU MOZ ETH TUR ZAF MDG IRN GABCOG MWI MLI -1001020 e(birthr|X) -40 -20 0 20 40 e( infmor | X ) coef = .12391559, se = .0378934, t = 3.27 PRT THA CHE BWA FIN AUT RWA LSO LKAMUS IRL BTN MWI MYS BDI PNG CRI GAB BGD NAM UGA NPL ZAF KHM ITA FRA CHN IDN ETH NOR MDG USA GRC JPN BFA KEN CAN LAO IND COG ZWETTO TZA GTM NER PAN MLI SYR PHL HUN PAK JAM IRN HTI PRY ROM SWE DZA GIN TUR SLV DNK KOR ESPPOL DEU TGO AUS GHA CMR HND MAR BGR ECU NZL CIV NLD TUNNGA MOZ MEX GBRSOM ARE SAU EGY SLE ZMB BOL BRACOL TCD SEN DOM ISR BEL BEN IRQ HKG PER MRT JOR ARG CHL CAFURY NIC VEN SGP -1001020 e(birthr|X) -40 -20 0 20 40 e( urbanpop | X ) coef = .05291099, se = .03965302, t = 1.33 (c) Megan Reif
  • 57. III.Plots to Identify Extreme Values: C. Star Plots for outliers, leverage, and model generalized influence display invFtail(5,105, .5) .87591656 (use above command to get cut-off for Cook’s D), then use with the other rules of thumb to choose observations to display: graph7 estu h cooksd if abs(estu) > 2 & h > .2 & cooksd > .87591656, star graph7 estu h cooksd, star label(cid) select(88, 108) NOTES: This is an old but working Stata 7 command, search “graph7” for help file. Variable and thus direction associated w/ each line depends on order listed in command . 57 • The scaling of a star chart is a function of all the stars. Selecting just a few to be displayed still maintains the scaling based on all the observations and variables. • In our example model, no observations meet all three criteria for influence, so instead I will tell Stata to select some observations that include Senegal and Oman to show what the plot looks like (do this by using selecting observations 88-108) • What we want to see: Dot OR (Line in outlier direction &/OR Line in leverage direction) and no or tiny line in influence direction. • Look for longer lines in influence direction (pointing Lower LFT), leverage (lower RT). (c) Megan Reif
  • 58. III.C. Star Plots for DFBETAS (Individual Coefficient Influence) display 5/sqrt(110) .47673129 (use above command to get cut-off for dfbeta) graph7 _dfbeta_1 _dfbeta_2 _dfbeta_3 _dfbeta_4 if abs(_dfbeta_1) > .4767 | abs(_dfbeta_2) >.4767 | abs(_dfbeta_3) >.4767 | abs(_dfbeta_4) >.4767, star label(cid) NOTE we have to create new variables in the next command to ensure graphing of absolute values, so we do not know from Star Plot whether the point increases or decreases the coefficient. gen dflngnp=abs(_dfbeta_1) gen dfhdi=abs(_dfbeta_2) gen dfinform=abs(_dfbeta_3) gen dfurban=abs(_dfbeta_4) graph7 dfbeta_1 _dfbeta_2 _dfbeta_3 _dfbeta_4, star label(cid) select(88, 108) 58 • The scaling of a star chart is a function of all the stars. Selecting just a few to be displayed still maintains the scaling based on all the observations and variables. • In our example model, only OMAN meets ANY of the criteria for influence, so let’s select some observations to show what the plot looks like (a good reminder to use the statistics and rules of thumb in addition to eyeballing). Only OMAN is influential on all the coefficients at a level above the cut-off point for DFBETAS. OMAN is an oddity—lots of oil, relatively small Omani population, high birth rates, and a great deal of social development spending, raising HDI despite a largely rural population. How would you model this without deleting Oman? • What we want to see: Dot, tiny lines in ALL directions. • Look for longer lines ANY direction. (c) Megan Reif
  • 59. A Summary of COMMON DIAGNOSTIC PLOTS to identify potential extreme values Plot Type/ Command Preferred Appearance Use Description/Interpretation Leverage(h) (y-axis) v. Squared Normalized Residual Plot (x-axis) lvr2plot Scatter evenly spread around intersection of two means; no points to left of the mean normalized squared residual line (upper LEFT quadrant) Potential Influence on (1) ALL coefficients and (2) standard errors Vertical line represents average squared normalized residual and horizontal line average hat value. 1. IDENTIFY POINTS in RED AREA High Leverage AND Low Residual, when leverage greater than 0.2 2. POINTS in upper RIGHT quadrant are high leverage (>0.2) & high residual; Not influential on b but may diminish SEs and overstate certainty). Partial Regression Leverage Plots (Also called Added Variable Plot) avplots Scatter (loose or tight) of points even around the line in each plot. Potential Influence on EACH coefficient Residuals from regressing y on the Xk EXCEPT one (y=Xk-1b) y-axis, v. ordinary residuals EXCLUDED xi on remaining Xk-1 variables (xi = Xk-1b) x-axis  Look for points extreme in X w/ unusual e|y=Xk-1b values. CAUTIONS: (a) Verify points identified through “eyeballing” with DFBETAS. (b) Pay attention to scale of plots. Stretched or compacted displays mislead. Star Plots (a) Outliers, Leverage, & Model Influence (Cook’s D) gr7 estu h cooksd, star (b) Coefficient Influence (DFBETAs) gr7 dfx1 dfx2 dfxn, star (a) Dot OR (Line in outlier Direction &/OR Line in Leverage Direction) and no or tiny line in influence direction. (b) Dot (Lines in one or more directions= possible influence on one or more coefficients) (a) Multivariate Outliers, Leverage, &/or Influence Points (b) Potential Influence on EACH coefficient (a) Look for longer lines in Direction (a) (pointing Lower LFT), leverage (lower RT). (b) Look for longer line in any direction for DFBETA – each for a coefficient. NOTES: 1. Working old Stata 7 command, search “graph7” for help file. 2. Variable (b) associated w/ each line depends on order listed in command . (c) Megan Reif 59
  • 60. Cautions about Extreme Value Procedures • One weakness in the DFFITS and other statistics is that they will not always detect cases where there are two similar outliers. A single point would count as influential by itself, but included together, they are influential. • Cluster of outliers may indicate that model was wrongly applied to set of points. Partial regression plots and other methods may be better for finding such clusters than individual diagnostic statistics such as DFBETA. Both types of postestimation should be conducted. • A single outlier may indicate a typing error, ignoring a special missing data code, such as 999, or suggest that the model does account for important variation in the data. Only delete or change an observation if it is an obvious error, like a person being 10 feet tall, or a negative geographical distance. • Should not be abused to remove points to effect a desired change in a coefficient or its standard error! “An observation should only be removed if it is shown to be uncorrectably in error. Often no action is warranted, and when it is, the action should be more subtle than deletion….the benefits obtained from information on influential points far outweigh any potential danger” (Belsey et al., 16). • Think about non-linear or other specifications that might model the outliers directly. Outliers may present a research opportunity—do the outliers have anything in common? • Often the most that can be done is to report the results both with and without the outlier (maybe with one of the results in an appendix). The exception to this is the case of extreme x-values. It is possible to reduce the range over which your predictions will be valid (e.g., only OECD countries, only EU, only low-income, etc.)--it is ok to say your height and weight relationship is only usable for those between 5’5” and 6’5” for example, or that your model only applies to advanced industrialized democracies. 60(c) Megan Reif
  • 61. RESOURCES • UCLA Stata Regression Diagnostic Steps (good examples of data with problems) – http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm – http://www.ats.ucla.edu/stat/stata/examples/alsm/alsm9.htm – http://www.ats.ucla.edu/stat/stata/examples/ara/arastata11.htm • Belsley, D. A., E. Kuh, and R. E. Welsch. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. • Cook, D. R. and S. Weisberg (1982). Residuals and Influence in Regression. New York, NY, Chapman and Hall. • Fox, J. (1991). Regression Diagnostics. Newbury Park: Sage Publications. • Hamilton, L. C. (1992). Regression with Graphics: A Second Course in Applied Statistics. Pacific Grove, CA: Brooks/Cole Publishing Company. – Also has excellent chapter on pre-estimation graphical inspection of data – Includes section on post-estimation diagnostics for logit • For Regression Diagnostics for survey data (weighting for surveys requires adjusted methods), see Li, J. and R. Valliant Influence Analysis in Linear Regression with Sampling Weights. 3330-3337 and Valliant, R., J. Li, et al. (2009). Regression Diagnostics for Survey Data. Stata Conference. Washington, DC, Stata User's Gro • Temple, J. (2000). "Growth Regressions and What the Textbooks Don't Tell You." Bulletin of Economic Research 52(3): 181-205. The paper discusses three econometric problems that are rarely given adequate discussion in textbooks: model uncertainty, parameter heterogeneity, and outliers. 61(c) Megan Reif
  • 62. PS 699 Section March 25, 2010 Megan Reif Graduate Student Instructor, Political Science Professor Rob Franzese University of Michigan Regression Diagnostics 1. Diagnostics for Assessing (assessable) CLRM Assumptions 2. Diagnostics for Assessing Data Problems (e.g., Multicolinearity)
  • 63. Step One: Histogram and Box-Plot of the Ordinary Residuals (the former is useful in detecting multi-modal distribution of residuals, which suggests omitted qualitative variable that divides data into groups) Step Two: Graphical methods – tests exist for error normality, but visual methods are generally preferred Step Three: Q-Q Plot of Residuals vs. Normal Distribution, and Normal Probability Plot Background: What is a Q-Q Plot – Quantile-Quantile Plot? – Q-Q Plot is a scatterplot that graphs quantiles of one variable against quantiles of the second variable – The Quantiles are the data values in ascending order, where the first coordinate shows the lowest x1 value against the lowest x2 value, the second coordinate are the next two lowest values of x1 and x2 and so on (We graph a set of points with coordinates (X1i, X2i), where X1i is the ith-from lowest value of X1 and X2i is the ith- from-lowest value of X2). – What we can learn from a Q-Q Plot of Two Variables: 1. If the distributions of the two variables are similar in center, spread, and shape, then the points will lie on the 45-degree diagonal line from the origin. 2. If the distributions have the same SPREAD and SHAPE but different center (mean, median…), the points will follow a straight line parallel to the 45-degree diagonal but not crossing the origin. 3. If distributions have different spreads/variances and centers, but similar in shape, the points will follow a straight line NOT parallel to the diagonal. 4. If the points do not follow a straight line, the distributions are different shapes entirely. Two uses for Q-Q Plots: 1. Compare two empirical distributions (useful to assess whether subsets of the data, such as different time periods or groups, share the same distribution or come from different populations). 2. Compare an empirical distribution against a theoretical distribution (such as the Normal). I. Normal Distribution of Disturbances, ε, Can only be Evaluated using Estimate e.
  • 64. A. Residual Quantile-Normal Plot (also known as probit plot, normal-quantile comparison plot of residuals) 1. Quantile–Normal Plot (qnorm): - emphasize the tails of the distribution 2. Normal Probability Plot (pnorm): put the focus on the center of the distribution • What we expect to see if the empirical distribution is identical to a normal distribution, expect all points to lie on a diagonal line. I.A. Q-Q Plot of Residuals vs. Normal Distribution
  • 65. Quantile-Normal Plot Interpretation Basics Source: Hamilton, Regression with Graphics, p. 16
  • 66. Quantile-Quantile Plot Diagnostic Patterns Description of Point Pattern Possible Interpretation Points on 45 o diagonal line from Origin Distributions similar in center, spread, and shape Points on straight line parallel to 45 o diagonal Same SPREAD and SHAPE but different center (mean, median…), never see e with non-zero mean! Points follow straight line NOT parallel to the diagonal Different spreads/variances and centers, but similar in shape. Points do not follow a straight line Distributions have different shape. Vertically Steep (closer to parallel to y-axis) at Top and Bottom Heavy Tails, Outliers at Low and High Data Values Horizontal (closer to parallel to x-axis) at Top and Bottom Light Tails, Fewer Outliers Two or more less-step areas (horizontal parallel to x- axis) indicate higher than normal density, separated by a gap or steep climb (area of lower density) Distribution is bi- or multi-modal (subgroups, different populations) All but a few points fall on a line - some points are vertically separated from the rest of the data Outliers in the data Left end of pattern is below the line; right end of pattern is above the line Long tails at both ends of the data distribution Left end of pattern is above the line; right end of pattern is below the line Short tails at both ends of the distribution Curved pattern with slope increasing from left to right Data distribution is skewed to the right Curved pattern with slope decreasing from left to right Data distribution is skewed to the left Granularity: Staircase pattern (plateaus and gaps) Data values have been rounded or are discrete
  • 67. • CONTINUING EXAMPLE (from March 18 Notes): Model from Mukherjee et al. of crude birth rate as a function of: – GNP per capita (logged, per Feb 18 Notes and gnrl practice for such variables) – IM: Infant mortality – URBAN: percent % population urban – HDI: human development index (From WB Human Dev Report 1993) regress birthr lngnp hdi infmor urbanpop Source | SS df MS Number of obs = 110 -------------+------------------------------ F( 4, 105) = 129.19 Model | 16552.2585 4 4138.06462 Prob > F = 0.0000 Residual | 3363.19755 105 32.0304528 R-squared = 0.8311 -------------+------------------------------ Adj R-squared = 0.8247 Total | 19915.456 109 182.710606 Root MSE = 5.6595 ------------------------------------------------------------------------------ birthr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- lngnp | -.2138487 .7960166 -0.27 0.789 -1.792203 1.364505 hdi | -24.50566 7.495152 -3.27 0.001 -39.36716 -9.644157 infmor | .111157 .0396176 2.81 0.006 .0326026 .1897115 urbanpop | .0111358 .0396627 0.28 0.779 -.0675081 .0897797 _cons | 39.56958 6.599771 6.00 0.000 26.48346 52.65571 ------------------------------------------------------------------------------ 6 2 0 lnGNPC HDI IM URBBIRTHrt GNPC HDI IM URBAN          
  • 68. qnorm estu, grid mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) yline(-.1535893, lpattern(longdash) lcolor(cranberry)) caption(Red Dashed Line Shows Median of Studentized Residuals, size(vsmall)) legend(on) -------------------------------------------- Quantile- Normal Plot
  • 69. pnorm estu, grid mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) What does this granularity suggest? Normal Probability Plot
  • 70. What Non-Normal Residuals do to your OLS Estimates and What do to • If errors not normally distributed: – Efficiency decreases and inference based on t- and F distributions are not justified, especially as sample size decreases – Heavy-tailed error distributions (more outliers) will result in great sample-to-sample variation (less generalizability) – Normality is not required in order to obtain unbiased estimates of the regression coefficients. • If you have not already transformed skewed variables, doing so may help, as non-normal distribution of e may be caused by skewed X and/or Y distributions. • Model re-specification may be required if evidence of granularity, multi-modality • Robust methods provide alternatives to OLS for dealing with non- normal errors.
  • 71. (Ordinary) Residual vs. Fitted Plot CLRM: •Heteroskedasticity (leads to inefficiency and biased standard error estimates) •Residual Non-Normality (compounds in efficiency and undermines rationale for t- and F-tests, casting doubt on p-values reported in output) SPECIFICATION: •Non-linearity in X-Y relationship(s)
  • 72. rvfplot, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) Heteroskedasticity Variance for smaller fitted values larger than for medium fitted values?
  • 73. Absolute Value of Residual v. Fitted (easier to see heteroskedasticity) predict yhat predict resid, resid gen absresid=abs(resid) graph twoway scatter absresid yhat, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
  • 74. Note: Fox Recommends Using Studentized Residuals vs. Fitted Values (in example there is little difference) graph twoway scatter estu yhat, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on)
  • 75. Residual v. Predictor Plot • Heteroskedasticity e varies with values of one or more Xs. rvpplot lngnp, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(lngnp) rvpplot hdi, mcolor(red) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdi, replace) rvpplot infmor, mcolor(blue) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(infmor, replace) rvpplot urbanpop, mcolor(orange) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(urbanpop, replace) graph combine lngnp hdi infmor urbanpop
  • 77. Component-Plus-Residual Plot • The component plus residual plot is also known as partial- regression leverage plots, adjusted partial residuals plots or adjusted variable plots. • This plot shows the expectation of the dependent variable given a single independent variable, holding all else constant, PLUS the residual for that observation from the FULL model. • Looks at one of the explained parts of Y, plus the unexplained part (e), plotted against an independent variable. • CLRM: Heteroskedasticity • Functional form / non-linearity cprplot lngnp, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(lngnpcp, replace) cprplot hdi, mcolor(green) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdicp, replace) cprplot hdi, mcolor(red) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(hdicp, replace) cprplot infmor, mcolor(blue) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(infcp, replace) cprplot urbanpop, mcolor(orange) msize(small) msymbol(circle) mlabel(cid) mlabsize(tiny) mlabcolor(gs4) mlabangle(forty_five) legend(on) name(urbcp, replace) graph combine lngnpcp hdicp infcp urbcp
  • 79. Durbin Watson Test Statistic Correlograms Semi-Variograms Time Plot I. Autocorrelation
  • 80. Variance Inflation Factor High Collinearity increases se and reduce significance on important variables. VIF = 1/(1-R2j) where R2j is from the regression of variable j on the other independent variables. If variable j is completely uncorrelated with the other variables, than the R2 j will be zero, and VIF will be one. If fit perfect, R2j will be large. Larger VIF=more coll. I. Multicollinearity
  • 81. Summary of COMMON DIAGNOSTIC PLOTS to assess CLRM assumptions & Data Problems (post-estimation) Plot Type/Command Preferred Appearance Use Description/Interpretation Quantile-Normal Plot (Ordinary Residuals vs. Normal) qnorm estu & Normal Probability Plot (Studentized Residuals v. Standard Normal) pnorm estu If the empirical distribution of the residuals is identical to a normal distribution, expect all points to lie on the 45-degree diagonal line through the origin. Normally Distributed Stochastic Component Q-Normal Plot: Inspect Tails P-Probability Plot: Inspect Middle 1. Look for multi-modality, granularity (possible misspecification) 2. Right or Left Skewness (bowled up, Bowled down) 3. Heavy Tails 4. Vertical difference in values (outliers) Ordinary or Studentized Residual v. Fitted Values rvfplot & |Residual| v. Fitted graph twoway scatter absresid yhat No discernable pattern, even band with constant variance above and below zero, and high and low values of y. CLRM: Heteroskedasticity e varies with y Residual Normality SPECIFICATION: Non-linearity in X-Y relationship(s) Sum total of what the regression has explained. 1. Look for systematic variation in the distance of residuals from their mean of zero. 2. Q-N plot better to assess normality 3. This plot helps asses whether error variance Increases or decreases at smaller or larger values of y. 4. Clusters of residuals above or below zero (Ordinary) Residual v. Predictor Plot (each X) rvpplot x1varname No discernable pattern, even band with constant variance above and below zero, and high and low values of each X. CLRM: Heteroskedasticity e varies with values of one or more Xs. SPECIFICATION: Non-linearity 1. Look for systematic variation in the distance of residuals from mean 2. Whether error variance increases or decreases at smaller or large values of each X 3. Clusters of residuals above or below 0. Component Plus Residual Plot cprplot x1varname