2. 2
Consider the two variables x and y. A
deterministic relationship is one in which
the value of y (the dependent variable) is
described by some formula or mathematical
notation such as y = f(x), y = 3 + 2 x or
y = 5e-2x
where x is the dependent variable.
Deterministic Models
3. 3
A description of the relation between two
variables x and y that are not deterministically
related can be given by specifying a
probabilistic model.
The general form of an additive probabilistic
model allows y to be larger or smaller than
f(x) by a random amount, e.
The model equation is of the form
Probabilistic Models
Y = deterministic function of x + random deviation
= f(x) + e
5. 5
Simple Linear Regression Model
The simple linear regression model
assumes that there is a line with vertical or
y intercept a and slope b, called the true or
population regression line.
When a value of the independent variable x
is fixed and an observation on the
dependent variable y is made,
y = α + βx + e
Without the random deviation e, all observed points
(x, y) points would fall exactly on the population regression
line. The inclusion of e in the model equation allows points
to deviate from the line by random amounts.
6. 6
Simple Linear Regression Model
0
0
x = x1 x = x2
e2
Observation when x = x1
(positive deviation)
e2
Observation when x = x2
(positive deviation)
α = vertical intercept
Population regression line
(Slope β)
7. 7
Basic Assumptions of the Simple
Linear Regression Model
1. The distribution of e at any particular x
value has mean value 0 (µe = 0).
2. The standard deviation of e (which
describes the spread of its distribution) is
the same for any particular value of x. This
standard deviation is denoted by σ.
3. The distribution of e at any particular x
value is normal.
4. The random deviations e1, e2, …, en
associated with different observations are
independent of one another.
8. 8
More About the Simple Linear
Regression Model
and
(standard deviation of y for fixed x) = σ.
For any fixed x value, y itself has a normal
distribution.
mean y value height of the population
x
for fixed x regression line above x
= = α + β
9. 9
Interpretation of Terms
1. The slope β of the population regression
line is the mean (average) change in y
associated with a 1-unit increase in x.
2. The vertical intercept α is the height of
the population line when x = 0.
3. The size of σ determines the extent to
which the (x, y) observations deviate from
the population line.
Small σ Large σ
11. 11
Estimates for the Regression Line
The point estimates of β, the slope, and α,
the y intercept of the population regression
line, are the slope and y intercept,
respectively, of the least squares line.
That is,
xy
xx
S
b point estimate of
S
= β =
a point estimate of y bx= α = −
where
( ) ( ) ( )
2
2
xy xx
x y x
S xy and S x
n n
= − = −
∑ ∑ ∑
∑ ∑
where
( ) ( ) ( )
2
2
xy xx
x y x
S xy and S x
n n
= − = −
∑ ∑ ∑
∑ ∑
12. 12
Interpretation of y = a + bx
Let x* denote a specific value of the
predictor variable x. The a + bx* has two
interpetations:
1. a + bx* is a point estimate of the
mean y value when x = x*.
2. a + bx* is a point prediction of an
individual y value to be observed
when x = x*.
13. 13
Example
The following data was collected in a
study of age and fatness in humans.
* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-
photon (153
Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839
One of the questions was, “What is the
relationship between age and fatness?”
Age 23 23 27 27 39 41 45 49 50
% Fat 9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1
Age 53 53 54 56 57 58 58 60 61
% Fat 34.7 42 29.1 32.5 30.3 33 33.8 41.1 34.5
15. 15
Example
2
n 18, x 834, y 515
x 41612, xy 25489.2
= = =
= =
∑ ∑
∑ ∑
( )
2
2
xx
2
x
S x
n
834
41612 2970
18
= −
= − =
∑∑
( ) ( )
( ) ( )
xy
x y
S xy
n
834 515
25489.2 1627.53
18
= −
= − =
∑ ∑∑
17. 17
Example
A point estimate for the %Fat for a
human who is 45 years old is
If 45 is put into the equation for x, we have both
an estimated %Fat for a 45 year old human or
an estimated average %Fat for 45 year old
humans
The two interpretations are quite different.
ˆy 3.22 0.548x= +
a + bx=3.22+0.548(45)=27.9%
a + bx=3.22+0.548(45)=27.9%
18. 18
Example
A plot of the data
points along with
the least squares
regression line
created with
Minitab is given
to the right.
6050403020
40
30
20
10
Age (x)
%Faty
S = 5.75361 R-Sq = 62.7 % R-Sq(adj) = 60.4 %
% Fat y = 3.22086 + 0.547991 Age (x)
Regression Plot
19. 19
Terminology
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
=1st predicted value
=2nd predicted value
=nth predicted value
1 1
2 2
n n
ˆy a bx
ˆy a bx
...
ˆy a bx
= +
= +
= +
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
=1st predicted value
=2nd predicted value
=nth predicted value
1 1
2 2
n n
ˆy a bx
ˆy a bx
...
ˆy a bx
= +
= +
= +
The residuals for the least squares line are the
values: 1 1 2 2 n n
y y ,y y , ...,y yˆ ˆ ˆ− − −
The residuals for the least squares line are the
values: 1 1 2 2 n n
y y ,y y , ...,y yˆ ˆ ˆ− − −
20. 20
Definition formulae
The total sum of squares, denoted by SSTo,
is defined as
The residual sum of squares, denoted by
SSResid, is defined as
2 2 2
1 2 n
2
SSTo (y y) (y y) (y y)
(y y)∑
= − + − + + −
= −
L
2 2 2
1 1 2 2 n n
2
SSResid (y y ) (y y ) (y y )ˆ ˆ ˆ
(y y)ˆ∑
= − + − + + −
= −
L
21. 21
Calculation Formulae Recalled
SSTo and SSResid are generally found as
part of the standard output from most
statistical packages or can be obtained using
the following computational formulas:
( )
( )
2
2 2 y
SSTo y y y
n
∑
∑ ∑= − = −
2 2
SSResid (y y) y a y b xyˆ∑ ∑ ∑ ∑= − = − −
22. 22
Coefficient of Determination
The coefficient of determination,
denoted by r2
, gives the proportion of
variation in y that can be attributed to an
approximate linear relationship between x
and y.
The coefficient of determination, r2, can be
computed as 2 SSResid
r 1
SSTo
= −
The coefficient of determination, r2, can be
computed as 2 SSResid
r 1
SSTo
= −
23. 23
Estimated Standard Deviation, se
The statistic for estimating the variance σ2
is
where
2
e
SSResid
s
n 2
=
−
2 2
ˆSSResid (y y) y a y b xy= − = − −∑ ∑ ∑ ∑
2
eThe subscript e in s is a reminder that we are
estimating the variance of the "errors" or residuals.
24. 24
Estimated Standard Deviation, se
The estimate of σ is the estimated
standard deviation
The number of degrees of freedom associated
with estimating σ2
or σ in simple linear regression
is n - 2.
2
e es s=
26. 26
Example continued
2
n 18, y 515.0, y 16156.3
xy 25489.2 ,a 3.2209, b 0.54799
= = =
= = =
∑ ∑
∑
( )
( )
2
2 2
2
y
SSTot= y-y y
n
(515.0)
16156.3 1421.5
18
= −
= − =
∑∑ ∑
2 SSResid 529.66
r 1 1 1 0.373 0.627
SSTo 1421.5
= − = − = − =
27. 27
Example continued
With r2
= 0.627 or 62.7%, we can say that
62.7% of the observed variation in %Fat
can be attributed to the probabilistic linear
relationship with human age.
The magnitude of a typical sample
deviation from the least squares line is
about 5.75(%) which is reasonably large
compared to the y values themselves.
This would suggest that the model is only
useful in the sense of provide gross
“ballpark” estimates for %Fat for humans
based on age.
28. 28
Properties of the Sampling
Distribution of b
1. The mean value of b is β. Specifically,
µb=β and hence b is an unbiased
statistic for estimating β
When the four basic assumptions of the
simple linear regression model are satisfied,
the following conditions are met:
b
xxS
σ
σ =
2. The standard deviation
of the statistic b is
3. The statistic b has a normal distribution (a
consequence of the error e being normally
distributed)
29. 29
Estimated Standard Deviation of b
The estimated standard deviation of the
statistic b is e
b
xx
s
S
σ =
When then four basic assumptions of the
simple linear regression model are satisfied,
the probability distribution of the
standardized variable
is the t distribution with df = n - 2
b
b
t
s
− β
=
30. 30
Confidence interval for β
When then four basic assumptions of the
simple linear regression model are
satisfied, a confidence interval for β,
the slope of the population regression
line, has the form
b ± (t critical value)⋅sb
where the t critical value is based on
df = n - 2.
31. 31
Example continued
Recall
2 2
n 18, x 834, y 515
x 41612, xy 25489.2, y 16156.3
= = =
= = =
∑ ∑
∑ ∑ ∑
b 0.54799, a 3.2209= =
e
b
xx
s 5.754
s 0.1056
S 2970
= = =
A 95% confidence interval estimate for β is
bb t s 0.5480 (2.12) (0.1056) 0.5480 0.2238± = ± = ±g g
es 5.754=
32. 32
Example continued
Based on sample data, we are 95% confident that the
true mean increase in %Fat associated with a year of
age is between 0.324% and 0.772%.
A 95% confidence interval estimate for β is
bb ts 0.5480 2.12(0.1056)
0.5480 0.2238
(0.324,0.772)
± = ±
= ±
33. 33
The regression equation is
% Fat y = 3.22 + 0.548 Age (x)
Predictor Coef SE Coef T P
Constant 3.221 5.076 0.63 0.535
Age (x) 0.5480 0.1056 5.19 0.000
S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4%
Analysis of Variance
Source DF SS MS F P
Regression 1 891.87 891.87 26.94 0.000
Residual Error 16 529.66 33.10
Total 17 1421.54
Example continued
Minitab output looks like
Regression line
2
es
residual df = n -2
SSResidSSTo
Estimated slope b
Regression Analysis: % Fat y versus Age (x)
Estimated y intercept a
34. 34
Hypothesis Tests Concerning β
Null hypothesis: H0: β = hypothesized value
Test statistic:
The test is based on df = n - 2
b
b hypothesized value
t
s
−
=
Test statistic:
The test is based on df = n - 2
b
b hypothesized value
t
s
−
=
35. 35
Hypothesis Tests Concerning β
Alternate hypothesis and finding the P-value:
1. Ha: β > hypothesized value
P-value = Area under the t curve with
n - 2 degrees of freedom to the
right of the calculated t
2. Ha: β < hypothesized value
P-value = Area under the t curve with
n - 2 degrees of freedom to the left
of the calculated t
36. 36
Hypothesis Tests Concerning β
3. Ha: β ≠ hypothesized value
a) If t is positive, P-value = 2 (Area
under the t curve with n - 2 degrees
of freedom to the right of the
calculated t)
b) If t is negative, P-value = 2 (Area
under the t curve with n - 2 degrees
of freedom to the left of the
calculated t)
37. 37
Hypothesis Tests Concerning β
Assumptions:
1. The distribution of e at any particular x
value has mean value 0 (µe = 0)
2. The standard deviation of e is σ, which
does not depend on x
3. The distribution of e at any particular x
value is normal
4. The random deviations e1, e2, … , en
associated with different observations are
independent of one another
38. 38
Hypothesis Tests Concerning β
Quite often the test is performed with the
hypotheses
H0: β = 0 vs. Ha: β ≠ 0
This particular form of the test is called the
model utility test for simple linear
regression.
The test statistic simplifies to and is called the t ratio.
b
b
t
s
=
The null hypothesis specifies that there is no useful
linear relationship between x and y, whereas the
alternative hypothesis specifies that there is a useful
linear relationship between x and y.
39. 39
Example
Consider the following data on percentage
unemployment and suicide rates.
* Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158.
City
Percentage
Unemployed
Suicide
Rate
New York 3.0 72
Los Angeles 4.7 224
Chicago 3.0 82
Philadelphia 3.2 92
Detroit 3.8 104
Boston 2.5 71
San Francisco 4.8 235
Washington 2.7 81
Pittsburgh 4.4 86
St. Louis 3.1 102
Cleveland 3.5 104
41. 41
Example
City
Percentage
Unemployed
(x)
Suicide
Rate
(y)
x
2
xy y
2
New York 3.0 72 9.00 216.0 05184
Los Angeles 4.7 224 22.09 1052.8 50176
Chicago 3.0 82 9.00 246.0 06724
Philadelphia 3.2 92 10.24 294.4 08464
Detroit 3.8 104 14.44 395.2 10816
Boston 2.5 71 6.25 177.5 05041
San Francisco 4.8 235 23.04 1128.0 55225
Washington 2.7 81 7.29 218.7 06561
Pittsburgh 4.4 86 19.36 378.4 07396
St. Louis 3.1 102 9.61 316.2 10404
Cleveland 3.5 104 12.25 364.0 10816
38.7 1253 142.57 4787.2 176807
42. 42
Example
Some basic summary statistics
2
2
n 11, x 38.7, x 142.57
y 1253, y 176807, xy 4787.2
= = =
= = =
∑ ∑
∑ ∑ ∑
( ) ( )
xy
x y
S xy
n
(38.7)(1253)
4787.2
11
378.92
= −
= −
=
∑ ∑
∑
( )
2
2
xx
2
x
S x
n
38.7
142.57
11
6.4164
= −
= −
=
∑
∑
43. 43
Example
Continuing with the calculations
xy
xx
S 378.92
b 59.06
S 6.4164
= = =
1253 38.7
a y bx 59.06 93.86
11 11
= − = − = −
ˆy 93.86 59.06x= − +
44. 44
Example
Continuing with the calculations
2 2
SSResid
ˆ(y y) y a y b xy
176807 ( 93.857)(1253) 59.055(4787.2)
11701.9
= − = − −
= − − −
=
∑ ∑ ∑ ∑
( )
2
2 2
yy
2
y
SSTo S (y y) y
n
1253
176807
11
34078.9
= = − = −
= −
=
∑∑ ∑
46. 46
Example - Model Utility Test
1. β = the true average change in suicide
rate associated with an increase in the
unemployment rate of 1 percentage
point
2. H0: β = 0
3. Ha: β ≠ 0
4. α has not been preselected. We shall
interpret the observed level of
significance (P-value)
5. Test statistic:
b b b
b hypothesized value b 0 b
t
s s s
− −
= = =
47. 47
Example - Model Utility Test
6. Assumptions: The following plot (Minitab) of
the data shows a linear pattern and the
variability of points does not appear to be
changing with x. Assuming that the distribution
of errors (residuals) at any given x value is
approximately normal, the assumptions of the
simple linear regression model are
appropriate.
48. 48
Example - Model Utility Test
8. P-value: The table of tail areas for t-
distributions only has t values ≤ 4, so we can
see that the corresponding tail area is < 0.002.
Since this is a two-tail test the P-value < 0.004.
(Actual calculation gives a P-value = 0.002)
7. Calculation:
e
b
xx
s 36.06
s 14.24
S 6.4164
= = =
b
b 59.06
t 4.15
s 14.24
= = =
49. 49
Example - Model Utility Test
8. Conclusion:
Even though no specific significance
level was chosen for the test, with the
P-value being so small (< 0.004) one
would generally reject the null
hypothesis that β = 0 and conclude that
there is a useful linear relationship
between the % unemployed and the
suicide rate.
50. 50
Example - Minitab Output
Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x)
The regression equation is
Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x)
Predictor Coef SE Coef T P
Constant -93.86 51.25 -1.83 0.100
Percenta 59.05 14.24 4.15 0.002
S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8%
T value for Model Utility Test
H0: β = 0 Ha: β ≠ 0
P-value
51. 51
Example – Reality Check!
Although the medel utility test indicates that the model
is useful, we should be a bit reticent to use the model
principally as a estimation tool.
Notice that s = 36.06, where the actual range of
suicide rates is 235 – 71 = 164. This means to typical
error in estimating the suicide rate would be
approximately 22% of the range in error. With 9 of the
11 data points having suicide rates at or below 104,
this would constitute a very large amount of error in
the estimation.
The statistics is very clear: We have established a
strong positive linear relationship between percentage
employed and the suicide rate. I would just not be
particularly meaningful or useful to provide actual
numerical estimates for suicide rates.
52. 52
Residual Analysis
The simple linear regression model equation
is y = α + βx + e where e represents the
random deviation of an observed y value
from the population regression line α + βx .
Key assumptions about e
1. At any particular x value, the distribution
of e is a normal distribution
2. At any particular x value, the standard
deviation of e is σ, which is constant
over all values of x.
53. 53
Residual Analysis
To check on these assumptions, one would
examine the deviations e1, e2, …, en.
Generally, the deviations are not known, so
we check on the assumptions by looking at
the residuals which are the deviations from
the estimated line, a + bx.
The residuals are given by
1 1 1 1
2 2 2 2
n n n n
ˆy y y (a bx )
ˆy y y (a bx )
ˆy y y (a bx )
− = − +
− = − +
− = − +
M
54. 54
Standardized Residuals
Recall: A quantity is standardized by
subtracting its mean value and then dividing
by its true (or estimated) standard deviation.
For the residuals, the true mean is zero (0)
if the assumptions are true.
( )
i i
2
ˆy y e
xx
x x1
s s 1
n S
−
−
= − −
The estimated standard deviation of a residual
depends on the x value. The estimated standard
deviation of the ith
residual, , is given byi iˆy y−
55. 55
Standardized Residuals
As you can see from the formula for the
estimated standard deviation the calculation
of the standardized residuals is a bit of a
calculational nightmare.
Fortunately, most statistical software
packages are set up to perform these
calculations and do so quite proficiently.
56. 56
Standardized Residuals - Example
Consider the data on percentage unemployment
and suicide rates
Notice that the standardized residual for Pittsburgh
is -2.50, somewhat large for this size data set.
City
Percentage
Unemployed
Suicide
Rate
Residual Standardized
Residual
New York 3.0 72 83.31 -11.31 -0.34
Los Angeles 4.7 224 183.70 40.30 1.34
Chicago 3.0 82 83.31 -1.31 -0.04
Philadelphia 3.2 92 95.12 -3.12 -0.09
Detroit 3.8 104 130.55 -26.55 -0.78
Boston 2.5 71 53.78 17.22 0.55
San Francisco 4.8 235 189.61 45.39 1.56
Washington 2.7 81 65.59 15.41 0.48
Pittsburgh 4.4 86 165.99 -79.98 -2.50
St. Louis 3.1 102 89.21 12.79 0.38
Cleveland 3.5 104 112.84 -8.84 -0.26
ˆy ˆy - y
58. 58
Normal Plots
500-50
2
1
0
-1
-2
NormalScore
Residual
Normal Probability Plot of the Residuals
(response is Suicide)
2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5
2
1
0
-1
-2
NormalScore
Standardized Residual
Normal Probability Plot of the Residuals
(response is Suicide)
Notice that both of the normal plots look similar. If
a software package is available to do the
calculation and plots, it is preferable to look at the
normal plot of the standardized residuals.
In both cases, the points look reasonable linear
with the possible exception of Pittsburgh, so the
assumption that the errors are normally distributed
seems to be supported by the sample data.
59. 59
More Comments
The fact that Pittsburgh has a large
standardized residual makes it worthwhile
to look at that city carefully to make sure the
figures were reported correctly. One might
also look to see if there are some reasons
that Pittsburgh should be looked at
separately because some other
characteristic distinguishes it from all of the
other cities.
Pittsburgh does have a large effect on
model.
61. 61
This plot suggests that a curvilinear regression model
is needed.
2
1
0
-1
-2
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
Visual Interpretation of
Standardized Residuals
62. 62
This plot suggests a non-constant variance. The
assumptions of the model are not correct.
2
1
0
-1
-2
-3
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)3
Visual Interpretation of
Standardized Residuals
63. 63
This plot shows a data point with a large standardized
residual.
2
1
0
-1
-2
-3
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
Visual Interpretation of
Standardized Residuals
64. 64
This plot shows a potentially influential observation.
2
1
0
-1
-2
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
Visual Interpretation of
Standardized Residuals
65. 65
Example - % Unemployment vs. Suicide Rate
This plot of the residuals (errors) indicates some
possible problems with this linear model. You can see
a pattern to the points.
Generally
decreasing
pattern to these
points.
Unusually large
residual –
clearly an
influential point
These two points are quite
influential since they are far
away from the others in
terms of the % unemployed
66. 66
Properties of the Sampling Distribution
of a + bx for a Fixed x Value
Let x* denote a particular value of the
independent variable x. When the four basic
assumptions of the simple linear regression
model are satisfied, the sampling
distribution of the statistic a + bx* has the
following properties:
1. The mean value of a + bx* is α + βx*,
so a + bx* is an unbiased statistic for
estimating the average y value when
x = x*
67. 67
Properties of the Sampling Distribution
of a + bx for a Fixed x Value
3. The distribution of the statistic a + bx* is
normal.
2. The standard deviation of the statistic
a + bx* denoted by σa+bx*, is given by
( )
2
a bx*
xx
x * x1
n S
+
−
σ = σ +
68. 68
Addition Information about the Sampling
Distribution of a + bx for a Fixed x Value
The estimated standard deviation of
the statistic a + bx*, denoted by sa+bx*,
is given by ( )2
a bx* e
xx
x * x1
s s
n S
+
−
= +
When the four basic assumptions of the
simple linear regression model are satisfied,
the probability distribution of the standardized
variable
is the t distribution with df = n - 2.
a bx*
a bx * ( x*)
t
s +
+ − α +β
=
69. 69
Confidence Interval for a Mean y Value
When the four basic assumptions of the
simple linear regression model are met, a
confidence interval for a + bx*, the
average y value when x has the value x*, is
a + bx* ± (t critical value)sa+bx*
Where the t critical value is based on
df = n -2.
Many authors give the following equivalent form
for the confidence interval.
2
e
xx
1 (x * x)
a bx * (t critical value)s
n S
−
+ ± +
70. 70
Confidence Interval for a Single y Value
When the four basic assumptions of the simple
linear regression model are met, a prediction
interval for y*, a single y observation made
when x has the value x*, has the form
Where the t critical value is based on df = n -2.
2 2
e a bx*a bx * (t critical value) s s ++ ± +
Many authors give the following equivalent form
for the prediction interval.
2
e
xx
1 (x * x)
a bx * (t critical value)s 1
n S
−
+ ± + +
71. 71
Example - Mean Annual Temperature vs. Mortality
Data was collected in certain regions of
Great Britain, Norway and Sweden to study
the relationship between the mean annual
temperature and the mortality rate for a
specific type of breast cancer in women.
* Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in
certain European countries. British Medical Journal, 1, 488-490
Mean Annual
Temperature (F°)
51.3 49.9 50.0 49.2 48.5 47.8 47.3 45.1
Mortality Index 102.5 104.5 100.4 95.9 87.0 95.0 88.6 89.2
Mean Annual
Temperature (F°)
46.3 42.1 44.2 43.5 42.3 40.2 31.8 34.0
Mortality Index 78.9 84.6 81.7 72.2 65.1 68.1 67.3 52.5
72. 72
Example - Mean Annual Temperature vs. Mortality
Regression Analysis: Mortality index versus Mean annual temperature
The regression equation is
Mortality index = - 21.8 + 2.36 Mean annual temperature
Predictor Coef SE Coef T P
Constant -21.79 15.67 -1.39 0.186
Mean ann 2.3577 0.3489 6.76 0.000
S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 2599.5 2599.5 45.67 0.000
Residual Error 14 796.9 56.9
Total 15 3396.4
Unusual Observations
Obs Mean ann Mortalit Fit SE Fit Residual St Resid
15 31.8 67.30 53.18 4.85 14.12 2.44RX
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
73. 73
Example - Mean Annual Temperature vs. Mortality
504030
100
90
80
70
60
50
Mean annual
Mortalityin
S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 %
Mortality in = -21.7947 + 2.35769 Mean annual
Regression Plot
The point has a large standardized residual and is
influential because of the low Mean Annual Temperature.
74. 74
Example - Mean Annual Temperature vs. Mortality
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X
2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88)
3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54)
4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02)
5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25)
6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57)
X denotes a row with X values away from the center
Values of Predictors for New Observations
New Obs Mean ann
1 31.8
2 35.0
3 40.0
4 44.6
5 50.0
6 51.3
These are the x* values for which the
above fits, standard errors of the fits,
95% confidence intervals for Mean y
values and prediction intervals for y
values given above.
75. 75
504030
120
110
100
90
80
70
60
50
40
30
Mean annual
Mortalityin
S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 %
Mortality in = -21.7947 + 2.35769 Mean annual
95% PI
95% CI
Regression
Regression Plot
Example - Mean Annual Temperature vs. Mortality
95% prediction interval for single y value at x = 45. (67.62,100.98)
95% confidence interval for Mean y value at x = 40. (67.20, 77.82)
76. 76
A Test for Independence in a
Bivariate Normal Population
Null hypothesis: H0: ρ = 0
Assumption: r is the correlation coefficient for a
random sample from a bivariate normal
population.
Test statistic:
The t critical value is based on df = n - 2
2
r
t
1 r
n 2
=
−
−
77. 77
A Test for Independence in a
Bivariate Normal Population
Alternate hypothesis: H0: ρ > 0 (Positive
dependence): P-value is the area under the
appropriate t curve to the right of the computed t.
Alternate hypothesis: H0: ρ < 0 (Negative
dependence): P-value is the area under the
appropriate t curve to the right of the computed t.
Alternate hypothesis: H0: ρ ≠ 0 (Dependence):
P-value is
i. twice the area under the appropriate t curve to the left of
the computed t value if t < 0 and
ii. twice the area under the appropriate t curve to the right of
the computed t value if t > 0
78. 78
Example
Recall the data from
the study of %Fat vs.
Age for humans.
There are 18 data
points and a quick
calculation of the
Pierson correlation
coefficient gives
r = 0.79209.
We will test to see if
there is a dependence
at the 0.05
significance level.
Age (x) % Fat y x2
xy
23 9.5 529 218.5
23 27.9 529 641.7
27 7.8 729 210.6
27 17.8 729 480.6
39 31.4 1521 1224.6
41 25.9 1681 1061.9
45 27.4 2025 1233
49 25.2 2401 1234.8
50 31.1 2500 1555
53 34.7 2809 1839.1
53 42 2809 2226
54 29.1 2916 1571.4
56 32.5 3136 1820
57 30.3 3249 1727.1
58 33 3364 1914
58 33.8 3364 1960.4
60 41.1 3600 2466
61 34.5 3721 2104.5
79. 79
Example
1. ρ = the correlation between % fat and
age in the population from which the
sample was selected
2. H0: ρ = 0
3. Ha: ρ ≠ 0
4. α = 0.05
5. Test statistic:
2
r
t , df n 2
1 r
n 2
= = −
−
−
80. 80
Example
6. Looking at the two normal plots, we can see
it is not reasonable to assume that either the
distribution of age nor the distribution of % fat
are normal. (Notice, the data points deviate
from a linear pattern quite substantially.
Since neither is normal, we shall not continue
with the test.
P-Value: 0.011
A-Squared: 0.980
Anderson-Darling Normality Test
N: 18
StDev: 13.2176
Average: 46.3333
6252423222
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
Age (x)
Normal Probability Plot
P-Value: 0.032
A-Squared: 0.796
Anderson-Darling Normality Test
N: 18
StDev: 9.14439
Average: 28.6111
40302010
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
% Fat y
Normal Probability Plot
81. 81
Another Example
Height vs. Joint Length
The professor in an elementary statistics
class wanted to explain correlation so he
needed some bivariate data. He asked his
class (presumably a random or
representative sample of late adolescent
humans) to measure the length of the
metacarpal bone on the index finger of the
right hand (in cm) and height (in ft). The
data are provided on the next slide.
82. 82
Example - Height vs. Joint Length
There are 17 data points and a quick
calculation of the Pierson correlation
coefficient gives r = 0.74908.
We will test to see if the true population
correlation coefficient is positive at the 0.05
level of significance.
Joint length 3.5 3.4 3.4 2.7 3.5 3.5 4.2 4.0 3.0
Height 64 68.5 69 64 68 73 72 75 70
Joint length 3.4 2.9 3.5 3.5 2.8 4.0 3.8 3.3
Height 68.5 65 67 70 65 75 70 66
83. 83
1. ρ = the true correlation between height
and right index finger metacarpal joint in
the population from which the sample
was selected
2. H0: ρ = 0
3. Ha: ρ > 0
4. α = 0.05
Example - Height vs. Joint Length
5. Test statistic:
2
r
t , df n 2
1 r
n 2
= = −
−
−
84. 84
6. Looking at the two normal plots, we can see it is
reasonable to assume that the distribution of age and
the distribution of % fat are both normal. (Notice, the
data points follow a reasonably linear pattern. This
appears to confirm the assumption that the sample is
from a bivariate normal distribution. We will assume
that the class was a random sample of young adults.
Example - Height vs. Joint Length
P-Value: 0.557
A-Squared: 0.294
Anderson-Darling Normality Test
N: 17
StDev: 3.49974
Average: 68.8235
757065
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
Height
Normal Probability Plot
P-Value: 0.156
A-Squared: 0.524
Anderson-Darling Normality Test
N: 17
StDev: 0.419734
Average: 3.43529
4.03.53.0
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
Joint
Normal Probability Plot
85. 85
Example - Height vs. Joint Length
8. P-value: Looking on the table of tail areas for t
curves under 15 degrees of freedom, 4.379 is off
the bottom of the table, so P-value < 0.001. Minitab
reports the P-value to be 0.001.
9. Conclusion: The P-value is smaller than α = 0.05, so
we can reject H0. We can conclude that the true
population correlation coefficient is greater then 0.
I.e., the metacarpal bone is longer for taller people.
7. Calculation:
2 2
r 0.74908
t 4.379
1 r 1 (0.74908)
n 2 17 2
= = =
− −
− −