SlideShare ist ein Scribd-Unternehmen logo
1 von 85
1
Chapter 13
Simple Linear Regression
&
Correlation
Inferential Methods
2
Consider the two variables x and y. A
deterministic relationship is one in which
the value of y (the dependent variable) is
described by some formula or mathematical
notation such as y = f(x), y = 3 + 2 x or
y = 5e-2x
where x is the dependent variable.
Deterministic Models
3
A description of the relation between two
variables x and y that are not deterministically
related can be given by specifying a
probabilistic model.
The general form of an additive probabilistic
model allows y to be larger or smaller than
f(x) by a random amount, e.
The model equation is of the form
Probabilistic Models
Y = deterministic function of x + random deviation
= f(x) + e
4
Probabilistic Models
Deviations from the deterministic part of a
probabilistic model
e=-1.5
5
Simple Linear Regression Model
The simple linear regression model
assumes that there is a line with vertical or
y intercept a and slope b, called the true or
population regression line.
When a value of the independent variable x
is fixed and an observation on the
dependent variable y is made,
y = α + βx + e
Without the random deviation e, all observed points
(x, y) points would fall exactly on the population regression
line. The inclusion of e in the model equation allows points
to deviate from the line by random amounts.
6
Simple Linear Regression Model
0
0
x = x1 x = x2
e2
Observation when x = x1
(positive deviation)
e2
Observation when x = x2
(positive deviation)
α = vertical intercept
Population regression line
(Slope β)
7
Basic Assumptions of the Simple
Linear Regression Model
1. The distribution of e at any particular x
value has mean value 0 (µe = 0).
2. The standard deviation of e (which
describes the spread of its distribution) is
the same for any particular value of x. This
standard deviation is denoted by σ.
3. The distribution of e at any particular x
value is normal.
4. The random deviations e1, e2, …, en
associated with different observations are
independent of one another.
8
More About the Simple Linear
Regression Model
and
(standard deviation of y for fixed x) = σ.
For any fixed x value, y itself has a normal
distribution.
mean y value height of the population
x
for fixed x regression line above x
   
= = α + β   
   
9
Interpretation of Terms
1. The slope β of the population regression
line is the mean (average) change in y
associated with a 1-unit increase in x.
2. The vertical intercept α is the height of
the population line when x = 0.
3. The size of σ determines the extent to
which the (x, y) observations deviate from
the population line.
Small σ Large σ
10
Illustration of Assumptions
11
Estimates for the Regression Line
The point estimates of β, the slope, and α,
the y intercept of the population regression
line, are the slope and y intercept,
respectively, of the least squares line.
That is,
xy
xx
S
b point estimate of
S
= β =
a point estimate of y bx= α = −
where
( ) ( ) ( )
2
2
xy xx
x y x
S xy and S x
n n
= − = −
∑ ∑ ∑
∑ ∑
where
( ) ( ) ( )
2
2
xy xx
x y x
S xy and S x
n n
= − = −
∑ ∑ ∑
∑ ∑
12
Interpretation of y = a + bx
Let x* denote a specific value of the
predictor variable x. The a + bx* has two
interpetations:
1. a + bx* is a point estimate of the
mean y value when x = x*.
2. a + bx* is a point prediction of an
individual y value to be observed
when x = x*.
13
Example
The following data was collected in a
study of age and fatness in humans.
* Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-
photon (153
Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839
One of the questions was, “What is the
relationship between age and fatness?”
Age 23 23 27 27 39 41 45 49 50
% Fat 9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1
Age 53 53 54 56 57 58 58 60 61
% Fat 34.7 42 29.1 32.5 30.3 33 33.8 41.1 34.5
14
Example
Age (x) % Fat y x2
xy
23 9.5 529 218.5
23 27.9 529 641.7
27 7.8 729 210.6
27 17.8 729 480.6
39 31.4 1521 1224.6
41 25.9 1681 1061.9
45 27.4 2025 1233
49 25.2 2401 1234.8
50 31.1 2500 1555
53 34.7 2809 1839.1
53 42 2809 2226
54 29.1 2916 1571.4
56 32.5 3136 1820
57 30.3 3249 1727.1
58 33 3364 1914
58 33.8 3364 1960.4
60 41.1 3600 2466
61 34.5 3721 2104.5
834 515 41612 25489.2
2
n 18
X 834
y 515
X 41612
XY 25489.2
=
=
=
=
=
∑
∑
∑
∑
15
Example
2
n 18, x 834, y 515
x 41612, xy 25489.2
= = =
= =
∑ ∑
∑ ∑
( )
2
2
xx
2
x
S x
n
834
41612 2970
18
= −
= − =
∑∑
( ) ( )
( ) ( )
xy
x y
S xy
n
834 515
25489.2 1627.53
18
= −
= − =
∑ ∑∑
16
Example
xy
xx
S 1627.53
b 0.54799
S 2970
= = =
515 834
a y bx 0.54799 3.2209
18 18
= − = − =
ˆy 3.22 0.548x= +
17
Example
A point estimate for the %Fat for a
human who is 45 years old is
If 45 is put into the equation for x, we have both
an estimated %Fat for a 45 year old human or
an estimated average %Fat for 45 year old
humans
The two interpretations are quite different.
ˆy 3.22 0.548x= +
a + bx=3.22+0.548(45)=27.9%
a + bx=3.22+0.548(45)=27.9%
18
Example
A plot of the data
points along with
the least squares
regression line
created with
Minitab is given
to the right.
6050403020
40
30
20
10
Age (x)
%Faty
S = 5.75361 R-Sq = 62.7 % R-Sq(adj) = 60.4 %
% Fat y = 3.22086 + 0.547991 Age (x)
Regression Plot
19
Terminology
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
=1st predicted value
=2nd predicted value
=nth predicted value
1 1
2 2
n n
ˆy a bx
ˆy a bx
...
ˆy a bx
= +
= +
= +
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
=1st predicted value
=2nd predicted value
=nth predicted value
1 1
2 2
n n
ˆy a bx
ˆy a bx
...
ˆy a bx
= +
= +
= +
The residuals for the least squares line are the
values: 1 1 2 2 n n
y y ,y y , ...,y yˆ ˆ ˆ− − −
The residuals for the least squares line are the
values: 1 1 2 2 n n
y y ,y y , ...,y yˆ ˆ ˆ− − −
20
Definition formulae
The total sum of squares, denoted by SSTo,
is defined as
The residual sum of squares, denoted by
SSResid, is defined as
2 2 2
1 2 n
2
SSTo (y y) (y y) (y y)
(y y)∑
= − + − + + −
= −
L
2 2 2
1 1 2 2 n n
2
SSResid (y y ) (y y ) (y y )ˆ ˆ ˆ
(y y)ˆ∑
= − + − + + −
= −
L
21
Calculation Formulae Recalled
SSTo and SSResid are generally found as
part of the standard output from most
statistical packages or can be obtained using
the following computational formulas:
( )
( )
2
2 2 y
SSTo y y y
n
∑
∑ ∑= − = −
2 2
SSResid (y y) y a y b xyˆ∑ ∑ ∑ ∑= − = − −
22
Coefficient of Determination
The coefficient of determination,
denoted by r2
, gives the proportion of
variation in y that can be attributed to an
approximate linear relationship between x
and y.
The coefficient of determination, r2, can be
computed as 2 SSResid
r 1
SSTo
= −
The coefficient of determination, r2, can be
computed as 2 SSResid
r 1
SSTo
= −
23
Estimated Standard Deviation, se
The statistic for estimating the variance σ2
is
where
2
e
SSResid
s
n 2
=
−
2 2
ˆSSResid (y y) y a y b xy= − = − −∑ ∑ ∑ ∑
2
eThe subscript e in s is a reminder that we are
estimating the variance of the "errors" or residuals.
24
Estimated Standard Deviation, se
The estimate of σ is the estimated
standard deviation
The number of degrees of freedom associated
with estimating σ2
or σ in simple linear regression
is n - 2.
2
e es s=
25
Example continued
SSResid
529.66=
2
e
SSResid
s
n 2
529.66
18 2
33.104
=
−
=
−
=
2
e es s
33.104
5.754
=
=
=
Age (x) % Fat (y) y2
Predicted
Value
Residual
23 9.5 90.3 15.82 -6.32 40.00
23 27.9 778.4 15.82 12.08 145.81
27 7.8 60.8 18.02 -10.22 104.38
27 17.8 316.8 18.02 -0.22 0.05
39 31.4 986.0 24.59 6.81 46.34
41 25.9 670.8 25.69 0.21 0.04
45 27.4 750.8 27.88 -0.48 0.23
49 25.2 635.0 30.07 -4.87 23.74
50 31.1 967.2 30.62 0.48 0.23
53 34.7 1204.1 32.26 2.44 5.93
53 42.0 1764.0 32.26 9.74 94.78
54 29.1 846.8 32.81 -3.71 13.78
56 32.5 1056.3 33.91 -1.41 1.98
57 30.3 918.1 34.46 -4.16 17.27
58 33.0 1089.0 35.00 -2.00 4.02
58 33.8 1142.4 35.00 -1.20 1.45
60 41.1 1689.2 36.10 5.00 25.00
61 34.5 1190.3 36.65 -2.15 4.62
834 515.0 16156.3 529.66
ˆy y−ˆy ( )
2
ˆy y−
26
Example continued
2
n 18, y 515.0, y 16156.3
xy 25489.2 ,a 3.2209, b 0.54799
= = =
= = =
∑ ∑
∑
( )
( )
2
2 2
2
y
SSTot= y-y y
n
(515.0)
16156.3 1421.5
18
= −
= − =
∑∑ ∑
2 SSResid 529.66
r 1 1 1 0.373 0.627
SSTo 1421.5
= − = − = − =
27
Example continued
With r2
= 0.627 or 62.7%, we can say that
62.7% of the observed variation in %Fat
can be attributed to the probabilistic linear
relationship with human age.
The magnitude of a typical sample
deviation from the least squares line is
about 5.75(%) which is reasonably large
compared to the y values themselves.
This would suggest that the model is only
useful in the sense of provide gross
“ballpark” estimates for %Fat for humans
based on age.
28
Properties of the Sampling
Distribution of b
1. The mean value of b is β. Specifically,
µb=β and hence b is an unbiased
statistic for estimating β
When the four basic assumptions of the
simple linear regression model are satisfied,
the following conditions are met:
b
xxS
σ
σ =
2. The standard deviation
of the statistic b is
3. The statistic b has a normal distribution (a
consequence of the error e being normally
distributed)
29
Estimated Standard Deviation of b
The estimated standard deviation of the
statistic b is e
b
xx
s
S
σ =
When then four basic assumptions of the
simple linear regression model are satisfied,
the probability distribution of the
standardized variable
is the t distribution with df = n - 2
b
b
t
s
− β
=
30
Confidence interval for β
When then four basic assumptions of the
simple linear regression model are
satisfied, a confidence interval for β,
the slope of the population regression
line, has the form
b ± (t critical value)⋅sb
where the t critical value is based on
df = n - 2.
31
Example continued
Recall
2 2
n 18, x 834, y 515
x 41612, xy 25489.2, y 16156.3
= = =
= = =
∑ ∑
∑ ∑ ∑
b 0.54799, a 3.2209= =
e
b
xx
s 5.754
s 0.1056
S 2970
= = =
A 95% confidence interval estimate for β is
bb t s 0.5480 (2.12) (0.1056) 0.5480 0.2238± = ± = ±g g
es 5.754=
32
Example continued
Based on sample data, we are 95% confident that the
true mean increase in %Fat associated with a year of
age is between 0.324% and 0.772%.
A 95% confidence interval estimate for β is
bb ts 0.5480 2.12(0.1056)
0.5480 0.2238
(0.324,0.772)
± = ±
= ±
33
The regression equation is
% Fat y = 3.22 + 0.548 Age (x)
Predictor Coef SE Coef T P
Constant 3.221 5.076 0.63 0.535
Age (x) 0.5480 0.1056 5.19 0.000
S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4%
Analysis of Variance
Source DF SS MS F P
Regression 1 891.87 891.87 26.94 0.000
Residual Error 16 529.66 33.10
Total 17 1421.54
Example continued
Minitab output looks like
Regression line
2
es
residual df = n -2
SSResidSSTo
Estimated slope b
Regression Analysis: % Fat y versus Age (x)
Estimated y intercept a
34
Hypothesis Tests Concerning β
Null hypothesis: H0: β = hypothesized value
Test statistic:
The test is based on df = n - 2
b
b hypothesized value
t
s
−
=
Test statistic:
The test is based on df = n - 2
b
b hypothesized value
t
s
−
=
35
Hypothesis Tests Concerning β
Alternate hypothesis and finding the P-value:
1. Ha: β > hypothesized value
P-value = Area under the t curve with
n - 2 degrees of freedom to the
right of the calculated t
2. Ha: β < hypothesized value
P-value = Area under the t curve with
n - 2 degrees of freedom to the left
of the calculated t
36
Hypothesis Tests Concerning β
3. Ha: β ≠ hypothesized value
a) If t is positive, P-value = 2 (Area
under the t curve with n - 2 degrees
of freedom to the right of the
calculated t)
b) If t is negative, P-value = 2 (Area
under the t curve with n - 2 degrees
of freedom to the left of the
calculated t)
37
Hypothesis Tests Concerning β
Assumptions:
1. The distribution of e at any particular x
value has mean value 0 (µe = 0)
2. The standard deviation of e is σ, which
does not depend on x
3. The distribution of e at any particular x
value is normal
4. The random deviations e1, e2, … , en
associated with different observations are
independent of one another
38
Hypothesis Tests Concerning β
Quite often the test is performed with the
hypotheses
H0: β = 0 vs. Ha: β ≠ 0
This particular form of the test is called the
model utility test for simple linear
regression.
The test statistic simplifies to and is called the t ratio.
b
b
t
s
=
The null hypothesis specifies that there is no useful
linear relationship between x and y, whereas the
alternative hypothesis specifies that there is a useful
linear relationship between x and y.
39
Example
Consider the following data on percentage
unemployment and suicide rates.
* Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158.
City
Percentage
Unemployed
Suicide
Rate
New York 3.0 72
Los Angeles 4.7 224
Chicago 3.0 82
Philadelphia 3.2 92
Detroit 3.8 104
Boston 2.5 71
San Francisco 4.8 235
Washington 2.7 81
Pittsburgh 4.4 86
St. Louis 3.1 102
Cleveland 3.5 104
40
Example
The plot of the data points produced by
Minitab follows
41
Example
City
Percentage
Unemployed
(x)
Suicide
Rate
(y)
x
2
xy y
2
New York 3.0 72 9.00 216.0 05184
Los Angeles 4.7 224 22.09 1052.8 50176
Chicago 3.0 82 9.00 246.0 06724
Philadelphia 3.2 92 10.24 294.4 08464
Detroit 3.8 104 14.44 395.2 10816
Boston 2.5 71 6.25 177.5 05041
San Francisco 4.8 235 23.04 1128.0 55225
Washington 2.7 81 7.29 218.7 06561
Pittsburgh 4.4 86 19.36 378.4 07396
St. Louis 3.1 102 9.61 316.2 10404
Cleveland 3.5 104 12.25 364.0 10816
38.7 1253 142.57 4787.2 176807
42
Example
Some basic summary statistics
2
2
n 11, x 38.7, x 142.57
y 1253, y 176807, xy 4787.2
= = =
= = =
∑ ∑
∑ ∑ ∑
( ) ( )
xy
x y
S xy
n
(38.7)(1253)
4787.2
11
378.92
= −
= −
=
∑ ∑
∑
( )
2
2
xx
2
x
S x
n
38.7
142.57
11
6.4164
= −
= −
=
∑
∑
43
Example
Continuing with the calculations
xy
xx
S 378.92
b 59.06
S 6.4164
= = =
1253 38.7
a y bx 59.06 93.86
11 11
= − = − = −
ˆy 93.86 59.06x= − +
44
Example
Continuing with the calculations
2 2
SSResid
ˆ(y y) y a y b xy
176807 ( 93.857)(1253) 59.055(4787.2)
11701.9
= − = − −
= − − −
=
∑ ∑ ∑ ∑
( )
2
2 2
yy
2
y
SSTo S (y y) y
n
1253
176807
11
34078.9
= = − = −
= −
=
∑∑ ∑
45
Example
2 SSResid 11701.9
r 1 1
SSto 34078.9
1 0.343 0.657
= − = −
= − =
e
SSResid 11701.9
s 36.06
n-2 9
= = =
46
Example - Model Utility Test
1. β = the true average change in suicide
rate associated with an increase in the
unemployment rate of 1 percentage
point
2. H0: β = 0
3. Ha: β ≠ 0
4. α has not been preselected. We shall
interpret the observed level of
significance (P-value)
5. Test statistic:
b b b
b hypothesized value b 0 b
t
s s s
− −
= = =
47
Example - Model Utility Test
6. Assumptions: The following plot (Minitab) of
the data shows a linear pattern and the
variability of points does not appear to be
changing with x. Assuming that the distribution
of errors (residuals) at any given x value is
approximately normal, the assumptions of the
simple linear regression model are
appropriate.
48
Example - Model Utility Test
8. P-value: The table of tail areas for t-
distributions only has t values ≤ 4, so we can
see that the corresponding tail area is < 0.002.
Since this is a two-tail test the P-value < 0.004.
(Actual calculation gives a P-value = 0.002)
7. Calculation:
e
b
xx
s 36.06
s 14.24
S 6.4164
= = =
b
b 59.06
t 4.15
s 14.24
= = =
49
Example - Model Utility Test
8. Conclusion:
Even though no specific significance
level was chosen for the test, with the
P-value being so small (< 0.004) one
would generally reject the null
hypothesis that β = 0 and conclude that
there is a useful linear relationship
between the % unemployed and the
suicide rate.
50
Example - Minitab Output
Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x)
The regression equation is
Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x)
Predictor Coef SE Coef T P
Constant -93.86 51.25 -1.83 0.100
Percenta 59.05 14.24 4.15 0.002
S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8%
T value for Model Utility Test
H0: β = 0 Ha: β ≠ 0
P-value
51
Example – Reality Check!
Although the medel utility test indicates that the model
is useful, we should be a bit reticent to use the model
principally as a estimation tool.
Notice that s = 36.06, where the actual range of
suicide rates is 235 – 71 = 164. This means to typical
error in estimating the suicide rate would be
approximately 22% of the range in error. With 9 of the
11 data points having suicide rates at or below 104,
this would constitute a very large amount of error in
the estimation.
The statistics is very clear: We have established a
strong positive linear relationship between percentage
employed and the suicide rate. I would just not be
particularly meaningful or useful to provide actual
numerical estimates for suicide rates.
52
Residual Analysis
The simple linear regression model equation
is y = α + βx + e where e represents the
random deviation of an observed y value
from the population regression line α + βx .
Key assumptions about e
1. At any particular x value, the distribution
of e is a normal distribution
2. At any particular x value, the standard
deviation of e is σ, which is constant
over all values of x.
53
Residual Analysis
To check on these assumptions, one would
examine the deviations e1, e2, …, en.
Generally, the deviations are not known, so
we check on the assumptions by looking at
the residuals which are the deviations from
the estimated line, a + bx.
The residuals are given by
1 1 1 1
2 2 2 2
n n n n
ˆy y y (a bx )
ˆy y y (a bx )
ˆy y y (a bx )
− = − +
− = − +
− = − +
M
54
Standardized Residuals
Recall: A quantity is standardized by
subtracting its mean value and then dividing
by its true (or estimated) standard deviation.
For the residuals, the true mean is zero (0)
if the assumptions are true.
( )
i i
2
ˆy y e
xx
x x1
s s 1
n S
−
−
= − −
The estimated standard deviation of a residual
depends on the x value. The estimated standard
deviation of the ith
residual, , is given byi iˆy y−
55
Standardized Residuals
As you can see from the formula for the
estimated standard deviation the calculation
of the standardized residuals is a bit of a
calculational nightmare.
Fortunately, most statistical software
packages are set up to perform these
calculations and do so quite proficiently.
56
Standardized Residuals - Example
Consider the data on percentage unemployment
and suicide rates
Notice that the standardized residual for Pittsburgh
is -2.50, somewhat large for this size data set.
City
Percentage
Unemployed
Suicide
Rate
Residual Standardized
Residual
New York 3.0 72 83.31 -11.31 -0.34
Los Angeles 4.7 224 183.70 40.30 1.34
Chicago 3.0 82 83.31 -1.31 -0.04
Philadelphia 3.2 92 95.12 -3.12 -0.09
Detroit 3.8 104 130.55 -26.55 -0.78
Boston 2.5 71 53.78 17.22 0.55
San Francisco 4.8 235 189.61 45.39 1.56
Washington 2.7 81 65.59 15.41 0.48
Pittsburgh 4.4 86 165.99 -79.98 -2.50
St. Louis 3.1 102 89.21 12.79 0.38
Cleveland 3.5 104 112.84 -8.84 -0.26
ˆy ˆy - y
57
Example
Pittsburgh
This point has
an unusually
high residual
58
Normal Plots
500-50
2
1
0
-1
-2
NormalScore
Residual
Normal Probability Plot of the Residuals
(response is Suicide)
2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5
2
1
0
-1
-2
NormalScore
Standardized Residual
Normal Probability Plot of the Residuals
(response is Suicide)
Notice that both of the normal plots look similar. If
a software package is available to do the
calculation and plots, it is preferable to look at the
normal plot of the standardized residuals.
In both cases, the points look reasonable linear
with the possible exception of Pittsburgh, so the
assumption that the errors are normally distributed
seems to be supported by the sample data.
59
More Comments
The fact that Pittsburgh has a large
standardized residual makes it worthwhile
to look at that city carefully to make sure the
figures were reported correctly. One might
also look to see if there are some reasons
that Pittsburgh should be looked at
separately because some other
characteristic distinguishes it from all of the
other cities.
Pittsburgh does have a large effect on
model.
60
2
1
0
-1
-2
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
This plot is an example of a satisfactory plot that
indicates that the model assumptions are reasonable.
Visual Interpretation of
Standardized Residuals
61
This plot suggests that a curvilinear regression model
is needed.
2
1
0
-1
-2
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
Visual Interpretation of
Standardized Residuals
62
This plot suggests a non-constant variance. The
assumptions of the model are not correct.
2
1
0
-1
-2
-3
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)3
Visual Interpretation of
Standardized Residuals
63
This plot shows a data point with a large standardized
residual.
2
1
0
-1
-2
-3
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
Visual Interpretation of
Standardized Residuals
64
This plot shows a potentially influential observation.
2
1
0
-1
-2
x
StandardizedResidual
Standardized Residuals Versus x
(response is y)
Visual Interpretation of
Standardized Residuals
65
Example - % Unemployment vs. Suicide Rate
This plot of the residuals (errors) indicates some
possible problems with this linear model. You can see
a pattern to the points.
Generally
decreasing
pattern to these
points.
Unusually large
residual –
clearly an
influential point
These two points are quite
influential since they are far
away from the others in
terms of the % unemployed
66
Properties of the Sampling Distribution
of a + bx for a Fixed x Value
Let x* denote a particular value of the
independent variable x. When the four basic
assumptions of the simple linear regression
model are satisfied, the sampling
distribution of the statistic a + bx* has the
following properties:
1. The mean value of a + bx* is α + βx*,
so a + bx* is an unbiased statistic for
estimating the average y value when
x = x*
67
Properties of the Sampling Distribution
of a + bx for a Fixed x Value
3. The distribution of the statistic a + bx* is
normal.
2. The standard deviation of the statistic
a + bx* denoted by σa+bx*, is given by
( )
2
a bx*
xx
x * x1
n S
+
−
σ = σ +
68
Addition Information about the Sampling
Distribution of a + bx for a Fixed x Value
The estimated standard deviation of
the statistic a + bx*, denoted by sa+bx*,
is given by ( )2
a bx* e
xx
x * x1
s s
n S
+
−
= +
When the four basic assumptions of the
simple linear regression model are satisfied,
the probability distribution of the standardized
variable
is the t distribution with df = n - 2.
a bx*
a bx * ( x*)
t
s +
+ − α +β
=
69
Confidence Interval for a Mean y Value
When the four basic assumptions of the
simple linear regression model are met, a
confidence interval for a + bx*, the
average y value when x has the value x*, is
a + bx* ± (t critical value)sa+bx*
Where the t critical value is based on
df = n -2.
Many authors give the following equivalent form
for the confidence interval.
2
e
xx
1 (x * x)
a bx * (t critical value)s
n S
−
+ ± +
70
Confidence Interval for a Single y Value
When the four basic assumptions of the simple
linear regression model are met, a prediction
interval for y*, a single y observation made
when x has the value x*, has the form
Where the t critical value is based on df = n -2.
2 2
e a bx*a bx * (t critical value) s s ++ ± +
Many authors give the following equivalent form
for the prediction interval.
2
e
xx
1 (x * x)
a bx * (t critical value)s 1
n S
−
+ ± + +
71
Example - Mean Annual Temperature vs. Mortality
Data was collected in certain regions of
Great Britain, Norway and Sweden to study
the relationship between the mean annual
temperature and the mortality rate for a
specific type of breast cancer in women.
* Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in
certain European countries. British Medical Journal, 1, 488-490
Mean Annual
Temperature (F°)
51.3 49.9 50.0 49.2 48.5 47.8 47.3 45.1
Mortality Index 102.5 104.5 100.4 95.9 87.0 95.0 88.6 89.2
Mean Annual
Temperature (F°)
46.3 42.1 44.2 43.5 42.3 40.2 31.8 34.0
Mortality Index 78.9 84.6 81.7 72.2 65.1 68.1 67.3 52.5
72
Example - Mean Annual Temperature vs. Mortality
Regression Analysis: Mortality index versus Mean annual temperature
The regression equation is
Mortality index = - 21.8 + 2.36 Mean annual temperature
Predictor Coef SE Coef T P
Constant -21.79 15.67 -1.39 0.186
Mean ann 2.3577 0.3489 6.76 0.000
S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9%
Analysis of Variance
Source DF SS MS F P
Regression 1 2599.5 2599.5 45.67 0.000
Residual Error 14 796.9 56.9
Total 15 3396.4
Unusual Observations
Obs Mean ann Mortalit Fit SE Fit Residual St Resid
15 31.8 67.30 53.18 4.85 14.12 2.44RX
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
73
Example - Mean Annual Temperature vs. Mortality
504030
100
90
80
70
60
50
Mean annual
Mortalityin
S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 %
Mortality in = -21.7947 + 2.35769 Mean annual
Regression Plot
The point has a large standardized residual and is
influential because of the low Mean Annual Temperature.
74
Example - Mean Annual Temperature vs. Mortality
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X
2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88)
3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54)
4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02)
5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25)
6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57)
X denotes a row with X values away from the center
Values of Predictors for New Observations
New Obs Mean ann
1 31.8
2 35.0
3 40.0
4 44.6
5 50.0
6 51.3
These are the x* values for which the
above fits, standard errors of the fits,
95% confidence intervals for Mean y
values and prediction intervals for y
values given above.
75
504030
120
110
100
90
80
70
60
50
40
30
Mean annual
Mortalityin
S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 %
Mortality in = -21.7947 + 2.35769 Mean annual
95% PI
95% CI
Regression
Regression Plot
Example - Mean Annual Temperature vs. Mortality
95% prediction interval for single y value at x = 45. (67.62,100.98)
95% confidence interval for Mean y value at x = 40. (67.20, 77.82)
76
A Test for Independence in a
Bivariate Normal Population
Null hypothesis: H0: ρ = 0
Assumption: r is the correlation coefficient for a
random sample from a bivariate normal
population.
Test statistic:
The t critical value is based on df = n - 2
2
r
t
1 r
n 2
=
−
−
77
A Test for Independence in a
Bivariate Normal Population
Alternate hypothesis: H0: ρ > 0 (Positive
dependence): P-value is the area under the
appropriate t curve to the right of the computed t.
Alternate hypothesis: H0: ρ < 0 (Negative
dependence): P-value is the area under the
appropriate t curve to the right of the computed t.
Alternate hypothesis: H0: ρ ≠ 0 (Dependence):
P-value is
i. twice the area under the appropriate t curve to the left of
the computed t value if t < 0 and
ii. twice the area under the appropriate t curve to the right of
the computed t value if t > 0
78
Example
Recall the data from
the study of %Fat vs.
Age for humans.
There are 18 data
points and a quick
calculation of the
Pierson correlation
coefficient gives
r = 0.79209.
We will test to see if
there is a dependence
at the 0.05
significance level.
Age (x) % Fat y x2
xy
23 9.5 529 218.5
23 27.9 529 641.7
27 7.8 729 210.6
27 17.8 729 480.6
39 31.4 1521 1224.6
41 25.9 1681 1061.9
45 27.4 2025 1233
49 25.2 2401 1234.8
50 31.1 2500 1555
53 34.7 2809 1839.1
53 42 2809 2226
54 29.1 2916 1571.4
56 32.5 3136 1820
57 30.3 3249 1727.1
58 33 3364 1914
58 33.8 3364 1960.4
60 41.1 3600 2466
61 34.5 3721 2104.5
79
Example
1. ρ = the correlation between % fat and
age in the population from which the
sample was selected
2. H0: ρ = 0
3. Ha: ρ ≠ 0
4. α = 0.05
5. Test statistic:
2
r
t , df n 2
1 r
n 2
= = −
−
−
80
Example
6. Looking at the two normal plots, we can see
it is not reasonable to assume that either the
distribution of age nor the distribution of % fat
are normal. (Notice, the data points deviate
from a linear pattern quite substantially.
Since neither is normal, we shall not continue
with the test.
P-Value: 0.011
A-Squared: 0.980
Anderson-Darling Normality Test
N: 18
StDev: 13.2176
Average: 46.3333
6252423222
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
Age (x)
Normal Probability Plot
P-Value: 0.032
A-Squared: 0.796
Anderson-Darling Normality Test
N: 18
StDev: 9.14439
Average: 28.6111
40302010
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
% Fat y
Normal Probability Plot
81
Another Example
Height vs. Joint Length
The professor in an elementary statistics
class wanted to explain correlation so he
needed some bivariate data. He asked his
class (presumably a random or
representative sample of late adolescent
humans) to measure the length of the
metacarpal bone on the index finger of the
right hand (in cm) and height (in ft). The
data are provided on the next slide.
82
Example - Height vs. Joint Length
There are 17 data points and a quick
calculation of the Pierson correlation
coefficient gives r = 0.74908.
We will test to see if the true population
correlation coefficient is positive at the 0.05
level of significance.
Joint length 3.5 3.4 3.4 2.7 3.5 3.5 4.2 4.0 3.0
Height 64 68.5 69 64 68 73 72 75 70
Joint length 3.4 2.9 3.5 3.5 2.8 4.0 3.8 3.3
Height 68.5 65 67 70 65 75 70 66
83
1. ρ = the true correlation between height
and right index finger metacarpal joint in
the population from which the sample
was selected
2. H0: ρ = 0
3. Ha: ρ > 0
4. α = 0.05
Example - Height vs. Joint Length
5. Test statistic:
2
r
t , df n 2
1 r
n 2
= = −
−
−
84
6. Looking at the two normal plots, we can see it is
reasonable to assume that the distribution of age and
the distribution of % fat are both normal. (Notice, the
data points follow a reasonably linear pattern. This
appears to confirm the assumption that the sample is
from a bivariate normal distribution. We will assume
that the class was a random sample of young adults.
Example - Height vs. Joint Length
P-Value: 0.557
A-Squared: 0.294
Anderson-Darling Normality Test
N: 17
StDev: 3.49974
Average: 68.8235
757065
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
Height
Normal Probability Plot
P-Value: 0.156
A-Squared: 0.524
Anderson-Darling Normality Test
N: 17
StDev: 0.419734
Average: 3.43529
4.03.53.0
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
Joint
Normal Probability Plot
85
Example - Height vs. Joint Length
8. P-value: Looking on the table of tail areas for t
curves under 15 degrees of freedom, 4.379 is off
the bottom of the table, so P-value < 0.001. Minitab
reports the P-value to be 0.001.
9. Conclusion: The P-value is smaller than α = 0.05, so
we can reject H0. We can conclude that the true
population correlation coefficient is greater then 0.
I.e., the metacarpal bone is longer for taller people.
7. Calculation:
2 2
r 0.74908
t 4.379
1 r 1 (0.74908)
n 2 17 2
= = =
− −
− −

Weitere ähnliche Inhalte

Was ist angesagt?

Approach to anova questions
Approach to anova questionsApproach to anova questions
Approach to anova questionsGeorgeGidudu
 
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluationA. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluationjemille6
 
6334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-26334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-2jemille6
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleMarjan Sterjev
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statisticsTarun Gehlot
 
Hypergeometric distribution
Hypergeometric distributionHypergeometric distribution
Hypergeometric distributionmohammad nouman
 
4 2 continuous probability distributionn
4 2 continuous probability    distributionn4 2 continuous probability    distributionn
4 2 continuous probability distributionnLama K Banna
 
Spanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationSpanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationjemille6
 
Solutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomerySolutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomeryByron CZ
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)jemille6
 
06 ch ken black solution
06 ch ken black solution06 ch ken black solution
06 ch ken black solutionKrunal Shah
 
C2 st lecture 13 revision for test b handout
C2 st lecture 13   revision for test b handoutC2 st lecture 13   revision for test b handout
C2 st lecture 13 revision for test b handoutfatima d
 

Was ist angesagt? (17)

Probability Distribution
Probability DistributionProbability Distribution
Probability Distribution
 
Approach to anova questions
Approach to anova questionsApproach to anova questions
Approach to anova questions
 
Chapter05
Chapter05Chapter05
Chapter05
 
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluationA. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
A. Spanos Probability/Statistics Lecture Notes 5: Post-data severity evaluation
 
Chapter14
Chapter14Chapter14
Chapter14
 
6334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-26334 Day 3 slides: Spanos-lecture-2
6334 Day 3 slides: Spanos-lecture-2
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation Example
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statistics
 
Practice Test 2 Solutions
Practice Test 2  SolutionsPractice Test 2  Solutions
Practice Test 2 Solutions
 
Hypergeometric distribution
Hypergeometric distributionHypergeometric distribution
Hypergeometric distribution
 
Chapter11
Chapter11Chapter11
Chapter11
 
4 2 continuous probability distributionn
4 2 continuous probability    distributionn4 2 continuous probability    distributionn
4 2 continuous probability distributionn
 
Spanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationSpanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimation
 
Solutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomerySolutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. Montgomery
 
Spurious correlation (updated)
Spurious correlation (updated)Spurious correlation (updated)
Spurious correlation (updated)
 
06 ch ken black solution
06 ch ken black solution06 ch ken black solution
06 ch ken black solution
 
C2 st lecture 13 revision for test b handout
C2 st lecture 13   revision for test b handoutC2 st lecture 13   revision for test b handout
C2 st lecture 13 revision for test b handout
 

Andere mochten auch

Andere mochten auch (20)

Chapter10
Chapter10Chapter10
Chapter10
 
Chapter6
Chapter6Chapter6
Chapter6
 
Chapter2
Chapter2Chapter2
Chapter2
 
Chapter3
Chapter3Chapter3
Chapter3
 
Chapter9
Chapter9Chapter9
Chapter9
 
Chapter1
Chapter1Chapter1
Chapter1
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Simple Linear Regression (simplified)
Simple Linear Regression (simplified)Simple Linear Regression (simplified)
Simple Linear Regression (simplified)
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
 
Ch14
Ch14Ch14
Ch14
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlation
 
Basics of Regression analysis
 Basics of Regression analysis Basics of Regression analysis
Basics of Regression analysis
 
Simple linear regression analysis
Simple linear  regression analysisSimple linear  regression analysis
Simple linear regression analysis
 
Chap12 simple regression
Chap12 simple regressionChap12 simple regression
Chap12 simple regression
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
 

Ähnlich wie Chapter13

Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Japheth Muthama
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JMJapheth Muthama
 
Regression analysis presentation
Regression analysis presentationRegression analysis presentation
Regression analysis presentationMuhammadFaisal733
 
Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...
Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...
Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...Payaamvohra1
 
Presentation on Regression Analysis
Presentation on Regression AnalysisPresentation on Regression Analysis
Presentation on Regression AnalysisRekha Rani
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
 
Econometrics homework help
Econometrics homework helpEconometrics homework help
Econometrics homework helpMark Austin
 
Correlation and Regrretion
Correlation and RegrretionCorrelation and Regrretion
Correlation and RegrretionParth Desani
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).pptMuhammadAftab89
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.pptRidaIrfan10
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.pptkrunal soni
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.pptMoinPasha12
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Sciencessuser71ac73
 

Ähnlich wie Chapter13 (20)

Regression Analysis by Muthama JM
Regression Analysis by Muthama JM Regression Analysis by Muthama JM
Regression Analysis by Muthama JM
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JM
 
Regression analysis presentation
Regression analysis presentationRegression analysis presentation
Regression analysis presentation
 
Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...
Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...
Biostats coorelation vs rREGRESSION.DIFFERENCE BETWEEN CORRELATION AND REGRES...
 
Chapter13
Chapter13Chapter13
Chapter13
 
Regression
RegressionRegression
Regression
 
Presentation on Regression Analysis
Presentation on Regression AnalysisPresentation on Regression Analysis
Presentation on Regression Analysis
 
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
 
ML Module 3.pdf
ML Module 3.pdfML Module 3.pdf
ML Module 3.pdf
 
Econometrics homework help
Econometrics homework helpEconometrics homework help
Econometrics homework help
 
Chap5 correlation
Chap5 correlationChap5 correlation
Chap5 correlation
 
Correlation and Regrretion
Correlation and RegrretionCorrelation and Regrretion
Correlation and Regrretion
 
Corr And Regress
Corr And RegressCorr And Regress
Corr And Regress
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Science
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 

Mehr von Richard Ferreria (16)

Adding grades to your google site v2 (dropbox)
Adding grades to your google site v2 (dropbox)Adding grades to your google site v2 (dropbox)
Adding grades to your google site v2 (dropbox)
 
Stats chapter 14
Stats chapter 14Stats chapter 14
Stats chapter 14
 
Stats chapter 15
Stats chapter 15Stats chapter 15
Stats chapter 15
 
Stats chapter 13
Stats chapter 13Stats chapter 13
Stats chapter 13
 
Stats chapter 12
Stats chapter 12Stats chapter 12
Stats chapter 12
 
Stats chapter 11
Stats chapter 11Stats chapter 11
Stats chapter 11
 
Stats chapter 11
Stats chapter 11Stats chapter 11
Stats chapter 11
 
Stats chapter 10
Stats chapter 10Stats chapter 10
Stats chapter 10
 
Stats chapter 9
Stats chapter 9Stats chapter 9
Stats chapter 9
 
Stats chapter 8
Stats chapter 8Stats chapter 8
Stats chapter 8
 
Stats chapter 8
Stats chapter 8Stats chapter 8
Stats chapter 8
 
Stats chapter 7
Stats chapter 7Stats chapter 7
Stats chapter 7
 
Stats chapter 6
Stats chapter 6Stats chapter 6
Stats chapter 6
 
Podcasting and audio editing
Podcasting and audio editingPodcasting and audio editing
Podcasting and audio editing
 
Adding grades to your google site
Adding grades to your google siteAdding grades to your google site
Adding grades to your google site
 
Stats chapter 5
Stats chapter 5Stats chapter 5
Stats chapter 5
 

Kürzlich hochgeladen

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 

Kürzlich hochgeladen (20)

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 

Chapter13

  • 1. 1 Chapter 13 Simple Linear Regression & Correlation Inferential Methods
  • 2. 2 Consider the two variables x and y. A deterministic relationship is one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or y = 5e-2x where x is the dependent variable. Deterministic Models
  • 3. 3 A description of the relation between two variables x and y that are not deterministically related can be given by specifying a probabilistic model. The general form of an additive probabilistic model allows y to be larger or smaller than f(x) by a random amount, e. The model equation is of the form Probabilistic Models Y = deterministic function of x + random deviation = f(x) + e
  • 4. 4 Probabilistic Models Deviations from the deterministic part of a probabilistic model e=-1.5
  • 5. 5 Simple Linear Regression Model The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y = α + βx + e Without the random deviation e, all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts.
  • 6. 6 Simple Linear Regression Model 0 0 x = x1 x = x2 e2 Observation when x = x1 (positive deviation) e2 Observation when x = x2 (positive deviation) α = vertical intercept Population regression line (Slope β)
  • 7. 7 Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular x value has mean value 0 (µe = 0). 2. The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by σ. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, …, en associated with different observations are independent of one another.
  • 8. 8 More About the Simple Linear Regression Model and (standard deviation of y for fixed x) = σ. For any fixed x value, y itself has a normal distribution. mean y value height of the population x for fixed x regression line above x     = = α + β       
  • 9. 9 Interpretation of Terms 1. The slope β of the population regression line is the mean (average) change in y associated with a 1-unit increase in x. 2. The vertical intercept α is the height of the population line when x = 0. 3. The size of σ determines the extent to which the (x, y) observations deviate from the population line. Small σ Large σ
  • 11. 11 Estimates for the Regression Line The point estimates of β, the slope, and α, the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line. That is, xy xx S b point estimate of S = β = a point estimate of y bx= α = − where ( ) ( ) ( ) 2 2 xy xx x y x S xy and S x n n = − = − ∑ ∑ ∑ ∑ ∑ where ( ) ( ) ( ) 2 2 xy xx x y x S xy and S x n n = − = − ∑ ∑ ∑ ∑ ∑
  • 12. 12 Interpretation of y = a + bx Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations: 1. a + bx* is a point estimate of the mean y value when x = x*. 2. a + bx* is a point prediction of an individual y value to be observed when x = x*.
  • 13. 13 Example The following data was collected in a study of age and fatness in humans. * Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual- photon (153 Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839 One of the questions was, “What is the relationship between age and fatness?” Age 23 23 27 27 39 41 45 49 50 % Fat 9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1 Age 53 53 54 56 57 58 58 60 61 % Fat 34.7 42 29.1 32.5 30.3 33 33.8 41.1 34.5
  • 14. 14 Example Age (x) % Fat y x2 xy 23 9.5 529 218.5 23 27.9 529 641.7 27 7.8 729 210.6 27 17.8 729 480.6 39 31.4 1521 1224.6 41 25.9 1681 1061.9 45 27.4 2025 1233 49 25.2 2401 1234.8 50 31.1 2500 1555 53 34.7 2809 1839.1 53 42 2809 2226 54 29.1 2916 1571.4 56 32.5 3136 1820 57 30.3 3249 1727.1 58 33 3364 1914 58 33.8 3364 1960.4 60 41.1 3600 2466 61 34.5 3721 2104.5 834 515 41612 25489.2 2 n 18 X 834 y 515 X 41612 XY 25489.2 = = = = = ∑ ∑ ∑ ∑
  • 15. 15 Example 2 n 18, x 834, y 515 x 41612, xy 25489.2 = = = = = ∑ ∑ ∑ ∑ ( ) 2 2 xx 2 x S x n 834 41612 2970 18 = − = − = ∑∑ ( ) ( ) ( ) ( ) xy x y S xy n 834 515 25489.2 1627.53 18 = − = − = ∑ ∑∑
  • 16. 16 Example xy xx S 1627.53 b 0.54799 S 2970 = = = 515 834 a y bx 0.54799 3.2209 18 18 = − = − = ˆy 3.22 0.548x= +
  • 17. 17 Example A point estimate for the %Fat for a human who is 45 years old is If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans The two interpretations are quite different. ˆy 3.22 0.548x= + a + bx=3.22+0.548(45)=27.9% a + bx=3.22+0.548(45)=27.9%
  • 18. 18 Example A plot of the data points along with the least squares regression line created with Minitab is given to the right. 6050403020 40 30 20 10 Age (x) %Faty S = 5.75361 R-Sq = 62.7 % R-Sq(adj) = 60.4 % % Fat y = 3.22086 + 0.547991 Age (x) Regression Plot
  • 19. 19 Terminology The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives =1st predicted value =2nd predicted value =nth predicted value 1 1 2 2 n n ˆy a bx ˆy a bx ... ˆy a bx = + = + = + The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives =1st predicted value =2nd predicted value =nth predicted value 1 1 2 2 n n ˆy a bx ˆy a bx ... ˆy a bx = + = + = + The residuals for the least squares line are the values: 1 1 2 2 n n y y ,y y , ...,y yˆ ˆ ˆ− − − The residuals for the least squares line are the values: 1 1 2 2 n n y y ,y y , ...,y yˆ ˆ ˆ− − −
  • 20. 20 Definition formulae The total sum of squares, denoted by SSTo, is defined as The residual sum of squares, denoted by SSResid, is defined as 2 2 2 1 2 n 2 SSTo (y y) (y y) (y y) (y y)∑ = − + − + + − = − L 2 2 2 1 1 2 2 n n 2 SSResid (y y ) (y y ) (y y )ˆ ˆ ˆ (y y)ˆ∑ = − + − + + − = − L
  • 21. 21 Calculation Formulae Recalled SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: ( ) ( ) 2 2 2 y SSTo y y y n ∑ ∑ ∑= − = − 2 2 SSResid (y y) y a y b xyˆ∑ ∑ ∑ ∑= − = − −
  • 22. 22 Coefficient of Determination The coefficient of determination, denoted by r2 , gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. The coefficient of determination, r2, can be computed as 2 SSResid r 1 SSTo = − The coefficient of determination, r2, can be computed as 2 SSResid r 1 SSTo = −
  • 23. 23 Estimated Standard Deviation, se The statistic for estimating the variance σ2 is where 2 e SSResid s n 2 = − 2 2 ˆSSResid (y y) y a y b xy= − = − −∑ ∑ ∑ ∑ 2 eThe subscript e in s is a reminder that we are estimating the variance of the "errors" or residuals.
  • 24. 24 Estimated Standard Deviation, se The estimate of σ is the estimated standard deviation The number of degrees of freedom associated with estimating σ2 or σ in simple linear regression is n - 2. 2 e es s=
  • 25. 25 Example continued SSResid 529.66= 2 e SSResid s n 2 529.66 18 2 33.104 = − = − = 2 e es s 33.104 5.754 = = = Age (x) % Fat (y) y2 Predicted Value Residual 23 9.5 90.3 15.82 -6.32 40.00 23 27.9 778.4 15.82 12.08 145.81 27 7.8 60.8 18.02 -10.22 104.38 27 17.8 316.8 18.02 -0.22 0.05 39 31.4 986.0 24.59 6.81 46.34 41 25.9 670.8 25.69 0.21 0.04 45 27.4 750.8 27.88 -0.48 0.23 49 25.2 635.0 30.07 -4.87 23.74 50 31.1 967.2 30.62 0.48 0.23 53 34.7 1204.1 32.26 2.44 5.93 53 42.0 1764.0 32.26 9.74 94.78 54 29.1 846.8 32.81 -3.71 13.78 56 32.5 1056.3 33.91 -1.41 1.98 57 30.3 918.1 34.46 -4.16 17.27 58 33.0 1089.0 35.00 -2.00 4.02 58 33.8 1142.4 35.00 -1.20 1.45 60 41.1 1689.2 36.10 5.00 25.00 61 34.5 1190.3 36.65 -2.15 4.62 834 515.0 16156.3 529.66 ˆy y−ˆy ( ) 2 ˆy y−
  • 26. 26 Example continued 2 n 18, y 515.0, y 16156.3 xy 25489.2 ,a 3.2209, b 0.54799 = = = = = = ∑ ∑ ∑ ( ) ( ) 2 2 2 2 y SSTot= y-y y n (515.0) 16156.3 1421.5 18 = − = − = ∑∑ ∑ 2 SSResid 529.66 r 1 1 1 0.373 0.627 SSTo 1421.5 = − = − = − =
  • 27. 27 Example continued With r2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age. The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age.
  • 28. 28 Properties of the Sampling Distribution of b 1. The mean value of b is β. Specifically, µb=β and hence b is an unbiased statistic for estimating β When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met: b xxS σ σ = 2. The standard deviation of the statistic b is 3. The statistic b has a normal distribution (a consequence of the error e being normally distributed)
  • 29. 29 Estimated Standard Deviation of b The estimated standard deviation of the statistic b is e b xx s S σ = When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2 b b t s − β =
  • 30. 30 Confidence interval for β When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for β, the slope of the population regression line, has the form b ± (t critical value)⋅sb where the t critical value is based on df = n - 2.
  • 31. 31 Example continued Recall 2 2 n 18, x 834, y 515 x 41612, xy 25489.2, y 16156.3 = = = = = = ∑ ∑ ∑ ∑ ∑ b 0.54799, a 3.2209= = e b xx s 5.754 s 0.1056 S 2970 = = = A 95% confidence interval estimate for β is bb t s 0.5480 (2.12) (0.1056) 0.5480 0.2238± = ± = ±g g es 5.754=
  • 32. 32 Example continued Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%. A 95% confidence interval estimate for β is bb ts 0.5480 2.12(0.1056) 0.5480 0.2238 (0.324,0.772) ± = ± = ±
  • 33. 33 The regression equation is % Fat y = 3.22 + 0.548 Age (x) Predictor Coef SE Coef T P Constant 3.221 5.076 0.63 0.535 Age (x) 0.5480 0.1056 5.19 0.000 S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4% Analysis of Variance Source DF SS MS F P Regression 1 891.87 891.87 26.94 0.000 Residual Error 16 529.66 33.10 Total 17 1421.54 Example continued Minitab output looks like Regression line 2 es residual df = n -2 SSResidSSTo Estimated slope b Regression Analysis: % Fat y versus Age (x) Estimated y intercept a
  • 34. 34 Hypothesis Tests Concerning β Null hypothesis: H0: β = hypothesized value Test statistic: The test is based on df = n - 2 b b hypothesized value t s − = Test statistic: The test is based on df = n - 2 b b hypothesized value t s − =
  • 35. 35 Hypothesis Tests Concerning β Alternate hypothesis and finding the P-value: 1. Ha: β > hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the right of the calculated t 2. Ha: β < hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t
  • 36. 36 Hypothesis Tests Concerning β 3. Ha: β ≠ hypothesized value a) If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t) b) If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t)
  • 37. 37 Hypothesis Tests Concerning β Assumptions: 1. The distribution of e at any particular x value has mean value 0 (µe = 0) 2. The standard deviation of e is σ, which does not depend on x 3. The distribution of e at any particular x value is normal 4. The random deviations e1, e2, … , en associated with different observations are independent of one another
  • 38. 38 Hypothesis Tests Concerning β Quite often the test is performed with the hypotheses H0: β = 0 vs. Ha: β ≠ 0 This particular form of the test is called the model utility test for simple linear regression. The test statistic simplifies to and is called the t ratio. b b t s = The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.
  • 39. 39 Example Consider the following data on percentage unemployment and suicide rates. * Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158. City Percentage Unemployed Suicide Rate New York 3.0 72 Los Angeles 4.7 224 Chicago 3.0 82 Philadelphia 3.2 92 Detroit 3.8 104 Boston 2.5 71 San Francisco 4.8 235 Washington 2.7 81 Pittsburgh 4.4 86 St. Louis 3.1 102 Cleveland 3.5 104
  • 40. 40 Example The plot of the data points produced by Minitab follows
  • 41. 41 Example City Percentage Unemployed (x) Suicide Rate (y) x 2 xy y 2 New York 3.0 72 9.00 216.0 05184 Los Angeles 4.7 224 22.09 1052.8 50176 Chicago 3.0 82 9.00 246.0 06724 Philadelphia 3.2 92 10.24 294.4 08464 Detroit 3.8 104 14.44 395.2 10816 Boston 2.5 71 6.25 177.5 05041 San Francisco 4.8 235 23.04 1128.0 55225 Washington 2.7 81 7.29 218.7 06561 Pittsburgh 4.4 86 19.36 378.4 07396 St. Louis 3.1 102 9.61 316.2 10404 Cleveland 3.5 104 12.25 364.0 10816 38.7 1253 142.57 4787.2 176807
  • 42. 42 Example Some basic summary statistics 2 2 n 11, x 38.7, x 142.57 y 1253, y 176807, xy 4787.2 = = = = = = ∑ ∑ ∑ ∑ ∑ ( ) ( ) xy x y S xy n (38.7)(1253) 4787.2 11 378.92 = − = − = ∑ ∑ ∑ ( ) 2 2 xx 2 x S x n 38.7 142.57 11 6.4164 = − = − = ∑ ∑
  • 43. 43 Example Continuing with the calculations xy xx S 378.92 b 59.06 S 6.4164 = = = 1253 38.7 a y bx 59.06 93.86 11 11 = − = − = − ˆy 93.86 59.06x= − +
  • 44. 44 Example Continuing with the calculations 2 2 SSResid ˆ(y y) y a y b xy 176807 ( 93.857)(1253) 59.055(4787.2) 11701.9 = − = − − = − − − = ∑ ∑ ∑ ∑ ( ) 2 2 2 yy 2 y SSTo S (y y) y n 1253 176807 11 34078.9 = = − = − = − = ∑∑ ∑
  • 45. 45 Example 2 SSResid 11701.9 r 1 1 SSto 34078.9 1 0.343 0.657 = − = − = − = e SSResid 11701.9 s 36.06 n-2 9 = = =
  • 46. 46 Example - Model Utility Test 1. β = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point 2. H0: β = 0 3. Ha: β ≠ 0 4. α has not been preselected. We shall interpret the observed level of significance (P-value) 5. Test statistic: b b b b hypothesized value b 0 b t s s s − − = = =
  • 47. 47 Example - Model Utility Test 6. Assumptions: The following plot (Minitab) of the data shows a linear pattern and the variability of points does not appear to be changing with x. Assuming that the distribution of errors (residuals) at any given x value is approximately normal, the assumptions of the simple linear regression model are appropriate.
  • 48. 48 Example - Model Utility Test 8. P-value: The table of tail areas for t- distributions only has t values ≤ 4, so we can see that the corresponding tail area is < 0.002. Since this is a two-tail test the P-value < 0.004. (Actual calculation gives a P-value = 0.002) 7. Calculation: e b xx s 36.06 s 14.24 S 6.4164 = = = b b 59.06 t 4.15 s 14.24 = = =
  • 49. 49 Example - Model Utility Test 8. Conclusion: Even though no specific significance level was chosen for the test, with the P-value being so small (< 0.004) one would generally reject the null hypothesis that β = 0 and conclude that there is a useful linear relationship between the % unemployed and the suicide rate.
  • 50. 50 Example - Minitab Output Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x) The regression equation is Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x) Predictor Coef SE Coef T P Constant -93.86 51.25 -1.83 0.100 Percenta 59.05 14.24 4.15 0.002 S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8% T value for Model Utility Test H0: β = 0 Ha: β ≠ 0 P-value
  • 51. 51 Example – Reality Check! Although the medel utility test indicates that the model is useful, we should be a bit reticent to use the model principally as a estimation tool. Notice that s = 36.06, where the actual range of suicide rates is 235 – 71 = 164. This means to typical error in estimating the suicide rate would be approximately 22% of the range in error. With 9 of the 11 data points having suicide rates at or below 104, this would constitute a very large amount of error in the estimation. The statistics is very clear: We have established a strong positive linear relationship between percentage employed and the suicide rate. I would just not be particularly meaningful or useful to provide actual numerical estimates for suicide rates.
  • 52. 52 Residual Analysis The simple linear regression model equation is y = α + βx + e where e represents the random deviation of an observed y value from the population regression line α + βx . Key assumptions about e 1. At any particular x value, the distribution of e is a normal distribution 2. At any particular x value, the standard deviation of e is σ, which is constant over all values of x.
  • 53. 53 Residual Analysis To check on these assumptions, one would examine the deviations e1, e2, …, en. Generally, the deviations are not known, so we check on the assumptions by looking at the residuals which are the deviations from the estimated line, a + bx. The residuals are given by 1 1 1 1 2 2 2 2 n n n n ˆy y y (a bx ) ˆy y y (a bx ) ˆy y y (a bx ) − = − + − = − + − = − + M
  • 54. 54 Standardized Residuals Recall: A quantity is standardized by subtracting its mean value and then dividing by its true (or estimated) standard deviation. For the residuals, the true mean is zero (0) if the assumptions are true. ( ) i i 2 ˆy y e xx x x1 s s 1 n S − − = − − The estimated standard deviation of a residual depends on the x value. The estimated standard deviation of the ith residual, , is given byi iˆy y−
  • 55. 55 Standardized Residuals As you can see from the formula for the estimated standard deviation the calculation of the standardized residuals is a bit of a calculational nightmare. Fortunately, most statistical software packages are set up to perform these calculations and do so quite proficiently.
  • 56. 56 Standardized Residuals - Example Consider the data on percentage unemployment and suicide rates Notice that the standardized residual for Pittsburgh is -2.50, somewhat large for this size data set. City Percentage Unemployed Suicide Rate Residual Standardized Residual New York 3.0 72 83.31 -11.31 -0.34 Los Angeles 4.7 224 183.70 40.30 1.34 Chicago 3.0 82 83.31 -1.31 -0.04 Philadelphia 3.2 92 95.12 -3.12 -0.09 Detroit 3.8 104 130.55 -26.55 -0.78 Boston 2.5 71 53.78 17.22 0.55 San Francisco 4.8 235 189.61 45.39 1.56 Washington 2.7 81 65.59 15.41 0.48 Pittsburgh 4.4 86 165.99 -79.98 -2.50 St. Louis 3.1 102 89.21 12.79 0.38 Cleveland 3.5 104 112.84 -8.84 -0.26 ˆy ˆy - y
  • 57. 57 Example Pittsburgh This point has an unusually high residual
  • 58. 58 Normal Plots 500-50 2 1 0 -1 -2 NormalScore Residual Normal Probability Plot of the Residuals (response is Suicide) 2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5 2 1 0 -1 -2 NormalScore Standardized Residual Normal Probability Plot of the Residuals (response is Suicide) Notice that both of the normal plots look similar. If a software package is available to do the calculation and plots, it is preferable to look at the normal plot of the standardized residuals. In both cases, the points look reasonable linear with the possible exception of Pittsburgh, so the assumption that the errors are normally distributed seems to be supported by the sample data.
  • 59. 59 More Comments The fact that Pittsburgh has a large standardized residual makes it worthwhile to look at that city carefully to make sure the figures were reported correctly. One might also look to see if there are some reasons that Pittsburgh should be looked at separately because some other characteristic distinguishes it from all of the other cities. Pittsburgh does have a large effect on model.
  • 60. 60 2 1 0 -1 -2 x StandardizedResidual Standardized Residuals Versus x (response is y) This plot is an example of a satisfactory plot that indicates that the model assumptions are reasonable. Visual Interpretation of Standardized Residuals
  • 61. 61 This plot suggests that a curvilinear regression model is needed. 2 1 0 -1 -2 x StandardizedResidual Standardized Residuals Versus x (response is y) Visual Interpretation of Standardized Residuals
  • 62. 62 This plot suggests a non-constant variance. The assumptions of the model are not correct. 2 1 0 -1 -2 -3 x StandardizedResidual Standardized Residuals Versus x (response is y)3 Visual Interpretation of Standardized Residuals
  • 63. 63 This plot shows a data point with a large standardized residual. 2 1 0 -1 -2 -3 x StandardizedResidual Standardized Residuals Versus x (response is y) Visual Interpretation of Standardized Residuals
  • 64. 64 This plot shows a potentially influential observation. 2 1 0 -1 -2 x StandardizedResidual Standardized Residuals Versus x (response is y) Visual Interpretation of Standardized Residuals
  • 65. 65 Example - % Unemployment vs. Suicide Rate This plot of the residuals (errors) indicates some possible problems with this linear model. You can see a pattern to the points. Generally decreasing pattern to these points. Unusually large residual – clearly an influential point These two points are quite influential since they are far away from the others in terms of the % unemployed
  • 66. 66 Properties of the Sampling Distribution of a + bx for a Fixed x Value Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a + bx* has the following properties: 1. The mean value of a + bx* is α + βx*, so a + bx* is an unbiased statistic for estimating the average y value when x = x*
  • 67. 67 Properties of the Sampling Distribution of a + bx for a Fixed x Value 3. The distribution of the statistic a + bx* is normal. 2. The standard deviation of the statistic a + bx* denoted by σa+bx*, is given by ( ) 2 a bx* xx x * x1 n S + − σ = σ +
  • 68. 68 Addition Information about the Sampling Distribution of a + bx for a Fixed x Value The estimated standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by ( )2 a bx* e xx x * x1 s s n S + − = + When the four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2. a bx* a bx * ( x*) t s + + − α +β =
  • 69. 69 Confidence Interval for a Mean y Value When the four basic assumptions of the simple linear regression model are met, a confidence interval for a + bx*, the average y value when x has the value x*, is a + bx* ± (t critical value)sa+bx* Where the t critical value is based on df = n -2. Many authors give the following equivalent form for the confidence interval. 2 e xx 1 (x * x) a bx * (t critical value)s n S − + ± +
  • 70. 70 Confidence Interval for a Single y Value When the four basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x has the value x*, has the form Where the t critical value is based on df = n -2. 2 2 e a bx*a bx * (t critical value) s s ++ ± + Many authors give the following equivalent form for the prediction interval. 2 e xx 1 (x * x) a bx * (t critical value)s 1 n S − + ± + +
  • 71. 71 Example - Mean Annual Temperature vs. Mortality Data was collected in certain regions of Great Britain, Norway and Sweden to study the relationship between the mean annual temperature and the mortality rate for a specific type of breast cancer in women. * Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in certain European countries. British Medical Journal, 1, 488-490 Mean Annual Temperature (F°) 51.3 49.9 50.0 49.2 48.5 47.8 47.3 45.1 Mortality Index 102.5 104.5 100.4 95.9 87.0 95.0 88.6 89.2 Mean Annual Temperature (F°) 46.3 42.1 44.2 43.5 42.3 40.2 31.8 34.0 Mortality Index 78.9 84.6 81.7 72.2 65.1 68.1 67.3 52.5
  • 72. 72 Example - Mean Annual Temperature vs. Mortality Regression Analysis: Mortality index versus Mean annual temperature The regression equation is Mortality index = - 21.8 + 2.36 Mean annual temperature Predictor Coef SE Coef T P Constant -21.79 15.67 -1.39 0.186 Mean ann 2.3577 0.3489 6.76 0.000 S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9% Analysis of Variance Source DF SS MS F P Regression 1 2599.5 2599.5 45.67 0.000 Residual Error 14 796.9 56.9 Total 15 3396.4 Unusual Observations Obs Mean ann Mortalit Fit SE Fit Residual St Resid 15 31.8 67.30 53.18 4.85 14.12 2.44RX R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.
  • 73. 73 Example - Mean Annual Temperature vs. Mortality 504030 100 90 80 70 60 50 Mean annual Mortalityin S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 % Mortality in = -21.7947 + 2.35769 Mean annual Regression Plot The point has a large standardized residual and is influential because of the low Mean Annual Temperature.
  • 74. 74 Example - Mean Annual Temperature vs. Mortality Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X 2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88) 3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54) 4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02) 5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25) 6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57) X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Mean ann 1 31.8 2 35.0 3 40.0 4 44.6 5 50.0 6 51.3 These are the x* values for which the above fits, standard errors of the fits, 95% confidence intervals for Mean y values and prediction intervals for y values given above.
  • 75. 75 504030 120 110 100 90 80 70 60 50 40 30 Mean annual Mortalityin S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 % Mortality in = -21.7947 + 2.35769 Mean annual 95% PI 95% CI Regression Regression Plot Example - Mean Annual Temperature vs. Mortality 95% prediction interval for single y value at x = 45. (67.62,100.98) 95% confidence interval for Mean y value at x = 40. (67.20, 77.82)
  • 76. 76 A Test for Independence in a Bivariate Normal Population Null hypothesis: H0: ρ = 0 Assumption: r is the correlation coefficient for a random sample from a bivariate normal population. Test statistic: The t critical value is based on df = n - 2 2 r t 1 r n 2 = − −
  • 77. 77 A Test for Independence in a Bivariate Normal Population Alternate hypothesis: H0: ρ > 0 (Positive dependence): P-value is the area under the appropriate t curve to the right of the computed t. Alternate hypothesis: H0: ρ < 0 (Negative dependence): P-value is the area under the appropriate t curve to the right of the computed t. Alternate hypothesis: H0: ρ ≠ 0 (Dependence): P-value is i. twice the area under the appropriate t curve to the left of the computed t value if t < 0 and ii. twice the area under the appropriate t curve to the right of the computed t value if t > 0
  • 78. 78 Example Recall the data from the study of %Fat vs. Age for humans. There are 18 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.79209. We will test to see if there is a dependence at the 0.05 significance level. Age (x) % Fat y x2 xy 23 9.5 529 218.5 23 27.9 529 641.7 27 7.8 729 210.6 27 17.8 729 480.6 39 31.4 1521 1224.6 41 25.9 1681 1061.9 45 27.4 2025 1233 49 25.2 2401 1234.8 50 31.1 2500 1555 53 34.7 2809 1839.1 53 42 2809 2226 54 29.1 2916 1571.4 56 32.5 3136 1820 57 30.3 3249 1727.1 58 33 3364 1914 58 33.8 3364 1960.4 60 41.1 3600 2466 61 34.5 3721 2104.5
  • 79. 79 Example 1. ρ = the correlation between % fat and age in the population from which the sample was selected 2. H0: ρ = 0 3. Ha: ρ ≠ 0 4. α = 0.05 5. Test statistic: 2 r t , df n 2 1 r n 2 = = − − −
  • 80. 80 Example 6. Looking at the two normal plots, we can see it is not reasonable to assume that either the distribution of age nor the distribution of % fat are normal. (Notice, the data points deviate from a linear pattern quite substantially. Since neither is normal, we shall not continue with the test. P-Value: 0.011 A-Squared: 0.980 Anderson-Darling Normality Test N: 18 StDev: 13.2176 Average: 46.3333 6252423222 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability Age (x) Normal Probability Plot P-Value: 0.032 A-Squared: 0.796 Anderson-Darling Normality Test N: 18 StDev: 9.14439 Average: 28.6111 40302010 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability % Fat y Normal Probability Plot
  • 81. 81 Another Example Height vs. Joint Length The professor in an elementary statistics class wanted to explain correlation so he needed some bivariate data. He asked his class (presumably a random or representative sample of late adolescent humans) to measure the length of the metacarpal bone on the index finger of the right hand (in cm) and height (in ft). The data are provided on the next slide.
  • 82. 82 Example - Height vs. Joint Length There are 17 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.74908. We will test to see if the true population correlation coefficient is positive at the 0.05 level of significance. Joint length 3.5 3.4 3.4 2.7 3.5 3.5 4.2 4.0 3.0 Height 64 68.5 69 64 68 73 72 75 70 Joint length 3.4 2.9 3.5 3.5 2.8 4.0 3.8 3.3 Height 68.5 65 67 70 65 75 70 66
  • 83. 83 1. ρ = the true correlation between height and right index finger metacarpal joint in the population from which the sample was selected 2. H0: ρ = 0 3. Ha: ρ > 0 4. α = 0.05 Example - Height vs. Joint Length 5. Test statistic: 2 r t , df n 2 1 r n 2 = = − − −
  • 84. 84 6. Looking at the two normal plots, we can see it is reasonable to assume that the distribution of age and the distribution of % fat are both normal. (Notice, the data points follow a reasonably linear pattern. This appears to confirm the assumption that the sample is from a bivariate normal distribution. We will assume that the class was a random sample of young adults. Example - Height vs. Joint Length P-Value: 0.557 A-Squared: 0.294 Anderson-Darling Normality Test N: 17 StDev: 3.49974 Average: 68.8235 757065 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability Height Normal Probability Plot P-Value: 0.156 A-Squared: 0.524 Anderson-Darling Normality Test N: 17 StDev: 0.419734 Average: 3.43529 4.03.53.0 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability Joint Normal Probability Plot
  • 85. 85 Example - Height vs. Joint Length 8. P-value: Looking on the table of tail areas for t curves under 15 degrees of freedom, 4.379 is off the bottom of the table, so P-value < 0.001. Minitab reports the P-value to be 0.001. 9. Conclusion: The P-value is smaller than α = 0.05, so we can reject H0. We can conclude that the true population correlation coefficient is greater then 0. I.e., the metacarpal bone is longer for taller people. 7. Calculation: 2 2 r 0.74908 t 4.379 1 r 1 (0.74908) n 2 17 2 = = = − − − −

Hinweis der Redaktion

  1. &amp;lt;number&amp;gt;
  2. &amp;lt;number&amp;gt;
  3. &amp;lt;number&amp;gt;
  4. &amp;lt;number&amp;gt;