The document discusses simple linear regression and correlation methods. It defines deterministic and probabilistic models for describing the relationship between two variables. A simple linear regression model assumes a population regression line with intercept a and slope b, where observations may deviate from the line by some random error e. Key assumptions of the model are that e has a normal distribution with mean 0 and constant variance across values of x, and errors are independent. The slope b estimates the average change in y per unit change in x.
5. Simple Linear Regression Model The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y = + x + e Without the random deviation e , all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts.
6. Simple Linear Regression Model 0 0 x = x 1 x = x 2 e 2 Observation when x = x 1 (positive deviation) e 2 Observation when x = x 2 (positive deviation) = vertical intercept Population regression line (Slope )
7.
8. More About the Simple Linear Regression Model and (standard deviation of y for fixed x) = . For any fixed x value, y itself has a normal distribution.
11. Estimates for the Regression Line The point estimates of , the slope, and , the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line. That is,
12.
13. Example The following data was collected in a study of age and fatness in humans. * Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual-photon ( 153 Gd) absorptiometry. American Journal of Clinical Nutrition , 40 , 834-839 One of the questions was, “What is the relationship between age and fatness?”
17. Example A point estimate for the %Fat for a human who is 45 years old is If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans The two interpretations are quite different.
18. Example A plot of the data points along with the least squares regression line created with Minitab is given to the right.
20. Definition formulae The total sum of squares , denoted by SSTo , is defined as The residual sum of squares , denoted by SSResid , is defined as
21. Calculation Formulae Recalled SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas:
24. Estimated Standard Deviation, s e The estimate of is the estimated standard deviation The number of degrees of freedom associated with estimating or in simple linear regression is n - 2.
27. Example continued With r 2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age. The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age.
28.
29. Estimated Standard Deviation of b The estimated standard deviation of the statistic b is When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2
30. Confidence interval for When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for , the slope of the population regression line, has the form b (t critical value) s b where the t critical value is based on df = n - 2.
32. Example continued Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%. A 95% confidence interval estimate for is
33. Example continued The regression equation is % Fat y = 3.22 + 0.548 Age (x) Predictor Coef SE Coef T P Constant 3.221 5.076 0.63 0.535 Age (x) 0.5480 0.1056 5.19 0.000 S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4% Analysis of Variance Source DF SS MS F P Regression 1 891.87 891.87 26.94 0.000 Residual Error 16 529.66 33.10 Total 17 1421.54 Minitab output looks like Regression Analysis: % Fat y versus Age (x) Regression line residual df = n -2 SSResid SSTo Estimated slope b Estimated y intercept a
38. Hypothesis Tests Concerning Quite often the test is performed with the hypotheses H 0 : = 0 vs. H a : 0 This particular form of the test is called the model utility test for simple linear regression. The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y. The test statistic simplifies to and is called the t ratio .
39. Example Consider the following data on percentage unemployment and suicide rates. * Smith, D. (1977) Patterns in Human Geography , Canada: Douglas David and Charles Ltd., 158.
50. Example - Minitab Output Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x) The regression equation is Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x) Predictor Coef SE Coef T P Constant -93.86 51.25 -1.83 0.100 Percenta 59.05 14.24 4.15 0.002 S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8% T value for Model Utility Test H 0 : = 0 H a : 0 P-value
51.
52. Residual Analysis To check on these assumptions, one would examine the deviations e 1 , e 2 , …, e n . Generally, the deviations are not known, so we check on the assumptions by looking at the residuals which are the deviations from the estimated line, a + bx. The residuals are given by
53. Standardized Residuals Recall: A quantity is standardized by subtracting its mean value and then dividing by its true (or estimated) standard deviation. For the residuals, the true mean is zero (0) if the assumptions are true. The estimated standard deviation of a residual depends on the x value. The estimated standard deviation of the i th residual, , is given by
54. Standardized Residuals As you can see from the formula for the estimated standard deviation the calculation of the standardized residuals is a bit of a calculational nightmare. Fortunately, most statistical software packages are set up to perform these calculations and do so quite proficiently.
55. Standardized Residuals - Example Consider the data on percentage unemployment and suicide rates Notice that the standardized residual for Pittsburgh is -2.50, somewhat large for this size data set.
57. Normal Plots Notice that both of the normal plots look similar. If a software package is available to do the calculation and plots, it is preferable to look at the normal plot of the standardized residuals. In both cases, the points look reasonable linear with the possible exception of Pittsburgh, so the assumption that the errors are normally distributed seems to be supported by the sample data.
58. More Comments The fact that Pittsburgh has a large standardized residual makes it worthwhile to look at that city carefully to make sure the figures were reported correctly. One might also look to see if there are some reasons that Pittsburgh should be looked at separately because some other characteristic distinguishes it from all of the other cities. Pittsburgh does have a large effect on model.
59. Visual Interpretation of Standardized Residuals This plot is an example of a satisfactory plot that indicates that the model assumptions are reasonable.
60. Visual Interpretation of Standardized Residuals This plot suggests that a curvilinear regression model is needed.
61. Visual Interpretation of Standardized Residuals This plot suggests a non-constant variance. The assumptions of the model are not correct.
62. Visual Interpretation of Standardized Residuals This plot shows a data point with a large standardized residual.
63. Visual Interpretation of Standardized Residuals This plot shows a potentially influential observation.
64. Example - % Unemployment vs. Suicide Rate This plot of the residuals (errors) indicates some possible problems with this linear model. You can see a pattern to the points. Generally decreasing pattern to these points. Unusually large residual These two points are quite influential since they are far away from the others in terms of the % unemployed
65.
66.
67. Addition Information about the Sampling Distribution of a + bx for a Fixed x Value The estimated standard deviation of the statistic a + bx*, denoted by s a+bx* , is given by When the four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2.
68. Confidence Interval for a Mean y Value When the four basic assumptions of the simple linear regression model are met, a confidence interval for a + bx* , the average y value when x has the value x*, is a + bx* (t critical value)s a+bx* Where the t critical value is based on df = n -2. Many authors give the following equivalent form for the confidence interval.
69. Confidence Interval for a Single y Value When the four basic assumptions of the simple linear regression model are met, a prediction interval for y* , a single y observation made when x has the value x*, has the form Where the t critical value is based on df = n -2. Many authors give the following equivalent form for the prediction interval.
70. Example - Mean Annual Temperature vs. Mortality Data was collected in certain regions of Great Britain, Norway and Sweden to study the relationship between the mean annual temperature and the mortality rate for a specific type of breast cancer in women. * Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in certain European countries. British Medical Journal , 1 , 488-490
71. Example - Mean Annual Temperature vs. Mortality Regression Analysis: Mortality index versus Mean annual temperature The regression equation is Mortality index = - 21.8 + 2.36 Mean annual temperature Predictor Coef SE Coef T P Constant -21.79 15.67 -1.39 0.186 Mean ann 2.3577 0.3489 6.76 0.000 S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9% Analysis of Variance Source DF SS MS F P Regression 1 2599.5 2599.5 45.67 0.000 Residual Error 14 796.9 56.9 Total 15 3396.4 Unusual Observations Obs Mean ann Mortalit Fit SE Fit Residual St Resid 15 31.8 67.30 53.18 4.85 14.12 2.44RX R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.
72. Example - Mean Annual Temperature vs. Mortality The point has a large standardized residual and is influential because of the low Mean Annual Temperature.
73. Example - Mean Annual Temperature vs. Mortality These are the x* values for which the above fits, standard errors of the fits, 95% confidence intervals for Mean y values and prediction intervals for y values given above. Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X 2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88) 3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54) 4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02) 5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25) 6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57) X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Mean ann 1 31.8 2 35.0 3 40.0 4 44.6 5 50.0 6 51.3
74. Example - Mean Annual Temperature vs. Mortality 95% prediction interval for single y value at x = 45. (67.62,100.98) 95% confidence interval for Mean y value at x = 40. (67.20, 77.82)
75. A Test for Independence in a Bivariate Normal Population Null hypothesis: H 0 : = 0 Assumption: r is the correlation coefficient for a random sample from a bivariate normal population. Test statistic: The t critical value is based on df = n - 2
76.
77. Example Recall the data from the study of %Fat vs. Age for humans. There are 18 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.79209. We will test to see if there is a dependence at the 0.05 significance level.
78.
79.
80. Another Example Height vs. Joint Length The professor in an elementary statistics class wanted to explain correlation so he needed some bivariate data. He asked his class (presumably a random or representative sample of late adolescent humans) to measure the length of the metacarpal bone on the index finger of the right hand (in cm) and height (in ft). The data are provided on the next slide.
81. Example - Height vs. Joint Length There are 17 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.74908. We will test to see if the true population correlation coefficient is positive at the 0.05 level of significance.