Concept of Correlation, Simple Linear Regression & Multiple Linear Regression and its analysis using SPSS. How it check the validity of assumptions in Regression
2. Correlation
Correlation analysis is used to study the strength of
relationship between two or more quantitative
variables. Correlation shows the degree of linear
dependence between the two variables.
Correlation doesn’t imply causation.
If variables are not related by cause and effect
relationship but show correlation then such
correlation is called Spurious or Non-sense
correlation.
3. Correlation
Correlation can be positive, negative or zero
depending on the change between two variables.
If the change in two variables is in the same
direction it is positive correlation.
If the change in two variables is in the opposite
direction it is negative correlation.
If the change in one variable does not affect the
change in the other variable it is zero correlation.
4. Correlation
Coefficient
Correlation coefficient (r) is the measure of extent
of correlation between two variables.
There are several types of correlation coefficient
but the most popular is Karl Pearson’s correlation
coefficient.
5. Testing
Correlation
Coefficient
Null Hypothesis H0: 𝜌 = 0
[There is no significant linear correlation between two variables]
Alternative Hypothesis H1: 𝜌≠ 0
[There is significant linear correlation between two variables]
Test statistics: 𝐭 =
𝑟 𝑛−2
1−𝑟2
The test statistics t follows Student’s t distribution with 𝒏 − 𝟐
degrees of freedom.
6. Case Study
The body temperature (in 0
𝐹) for 100 adults were measured along with
their gender, age, and heart rate. Data: body_temp.xlsx .
Obtain correlation coefficient between body temperature and heart rate.
Also check its significance.
7. Null & Alternative
Hypothesis
Null Hypothesis H0: 𝜌 = 0
[There is no significant linear correlation between body
temperature and heart rate]
Alternative Hypothesis H1: 𝜌≠ 0
[There is significant linear correlation between body temperature
and heart rate]
10. Test Statistics t
and p value
Correlation coefficient (r) between two variables heart rate
and temperature is 0.448.
Here p value = 0.000 < 0.05, so null hypothesis is rejected.
Thus, there is significant linear correlation between Heart rate
and Temperature
11. Regression
Regression analysis is a set of statistical processes
for estimating the relationships between
a dependent variable (often called the 'outcome' or
'response' variable) and one or more independent
variables (often called 'predictors', 'covariates',
'explanatory variables' or 'features’).
12. Regression
Analysis
Regression analysis helps you understand how the
dependent variable changes when one of the
independent variables varies and allows to
mathematically determine which of those
variables really has an impact.
Regression analysis includes several variations,
such as linear, multiple linear, and nonlinear. The
most common models are simple linear and
multiple linear.
13. Types of Regression
Dependent variable Independent variable Type of
Regression
Relationship
between variables
One
(Scale )
One
(Scale)
Simple Linear Linear
One
(Scale)
Two or more
(Continuous / Categorical)
Multiple Linear Linear
One
( Categorical – binary)
Two or more
(Continuous / Categorical)
Logistic Need not be linear
One
( Categorical )
Two or more
(Continuous / Categorical)
Multinomial
Logistic
Need not be linear
14. Simple
Regression
The simple linear regression model is used to predict one
response (dependent) variable based on one predictor
(independent) variable.
The linear regression model can be stated as follows
𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝑒𝑖 , 𝑖 = 1, 2, · · · , n.
where
• 𝑦𝑖 is value of the response variable,
• 𝑥𝑖 is the value of the predictor variable,
• 𝛽0 , 𝛽1are the parameters (regression coefficients),
• 𝑒𝑖 is random error term with E(𝑒𝑖 ) = 0 and V (𝑒𝑖 ) = 𝜎2.
15. Random Error
for this Xi value
Y
X
Observed Value
of Y for Xi
Predicted Value
of Y for Xi
i
i
1
0
i ε
x
β
β
y
Xi
Slope = β1
Intercept = β0
εi
Graphical representation
16. Assumptions of
Simple
Regression
The four important assumptions for a simple linear
regression model are :
• The regression model is Linear in parameter.
• The errors are Independently distributed.
• The errors are Normally distributed.
• The errors have Equal variances. i.e. V (𝑒𝑖 ) = 𝜎2
.
( Homoscedasticity)
17. Method
The best line of fit can be obtained by the method of
least squares. It calculates the best line of fit for the
observed data by minimizing the sum of squares of the
vertical deviations from each data point to the line,
i.e., (𝑦𝑖 − 𝑦𝑖)2
18. Total variation is made up of two parts:
SSE
SSR
SST
Total Sum of
Squares
Regression Sum
of Squares
Error Sum of
Squares
2
i )
Y
Y
(
SST
2
i
i )
Ŷ
Y
(
SSE
2
i )
Y
Ŷ
(
SSR
where: = Mean value of the dependent variable
Yi = Observed value of the dependent variable
= Predicted value of Y for the given Xi value
i
Y
ˆ
Y
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean 𝑌
• SSR = regression sum of squares (Explained Variation)
• Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
• Variation in Y attributable to factors other than X
Measures of Variations
19. Xi
Y
X
Yi
SST = (Yi - Y)2
SSE = (Yi - Yi )2
SSR = (Yi - Y)2
_
_
_
Y
Y
Y
_
Y
Measures of Variations
20. The Coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent variable.
The coefficient of determination is denoted as R2
1
R
0 2
Note:
SST
SSR
R2
Coefficient of Determination
𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
21. The Adjusted R-squared is a modified
version of R-squared that adjusts for
predictors that are not significant in a
regression model.
Adjusted R Square
R-squared increases every time you add an
independent variable to the model. Adjusted R-
squared value increases only when the new term
improves the model fit more than expected by
chance alone. The adjusted R-squared value
actually decreases when the term doesn’t
improve the model fit by a sufficient amount.
22. Multiple
Regression
The multiple linear regression model is used to predict a
response (independent) variable based on two or more
predictor variable (dependent) variable.
The multiple linear regression model can be stated as follows
𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖1 + 𝛽2𝑥𝑖2 + ⋯ … … + 𝛽𝑝𝑥𝑖𝑝 + 𝑒𝑖 , 𝑖 = 1,2, · · , n.
where
• 𝑦𝑖 is 𝑖𝑡ℎvalue of the response variable,
• 𝑥𝑖𝑗 is the 𝑖𝑡ℎ
observation of 𝑗𝑡ℎ
predictor variable,
• 𝛽0, 𝛽1, 𝛽2 …. 𝛽𝑝 are the parameters (regression coefficients),
• 𝑒𝑖 is random error term with E(𝑒𝑖 ) = 0 and V (𝑒𝑖 ) = 𝜎2
.
23. Case Study 1
The body temperature (in 0
𝐹) for 100 adults were measured along with
their gender, age, and heart rate. The data is stored in body_temp.xlsx file.
Built a linear regression model for body temperature using heart rate as a
predictor.
26. Multiple R = Correlation Coefficient = 0.45
R Square = Coefficient of Determination = 0.20
R Square = 0.20 shows that 20% of variations in temperature due to Heart Rate.
Model Summary
27. p value = 0 < 0.05.
So, there is enough evidence that fitted regression model is significant.
The regression model predicts the dependent variable – Temperature,
significantly well.
ANOVA
28. H0: 𝛽1=0 [Regression coefficient for Heart Rate is
not significant]
H1: 𝛽1≠ 0 [Regression coefficient for Heart Rate is
significant]
p value of regression coefficient of Heart Rate = 0
< 0.05, H0 is rejected.
So , regression coefficient of Heart Rate is
significant.
Regression Coefficients
Regression Model:
Temperature = 92.391 + 0.081 Heart Rate
29. Checking
Assumptions
• The regression model is Linear in parameter.
• The errors are Independently distributed.
• The errors are Normally distributed.
• The errors have Equal variances. That is V (𝑒𝑖 ) = 𝜎2
.
( Homoscedasticity)
33. Assumption - Errors are Independently distributed
Value of Durbin-Watson is
1.804,which is close to 2.
So, the assumption that errors
are independently distributed is
met
36. Homoscedastic Assumptions
The data does not have an obvious
pattern, there are points equally
distributed above and below zero on the
X axis, and to the left and right of zero
on the Y axis.
So homoscedasticity assumption is met.
37. Case Study 2
The data were collected on a simple random sample of 20
patients with hypertension. The dataset is in arterialBp.csv.
The variables are
Y = mean arterial blood pressure (mm Hg)
X1 = age (years), X2 = weight (kgs)
X3 = body surface area (sq. m)
X4 = duration of hypertension (years)
X5 = basal pulse (beats /min), X6 = measure of stress
Fit an appropriate regression equation.
41. Multiple R = Correlation Coefficient = 0.997
R Square = Coefficient of Determination = 0.995
R Square = 0.995 shows that 99.5% of variations in blood pressure is due to age,
weight, bsa, hypertension, pulse and stress.
Model Summary
42. p value = 0 < 0.05.
So, there is enough evidence that fitted regression model is significant.
The regression model predicts the dependent variable – blood pressure,
significantly well.
ANOVA
44. Multiple R = Correlation Coefficient = 0.997
R Square = Coefficient of Determination = 0.993
R Square = 0.993 shows that 99.3% of variations in blood pressure is due to age,
weight, bsa.
Model Summary
45. p value = 0 < 0.05.
So, there is enough evidence that fitted regression model is significant.
The regression model predicts the dependent variable – blood pressure,
significantly well.
ANOVA
47. Checking
Assumptions
• The regression model is Linear in parameter.
• The errors are Independently distributed.
• The errors are Normally distributed.
• The errors have Equal variances. That is V (𝑒𝑖 ) = 𝜎2
.
( Homoscedasticity)
• There is no Multicollinearity
(No significant correlation between independent variables)
53. Homoscedastic Assumptions
The data does not have an obvious
pattern, there are points equally
distributed above and below zero on the
X axis, and to the left and right of zero
on the Y axis.
So homoscedasticity assumption is met.
55. Assumption - Errors are Independently distributed
Value of Durbin-Watson is
1.537,which is close to 2.
So, the assumption that errors
are independently distributed
is met
57. Multicollinearity Assumptions
Variance Inflation Factor(VIF) for all variables lie between 1 & 10, so there is no
multicollinearity. i.e. independent variables are do not have significant correlation between
them.
58. THANK YOU
Dr Parag Shah | M.Sc., M.Phil., Ph.D. ( Statistics)
www.paragstatistics.wordpress.com