# Linear Regression

12. Jan 2022
1 von 28

### Linear Regression

• 1. Machine Learning - I Regression Analysis Part - II
• 2. What is Linear regression? Linear regression is an approach for modeling the relationship between one dependent variable and one or more independent variables. Rupak Roy
• 3. Terminology Dependent Variable (Y): Predicted Variables /Target Variables a variable (often denoted by Y) whose values depends on that of another. Independent Variable (X): Predictor Variable a variable (often denoted by X) whose variation does not depend on that of another. Rupak Roy
• 4. Linear regression A linear relationship between 2 variables is essentially a straight line relationship Y =mx + c Where , m = slope c = Intercept Slope is the average rate of change of Y when X changes Intercept is the value of Y when x = 0 . Rupak Roy
• 5. Linear regression Slope & intercept Which is the largest slope(m)? it’s the line closest to Y Intercept is in the Y’s line Therefore the value of C is the Y value when X = 0 Rupak Roy
• 6. Interpreting intercept If x = 72,77,81 slope(m)=85, intercept= 20 Then, y = mx +c = 85(72)+20 = 85(77)+20 = 85(81)+20 For a unit change in X, Y changes by a constant amount. Rupak Roy
• 7. Interpreting the intercept If the slope m = 0 then Y = m x + c = 0(72) + 20 (Y is constant, so there is no relationship between x and y because no matter how much X changes Y doesn’t change) If the intercept C = 0 then Y= mx + c = 85(72)+ 0 157 lines passes through the origin (Y is directly proportional to X, so there exist a relationship between X and Y)) Rupak Roy
• 8. Linear regression Since we are observing straight line relationship, the relationship is usually shown as Y = Bo + B1 * X + E Where Bo = C , intercept E = error B1 = m, slope, Beta Coefficient The standard error of the estimate is a measure of the accuracy of predictions. Or simply it means: E = Actual – Predicted (values) It also knows as residuals. Rupak Roy
• 9. Linear regression Therefore the best regression is the one that minimizes the ERROR i.e. SSE Algorithms to minimize the error are  OLS (Ordinary Least Square)  Gradient Descent Rupak Roy
• 10. OLS Ordinary Least Square or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the sum of the squares of the differences between the observed responses (values of the variable being predicted) in the given dataset and those predicted by a linear function of a set of explanatory variables.
• 11. Goals of OLS Understanding the goal of OLS to minimize the SSE. What will be a good fit minimize ? • Error on first & last data points. • Summation of Error on all data points. • Summation of |Sqrt(Error) | on all data points. The best is summation of |sqrt(error)| on all data points i.e. SSE sum of squared errors. Rupak Roy
• 12. Goals of OLS  Which fit has the largest SSE? As you add more data SSE will certainly go up. But it doesn’t necessary means that your fit is doing worse job.  Larger the SSE – the Worst fit.  So SSE isn’t the perfect to evaluate Regression. Rupak Roy
• 13. OLS What Ordinary Least Square (OLS) does, it tries to identify the best possible line and the slope and the intercept on the best possible line. Now, how can we get one line that captures the best relationship between X and Y variable? One way of identifying the best possible line is to say the line that is close as possible to as many points as possible and those points which will go through the line the distance will be Zero, so overall the line with minimum distance is the best possible line. Rupak Roy
• 14. Alternative to SSE  After SSE what is the best way to evaluate the Regression? R square: it explains what is the proportion of the variable in the Y is being explain by X. The higher the R-square the better the model. But the r-square value increases whenever we add a new variable, whether relevant or not. In order to address this issue we use the Adjusted R-square. Adjusted R-square captures only the variation and the score of adjusted R-square will go up only if the significant variables are added to the model.
• 15. Interpreting Linear Regression Equation So how do we interpret the Linear equation? Example: Death rate(Y) = intercept (c) + slope (m) * X Y = - 455.3 + 122 * Fever(X)  Positive Sign in the co-efficient fever (122) implies a positive relationship between fever & death rate. This means for a unit change in Fever (X) the average increase in death rate will be 122  And negative sign in the co-efficient fever (-122) y = -455.3 + (-122) * Fever implies a negative relationship between fever & death rate. This means for a unit change in Fever (X) the average decrease in death rate will be 122 Rupak Roy
• 16. Interpreting Linear Regression Output P – Value  Is the probability or the chances of occurring the event. The standard P-Value to select the variable is 0.05 . If the P –value is more then 0.05 then there is no cause effect relationship between X and Y i.e. the target variable and the independent variable. Therefore the factor Dia has no effect on the dependent Y variable (death rate) So whatever we are seeing in Dia for death rate is due to randomness. Rupak Roy
• 17. Multiple Linear Regression  In real life situations, there will be multiple factors impacting the target factor.  However the less the factor variables the better the result, for that we can do Feature Engineering or Derived Variables.  Example: Radio + Tv together can have better impact then separately Tv or Radio . Rupak Roy
• 18. Multiple Linear Regression Output The actual regression equation, Y = Bo + B1*X1 + B2*X2 + B3*X3 OR Y =C+M*(x1)+M*(x2)+M *(x3) Therefore, Y = Bo + B1* cancer + B2*Hepa + B3 * smoke Y = -455 + 254 * cancer + -199 * hepa + 85 * smoke The variable Cancer is not significant since the P-value is <=0.05 So whatever we are seeing the variable cancer for this particular death rate is due to randomness. Therefore the equation will be Y= -455+ (-199) * Hepa + 85 * smoke
• 19. Interpreting the Multiple Linear Regression output So, for 45 units increase of smoke and 58 units increase of Hepa(0) Then the change in unit of death rate (Y) will be Y= -455+ (-199) * Hepa + 85 * smoke = -455+ (-199) * 58 + 85 * 45 = -455+(-11542)+3825 = -8172 So the total output of – 8172 indicates if there is only 45 units of smoke and 58 units of Hepa(0) there will be a decrease in death rate by 8172 units. Rupak Roy
• 20. Multiple Linear Output Regression Here we have also observed the variable Hepa (1=b) = -199 Hepa variable has to 2 levels i.e. Hepatitis B and Hepa C b = 1 c = 0 got assigned automatically. So it means if the Hepa variable goes up from 0 to 1 then the death rate of the person will also go down by -199 However if there are multiple factors in the variable like Hepa then it is better to create derived variables for each. Rupak Roy
• 21. Assumptions of Linear Regression 1. Check for Linearity: plot the residuals against each the independent variables. If the data is linearly related we should see no patterns in the plot. If we see any patterns(right graph)then the relationship is non-linear Rupak Roy
• 22. Assumptions of Linear Regression 2. The residuals should have constant variance i.e. Homoscedasticity. Plot the residuals in X against the Predicted Y Homoscedasticity Hetroscedasticity We should not see any patterns, if we see any patterns then it is due to some random variation. Therefore there are some factors influencing the regression model that we are not capturing in our model. In order to get rid of Hetroscedasticity use log, sqrt etc. transformation to Dependent variable(DV)/Target variable.
• 23. Assumptions of Linear Regression plot(mode\$dv, model\$residuals,abline(0,0)) If it is linear straight line pattern than the model is baised Rupak Roy
• 24. Assumptions of Linear Regression 3. The residuals are normally distributed. Check Using: plot(model\$residuals) qqplot(mode\$residuals) Or Hist(model\$residuals) If not normally distributed then we can use log, sqrt and other variable transformation to target variable. Rupak Roy
• 25. Assumptions of Linear Regression 4. The independent variables should not be co-related to each other This issue is also known as Multicolinearity. >cor(data\$IV1,data\$IV2,data\$IV3,data\$Iv4) From car package use VIF(model) where score should be<=10 It is also important to check the multicolinearity for dependent variable. And can also be done by using scatterplot. Check using pair wise (IV vs DV) i.e. by using graph >qplot(data\$DV, data\$IV) #quick plot(ggplot2) >qqplot(data\$DV, data\$IV) #quick complex graphics Else for a cor-relation score use: >cor(data\$DependentVarible, data\$Independent variable) Cor-relation score ranges from (-1 to 1) Where negative score indicate less cor-related. Solution: keep only one and eliminate the other, it is like having 2 same variables which in turn becomes difficult to find out the true relationship of a predictors with response variable and with presence of correlated predictors, the standard errors tend to increase.
• 26. Multiple Linear Regression  To understand the difference between actual vs predicted values. >plot(dataset\$DV,col=“blue”,type=“l”) >lines(model\$fitted.values,col=“red”, type=“l”) The more closer the better the accuracy and the difference is known as residuals. Rupak Roy
• 27. Interpreting/Calculating the residuals How to calculate the residuals? >dataset\$residuals_cal<- dataset\$dv- model4\$fitted.values >summary(bf_non_outlier\$residuals_cal) >summary(model4\$residuals) Both of them will be same. Now to cross-check the our calculated residuals value with the residuals generate from the model >dataset\$residuals_model<-model4\$residuals
• 28. Multiple Linear Regression  Now lets get some hands on experience with a Case Study Rupak Roy