Linear regression is an approach for modeling the relationship between one dependent variable and one or more independent variables.
Algorithms to minimize the error are
OLS (Ordinary Least Square)
Gradient Descent and much more.
Let me know if anything is required. Ping me at google #bobrupakroy
2. What is Linear regression?
Linear regression is an approach for modeling the relationship
between one dependent variable and one or more
independent variables.
Rupak Roy
3. Terminology
Dependent Variable (Y): Predicted Variables /Target Variables a
variable (often denoted by Y) whose values depends on that of
another.
Independent Variable (X): Predictor Variable a variable (often
denoted by X) whose variation does not depend on that of
another.
Rupak Roy
4. Linear regression
A linear relationship between 2 variables is essentially a
straight line relationship Y =mx + c
Where ,
m = slope
c = Intercept
Slope is the average rate of change of Y when X changes
Intercept is the value of Y when x = 0 .
Rupak Roy
5. Linear regression
Slope & intercept
Which is the largest slope(m)?
it’s the line closest to Y
Intercept is in the Y’s line
Therefore the value of C is
the Y value when X = 0
Rupak Roy
6. Interpreting intercept
If x = 72,77,81 slope(m)=85, intercept= 20
Then, y = mx +c
= 85(72)+20
= 85(77)+20
= 85(81)+20
For a unit change in X, Y changes by
a constant amount.
Rupak Roy
7. Interpreting the intercept
If the slope m = 0
then Y = m x + c = 0(72) + 20
(Y is constant, so there is no relationship
between x and y because no matter
how much X changes Y doesn’t
change)
If the intercept C = 0
then Y= mx + c
= 85(72)+ 0
157 lines passes through the origin
(Y is directly proportional to X, so
there exist a relationship between X and Y))
Rupak Roy
8. Linear regression
Since we are observing straight line relationship, the
relationship is usually shown as
Y = Bo + B1 * X + E
Where
Bo = C , intercept
E = error
B1 = m, slope, Beta Coefficient
The standard error of the estimate is a measure of the
accuracy of predictions.
Or simply it means: E = Actual – Predicted (values)
It also knows as residuals.
Rupak Roy
9. Linear regression
Therefore the best regression is the one that minimizes the
ERROR i.e. SSE
Algorithms to minimize the error are
OLS (Ordinary Least Square)
Gradient Descent
Rupak Roy
10. OLS
Ordinary Least Square or linear least squares is a method for
estimating the unknown parameters in a linear regression
model, with the goal of minimizing the sum of the squares of
the differences between the observed responses (values of
the variable being predicted) in the given dataset and those
predicted by a linear function of a set of explanatory
variables.
11. Goals of OLS
Understanding the goal of OLS to minimize the SSE.
What will be a good fit minimize ?
• Error on first & last data points.
• Summation of Error on all data points.
• Summation of |Sqrt(Error) | on all data points.
The best is summation of |sqrt(error)| on all data points
i.e. SSE sum of squared errors.
Rupak Roy
12. Goals of OLS
Which fit has the largest SSE?
As you add more data SSE will certainly go up. But it doesn’t
necessary means that your fit is doing worse job.
Larger the SSE – the Worst fit.
So SSE isn’t the perfect to evaluate Regression.
Rupak Roy
13. OLS
What Ordinary Least Square (OLS) does, it tries to identify the
best possible line and the slope and the intercept on the best
possible line.
Now, how can we get one line
that captures the best relationship
between X and Y variable?
One way of identifying the best
possible line is to say the line that is close as possible to as many
points as possible and those points which will go through the line
the distance will be Zero, so overall the line with minimum
distance is the best possible line.
Rupak Roy
14. Alternative to SSE
After SSE what is the best way to evaluate the Regression?
R square: it explains what is the proportion of the variable in
the Y is being explain by X. The higher the R-square the better
the model.
But the r-square value increases whenever we add a new
variable, whether relevant or not. In order to address this issue
we use the Adjusted R-square.
Adjusted R-square captures only the variation and the score of
adjusted R-square will go up only if the significant variables are
added to the model.
15. Interpreting Linear Regression Equation
So how do we interpret
the Linear equation?
Example:
Death rate(Y) = intercept (c) + slope (m) * X
Y = - 455.3 + 122 * Fever(X)
Positive Sign in the co-efficient fever (122) implies a positive
relationship between fever & death rate. This means for a unit
change in Fever (X) the average increase in death rate will be
122
And negative sign in the co-efficient fever (-122)
y = -455.3 + (-122) * Fever implies a negative relationship
between fever & death rate. This means for a unit change in
Fever (X) the average decrease in death rate will be 122
Rupak Roy
16. Interpreting Linear Regression Output
P – Value
Is the probability or the chances of occurring the event.
The standard P-Value
to select the variable is 0.05 . If the P –value is more then 0.05
then there is no cause effect relationship between X and Y i.e.
the target variable and the independent variable. Therefore the
factor Dia has no effect on the dependent Y variable (death
rate) So whatever we are seeing in Dia for death rate is due to
randomness.
Rupak Roy
17. Multiple Linear Regression
In real life situations, there will be multiple factors impacting
the target factor.
However the less the factor variables the better the result, for
that we can do Feature Engineering or Derived Variables.
Example: Radio + Tv together can have better impact then
separately Tv or Radio .
Rupak Roy
18. Multiple Linear Regression Output
The actual regression equation,
Y = Bo + B1*X1 + B2*X2 + B3*X3
OR Y =C+M*(x1)+M*(x2)+M *(x3)
Therefore, Y = Bo + B1* cancer + B2*Hepa + B3 * smoke
Y = -455 + 254 * cancer + -199 * hepa + 85 * smoke
The variable Cancer is not significant since the P-value is <=0.05
So whatever we are seeing the variable cancer for this
particular death rate is due to randomness. Therefore the
equation will be Y= -455+ (-199) * Hepa + 85 * smoke
19. Interpreting the
Multiple Linear Regression output
So, for 45 units increase of smoke and 58 units increase of
Hepa(0)
Then the change in unit of death rate (Y) will be
Y= -455+ (-199) * Hepa + 85 * smoke
= -455+ (-199) * 58 + 85 * 45 = -455+(-11542)+3825
= -8172
So the total output of – 8172 indicates if there is only 45 units of
smoke and 58 units of Hepa(0) there will be a decrease in
death rate by
8172 units.
Rupak Roy
20. Multiple Linear Output Regression
Here we have also
observed the variable
Hepa (1=b) = -199
Hepa variable has to 2 levels i.e. Hepatitis B and Hepa C
b = 1
c = 0 got assigned automatically.
So it means if the Hepa variable goes up from 0 to 1 then the
death rate of the person will also go down by -199
However if there are multiple factors in the variable like Hepa
then it is better to create derived variables for each.
Rupak Roy
21. Assumptions of Linear Regression
1. Check for Linearity: plot the residuals against each the
independent variables. If the data is linearly related we
should see no patterns in the plot.
If we see any patterns(right graph)then the relationship
is non-linear
Rupak Roy
22. Assumptions of Linear Regression
2. The residuals should have constant variance i.e.
Homoscedasticity.
Plot the residuals in X against the Predicted Y
Homoscedasticity Hetroscedasticity
We should not see any patterns, if we see any patterns then it is due
to some random variation. Therefore there are some factors
influencing the regression model that we are not capturing in our
model. In order to get rid of Hetroscedasticity use log, sqrt etc.
transformation to Dependent variable(DV)/Target variable.
23. Assumptions of Linear Regression
plot(mode$dv, model$residuals,abline(0,0))
If it is linear straight line pattern than the model is baised
Rupak Roy
24. Assumptions of Linear Regression
3. The residuals are normally distributed.
Check Using:
plot(model$residuals)
qqplot(mode$residuals)
Or
Hist(model$residuals)
If not normally distributed then
we can use log, sqrt and other
variable transformation to target
variable.
Rupak Roy
25. Assumptions of Linear Regression
4. The independent variables should not be co-related to
each other This issue is also known as Multicolinearity.
>cor(data$IV1,data$IV2,data$IV3,data$Iv4)
From car package use VIF(model) where score should be<=10
It is also important to check the multicolinearity for dependent variable.
And can also be done by using scatterplot.
Check using pair wise (IV vs DV) i.e. by using graph
>qplot(data$DV, data$IV) #quick plot(ggplot2)
>qqplot(data$DV, data$IV) #quick complex graphics
Else for a cor-relation score use:
>cor(data$DependentVarible, data$Independent variable)
Cor-relation score ranges from (-1 to 1)
Where negative score indicate less cor-related.
Solution: keep only one and eliminate the other, it is like having 2 same
variables which in turn becomes difficult to find out the true relationship of a
predictors with response variable and with presence of correlated
predictors, the standard errors tend to increase.
26. Multiple Linear Regression
To understand the difference between
actual vs predicted values.
>plot(dataset$DV,col=“blue”,type=“l”)
>lines(model$fitted.values,col=“red”, type=“l”)
The more closer the better the
accuracy and the difference is
known as
residuals.
Rupak Roy
27. Interpreting/Calculating the residuals
How to calculate the residuals?
>dataset$residuals_cal<- dataset$dv- model4$fitted.values
>summary(bf_non_outlier$residuals_cal)
>summary(model4$residuals)
Both of them will be same.
Now to cross-check the our
calculated residuals value with the
residuals generate from the model
>dataset$residuals_model<-model4$residuals