SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Machine
Learning - I
Regression Analysis
Part - II
What is Linear regression?
Linear regression is an approach for modeling the relationship
between one dependent variable and one or more
independent variables.
Rupak Roy
Terminology
Dependent Variable (Y): Predicted Variables /Target Variables a
variable (often denoted by Y) whose values depends on that of
another.
Independent Variable (X): Predictor Variable a variable (often
denoted by X) whose variation does not depend on that of
another.
Rupak Roy
Linear regression
A linear relationship between 2 variables is essentially a
straight line relationship Y =mx + c
Where ,
m = slope
c = Intercept
Slope is the average rate of change of Y when X changes
Intercept is the value of Y when x = 0 .
Rupak Roy
Linear regression
Slope & intercept
Which is the largest slope(m)?
it’s the line closest to Y
Intercept is in the Y’s line
Therefore the value of C is
the Y value when X = 0
Rupak Roy
Interpreting intercept
If x = 72,77,81 slope(m)=85, intercept= 20
Then, y = mx +c
= 85(72)+20
= 85(77)+20
= 85(81)+20
For a unit change in X, Y changes by
a constant amount.
Rupak Roy
Interpreting the intercept
If the slope m = 0
then Y = m x + c = 0(72) + 20
(Y is constant, so there is no relationship
between x and y because no matter
how much X changes Y doesn’t
change)
If the intercept C = 0
then Y= mx + c
= 85(72)+ 0
157 lines passes through the origin
(Y is directly proportional to X, so
there exist a relationship between X and Y))
Rupak Roy
Linear regression
Since we are observing straight line relationship, the
relationship is usually shown as
Y = Bo + B1 * X + E
Where
Bo = C , intercept
E = error
B1 = m, slope, Beta Coefficient
The standard error of the estimate is a measure of the
accuracy of predictions.
Or simply it means: E = Actual – Predicted (values)
It also knows as residuals.
Rupak Roy
Linear regression
Therefore the best regression is the one that minimizes the
ERROR i.e. SSE
Algorithms to minimize the error are
 OLS (Ordinary Least Square)
 Gradient Descent
Rupak Roy
OLS
Ordinary Least Square or linear least squares is a method for
estimating the unknown parameters in a linear regression
model, with the goal of minimizing the sum of the squares of
the differences between the observed responses (values of
the variable being predicted) in the given dataset and those
predicted by a linear function of a set of explanatory
variables.
Goals of OLS
Understanding the goal of OLS to minimize the SSE.
What will be a good fit minimize ?
• Error on first & last data points.
• Summation of Error on all data points.
• Summation of |Sqrt(Error) | on all data points.
The best is summation of |sqrt(error)| on all data points
i.e. SSE sum of squared errors.
Rupak Roy
Goals of OLS
 Which fit has the largest SSE?
As you add more data SSE will certainly go up. But it doesn’t
necessary means that your fit is doing worse job.
 Larger the SSE – the Worst fit.
 So SSE isn’t the perfect to evaluate Regression.
Rupak Roy
OLS
What Ordinary Least Square (OLS) does, it tries to identify the
best possible line and the slope and the intercept on the best
possible line.
Now, how can we get one line
that captures the best relationship
between X and Y variable?
One way of identifying the best
possible line is to say the line that is close as possible to as many
points as possible and those points which will go through the line
the distance will be Zero, so overall the line with minimum
distance is the best possible line.
Rupak Roy
Alternative to SSE
 After SSE what is the best way to evaluate the Regression?
R square: it explains what is the proportion of the variable in
the Y is being explain by X. The higher the R-square the better
the model.
But the r-square value increases whenever we add a new
variable, whether relevant or not. In order to address this issue
we use the Adjusted R-square.
Adjusted R-square captures only the variation and the score of
adjusted R-square will go up only if the significant variables are
added to the model.
Interpreting Linear Regression Equation
So how do we interpret
the Linear equation?
Example:
Death rate(Y) = intercept (c) + slope (m) * X
Y = - 455.3 + 122 * Fever(X)
 Positive Sign in the co-efficient fever (122) implies a positive
relationship between fever & death rate. This means for a unit
change in Fever (X) the average increase in death rate will be
122
 And negative sign in the co-efficient fever (-122)
y = -455.3 + (-122) * Fever implies a negative relationship
between fever & death rate. This means for a unit change in
Fever (X) the average decrease in death rate will be 122
Rupak Roy
Interpreting Linear Regression Output
P – Value
 Is the probability or the chances of occurring the event.
The standard P-Value
to select the variable is 0.05 . If the P –value is more then 0.05
then there is no cause effect relationship between X and Y i.e.
the target variable and the independent variable. Therefore the
factor Dia has no effect on the dependent Y variable (death
rate) So whatever we are seeing in Dia for death rate is due to
randomness.
Rupak Roy
Multiple Linear Regression
 In real life situations, there will be multiple factors impacting
the target factor.
 However the less the factor variables the better the result, for
that we can do Feature Engineering or Derived Variables.
 Example: Radio + Tv together can have better impact then
separately Tv or Radio .
Rupak Roy
Multiple Linear Regression Output
The actual regression equation,
Y = Bo + B1*X1 + B2*X2 + B3*X3
OR Y =C+M*(x1)+M*(x2)+M *(x3)
Therefore, Y = Bo + B1* cancer + B2*Hepa + B3 * smoke
Y = -455 + 254 * cancer + -199 * hepa + 85 * smoke
The variable Cancer is not significant since the P-value is <=0.05
So whatever we are seeing the variable cancer for this
particular death rate is due to randomness. Therefore the
equation will be Y= -455+ (-199) * Hepa + 85 * smoke
Interpreting the
Multiple Linear Regression output
So, for 45 units increase of smoke and 58 units increase of
Hepa(0)
Then the change in unit of death rate (Y) will be
Y= -455+ (-199) * Hepa + 85 * smoke
= -455+ (-199) * 58 + 85 * 45 = -455+(-11542)+3825
= -8172
So the total output of – 8172 indicates if there is only 45 units of
smoke and 58 units of Hepa(0) there will be a decrease in
death rate by
8172 units.
Rupak Roy
Multiple Linear Output Regression
Here we have also
observed the variable
Hepa (1=b) = -199
Hepa variable has to 2 levels i.e. Hepatitis B and Hepa C
b = 1
c = 0 got assigned automatically.
So it means if the Hepa variable goes up from 0 to 1 then the
death rate of the person will also go down by -199
However if there are multiple factors in the variable like Hepa
then it is better to create derived variables for each.
Rupak Roy
Assumptions of Linear Regression
1. Check for Linearity: plot the residuals against each the
independent variables. If the data is linearly related we
should see no patterns in the plot.
If we see any patterns(right graph)then the relationship
is non-linear
Rupak Roy
Assumptions of Linear Regression
2. The residuals should have constant variance i.e.
Homoscedasticity.
Plot the residuals in X against the Predicted Y
Homoscedasticity Hetroscedasticity
We should not see any patterns, if we see any patterns then it is due
to some random variation. Therefore there are some factors
influencing the regression model that we are not capturing in our
model. In order to get rid of Hetroscedasticity use log, sqrt etc.
transformation to Dependent variable(DV)/Target variable.
Assumptions of Linear Regression
plot(mode$dv, model$residuals,abline(0,0))
If it is linear straight line pattern than the model is baised
Rupak Roy
Assumptions of Linear Regression
3. The residuals are normally distributed.
Check Using:
plot(model$residuals)
qqplot(mode$residuals)
Or
Hist(model$residuals)
If not normally distributed then
we can use log, sqrt and other
variable transformation to target
variable.
Rupak Roy
Assumptions of Linear Regression
4. The independent variables should not be co-related to
each other This issue is also known as Multicolinearity.
>cor(data$IV1,data$IV2,data$IV3,data$Iv4)
From car package use VIF(model) where score should be<=10
It is also important to check the multicolinearity for dependent variable.
And can also be done by using scatterplot.
Check using pair wise (IV vs DV) i.e. by using graph
>qplot(data$DV, data$IV) #quick plot(ggplot2)
>qqplot(data$DV, data$IV) #quick complex graphics
Else for a cor-relation score use:
>cor(data$DependentVarible, data$Independent variable)
Cor-relation score ranges from (-1 to 1)
Where negative score indicate less cor-related.
Solution: keep only one and eliminate the other, it is like having 2 same
variables which in turn becomes difficult to find out the true relationship of a
predictors with response variable and with presence of correlated
predictors, the standard errors tend to increase.
Multiple Linear Regression
 To understand the difference between
actual vs predicted values.
>plot(dataset$DV,col=“blue”,type=“l”)
>lines(model$fitted.values,col=“red”, type=“l”)
The more closer the better the
accuracy and the difference is
known as
residuals.
Rupak Roy
Interpreting/Calculating the residuals
How to calculate the residuals?
>dataset$residuals_cal<- dataset$dv- model4$fitted.values
>summary(bf_non_outlier$residuals_cal)
>summary(model4$residuals)
Both of them will be same.
Now to cross-check the our
calculated residuals value with the
residuals generate from the model
>dataset$residuals_model<-model4$residuals
Multiple Linear Regression
 Now lets get some hands on experience
with a Case Study
Rupak Roy

Weitere ähnliche Inhalte

Was ist angesagt?

Topic 15 correlation spss
Topic 15 correlation spssTopic 15 correlation spss
Topic 15 correlation spss
Sizwan Ahammed
 

Was ist angesagt? (20)

Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variables
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variables
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundations
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
 
Chapter08
Chapter08Chapter08
Chapter08
 
Elementary statistical inference1
Elementary statistical inference1Elementary statistical inference1
Elementary statistical inference1
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance
 
Increasing Power without Increasing Sample Size
Increasing Power without Increasing Sample SizeIncreasing Power without Increasing Sample Size
Increasing Power without Increasing Sample Size
 
Estimating a Population Proportion
Estimating a Population ProportionEstimating a Population Proportion
Estimating a Population Proportion
 
Parameter estimation
Parameter estimationParameter estimation
Parameter estimation
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Estimation Theory
Estimation TheoryEstimation Theory
Estimation Theory
 
Estimation
EstimationEstimation
Estimation
 
Topic 15 correlation spss
Topic 15 correlation spssTopic 15 correlation spss
Topic 15 correlation spss
 
Introduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood EstimatorIntroduction to Maximum Likelihood Estimator
Introduction to Maximum Likelihood Estimator
 
Estimation theory 1
Estimation theory 1Estimation theory 1
Estimation theory 1
 

Ähnlich wie Linear Regression

Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
Kemal İnciroğlu
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
saba khan
 

Ähnlich wie Linear Regression (20)

Chapter 14 Part I
Chapter 14 Part IChapter 14 Part I
Chapter 14 Part I
 
Corr And Regress
Corr And RegressCorr And Regress
Corr And Regress
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Science
 
Regression
RegressionRegression
Regression
 
Regression.ppt basic introduction of regression with example
Regression.ppt basic introduction of regression with exampleRegression.ppt basic introduction of regression with example
Regression.ppt basic introduction of regression with example
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Statistics for entrepreneurs
Statistics for entrepreneurs Statistics for entrepreneurs
Statistics for entrepreneurs
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
regression.pptx
regression.pptxregression.pptx
regression.pptx
 
Regression
RegressionRegression
Regression
 
Chapter05
Chapter05Chapter05
Chapter05
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptx
 

Mehr von Rupak Roy

Mehr von Rupak Roy (20)

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLP
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLP
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical Steps
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase Architecture
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQL
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Apache Pig Relational Operators - II
Apache Pig Relational Operators - II Apache Pig Relational Operators - II
Apache Pig Relational Operators - II
 
Passing Parameters using File and Command Line
Passing Parameters using File and Command LinePassing Parameters using File and Command Line
Passing Parameters using File and Command Line
 
Apache PIG Relational Operations
Apache PIG Relational Operations Apache PIG Relational Operations
Apache PIG Relational Operations
 

Kürzlich hochgeladen

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 

Kürzlich hochgeladen (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 

Linear Regression

  • 1. Machine Learning - I Regression Analysis Part - II
  • 2. What is Linear regression? Linear regression is an approach for modeling the relationship between one dependent variable and one or more independent variables. Rupak Roy
  • 3. Terminology Dependent Variable (Y): Predicted Variables /Target Variables a variable (often denoted by Y) whose values depends on that of another. Independent Variable (X): Predictor Variable a variable (often denoted by X) whose variation does not depend on that of another. Rupak Roy
  • 4. Linear regression A linear relationship between 2 variables is essentially a straight line relationship Y =mx + c Where , m = slope c = Intercept Slope is the average rate of change of Y when X changes Intercept is the value of Y when x = 0 . Rupak Roy
  • 5. Linear regression Slope & intercept Which is the largest slope(m)? it’s the line closest to Y Intercept is in the Y’s line Therefore the value of C is the Y value when X = 0 Rupak Roy
  • 6. Interpreting intercept If x = 72,77,81 slope(m)=85, intercept= 20 Then, y = mx +c = 85(72)+20 = 85(77)+20 = 85(81)+20 For a unit change in X, Y changes by a constant amount. Rupak Roy
  • 7. Interpreting the intercept If the slope m = 0 then Y = m x + c = 0(72) + 20 (Y is constant, so there is no relationship between x and y because no matter how much X changes Y doesn’t change) If the intercept C = 0 then Y= mx + c = 85(72)+ 0 157 lines passes through the origin (Y is directly proportional to X, so there exist a relationship between X and Y)) Rupak Roy
  • 8. Linear regression Since we are observing straight line relationship, the relationship is usually shown as Y = Bo + B1 * X + E Where Bo = C , intercept E = error B1 = m, slope, Beta Coefficient The standard error of the estimate is a measure of the accuracy of predictions. Or simply it means: E = Actual – Predicted (values) It also knows as residuals. Rupak Roy
  • 9. Linear regression Therefore the best regression is the one that minimizes the ERROR i.e. SSE Algorithms to minimize the error are  OLS (Ordinary Least Square)  Gradient Descent Rupak Roy
  • 10. OLS Ordinary Least Square or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the sum of the squares of the differences between the observed responses (values of the variable being predicted) in the given dataset and those predicted by a linear function of a set of explanatory variables.
  • 11. Goals of OLS Understanding the goal of OLS to minimize the SSE. What will be a good fit minimize ? • Error on first & last data points. • Summation of Error on all data points. • Summation of |Sqrt(Error) | on all data points. The best is summation of |sqrt(error)| on all data points i.e. SSE sum of squared errors. Rupak Roy
  • 12. Goals of OLS  Which fit has the largest SSE? As you add more data SSE will certainly go up. But it doesn’t necessary means that your fit is doing worse job.  Larger the SSE – the Worst fit.  So SSE isn’t the perfect to evaluate Regression. Rupak Roy
  • 13. OLS What Ordinary Least Square (OLS) does, it tries to identify the best possible line and the slope and the intercept on the best possible line. Now, how can we get one line that captures the best relationship between X and Y variable? One way of identifying the best possible line is to say the line that is close as possible to as many points as possible and those points which will go through the line the distance will be Zero, so overall the line with minimum distance is the best possible line. Rupak Roy
  • 14. Alternative to SSE  After SSE what is the best way to evaluate the Regression? R square: it explains what is the proportion of the variable in the Y is being explain by X. The higher the R-square the better the model. But the r-square value increases whenever we add a new variable, whether relevant or not. In order to address this issue we use the Adjusted R-square. Adjusted R-square captures only the variation and the score of adjusted R-square will go up only if the significant variables are added to the model.
  • 15. Interpreting Linear Regression Equation So how do we interpret the Linear equation? Example: Death rate(Y) = intercept (c) + slope (m) * X Y = - 455.3 + 122 * Fever(X)  Positive Sign in the co-efficient fever (122) implies a positive relationship between fever & death rate. This means for a unit change in Fever (X) the average increase in death rate will be 122  And negative sign in the co-efficient fever (-122) y = -455.3 + (-122) * Fever implies a negative relationship between fever & death rate. This means for a unit change in Fever (X) the average decrease in death rate will be 122 Rupak Roy
  • 16. Interpreting Linear Regression Output P – Value  Is the probability or the chances of occurring the event. The standard P-Value to select the variable is 0.05 . If the P –value is more then 0.05 then there is no cause effect relationship between X and Y i.e. the target variable and the independent variable. Therefore the factor Dia has no effect on the dependent Y variable (death rate) So whatever we are seeing in Dia for death rate is due to randomness. Rupak Roy
  • 17. Multiple Linear Regression  In real life situations, there will be multiple factors impacting the target factor.  However the less the factor variables the better the result, for that we can do Feature Engineering or Derived Variables.  Example: Radio + Tv together can have better impact then separately Tv or Radio . Rupak Roy
  • 18. Multiple Linear Regression Output The actual regression equation, Y = Bo + B1*X1 + B2*X2 + B3*X3 OR Y =C+M*(x1)+M*(x2)+M *(x3) Therefore, Y = Bo + B1* cancer + B2*Hepa + B3 * smoke Y = -455 + 254 * cancer + -199 * hepa + 85 * smoke The variable Cancer is not significant since the P-value is <=0.05 So whatever we are seeing the variable cancer for this particular death rate is due to randomness. Therefore the equation will be Y= -455+ (-199) * Hepa + 85 * smoke
  • 19. Interpreting the Multiple Linear Regression output So, for 45 units increase of smoke and 58 units increase of Hepa(0) Then the change in unit of death rate (Y) will be Y= -455+ (-199) * Hepa + 85 * smoke = -455+ (-199) * 58 + 85 * 45 = -455+(-11542)+3825 = -8172 So the total output of – 8172 indicates if there is only 45 units of smoke and 58 units of Hepa(0) there will be a decrease in death rate by 8172 units. Rupak Roy
  • 20. Multiple Linear Output Regression Here we have also observed the variable Hepa (1=b) = -199 Hepa variable has to 2 levels i.e. Hepatitis B and Hepa C b = 1 c = 0 got assigned automatically. So it means if the Hepa variable goes up from 0 to 1 then the death rate of the person will also go down by -199 However if there are multiple factors in the variable like Hepa then it is better to create derived variables for each. Rupak Roy
  • 21. Assumptions of Linear Regression 1. Check for Linearity: plot the residuals against each the independent variables. If the data is linearly related we should see no patterns in the plot. If we see any patterns(right graph)then the relationship is non-linear Rupak Roy
  • 22. Assumptions of Linear Regression 2. The residuals should have constant variance i.e. Homoscedasticity. Plot the residuals in X against the Predicted Y Homoscedasticity Hetroscedasticity We should not see any patterns, if we see any patterns then it is due to some random variation. Therefore there are some factors influencing the regression model that we are not capturing in our model. In order to get rid of Hetroscedasticity use log, sqrt etc. transformation to Dependent variable(DV)/Target variable.
  • 23. Assumptions of Linear Regression plot(mode$dv, model$residuals,abline(0,0)) If it is linear straight line pattern than the model is baised Rupak Roy
  • 24. Assumptions of Linear Regression 3. The residuals are normally distributed. Check Using: plot(model$residuals) qqplot(mode$residuals) Or Hist(model$residuals) If not normally distributed then we can use log, sqrt and other variable transformation to target variable. Rupak Roy
  • 25. Assumptions of Linear Regression 4. The independent variables should not be co-related to each other This issue is also known as Multicolinearity. >cor(data$IV1,data$IV2,data$IV3,data$Iv4) From car package use VIF(model) where score should be<=10 It is also important to check the multicolinearity for dependent variable. And can also be done by using scatterplot. Check using pair wise (IV vs DV) i.e. by using graph >qplot(data$DV, data$IV) #quick plot(ggplot2) >qqplot(data$DV, data$IV) #quick complex graphics Else for a cor-relation score use: >cor(data$DependentVarible, data$Independent variable) Cor-relation score ranges from (-1 to 1) Where negative score indicate less cor-related. Solution: keep only one and eliminate the other, it is like having 2 same variables which in turn becomes difficult to find out the true relationship of a predictors with response variable and with presence of correlated predictors, the standard errors tend to increase.
  • 26. Multiple Linear Regression  To understand the difference between actual vs predicted values. >plot(dataset$DV,col=“blue”,type=“l”) >lines(model$fitted.values,col=“red”, type=“l”) The more closer the better the accuracy and the difference is known as residuals. Rupak Roy
  • 27. Interpreting/Calculating the residuals How to calculate the residuals? >dataset$residuals_cal<- dataset$dv- model4$fitted.values >summary(bf_non_outlier$residuals_cal) >summary(model4$residuals) Both of them will be same. Now to cross-check the our calculated residuals value with the residuals generate from the model >dataset$residuals_model<-model4$residuals
  • 28. Multiple Linear Regression  Now lets get some hands on experience with a Case Study Rupak Roy