SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Mirror mirror on the wall, help me
predict and know it all…
Ankur Sansanwal
https://www.linkedin.com/in/ankursansanwal
2
Why are we here
3
Lets establish the baseline
• What does this represent?
I eat this for breakfastThere is always a first time
4
Lets establish the baseline
• How familiar R you with
Jedi Master
I publish R libraries
It’s a dance
studio down the
road
5
Lets get ready
• Please install R Studio (GUI) or R (Shell)
– You can download R Studio from https://www.rstudio.com/ and R from
https://cran.r-project.org/bin/windows/base/
• Copy the wine.csv & wine_test.csv from the thumb drive
– This data comes from Liquid Assets
(www.liquidasset.com/winedata.html)
• This talk is inspired by Edge Analytics course work at edx.org 
6
Statistics refresher before the fun part
 Independent & Dependent variable
 One Variable Linear Regression
 SSE
 SST
 R2
7
What is
Dependent variable  Variable that you are tying to predict
Independent variable  Variables that you believe influence dependent variable
8
• Y = 0.5 (avg. growing temp) – 1.25
How does price change based on temp?
Is this line perfect?
• Baseline model  y = 7
What is the price of wine when temp is 16 or 18 degrees?
•
One Variable Linear Regression
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
1
1
2
2
9
• How do you calculate error?
The best model (line) should have minimal errors…
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
2
2
• Error = Actual value – Prediction value
= 8 – 7 = 1
10
One of the measures of quality of the model is Sum of
Squared Errors (SSE)
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
2
• SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2
11
• Lets call SSE for baseline as SST
• SST = 10.15
• SSE = 6.5
The smaller the errors vs baseline the better is the model
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
1
2
• R2 = 1 – SSE / SST
= 1 – 6.5 / 10.15
= 0. 44
12
But why is value of R2 always between 0 & 1?
• R2 = 1 – SSE / SST
• 0 <= SSE < = SST  Why is this the case?
13
Quiz
• What would be the R2 of a perfect model?
14
Remember
• Good models for easy problems will have R2 close to 1
• Good models for hard problems will have R2 close to 0
15
The real deal
 Build a multi-variable regression model in R
16
Step 1 – Have a look at the data (wine.csv)
17
Step 2 – Ask: what's the business question we are trying to
solve?
“Predict the price of wine?”
18
Quiz
• What’s the independent & dependent variable in our data-set?
19
Step 3 – Check your working directory?
20
Step 4 – Your working directory should be set to the
location of wine.csv file
21
Step 5 – Load csv file
22
Step 6 – Lets create 1 variable regression model
Name of the model
Dependent var
Independent var
Name of the data set
23
Step 7 – Understand the output of the model
Error terms
Model used
R2
Adjusts R2 for number of number
of indep. var. relative to the no. of
data pts.
24
Step 8 – Calculate SSE for Model1
As the name goes, sum of Squared errors
25
Step 9 – Add another independent variable to the model
The new variable
26
Step 10 – Compare the two models  Which is better?
27
Step 11 – But lets calculate SSE for Model 2 as well….
To know which of the two models is better, compare their SSE
28
Step 12 – Lets go all in…
29
Step 13 – So which of the models is the best?
Model 1 Model 2 Model 3
Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter
Rain + Age + FrancePop
R2 0.43 0.70 0.82
SSE 5.73 2.97 1.73
30
Step 14 – What does the output tell us about the independent
variables
Coefficient
• If a coefficient is close to 0 remove it  It means that the independent variable does
not change our prediction for dependent variable
• Larger the abs. value of t-value, the more likely the coefficient is to be significant
• Closer the value of Pr(>|t|) to 1, less significant is the independent variable
Estimate / Std. Error
How much is coef. likely to
vary from est. value
Probability of coefficient is
actually 0
31
Step 15 – Lets improve the model by excluding FrancePop
(most insignificant var.)
32
Refresher – What is correlation
+10-1
Highly correlatedHighly correlated No correlation
33
Step 16 – ID Multicollinearity .ie. Situation when two
independent var. are highly correlated
34
Are we there yet!
 Use our model to predict price of wine
35
Before we do that
• Data we build our model is called train data
• Data we test our model is called test data
36
Step 17 – Load test data
37
Step 18 – Predict (finally)
Pretty close!
38
Step 19 – Last but not the least, lets calculate R2 to quantify
how good our prediction is…
39
Any intelligent fool can make things bigger, more complex, and more
violent. It takes a touch of genius -- and a lot of courage -- to move in
the opposite direction.
40
Things that have worked for me
• “See” the data before you model
– I use Sublime text editor
• Define the business question – WRITE it down
• Be friend’s with ETL jedi (Data transformation expert)
• Its Ok if its simple!

Weitere ähnliche Inhalte

Andere mochten auch

Diaporama le serviteur du centurion romain
Diaporama le serviteur du centurion romainDiaporama le serviteur du centurion romain
Diaporama le serviteur du centurion romainkt42 catechisme
 
Diaporama comprendre la Messe : Heureux les invités au repas du Seigneur
Diaporama comprendre la Messe : Heureux les invités au repas du SeigneurDiaporama comprendre la Messe : Heureux les invités au repas du Seigneur
Diaporama comprendre la Messe : Heureux les invités au repas du Seigneurkt42 catechisme
 
Laser Scanning Inspection Report-Reference
Laser Scanning Inspection Report-ReferenceLaser Scanning Inspection Report-Reference
Laser Scanning Inspection Report-Reference灿 冯
 
Christmas recycling 2016 7
Christmas recycling 2016 7Christmas recycling 2016 7
Christmas recycling 2016 7MrOH
 
ооснови здоров'я,1кл. ознаки здорової та хворої людини 1 кла
ооснови здоров'я,1кл. ознаки здорової та хворої людини 1 клаооснови здоров'я,1кл. ознаки здорової та хворої людини 1 кла
ооснови здоров'я,1кл. ознаки здорової та хворої людини 1 клаАльона Заїкіна
 
7 air pollution-control
7 air pollution-control7 air pollution-control
7 air pollution-controlSata Ajjam
 
Aluminium kiln - 3D .PDF
Aluminium kiln - 3D   .PDFAluminium kiln - 3D   .PDF
Aluminium kiln - 3D .PDFsaeid teymoori
 

Andere mochten auch (9)

Diaporama le serviteur du centurion romain
Diaporama le serviteur du centurion romainDiaporama le serviteur du centurion romain
Diaporama le serviteur du centurion romain
 
Diaporama comprendre la Messe : Heureux les invités au repas du Seigneur
Diaporama comprendre la Messe : Heureux les invités au repas du SeigneurDiaporama comprendre la Messe : Heureux les invités au repas du Seigneur
Diaporama comprendre la Messe : Heureux les invités au repas du Seigneur
 
Laser Scanning Inspection Report-Reference
Laser Scanning Inspection Report-ReferenceLaser Scanning Inspection Report-Reference
Laser Scanning Inspection Report-Reference
 
Christmas recycling 2016 7
Christmas recycling 2016 7Christmas recycling 2016 7
Christmas recycling 2016 7
 
ооснови здоров'я,1кл. ознаки здорової та хворої людини 1 кла
ооснови здоров'я,1кл. ознаки здорової та хворої людини 1 клаооснови здоров'я,1кл. ознаки здорової та хворої людини 1 кла
ооснови здоров'я,1кл. ознаки здорової та хворої людини 1 кла
 
Hezikidetza
HezikidetzaHezikidetza
Hezikidetza
 
7 air pollution-control
7 air pollution-control7 air pollution-control
7 air pollution-control
 
Aluminium kiln - 3D .PDF
Aluminium kiln - 3D   .PDFAluminium kiln - 3D   .PDF
Aluminium kiln - 3D .PDF
 
Retirement income
Retirement incomeRetirement income
Retirement income
 

Ähnlich wie Predective analytcis v0.1 AS

SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...AboutYouGmbH
 
Mixed Effects Models - Fixed Effect Interactions
Mixed Effects Models - Fixed Effect InteractionsMixed Effects Models - Fixed Effect Interactions
Mixed Effects Models - Fixed Effect InteractionsScott Fraundorf
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Preparing for AI - Measurefest
Preparing for AI - MeasurefestPreparing for AI - Measurefest
Preparing for AI - MeasurefestGuido X Jansen
 
Zurich R user group presentation May 2016
Zurich R user group presentation May 2016Zurich R user group presentation May 2016
Zurich R user group presentation May 2016Andrew Marritt
 
Preparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical StepsPreparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical StepsXebiaLabs
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologyVladyslav Frolov
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTPConnor McDonald
 
Pavlo Zhdanov "Mastering solid and base principles for software design"
Pavlo Zhdanov "Mastering solid and base principles for software design"Pavlo Zhdanov "Mastering solid and base principles for software design"
Pavlo Zhdanov "Mastering solid and base principles for software design"LogeekNightUkraine
 
Increasing reporting value with statistics
Increasing reporting value with statisticsIncreasing reporting value with statistics
Increasing reporting value with statisticsvraopolisetti
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxjane3dyson92312
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxfestockton
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdfEmerson Ceras
 

Ähnlich wie Predective analytcis v0.1 AS (20)

Self healing data
Self healing dataSelf healing data
Self healing data
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
 
Mixed Effects Models - Fixed Effect Interactions
Mixed Effects Models - Fixed Effect InteractionsMixed Effects Models - Fixed Effect Interactions
Mixed Effects Models - Fixed Effect Interactions
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Mlcc #4
Mlcc #4Mlcc #4
Mlcc #4
 
Preparing for AI - Measurefest
Preparing for AI - MeasurefestPreparing for AI - Measurefest
Preparing for AI - Measurefest
 
Zurich R user group presentation May 2016
Zurich R user group presentation May 2016Zurich R user group presentation May 2016
Zurich R user group presentation May 2016
 
Preparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical StepsPreparing for Enterprise Continuous Delivery - 5 Critical Steps
Preparing for Enterprise Continuous Delivery - 5 Critical Steps
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Pavlo Zhdanov "Mastering solid and base principles for software design"
Pavlo Zhdanov "Mastering solid and base principles for software design"Pavlo Zhdanov "Mastering solid and base principles for software design"
Pavlo Zhdanov "Mastering solid and base principles for software design"
 
Increasing reporting value with statistics
Increasing reporting value with statisticsIncreasing reporting value with statistics
Increasing reporting value with statistics
 
1015 track2 abbott
1015 track2 abbott1015 track2 abbott
1015 track2 abbott
 
1030 track2 abbott
1030 track2 abbott1030 track2 abbott
1030 track2 abbott
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docx
 
Assignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docxAssignment - 03Model Building, Selection, & Prediction.docx
Assignment - 03Model Building, Selection, & Prediction.docx
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
 

Predective analytcis v0.1 AS

  • 1. Mirror mirror on the wall, help me predict and know it all… Ankur Sansanwal https://www.linkedin.com/in/ankursansanwal
  • 3. 3 Lets establish the baseline • What does this represent? I eat this for breakfastThere is always a first time
  • 4. 4 Lets establish the baseline • How familiar R you with Jedi Master I publish R libraries It’s a dance studio down the road
  • 5. 5 Lets get ready • Please install R Studio (GUI) or R (Shell) – You can download R Studio from https://www.rstudio.com/ and R from https://cran.r-project.org/bin/windows/base/ • Copy the wine.csv & wine_test.csv from the thumb drive – This data comes from Liquid Assets (www.liquidasset.com/winedata.html) • This talk is inspired by Edge Analytics course work at edx.org 
  • 6. 6 Statistics refresher before the fun part  Independent & Dependent variable  One Variable Linear Regression  SSE  SST  R2
  • 7. 7 What is Dependent variable  Variable that you are tying to predict Independent variable  Variables that you believe influence dependent variable
  • 8. 8 • Y = 0.5 (avg. growing temp) – 1.25 How does price change based on temp? Is this line perfect? • Baseline model  y = 7 What is the price of wine when temp is 16 or 18 degrees? • One Variable Linear Regression Avg. growing Temp (Independent Var.) Price(DependentVar.) 1 1 2 2
  • 9. 9 • How do you calculate error? The best model (line) should have minimal errors… Avg. growing Temp (Independent Var.) Price(DependentVar.) 2 2 • Error = Actual value – Prediction value = 8 – 7 = 1
  • 10. 10 One of the measures of quality of the model is Sum of Squared Errors (SSE) Avg. growing Temp (Independent Var.) Price(DependentVar.) 2 • SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2
  • 11. 11 • Lets call SSE for baseline as SST • SST = 10.15 • SSE = 6.5 The smaller the errors vs baseline the better is the model Avg. growing Temp (Independent Var.) Price(DependentVar.) 1 2 • R2 = 1 – SSE / SST = 1 – 6.5 / 10.15 = 0. 44
  • 12. 12 But why is value of R2 always between 0 & 1? • R2 = 1 – SSE / SST • 0 <= SSE < = SST  Why is this the case?
  • 13. 13 Quiz • What would be the R2 of a perfect model?
  • 14. 14 Remember • Good models for easy problems will have R2 close to 1 • Good models for hard problems will have R2 close to 0
  • 15. 15 The real deal  Build a multi-variable regression model in R
  • 16. 16 Step 1 – Have a look at the data (wine.csv)
  • 17. 17 Step 2 – Ask: what's the business question we are trying to solve? “Predict the price of wine?”
  • 18. 18 Quiz • What’s the independent & dependent variable in our data-set?
  • 19. 19 Step 3 – Check your working directory?
  • 20. 20 Step 4 – Your working directory should be set to the location of wine.csv file
  • 21. 21 Step 5 – Load csv file
  • 22. 22 Step 6 – Lets create 1 variable regression model Name of the model Dependent var Independent var Name of the data set
  • 23. 23 Step 7 – Understand the output of the model Error terms Model used R2 Adjusts R2 for number of number of indep. var. relative to the no. of data pts.
  • 24. 24 Step 8 – Calculate SSE for Model1 As the name goes, sum of Squared errors
  • 25. 25 Step 9 – Add another independent variable to the model The new variable
  • 26. 26 Step 10 – Compare the two models  Which is better?
  • 27. 27 Step 11 – But lets calculate SSE for Model 2 as well…. To know which of the two models is better, compare their SSE
  • 28. 28 Step 12 – Lets go all in…
  • 29. 29 Step 13 – So which of the models is the best? Model 1 Model 2 Model 3 Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter Rain + Age + FrancePop R2 0.43 0.70 0.82 SSE 5.73 2.97 1.73
  • 30. 30 Step 14 – What does the output tell us about the independent variables Coefficient • If a coefficient is close to 0 remove it  It means that the independent variable does not change our prediction for dependent variable • Larger the abs. value of t-value, the more likely the coefficient is to be significant • Closer the value of Pr(>|t|) to 1, less significant is the independent variable Estimate / Std. Error How much is coef. likely to vary from est. value Probability of coefficient is actually 0
  • 31. 31 Step 15 – Lets improve the model by excluding FrancePop (most insignificant var.)
  • 32. 32 Refresher – What is correlation +10-1 Highly correlatedHighly correlated No correlation
  • 33. 33 Step 16 – ID Multicollinearity .ie. Situation when two independent var. are highly correlated
  • 34. 34 Are we there yet!  Use our model to predict price of wine
  • 35. 35 Before we do that • Data we build our model is called train data • Data we test our model is called test data
  • 36. 36 Step 17 – Load test data
  • 37. 37 Step 18 – Predict (finally) Pretty close!
  • 38. 38 Step 19 – Last but not the least, lets calculate R2 to quantify how good our prediction is…
  • 39. 39 Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction.
  • 40. 40 Things that have worked for me • “See” the data before you model – I use Sublime text editor • Define the business question – WRITE it down • Be friend’s with ETL jedi (Data transformation expert) • Its Ok if its simple!

Hinweis der Redaktion

  1. Goal of a linear regression is to create a predictive line through the data. What does this mean? Create a line such that most of the points fall on it or close to it. Simple line 1 Can we have a better line?
  2. SSE will always have less number of errors than SST because the value of the coefficient of the independent var at worst can be 0.. In other words SSE always has less error than SST
  3. Setwd(“c:\\xxx\xxx\xxx\xx\x”)
  4. Setwd(“c:\\xxx\xxx\xxx\xx\x”)
  5. What does this tell you? There are 2 observations of 7 variables in wine.
  6. What does this tell you? There are 2 observations of 7 variables in wine.
  7. Adjusted R2 adjusts R2 to account for the number of independent variables used relative to the no. of data points. Remember Multiple R2 will always increase if you add more independent var. But adjusted R2 will decrease if you add an independent var that doesn’t help the model. Hence it’s a good way to
  8. Why should we remove coefficients close to 0? Because it mean that the indep var is not helping predict dependent var.
  9. So the adjusted R2 increased from previous model after removing FrancePop, which is a good thing. Note that Age is now significant, but was not earlier.. Why? Please embrace Multicolliearity
  10. What if you would have removed both FrancePop and Age? Compute the value of R2 – does it decrease?