3. 3
Lets establish the baseline
• What does this represent?
I eat this for breakfastThere is always a first time
4. 4
Lets establish the baseline
• How familiar R you with
Jedi Master
I publish R libraries
It’s a dance
studio down the
road
5. 5
Lets get ready
• Please install R Studio (GUI) or R (Shell)
– You can download R Studio from https://www.rstudio.com/ and R from
https://cran.r-project.org/bin/windows/base/
• Copy the wine.csv & wine_test.csv from the thumb drive
– This data comes from Liquid Assets
(www.liquidasset.com/winedata.html)
• This talk is inspired by Edge Analytics course work at edx.org
6. 6
Statistics refresher before the fun part
Independent & Dependent variable
One Variable Linear Regression
SSE
SST
R2
7. 7
What is
Dependent variable Variable that you are tying to predict
Independent variable Variables that you believe influence dependent variable
8. 8
• Y = 0.5 (avg. growing temp) – 1.25
How does price change based on temp?
Is this line perfect?
• Baseline model y = 7
What is the price of wine when temp is 16 or 18 degrees?
•
One Variable Linear Regression
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
1
1
2
2
9. 9
• How do you calculate error?
The best model (line) should have minimal errors…
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
2
2
• Error = Actual value – Prediction value
= 8 – 7 = 1
10. 10
One of the measures of quality of the model is Sum of
Squared Errors (SSE)
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
2
• SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2
11. 11
• Lets call SSE for baseline as SST
• SST = 10.15
• SSE = 6.5
The smaller the errors vs baseline the better is the model
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
1
2
• R2 = 1 – SSE / SST
= 1 – 6.5 / 10.15
= 0. 44
12. 12
But why is value of R2 always between 0 & 1?
• R2 = 1 – SSE / SST
• 0 <= SSE < = SST Why is this the case?
22. 22
Step 6 – Lets create 1 variable regression model
Name of the model
Dependent var
Independent var
Name of the data set
23. 23
Step 7 – Understand the output of the model
Error terms
Model used
R2
Adjusts R2 for number of number
of indep. var. relative to the no. of
data pts.
24. 24
Step 8 – Calculate SSE for Model1
As the name goes, sum of Squared errors
25. 25
Step 9 – Add another independent variable to the model
The new variable
26. 26
Step 10 – Compare the two models Which is better?
27. 27
Step 11 – But lets calculate SSE for Model 2 as well….
To know which of the two models is better, compare their SSE
29. 29
Step 13 – So which of the models is the best?
Model 1 Model 2 Model 3
Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter
Rain + Age + FrancePop
R2 0.43 0.70 0.82
SSE 5.73 2.97 1.73
30. 30
Step 14 – What does the output tell us about the independent
variables
Coefficient
• If a coefficient is close to 0 remove it It means that the independent variable does
not change our prediction for dependent variable
• Larger the abs. value of t-value, the more likely the coefficient is to be significant
• Closer the value of Pr(>|t|) to 1, less significant is the independent variable
Estimate / Std. Error
How much is coef. likely to
vary from est. value
Probability of coefficient is
actually 0
31. 31
Step 15 – Lets improve the model by excluding FrancePop
(most insignificant var.)
32. 32
Refresher – What is correlation
+10-1
Highly correlatedHighly correlated No correlation
33. 33
Step 16 – ID Multicollinearity .ie. Situation when two
independent var. are highly correlated
34. 34
Are we there yet!
Use our model to predict price of wine
35. 35
Before we do that
• Data we build our model is called train data
• Data we test our model is called test data
38. 38
Step 19 – Last but not the least, lets calculate R2 to quantify
how good our prediction is…
39. 39
Any intelligent fool can make things bigger, more complex, and more
violent. It takes a touch of genius -- and a lot of courage -- to move in
the opposite direction.
40. 40
Things that have worked for me
• “See” the data before you model
– I use Sublime text editor
• Define the business question – WRITE it down
• Be friend’s with ETL jedi (Data transformation expert)
• Its Ok if its simple!
Hinweis der Redaktion
Goal of a linear regression is to create a predictive line through the data. What does this mean? Create a line such that most of the points fall on it or close to it.
Simple line 1
Can we have a better line?
SSE will always have less number of errors than SST because the value of the coefficient of the independent var at worst can be 0.. In other words SSE always has less error than SST
Setwd(“c:\\xxx\xxx\xxx\xx\x”)
Setwd(“c:\\xxx\xxx\xxx\xx\x”)
What does this tell you?
There are 2 observations of 7 variables in wine.
What does this tell you?
There are 2 observations of 7 variables in wine.
Adjusted R2 adjusts R2 to account for the number of independent variables used relative to the no. of data points.
Remember Multiple R2 will always increase if you add more independent var. But adjusted R2 will decrease if you add an independent var that doesn’t help the model. Hence it’s a good way to
Why should we remove coefficients close to 0? Because it mean that the indep var is not helping predict dependent var.
So the adjusted R2 increased from previous model after removing FrancePop, which is a good thing.
Note that Age is now significant, but was not earlier.. Why? Please embrace Multicolliearity
What if you would have removed both FrancePop and Age? Compute the value of R2 – does it decrease?