Predective analytcis v0.1 AS

Mirror mirror on the wall, help me
predict and know it all…
Ankur Sansanwal
https://www.linkedin.com/in/ankursansanwal

3
Lets establish the baseline
• What does this represent?
I eat this for breakfastThere is always a first time

4
Lets establish the baseline
• How familiar R you with
Jedi Master
I publish R libraries
It’s a dance
studio down the
road

5
Lets get ready
• Please install R Studio (GUI) or R (Shell)
– You can download R Studio from https://www.rstudio.com/ and R from
https://cran.r-project.org/bin/windows/base/
• Copy the wine.csv & wine_test.csv from the thumb drive
– This data comes from Liquid Assets
(www.liquidasset.com/winedata.html)
• This talk is inspired by Edge Analytics course work at edx.org 

6
Statistics refresher before the fun part
 Independent & Dependent variable
 One Variable Linear Regression
 SSE
 SST
 R2

7
What is
Dependent variable  Variable that you are tying to predict
Independent variable  Variables that you believe influence dependent variable

8
• Y = 0.5 (avg. growing temp) – 1.25
How does price change based on temp?
Is this line perfect?
• Baseline model  y = 7
What is the price of wine when temp is 16 or 18 degrees?
•
One Variable Linear Regression
Avg. growing Temp
(Independent Var.)
Price(DependentVar.)
1
1
2
2

9
• How do you calculate error?
The best model (line) should have minimal errors…
Avg. growing Temp
(Independent Var.)
2
2
• Error = Actual value – Prediction value
= 8 – 7 = 1

10
One of the measures of quality of the model is Sum of
Squared Errors (SSE)
Avg. growing Temp
(Independent Var.)
2
• SSE = (e1)2 + (e2)2 + (e3)2 + (e4)2 + …………… + (en)2

11
• Lets call SSE for baseline as SST
• SST = 10.15
• SSE = 6.5
The smaller the errors vs baseline the better is the model
Avg. growing Temp
(Independent Var.)
1
2
• R2 = 1 – SSE / SST
= 1 – 6.5 / 10.15
= 0. 44

12
But why is value of R2 always between 0 & 1?
• R2 = 1 – SSE / SST
• 0 <= SSE < = SST  Why is this the case?

13
Quiz
• What would be the R2 of a perfect model?

14
Remember
• Good models for easy problems will have R2 close to 1
• Good models for hard problems will have R2 close to 0

15
The real deal
 Build a multi-variable regression model in R

16
Step 1 – Have a look at the data (wine.csv)

17
Step 2 – Ask: what's the business question we are trying to
solve?
“Predict the price of wine?”

18
Quiz
• What’s the independent & dependent variable in our data-set?

19
Step 3 – Check your working directory?

20
Step 4 – Your working directory should be set to the
location of wine.csv file

22
Step 6 – Lets create 1 variable regression model
Name of the model
Dependent var
Independent var
Name of the data set

23
Step 7 – Understand the output of the model
Error terms
Model used
R2
Adjusts R2 for number of number
of indep. var. relative to the no. of
data pts.

24
Step 8 – Calculate SSE for Model1
As the name goes, sum of Squared errors

25
Step 9 – Add another independent variable to the model
The new variable

26
Step 10 – Compare the two models  Which is better?

27
Step 11 – But lets calculate SSE for Model 2 as well….
To know which of the two models is better, compare their SSE

28
Step 12 – Lets go all in…

29
Step 13 – So which of the models is the best?
Model 1 Model 2 Model 3
Impendent variables AGST AGST + Harvest Rain AGST + Harvest Rain + Winter
Rain + Age + FrancePop
R2 0.43 0.70 0.82
SSE 5.73 2.97 1.73

30
Step 14 – What does the output tell us about the independent
variables
Coefficient
• If a coefficient is close to 0 remove it  It means that the independent variable does
not change our prediction for dependent variable
• Larger the abs. value of t-value, the more likely the coefficient is to be significant
• Closer the value of Pr(>|t|) to 1, less significant is the independent variable
Estimate / Std. Error
How much is coef. likely to
vary from est. value
Probability of coefficient is
actually 0

31
Step 15 – Lets improve the model by excluding FrancePop
(most insignificant var.)

32
Refresher – What is correlation
+10-1
Highly correlatedHighly correlated No correlation

33
Step 16 – ID Multicollinearity .ie. Situation when two
independent var. are highly correlated

34
Are we there yet!
 Use our model to predict price of wine

35
Before we do that
• Data we build our model is called train data
• Data we test our model is called test data

37
Step 18 – Predict (finally)
Pretty close!

38
Step 19 – Last but not the least, lets calculate R2 to quantify
how good our prediction is…

39
Any intelligent fool can make things bigger, more complex, and more
violent. It takes a touch of genius -- and a lot of courage -- to move in
the opposite direction.

40
Things that have worked for me
• “See” the data before you model
– I use Sublime text editor
• Define the business question – WRITE it down
• Be friend’s with ETL jedi (Data transformation expert)
• Its Ok if its simple!

Predective analytcis v0.1 AS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Predective analytcis v0.1 AS

Ähnlich wie Predective analytcis v0.1 AS (20)

Predective analytcis v0.1 AS

Hinweis der Redaktion