The document discusses the author's approach to a Kaggle competition by first preparing the data. Key steps included identifying outliers using median absolute deviation, using a Kolmogorov-Smirnov test to identify different distributions in the training data, and removing data that did not need to be modeled. The author then created various features from the data and experimented with different models before creating an ensemble of models. The ensemble approach combined predictions from random forests, gradient boosted models, support vector machines, and generalized linear models through a stacking regression. This process resulted in a top ranking of 49 out of 532 for the competition.
23. What happened
here?
No need to model?
Just assume all Chicargo
data to be 0?
Chicargo data generated by remote_API mostly 0s, no need to model
24. Separate Outliers using Median Absolute Deviation (MAD)
MAD is robust and can handle skewed data. It helps to identify outliers. We
separated data more which are more than 3 Median Absolute Deviation.
Code can be found at:
baseFunctions_cleanData.R
26. Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
10% of
training
data is
used for
modeling
59% of
data are
Chicargo
data
generated
by
remote_AP
I, mostly
0s, no
need
model, just
estimate
using
median
Key Advantage: Rapid prototyping!
4% of data is identified as outliers by
MAD
KS test: 27% of training data are of different
distribution
27. When you can focus on a small but representative subset of data, you
can run many, many experiments very quickly (I did several hundreds)
28. Now we have the raw ingredients prepared,
it is time to make the dishes
29. Experiment with Different Models
❖ Random Forest
❖ Generalized Boosted Regression Models (GBM)
❖ Support Vector Machines (SVM)
❖ Bootstrap aggregated (bagged) linear models
How to use? Ask Google & RTFM
33. Instead, the magic happens when we combine data and
when we create new data - aka feature creation
34. Creating Simples Features : City
trainData$city[trainData$longitude=="-77"]<-
"richmond"
trainData$city[trainData$longitude=="-72"]<-
"new_haven"
trainData$city[trainData$longitude=="-87"]<-
"chicargo"
trainData$city[trainData$longitude=="-122"]<-
"oakland"
Code can be found at:
1. dataExplore_map.R
36. Creating Complex Features: Predicted View
The task is to predict view, votes,
comments but logically, won’t
number of votes and comments be
correlated with number of views?
Code can be found at:
baseFunctions_model.R
37. Creating Complex Features: Predicted View
Storing the predicted value of view as new column
and using it as a new feature to predict votes & comments…
very risky business but powerful if you know what you are doing
43. Full List of Features Used
Code can be found at:
baseFunctions_model.R
+Num View as Y variable
+Num Comments as Y variable
+Num Votes as Y variable
Fed into models to predict
view, votes, comments respectively
44. Only used 1 original feature, I created the other 13 features
Code can be found at:
baseFunctions_model.R
Fed into models to predict
view, votes, comments respectively
Original Feature (1) Created Feature (13)
45. An ensemble of good enough models can be surprisingly strong
46. An ensemble of good enough models can be surprisingly strong
48. Each model is good for different scenario
GBM is rock
solid, good for
all scenarios
SVM is
counter
weight, don’t
trust anything
it says
GLM is
amazing for
predicting
comments,
not so much
for others
RandomForest
is moderate,
provides a
balanced view
49. Ensemble (Stacking using regression)
testDataAns rfAns gbmAns svmAns glmBagAns
2.3 2 2.5 2.4 1.8
2 1.8 2.2 1.7 1.6
1.3 1.3 1.7 1.2 1.0
1.5 1.4 1.9 1.6 1.2
… … … … …
glm(testDataAns~rfAns+gbmAns+svmAns+glmBagAns)
We are interested in the coefficient
50. Ensemble (Stacking using regression)
Sort and column bind the predictions from the 4 models
Run regression (logistic or linear) and obtain coefficients
Scale ensemble ratio back to 1 (100%)
51. Obtaining the ensemble ratio for each model
Inside 3. testMod_generateEnsembleRatio
folder - getEnsembleRatio.r
52. Ensemble is not perfect…
❖ Simple to implement? Kind of. But very tedious to
update. Will need to rerun every single model every time
you make any changes to the data (as the ensemble
ratio may change).
❖ Easy to overfit test data (will require another set of
validation data or cross validation).
❖ Very hard to explain to business users what is going on.
54. Ignore IgnoreIgnore
Model
Ignore
Model
Ignore
MAD
10% of
training
data is
used for
modeling
4% of data is identified as outliers by
MAD
KS test: Too different from rest of data
59% of
data are
Chicargo
data
generated
by
remote_AP
I, mostly
0s, no
need
model, just
estimate
using
median
Key Advantage: Rapid prototyping!