Why Teams call analytics are critical to your entire business
Demographics andweblogtargeting
1. Demographics and Weblog
Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers.
Design a classificaiton model for insight into
which variables are important for strategies to
increase the subscription rate
Learn by Doing
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
3. Data Mining Hackathon
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
4. Funded by Rapleaf
• With Motley Fool’s data
• App note for Rapleaf/Motley Fool
• Template for other hackathons
• Did not use AWS. R on individual PCs
• Logisics: Rapleaf funded prizes and food for 2
weekends for ~20-50. Venue was free
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
5. Getting more subscribers
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
6. Headline Data, Weblog
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
8. Cleaning Data
• training.csv(201,000), headlines.tsv(811MB), e
ntry.tsv(100k), demographics.tsv
• Feature Engineering
• Github:
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
9. Ensemble Methods
• Bagging, Boosting, randomForests
• Overfitting
• Stability (small changes make large prediction
changes)
• Previously none of these work at scale
• Small scale results using R, large scale exist in
proprietary implementations(google, amazon,
etc..)
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
10. ROC Curves
Binary Classifier Only!
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
11. Paid Subscriber ROC curve, ~61%
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
12. Boosted Regression Trees Performance
• training data ROC score = 0.745
• cv ROC score = 0.737 ; se = 0.002
• 5.5% less performance than the winning score
without doing any data processing
• Random is 50% or .50. We are .737-.50 better
than random by 23.7%
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
13. Contribution of predictor variables
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
14. Predictive Importance
• Friedman, number of times a variable is selected for splitting weighted by
squared error or improvement to model. Measure of sparsity in data
• Fit plots remove averages of model variables
• 1 pageV 74.0567852
• 2 loc 11.0801383
• 3 income 4.1565597
• 4 age 3.1426519
• 5 residlen 3.0813927
• 6 home 2.3308287
• 7 marital 0.6560258
• 8 sex 0.6476549
• 9 prop 0.3817017
• 10 child 0.2632598
• 11 own 0.2030012
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
15. Behavioral vs. Demographics
• Demographics are sparse
• Behavioral weblogs are the best source. Most
sites aren’t using this information correctly.
There is no single correct answer. Trial and
Error on features. The features are more
important than the algorithm
• Linear vs. Nonlinear
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
16. Fitted Values (Crappy)
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
17. Fitted Values Better
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
18. Predictor Variable Interaction
• Adjusting variable
interactions
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
19. Variable Interactions
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
20. Plot Interactions age, loc
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
21. Trees vs. other methods
• Can see multiple levels good for trees. Do
other variables match this? Simplify model or
add more features. Iterate to a better model
• No Math. Analyst
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
22. Number of Trees
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
23. Data Set Number of Trees
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
24. Hackathon Results
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
25. Weblogs only 68.15%, 18% better than
random
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
26. Demographics add 1%
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
27. AWS Advantages
• Running multiple instances with different
algorithms and parameters using R
• Add tutorial, install Screen, R GUI bugs
• http://amazonlabs.pbworks.com/w/page/280
36646/FrontPage
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org
28. Conclusion
• Data Mining at scale requires more development
in visualization, MR algorithms, MR data
preprocessing.
• Tuning using visualization. Tune 3 parameters, tc,
lr, #trees. Didn’t cover 2/3.
• This isn’t reproducable in Hadoop/Mahout or any
open source code I know of
• Other use cases, i.e. predicting which item will
sell(eBay), search engine ranking.
• Careful with MR paradigms, Hadoop MR !=
Couchbase MR
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org