SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Demographics and Weblog
 Hackathon – Case Study
 5.3% of Motley Fool visitors are subscribers.
 Design a classificaiton model for insight into
which variables are important for strategies to
        increase the subscription rate
                Learn by Doing


             copyright All Rights Reserved Doug Chang
                 dougc at stanfordalumni dot org
http://www.meetup.com/HandsOnPro
         grammingEvents/




          copyright All Rights Reserved Doug Chang
              dougc at stanfordalumni dot org
Data Mining Hackathon




     copyright All Rights Reserved Doug Chang
         dougc at stanfordalumni dot org
Funded by Rapleaf
•   With Motley Fool’s data
•   App note for Rapleaf/Motley Fool
•   Template for other hackathons
•   Did not use AWS. R on individual PCs
•   Logisics: Rapleaf funded prizes and food for 2
    weekends for ~20-50. Venue was free



                  copyright All Rights Reserved Doug Chang
                      dougc at stanfordalumni dot org
Getting more subscribers




      copyright All Rights Reserved Doug Chang
          dougc at stanfordalumni dot org
Headline Data, Weblog




     copyright All Rights Reserved Doug Chang
         dougc at stanfordalumni dot org
Demographics




 copyright All Rights Reserved Doug Chang
     dougc at stanfordalumni dot org
Cleaning Data
• training.csv(201,000), headlines.tsv(811MB), e
  ntry.tsv(100k), demographics.tsv
• Feature Engineering
• Github:




                copyright All Rights Reserved Doug Chang
                    dougc at stanfordalumni dot org
Ensemble Methods
• Bagging, Boosting, randomForests
• Overfitting
• Stability (small changes make large prediction
  changes)
• Previously none of these work at scale
• Small scale results using R, large scale exist in
  proprietary implementations(google, amazon,
  etc..)
                 copyright All Rights Reserved Doug Chang
                     dougc at stanfordalumni dot org
ROC Curves

                                           Binary Classifier Only!




copyright All Rights Reserved Doug Chang
    dougc at stanfordalumni dot org
Paid Subscriber ROC curve, ~61%




          copyright All Rights Reserved Doug Chang
              dougc at stanfordalumni dot org
Boosted Regression Trees Performance
• training data ROC score = 0.745
• cv ROC score = 0.737 ; se = 0.002
• 5.5% less performance than the winning score
  without doing any data processing
• Random is 50% or .50. We are .737-.50 better
  than random by 23.7%



               copyright All Rights Reserved Doug Chang
                   dougc at stanfordalumni dot org
Contribution of predictor variables




           copyright All Rights Reserved Doug Chang
               dougc at stanfordalumni dot org
Predictive Importance
• Friedman, number of times a variable is selected for splitting weighted by
  squared error or improvement to model. Measure of sparsity in data
• Fit plots remove averages of model variables
• 1 pageV 74.0567852
• 2     loc 11.0801383
• 3 income 4.1565597
• 4     age 3.1426519
• 5 residlen 3.0813927
• 6 home 2.3308287
• 7 marital 0.6560258
• 8     sex 0.6476549
• 9 prop 0.3817017
• 10 child 0.2632598
• 11 own 0.2030012


                          copyright All Rights Reserved Doug Chang
                              dougc at stanfordalumni dot org
Behavioral vs. Demographics
• Demographics are sparse
• Behavioral weblogs are the best source. Most
  sites aren’t using this information correctly.
  There is no single correct answer. Trial and
  Error on features. The features are more
  important than the algorithm
• Linear vs. Nonlinear


                copyright All Rights Reserved Doug Chang
                    dougc at stanfordalumni dot org
Fitted Values (Crappy)




     copyright All Rights Reserved Doug Chang
         dougc at stanfordalumni dot org
Fitted Values Better




    copyright All Rights Reserved Doug Chang
        dougc at stanfordalumni dot org
Predictor Variable Interaction
• Adjusting variable
  interactions




                copyright All Rights Reserved Doug Chang
                    dougc at stanfordalumni dot org
Variable Interactions




    copyright All Rights Reserved Doug Chang
        dougc at stanfordalumni dot org
Plot Interactions age, loc




       copyright All Rights Reserved Doug Chang
           dougc at stanfordalumni dot org
Trees vs. other methods
• Can see multiple levels good for trees. Do
  other variables match this? Simplify model or
  add more features. Iterate to a better model
• No Math. Analyst




                copyright All Rights Reserved Doug Chang
                    dougc at stanfordalumni dot org
Number of Trees




  copyright All Rights Reserved Doug Chang
      dougc at stanfordalumni dot org
Data Set Number of Trees




      copyright All Rights Reserved Doug Chang
          dougc at stanfordalumni dot org
Hackathon Results




   copyright All Rights Reserved Doug Chang
       dougc at stanfordalumni dot org
Weblogs only 68.15%, 18% better than
              random




            copyright All Rights Reserved Doug Chang
                dougc at stanfordalumni dot org
Demographics add 1%




    copyright All Rights Reserved Doug Chang
        dougc at stanfordalumni dot org
AWS Advantages
• Running multiple instances with different
  algorithms and parameters using R
• Add tutorial, install Screen, R GUI bugs
• http://amazonlabs.pbworks.com/w/page/280
  36646/FrontPage




              copyright All Rights Reserved Doug Chang
                  dougc at stanfordalumni dot org
Conclusion
• Data Mining at scale requires more development
  in visualization, MR algorithms, MR data
  preprocessing.
• Tuning using visualization. Tune 3 parameters, tc,
  lr, #trees. Didn’t cover 2/3.
• This isn’t reproducable in Hadoop/Mahout or any
  open source code I know of
• Other use cases, i.e. predicting which item will
  sell(eBay), search engine ranking.
• Careful with MR paradigms, Hadoop MR !=
  Couchbase MR
                 copyright All Rights Reserved Doug Chang
                     dougc at stanfordalumni dot org

Weitere ähnliche Inhalte

Ähnlich wie Demographics andweblogtargeting

Unbreaking Your Django Application
Unbreaking Your Django ApplicationUnbreaking Your Django Application
Unbreaking Your Django Application
OSCON Byrum
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Xavier Amatriain
 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptx
Information Exploration
 

Ähnlich wie Demographics andweblogtargeting (20)

Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
 
Es fácil contribuir al open source - Bolivia JUG 2020
Es fácil contribuir al open source - Bolivia JUG 2020Es fácil contribuir al open source - Bolivia JUG 2020
Es fácil contribuir al open source - Bolivia JUG 2020
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Unbreaking Your Django Application
Unbreaking Your Django ApplicationUnbreaking Your Django Application
Unbreaking Your Django Application
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
Ds @ bol
Ds @ bolDs @ bol
Ds @ bol
 
Software provider
Software providerSoftware provider
Software provider
 
Why the h# should I use Appium with React Native
Why the h# should I use Appium with React NativeWhy the h# should I use Appium with React Native
Why the h# should I use Appium with React Native
 
Velocity 2016 Speaking Session - Using Machine Learning to Determine Drivers ...
Velocity 2016 Speaking Session - Using Machine Learning to Determine Drivers ...Velocity 2016 Speaking Session - Using Machine Learning to Determine Drivers ...
Velocity 2016 Speaking Session - Using Machine Learning to Determine Drivers ...
 
Using machine learning to determine drivers of bounce and conversion
Using machine learning to determine drivers of bounce and conversionUsing machine learning to determine drivers of bounce and conversion
Using machine learning to determine drivers of bounce and conversion
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleQcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale
 
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
Building a Beer Recommender with Yhat (PAPIs.io - November 2014)
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Accelerating Your Test Execution Pipeline
Accelerating Your Test Execution PipelineAccelerating Your Test Execution Pipeline
Accelerating Your Test Execution Pipeline
 
Semantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptxSemantic Solutions from Information Exploration.pptx
Semantic Solutions from Information Exploration.pptx
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 

Mehr von Doug Chang (6)

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Hapi
HapiHapi
Hapi
 
Capital onehadoopclass
Capital onehadoopclassCapital onehadoopclass
Capital onehadoopclass
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
L'Oreal Tech Talk
L'Oreal Tech TalkL'Oreal Tech Talk
L'Oreal Tech Talk
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Demographics andweblogtargeting

  • 1. Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 2. http://www.meetup.com/HandsOnPro grammingEvents/ copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 3. Data Mining Hackathon copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 4. Funded by Rapleaf • With Motley Fool’s data • App note for Rapleaf/Motley Fool • Template for other hackathons • Did not use AWS. R on individual PCs • Logisics: Rapleaf funded prizes and food for 2 weekends for ~20-50. Venue was free copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 5. Getting more subscribers copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 6. Headline Data, Weblog copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 7. Demographics copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 8. Cleaning Data • training.csv(201,000), headlines.tsv(811MB), e ntry.tsv(100k), demographics.tsv • Feature Engineering • Github: copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 9. Ensemble Methods • Bagging, Boosting, randomForests • Overfitting • Stability (small changes make large prediction changes) • Previously none of these work at scale • Small scale results using R, large scale exist in proprietary implementations(google, amazon, etc..) copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 10. ROC Curves Binary Classifier Only! copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 11. Paid Subscriber ROC curve, ~61% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 12. Boosted Regression Trees Performance • training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002 • 5.5% less performance than the winning score without doing any data processing • Random is 50% or .50. We are .737-.50 better than random by 23.7% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 13. Contribution of predictor variables copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 14. Predictive Importance • Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of sparsity in data • Fit plots remove averages of model variables • 1 pageV 74.0567852 • 2 loc 11.0801383 • 3 income 4.1565597 • 4 age 3.1426519 • 5 residlen 3.0813927 • 6 home 2.3308287 • 7 marital 0.6560258 • 8 sex 0.6476549 • 9 prop 0.3817017 • 10 child 0.2632598 • 11 own 0.2030012 copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 15. Behavioral vs. Demographics • Demographics are sparse • Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm • Linear vs. Nonlinear copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 16. Fitted Values (Crappy) copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 17. Fitted Values Better copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 18. Predictor Variable Interaction • Adjusting variable interactions copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 19. Variable Interactions copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 20. Plot Interactions age, loc copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 21. Trees vs. other methods • Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model • No Math. Analyst copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 22. Number of Trees copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 23. Data Set Number of Trees copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 24. Hackathon Results copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 25. Weblogs only 68.15%, 18% better than random copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 26. Demographics add 1% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 27. AWS Advantages • Running multiple instances with different algorithms and parameters using R • Add tutorial, install Screen, R GUI bugs • http://amazonlabs.pbworks.com/w/page/280 36646/FrontPage copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org
  • 28. Conclusion • Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing. • Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3. • This isn’t reproducable in Hadoop/Mahout or any open source code I know of • Other use cases, i.e. predicting which item will sell(eBay), search engine ranking. • Careful with MR paradigms, Hadoop MR != Couchbase MR copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org