Demographics andweblogtargeting

•Als PPTX, PDF herunterladen•

0 gefällt mir•532 views

Doug Chang

Technologie

Demographics and Weblog
Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers.
Design a classificaiton model for insight into
which variables are important for strategies to
increase the subscription rate
Learn by Doing

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

http://www.meetup.com/HandsOnPro
grammingEvents/

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Data Mining Hackathon

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Funded by Rapleaf
• With Motley Fool’s data
• App note for Rapleaf/Motley Fool
• Template for other hackathons
• Did not use AWS. R on individual PCs
• Logisics: Rapleaf funded prizes and food for 2
weekends for ~20-50. Venue was free

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Getting more subscribers

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Headline Data, Weblog

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Demographics

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Cleaning Data
• training.csv(201,000), headlines.tsv(811MB), e
ntry.tsv(100k), demographics.tsv
• Feature Engineering
• Github:

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Ensemble Methods
• Bagging, Boosting, randomForests
• Overfitting
• Stability (small changes make large prediction
changes)
• Previously none of these work at scale
• Small scale results using R, large scale exist in
proprietary implementations(google, amazon,
etc..)
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

ROC Curves

Binary Classifier Only!

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Paid Subscriber ROC curve, ~61%

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Boosted Regression Trees Performance
• training data ROC score = 0.745
• cv ROC score = 0.737 ; se = 0.002
• 5.5% less performance than the winning score
without doing any data processing
• Random is 50% or .50. We are .737-.50 better
than random by 23.7%

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Contribution of predictor variables

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Predictive Importance
• Friedman, number of times a variable is selected for splitting weighted by
squared error or improvement to model. Measure of sparsity in data
• Fit plots remove averages of model variables
• 1 pageV 74.0567852
• 2 loc 11.0801383
• 3 income 4.1565597
• 4 age 3.1426519
• 5 residlen 3.0813927
• 6 home 2.3308287
• 7 marital 0.6560258
• 8 sex 0.6476549
• 9 prop 0.3817017
• 10 child 0.2632598
• 11 own 0.2030012

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Behavioral vs. Demographics
• Demographics are sparse
• Behavioral weblogs are the best source. Most
sites aren’t using this information correctly.
There is no single correct answer. Trial and
Error on features. The features are more
important than the algorithm
• Linear vs. Nonlinear

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Fitted Values (Crappy)

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Fitted Values Better

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Predictor Variable Interaction
• Adjusting variable
interactions

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Variable Interactions

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Plot Interactions age, loc

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Trees vs. other methods
• Can see multiple levels good for trees. Do
other variables match this? Simplify model or
add more features. Iterate to a better model
• No Math. Analyst

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Number of Trees

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Data Set Number of Trees

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Hackathon Results

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Weblogs only 68.15%, 18% better than
random

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Demographics add 1%

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

AWS Advantages
• Running multiple instances with different
algorithms and parameters using R
• Add tutorial, install Screen, R GUI bugs
• http://amazonlabs.pbworks.com/w/page/280
36646/FrontPage

copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Conclusion
• Data Mining at scale requires more development
in visualization, MR algorithms, MR data
preprocessing.
• Tuning using visualization. Tune 3 parameters, tc,
lr, #trees. Didn’t cover 2/3.
• This isn’t reproducable in Hadoop/Mahout or any
open source code I know of
• Other use cases, i.e. predicting which item will
sell(eBay), search engine ranking.
• Careful with MR paradigms, Hadoop MR !=
Couchbase MR
copyright All Rights Reserved Doug Chang
dougc at stanfordalumni dot org

Weitere ähnliche Inhalte

Ähnlich wie Demographics andweblogtargeting

Improving Model Predictions via Stacking and Hyper-parameters Tuning

Jo-fai Chow

2014 toronto-torbug

c.titus.brown

Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...

TigerGraph

Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014

Austin Ogilvie

Es fácil contribuir al open source - Bolivia JUG 2020

César Hernández

Kaggle and data science

Akira Shibata

Unbreaking Your Django Application

OSCON Byrum

To build a decision-making system, we must provide answers to two sets of questions: (1) ""What will happen if I make decision X?"" and (2) ""How should I pick which decision to make?"". Typically, the first set of questions are answered with supervised learning: we build models to forecast whether someone will click on an ad, or visit a post. The second set of questions are more open-ended. In this talk, we will dive into how we can answer ""how"" questions, starting with heuristics and search. This will lead us to bandits, reinforcement learning, and Horizon: an open-source platform for training and deploying reinforcement learning models at massive scale. At Facebook, we are using Horizon, built using PyTorch 1.0 and Apache Spark, in a variety of AI-related and control tasks, spanning recommender systems, marketing & promotion distribution, and bandwidth optimization. The talk will cover the key components of Horizon and the lessons we learned along the way that influenced the development of the platform. Author: Jason Gauci

Horizon: Deep Reinforcement Learning at Scale

Databricks

Ds @ bol

Asparuh Hristov

Software provider

Arthit Hongchintakul

Currently there are a lot of testautomation frameworks that can help you automate your native app, but how do you pick the right one? In the talk I gave at AppiumConf2018 in London I showed how my journey for selecting the best tool for testing Tele2's new React Native app for iOS and Android went. We take a look at the selected teststrategy and walked down the bumpy road I needed to take to get the automation on the quality level it is now.

Why the h# should I use Appium with React Native

Wim Selles

Velocity 2016 Speaking Session - Using Machine Learning to Determine Drivers ...

SOASTA

Using machine learning to determine drivers of bounce and conversion

Tammy Everts

Dev and Ops Collaboration and Awareness at Etsy and Flickr

John Allspaw

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale

Xavier Amatriain

Building a Beer Recommender with Yhat (PAPIs.io - November 2014)

Austin Ogilvie

Data Science At Zillow

Nicholas McClure

Accelerating Your Test Execution Pipeline

SmartBear

Semantic Solutions from Information Exploration.pptx

Information Exploration

Real-Time Big Data at In-Memory Speed, Using Storm

Nati Shalom

Ähnlich wie Demographics andweblogtargeting (20)

Improving Model Predictions via Stacking and Hyper-parameters Tuning

2014 toronto-torbug

Graph Gurus Episode 8: Location, Location, Location - Geospatial Analysis wit...

Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014

Es fácil contribuir al open source - Bolivia JUG 2020

Kaggle and data science

Unbreaking Your Django Application

Horizon: Deep Reinforcement Learning at Scale

Ds @ bol

Software provider

Why the h# should I use Appium with React Native

Velocity 2016 Speaking Session - Using Machine Learning to Determine Drivers ...

Using machine learning to determine drivers of bounce and conversion

Dev and Ops Collaboration and Awareness at Etsy and Flickr

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale

Building a Beer Recommender with Yhat (PAPIs.io - November 2014)

Data Science At Zillow

Accelerating Your Test Execution Pipeline

Semantic Solutions from Information Exploration.pptx

Real-Time Big Data at In-Memory Speed, Using Storm

Mehr von Doug Chang

BRV CTO Summit Deep Learning Talk

Doug Chang

Hapi

Doug Chang

Capital onehadoopclass

Doug Chang

Capital onehadoopintro

Doug Chang

L'Oreal Tech Talk

Doug Chang

Hadoop/HBase POC framework

Doug Chang

Mehr von Doug Chang (6)

BRV CTO Summit Deep Learning Talk

Hapi

Capital onehadoopclass

Capital onehadoopintro

L'Oreal Tech Talk

Hadoop/HBase POC framework

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

DBX First Quarter 2024 Investor Presentation

Dropbox

Exploring Multimodal Embeddings with Milvus

Zilliz

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

Passkeys: Developing APIs to enable passwordless authentication Cody Salas, Sr Developer Advocate | Solutions Architect - Yubico Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

apidays

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

💥 You’re lucky! We’ve found two different (lead) developers that are willing to share their valuable lessons learned about using UiPath Document Understanding! Based on recent implementations in appealing use cases at Partou and SPIE. Don’t expect fancy videos or slide decks, but real and practical experiences that will help you with your own implementations. 📕 Topics that will be addressed: • Training the ML-model by humans: do or don't? • Rule-based versus AI extractors • Tips for finding use cases • How to start 👨‍🏫👨‍💻 Speakers: o Dion Morskieft, RPA Product Owner @Partou o Jack Klein-Schiphorst, Automation Developer @Tacstone Technology

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

UiPathCommunity

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

DBX First Quarter 2024 Investor Presentation

Exploring Multimodal Embeddings with Milvus

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Manulife - Insurer Transformation Award 2024

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

[BuildWithAI] Introduction to Gemini.pdf

CNIC Information System with Pakdata Cf In Pakistan

Strategies for Landing an Oracle DBA Job as a Fresher

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

AXA XL - Insurer Innovation Award Americas 2024

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

presentation ICT roal in 21st century education

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

FWD Group - Insurer Innovation Award 2024

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Why Teams call analytics are critical to your entire business

Demographics andweblogtargeting

1. Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

4. Funded by Rapleaf • With Motley Fool’s data • App note for Rapleaf/Motley Fool • Template for other hackathons • Did not use AWS. R on individual PCs • Logisics: Rapleaf funded prizes and food for 2 weekends for ~20-50. Venue was free copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

8. Cleaning Data • training.csv(201,000), headlines.tsv(811MB), e ntry.tsv(100k), demographics.tsv • Feature Engineering • Github: copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

9. Ensemble Methods • Bagging, Boosting, randomForests • Overfitting • Stability (small changes make large prediction changes) • Previously none of these work at scale • Small scale results using R, large scale exist in proprietary implementations(google, amazon, etc..) copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

12. Boosted Regression Trees Performance • training data ROC score = 0.745 • cv ROC score = 0.737 ; se = 0.002 • 5.5% less performance than the winning score without doing any data processing • Random is 50% or .50. We are .737-.50 better than random by 23.7% copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

14. Predictive Importance • Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of sparsity in data • Fit plots remove averages of model variables • 1 pageV 74.0567852 • 2 loc 11.0801383 • 3 income 4.1565597 • 4 age 3.1426519 • 5 residlen 3.0813927 • 6 home 2.3308287 • 7 marital 0.6560258 • 8 sex 0.6476549 • 9 prop 0.3817017 • 10 child 0.2632598 • 11 own 0.2030012 copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

15. Behavioral vs. Demographics • Demographics are sparse • Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm • Linear vs. Nonlinear copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

21. Trees vs. other methods • Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model • No Math. Analyst copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

27. AWS Advantages • Running multiple instances with different algorithms and parameters using R • Add tutorial, install Screen, R GUI bugs • http://amazonlabs.pbworks.com/w/page/280 36646/FrontPage copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

28. Conclusion • Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing. • Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3. • This isn’t reproducable in Hadoop/Mahout or any open source code I know of • Other use cases, i.e. predicting which item will sell(eBay), search engine ranking. • Careful with MR paradigms, Hadoop MR != Couchbase MR copyright All Rights Reserved Doug Chang dougc at stanfordalumni dot org

Demographics andweblogtargeting

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Demographics andweblogtargeting

Ähnlich wie Demographics andweblogtargeting (20)

Mehr von Doug Chang

Mehr von Doug Chang (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Demographics andweblogtargeting