SlideShare ist ein Scribd-Unternehmen logo
1 von 17
4/23/13
Movie Night: Data Science
- Even On Our Night Off
May 27, 2014
Anqi and Irene – (H2O)
• Anqi is the in-house R expert and is responsible for K-means and PCA
• Irene is the pencil and paper stats nerd and technical writer
• Part of a data science team that’s 75% women, and on a technical team that’s
23% women (well above average).
Sergei- (Collective) VP, Data Sciences at Collective, where he is
responsible for the architecture, development and scaling of data-driven
technology products for digital advertising.
What is H2O?
• Same statistics - new volumes of data
• On a distributed cluster models on a terabyte of data can finish in
minutes.
• Provide an interface to give more people the power of data science.
• Also hook H2O into R and Scala
Overview
Walk through the practical problem of what movie to go see
together.
Examine work flow from data to prediction, and let the best
model inform our choice
Extend to production setting applications with a customer use
case
Movie Lens Data
Data is the 100,000 observation MovieLens data set
Demographic Features:
State Age Occupation Gender
Factor Integer Factor Factor
Levels: 62 Range
(7,73)
Levels: 21 Levels: 2
Largest class:
California
Mean: 32.9 Largest Class:
Student
M:F is about
3:1
Movie Classes
Movies are classified by types, types are not exclusive.
Dependent Variable
Users rated movies on a Likert scale of 1 to 5.
We converted this to a binomial indicator:
Ratings >= 4: recoded to 1, indicating liked movie
Ratings < 4: recoded to 0, indicating disliked the movie
Super Models
Both models are predicting the same dependent variable as a function
of the same set of features.
First model with tree based GBM - start simple and let the model get
as complex as it needs to with depth
Alternative model with regularized GLM - start with complexity
and let model generalize with regularization
WWIM
Using Gradient Boosted Classification on two classes
GBM is nonparametric, great when there’s no theoretical
model.
Accounts for complex interaction
Control overfitting with learning rate
WWAM: Alternative – Logistic GLM
Logistic binomial regression
End model has interpretability
Control for overfitting introducing penalty into objective
function - aids in feature selection and generalizability
Ridge regression- all L2 Penalty
Rubber; Meet Road
Comparison of error rates on holdout set
GBM Model GLM Model
Error on Dislike (0) 28% 30%
Error on Like (1) 18% 50%
Overall 22% 40%
GBM Predictions GLM Predictions
Like: 300, Her, Need For Speed
Dislike: Frozen, Pebody
Like: 300, Her, Capt. America
Dislike: Frozen, Divergent
Lights Out - Some Closing
Points
We didn't address a serious problem here - but this is the
general process used in a production environment.
To give you a sense for the real world implementation, we’ve
asked one of our users to share his use case with you.
Stories change people, while statistics gives
them something to argue about
- Bernie Siegel
Ad Server
(publisher)
Ad Server
(advertiser)
AgencyBrowser
BrandsPublishers
Content
Inventory
Ads
Audience
Audience Modeling
1. Build the Audience Cloud of stable cookies.
2. Define target audience using Cookie level data.
3. Assemble 1,000s of features on every cookie.
4. Build a predictive model using machine learning.
5. Score every cookie in the Audience Cloud.
6. Create a targetable segment with the top X users.
7. Adjust X daily to optimize delivery & performance.
8. Rebuild models weekly (daily if warranted).
Audience Cloud
(200M+ Stable Cookies)
Target Audience
(100K Cookies)
1M Cookies
3M Cookies
bit.ly/MLatScale
Preprint of paper submitted to KDD’14
Audience Extension: audiences (age 25-40, buys toys, watches TNT)
Audience Optimization: actions (clicks, online purchases)
Modeling Platform
MODEL BUILDING
Computing predictive models
on
Current Future
DATA SIZES
Size of data
ALGORITHM
Complexity and performance
GBMglmnet
1 million
1,000
1 billion
100,000
SCORING
Predicting outcomes
Batch
Real
Time
+ H2O

Weitere ähnliche Inhalte

Ähnlich wie MLconf NYC 0xdata

Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsIRJET Journal
 
Experiment Management for the Enterprise
Experiment Management for the EnterpriseExperiment Management for the Enterprise
Experiment Management for the EnterpriseSigOpt
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning SystemsAnuj Gupta
 
Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Kim Herzig
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
 
[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick
[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick
[DSC Croatia 22] The Ethos of Process in ML - Leanne FitzpatrickDataScienceConferenc1
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Age and Gender Detection-converted.pdf
Age and Gender Detection-converted.pdfAge and Gender Detection-converted.pdf
Age and Gender Detection-converted.pdfMohammedMuzammil83
 
IRJET- Analysis of Vehicle Number Plate Recognition
IRJET- Analysis of Vehicle Number Plate RecognitionIRJET- Analysis of Vehicle Number Plate Recognition
IRJET- Analysis of Vehicle Number Plate RecognitionIRJET Journal
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUnity Technologies
 
BigML Education - Deepnets
BigML Education - DeepnetsBigML Education - Deepnets
BigML Education - DeepnetsBigML, Inc
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITOMarcoMellia
 
Freenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformFreenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformBrandon White
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...Dataconomy Media
 
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET Journal
 
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )Pramit Choudhary
 

Ähnlich wie MLconf NYC 0xdata (20)

Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
Experiment Management for the Enterprise
Experiment Management for the EnterpriseExperiment Management for the Enterprise
Experiment Management for the Enterprise
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)
 
Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick
[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick
[DSC Croatia 22] The Ethos of Process in ML - Leanne Fitzpatrick
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Age and Gender Detection-converted.pdf
Age and Gender Detection-converted.pdfAge and Gender Detection-converted.pdf
Age and Gender Detection-converted.pdf
 
IRJET- Analysis of Vehicle Number Plate Recognition
IRJET- Analysis of Vehicle Number Plate RecognitionIRJET- Analysis of Vehicle Number Plate Recognition
IRJET- Analysis of Vehicle Number Plate Recognition
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model training
 
BigML Education - Deepnets
BigML Education - DeepnetsBigML Education - Deepnets
BigML Education - Deepnets
 
Age and Gender Detection.docx
Age and Gender Detection.docxAge and Gender Detection.docx
Age and Gender Detection.docx
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
 
Freenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformFreenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning Platform
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
 
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )Learning to Learn Model Behavior ( Capital One: data intelligence conference )
Learning to Learn Model Behavior ( Capital One: data intelligence conference )
 
PowerPoint Presentation
PowerPoint PresentationPowerPoint Presentation
PowerPoint Presentation
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Kürzlich hochgeladen (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

MLconf NYC 0xdata

  • 1. 4/23/13 Movie Night: Data Science - Even On Our Night Off May 27, 2014
  • 2. Anqi and Irene – (H2O) • Anqi is the in-house R expert and is responsible for K-means and PCA • Irene is the pencil and paper stats nerd and technical writer • Part of a data science team that’s 75% women, and on a technical team that’s 23% women (well above average). Sergei- (Collective) VP, Data Sciences at Collective, where he is responsible for the architecture, development and scaling of data-driven technology products for digital advertising.
  • 3. What is H2O? • Same statistics - new volumes of data • On a distributed cluster models on a terabyte of data can finish in minutes. • Provide an interface to give more people the power of data science. • Also hook H2O into R and Scala
  • 4. Overview Walk through the practical problem of what movie to go see together. Examine work flow from data to prediction, and let the best model inform our choice Extend to production setting applications with a customer use case
  • 5. Movie Lens Data Data is the 100,000 observation MovieLens data set Demographic Features: State Age Occupation Gender Factor Integer Factor Factor Levels: 62 Range (7,73) Levels: 21 Levels: 2 Largest class: California Mean: 32.9 Largest Class: Student M:F is about 3:1
  • 6. Movie Classes Movies are classified by types, types are not exclusive.
  • 7. Dependent Variable Users rated movies on a Likert scale of 1 to 5. We converted this to a binomial indicator: Ratings >= 4: recoded to 1, indicating liked movie Ratings < 4: recoded to 0, indicating disliked the movie
  • 8. Super Models Both models are predicting the same dependent variable as a function of the same set of features. First model with tree based GBM - start simple and let the model get as complex as it needs to with depth Alternative model with regularized GLM - start with complexity and let model generalize with regularization
  • 9. WWIM Using Gradient Boosted Classification on two classes GBM is nonparametric, great when there’s no theoretical model. Accounts for complex interaction Control overfitting with learning rate
  • 10. WWAM: Alternative – Logistic GLM Logistic binomial regression End model has interpretability Control for overfitting introducing penalty into objective function - aids in feature selection and generalizability Ridge regression- all L2 Penalty
  • 11. Rubber; Meet Road Comparison of error rates on holdout set GBM Model GLM Model Error on Dislike (0) 28% 30% Error on Like (1) 18% 50% Overall 22% 40%
  • 12. GBM Predictions GLM Predictions Like: 300, Her, Need For Speed Dislike: Frozen, Pebody Like: 300, Her, Capt. America Dislike: Frozen, Divergent
  • 13. Lights Out - Some Closing Points We didn't address a serious problem here - but this is the general process used in a production environment. To give you a sense for the real world implementation, we’ve asked one of our users to share his use case with you.
  • 14. Stories change people, while statistics gives them something to argue about - Bernie Siegel
  • 16. Audience Modeling 1. Build the Audience Cloud of stable cookies. 2. Define target audience using Cookie level data. 3. Assemble 1,000s of features on every cookie. 4. Build a predictive model using machine learning. 5. Score every cookie in the Audience Cloud. 6. Create a targetable segment with the top X users. 7. Adjust X daily to optimize delivery & performance. 8. Rebuild models weekly (daily if warranted). Audience Cloud (200M+ Stable Cookies) Target Audience (100K Cookies) 1M Cookies 3M Cookies bit.ly/MLatScale Preprint of paper submitted to KDD’14 Audience Extension: audiences (age 25-40, buys toys, watches TNT) Audience Optimization: actions (clicks, online purchases)
  • 17. Modeling Platform MODEL BUILDING Computing predictive models on Current Future DATA SIZES Size of data ALGORITHM Complexity and performance GBMglmnet 1 million 1,000 1 billion 100,000 SCORING Predicting outcomes Batch Real Time + H2O