SlideShare a Scribd company logo
1 of 16
Download to read offline
Winning Data Science 
Competitions 
Some (hopefully) useful pointers 
Owen Zhang 
Data Scientist
A plug for myself 
Current 
● Chief Product Officer 
Previous 
● VP, Science
Agenda 
● Structure of a Data Science Competition 
● Philosophical considerations 
● Sources of competitive advantage 
● Some technical tips 
● Apply what we learn out of competitions 
Philosophy 
Strategy 
Technique
Structure of a Data Science Competition 
Build model using Training Data to 
predict outcomes on Private LB Data 
Training Public LB 
(validation) 
Private LB 
(holdout) 
Quick but often misleading feedback 
Data Science Competitions remind us that the purpose of a 
predictive model is to predict on data that we have NOT seen.
A little “philosophy” 
● There are many ways to overfit 
● Beware of “multiple comparison fallacy” 
○ There is a cost in “peaking at the answer”, 
○ Usually the first idea (if it works) is the best 
“Think” more, “try” less
Sources of Competitive Advantage 
● Discipline (once bitten twice shy) 
○ Proper validation framework 
● Effort 
● (Some) Domain knowledge 
● Feature engineering 
● The “right” model structure 
● Machine/statistical learning packages 
● Coding/data manipulation efficiency 
● Luck
Technical Tricks -- GBM 
● My confession: I (over)use GBM 
○ When in doubt, use GBM 
● GBM automatically approximate 
○ Non-linear transformations 
○ Subtle and deep interactions 
● GBM gracefully treats missing values 
● GBM is invariant to monotonic transformation of 
features
Technical Tricks -- GBM needs TLC too 
● Tuning parameters 
○ Learning rate + number of trees 
■ Usually small learning rate + many trees work 
well. I target 1000 trees and tune learning rate 
○ Number of obs in leaf 
■ How many obs you need to get a good mean 
estimate? 
○ Interaction depth 
■ Don’t be afraid to use 10+, this is (roughly) the 
number of leaf nodes
Technical Tricks -- when GBM needs help 
● High cardinality features 
○ Convert into numerical with preprocessing -- out-of- 
fold average, counts, etc. 
○ Use Ridge regression (or similar) and 
■ use out-of-fold prediction as input to GBM 
■ or blend 
○ Be brave, use N-way interactions 
■ I used 7-way interaction in the Amazon 
competition. 
● GBM with out-of-fold treatment of high-cardinality 
feature performs very well
Technical Tricks -- feature engineering in GBM 
● GBM only APPROXIMATE interactions and non-linear 
transformations 
● Strong interactions benefit from being explicitly 
defined 
○ Especially ratios/sums/differences among 
features 
● GBM cannot capture complex features such as 
“average sales in the previous period for this type of 
product”
Technical Tricks -- Glmnet 
● From a methodology perspective, the opposite of 
GBM 
● Captures (log/logistic) linear relationship 
● Work with very small # of rows (a few hundred or 
even less) 
● Complements GBM very well in a blend 
● Need a lot of more work 
○ missing values, outliers, transformations (log?), 
interactions 
● The sparsity assumption -- L1 vs L2
Technical Tricks -- Text mining 
● tau package in R 
● Python’s sklearn 
● L2 penalty a must 
● N-grams work well. 
● Don’t forget the “trivial features”: length of text, 
number of words, etc. 
● Many “text-mining” competitions on kaggle are 
actually dominated by structured fields -- KDD2014
Technical Tricks -- Blending 
● All models are wrong, but some are useful (George 
Box) 
○ The hope is that they are wrong in different ways 
● When in doubt, use average blender 
● Beware of temptation to overfit public leaderboard 
○ Use public LB + training CV 
● The strongest individual model does not necessarily 
make the best blend 
○ Sometimes intentionally built weak models are good blending 
candidates -- Liberty Mutual Competition
Technical Tricks -- blending continued 
● Try to build “diverse” models 
○ Different tools -- GBM, Glmnet, RF, SVM, etc. 
○ Different model specifications -- Linear, 
lognormal, poisson, 2 stage, etc. 
○ Different subsets of features 
○ Subsampled observations 
○ Weighted/unweighted 
○ … 
● But, do not try “blindly” -- think more, try less
Apply what we learn outside of competitions 
● Competitions give us really good models, but we also need to 
○ Select the right problem and structure it correctly 
○ Find good (at least useful) data 
○ Make sure models are used the right way 
Competitions help us 
● Understand how much “signal” exists in the data 
● Identify flaws in data or data creation process 
● Build generalizable models 
● Broaden our technical horizon 
● …
Acknowledgement 
● My fellow competitors and data scientists at large 
○ You have taught me (almost) everything 
● Xavier Conort -- my colleague at DataRobot 
○ Thanks for collaboration and inspiration for some 
material 
● Kaggle 
○ Thanks for the all the fun we have competing!

More Related Content

What's hot

Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsBigML, Inc
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science CompetitionJeong-Yoon Lee
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostTakami Sato
 
Writing Smarter Applications with Machine Learning
Writing Smarter Applications with Machine LearningWriting Smarter Applications with Machine Learning
Writing Smarter Applications with Machine LearningAnoop Thomas Mathew
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
LearningKit.ppt
LearningKit.pptLearningKit.ppt
LearningKit.pptbutest
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
 
Basic Problems and Solving Algorithms
Basic Problems and Solving AlgorithmsBasic Problems and Solving Algorithms
Basic Problems and Solving AlgorithmsNopadon Juneam
 
Recurrent Neural Networks in 10 minutes or less
Recurrent Neural Networks  in 10 minutes or lessRecurrent Neural Networks  in 10 minutes or less
Recurrent Neural Networks in 10 minutes or lessTal Perry
 
Competitive Programming Guide
Competitive Programming GuideCompetitive Programming Guide
Competitive Programming GuideAjay Khatri
 

What's hot (11)

Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
VSSML18. OptiML and Fusions
VSSML18. OptiML and FusionsVSSML18. OptiML and Fusions
VSSML18. OptiML and Fusions
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
Writing Smarter Applications with Machine Learning
Writing Smarter Applications with Machine LearningWriting Smarter Applications with Machine Learning
Writing Smarter Applications with Machine Learning
 
BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
LearningKit.ppt
LearningKit.pptLearningKit.ppt
LearningKit.ppt
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Basic Problems and Solving Algorithms
Basic Problems and Solving AlgorithmsBasic Problems and Solving Algorithms
Basic Problems and Solving Algorithms
 
Recurrent Neural Networks in 10 minutes or less
Recurrent Neural Networks  in 10 minutes or lessRecurrent Neural Networks  in 10 minutes or less
Recurrent Neural Networks in 10 minutes or less
 
Competitive Programming Guide
Competitive Programming GuideCompetitive Programming Guide
Competitive Programming Guide
 

Similar to Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning SystemsXavier Amatriain
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsXavier Amatriain
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Aaron Saray
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitMeetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitDigipolis Antwerpen
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning SystemsThoughtworks
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Better programmer overview
Better programmer   overviewBetter programmer   overview
Better programmer overviewCourseHunt
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsBigML, Inc
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvmAdam Gibson
 

Similar to Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival (20)

Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Beat the Benchmark.
Beat the Benchmark.Beat the Benchmark.
Beat the Benchmark.
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
Enterprise PHP Architecture through Design Patterns and Modularization (Midwe...
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitMeetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Better programmer overview
Better programmer   overviewBetter programmer   overview
Better programmer overview
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 

More from freshdatabos

An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Pythonfreshdatabos
 
Thinking in Data Workshop
Thinking in Data WorkshopThinking in Data Workshop
Thinking in Data Workshopfreshdatabos
 
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...freshdatabos
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Datafreshdatabos
 
Visualizing Networks
Visualizing NetworksVisualizing Networks
Visualizing Networksfreshdatabos
 
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...freshdatabos
 
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhD
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhDData Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhD
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhDfreshdatabos
 
Vector Space Word Representations - Rani Nelken PhD
Vector Space Word Representations - Rani Nelken PhDVector Space Word Representations - Rani Nelken PhD
Vector Space Word Representations - Rani Nelken PhDfreshdatabos
 
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -
 You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival - You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -freshdatabos
 

More from freshdatabos (9)

An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
Thinking in Data Workshop
Thinking in Data WorkshopThinking in Data Workshop
Thinking in Data Workshop
 
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...
Big But Personal Data: How Human Behavior Bounds Privacy and What We Can We D...
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
Visualizing Networks
Visualizing NetworksVisualizing Networks
Visualizing Networks
 
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...
In Defense of Imprecision: Why Traditional Approaches to Data Visualization a...
 
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhD
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhDData Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhD
Data Science on a Budget: Maximizing Insight and Impact - Nicholas Arcolano PhD
 
Vector Space Word Representations - Rani Nelken PhD
Vector Space Word Representations - Rani Nelken PhDVector Space Word Representations - Rani Nelken PhD
Vector Space Word Representations - Rani Nelken PhD
 
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -
 You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival - You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -
 

Winning Data Science Competitions (Owen Zhang) - 2014 Boston Data Festival

  • 1. Winning Data Science Competitions Some (hopefully) useful pointers Owen Zhang Data Scientist
  • 2. A plug for myself Current ● Chief Product Officer Previous ● VP, Science
  • 3. Agenda ● Structure of a Data Science Competition ● Philosophical considerations ● Sources of competitive advantage ● Some technical tips ● Apply what we learn out of competitions Philosophy Strategy Technique
  • 4. Structure of a Data Science Competition Build model using Training Data to predict outcomes on Private LB Data Training Public LB (validation) Private LB (holdout) Quick but often misleading feedback Data Science Competitions remind us that the purpose of a predictive model is to predict on data that we have NOT seen.
  • 5. A little “philosophy” ● There are many ways to overfit ● Beware of “multiple comparison fallacy” ○ There is a cost in “peaking at the answer”, ○ Usually the first idea (if it works) is the best “Think” more, “try” less
  • 6. Sources of Competitive Advantage ● Discipline (once bitten twice shy) ○ Proper validation framework ● Effort ● (Some) Domain knowledge ● Feature engineering ● The “right” model structure ● Machine/statistical learning packages ● Coding/data manipulation efficiency ● Luck
  • 7. Technical Tricks -- GBM ● My confession: I (over)use GBM ○ When in doubt, use GBM ● GBM automatically approximate ○ Non-linear transformations ○ Subtle and deep interactions ● GBM gracefully treats missing values ● GBM is invariant to monotonic transformation of features
  • 8. Technical Tricks -- GBM needs TLC too ● Tuning parameters ○ Learning rate + number of trees ■ Usually small learning rate + many trees work well. I target 1000 trees and tune learning rate ○ Number of obs in leaf ■ How many obs you need to get a good mean estimate? ○ Interaction depth ■ Don’t be afraid to use 10+, this is (roughly) the number of leaf nodes
  • 9. Technical Tricks -- when GBM needs help ● High cardinality features ○ Convert into numerical with preprocessing -- out-of- fold average, counts, etc. ○ Use Ridge regression (or similar) and ■ use out-of-fold prediction as input to GBM ■ or blend ○ Be brave, use N-way interactions ■ I used 7-way interaction in the Amazon competition. ● GBM with out-of-fold treatment of high-cardinality feature performs very well
  • 10. Technical Tricks -- feature engineering in GBM ● GBM only APPROXIMATE interactions and non-linear transformations ● Strong interactions benefit from being explicitly defined ○ Especially ratios/sums/differences among features ● GBM cannot capture complex features such as “average sales in the previous period for this type of product”
  • 11. Technical Tricks -- Glmnet ● From a methodology perspective, the opposite of GBM ● Captures (log/logistic) linear relationship ● Work with very small # of rows (a few hundred or even less) ● Complements GBM very well in a blend ● Need a lot of more work ○ missing values, outliers, transformations (log?), interactions ● The sparsity assumption -- L1 vs L2
  • 12. Technical Tricks -- Text mining ● tau package in R ● Python’s sklearn ● L2 penalty a must ● N-grams work well. ● Don’t forget the “trivial features”: length of text, number of words, etc. ● Many “text-mining” competitions on kaggle are actually dominated by structured fields -- KDD2014
  • 13. Technical Tricks -- Blending ● All models are wrong, but some are useful (George Box) ○ The hope is that they are wrong in different ways ● When in doubt, use average blender ● Beware of temptation to overfit public leaderboard ○ Use public LB + training CV ● The strongest individual model does not necessarily make the best blend ○ Sometimes intentionally built weak models are good blending candidates -- Liberty Mutual Competition
  • 14. Technical Tricks -- blending continued ● Try to build “diverse” models ○ Different tools -- GBM, Glmnet, RF, SVM, etc. ○ Different model specifications -- Linear, lognormal, poisson, 2 stage, etc. ○ Different subsets of features ○ Subsampled observations ○ Weighted/unweighted ○ … ● But, do not try “blindly” -- think more, try less
  • 15. Apply what we learn outside of competitions ● Competitions give us really good models, but we also need to ○ Select the right problem and structure it correctly ○ Find good (at least useful) data ○ Make sure models are used the right way Competitions help us ● Understand how much “signal” exists in the data ● Identify flaws in data or data creation process ● Build generalizable models ● Broaden our technical horizon ● …
  • 16. Acknowledgement ● My fellow competitors and data scientists at large ○ You have taught me (almost) everything ● Xavier Conort -- my colleague at DataRobot ○ Thanks for collaboration and inspiration for some material ● Kaggle ○ Thanks for the all the fun we have competing!