SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
© DataRobot, Inc. All rights reserved.
Kaggle
and
Data Science
Japan, 2018
Sergey Yurgenson
Director, Advanced Data Science Services
Kaggle Grandmaster
© DataRobot, Inc. All rights reserved.
© DataRobot, Inc. All rights reserved.
Kaggle
● Kaggle is a platform for data science competitions
● It was created by Anthony Goldbloom in 2010 in Australia and then moved to San
Francisco
● In March of 2017 it was acquired by Google
● Right now many other start-up are trying to replicate the same idea, but Kaggle is still the
most known in data science community name
● As of now Kaggle hosted more than 280 competitions and has more than 1 million
members from more than 190 countries
© DataRobot, Inc. All rights reserved.
Kaggle competitions
● Most of Kaggle competitions are predictive modeling competition
● Participants are provided with training data to train their models and test data with
unknown targets
● Participants need to calculate predictions for test data and submit those
predictions to Kaggle platform.
● Accuracy of predictions is evaluated using predefined objective metric and that
result is provided back to participants.
● Model performance of all participants is publicly available and participants can
compare quality of their models with models of other participants
● Many competitions have monetary prizes for top finishers
© DataRobot, Inc. All rights reserved.
Kaggle competitions
© DataRobot, Inc. All rights reserved.
Kaggle ranking
● Based on competitions performance Kaggle ranks members using points and
awards titles for top finishing in competitions
● For example to get title of master member needs to earn one gold medal and
two silver medal. For competitions with 1000 participants it means to finish
once in top 10 places and twice in top 50.
© DataRobot, Inc. All rights reserved.
Kaggle ranking
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
© DataRobot, Inc. All rights reserved.
Why do you dislike Kaggle ?
● Kaggle competition does not have much in common with real Data Science
○ The problems are already well formulated with metrics predefined. In an industry setting there is
ambiguity, and knowing what to solve is one of the key steps towards a solution.
○ Data is most cases is already provided and is relatively clean.
○ The goal is more leaderboard driven rather than understanding driven. Winning a competition
versus why an approach works is a top priority. Results may not be trustworthy.
○ There are chances of overfitting to test data with repeated submissions.
○ In most cases the solution is an ensemble of algorithms and not “productionizable”.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
True or False ?
● “The problems are already well formulated with metrics predefined. In an
industry setting there is ambiguity, and knowing what to solve is one of the
key steps towards a solution.”
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Problem is well formulated
Mostly True , however...
● Need for criteria is inherited property of any competition.
● In real world not all data scientists are free to select and reformulate the problem. Many problems
are already defined with assigned specific success criteria.
● We learn many subjects and skills by solving provided predefined problems, doing predefined
exercises. We learn math by solving problems from textbooks, we learn physics by solving
problems from textbooks. Problems already formulated. By solving problems we also learn how
to formulate problems, what is suitable approach in particular data science situation.
● We also have to admit that evaluating business value of solving the problem is completely out of
scope of Kaggle competitions. While business value analysis and problem prioritization is
important part of many real life data science projects.
© DataRobot, Inc. All rights reserved.
True or False ?
● “Data is most cases is already provided and is relatively clean.”
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Data is clean
Half true
● In many competitions datasets are
○ Very big
○ Have multiple tables
○ Some records are duplicated and mislabeled
○ Contain combination of structured data and unstructured data
● Some competitions encourage search for additional sources of data
● Many data leaks
● Often features names and meaning are not provided making problem even more difficult than in real
world
● Data may be intentionally distorted to conform to data privacy laws
© DataRobot, Inc. All rights reserved.
Data is clean
● Complex data structure ● Big datasets
● No meaningful feature names
© DataRobot, Inc. All rights reserved.
Data is clean
● Kaggle competitions teach unique data manipulation skills:
○ Dealing with data with hardware limitations : efficient code, smart sampling, clever encoding…
○ Using EDA to uncover meaning of data without relying on labels or other provided information
○ Data leaks discovering based on the data analysis
© DataRobot, Inc. All rights reserved.
True or False ?
● The goal is more leaderboard driven rather than understanding driven. Winning
a competition versus why an approach works is a top priority. Results may not
be trustworthy.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
No understanding
True but maybe not that important
● Assumes that model we can not understand is less valuable than model we can understand
○ Model is not necessarily used for knowledge discovery
○ In real life we often use something and rely on something we do not completely understand
○ If something that we do not understand can not be trustworthy then how we ever trust other
people?
○ Even complex machine learning model may provide simplification of even more complex real
system
© DataRobot, Inc. All rights reserved.
No understanding
● Ignores all new research of model interpretability
○ Feature importance
○ Reason codes
○ Partial dependence plots
○ Surrogate models
○ Neuron activation visualization
○ ...
● Those methods allow us to analyze and understand behaviour of models as complicated as GBM and
Neural Networks
© DataRobot, Inc. All rights reserved.
No understanding ?
© DataRobot, Inc. All rights reserved.
True or False ?
● There are chances of overfitting to test data with repeated submissions.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Overfitting
False
● Complete misunderstanding of how Kaggle works
○ Test data in Kaggle competition is split into two parts - public and private
○ During competition models are evaluated only on public part of the test set
○ Final results are based only on private part of the test dataset
○ Thus final model evaluation is based on completely new data
● One of first lessons all competitions participants learn very fast
○ Do not overfit leaderboard.
○ Create training/validation partition which reflect as much as possible test data including
seasonality effects and data drift
© DataRobot, Inc. All rights reserved.
True or False ?
● In most cases the solution is an ensemble of algorithms and not
“productionizable”.
https://www.quora.com/Why-do-you-dislike-Kaggle
© DataRobot, Inc. All rights reserved.
Difficult to put in production
Half True, half false
● Yes, in most cases top models are complicated ensembles
● Difficult to put in production if one does it one-by-one for each model separately
● Easy if one uses appropriately developed platform that can handle many models and blenders
© DataRobot, Inc. All rights reserved.
True or False ?
● Sometimes, a 0.01 difference in AUC can be the difference between 1st place
and 294th place (out of 626) . Those marginal gains take significant time and
effort that may not be worthwhile in the face of other projects and priorities
https://www.quora.com/How-similar-are-Kaggle-competitions-to-what-data-scientists-do
© DataRobot, Inc. All rights reserved.
Marginal gain is not valuable
Not always true
● Often we ourselves advise clients on balance between time spent and model performance
● However in investment world 0.01 AUC difference means difference in millions of dollars of gain or
loss
● Competition aspect of the data science problem with small margins drives innovation
○ New preprocessing steps
○ New feature engineering ideas
○ Continues testing of new algorithms and implementations (GBM - XGboost - LightGBM -
CatBoost)
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
● “Kaggle competitions cover a decent amount of what a data scientist does.
The two big missing pieces are:
○ 1. taking a business problem and specifying it as a data science problem
(which includes pulling the data and structuring it so that it addresses that
business problem).
○ 2. putting models into production.”
Anthony Goldbloom
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science
● Kaggle is a competition
● “Real” Data Science is ...
also competition
© DataRobot, Inc. All rights reserved.
Kaggle to “real life” Data Science
● DataRobot - created by top Kagglers
Owen Zhang
Product Advisor
Highest: #1
Xavier Conort
Chief Data Scientist
Highest: 1st
Sergey Yurgenson
Director- AI Services
Highest: 1st
Jeremy Achin
CEO & Co-Founder
Highest: 20th
Tom de Godoy
CTO & Co-Founder
Highest: 20th
Amanda Schierz
Data Scientist
Highest: 24
DataRobot automatically replicates the steps seasoned data scientists take. This allows
non-technical business users to create accurate predictive models and data scientists to add
to their existing tool set.
© DataRobot, Inc. All rights reserved.
Kaggle and Data Science

Weitere ähnliche Inhalte

Was ist angesagt?

最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリングmlm_kansai
 
5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnet5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnetNagi Teramo
 
データサイエンティストの仕事とデータ分析コンテスト
データサイエンティストの仕事とデータ分析コンテストデータサイエンティストの仕事とデータ分析コンテスト
データサイエンティストの仕事とデータ分析コンテストKen'ichi Matsui
 
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
[DL輪読会]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) 表形式デー...[DL輪読会]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) 表形式デー...
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...Deep Learning JP
 
Gradient Tree Boosting はいいぞ
Gradient Tree Boosting はいいぞGradient Tree Boosting はいいぞ
Gradient Tree Boosting はいいぞ7X RUSK
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short IntroIchigaku Takigawa
 
Factorization machines with r
Factorization machines with rFactorization machines with r
Factorization machines with rShota Yasui
 
自然言語処理による議論マイニング
自然言語処理による議論マイニング自然言語処理による議論マイニング
自然言語処理による議論マイニングNaoaki Okazaki
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」ManaMurakami1
 
優れた研究論文の書き方
優れた研究論文の書き方優れた研究論文の書き方
優れた研究論文の書き方Masanori Kado
 
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」西岡 賢一郎
 
Devsumi 2018summer
Devsumi 2018summerDevsumi 2018summer
Devsumi 2018summerHarada Kei
 
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~nocchi_airport
 
機械学習モデルの判断根拠の説明
機械学習モデルの判断根拠の説明機械学習モデルの判断根拠の説明
機械学習モデルの判断根拠の説明Satoshi Hara
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門Takuji Tahara
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類Shintaro Fukushima
 
トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...
トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...
トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...JunSuzuki21
 
強化学習入門
強化学習入門強化学習入門
強化学習入門Shunta Saito
 
バンディット問題について
バンディット問題についてバンディット問題について
バンディット問題についてjkomiyama
 

Was ist angesagt? (20)

最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
最近のKaggleに学ぶテーブルデータの特徴量エンジニアリング
 
5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnet5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnet
 
データサイエンティストの仕事とデータ分析コンテスト
データサイエンティストの仕事とデータ分析コンテストデータサイエンティストの仕事とデータ分析コンテスト
データサイエンティストの仕事とデータ分析コンテスト
 
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
[DL輪読会]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) 表形式デー...[DL輪読会]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) 表形式デー...
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
 
Gradient Tree Boosting はいいぞ
Gradient Tree Boosting はいいぞGradient Tree Boosting はいいぞ
Gradient Tree Boosting はいいぞ
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro
 
Factorization machines with r
Factorization machines with rFactorization machines with r
Factorization machines with r
 
自然言語処理による議論マイニング
自然言語処理による議論マイニング自然言語処理による議論マイニング
自然言語処理による議論マイニング
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
 
Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習
 
優れた研究論文の書き方
優れた研究論文の書き方優れた研究論文の書き方
優れた研究論文の書き方
 
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
表形式データで高性能な予測モデルを構築する「DNNとXGBoostのアンサンブル学習」
 
Devsumi 2018summer
Devsumi 2018summerDevsumi 2018summer
Devsumi 2018summer
 
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
 
機械学習モデルの判断根拠の説明
機械学習モデルの判断根拠の説明機械学習モデルの判断根拠の説明
機械学習モデルの判断根拠の説明
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
 
トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...
トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...
トップカンファレンスへの論文採択に向けて(AI研究分野版)/ Toward paper acceptance at top conferences (AI...
 
強化学習入門
強化学習入門強化学習入門
強化学習入門
 
バンディット問題について
バンディット問題についてバンディット問題について
バンディット問題について
 

Ähnlich wie Kaggle and data science

Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Albert Y. C. Chen
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scaleOwen Zhang
 
vodQA Pune (2019) - Design patterns in test automation
vodQA Pune (2019) - Design patterns in test automationvodQA Pune (2019) - Design patterns in test automation
vodQA Pune (2019) - Design patterns in test automationvodQA
 
Kaggle Days Milan - March 2019
Kaggle Days Milan - March 2019Kaggle Days Milan - March 2019
Kaggle Days Milan - March 2019Alberto Danese
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product ownerDavid Murgatroyd
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...Dataconomy Media
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine LearningAlexey Grigorev
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprisemark madsen
 
Profit from AI & Machine Learning: The Best Practices for People & Process
Profit from AI & Machine Learning: The Best Practices for People & ProcessProfit from AI & Machine Learning: The Best Practices for People & Process
Profit from AI & Machine Learning: The Best Practices for People & ProcessTony Baer
 
Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Wit Jakuczun
 
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyLazarinaStoyanova
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning SystemsThoughtworks
 

Ähnlich wie Kaggle and data science (20)

Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
vodQA Pune (2019) - Design patterns in test automation
vodQA Pune (2019) - Design patterns in test automationvodQA Pune (2019) - Design patterns in test automation
vodQA Pune (2019) - Design patterns in test automation
 
Kaggle Days Milan - March 2019
Kaggle Days Milan - March 2019Kaggle Days Milan - March 2019
Kaggle Days Milan - March 2019
 
How to train your product owner
How to train your product ownerHow to train your product owner
How to train your product owner
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
"What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual..."What we learned from 5 years of building a data science software that actual...
"What we learned from 5 years of building a data science software that actual...
 
Limits of Machine Learning
Limits of Machine LearningLimits of Machine Learning
Limits of Machine Learning
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Golang for data analytics
Golang for data analyticsGolang for data analytics
Golang for data analytics
 
Golang for data analytics
Golang for data analyticsGolang for data analytics
Golang for data analytics
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
A Tester's Life
A Tester's LifeA Tester's Life
A Tester's Life
 
Demystifying ML/AI
Demystifying ML/AIDemystifying ML/AI
Demystifying ML/AI
 
Profit from AI & Machine Learning: The Best Practices for People & Process
Profit from AI & Machine Learning: The Best Practices for People & ProcessProfit from AI & Machine Learning: The Best Practices for People & Process
Profit from AI & Machine Learning: The Best Practices for People & Process
 
Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...
 
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
 
CD in Machine Learning Systems
CD in Machine Learning SystemsCD in Machine Learning Systems
CD in Machine Learning Systems
 

Mehr von Akira Shibata

大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さんAkira Shibata
 
W&B monthly meetup#7 Intro.pdf
W&B monthly meetup#7 Intro.pdfW&B monthly meetup#7 Intro.pdf
W&B monthly meetup#7 Intro.pdfAkira Shibata
 
20230705 - Optuna Integration (to share).pdf
20230705 - Optuna Integration (to share).pdf20230705 - Optuna Integration (to share).pdf
20230705 - Optuna Integration (to share).pdfAkira Shibata
 
W&B Seminar #5(to share).pdf
W&B Seminar #5(to share).pdfW&B Seminar #5(to share).pdf
W&B Seminar #5(to share).pdfAkira Shibata
 
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdfmakoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdfAkira Shibata
 
LLM Webinar - シバタアキラ to share.pdf
LLM Webinar - シバタアキラ to share.pdfLLM Webinar - シバタアキラ to share.pdf
LLM Webinar - シバタアキラ to share.pdfAkira Shibata
 
Akira shibata at developer summit 2016
Akira shibata at developer summit 2016Akira shibata at developer summit 2016
Akira shibata at developer summit 2016Akira Shibata
 
PyData.Tokyo Hackathon#2 TensorFlow
PyData.Tokyo Hackathon#2 TensorFlowPyData.Tokyo Hackathon#2 TensorFlow
PyData.Tokyo Hackathon#2 TensorFlowAkira Shibata
 
20150421 日経ビッグデータカンファレンス
20150421 日経ビッグデータカンファレンス20150421 日経ビッグデータカンファレンス
20150421 日経ビッグデータカンファレンスAkira Shibata
 
人工知能をビジネスに活かす
人工知能をビジネスに活かす人工知能をビジネスに活かす
人工知能をビジネスに活かすAkira Shibata
 
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)Akira Shibata
 
PyData Tokyo Tutorial & Hackathon #1
PyData Tokyo Tutorial & Hackathon #1PyData Tokyo Tutorial & Hackathon #1
PyData Tokyo Tutorial & Hackathon #1Akira Shibata
 
PyData NYC by Akira Shibata
PyData NYC by Akira ShibataPyData NYC by Akira Shibata
PyData NYC by Akira ShibataAkira Shibata
 
20141127 py datatokyomeetup2
20141127 py datatokyomeetup220141127 py datatokyomeetup2
20141127 py datatokyomeetup2Akira Shibata
 
The LHC Explained by CNN
The LHC Explained by CNNThe LHC Explained by CNN
The LHC Explained by CNNAkira Shibata
 
Analysis Software Development
Analysis Software DevelopmentAnalysis Software Development
Analysis Software DevelopmentAkira Shibata
 

Mehr von Akira Shibata (20)

大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
大規模言語モデル開発を支える分散学習技術 - 東京工業大学横田理央研究室の藤井一喜さん
 
W&B monthly meetup#7 Intro.pdf
W&B monthly meetup#7 Intro.pdfW&B monthly meetup#7 Intro.pdf
W&B monthly meetup#7 Intro.pdf
 
20230705 - Optuna Integration (to share).pdf
20230705 - Optuna Integration (to share).pdf20230705 - Optuna Integration (to share).pdf
20230705 - Optuna Integration (to share).pdf
 
W&B Seminar #5(to share).pdf
W&B Seminar #5(to share).pdfW&B Seminar #5(to share).pdf
W&B Seminar #5(to share).pdf
 
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdfmakoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
makoto shing (stability ai) - image model fine-tuning - wandb_event_230525.pdf
 
LLM Webinar - シバタアキラ to share.pdf
LLM Webinar - シバタアキラ to share.pdfLLM Webinar - シバタアキラ to share.pdf
LLM Webinar - シバタアキラ to share.pdf
 
W&B Seminar #4.pdf
W&B Seminar #4.pdfW&B Seminar #4.pdf
W&B Seminar #4.pdf
 
Data x
Data xData x
Data x
 
Akira shibata at developer summit 2016
Akira shibata at developer summit 2016Akira shibata at developer summit 2016
Akira shibata at developer summit 2016
 
PyData.Tokyo Hackathon#2 TensorFlow
PyData.Tokyo Hackathon#2 TensorFlowPyData.Tokyo Hackathon#2 TensorFlow
PyData.Tokyo Hackathon#2 TensorFlow
 
20150421 日経ビッグデータカンファレンス
20150421 日経ビッグデータカンファレンス20150421 日経ビッグデータカンファレンス
20150421 日経ビッグデータカンファレンス
 
人工知能をビジネスに活かす
人工知能をビジネスに活かす人工知能をビジネスに活かす
人工知能をビジネスに活かす
 
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
LHCにおける素粒子ビッグデータの解析とROOTライブラリ(Big Data Analysis at LHC and ROOT)
 
PyData Tokyo Tutorial & Hackathon #1
PyData Tokyo Tutorial & Hackathon #1PyData Tokyo Tutorial & Hackathon #1
PyData Tokyo Tutorial & Hackathon #1
 
20150128 cross2015
20150128 cross201520150128 cross2015
20150128 cross2015
 
PyData NYC by Akira Shibata
PyData NYC by Akira ShibataPyData NYC by Akira Shibata
PyData NYC by Akira Shibata
 
20141127 py datatokyomeetup2
20141127 py datatokyomeetup220141127 py datatokyomeetup2
20141127 py datatokyomeetup2
 
The LHC Explained by CNN
The LHC Explained by CNNThe LHC Explained by CNN
The LHC Explained by CNN
 
LHC for Students
LHC for StudentsLHC for Students
LHC for Students
 
Analysis Software Development
Analysis Software DevelopmentAnalysis Software Development
Analysis Software Development
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 

Kürzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 

Kaggle and data science

  • 1. © DataRobot, Inc. All rights reserved. Kaggle and Data Science Japan, 2018
  • 2. Sergey Yurgenson Director, Advanced Data Science Services Kaggle Grandmaster © DataRobot, Inc. All rights reserved.
  • 3. © DataRobot, Inc. All rights reserved. Kaggle ● Kaggle is a platform for data science competitions ● It was created by Anthony Goldbloom in 2010 in Australia and then moved to San Francisco ● In March of 2017 it was acquired by Google ● Right now many other start-up are trying to replicate the same idea, but Kaggle is still the most known in data science community name ● As of now Kaggle hosted more than 280 competitions and has more than 1 million members from more than 190 countries
  • 4. © DataRobot, Inc. All rights reserved. Kaggle competitions ● Most of Kaggle competitions are predictive modeling competition ● Participants are provided with training data to train their models and test data with unknown targets ● Participants need to calculate predictions for test data and submit those predictions to Kaggle platform. ● Accuracy of predictions is evaluated using predefined objective metric and that result is provided back to participants. ● Model performance of all participants is publicly available and participants can compare quality of their models with models of other participants ● Many competitions have monetary prizes for top finishers
  • 5. © DataRobot, Inc. All rights reserved. Kaggle competitions
  • 6. © DataRobot, Inc. All rights reserved. Kaggle ranking ● Based on competitions performance Kaggle ranks members using points and awards titles for top finishing in competitions ● For example to get title of master member needs to earn one gold medal and two silver medal. For competitions with 1000 participants it means to finish once in top 10 places and twice in top 50.
  • 7. © DataRobot, Inc. All rights reserved. Kaggle ranking
  • 8. © DataRobot, Inc. All rights reserved. Kaggle and Data Science
  • 9. © DataRobot, Inc. All rights reserved. Why do you dislike Kaggle ? ● Kaggle competition does not have much in common with real Data Science ○ The problems are already well formulated with metrics predefined. In an industry setting there is ambiguity, and knowing what to solve is one of the key steps towards a solution. ○ Data is most cases is already provided and is relatively clean. ○ The goal is more leaderboard driven rather than understanding driven. Winning a competition versus why an approach works is a top priority. Results may not be trustworthy. ○ There are chances of overfitting to test data with repeated submissions. ○ In most cases the solution is an ensemble of algorithms and not “productionizable”. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 10. © DataRobot, Inc. All rights reserved. True or False ? ● “The problems are already well formulated with metrics predefined. In an industry setting there is ambiguity, and knowing what to solve is one of the key steps towards a solution.” https://www.quora.com/Why-do-you-dislike-Kaggle
  • 11. © DataRobot, Inc. All rights reserved. Problem is well formulated Mostly True , however... ● Need for criteria is inherited property of any competition. ● In real world not all data scientists are free to select and reformulate the problem. Many problems are already defined with assigned specific success criteria. ● We learn many subjects and skills by solving provided predefined problems, doing predefined exercises. We learn math by solving problems from textbooks, we learn physics by solving problems from textbooks. Problems already formulated. By solving problems we also learn how to formulate problems, what is suitable approach in particular data science situation. ● We also have to admit that evaluating business value of solving the problem is completely out of scope of Kaggle competitions. While business value analysis and problem prioritization is important part of many real life data science projects.
  • 12. © DataRobot, Inc. All rights reserved. True or False ? ● “Data is most cases is already provided and is relatively clean.” https://www.quora.com/Why-do-you-dislike-Kaggle
  • 13. © DataRobot, Inc. All rights reserved. Data is clean Half true ● In many competitions datasets are ○ Very big ○ Have multiple tables ○ Some records are duplicated and mislabeled ○ Contain combination of structured data and unstructured data ● Some competitions encourage search for additional sources of data ● Many data leaks ● Often features names and meaning are not provided making problem even more difficult than in real world ● Data may be intentionally distorted to conform to data privacy laws
  • 14. © DataRobot, Inc. All rights reserved. Data is clean ● Complex data structure ● Big datasets ● No meaningful feature names
  • 15. © DataRobot, Inc. All rights reserved. Data is clean ● Kaggle competitions teach unique data manipulation skills: ○ Dealing with data with hardware limitations : efficient code, smart sampling, clever encoding… ○ Using EDA to uncover meaning of data without relying on labels or other provided information ○ Data leaks discovering based on the data analysis
  • 16. © DataRobot, Inc. All rights reserved. True or False ? ● The goal is more leaderboard driven rather than understanding driven. Winning a competition versus why an approach works is a top priority. Results may not be trustworthy. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 17. © DataRobot, Inc. All rights reserved. No understanding True but maybe not that important ● Assumes that model we can not understand is less valuable than model we can understand ○ Model is not necessarily used for knowledge discovery ○ In real life we often use something and rely on something we do not completely understand ○ If something that we do not understand can not be trustworthy then how we ever trust other people? ○ Even complex machine learning model may provide simplification of even more complex real system
  • 18. © DataRobot, Inc. All rights reserved. No understanding ● Ignores all new research of model interpretability ○ Feature importance ○ Reason codes ○ Partial dependence plots ○ Surrogate models ○ Neuron activation visualization ○ ... ● Those methods allow us to analyze and understand behaviour of models as complicated as GBM and Neural Networks
  • 19. © DataRobot, Inc. All rights reserved. No understanding ?
  • 20. © DataRobot, Inc. All rights reserved. True or False ? ● There are chances of overfitting to test data with repeated submissions. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 21. © DataRobot, Inc. All rights reserved. Overfitting False ● Complete misunderstanding of how Kaggle works ○ Test data in Kaggle competition is split into two parts - public and private ○ During competition models are evaluated only on public part of the test set ○ Final results are based only on private part of the test dataset ○ Thus final model evaluation is based on completely new data ● One of first lessons all competitions participants learn very fast ○ Do not overfit leaderboard. ○ Create training/validation partition which reflect as much as possible test data including seasonality effects and data drift
  • 22. © DataRobot, Inc. All rights reserved. True or False ? ● In most cases the solution is an ensemble of algorithms and not “productionizable”. https://www.quora.com/Why-do-you-dislike-Kaggle
  • 23. © DataRobot, Inc. All rights reserved. Difficult to put in production Half True, half false ● Yes, in most cases top models are complicated ensembles ● Difficult to put in production if one does it one-by-one for each model separately ● Easy if one uses appropriately developed platform that can handle many models and blenders
  • 24. © DataRobot, Inc. All rights reserved. True or False ? ● Sometimes, a 0.01 difference in AUC can be the difference between 1st place and 294th place (out of 626) . Those marginal gains take significant time and effort that may not be worthwhile in the face of other projects and priorities https://www.quora.com/How-similar-are-Kaggle-competitions-to-what-data-scientists-do
  • 25. © DataRobot, Inc. All rights reserved. Marginal gain is not valuable Not always true ● Often we ourselves advise clients on balance between time spent and model performance ● However in investment world 0.01 AUC difference means difference in millions of dollars of gain or loss ● Competition aspect of the data science problem with small margins drives innovation ○ New preprocessing steps ○ New feature engineering ideas ○ Continues testing of new algorithms and implementations (GBM - XGboost - LightGBM - CatBoost)
  • 26. © DataRobot, Inc. All rights reserved. Kaggle and Data Science ● “Kaggle competitions cover a decent amount of what a data scientist does. The two big missing pieces are: ○ 1. taking a business problem and specifying it as a data science problem (which includes pulling the data and structuring it so that it addresses that business problem). ○ 2. putting models into production.” Anthony Goldbloom
  • 27. © DataRobot, Inc. All rights reserved. Kaggle and Data Science ● Kaggle is a competition ● “Real” Data Science is ... also competition
  • 28. © DataRobot, Inc. All rights reserved. Kaggle to “real life” Data Science ● DataRobot - created by top Kagglers Owen Zhang Product Advisor Highest: #1 Xavier Conort Chief Data Scientist Highest: 1st Sergey Yurgenson Director- AI Services Highest: 1st Jeremy Achin CEO & Co-Founder Highest: 20th Tom de Godoy CTO & Co-Founder Highest: 20th Amanda Schierz Data Scientist Highest: 24 DataRobot automatically replicates the steps seasoned data scientists take. This allows non-technical business users to create accurate predictive models and data scientists to add to their existing tool set.
  • 29. © DataRobot, Inc. All rights reserved. Kaggle and Data Science