SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
An Intro to Kaggle
By Lex Toumbourou
Senior Consultant at Thoughtworks
Part 1: Kaggle Overview
What is
● Founded in 2010 in Australia
● Acquired by Google in 2017
● Host of data science
competitions
● Largest data science
community at 536,000
registered users
?
Why
● Good resource for turning
theoretical skills in practical
skills
● Learn from other data scientists
● Gain reputation
?
Getting started with competitions
● What problems are you interested
in solving?
● What computational budget do you
have?
● Is the competition a good match for
your level?
Competition evaluation and rules
● What is the goal of the
competition?
● How is it evaluated?
○ Accuracy
○ Log Loss
○ Root mean squared error
○ Area under the ROC curve
○ F1 score
○ (many more)
Datasets
● 3 main files:
○ Train.csv
○ Test.csv
○ Sample submissions.csv
● Important to read data documentation
● Kaggle CLI useful for download datasets
on headless computers:
kaggle competitions download -c
house-prices-advanced-regression-tec
hniques
Loading dataset (useful Pandas one-liner)
Leaderboard
● Split into public and private
leaderboard.
● Be careful not to overfit on the
test set.
● Equal scores = oldest predict
wins.
Submissions
● Predictions provided as a CSV with row id
and prediction value(s)
● Some predictions are used for public, the
other for private.
● Usually limited to 5 submissions per day.
● At competition conclusion, pick 2
submissions to use on private
leaderboard.
Generating submission one-liner
Kernels
● Kaggle provided computers - even GPUs
provided
● Allows for sharing results with others.
● Scripts allows you to submit submissions
directly after running code.
Discussion forums
● Lots of useful insights.
● Competition winners will usually always
have read the forums in full.
Part 2: Getting Started
Tools
● Usually Python or R
● Jupyter Notebooks (interactive
development)
● Numpy (linear algebra)
● Pandas (structured data)
● Matplotlib
● Scikit-learn (models and ML tools)
● PyTorch or Tensorflow/Keras (neural
networks)
Model selection
● Dependent on problem
● Tree-based (RandomForests, XGBoost,
LightGBM) - good starting point for
structured data
● Linear Models (SVM, Logistic Reg) - still
useful for certain problems.
● Neural Networks (CNN, RNN) - image,
text and speech data, sometimes
structured
Choosing a validation method
● Train / val split
● Cross-validation
● Out-of-bag error
Fast iteration
● Run experiments on a subset of
your data.
● Good validation strategy.
● Save complex model stacking
and ensembling until after you’ve
maximized feature engineering.
Preparing data
● Model dependent
● Careful feature preparation and
engineering usually quite
important.
● 4 main columns type:
continuous, ordinal, categorical
and date
Image by Tobias Fischer
Continuous (aka numeric) features
● Scaling recommended (non-tree models)
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.StandardScaler
● Outlier cleaning (non-tree models)
Winsorization: remove 99th and 1th percentile
log(x)
● Data imputation (fill in missing values)
df.SomeValue.fillna(df.SomeValue.median())
df[‘SomeValue_isna’] = df.SomeValue.isna()
Categorical features
● Ensure order of ordinal columns
df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True)
● One-hot encode non-ordinal columns
dummies = pd.get_dummies(df[cat_columns], dummy_na=True)
df = pd.concat([df, dummies], axis=1)
https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow
Date time features
● Lots of information in a single date:
○ Day of week
○ Day of month
○ Is it a weekend?
○ Is it a public holiday?
● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns
Image by Charisse Kenion
Feature engineering
● Combining columns (adding
values together, multiplying,
dividing etc
● Adding additional data sources*
○ Things nearby to house
○ Weather on the day
○ Etc etc
* Ensure competition allows it
● Discover Feature Engineering -
great article
Image by Chester Alvarez
Hyperparameter (aka settings) tuning
● Hyperparam = parameter
that isn’t learned by model.
● Manually (try some values
and see what happens)
● Automated
○ RandomizedSearchCV
(sklearn)
○ GridSearchCV (sklearn)
○ Hyperopt
○ Spearmint
○ Lots more...
Stacking / ensembling (aka combining models)
● Most winnings solutions a
combination of models.
● Averaging predictions of multiple
models
● “Meta models”: a model trained on
predictions of multiple models.
http://www.chioka.in/stacking-blending-and-stacked-generalization/
Fin.

Weitere ähnliche Inhalte

Was ist angesagt?

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDataconomy Media
 
Build a Sentiment Model using ML.Net
Build a Sentiment Model using ML.NetBuild a Sentiment Model using ML.Net
Build a Sentiment Model using ML.NetCheah Eng Soon
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLHimadri Mishra
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Manjunath Sindagi
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...Dataconomy Media
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkInSemble
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive ProblemPresentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive ProblemArzam Muzaffar Kotriwala
 
Kaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityKaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityAlberto Danese
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastFranz Inc. - AllegroGraph
 
Dynamic Optimization without Markov Assumptions: application to power systems
Dynamic Optimization without Markov Assumptions: application to power systemsDynamic Optimization without Markov Assumptions: application to power systems
Dynamic Optimization without Markov Assumptions: application to power systemsOlivier Teytaud
 
Incremental Machine Learning.pptx
Incremental Machine Learning.pptxIncremental Machine Learning.pptx
Incremental Machine Learning.pptxSHAILIPATEL19
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 stepsQuantUniversity
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Machine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessJaroslaw Szymczak
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist zekeLabs Technologies
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Rehan Guha
 
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Olivier Teytaud
 

Was ist angesagt? (20)

DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLXDN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
 
Build a Sentiment Model using ML.Net
Build a Sentiment Model using ML.NetBuild a Sentiment Model using ML.Net
Build a Sentiment Model using ML.Net
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)Introduction to machine learning and applications (1)
Introduction to machine learning and applications (1)
 
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat ...
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Data science
Data scienceData science
Data science
 
Presentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive ProblemPresentation: Ad-Click Prediction, A Data-Intensive Problem
Presentation: Ad-Click Prediction, A Data-Intensive Problem
 
Kaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML InterpretabilityKaggle Days Paris - Alberto Danese - ML Interpretability
Kaggle Days Paris - Alberto Danese - ML Interpretability
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
 
Dynamic Optimization without Markov Assumptions: application to power systems
Dynamic Optimization without Markov Assumptions: application to power systemsDynamic Optimization without Markov Assumptions: application to power systems
Dynamic Optimization without Markov Assumptions: application to power systems
 
Incremental Machine Learning.pptx
Incremental Machine Learning.pptxIncremental Machine Learning.pptx
Incremental Machine Learning.pptx
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 steps
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Machine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's businessMachine Learning to moderate ads in real world classified's business
Machine Learning to moderate ads in real world classified's business
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
Ilab Metis: we optimize power systems and we are not afraid of direct policy ...
 

Ähnlich wie A Kaggle Talk

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflowsAdam Gibson
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...Infoshare
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingDatabricks
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellenGraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellenNeo4j
 

Ähnlich wie A Kaggle Talk (20)

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Centernet
CenternetCenternet
Centernet
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellenGraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
GraphTalk Wien - Intelligente Lösungen mit Graphen erstellen
 

Kürzlich hochgeladen

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

A Kaggle Talk

  • 1. An Intro to Kaggle By Lex Toumbourou Senior Consultant at Thoughtworks
  • 2. Part 1: Kaggle Overview
  • 3. What is ● Founded in 2010 in Australia ● Acquired by Google in 2017 ● Host of data science competitions ● Largest data science community at 536,000 registered users ?
  • 4. Why ● Good resource for turning theoretical skills in practical skills ● Learn from other data scientists ● Gain reputation ?
  • 5. Getting started with competitions ● What problems are you interested in solving? ● What computational budget do you have? ● Is the competition a good match for your level?
  • 6. Competition evaluation and rules ● What is the goal of the competition? ● How is it evaluated? ○ Accuracy ○ Log Loss ○ Root mean squared error ○ Area under the ROC curve ○ F1 score ○ (many more)
  • 7. Datasets ● 3 main files: ○ Train.csv ○ Test.csv ○ Sample submissions.csv ● Important to read data documentation ● Kaggle CLI useful for download datasets on headless computers: kaggle competitions download -c house-prices-advanced-regression-tec hniques
  • 8. Loading dataset (useful Pandas one-liner)
  • 9. Leaderboard ● Split into public and private leaderboard. ● Be careful not to overfit on the test set. ● Equal scores = oldest predict wins.
  • 10. Submissions ● Predictions provided as a CSV with row id and prediction value(s) ● Some predictions are used for public, the other for private. ● Usually limited to 5 submissions per day. ● At competition conclusion, pick 2 submissions to use on private leaderboard.
  • 12. Kernels ● Kaggle provided computers - even GPUs provided ● Allows for sharing results with others. ● Scripts allows you to submit submissions directly after running code.
  • 13. Discussion forums ● Lots of useful insights. ● Competition winners will usually always have read the forums in full.
  • 14. Part 2: Getting Started
  • 15. Tools ● Usually Python or R ● Jupyter Notebooks (interactive development) ● Numpy (linear algebra) ● Pandas (structured data) ● Matplotlib ● Scikit-learn (models and ML tools) ● PyTorch or Tensorflow/Keras (neural networks)
  • 16. Model selection ● Dependent on problem ● Tree-based (RandomForests, XGBoost, LightGBM) - good starting point for structured data ● Linear Models (SVM, Logistic Reg) - still useful for certain problems. ● Neural Networks (CNN, RNN) - image, text and speech data, sometimes structured
  • 17. Choosing a validation method ● Train / val split ● Cross-validation ● Out-of-bag error
  • 18. Fast iteration ● Run experiments on a subset of your data. ● Good validation strategy. ● Save complex model stacking and ensembling until after you’ve maximized feature engineering.
  • 19. Preparing data ● Model dependent ● Careful feature preparation and engineering usually quite important. ● 4 main columns type: continuous, ordinal, categorical and date Image by Tobias Fischer
  • 20. Continuous (aka numeric) features ● Scaling recommended (non-tree models) sklearn.preprocessing.MinMaxScaler sklearn.preprocessing.StandardScaler ● Outlier cleaning (non-tree models) Winsorization: remove 99th and 1th percentile log(x) ● Data imputation (fill in missing values) df.SomeValue.fillna(df.SomeValue.median()) df[‘SomeValue_isna’] = df.SomeValue.isna()
  • 21. Categorical features ● Ensure order of ordinal columns df.Rating.cat.set_categories([1, 2, 3], ordered=True, inplace=True) ● One-hot encode non-ordinal columns dummies = pd.get_dummies(df[cat_columns], dummy_na=True) df = pd.concat([df, dummies], axis=1) https://datascience.stackexchange.com/questions/30215/what-is-one-hot-encoding-in-tensorflow
  • 22. Date time features ● Lots of information in a single date: ○ Day of week ○ Day of month ○ Is it a weekend? ○ Is it a public holiday? ● Lots of handy methods in the dt attribute of a Pandas column, which can be added as new columns Image by Charisse Kenion
  • 23. Feature engineering ● Combining columns (adding values together, multiplying, dividing etc ● Adding additional data sources* ○ Things nearby to house ○ Weather on the day ○ Etc etc * Ensure competition allows it ● Discover Feature Engineering - great article Image by Chester Alvarez
  • 24. Hyperparameter (aka settings) tuning ● Hyperparam = parameter that isn’t learned by model. ● Manually (try some values and see what happens) ● Automated ○ RandomizedSearchCV (sklearn) ○ GridSearchCV (sklearn) ○ Hyperopt ○ Spearmint ○ Lots more...
  • 25. Stacking / ensembling (aka combining models) ● Most winnings solutions a combination of models. ● Averaging predictions of multiple models ● “Meta models”: a model trained on predictions of multiple models. http://www.chioka.in/stacking-blending-and-stacked-generalization/
  • 26. Fin.