SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Debugging Machine Learning.
Mostly for profit but with a bit of fun too!
Michał Łopuszyński
PyData Warsaw, 19.10.2017
About me
In my previous professional life, I was a theoretical physicist.
I got a PhD in solid state physics
•
For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
Agenda
4 failure modes of ML systems•
9 simple hints what to do•
Hey, my
ML system
does not
work at all...
Hint #1
Check your code
AKA it is engineering, stupid!
Write tests
Do not strive for 100% coverage, partial coverage is infinitely better
then none!
•
In Python, doctests are your friends•
"Hidden" benefits of tests•
Better code structure•
Executable documentation•
Test the fragile parts first•
Test your code with Monte Carlo / synthetic data
Try to generate a trivial case for your
ML system
•
This tests the whole pipeline (transforming/training) and allows for exploration of
your system performance under unusual conditions
•
If you have generative model,
prepare Monte Carlo data from
assumed distributions
•
Perturb the original data, by
generating the output with known
and learnable prescription
•
Code style
Single responsibility principle•
Do not repeat yourself (DRY!)•
Have and apply coding standard (PEP8)•
Consider using a linter•
Short functions and methods•
https://xkcd.com/844/
pycodestyle checker (formerly pep8)•
yapf formater•
pylint, pyflakes, pychecker
Naming
Minimal requirement - be consistent!•
Interesting software development books, offering chapters on naming•
Freely available chapter on names
Hint #2
Check your data
Data quality audits are difficult
Happy families are all alike; every unhappy family
is unhappy in its own way.
Leo Tolstoy
Like families, tidy datasets are all alike but every
messy dataset is messy in its own way.
Hadley Wickham
H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10
Images credit - Wikipedia
Data quality
Beware, your data providers usually
overestimate the data quality
•
Think of outliers, missing values (and how the are represented)•
Understand your data•
Do exploratory data analysis•
Visualize, visualize, visualize•
Talk to the domain expert•
Is your data correct, complete, coherent, stationary (seasonality!), deduplicated,
up-to-date, representative (unbiased)
•
OK, my ML system
works, but I think it
should perform
better...
Hint #3
Examine your features
Features
Features make a difference!•
Be creative with your features•
Try meaningful transformations,
combinations (products/ratios), decorrelation...
•
Think of good representations for non-tabular data
(text, time-series)
•
Make conscious decision about missing values•
Understand what features are important for your model•
Use ML models offering feature ranking•
Use feature selection methods•
ID F1 F2 F3
Hint #4
Examine your data points
Data points
Find difficult data points! (DDP)•
DDP = notoriously misclassified (or high error) cases
in your cross-validation loop for large variety of models
•
ID
P1
P2
P3
P4
P5
Examine DDPs, understand them!•
In the easiest case, remove DDPs from the dataset
(think outliers, mislabeled examples)
•
Influence functions
Influence functions
Best Paper Award
ICML 2017
Data points
Get more data!• ID
P1
P2
P3
P4
P5
Trick 1. Extend your set with artificial data
E.g., data augmentation in image processing,
SMOTE algorithm for imbalanced datasets
•
Good performance booster, rarely applicable•
Trick 2. Generate automatically noisy labeled
data set by heuristics, e.g. distant supervision in NLP
(requires unlabeled data!)
•
Trick 3. Semisupervised learning methods
self-training and co-training (requires unlabeled data!)
•
Hint #5
Examine your model
Why my model predicts what it predicts? (philosophical slide)
How do you answer why questions?•
Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
Model introspection
You can answer thy why question, only for very simple models
(e.g., linear model, basic decision trees)
•
Sometimes, it is instructive to run such a simple model on your dataset, even
though it does not provide top-level performance
•
You can boost your simple model by feeding it with more advanced
(non-linearly transformed) features
•
Complex model introspection
LIME algorithm = Local Interpretable Model-agnostic Explanations
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
Complex model introspection – practical issues
Lime authors provided open source Python implementation as lime package•
http://eli5.readthedocs.io/en/latest/tutorials•
Another option eli5 package, aimed more generally at model introspection
and explanation (includes lime implementation)
•
Visualizing models
Display model in
a data space
Look at collection of
models at once
Explore the process of
model fitting
The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
So my ML system
works on test data,
but you tell me it
fails in production?
Hint #6
Watch out for overfitting
Overfitting
If you torture the data long enough,
it will confess.
Roald Coase
Hint #7
Watch out for data leakage
Data leakage
Some time ago, I used to thing data leaks are trivial to avoid•
They are not! (Look at number of Kaggle competitions flawed by Data Leakage)•
You may introduce them yourself
E.g. meaningful identifiers, past & future separation in time series
•
You may receive them in the data from your provider•
Good paper•
Hint #8
Watch out for covariate shift
What is covariate shift?
Training data
y
X
What is covariate shift?
Model
Training data
X
y
What is covariate shift?
Model
Noiseless reality
Training data
y
X
What is covariate shift?
Model
Noiseless reality
Training data
Production data
(Test data)
y
X
Covariate shift
Unlike overfitting and data leakage, it is easier to detect•
Method: Try to build classifier differentiating train from production (test).
If you succeed, you very likely have a problem
•
Basic remedy – reweighting data points. Give production-like data higher impact
on your model
•
The quality of my
super ML system
deteriorates with
time, really?
Really really?
ML system in production
2009: Hooray we can predict flu
epidemics from Google query data!
2014: Hmm... Can we?
ML system in production
NIPS 2015 paper
Hint #9
Remember monitoring & maintenance
AKA it is engineering again, stupid!
5
Thank you!
Questions?
@lopusz

Weitere ähnliche Inhalte

Was ist angesagt?

Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionDatabricks
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Sri Ambati
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsJustin Basilico
 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Sri Ambati
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDays Riga
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineeringIvano Malavolta
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningWell, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningDevFest DC
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixXavier Amatriain
 

Was ist angesagt? (20)

Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly Detection
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering[2017/2018] RESEARCH in software engineering
[2017/2018] RESEARCH in software engineering
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine LearningWell, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
Well, That Escalated Quickly: Anomaly Detection with Elastic Machine Learning
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
 

Ähnlich wie Debugging machine-learning

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017fredverheul
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation OptionsMihai Criveti
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-finalsupportlogic
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programmingJuggernaut Liu
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxsnigdhaagrawal11
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxAbderrahmanABID2
 
MachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert SystemsMachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert Systemsshreenathji26
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptxharikaramisetty3
 
How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...WiMLDSMontreal
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...panagenda
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 

Ähnlich wie Debugging machine-learning (20)

Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Machine learning 101 dkom 2017
Machine learning 101 dkom 2017Machine learning 101 dkom 2017
Machine learning 101 dkom 2017
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options10 Limitations of Large Language Models and Mitigation Options
10 Limitations of Large Language Models and Mitigation Options
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Reading Notes : the practice of programming
Reading Notes : the practice of programmingReading Notes : the practice of programming
Reading Notes : the practice of programming
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert SystemsMachineLearningSparkML AI and expert Systems
MachineLearningSparkML AI and expert Systems
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
MachineLearningSparkML.pptx
MachineLearningSparkML.pptxMachineLearningSparkML.pptx
MachineLearningSparkML.pptx
 
How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...How to build a data science project in a corporate setting, by Soraya Christi...
How to build a data science project in a corporate setting, by Soraya Christi...
 
Karen Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling BlundersKaren Lopez 10 Physical Data Modeling Blunders
Karen Lopez 10 Physical Data Modeling Blunders
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 

Kürzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Debugging machine-learning

  • 1. Debugging Machine Learning. Mostly for profit but with a bit of fun too! Michał Łopuszyński PyData Warsaw, 19.10.2017
  • 2. About me In my previous professional life, I was a theoretical physicist. I got a PhD in solid state physics • For the last 5 years I work as a Data Scientist in ICM, University of Warsaw•
  • 3. Agenda 4 failure modes of ML systems• 9 simple hints what to do•
  • 4. Hey, my ML system does not work at all...
  • 5. Hint #1 Check your code AKA it is engineering, stupid!
  • 6. Write tests Do not strive for 100% coverage, partial coverage is infinitely better then none! • In Python, doctests are your friends• "Hidden" benefits of tests• Better code structure• Executable documentation• Test the fragile parts first•
  • 7. Test your code with Monte Carlo / synthetic data Try to generate a trivial case for your ML system • This tests the whole pipeline (transforming/training) and allows for exploration of your system performance under unusual conditions • If you have generative model, prepare Monte Carlo data from assumed distributions • Perturb the original data, by generating the output with known and learnable prescription •
  • 8. Code style Single responsibility principle• Do not repeat yourself (DRY!)• Have and apply coding standard (PEP8)• Consider using a linter• Short functions and methods• https://xkcd.com/844/ pycodestyle checker (formerly pep8)• yapf formater• pylint, pyflakes, pychecker
  • 9. Naming Minimal requirement - be consistent!• Interesting software development books, offering chapters on naming• Freely available chapter on names
  • 11. Data quality audits are difficult Happy families are all alike; every unhappy family is unhappy in its own way. Leo Tolstoy Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Hadley Wickham H. Wickham, Tidy Data, JSS 59 (2014), doi: 10.18637/jss.v059.i10 Images credit - Wikipedia
  • 12. Data quality Beware, your data providers usually overestimate the data quality • Think of outliers, missing values (and how the are represented)• Understand your data• Do exploratory data analysis• Visualize, visualize, visualize• Talk to the domain expert• Is your data correct, complete, coherent, stationary (seasonality!), deduplicated, up-to-date, representative (unbiased) •
  • 13. OK, my ML system works, but I think it should perform better...
  • 15. Features Features make a difference!• Be creative with your features• Try meaningful transformations, combinations (products/ratios), decorrelation... • Think of good representations for non-tabular data (text, time-series) • Make conscious decision about missing values• Understand what features are important for your model• Use ML models offering feature ranking• Use feature selection methods• ID F1 F2 F3
  • 16. Hint #4 Examine your data points
  • 17. Data points Find difficult data points! (DDP)• DDP = notoriously misclassified (or high error) cases in your cross-validation loop for large variety of models • ID P1 P2 P3 P4 P5 Examine DDPs, understand them!• In the easiest case, remove DDPs from the dataset (think outliers, mislabeled examples) •
  • 20. Data points Get more data!• ID P1 P2 P3 P4 P5 Trick 1. Extend your set with artificial data E.g., data augmentation in image processing, SMOTE algorithm for imbalanced datasets • Good performance booster, rarely applicable• Trick 2. Generate automatically noisy labeled data set by heuristics, e.g. distant supervision in NLP (requires unlabeled data!) • Trick 3. Semisupervised learning methods self-training and co-training (requires unlabeled data!) •
  • 22. Why my model predicts what it predicts? (philosophical slide) How do you answer why questions?• Inspiring homework: watch Richard Feynman, Fun to imagine on magnets (youtube)•
  • 23. Model introspection You can answer thy why question, only for very simple models (e.g., linear model, basic decision trees) • Sometimes, it is instructive to run such a simple model on your dataset, even though it does not provide top-level performance • You can boost your simple model by feeding it with more advanced (non-linearly transformed) features •
  • 24. Complex model introspection LIME algorithm = Local Interpretable Model-agnostic Explanations ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016)
  • 25. Complex model introspection – practical issues Lime authors provided open source Python implementation as lime package• http://eli5.readthedocs.io/en/latest/tutorials• Another option eli5 package, aimed more generally at model introspection and explanation (includes lime implementation) •
  • 26. Visualizing models Display model in a data space Look at collection of models at once Explore the process of model fitting The ASA Data Science Journal 8, 203 (2015), doi: 10.1002/sam.11271
  • 27. So my ML system works on test data, but you tell me it fails in production?
  • 28. Hint #6 Watch out for overfitting
  • 29. Overfitting If you torture the data long enough, it will confess. Roald Coase
  • 30. Hint #7 Watch out for data leakage
  • 31. Data leakage Some time ago, I used to thing data leaks are trivial to avoid• They are not! (Look at number of Kaggle competitions flawed by Data Leakage)• You may introduce them yourself E.g. meaningful identifiers, past & future separation in time series • You may receive them in the data from your provider• Good paper•
  • 32. Hint #8 Watch out for covariate shift
  • 33. What is covariate shift? Training data y X
  • 34. What is covariate shift? Model Training data X y
  • 35. What is covariate shift? Model Noiseless reality Training data y X
  • 36. What is covariate shift? Model Noiseless reality Training data Production data (Test data) y X
  • 37. Covariate shift Unlike overfitting and data leakage, it is easier to detect• Method: Try to build classifier differentiating train from production (test). If you succeed, you very likely have a problem • Basic remedy – reweighting data points. Give production-like data higher impact on your model •
  • 38. The quality of my super ML system deteriorates with time, really? Really really?
  • 39. ML system in production 2009: Hooray we can predict flu epidemics from Google query data! 2014: Hmm... Can we?
  • 40. ML system in production NIPS 2015 paper
  • 41. Hint #9 Remember monitoring & maintenance AKA it is engineering again, stupid! 5