SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Thrifty Machine Learning
Cultivating cost-conscious modeling habits
PyData Global
2020
Overview
Machine
Learning on a
Budget
The
Mother of
Invention
2 3
An
Embarrassment
of Riches
1
Dr.
Rebecca
Bilbro
Machine Learning Engineer @ Unisys
Applied Text Analysis with Python
Scikit-Yellowbrick
@rebeccabilbro
An Embarrassment of Riches
Open
Source
Software
Hardware / CloudMarket
cpus gpus tpus
Arxiv
&Reddit
&Kaggle
&Medium
&Twitter
&Google
&StackOverFlow
“A fool and [their]
money are soon
parted.”
“By sowing frugality we
reap liberty, a golden
harvest.”
- Agesilaus
Machine Learning
on a Budget
The Supervised Learning Problem
11
Labeled Training Data
Define a set of target classes &
build a training dataset that
has been annotated with those
class labels.
Feature Transformation(s)
Take raw data and convert into
vector form ahead of model
training.
Classifier Algorithm
Train a model to recognize
target classes using labeled
training data. Tune parameters
to reduce false positives
and/or false negatives.
This part is slow and boring.
“Mais il faut cultiver
notre jardin.”
- Voltaire, Candide
Getting Labeled Data
● Started with Amazon Mechanical Turk.
● Now there are many commercial providers of data labeling
and data annotation services.
● It can be quite expensive.
○ 100,000 samples for $25,000 - $75,000
● It’s just people, actually…
○ Semuels, Alana. “The Internet Is Enabling a New Kind of Poorly Paid
Hell.” The Atlantic, January 23, 2018.
● Doesn’t usually work for domain-specific data.
● Quality tends to vary.
Which Model is Best?
Bayesian Decision Tree Dense Feedforward
Which Model is Best?
Bayesian Decision Tree Dense Feedforward
for this dataset/problem space
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
How to Select a Model
● Start with a simple model, or better yet, try several in
parallel!
● Filter out the weak performers, and only tune the best.
● Set an initial baseline.
● Use these preliminary steps to prepare for hyperparameter
tuning.
Hyperparameter Tuning
● Grid search
● Randomized search
● Bayesian optimization
● Evolutionary optimization
● Population-based training
● Gradient-based optimization
● “Auto ML” (see above, but pay $$$)
● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no
guarantee of a solution.
● As the search space gets larger,
the amount of time increases
exponentially.
Unfortunately...
Reduce feature space
Evaluate train/test ratio
Understand parameter ranges
Thoughtful Tuning
● Only tune the best performing models.
● Try to reduce your feature space.
● Understand the parameter ranges you’re searching.
● Move towards complexity purposefully
○ Understand error from variance vs. error from bias.
○ The model underfits.
○ The error doesn’t converge.
● Move towards complexity gradually
○ While both train and test scores are increasing (or error decreasing).
Development Environments
Development Environments
Prototype Locally First
● Consider: are these conveniences really necessary/useful
at the prototyping phase?
○ Probably not
● Don’t default to using cloud-hosted, Spark-running
notebooks for everything!
● Configure Python to run locally (one-time cost).
● VSCode, PyCharm, etc, support Jupyter notebooks now.
● Downsampling your data is cheap!
“Those who cannot
remember the past are
condemned to repeat it.”
- George Santayana
Serialize Everything
● The model
● Engineered features
● Feature vectors/embeddings
● Stopwords
● Lexicons
● Scores
● Diagnostic plots
● Training times
And any other artifacts or
metadata!
When is an ML Model “done”?
● When you have achieved an accuracy measure above your
threshold.
● When your error bounds are within your pre-defined target
range.
● When your cross-validation demonstrates a convergence in
training and test data.
● When the sprint is over.
● When the project is due
Plan for this, not this
The Mother of
Invention…
When we shift our collective
mindset toward model
thriftiness rather than the
relentless pursuit of a tiny bit
more F1, there’s no telling what
new things we might discover…
Thank you!
@rebeccabilbro
Slide template by Slidesgo
Icons by Flaticon
Images & infographics by Freepik
Hardware
Serialization
Objectivity
Simplicity
“Done”
Annotation

Weitere ähnliche Inhalte

Was ist angesagt?

Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Hayim Makabee
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1BigML, Inc
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentationoldmanpat
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Gülden Bilgütay
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnAmol Agrawal
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine LearningCodeForFrankfurt
 
Model remodeling with modern deep learning frameworks
Model remodeling with modern deep learning frameworksModel remodeling with modern deep learning frameworks
Model remodeling with modern deep learning frameworksrosentep
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionBigML, Inc
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellSri Ambati
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101QuantUniversity
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 

Was ist angesagt? (20)

Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
GA.-.Presentation
GA.-.PresentationGA.-.Presentation
GA.-.Presentation
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
 
Model remodeling with modern deep learning frameworks
Model remodeling with modern deep learning frameworksModel remodeling with modern deep learning frameworks
Model remodeling with modern deep learning frameworks
 
VSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic RegressionVSSML16 L2. Ensembles and Logistic Regression
VSSML16 L2. Ensembles and Logistic Regression
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
 
notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 

Ähnlich wie PyData Global: Thrifty Machine Learning

Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...PyData
 
Semi-supervised learning with GANs
Semi-supervised learning with GANsSemi-supervised learning with GANs
Semi-supervised learning with GANsterek47
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuningYosuke Mizutani
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLBryan Bischof
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 

Ähnlich wie PyData Global: Thrifty Machine Learning (20)

Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...
 
Semi-supervised learning with GANs
Semi-supervised learning with GANsSemi-supervised learning with GANs
Semi-supervised learning with GANs
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Bagging.pptx
Bagging.pptxBagging.pptx
Bagging.pptx
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in ML
 
C3 w1
C3 w1C3 w1
C3 w1
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 

Mehr von Rebecca Bilbro

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsRebecca Bilbro
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusRebecca Bilbro
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scaleRebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability ProjectRebecca Bilbro
 

Mehr von Rebecca Bilbro (15)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Kürzlich hochgeladen

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Kürzlich hochgeladen (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

PyData Global: Thrifty Machine Learning

  • 1. Thrifty Machine Learning Cultivating cost-conscious modeling habits PyData Global 2020
  • 2. Overview Machine Learning on a Budget The Mother of Invention 2 3 An Embarrassment of Riches 1
  • 3. Dr. Rebecca Bilbro Machine Learning Engineer @ Unisys Applied Text Analysis with Python Scikit-Yellowbrick @rebeccabilbro
  • 8. “A fool and [their] money are soon parted.”
  • 9. “By sowing frugality we reap liberty, a golden harvest.” - Agesilaus
  • 11. The Supervised Learning Problem 11 Labeled Training Data Define a set of target classes & build a training dataset that has been annotated with those class labels. Feature Transformation(s) Take raw data and convert into vector form ahead of model training. Classifier Algorithm Train a model to recognize target classes using labeled training data. Tune parameters to reduce false positives and/or false negatives. This part is slow and boring.
  • 12.
  • 13. “Mais il faut cultiver notre jardin.” - Voltaire, Candide
  • 14. Getting Labeled Data ● Started with Amazon Mechanical Turk. ● Now there are many commercial providers of data labeling and data annotation services. ● It can be quite expensive. ○ 100,000 samples for $25,000 - $75,000 ● It’s just people, actually… ○ Semuels, Alana. “The Internet Is Enabling a New Kind of Poorly Paid Hell.” The Atlantic, January 23, 2018. ● Doesn’t usually work for domain-specific data. ● Quality tends to vary.
  • 15. Which Model is Best? Bayesian Decision Tree Dense Feedforward
  • 16. Which Model is Best? Bayesian Decision Tree Dense Feedforward for this dataset/problem space
  • 17. from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection as ms classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = ms.KFold(len(X), n_folds=12) max([ ms.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ])
  • 18. How to Select a Model ● Start with a simple model, or better yet, try several in parallel! ● Filter out the weak performers, and only tune the best. ● Set an initial baseline. ● Use these preliminary steps to prepare for hyperparameter tuning.
  • 19. Hyperparameter Tuning ● Grid search ● Randomized search ● Bayesian optimization ● Evolutionary optimization ● Population-based training ● Gradient-based optimization ● “Auto ML” (see above, but pay $$$)
  • 20. ● Search is difficult, particularly in high dimensional space. ● Even with clever optimization techniques, there is no guarantee of a solution. ● As the search space gets larger, the amount of time increases exponentially. Unfortunately...
  • 24. Thoughtful Tuning ● Only tune the best performing models. ● Try to reduce your feature space. ● Understand the parameter ranges you’re searching. ● Move towards complexity purposefully ○ Understand error from variance vs. error from bias. ○ The model underfits. ○ The error doesn’t converge. ● Move towards complexity gradually ○ While both train and test scores are increasing (or error decreasing).
  • 27. Prototype Locally First ● Consider: are these conveniences really necessary/useful at the prototyping phase? ○ Probably not ● Don’t default to using cloud-hosted, Spark-running notebooks for everything! ● Configure Python to run locally (one-time cost). ● VSCode, PyCharm, etc, support Jupyter notebooks now. ● Downsampling your data is cheap!
  • 28. “Those who cannot remember the past are condemned to repeat it.” - George Santayana
  • 29.
  • 30. Serialize Everything ● The model ● Engineered features ● Feature vectors/embeddings ● Stopwords ● Lexicons ● Scores ● Diagnostic plots ● Training times And any other artifacts or metadata!
  • 31. When is an ML Model “done”? ● When you have achieved an accuracy measure above your threshold. ● When your error bounds are within your pre-defined target range. ● When your cross-validation demonstrates a convergence in training and test data. ● When the sprint is over. ● When the project is due
  • 32. Plan for this, not this
  • 34. When we shift our collective mindset toward model thriftiness rather than the relentless pursuit of a tiny bit more F1, there’s no telling what new things we might discover…
  • 36. Slide template by Slidesgo Icons by Flaticon Images & infographics by Freepik