SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Automated Machine Learning
Machine Learning Engineer
Core Modeling Team
Teach sometimes
AI, Machine Learning, Summer/Winter ML Schools
Compete sometimes
Currently hold an Expert rank, top 2% worldwide
Why This Talk
?
Seriously misunderstood creature, AutoML is
Image copyright © Warner Bros. Source: syfy.com
AutoML provides methods and processes to make ML available
for non-ML experts, to improve efficiency of ML and to accelerate
research on ML.
[www.automl.org]
Automated machine learning (AutoML) is the
process of automating end-to-end the process of
applying machine learning to real-world problems
Top 3 Questions
Peers Ask Me
?
#1: Will all data scientists lose their jobs soon?
Image copyright © 20th Century Fox. Source: youtube.com
#2: AutoML is about a neural network
generating neural networks, right?
NIPS 2016 conference. Source: blog.ought.com
#3: DS/ML requires serious human expertise.
How can automation ever be “better”?
Image copyright © USA Network. Source: wallpaperplay.com
Three Levels of Scope
Academic AutoML
Advance human knowledge in fundamental AutoML methods
Get publications, citations, degrees, inspire R&D1
Libraries and Open-Source AutoML Software
Refine academic ideas to technical feasibility, gain product engineering experience
Find peers, validate ideas with early adopters, build a community of practitioners2
Commercial AutoML Product
Build a profitable business by solving real-world problems and delivering value at scale
(from small businesses and NGOs to largest corporations and governments)3
focus of this talk
Some Background
🦄 Unicorn startup from Boston, MA
🗓 Developing AutoML products since 2012
💵 $430M of investments (Series E)
🏢 Hundreds of enterprise customers (including ⅓ of Fortune 50)
🔮 1.3 billion ML models built so far
👨‍💻 1000 employees @ ~50 locations around the globe
“DataRobot sets the standard for augmented data science and machine learning”
– Gartner Magic Quadrant for DS and ML Platforms, 2019
“DataRobot leads the pack with a broad set of robust capabilities”
– Forrester New Wave, Automation-Focused ML Solutions, Q2 2019
Recap: DS Value Generation
Business User Problem Data Science
Automation
Optimization
Actionable Insights
Bottom Line Improvement & Executive Decision Support
Raw Data
Business User Problem Data Science
Automation
Optimization
Actionable Insights
Problem
Fram
ing
DataPrep
&
Annotation
DataIngestion
&
M
anagem
entPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
SoftwareConstructionDeploym
entConsum
ption
M
odelM
aintenance
Risk&
Com
pliance
Problem
Fram
ing
DataPrep
&
Annotation
DataIngestion
&
M
anagem
entPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
SoftwareConstructionDeploym
entConsum
ption
M
odelM
aintenance
Risk&
Com
pliance
Needs domain knowledge to do right
Hates doing
Enjoys doing and wants to keep doing it
Often lacks skills or methodology to do right
Persona: Data Scientist
In large organizations, a lot of “throwing over the wall” happens here
~85% of DS projects never make it to production [bit.ly/30PGOZM]
Recall The Earlier Definitions:
1. “Accessible for non-ML experts”
2. “End-to-end automation”
Problem
Fram
ing
DataPrep
&
Annotation
IngestionPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
SoftwareConstructionDeploym
entConsum
ption
M
odelM
aintenance
Vast majority
of ML research
focused here
Risk&
Com
pliance
Vast majority of AutoML research
and emerging products focused here
Actually needed to deliver value in the real world
Sculley et al. (Google)
“Hidden Technical Debt in Machine Learning Systems” [NIPS 2015]
Ideal Goal
Business User AutoMLRaw Data Definition of
Business Objective
Automatically
Deployed Application
with Monitoring and
Continual Learning
● Lots of capable and motivated people in non-DS teams that know the domain and can deliver value
● Data scientists focus on strategic projects, mentor “citizen data scientists”, and help with problem setup
Good AutoML:
1. Empowers non-experts but does not alienate experts.
2. Augments user’s domain knowledge with automation and fast iteration.
3. Provides guardrails and trust.
Enables more people to get more results with better quality.
Source: MovieFigures via youtube.com
Interesting Use Case: Model Factory
AutoML
● Models specific to data subsets (e.g. propensity per SKU)
● Models specific to time ranges (e.g. +1 day, +1 month forecast)
● Short-lived models with rapid refresh cycle (e.g. fraud, malware)
Interesting Challenges
of Building an AutoML Product
Business User Problem Data Science
Automation
Optimization
Actionable Insights
Problem
Fram
ing
DataPrep
&
Annotation
DataIngestion
&
M
anagem
entPartitioning
EDA
&
QualityAssessm
ent
FeatureEngineering
M
odelingM
odelTuning
Evaluation
&
Selection
SoftwareConstructionDeploym
entConsum
ption
M
odelM
aintenance
Risk&
Com
pliance
Problem Framing
● Automatic detection of the modeling problem from data layout
(regression, binary, multiclass, multilabel, ranking, recommendation, ...)
● Are there datetime features in the data? Maybe it’s a time series forecasting
problem? Maybe there are multiple series along the same axis?
● Maybe there’s no target at all? (E.g., user is interested in anomaly detection)
● If there’s a target, can we figure out its distribution and recommend a reliable
optimization metric?
● Are there any prior constraints? (E.g., prediction range, monotonicity, weights)
● Does the data have valid tabular shape? Are there various data sources to merge?
Data Preparation and Annotation
ⓘ Deep Feature Synthesis: automatic generation of features from snowflake-schema relational data
J. Kanter, K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. DSAA 2015.
“featuretools” Python package: https://github.com/Featuretools/featuretools
ⓘ Snorkel: rapid training data creation with weak supervision
https://github.com/snorkel-team/snorkel
https://arxiv.org/abs/1711.10160
● Is the target defined everywhere? Do we need weak supervision or active learning?
Partitioning
● Automatically recommend a problem-aware validation schema
● Are there group relationships between rows? Need different validation
● Is datetime an important dimension in the dataset? Need different validation
● Seasonal time series detected? Validation needs to account for the seasonal cycles
● Do we need to oversample/undersample/stratify/augment?
● Do not reuse the same validation set for multiple purposes (HPO, ES, model ranking)
● The entire modeling pipeline must be robust enough to never peek into the holdout
until the final model deployment
EDA & Quality Assessment
● Automatic data type / column intent detection
ⓘ Exercise: think how you would distinguish between numerics, ordinals, categoricals, text, datetime
● Are there features without meaningful information?
(IDs, constants, duplicates, extreme cardinality or sparsity, noise)
● Are there features that are a potential source of leakage?
ⓘ Watch my earlier talk :P https://github.com/YuriyGuts/odsc-target-leakage-workshop
● Is the format of the data consistent over time? (typical issue for long-lived systems)
● Are there outliers that are dangerous for the chosen optimization objective?
● Can be super insightful to view the data over time, over space, over target label
Feature Engineering
● Needs to be model-aware! Linear, tree-based, neural, FM, classic time series require
different preprocessing and benefit from different feature engineering techniques
● Needs to be datatype-aware
ⓘ For example, correctly distinguishing between a text feature and a categorical feature pays off here.
By the way, language matters for text. We should auto-detect it too and derive features accordingly.
● Needs to be leakage-free (no peeking into test set, very careful peeking at the target)
● Needs to work at prediction time when the model is deployed, using the same raw
data format but with no ground truth available
● Resources are finite! Latency and scalability are just as important as accuracy
Modeling
● Accuracy is a must. Every percent pays off. Auto-ensembling can help too.
ⓘ Steward Healthcare: www.datarobot.com/casestudy/reducing-costs-with-datarobot-at-steward-health-care/
More accurate predictions: –1% in nurse hours saves $2,000,000/year; –0.1% of patient stay saves $10,000,000/year
● No Free Lunch Theorem is very relevant, especially with prior business constraints.
● Not enough to just have a “list of models”: need to construct pipelines dynamically.
ⓘ Zoubin Ghahramani. Keynote at ICML 2018 AutoML workshop.
● Training from scratch / exhaustive search vs. transfer learning / metalearning.
● Efficient data usage, CPU/GPU and RAM usage, training time, and prediction latency
are just as important as accuracy. Model search can also be constrained by time.
● Every model must be serializable, transferable, reproducible, autonomous.
Model Tuning
● Automated hyperparameter optimization (both for preprocessing and models)
ⓘ An extensively studied problem in AutoML research.
See www.automl.org/book/ for current approaches and libraries.
tl;dr: scikit-optimize, hyperopt, BOHB.
● Automated feature reduction / redundancy detection
● Models need to have well-calibrated probability outputs
ⓘ Guo et al. On Calibration of Modern Neural Networks, ICML 2017 arxiv.org/abs/1706.04599
● Pipeline optimization (also: Neural Architecture Search)
ⓘ Also a subject of extensive academic interest
See www.automl.org/book/ for current approaches
Pipeline optimization AutoML powered by genetic programming: TPOT https://github.com/EpistasisLab/tpot
github.com/pprett/aml-class-19
Genetic Pipeline Optimization
Evaluation and Selection
● Fair model comparison and ranking on out-of-sample data
● Analysis of data efficiency (learning curves), resource usage, prediction throughput
● Analysis of model stability out-of-sample
Typical issue: how well a time series model handles different forecasting horizons
● Recommending the best model, considering accuracy, transparency, and speed
● Making use of the data: retraining the best model on more data if needed
ⓘ Quiz: what to do with hyperparameters?
● Fair “apples-to-apples” comparison with externally developed models
Risk and Compliance
● Explaining feature importance, feature interactions, partial dependence
● Explaining the kinds and ranges of tuned hyperparameters and optimal values
● Explaining individual predictions in terms of original features
● Feature sensitivity analysis (effect of perturbations on predictions)
● “What-if” simulations and analysis (e.g. for ethical evaluation)
● Access to preprocessed/final modeling data for external reproducibility
● Auto-documenting the methodology, results, and insights!
● All of the above should be available for every model!
Software Construction & Deployment
● Model needs to use the same dependencies it used during training.
OSS scientific packages also have bugs and breaking changes!
● Edge computing may require the model to be exportable and available offline
ⓘ Exercise: think how you would make a full model pipeline available for scoring on iOS, Android, Raspberry Pi, ...
● Application needs to be generated according to the initial business problem setup
(e.g. do we need to explain, predict, or prescribe/optimize). Needs to expose API/UI.
● IT policies and compliance have the same relevance here as for any other enterprise
software. OSS and security audit. Legacy software compatibility
Cloud-native,
Docker,
Kubernetes...
CentOS 6
Model Maintenance
● Need to distinguish service health vs. input data health vs. model health
● Automated feature drift / response drift detection
The world never stops changing
● Feedback loop detection
And we never stop changing the world
● Continuous learning
● Challenger models / auto-fallback to a more robust model
References
1. Rich Caruana (Microsoft Research). Open Research Problems in AutoML
https://sites.google.com/site/automlwsicml15/
2. AutoML: Methods, Systems, Challenges
http://automl.org/book/
3. Peter Prettenhofer: AutoML Class @ UCU Data Science School 2019
https://github.com/pprett/aml-class-19
?yuriy.guts @ gmail.com
linkedin.com/in/yuriyguts

Weitere ähnliche Inhalte

Was ist angesagt?

The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoMLNing Jiang
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 
Machine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And ApplicationsMachine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And ApplicationsSlideTeam
 
2.17Mb ppt
2.17Mb ppt2.17Mb ppt
2.17Mb pptbutest
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLHimadri Mishra
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt Poojamanic
 
Machine Learning
Machine LearningMachine Learning
Machine LearningRahul Kumar
 
Machine Learning
Machine LearningMachine Learning
Machine LearningVivek Garg
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningGanesh Satpute
 

Was ist angesagt? (20)

The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And ApplicationsMachine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And Applications
 
2.17Mb ppt
2.17Mb ppt2.17Mb ppt
2.17Mb ppt
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 

Ähnlich wie Automated Machine Learning

AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Matt Stubbs
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Pydata Chicago - work hard once
Pydata Chicago - work hard oncePydata Chicago - work hard once
Pydata Chicago - work hard onceJi Dong
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...Infoshare
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Analytics India Magazine
 
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)Leslie McFarlin
 
Machine Learning
Machine Learning Machine Learning
Machine Learning AyanGain
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in productionStepan Pushkarev
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Saurabh Kaushik
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...Dario Mangano
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptxgdgsurrey
 

Ähnlich wie Automated Machine Learning (20)

AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
Technovision
TechnovisionTechnovision
Technovision
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Pydata Chicago - work hard once
Pydata Chicago - work hard oncePydata Chicago - work hard once
Pydata Chicago - work hard once
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...
 
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
Machine learning
Machine learningMachine learning
Machine learning
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 

Mehr von Yuriy Guts

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Yuriy Guts
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLPYuriy Guts
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)Yuriy Guts
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivYuriy Guts
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of RedisYuriy Guts
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in ScalaYuriy Guts
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET DevelopersYuriy Guts
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETYuriy Guts
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional RequirementsYuriy Guts
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software ArchitectureYuriy Guts
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business AnalystsYuriy Guts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceYuriy Guts
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseYuriy Guts
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesYuriy Guts
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingYuriy Guts
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryYuriy Guts
 

Mehr von Yuriy Guts (19)

Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)Target Leakage in Machine Learning (ODSC East 2020)
Target Leakage in Machine Learning (ODSC East 2020)
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
Paraphrase Detection in NLP
Paraphrase Detection in NLPParaphrase Detection in NLP
Paraphrase Detection in NLP
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)NoSQL (ELEKS DevTalks #1 - Jan 2015)
NoSQL (ELEKS DevTalks #1 - Jan 2015)
 
Experiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG LvivExperiments with Machine Learning - GDG Lviv
Experiments with Machine Learning - GDG Lviv
 
A Developer Overview of Redis
A Developer Overview of RedisA Developer Overview of Redis
A Developer Overview of Redis
 
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
[JEEConf 2015] Lessons from Building a Modern B2C System in Scala
 
Redis for .NET Developers
Redis for .NET DevelopersRedis for .NET Developers
Redis for .NET Developers
 
Aspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NETAspect-Oriented Programming (AOP) in .NET
Aspect-Oriented Programming (AOP) in .NET
 
Non-Functional Requirements
Non-Functional RequirementsNon-Functional Requirements
Non-Functional Requirements
 
Introduction to Software Architecture
Introduction to Software ArchitectureIntroduction to Software Architecture
Introduction to Software Architecture
 
UML for Business Analysts
UML for Business AnalystsUML for Business Analysts
UML for Business Analysts
 
Intro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT AudienceIntro to Software Engineering for non-IT Audience
Intro to Software Engineering for non-IT Audience
 
ELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash CourseELEKS DevTalks #4: Amazon Web Services Crash Course
ELEKS DevTalks #4: Amazon Web Services Crash Course
 
ELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - DatabasesELEKS Summer School 2012: .NET 09 - Databases
ELEKS Summer School 2012: .NET 09 - Databases
 
ELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - MultithreadingELEKS Summer School 2012: .NET 06 - Multithreading
ELEKS Summer School 2012: .NET 06 - Multithreading
 
ELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and MemoryELEKS Summer School 2012: .NET 04 - Resources and Memory
ELEKS Summer School 2012: .NET 04 - Resources and Memory
 

Kürzlich hochgeladen

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Automated Machine Learning

  • 2. Machine Learning Engineer Core Modeling Team Teach sometimes AI, Machine Learning, Summer/Winter ML Schools Compete sometimes Currently hold an Expert rank, top 2% worldwide
  • 4.
  • 5. Seriously misunderstood creature, AutoML is Image copyright © Warner Bros. Source: syfy.com
  • 6. AutoML provides methods and processes to make ML available for non-ML experts, to improve efficiency of ML and to accelerate research on ML. [www.automl.org]
  • 7. Automated machine learning (AutoML) is the process of automating end-to-end the process of applying machine learning to real-world problems
  • 9. #1: Will all data scientists lose their jobs soon? Image copyright © 20th Century Fox. Source: youtube.com
  • 10. #2: AutoML is about a neural network generating neural networks, right? NIPS 2016 conference. Source: blog.ought.com
  • 11. #3: DS/ML requires serious human expertise. How can automation ever be “better”? Image copyright © USA Network. Source: wallpaperplay.com
  • 12. Three Levels of Scope Academic AutoML Advance human knowledge in fundamental AutoML methods Get publications, citations, degrees, inspire R&D1 Libraries and Open-Source AutoML Software Refine academic ideas to technical feasibility, gain product engineering experience Find peers, validate ideas with early adopters, build a community of practitioners2 Commercial AutoML Product Build a profitable business by solving real-world problems and delivering value at scale (from small businesses and NGOs to largest corporations and governments)3 focus of this talk
  • 13. Some Background 🦄 Unicorn startup from Boston, MA 🗓 Developing AutoML products since 2012 💵 $430M of investments (Series E) 🏢 Hundreds of enterprise customers (including ⅓ of Fortune 50) 🔮 1.3 billion ML models built so far 👨‍💻 1000 employees @ ~50 locations around the globe “DataRobot sets the standard for augmented data science and machine learning” – Gartner Magic Quadrant for DS and ML Platforms, 2019 “DataRobot leads the pack with a broad set of robust capabilities” – Forrester New Wave, Automation-Focused ML Solutions, Q2 2019
  • 14. Recap: DS Value Generation Business User Problem Data Science Automation Optimization Actionable Insights Bottom Line Improvement & Executive Decision Support Raw Data
  • 15. Business User Problem Data Science Automation Optimization Actionable Insights Problem Fram ing DataPrep & Annotation DataIngestion & M anagem entPartitioning EDA & QualityAssessm ent FeatureEngineering M odelingM odelTuning Evaluation & Selection SoftwareConstructionDeploym entConsum ption M odelM aintenance Risk& Com pliance
  • 16. Problem Fram ing DataPrep & Annotation DataIngestion & M anagem entPartitioning EDA & QualityAssessm ent FeatureEngineering M odelingM odelTuning Evaluation & Selection SoftwareConstructionDeploym entConsum ption M odelM aintenance Risk& Com pliance Needs domain knowledge to do right Hates doing Enjoys doing and wants to keep doing it Often lacks skills or methodology to do right Persona: Data Scientist In large organizations, a lot of “throwing over the wall” happens here ~85% of DS projects never make it to production [bit.ly/30PGOZM]
  • 17. Recall The Earlier Definitions: 1. “Accessible for non-ML experts” 2. “End-to-end automation”
  • 18. Problem Fram ing DataPrep & Annotation IngestionPartitioning EDA & QualityAssessm ent FeatureEngineering M odelingM odelTuning Evaluation & Selection SoftwareConstructionDeploym entConsum ption M odelM aintenance Vast majority of ML research focused here Risk& Com pliance Vast majority of AutoML research and emerging products focused here Actually needed to deliver value in the real world
  • 19. Sculley et al. (Google) “Hidden Technical Debt in Machine Learning Systems” [NIPS 2015]
  • 20. Ideal Goal Business User AutoMLRaw Data Definition of Business Objective Automatically Deployed Application with Monitoring and Continual Learning ● Lots of capable and motivated people in non-DS teams that know the domain and can deliver value ● Data scientists focus on strategic projects, mentor “citizen data scientists”, and help with problem setup
  • 21. Good AutoML: 1. Empowers non-experts but does not alienate experts. 2. Augments user’s domain knowledge with automation and fast iteration. 3. Provides guardrails and trust. Enables more people to get more results with better quality. Source: MovieFigures via youtube.com
  • 22. Interesting Use Case: Model Factory AutoML ● Models specific to data subsets (e.g. propensity per SKU) ● Models specific to time ranges (e.g. +1 day, +1 month forecast) ● Short-lived models with rapid refresh cycle (e.g. fraud, malware)
  • 24. Business User Problem Data Science Automation Optimization Actionable Insights Problem Fram ing DataPrep & Annotation DataIngestion & M anagem entPartitioning EDA & QualityAssessm ent FeatureEngineering M odelingM odelTuning Evaluation & Selection SoftwareConstructionDeploym entConsum ption M odelM aintenance Risk& Com pliance
  • 25. Problem Framing ● Automatic detection of the modeling problem from data layout (regression, binary, multiclass, multilabel, ranking, recommendation, ...) ● Are there datetime features in the data? Maybe it’s a time series forecasting problem? Maybe there are multiple series along the same axis? ● Maybe there’s no target at all? (E.g., user is interested in anomaly detection) ● If there’s a target, can we figure out its distribution and recommend a reliable optimization metric? ● Are there any prior constraints? (E.g., prediction range, monotonicity, weights)
  • 26. ● Does the data have valid tabular shape? Are there various data sources to merge? Data Preparation and Annotation ⓘ Deep Feature Synthesis: automatic generation of features from snowflake-schema relational data J. Kanter, K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. DSAA 2015. “featuretools” Python package: https://github.com/Featuretools/featuretools ⓘ Snorkel: rapid training data creation with weak supervision https://github.com/snorkel-team/snorkel https://arxiv.org/abs/1711.10160 ● Is the target defined everywhere? Do we need weak supervision or active learning?
  • 27. Partitioning ● Automatically recommend a problem-aware validation schema ● Are there group relationships between rows? Need different validation ● Is datetime an important dimension in the dataset? Need different validation ● Seasonal time series detected? Validation needs to account for the seasonal cycles ● Do we need to oversample/undersample/stratify/augment? ● Do not reuse the same validation set for multiple purposes (HPO, ES, model ranking) ● The entire modeling pipeline must be robust enough to never peek into the holdout until the final model deployment
  • 28. EDA & Quality Assessment ● Automatic data type / column intent detection ⓘ Exercise: think how you would distinguish between numerics, ordinals, categoricals, text, datetime ● Are there features without meaningful information? (IDs, constants, duplicates, extreme cardinality or sparsity, noise) ● Are there features that are a potential source of leakage? ⓘ Watch my earlier talk :P https://github.com/YuriyGuts/odsc-target-leakage-workshop ● Is the format of the data consistent over time? (typical issue for long-lived systems) ● Are there outliers that are dangerous for the chosen optimization objective? ● Can be super insightful to view the data over time, over space, over target label
  • 29. Feature Engineering ● Needs to be model-aware! Linear, tree-based, neural, FM, classic time series require different preprocessing and benefit from different feature engineering techniques ● Needs to be datatype-aware ⓘ For example, correctly distinguishing between a text feature and a categorical feature pays off here. By the way, language matters for text. We should auto-detect it too and derive features accordingly. ● Needs to be leakage-free (no peeking into test set, very careful peeking at the target) ● Needs to work at prediction time when the model is deployed, using the same raw data format but with no ground truth available ● Resources are finite! Latency and scalability are just as important as accuracy
  • 30.
  • 31. Modeling ● Accuracy is a must. Every percent pays off. Auto-ensembling can help too. ⓘ Steward Healthcare: www.datarobot.com/casestudy/reducing-costs-with-datarobot-at-steward-health-care/ More accurate predictions: –1% in nurse hours saves $2,000,000/year; –0.1% of patient stay saves $10,000,000/year ● No Free Lunch Theorem is very relevant, especially with prior business constraints. ● Not enough to just have a “list of models”: need to construct pipelines dynamically. ⓘ Zoubin Ghahramani. Keynote at ICML 2018 AutoML workshop. ● Training from scratch / exhaustive search vs. transfer learning / metalearning. ● Efficient data usage, CPU/GPU and RAM usage, training time, and prediction latency are just as important as accuracy. Model search can also be constrained by time. ● Every model must be serializable, transferable, reproducible, autonomous.
  • 32. Model Tuning ● Automated hyperparameter optimization (both for preprocessing and models) ⓘ An extensively studied problem in AutoML research. See www.automl.org/book/ for current approaches and libraries. tl;dr: scikit-optimize, hyperopt, BOHB. ● Automated feature reduction / redundancy detection ● Models need to have well-calibrated probability outputs ⓘ Guo et al. On Calibration of Modern Neural Networks, ICML 2017 arxiv.org/abs/1706.04599 ● Pipeline optimization (also: Neural Architecture Search) ⓘ Also a subject of extensive academic interest See www.automl.org/book/ for current approaches Pipeline optimization AutoML powered by genetic programming: TPOT https://github.com/EpistasisLab/tpot
  • 34. Evaluation and Selection ● Fair model comparison and ranking on out-of-sample data ● Analysis of data efficiency (learning curves), resource usage, prediction throughput ● Analysis of model stability out-of-sample Typical issue: how well a time series model handles different forecasting horizons ● Recommending the best model, considering accuracy, transparency, and speed ● Making use of the data: retraining the best model on more data if needed ⓘ Quiz: what to do with hyperparameters? ● Fair “apples-to-apples” comparison with externally developed models
  • 35. Risk and Compliance ● Explaining feature importance, feature interactions, partial dependence ● Explaining the kinds and ranges of tuned hyperparameters and optimal values ● Explaining individual predictions in terms of original features ● Feature sensitivity analysis (effect of perturbations on predictions) ● “What-if” simulations and analysis (e.g. for ethical evaluation) ● Access to preprocessed/final modeling data for external reproducibility ● Auto-documenting the methodology, results, and insights! ● All of the above should be available for every model!
  • 36. Software Construction & Deployment ● Model needs to use the same dependencies it used during training. OSS scientific packages also have bugs and breaking changes! ● Edge computing may require the model to be exportable and available offline ⓘ Exercise: think how you would make a full model pipeline available for scoring on iOS, Android, Raspberry Pi, ... ● Application needs to be generated according to the initial business problem setup (e.g. do we need to explain, predict, or prescribe/optimize). Needs to expose API/UI. ● IT policies and compliance have the same relevance here as for any other enterprise software. OSS and security audit. Legacy software compatibility
  • 38. Model Maintenance ● Need to distinguish service health vs. input data health vs. model health ● Automated feature drift / response drift detection The world never stops changing ● Feedback loop detection And we never stop changing the world ● Continuous learning ● Challenger models / auto-fallback to a more robust model
  • 39. References 1. Rich Caruana (Microsoft Research). Open Research Problems in AutoML https://sites.google.com/site/automlwsicml15/ 2. AutoML: Methods, Systems, Challenges http://automl.org/book/ 3. Peter Prettenhofer: AutoML Class @ UCU Data Science School 2019 https://github.com/pprett/aml-class-19