Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
2. Logistics
• We can’t hear you…
• Recording will be available…
• Slides will be available…
• Code samples and notebooks will be available…
• Queue up Questions…
• Bookmark databricks.com/blog
3. About our speakers
Yifan Cao, Sr. Product Manager, Machine Learning at Databricks
• Product Area: ML/DL algorithms and Databricks Runtime for Machine
Learning
• Built and grew two ML products to multi-million dollars in annual
revenue
• B.S. Engineering from UC Berkeley; MBA from MIT
Joseph Bradley, Software Engineer, Machine Learning at Databricks
• Apache Spark PMC member
• Postdoc at UC Berkeley
• Ph.D. in Machine Learning from Carnegie Mellon
4. Accelerate innovation by unifying data science,
engineering and business
• Original creators of
• 2000+ global companies use our platform across big
data & machine learning lifecycle
VISION
WHO WE
ARE
Unified Analytics PlatformSOLUTION
14. Databricks
ETL & ML
Databricks
ML Test & Model
Enable data scientists and citizen data scientists to accelerate and scale
the development and delivery of predictive models.
Run and deploy ML
models at Scale
14
Databricks and DataRobot Integration
Watch it now >
https://dbricks.co/datarobot
15. Great Training
AutoML on Databricks (3/3)
AutoML libraries
Partnerships
Hyperopt
AUTOMATION
USER CONTROL
AUTOMATION +
CONTROL
Integrations MLlib
Today's Content
16. Great Training
A simple analogy
Manual Transmission
Semi AutonomousAUTOMATION
USER CONTROL
AUTOMATION +
CONTROL
Automatic Transmission
Today's Content
17. Use Case #1: Hyperparameter Tuning
Model
Exploration
Feature
Engineering
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Scenarios:
● Automated hyperparameter search to select models after cross validation
● Automated hyperparameter search to optimize models in production
Our Offerings:
● Distributed Hyperopt + Automated MLflow Tracking
Raw Data ETL
18. Use Case #2: Model Search
Model
Exploration
Feature
Engineering
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Scenarios:
● Automated model search by exploring different combinations of featuresets, algos,
hyperparameters
● Automated model search by extending a baseline model to 1000+ custom models
Our Offerings:
● MLlib + Automated MLflow Tracking
● Distributed Hyperopt + Automated MLflow Tracking, with conditional hyperparameter tuning
Raw Data ETL
19. Scenarios:
● Automated end-to-end Machine Learning model generation pipelines incorporating
customer-specified logics
Our Offerings:
● Leverage existing Databricks internal tools & frameworks on top of Databricks Runtime
ML
Use Case #3: End-to-end ML Pipeline
Model
Exploration
Feature
Engineering
Model
Scoring
Hyperparam
eter Tuning
Alerting &
Monitoring
Cross
Validation
Raw Data ETL
21. Hyperparameters
Express high-level concepts, such as statistical assumptions
E.g.: regularization
Are fixed before training or are hard to learn from data
E.g.: neural net architecture
Affect objective, test time performance, computational cost
E.g.: # iterations or epochs
22. Tuning hyperparameters
E.g.: Fitting a
polynomial
Common goals:
• More flexible modeling process
• Reduced generalization error
• Faster training
• Plug & play ML
23. Challenges in tuning
Curse of dimensionality
Non-convex optimization
Computational cost
Unintuitive hyperparameters
27. A practical definition of tuning
ML Model
Featurization
Model family
selection
Hyperparameter
tuning
Parameters: configs which your ML library learns from data
Hyperparameters: configs which your ML library does not learn from data
30. Manual search
Select hyperparameter settings to try based on human intuition.
2 hyperparameters:
•[0, ..., 5]
•{A, B, ..., F}
A B C D E F
0
1
2
3
4
5
Expert knowledge tells us to try:
(2,C), (2,D), (2,E), (3,C), (3,D), (3,E)
31. Grid Search
Try points on a grid defined by ranges and step sizes
X-axis: {A,...,F}
Y-axis: 0-5, step = 1
A B C D E F
0
1
2
3
4
5
32. A B C D E F
0
1
2
3
4
5
Random Search
Sample from distributions over ranges
X-axis: Uniform({A,...,F})
Y-axis: Uniform([0,5])
33. Start with random search, then iterate:
•Use the previous “generation” to
inform the next generation
•E.g., sample from best performers &
then perturb them
Population Based Algorithms
A B C D E F
0
1
2
3
4
5
34. Start with random search, then iterate:
•Use the previous “generation” to
inform the next generation
•E.g., sample from best performers &
then perturb them
Population Based Algorithms
A B C D E F
0
1
2
3
4
5
35. Start with random search, then iterate:
•Use the previous “generation” to
inform the next generation
•E.g., sample from best performers &
then perturb them
Population Based Algorithms
A B C D E F
0
1
2
3
4
5
36. Model the loss function:
Hyperparameters ⇒ loss
Iteratively search space, trading off
between exploration and exploitation
A B C D E F
0
1
2
3
4
5
Bayesian Optimization
37. Get samples: Test new points in
hyperparameter space
Bayesian Optimization
A B C D E F
0
1
2
3
4
5
38. A B C D E F
0
1
2
3
4
5
Get samples: Test new points in
hyperparameter space
Update model of space:
Hyperparameters ⇒ loss
Bayesian Optimization
39. Comparing tuning methods
Iterative /
adaptive?
# evaluations
for P params
Model of
param space
Grid search No O(c^P) none
Random search No O(k) none
Population-based Yes O(k) implicit
Bayesian Yes O(k) explicit
42. MLflow Overview
42
Tracking
Record and query
experiments: code,
data, config, results
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools
mlflow.org github.com/mlflow twitter.com/MLflowdatabricks.com/mlflow
43. Organizing with
Training Data Validation Data Test Data
Final ML ModelML Model 1
ML Model 2
ML Model 3
Experiment
Main run
Child runs
Tip: Tune full pipeline, not 1 model.
44. Instrumenting tuning with
MLflow concepts for tracking runs
Params: hyperparameters
Metrics: training & validation, loss & objective, multiple objectives
Tags: provenance, simple metadata
Artifacts: serialized model, large metadata
45. Analyzing how tuning performs
Questions to answer
• Am I tuning the right hyperparameters?
• Am I exploring the right parts of the search space?
• Do I need to do another round of tuning?
Examining results
• Simple case: visualize param vs metric
• Challenges: multiple params and metrics, iterative experimentation
46. Auto-tracking MLlib with
Training Data Validation Data Test Data
Final ML ModelML Model 1
ML Model 2
ML Model 3
Experiment
Main run
Child runs
In Databricks
• CrossValidator &
TrainValidationSplit
• 1 run per setting of
hyperparameters
• Avg metrics for CV folds(demo)
48. Hyperopt
Hyperparameter tuning in Python ML workflows
● Usable with any Python ML library
● Tuning algorithms:
○ Random search
○ Bayesian (Tree of Parzen Estimators)
● Open source (3-clause BSD license)
https://github.com/hyperopt/hyperopt
49. Distribute tuning across Spark clusters
● Each Spark task trains & evaluates 1 model (hyperparameter setting)
○ Applicable to single-machine ML workloads
● Via new SparkTrials plugin
● Contributing to open source Hyperopt:
github.com/hyperopt/hyperopt/pull/509
With automated MLflow tracking in Databricks
Available now in Databricks Runtime 5.4 ML
Hyperopt on Apache Spark
(demo)
50. Related Content
Blog:
• Hyperparameter Tuning with MLflow,
Apache Spark MLlib and Hyperopt
Webinar:
• How to Automate Machine Learning and
Scale Delivery
Tutorials
● Hyperparameter Tuning Documentation
● MLflow integrations with H20.ai GPyOpt,
HyperOpt
Notebooks
● MLlib + Automated MLflow Tracking
● Distributed Hyperopt + Automated MLflow
Tracking
● Basic Introduction to DataRobot via API
Videos
● Automating Predictive Modeling at Zynga
with PySpark and Pandas UDFs
● Best Practices for Hyperparameter Tuning
with MLflow
● Advanced Hyperparameter Optimization
for Deep Learning with MLflow
51. Getting started
MLflow
Managed MLflow
Generally Available in
Databricks
MLlib + automated
MLflow tracking
Public preview in
Databricks Runtime 5.4
& 5.4ML
Distributed Hyperopt
+ automated MLflow
tracking
Public preview in
Databricks Runtime 5.4ML
https://docs.databricks.com/spark/latest/mllib/index.html#hyperparameter-tuning
https://docs.azuredatabricks.net/spark/latest/mllib/index.html#hyperparameter-tuning
https://mlflow.org/