DataScienceLab, 13 мая 2017
Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации
Максим Бевза (Research Engineer at Grammarly)
Все алгоритмы машинного обучения нуждаются в настройке (тьюнинге). Часто мы используем Grid Search или Randomized Search или нашу интуицию для подбора гиперпараметров. Байесовская оптимизация поможет нам направить Randomized Search в те места, которые наиболее перспективны, так, чтобы тот же (или лучший) результат мы получили за меньшее количество итераций.
Все материалы: http://datascience.in.ua/report2017
7. Other complex problems
● And many more
○ Recommender systems
○ Natural language understanding
○ Robotics
○ Grammatical error correction
○ ...
8. Number of parameters growth
● The number of parameters grows tremendously
○ Number of layers
○ Convolution kernel size
○ Number of neurons
○ Dropout drop rate
○ Learning rate
○ Batch size
● Preprocessing params
9. Tuning parameters is magic
● Complex systems are hard to analyse
● Impact of parameters on success is obscure
10. ● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is crucial
11. ● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is crucial
12. Goals
● Introduce Bayesian Optimization to the audience
● Share personal experience
○ Results on digit recognition problem
○ Toolkits for Bayesian Optimization
13. Overview
● Tuning ML hyper-parameters
● Bayesian Optimization
● Available software
● Experiments in research field
● My experiments
16. Grid Search
1. Define a search space
2. Try all 4*3=12
configurations
Search space for SVM Classifier
{
'C': [1, 10, 100, 1000],
'gamma': [1e-2, 1e-3, 1e-4],
'kernel': ['rbf']
}
17. Random Search
1. Define the search space
2. Sample the search space
and run ML algorithm
Search space for SVM Classifier
{
'C': scipy.stats.expon(scale=100),
'gamma': scipy.stats.expon(scale=.1),
'kernel': ['rbf']
}
18. Grid Search: pros & cons
● Fully automatic
● Parallelizable
● Number experiments grows exponentially with number of params
● Waste of time on unimportant parameters
● Some points in search space are not reachable
● Does not learn from previous iterations
19. Random Search: pros & cons
● Fully automatic
● Parallelizable
● Number of iterations are set upfront
● No time waste on unimportant parameters
● All points in the search space are reachable
● Does not learn from previous iterations
● Does not take into account evaluation cost
20. ● f(x, y) = g(x) + h(y)
● h(y) is smaller than g(x)
Grid Search vs Random Search
21. Grad Student Descent
● Researcher fiddles around with the parameters until
it works
Name of method by Ryan Adams
22. Grad Student Descent: pros & cons
● Learns from previous iterations
● Takes into account evaluation cost
● Parallelizable
● Benefits from understanding semantics of hyper-parameters
● Search is biased
● Requires a lot of manual work
23. Comparison of all methods
Grid Search Random
Search
Grad Student
Descent
Fully automatic Yes Yes No
Learns from previous iterations No No Yes
Takes into account eval. cost No No Yes
Parallelizable Yes Yes Yes
Reasonable search time No Yes Yes
Handles unimportant parameters No Yes Yes
Search is NOT biased Yes Yes No
Good software Yes Yes N/A
24. Bayesian Optimization: the goal
● Fully automatic
● Learns from previous iterations
● Takes into account evaluation cost
● Search is not biased
● Parallelizable
● Available software is non-free and not stable
26. ● Let’s treat our ML learning algorithm as a function f : X -> Y
● X is our search space for hyper-parameters
● Y is set of score that we want to optimize
● Let’s consider other parameters to be fixed (e.g. dataset)
Background
27. ● X - a search space
{
'C': [1, 1000],
'gamma': [0.0001, 0.1],
'kernel': ['rbf'],
}
Background: Examples
28. ● We can optimize towards any score (even non-differentiable)
○ Validation error rate
○ AUC
○ Recall at fixed FPR
○ Many more
Background: Examples
29. ● Our ML algorithm f for similar settings gets similar scores
● We can leverage it to try settings that are more promising
● For custom scores this condition should hold
Intuition
30. ● Let’s consider one
dimensional function
f : R -> R
● Let’s suppose we
want to minimize f
An example
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
31. ● Build all possible
functions
● Less smooth
functions are less
probable
An example
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
46. ● Hyper-parameters often impact the evaluation time
● Number of hidden layers, neurons per layer (Deep Learning)
● Number and depth of trees (Random Forest)
● Number of estimators (Gradient Boosting)
Time limits vs evaluation limits
47. ● In practice we deal with time limits
● E.g. what’s the best set-up we can get in 7 days?
● Try cheap evaluations first
● Given rough characteristic of f, try expensive evaluations
Time limits vs evaluation limits
49. ● Let’s estimate two functions at a time:
○ The function f itself
○ The cost of evaluation (duration) of function f
● We can use BO to estimate those functions
How to account for cost of evaluation?
50. ● We chosed the point with highest Expected Improvement
● Pick the highest EI/second instead
Strategy of choosing next point with cost
54. Comparison of all methods
Grid
Search
Random
Search
Grad Student
Descent
Bayesian
Optimization
Fully automatic Yes Yes No Yes
Learns from previous iterations No No Yes Yes
Takes into account eval. cost No No Yes Yes
Parallelizable Yes Yes Yes Tricky
Reasonable search time No Yes Yes Yes
Handles unimportant parameters No Yes Yes Yes
Search is NOT biased Yes Yes No Yes
Good software Yes Yes N/A No
58. ● The toolkits built by researchers are not supported well
○ Spearmint
○ SMAC
○ HyperOpt
○ BayesOpt
● Non-bayesian alternatives
○ TPE (Tree-structured Parzen Estimator)
○ PSO (Particle Swarm Optimization)
Available software
59. ● SigOpt provides Bayesian Optimization as a service
● Claims state-of-the-art Bayesian Optimization
● Their customers
○ Prudential
○ Huawei
○ MIT
○ Hotwire
○ ...
SigOpt
63. for _ in range(30):
suggestion = conn.experiments(experiment.id).suggestions().create()
value = evaluate_model(suggestion.assignments)
conn.experiments(experiment.id).observations().create(
suggestion=suggestion.id,
value=value,
)
SigOpt API
67. Extensive analysis by Clark et al. (2016)
● Extensive analysis of BO and other search methods
● Different type of functions
○ Oscillatory
○ Discrete values
○ Boring
○ ...
77. ● 6 parameters tuned
○ Number of filters per layer (1)
○ Number of convolution layers (1)
○ Dense layers size (2)
○ Batch size (1)
○ Learning rate (1)
Parameters of the model
78. ● Features
○ Parameter types: INT, FLOAT, ENUM
○ Evaluation data stored in MongoDB
○ Works with noisy functions
● License: Non-commercial usage
Spearmint
85. ● Spearmint tries boundaries first
○ Be cautious in setting up your search space
● Use logarithmic scales when it makes sense
● Recommendations on iteration limit
○ 10-20 iterations per one parameter
Gotchas