SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
HYPERPARAMETER OPTIMIZATION WITH
APPROXIMATE GRADIENT
Fabian Pedregosa
Chaire Havas-Dauphine
Paris-Dauphine / École Normale
Supérieure
HYPERPARAMETERS
Most machine learning models depend on at least one
hyperparameter to control for model complexity. Examples
include:
Amount of regularization.
Kernel parameters.
Architecture of a neural network.
Model parameters
Estimated using some
(regularized) goodness of
t on the data.
Hyperparameters
Cannot be estimated using
the same criteria as model
parameters (over tting).
HYPERPARAMETER SELECTION
Criterion to for hyperparameter selection:
Optimize loss on unseen data: cross-validation.
Minimize risk estimator: SURE, AIC/BIC, etc.
Example: least squares with regularization.ℓ2
loss =
Costly evaluation function,
non-convex.
Common methods: grid
search, random search, SMBO.
( − X(λ)∑n
i=1
bi ai )
2
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION
Compute gradients with respect to hyperparameters
[Larsen 1996, 1998, Bengio 2000].
Hyperparameter optimization as nested or bi-level
optimization:
arg min
λ∈
s.t.  X(λ)
⏟model parameters
  f (λ) ≜ g(X(λ), λ)
  loss on test set
∈  arg min
x∈ℝp
h(x, λ)
⏟loss on train set
GOAL: COMPUTE ∇f (λ)
By chain rule,
Two main approaches: implicit differentiation and iterative
differentiation [Domke et al. 2012, Macaulin 2015]
Implicit differentiation [Larsen 1996, Bengio 2000]:
formulate inner optimization as implicit equation.
∇f = ⋅+
∂g
∂λ
∂g
∂X
  known
∂X
∂λ
⏟unknown
X(λ) ∈ arg min h(x, λ) ⟺ h(X(λ), λ) = 0∇1
  implicit equation for X
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION
∇f = g − g∇2 ( h)∇2
1,2
T
( h)∇2
1
−1
∇1
Possible to compute gradient w.r.t. hyperparameters, given
Solution to the inner optimization
Solution to linear system
X(λ)
g( h)∇2
1
−1
∇1
computationally expensive.⟹
HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE
GRADIENT
Loose approximation
Cheap iterations, might
diverge.
Precise approximation
Costly iterations,
convergence to stationary
Replace by an approximate solution of the inner
optimization.
Approximately solve linear system.
Update using
Tradeoff
X(λ)
λ    ≈ ∇fpk
point.
HOAG At iteration perform the following:k = 1, 2, …
i) Solve the inner optimization problem up to tolerance , i.e. nd
such that
ii) Solve the linear system up to tolerance . That is, nd such
that
iii) Compute approximate gradient as
iv) Update hyperparameters:
εk
∈xk ℝp
∥X( ) − ∥ ≤ .λk xk εk
εk qk
∥ h( , ) − g( , )∥ ≤ .∇2
1
xk λk qk ∇1 xk λk εk
pk
= g( , ) − h( , ,pk ∇2 xk λk ∇2
1,2
xk λk )
T
qk
=
(
− )
.λk+1 P λk
1
L
pk
ANALYSIS - GLOBAL CONVERGENCE
Assumptions:
(A1). Lipschits and .
(A2). non-singular
(A3). Domain is bounded.
∇g h∇2
h(X(λ), λ)∇2
1

Corollary: If , then converges to a
stationary point :
if is in the interior of then
< ∞∑∞
i=1
εi λk
λ∗
⟨∇f ( ), α − ⟩ ≥ 0 , ∀α ∈ λ∗
λ∗
⟹ λ∗ 
∇f ( ) = 0λ∗
EXPERIMENTS
How to choose tolerance ?εk
Different strategies for the tolerance decrease. Quadratic:
, Cubic: , Exponential:= 0.1/εk k
2
0.1/k
3
0.1 × 0.9
k
Approximate-gradient strategies achieve much faster
decrease in early iterations.
EXPERIMENTS I
Model: -regularized
logistic regression.
1 Hyperparameter.
Datasets:
20news (18k 130k )
real-sim (73k 20k)
ℓ2
×
×
EXPERIMENTS II
Kernel ridge regression.
2 hyperparameters.
Parkinson dataset: 654
17
Multinomial Logistic
regression with one
hyperparameter per feature
[Maclaurin et al. 2015]
784 10
hyperparameters
MNIST dataset: 60k
784
×
×
×
CONCLUSION
Hyperparameter optimization with inexact gradient:
can update hyperparameters before model parameters
have fully converged.
independent of inner optimization algorithm.
convergence guarantees under smoothness
assumptions.
Open questions.
Non-smooth inner optimization (e.g. sparse models)?
Stochastic / online approximation?
REFERENCES
[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization of
hyperparameters." Neural computation 12.8 (2000): 1889-1900.
[J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Random
search for hyper-parameter optimization." The Journal of Machine
Learning Research 13.1 (2012): 281-305.
[J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization Using
Deep Neural Networks. (2015). at
[K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-Thaw
Bayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at
[F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. An
evaluation of sequential model-based optimization for expensive blackbox
functions.
http://arxiv.org/abs/1502.05700a
http://arxiv.org/abs/1406.3896
REFERENCES 2
[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing nite
sums with the stochastic average gradient. arXiv Prepr. arXiv1309.2388
1–45 (2013). at
[J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-Based
Modeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012).
[M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. Hybrid
Deterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput.
34, A1380–A1405 (2012).
http://arxiv.org/abs/1309.2388
EXPERIMENTS - COST FUNCTION
EXPERIMENTS
Comparison with other hyperparameter optimization
methods
Random = Random search, SMBO = Sequential Model-Based
Optimization (Gaussian process), Iterdiff = reverse-mode
differentiation .
EXPERIMENTS
Comparison in terms of a validation loss.
Random = Random search, SMBO = Sequential Model-Based
Optimization (Gaussian process), Iterdiff = reverse-mode
differentiation .

Weitere ähnliche Inhalte

Was ist angesagt?

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.
 
Recommendation System --Theory and Practice
Recommendation System --Theory and PracticeRecommendation System --Theory and Practice
Recommendation System --Theory and PracticeKimikazu Kato
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetaggingTakashi Abe
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Recsys matrix-factorizations
Recsys matrix-factorizationsRecsys matrix-factorizations
Recsys matrix-factorizationsDmitriy Selivanov
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryFrancesco Tudisco
 

Was ist angesagt? (20)

Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in Practice
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
 
Recommendation System --Theory and Practice
Recommendation System --Theory and PracticeRecommendation System --Theory and Practice
Recommendation System --Theory and Practice
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetagging
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
 
Recsys matrix-factorizations
Recsys matrix-factorizationsRecsys matrix-factorizations
Recsys matrix-factorizations
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboost
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 

Ähnlich wie Hyperparameter optimization with approximate gradient

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...IJCNCJournal
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural NetworksBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerIan Dewancker
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Convex optmization in communications
Convex optmization in communicationsConvex optmization in communications
Convex optmization in communicationsDeepshika Reddy
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
 
15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdf15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdfAllanKelvinSales
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 

Ähnlich wie Hyperparameter optimization with approximate gradient (20)

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
 
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
 
40120130406008
4012013040600840120130406008
40120130406008
 
A04230105
A04230105A04230105
A04230105
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Convex optmization in communications
Convex optmization in communicationsConvex optmization in communications
Convex optmization in communications
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
 
15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdf15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdf
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 

Mehr von Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you needFabian Pedregosa
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in pythonFabian Pedregosa
 

Mehr von Fabian Pedregosa (10)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 

Kürzlich hochgeladen

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 

Kürzlich hochgeladen (20)

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 

Hyperparameter optimization with approximate gradient

  • 1. HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Fabian Pedregosa Chaire Havas-Dauphine Paris-Dauphine / École Normale Supérieure
  • 2. HYPERPARAMETERS Most machine learning models depend on at least one hyperparameter to control for model complexity. Examples include: Amount of regularization. Kernel parameters. Architecture of a neural network. Model parameters Estimated using some (regularized) goodness of t on the data. Hyperparameters Cannot be estimated using the same criteria as model parameters (over tting).
  • 3. HYPERPARAMETER SELECTION Criterion to for hyperparameter selection: Optimize loss on unseen data: cross-validation. Minimize risk estimator: SURE, AIC/BIC, etc. Example: least squares with regularization.ℓ2 loss = Costly evaluation function, non-convex. Common methods: grid search, random search, SMBO. ( − X(λ)∑n i=1 bi ai ) 2
  • 4. GRADIENT-BASED HYPERPARAMETER OPTIMIZATION Compute gradients with respect to hyperparameters [Larsen 1996, 1998, Bengio 2000]. Hyperparameter optimization as nested or bi-level optimization: arg min λ∈ s.t.  X(λ) ⏟model parameters   f (λ) ≜ g(X(λ), λ)   loss on test set ∈  arg min x∈ℝp h(x, λ) ⏟loss on train set
  • 5. GOAL: COMPUTE ∇f (λ) By chain rule, Two main approaches: implicit differentiation and iterative differentiation [Domke et al. 2012, Macaulin 2015] Implicit differentiation [Larsen 1996, Bengio 2000]: formulate inner optimization as implicit equation. ∇f = ⋅+ ∂g ∂λ ∂g ∂X   known ∂X ∂λ ⏟unknown X(λ) ∈ arg min h(x, λ) ⟺ h(X(λ), λ) = 0∇1   implicit equation for X
  • 6. GRADIENT-BASED HYPERPARAMETER OPTIMIZATION ∇f = g − g∇2 ( h)∇2 1,2 T ( h)∇2 1 −1 ∇1 Possible to compute gradient w.r.t. hyperparameters, given Solution to the inner optimization Solution to linear system X(λ) g( h)∇2 1 −1 ∇1 computationally expensive.⟹
  • 7. HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Loose approximation Cheap iterations, might diverge. Precise approximation Costly iterations, convergence to stationary Replace by an approximate solution of the inner optimization. Approximately solve linear system. Update using Tradeoff X(λ) λ    ≈ ∇fpk
  • 8. point. HOAG At iteration perform the following:k = 1, 2, … i) Solve the inner optimization problem up to tolerance , i.e. nd such that ii) Solve the linear system up to tolerance . That is, nd such that iii) Compute approximate gradient as iv) Update hyperparameters: εk ∈xk ℝp ∥X( ) − ∥ ≤ .λk xk εk εk qk ∥ h( , ) − g( , )∥ ≤ .∇2 1 xk λk qk ∇1 xk λk εk pk = g( , ) − h( , ,pk ∇2 xk λk ∇2 1,2 xk λk ) T qk = ( − ) .λk+1 P λk 1 L pk
  • 9. ANALYSIS - GLOBAL CONVERGENCE Assumptions: (A1). Lipschits and . (A2). non-singular (A3). Domain is bounded. ∇g h∇2 h(X(λ), λ)∇2 1  Corollary: If , then converges to a stationary point : if is in the interior of then < ∞∑∞ i=1 εi λk λ∗ ⟨∇f ( ), α − ⟩ ≥ 0 , ∀α ∈ λ∗ λ∗ ⟹ λ∗  ∇f ( ) = 0λ∗
  • 10. EXPERIMENTS How to choose tolerance ?εk Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential:= 0.1/εk k 2 0.1/k 3 0.1 × 0.9 k Approximate-gradient strategies achieve much faster decrease in early iterations.
  • 11. EXPERIMENTS I Model: -regularized logistic regression. 1 Hyperparameter. Datasets: 20news (18k 130k ) real-sim (73k 20k) ℓ2 × ×
  • 12. EXPERIMENTS II Kernel ridge regression. 2 hyperparameters. Parkinson dataset: 654 17 Multinomial Logistic regression with one hyperparameter per feature [Maclaurin et al. 2015] 784 10 hyperparameters MNIST dataset: 60k 784 × × ×
  • 13. CONCLUSION Hyperparameter optimization with inexact gradient: can update hyperparameters before model parameters have fully converged. independent of inner optimization algorithm. convergence guarantees under smoothness assumptions. Open questions. Non-smooth inner optimization (e.g. sparse models)? Stochastic / online approximation?
  • 14. REFERENCES [Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization of hyperparameters." Neural computation 12.8 (2000): 1889-1900. [J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305. [J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization Using Deep Neural Networks. (2015). at [K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-Thaw Bayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at [F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. An evaluation of sequential model-based optimization for expensive blackbox functions. http://arxiv.org/abs/1502.05700a http://arxiv.org/abs/1406.3896
  • 15. REFERENCES 2 [M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing nite sums with the stochastic average gradient. arXiv Prepr. arXiv1309.2388 1–45 (2013). at [J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-Based Modeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012). [M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput. 34, A1380–A1405 (2012). http://arxiv.org/abs/1309.2388
  • 16. EXPERIMENTS - COST FUNCTION
  • 17. EXPERIMENTS Comparison with other hyperparameter optimization methods Random = Random search, SMBO = Sequential Model-Based Optimization (Gaussian process), Iterdiff = reverse-mode differentiation .
  • 18. EXPERIMENTS Comparison in terms of a validation loss. Random = Random search, SMBO = Sequential Model-Based Optimization (Gaussian process), Iterdiff = reverse-mode differentiation .