Machine Learning Hyperparameter Optimization

Copyright © GREE, Inc. All Rights Reserved.
機械学習モデルのハイパパラメータ最適化

• 尾崎嘉彦
• グリー株式会社　エンジニア
• Webゲーム開発 -> 機械学習
• 産総研　特定集中研究専門員
• ブラックボックス最適化
• 微分フリー最適化
• ハイパパラメータ最適化
発表者の紹介

イントロダクション

機械学習におけるハイパパラメータ
モデル自身や学習に関わる手法が持つ，性能に影響を及ぼす調整可能なパラメータ
x
t
ln λ = −18
0 1
−1
0
1
x
t
ln λ = 0
0 1
−1
0
1
正則化項のはたらき (Bishop, 2006) Adam optimizer (Kingma and Ba 2015)

モデルの複雑化に伴いハイパパラメータ数も増加
手作業や簡単な手法では細かい調整が手に負えない状況
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
image
3x3conv,512
3x3conv,64
3x3conv,64
pool,/2
3x3conv,128
3x3conv,128
pool,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
pool,/2
3x3conv,512
3x3conv,512
3x3conv,512
pool,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
pool,/2
fc4096
fc4096
fc1000
image
output
size:112
output
size:224
output
size:56
output
size:28
output
size:14
output
size:7
output
size:1
VGG-1934-layerplain
7x7conv,64,/2
pool,/2
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,64
3x3conv,128,/2
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,128
3x3conv,256,/2
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,256
3x3conv,512,/2
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
3x3conv,512
avgpool
fc1000
image
34-layerresidual
Residual Network (He et al. 2016)

ハイパパラメータ最適化の研究の盛り上がり
深層学習等の実用において必要不可欠な道具へ発展
• 探索空間が広大
• 関数評価コストが高価
• 目的関数がノイジー
• 変数のタイプが多様
ベイズ最適化などを中心に研究が発展 (Hutter et al. 2015)
ハイパパラメータ調整の自動化は最適化問題としてチャレンジング

ハイパパラメータ最適化問題の定式化
性能指標（損失関数）を最小化するブラックボックス最適化と考えるのが標準的
Minimize f(λ)
subject to λ ∈ Λ.
自分たちが観測できるのは，ノイズを伴った目的関数値のみ
目的関数が数式の形で明示的には与えられない
fϵ(λ) = f(λ) + ϵ, ϵ
iid
∼ N(0, σ2
n)

ブラックボックス最適化
利点と欠点
• 目的関数値しか要らない
• モデルや損失関数に依存せず極めて汎用的
• 目的関数の素性が不明
• 勾配情報が利用不可（効率的な最適化手法を考えるのが難しい）
• 微分フリー最適化手法が必要
利点
欠点

ハイパパラメータ最適化問題の定式化
最適化対象として直接k-fold cross validation lossなどを考えるのが一般的
fϵ(λ) =
1
k
k
i=1
L(Aλ, Di
train, Di
valid)

ハイパパラメータの分類
連続は一番扱いやすく，条件的は一番扱いにくい

最適化手法

• Strong Anytime Performance
• 厳しい制約のもとで，良い性能が得られること
• Strong Final Performance
• 緩い制約のもとで，非常に良い設定が得られること
• Effective Use of Parallel Resources
• 効率的に並列化できること
• Scalability
• 非常に多くのパラメータ数でも問題なく扱うことができること
• Robustness & Flexibility
• 目的関数値の観測ノイズや非常にセンシティブなパラメータに対して，
頑健かつ柔軟であること
ハイパパラメータ最適化手法が満たすべき要件 (Falkner et al. 2018a)
全てを満たすのは難しいため，現実には目的に応じて取捨選択が必要

手法の分類
Dodge et al. (2017)
λk
{(λi
, f(λi
))}k−1
i=1
λk
{λi
}k−1
i=1
• ベイズ最適化など
• 目的関数値を活用して効率的に最適化
• 評価回数を少なく抑えられる傾向
• グリッドサーチやランダムサーチなど
• 目的関数値に対する依存性がないため，リソースの許す限り並列評価が可能
• CPU時間に対する課金が主流のクラウド計算資源と相性がよい
• ウォールクロックタイムを少なく抑えられる傾向

グリッドサーチ
ハイパパラメータ調整に言及していたNIPS2014の論文88本のうち84本が使用 (Simm 2015)

グリッドサーチ
利点と欠点
• 並列化しやすく，計算リソースに対してスケーラブル
• 低実効次元性（後述）に著しく脆弱
• 計算量がパラメータ数の指数オーダーのためノンスケーラブル
• 局所・大域的最適解を見つける能力が貧弱

実験計画法 (Design of Experiments)
最良の点を中心とするより狭い範囲を反復的にサンプリング (Staelin 2002)
黒：2-level DOE
白：3-level DOE
黒：2-level DOEの1反復目
白：左下黒を最良と仮定した2反復目

ランダムサーチ
グリッドサーチと並んで最もシンプルな手法

ランダムサーチ
利点と欠点
• 並列化しやすく，計算リソースに対してスケーラブル
• パラメータ数に対してスケーラブル
• 低実効次元性（後述）に頑健
• 局所・大域的最適解を見つける能力が貧弱
利点
欠点

低実効次元性 (Low Effective Dimensionality)
モデル性能にとって重要なパラメータは少数であるためグリッドサーチは非効率，
またデータセット毎にそれらは異なる (Bergstra et al. 2012)
Important parameter
Unimportantparameter
Important parameter
Unimportantparameter
f(λ1, λ2) = g(λ1) + h(λ2) ≈ g(λ1)

• Hutter et al. (2014)
• functional ANOVAによるアプローチで重要なハイパパラメータを特定
• Fawcett and Hoos (2016)
• 2つの設定間で最もパフォーマンスに貢献しているパラメータを調べるablation
analysis
• Biedenkapp et al. (2017)
• サロゲートを用いることでablation analysisを高速化
• van Rijn and Hutter (2017a, b)
• functional ANOVAを用いて大規模にデータセット間のハイパパラメータ重要性を分析
重要なハイパパラメータの特定
近年の研究動向

低食い違い量列 (Low Discrepancy Sequence)
一様ランダムの代わりにSobol列やLatin Hypercube Samplingの使用を提案，計算実験の
結果Sobol列が有望 (Bergstra et al. 2012)，Dodge et al. 2017はk-DPPの使用を提案
Uniform Sobol LHS

Nelder-Mead法 (Nelder and Mead 1965)
反復的に単体を変形し最適化，Rのoptim関数の標準手法として採用されている
1次元，2次元および3次元単体

λ⁰
λ2
λ¹
λic
λc
λoc
λr
λe
f(λ0
) ≤ f(λ1
) ≤ f(λ2
)

λ⁰
λ2
λ¹
λic
λc
λoc
λr
λe
Reﬂect: λr
= λc
+ δr
(λc
− λn
) where λc
=
n−1
i=0 λi
/n

λ⁰
λ2
λ¹
λic
λc
λoc
λr
λe
Expand: λe
= λc
+ δe
(λc
− λn
)

λ⁰
λ2
λ¹
λic
λc
λoc
λr
λe
Outside contract: λoc
= λc
+ δoc
(λc
− λn
)

λ⁰
λ2
λ¹
λic
λc
λoc
λr
λe
Inside contract: λic
= λc
+ δic
(λc
− λn
)

λ⁰
λ2
λ¹
λic
λ1s
λoc
λr
λe
λ2s
Shrink: λ0
+ γs
(λi
− λ0
) : i = 0, . . . , n}

λ0
λ1
λ2
f(λ0
) ≤ f(λ1
) ≤ f(λ2
)

λ0
λ1
λr
λ2
Reﬂect

λ0
λ1
λr
λe
λ2
f(λr
) < f(λ0
) Expand

λ0
λ1
λe
f(λr
) f(λe
) λ2

λ1
λ2
λr
λ0

λ1
λ2
λr
λ0
λoc
f(λ1
) ≤ f(λr
) < f(λ2
) Outside contract

λ2
λ1
λ0
f(λoc
) ≤ f(λ2
) λ2
λoc

λ2
λ1
λr
λ0
λe

λ0
λ2
λ1

λ1
λ0
λ2

λ1
λ0
λ2
λic
λr
f(λr
) ≥ f(λ2
) Inside contract

λ2
λ0
λ1
Reﬂect Contract λ2
Shrink

λ2
λ1
λ0

McCormick benchmark function

利点と欠点
収束性や失敗する例，改良した手法などはConn et al. (2009); Audet and Hare (2017)
利点
• 局所解を見つける能力に優れる
• 部分的な並列化しかできない
• 悪質な局所解に陥る可能性がある
欠点

• 標準的な選択
係数の選択
0 < γs
< 1, −1 < δic
< 0 < δoc
< δr
< δe
γs
= 1
2 , δic
= −1
2 , δoc
= 1
2 , δr
= 1 and δe
= 2
γs
= 1 −
1
n
, δic
= −
3
4
+
1
2n
, δoc
=
3
4
−
1
2n
, δr
= 1, δe
= 1 +
2
n
where n ≥ 2
• 適応的な係数 (Gao and Han 2012)

ベイズ最適化
現在最も注目されているハイパパラメータ最適化手法（この例は最大化問題）

ベイズ最適化
• ベイズ最適化
• サロゲートをベイズ的に構築するSMBOの総称
• 　　　　　　を考えるP(fϵ(λ) | λ)
• サロゲートの種類
• ガウス過程 (GP)
• 最も標準的，有名な実装はSpearmint (Snoek et al. 2012)
• ランダムフォレスト
• SMAC (Hutter et al. 2011)
• Tree Parzen Estimator (TPE) (Bergstra et al. 2011)
• 実装はHyperopt
• 　　　　　　　　　　　を考える
• DNN (Snoek et al. 2015)
P(λ | fϵ(λ)), P(fϵ(λ))
• Sequential Model-based Optimization (SMBO)
• 反復的に関数評価とサロゲート（目的関数のモデル）の更新を繰り返す手法の総称
• ベイズ最適化や信頼領域法 (Ghanbari and Scheinberg 2017)

• ガウス分布
• スカラ，ベクトル上の分布
• ガウス過程
• 関数上の分布
ベイズ最適化
ガウス過程回帰に基づく方法
−1 −0.5 0 0.5 1
−3
−1.5
0
1.5
3
ガウス過程からのサンプル (Bishop, 2006)

• 目的関数が平均関数mと共分散関数kにより特徴づけされるGPに従うと仮定
• 事前平均関数としては　　　　　　とするのが標準的
ベイズ最適化
ガウス過程回帰に基づく方法
fϵ(λ) ∼ GP(m(λ), k(λ, λ′
))
m(λ) = 0

• カーネルはモデルの形を特徴づける
• 2点間の近さを抽象化したようなもの
• 適切なカーネルを選べばカテゴリ的・条件的パラメータも扱える
ベイズ最適化
共分散関数（カーネル）
Exponentiated Quadratic
Matérn 5/2
Kernels / Covariance functions (PyMC3)

• ARD squared exponential kernel
• ARD Matérn 5/2 kernel
• カーネルのハイパパラメータはデータから動的に決める
• 経験ベイズ (Bishop 2006)
• Markov Chain Monte Carlo (MCMC) (Snoek et al. 2012)
共分散関数（カーネル）の選択 (Snoek et al. 2012)
kse(λ, λ′
) = θ0
exp(−
1
2
r2
(λ, λ′
)),
r2
(λ, λ′
) =
D
d=1
(λd − λ′
d)2
/(θd
)2
k52(λ, λ′
) = θ0
(1 + 5r2(λ, λ′) +
5
3
r2
(x, λ′
)) exp(− 5r2(λ, λ′))
ベイズ最適化

ベイズ最適化
PRML 6章，カーネルのハイパパラメータの影響 (Bishop 2006)
(1.00, 4.00, 0.00, 0.00)
−1 −0.5 0 0.5 1
−3
−1.5
0
1.5
3
(9.00, 4.00, 0.00, 0.00)
−1 −0.5 0 0.5 1
−9
−4.5
0
4.5
9
(1.00, 64.00, 0.00, 0.00)
−1 −0.5 0 0.5 1
−3
−1.5
0
1.5
3
(1.00, 0.25, 0.00, 0.00)
−1 −0.5 0 0.5 1
−3
−1.5
0
1.5
3
(1.00, 4.00, 10.00, 0.00)
−1 −0.5 0 0.5 1
−9
−4.5
0
4.5
9
(1.00, 4.00, 0.00, 5.00)
−1 −0.5 0 0.5 1
−4
−2
0
2
4
k(λ, λ′
) = θ0
exp −
θ1
2
∥λ − λ′
∥2
+ θ2
+ θ3
λ⊤
λ′

ベイズ最適化
mとkを決めれば，過去の観測から未観測点の関数値を予測できる
ガウス分布の性質とSchurの公式から導出される (Rasmussen and Williams 2005; Bishop 2006)
データがないとまともに予測できないので，ランダムサーチなどでデータを集めて初期化しておく
P(fϵ(λt+1
) | λ1
, λ2
, . . . , λt+1
) = N(µt(λt+1
), σ2
t (λt+1
) + σ2
n),
µt(λt+1
) = k⊤
[K + σ2
nI]−1
[f(λ1
) f(λ2
) · · · f(λt
)]⊤
,
σ2
t (λt+1
) = k(λt+1
, λt+1
) − k⊤
[K + σ2
nI]−1
k
where
k = [k(λt+1
, λ1
) k(λt+1
, λ2
) · · · k(λt+1
, λt
)]⊤
,
K =
⎡
⎢
⎣
k(λ1
, λ1
) · · · k(λ1
, λt
)
...
...
...
k(λt
, λ1
) · · · k(λt
, λt
)
⎤
⎥
⎦ .

ベイズ最適化
観測点の近くでは分散小，離れると分散大（予測が不確かになる）
Brochu et al. (2010)

ベイズ最適化
次に評価する点の選び方
• 獲得関数と呼ばれる指標を最大化する点を次に評価する点として選ぶ
• 獲得関数は探索と知識利用のトレードオフを担う
• サロゲートの分散が大きい点を評価（探索）
• サロゲートの平均が小さい点を評価（知識利用）
aUCB(λ) = −µ(λ) + ξσ(λ)
• 例：GP-Upper Confidence Bound (GP-UCB) (Srinivas 2012) 
解きたいのは損失最小化問題なので-µ(λ)
• Probability of Improvement (PI)， Expected Improvement (EI)， Predictive
Entropy Search (PES) など色々あり，探索性能に大きく影響

ベイズ最適化
利点と欠点
利点
欠点
• 探索と知識利用のトレードオフを考慮した大域的な探索が可能
• 観測ノイズを考慮した探索が可能
• 共分散関数と獲得関数に対してセンシティブ
• 獲得関数の最適化が非凸大域的最適化
• ガウス過程回帰の場合，観測データ数の3乗オーダーの計算量
• 並列化が難しい

サロゲートの計算量削減
[K + σ2
nI]−1
• ガウス過程回帰のボトルネック：
• 近似計算 (Quiñonero-Candela et al. 2007; Titsias 2009)
• 計算量が相対的に少ないサロゲート
• ランダムフォレスト (Hutter et al. 2011)
• DNN (Snoek et al. 2015)

• Shah and Ghahramani (2015)
• Parallel Predictive Entropy Search
• Gonzalez et al. (2016)
• Local Penalization
• Kathuria et al. (2016)
• DPP sampling
• Kandasamy et al. (2018)
• 非同期並列Thompson sampling
• この他にも沢山
• Bergstra et al. (2011); Snoek et al. (2012); Contal et al.
(2013); Desautels et al. (2014); Daxberger and Low (2017);
Wang et al. (2017, 2018a); Rubin (2018)
ベイズ最適化の並列化

ベイズ最適化
（再掲）この例は最大化問題

その他の手法
適用事例報告がある主なもの
• CMA-ES
• Watanabe and Le Roux (2014); Loshchilov and Hutter (2016)
• Particle Swarm Optimization (PSO)
• Meissner et al. (2006); Lin et al. (2009); Lorenzo et al. (2017); Ye
(2017)
• Genetic Algorithm (GA)
• Leung et al. (2003); Young et al. (2015)
• Differential Evolution (DE)
• Fu et al. (2016a,b)
• 強化学習
• Hansen (2016); Bello et al. (2017); Dong et al. (2018)
• 勾配法 (※ブラックボックス最適化でない，連続パラメータのみ)
• Maclaurin et al. (2015); Luketina et al. (2016); Pedregosa (2016);
Franceschi (2017a,b,c, 2018a,b)

補助的なテクニック

• Domhan et al. (2015)
• 11種類の基底関数の重み付き線形和で学習曲線をモデル化
• ベイジアンネットワークを使用 (Klein et al. 2016)
• 過去のデータを活用 (Chandrashekaran and Lane 2017)
早期終了
エポック数に対する学習曲線を予測し，良い性能を達成する見込みのない学習を停止
fcomb =
k
i=1 wi
fi
(λ | θi) + ϵ, ϵ ∼ N(0, σ2
),
k
i=1 wi
= 1, ∀wi
, wi
≥ 0

• 異なる解像度でハイパパラメータ最適化後，functional ANOVAにより重要なパラメータを分析
• 多くの重要なパラメータとその値は解像度に依らず同じ (e.g. 学習率，バッチサイズ)
• 解像度の影響を受けるものは直後にmax-poolingを伴う畳込み層の数など（poolingすると
解像度が減るため）-> 高解像度化した際の適切な初期値は低解像度の場合から推測する
• 32×32で750回評価，64×64で500回評価，128×128で250回評価を行いハイパパラメータ最
適化しても精度は落ちず，128×128で1500回評価するよりも早く終わる
Increasing Image Sizes (IIS) (Hinz et al. 2018)
低解像度の画像を用いてハイパパラメータを最適化を始め，徐々に解像度を上げていく

• Successive Halving (Jamieson and Talwalkar 2015)
• 複数のハイパパラメータ設定候補を評価
• 下位候補を棄却，リソースを上位候補に多く割当て直して評価を継続
• 課題
• 候補数をnリソースをBとしたとき，nとB/nの適切なトレードオフは非自明
Hyperband (Li et al. 2016)
リソース (e.g. 学習時間，教師データ数) を適応的に割り当てる

Hyperband (Li et al. 2016)
提案手法：グリッドサーチのようにnとB/nのトレードオフを複数試す
ランダムサーチやベイズ最適化と組み合わせる (Bertrand et al. 2017; Falkner et al. 2018; Wang et al. 2018)

• 仮説：近いデータセットに対するハイパパラメータ最適化結果は似ている
• e.g. 学習データが増えたので，モデルを再学習する場合
• メタ特徴量
• ハンドメイド
• シンプルな特徴量（e.g. データ数，次元数，クラス数）
• 統計学や情報理論に基づく特徴（e.g. 分布の歪度）
• ランドマーク特徴（決定木などシンプルな機械学習モデルの性能）
• 深層学習 (Kim et al. 2017a,b)
• 近いデータセットのハイパパラメータ最適化結果で手法を初期化しウォームスタート
• PSO (Gomes et al. 2012)
• GA (Reif et al. 2012)
• ベイズ最適化 (Bardenet et al. 2013; Yogatama and Mann 2014; Feurer et al.
2014,2015,2018; Kim et al. 2017a,b)
メタ学習とウォームスタート

• Sampling (Arnold and Beyer 2006)
• 設定をn回評価し，平均値を取る
• Threshold Selection Equipped with Re-evaluation 
(Markon et al. 2001; Beielstein and Markon 2002; Jin and Branke 2005; Goh and Tan 2007; Gießen and Kötzing 2016)
• 目的関数値が最良値をしきい値以上改善した場合にsampling
• Value Suppression (Wang et al. 2018b)
• best-k設定が一定期間更新されないときにbest-k設定をsamplingし，関数値を修正
ノイズ対策

計算実験

CNNのハイパパラメータ最適化 (Ozaki et al. 2017)
以下を5つの手法でハイパパラメータ最適化する
Name Description Range
x1 Learning rate (= 0.1x1
) [1, 4]
x2 Momentum (= 1 − 0.1x2
) [0.5, 2]
x3 L2 weight decay [0.001, 0.01]
x∗
4 FC1 units [256, 1024]
Integer parameters are marked with ∗
.
データセット：MNIST
ネットワーク：LeNet，Batch-Normalized Maxout Network in Network
タスク：文字認識（10クラス分類）
) [0.5, 2]
) [0.5, 2]
x4 Dropout 1 [0.4, 0.6]
x5 Dropout 2 [0.4, 0.6]
x6 Conv 1 initialization deviation [0.01, 0.05]
x9 MMLP 1-1 initialization deviation [0.01, 0.05]
Batch-Normalized Mahout Network in Network
(Chang and Chen 2015)
MMLP (Maxout Multi Layer Perceptron)
LeNet (LeCun et al. 1998)
MNIST (LeCun and Cortes, 2010)

文字認識 (LeNet) 結果
Mean loss of all executions for each method per iteration (LeNet)

文字認識 (LeNet) 結果
Method mean loss min loss
Random search 0.005411 (±0.001413) 0.002781
Bayesian optimization 0.004217 (±0.002242) 0.000089
CMA-ES 0.000926 (±0.001420) 0.000047
Coordinate-search method 0.000052 (±0.000094) 0.000002
Nelder-Mead method 0.000029 (±0.000029) 0.000004
Method mean accuracy (%) accuracy with min loss (%)
Random search 98.98 (±0.08) 99.06
CMA-ES 99.20 (±0.08) 99.30

文字認識 (Batch-Normalized Mahout Network in Network) 結果
Mean loss of all executions for each method per iteration
(Batch-Normalized Maxout Network in Network)

文字認識 (Batch-Normalized Mahout Network in Network) 結果
Random search 0.045438 (±0.002142) 0.042694
CMA-ES 0.045248 (±0.002537) 0.042250
Random search 99.56 (±0.02) 99.58
CMA-ES 99.49 (±0.14) 99.59

データセット：Adience benchmark
ネットワーク：Gil and Tal (2015)
タスク：
(1)性別推定（2クラス分類）
(2)年齢層推定（8クラス分類）
) [1, 4]
) [0.5, 2]
x4 Dropout 1 [0.4, 0.6]
x5 Dropout 2 [0.4, 0.6]
x∗
6 FC 1 units [512, 1024]
x∗
7 FC 2 units [256, 512]
x11 FC 1 initialization deviation [0.001, 0.01]
x14 Conv 1 bias [0, 1]
x17 FC 1 bias [0, 1]
x18 FC 2 bias [0, 1]
x∗
19 Normalization 1 localsize (= 2x19 + 3) [0, 2]
x∗
20 Normalization 2 localsize (= 2x20 + 3) [0, 2]
x21 Normalization 1 alpha [0.0001, 0.0002]
x22 Normalization 2 alpha [0.0001, 0.0002]
x23 Normalization 1 beta [0.5, 0.95]
x24 Normalization 2 beta [0.5, 0.95]
Integer parameters are marked with ∗
.
Adience benchmark (Eran et al. 2014)

性別推定結果
Mean loss of all executions for each method per iteration
(gender classification CNN)

性別推定結果
Random search 0.001732 (±0.000540) 0.000984
CMA-ES 0.001804 (±0.000480) 0.001249
Random search 87.93 (±0.24) 88.21
CMA-ES 88.20 (±0.38) 88.55

年齢層推定結果
Mean loss of all executions for each method per iteration (age
classification CNN)

年齢層推定結果
Random search 0.035694 (±0.006958) 0.026563
CMA-ES 0.031244 (±0.010834) 0.016952
Random search 57.18 (±0.96) 57.90
CMA-ES 57.17 (±0.80) 58.19

局所探索法が良い結果を出せた理由はなにか
仮説：目的関数が多くの良質な局所解を持つ？ ->肯定的な結果（NMは異なる局所解に収束も，良い性能）
Parallel coordinates plot of the optimized hyperparameters of the gender classification CNN
• Olof (2018)による追試
• NMはCNNに対して確かに上手くいく，RNNに対しては微妙
• 平均的にはCNN/RNNいずれもTPEが良かった (ベイズ最適化でもGPの方は全然ダメだった)
• 実験を通して最良の結果を見つけたのはCNN/RNNいずれについてもNM
• CNNに共通するロス関数の性質がRNNでは成り立たないと指摘
• Snoek et al. (2012)らの実験ではGPを用いたベイズ最適化が，TPEより優れていたと報告

計算実験
様々な課題
• 基本的にどの論文も提案手法が一番という結論を主張する
• 提案手法は念入りにチューニングしてあるものと考える
• 再現性の問題
• 手法の実装（ソースコード公開），ランダム性及びチューニング
• 十分な計算リソースが手元にない
• モデルの評価結果を記録した表形式のデータセット (Klein et al. 2018)
• 実験設定がまちまち
• HPOLib (Eggensperger et al. 2013)
• 手法比較の方法
• 基準（e.g. 精度，AUC）と順位付けの手法 (Dewancker et al. 2016)
• 検証データへの過学習
• 実用においてはデータセットをtraining / validation / testの3つに分割して
おきチューニング後の性能がtestにおいて乖離し過ぎていないか確認

結論

結論
これから熱くなると予想するトピック
• 脱グリッドサーチ
• ランダムサーチをはじめとする他の手法を使用
• 状況に応じて利点と欠点を考慮
• 自分と近い実験設定の論文を参考
• 研究トピック
• 最適化手法
• 関連手法 (e.g. 重要なパラメータの特定，学習曲線予測)
• 再現性の担保やベンチマークの整備
• 応用 (AutoML e.g. CASH problem，モデルアーキテクチャ探索) 
Combined Algorithm Selection and Hyperparameter Optimization (CASH)

付録

Coordinate Search法
Maximal positive basisを活用した探索 (Conn et al., 2009; Audet and Hare, 2017)
D⊕
D⊕ = {±ei
: i = 1, 2, . . . , n}

λ0
∈ Λ(⊂ Rn
) δ0
∈ R with δ > 0 ϵ ∈ [0, ∞)
λ0

Pk
= {λk
+ δk
d : d ∈ D⊕} f(λ) < f(λk
) λ ∈ Pk
λ0
λ

λk+1
= λ δk+1
= δk
λ0
λ1

λ0
λ1
Pk
= {λk
+ δk
d : d ∈ D⊕} f(λ) < f(λk
) λ ∈ Pk

λk+1
= λ δk+1
= δk
λ0
λ1
λ2

λ0
λ1
λ2
λ3
Pk
= {λk
+ δk
: d ∈ D⊕} f(λ) < f(λk
) λ ∈ Pk

λk+1
= λk
δk+1
= δk
/2
λ0
λ1
λ2
λ3
=λ4

λ0
λ1
λ2
λ3
=λ4
=λ5
λk+1
= λk
δk+1
= δk
/2

δk+1
≤ ϵ
λ0
λ1
λ2
λ3
=λ4
=λ5
λ6

McCormick benchmark function

Pros and Cons
• 局所解を見つける能力
• 並列化は部分的にのみ可能
• 座標軸に沿い反復的に探索を行うため次元数に対して低スケーラブル
• 大域的な探索を行わないため，悪質な局所解に陥るリスク
収束性や失敗する例，改良した手法などはConn et al. (2009); Audet and Hare (2017)

探索空間の正規化
• ハイパパラメータ間のスケールが違いすぎると探索が非効率化
• 探索空間を予め単位超立方体に正規化して防止
• 実用上は無効値となる場合，適当に大きな損失値を返す

• 初期点の決め方
• 悪質な局所解に陥る問題に対して有効な方法
初期化の戦略
• 探索範囲の中心で初期化
• 数回のランダムサーチを行い，最も良かった点で初期化
• 異なる初期点からのマルチスタート

探索の戦略 (Audet and Hare 2017)
• Opportunistic polling
• 良いものが見つかった時点で採用
• 固定された順番
• 完全にランダム
• 直前に改善した方向からスタート
• Complete polling（スケールしない）
• 反復の度に全ての候補を評価して最良の値を選択

• Weighted Hamming distance kernel (Hutter et al. 2011)
ベイズ最適化
カテゴリ的パラメータを扱うためのカーネル
kmixed(λ, λ′
) = exp(rcont + rcat),
rcont(λ, λ′
) =
l∈Λcont
(−θl(λl − λ′
l)2
),
rcat(λ, λ′
) =
l∈Λcat
−θl(1 − δ(λl, λ′
l)).
where δ is the Kronecker delta function

• Conditional kernel (Lévesque et al. 2017)
• 条件的パラメータのための別のカーネル (Swersky et al. 2014)
ベイズ最適化
条件パラメータを扱うためのカーネル
kc(λ, λ′
) =
k(λ, λ′
) if λc = λ′
c ∀c ∈ C
0 otherwise
where C is the set of indices of active conditional hyperparameters

ベイズ最適化
具体的なガウス過程回帰の計算
µ1(λ2
) = k(λ2
, λ1
)f(λ1
)
µ2(λ3
) = k(λ3
, λ1
) k(λ3
, λ2
)
1 k(λ1
, λ2
)
k(λ2
, λ1
) 1
−1
f(λ1
)
f(λ2
)
=
1
1 − k(λ1, λ2)2
k(λ3
, λ1
) k(λ3
, λ2
)
1 −k(λ1
, λ2
)
−k(λ2
, λ1
) 1
f(λ1
)
f(λ2
)
=
1
1 − k(λ1, λ2)2
k(λ3
, λ1
) − k(λ2
, λ1
)k(λ3
, λ2
) k(λ3
, λ2
) − k(λ2
, λ1
)k(λ3
, λ1
)
f(λ1
)
f(λ2
)
=
1
1 − k(λ1, λ2)2
(k(λ3
, λ1
) − k(λ2
, λ1
)k(λ3
, λ2
))f(λ1
) + (k(λ3
, λ2
) − k(λ2
, λ1
)k(λ3
, λ1
))f(λ2
)
λ1
λ2
λ3
k(λ, λ′
) = exp −1
2 ∥λ − λ′
∥2
k(λ3
, λ1
) k(λ2
, λ1
) k(λ3
, λ2
)
f(λ1
) f(λ3
)

• Probability of Improvement (PI) (Kushner 1964)
• Expected Improvement (EI) (Mockus et al. 1978)
• 改善量を加味，よく使われる
• Predictive Entropy Search (PES) (Henrández-
Lobato et al. 2014)
• 情報量を最大化
ベイズ最適化
獲得関数の補足
aPI = P(f(λ) ≤ f(λ∗
) − ξ)
= φ
f(λ∗
) − ξ − µ(λ)
σ(λ)
λ∗
Φ ξ
PIの可視化 (Brochu et al. 2010)
※この図は最大化問題のため左式とは少し異なる

ベイズ最適化
獲得関数の最大化手法
• 獲得関数最大化自体が非凸大域的最適化
• 最適化手法
• Brochu (2010)
• DIRECT (Jones et al. 1993)
• Bergstra (2011)
• Estimation of Distribution (EDA) (Larraanaga and
Lozano 2011)
• Covariance Matrix Adaptation Evolution Strategy (CMA-
ES) (Hansen 2006)

• 多腕バンディット
• 複数の候補から最も良いものを逐次的に探す
• スロットマシンの累積報酬最大化問題
• ハイパパラメータ最適化は連続 / 無限腕バンディットや最適腕識別として考えられる
• ベイズ最適化は平均ケースを考えている
• バンディットは最悪ケースのリグレット最小化を考えるのが一般的
• 関連研究
• Srinivas et al. (2010, 2012); Bull (2011); Kandasamy et al. (2015,
2017)など
ベイズ最適化と多腕バンディットの繋がり

参考文献

Christopher M. Bishop. Pattern recognition and machine learning. Information science and statistics. Springer, New York, 2006. ISBN
978-0-387-31073-2.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], December 2014. URL http://arxiv.org/abs/
1412.6980. arXiv:1412.6980.

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition.
2016.

Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme. Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29(4):329–337,
November 2015. ISSN 0933-1875, 1610-1987. doi: 10.1007/s13218-015-0381-0. URL http://link.springer.com/10.1007/s13218-015-0381-0.

Stefan Falkner, Aaron Klein, and Frank Hutter. Practical hyperparameter optimization for

deep learning, 2018a. URL https://openreview.net/forum?id=HJMudFkDf.

Jesse Dodge, Kevin Jamieson, and Noah A. Smith. Open Loop Hyperparameter Optimization and Determinantal Point Processes. arXiv:1706.01566
[cs, stat], June 2017. URL http://arxiv.org/abs/1706.01566. arXiv: 1706.01566.

Jaak Simm. Survey of hyperparameter optimization in NIPS2014, 2015. URL https://github.com/jaak-s/nips2014-survey.

Carl Staelin. Parameter selection for support vector machines. 2002. URL http://www.hpl.hp.com/techreports/2002/HPL-2002-354R1.html.

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13:281–305, February 2012. ISSN
1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2188395.

Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An eﬃcient approach for assessing hyperparameter importance. In Proceedings of the 31st
International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages I—754–I—762. JMLR.org, 2014. URL
http://dl.acm.org/citation.cfm?id=3044805.3044891.
参考文献

Chris Fawcett and Holger H. Hoos. Analysing differences between algorithm configurations through ablation. Journal of Heuristics, 22(4):431–458,
Aug 2016. ISSN 1572-9397. doi:10.1007/s10732-014-9275-9. URL https://doi.org/10.1007/s10732-014-9275-9.

Andre Biedenkapp, Marius Lindauer, Katharina Eggensperger, Frank Hutter, ChrisFawcett, and Holger Hoos. Efficient parameter importance analysis
via ablation with surrogates, 2017. URL https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14750.

Jan N van Rijn and Frank Hutter. An empirical study of hyperparameter importance across datasets. In AutoML@PKDD/ECML, 2017a.

Jan N van Rijn and Frank Hutter. Hyperparameter importance across datasets. arXiv preprint arXiv:1710.04725, 2017b.

J. A. Nelder and R. Mead. A Simplex Method for Function Minimization. The Computer Journal, 7(4):308–313, January 1965. ISSN 0010-4620,
1460-2067. doi: 10.1093/comjnl/7.4.308. URL https://academic.oup.com/comjnl/article-lookup/doi/10.1093/comjnl/7.4.308.

Andrew R. Conn, Katya Scheinberg, and Luis N. Vicente. Introduction to Derivative-Free Optimization. Society for Industrial and Applied Mathematics,
January 2009. ISBN 978-0-89871-668-9 978-0-89871-876-8. doi: 10.1137/1.9780898718768. URL http://epubs.siam.org/doi/book/
10.1137/1.9780898718768.

Charles Audet and Warren Hare. Derivative-Free and Blackbox Optimization. Springer Series in Operations Research and Financial Engineering.
Springer International Publishing, Cham, 2017. ISBN 978-3-319-68912-8 978-3-319-68913-5. doi: 10.1007/978-3-319-68913-5. URL http://
link.springer.com/10.1007/978-3-319-68913-5.

Fuchang Gao and Lixing Han. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Computational Optimization and
Applications, 51(1):259–277, January 2012. ISSN 0926-6003, 1573-2894. doi: 10.1007/s10589-010-9329-3. URL http://link.springer.com/10.1007/
s10589-010-9329-3.

Hiva Ghanbari and Katya Scheinberg. Black-Box Optimization in Machine Learning with Trust Region Based Derivative Free Algorithm. arXiv:
1703.06925 [cs], March 2017. URL http://arxiv.org/abs/1703.06925. arXiv: 1703.06925.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural
information processing systems, pages 2951–2959, 2012.
参考文献

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration. In Carlos A. Coello Coello, editor, Learning and
Intelligent Optimization, pages 507–523, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. ISBN 978-3-642-25566-3.

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Proceedings of the 24th International Conference on Neural
Information Processing Systems, NIPS’11, pages 2546–2554, USA, 2011. Curran Associates Inc. ISBN 978-1-61839-599-3. URL http://dl.acm.org/citation.cfm?id=2986459.2986743.

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. Scalable bayesian
optimization using deep neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2171–
2180. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045349.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN
026218253X.32

Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical
Reinforcement Learning. arXiv:1012.2599 [cs], December 2010. URL http://arxiv.org/abs/1012.2599. arXiv: 1012.2599.

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE
Transactions on Information Theory, 58:3250–3265, 2012.

J. Quiñonero-Candela, CE. Rasmussen, and CKI. Williams. Approximation Methods for Gaussian Process Regression, pages 203–223. Neural Information Processing. MIT Press,
Cambridge, MA, USA, September 2007.

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In

David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning
Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL http://proceedings.mlr.press/v5/titsias09a.html.

Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Proceedings of the 28th International
Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3330–3338, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?
id=2969442.2969611.

Javier Gonzalez, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Arthur Gretton and Christian C. Robert, editors, Proceedings
of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 648–657, Cadiz, Spain, 09–11 May 2016.
PMLR. URL http://proceedings.mlr.press/v51/gonzalez16a.html.
参考文献

Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched Gaussian Process Bandit Optimization via Determinantal Point Processes. arXiv:1611.04088 [cs],
November 2016. URL http://arxiv.org/abs/1611.04088. arXiv: 1611.04088.

Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. Parallelised bayesian optimisation via thompson sampling. In Amos Storkey
and Fernando

Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning
Research, pages 133–142, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/kandasamy18a.html.

Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In
Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 8188, ECML PKDD 2013, pages 225–240, New York,
NY, USA, 2013. Springer-Verlag New York, Inc. ISBN 978-3-642-40987-5. doi: 10.1007/978-3-642-40988-2_15. URL http://dx.doi.org/10.1007/978-3-642-40988-2_15.

Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing Exploration-Exploitation Tradeoffs in Gaussian Process Bandit Optimization. Journal of Machine
Learning Research, 15:4053–4103, 2014. URL http://jmlr.org/papers/v15/desautels14a.html.

Erik A. Daxberger and Bryan Kian Hsiang Low. Distributed batch Gaussian process optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th
International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 951–960, International Convention Centre, Sydney,
Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/daxberger17a.html.

Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional Bayesian optimization via structural kernel learning. In Doina Precup and Yee
Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3656–
3664, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/wang17h.html.

Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scalebayesian optimization in high-dimensional spaces. In Amos Storkey and
Fernando Perez-Cruz, editors, Proceedings of the Twenty-First nternational Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine
Learning Research, pages 745–754, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018b. PMLR. URL http://proceedings.mlr.press/v84/wang18c.html.

Ran Rubin. New Heuristics for Parallel and Scalable Bayesian Optimization. arXiv:1807.00373 [cs, stat], July 2018. URL http://arxiv.org/abs/1807.00373. arXiv:
1807.00373.

Watanabe, Shinji, and Jonathan Le Roux. Black box optimization for automatic speech recognition. 2014.

Loshchilov, Ilya, and Frank Hutter. CMA-ES for Hyperparameter Optimization of Deep Neural Networks. 2016.
参考文献

Michael Meissner, Michael Schmuker, and Gisbert Schneider. Optimized Particle Swarm Optimization (OPSO) and its application to artiﬁcial neural network
training. BMC Bioinformatics, 7(1):125, March 2006. ISSN 1471-2105. doi: 10.1186/1471-2105-7-125. URL https://doi.org/10.1186/1471-2105-7-125.

Shih-Wei Lin, Shih-Chieh Chen, Wen-Jie Wu, and Chih-Hsien Chen. Parameter determination and feature selection for back-propagation network by
particle swarm optimization. Knowledge and Information Systems, 21(2):249–266, November 2009. ISSN 0219-3116. doi: 10.1007/s10115-009-0242-y.
URL https://doi.org/10.1007/s10115-009-0242-y.

Pablo Ribalta Lorenzo, Jakub Nalepa, Luciano Sanchez Ramos, and José Ranilla Pastor. Hyper-parameter selection in deep neural networks using parallel
particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 1864–1871. ACM, 2017.

Fei Ye. Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high-
dimensional data. PLOS ONE, 12 (12):1–36, 2017. doi: 10.1371/journal.pone.0188746. URL https://doi.org/10.1371/journal.pone.0188746.

F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam. Tuning of the structure and parameters of a neural network using an improved genetic algorithm.
Neural Networks, IEEE Transactions on, 14(1):79–88, February 2003. doi: 10.1109/tnn.2002.804317. URL http://dx.doi.org/10.1109/tnn.2002.804317.

Steven R Young, Derek C Rose, Thomas P Karnowski, Seung-Hwan Lim, and Robert M Patton. Optimizing deep learning hyper-parameters through an
evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, page 4. ACM, 2015.

Wei Fu, Tim Menzies, and Xipeng Shen. Tuning for software analytics: Is it really necessary? Information and Software Technology, 76:135 – 146, 2016a.
ISSN 0950-5849. doi: https://doi.org/10.1016/j.infsof.2016.04.017. URL http://www.sciencedirect.com/science/article/pii/S0950584916300738.

Wei Fu, Vivek Nair, and Tim Menzies. Why is Diﬀerential Evolution Better than Grid Search for Tuning Defect Predictors? arXiv:1609.02613 [cs, stat],
September 2016b. URL http://arxiv.org/abs/1609.02613. arXiv: 1609.02613.

Samantha Hansen. Using deep q-learning to control optimization hyperparameters. arXiv preprint arXiv:1602.04062, 2016.

Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. In International Conference on Machine
Learning, pages 459–468, 2017.
参考文献

Xingping Dong, Jianbing Shen, Wenguan Wang, Yu Liu, Ling Shao, and Fatih Porikli. Hyperparameter optimization for tracking with continuous deep q-learning. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 518–527, 2018.

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32Nd International
Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2113–2122. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045343.

Jelena Luketina, Mathias Berglund, Klaus Greﬀ, and Tapani Raiko. Scalable gradientbased tuning of continuous regularization hyperparameters. In Proceedings of the 33rd
International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2952–2960. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?
id=3045390.3045701.

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Proceedings of the 33rd International Conference on International Conference on Machine Learning -
Volume 48, ICML’16, pages 737–746. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045469.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. On hyperparameter optimization in learning systems. In Proceedings of the 5th International Conference
on Learning Representations (Workshop Track), 2017a.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. A Bridge Between Hyperparameter Optimization and Larning-to-learn. arXiv:1712.06283 [cs, stat],
December 2017b. URL http://arxiv.org/abs/1712.06283. arXiv: 1712.06283.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh,
editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1165–1173, International
ConventionCentre, Sydney, Australia, 06–11 Aug 2017c. PMLR. URL http://proceedings.mlr. press/v70/franceschi17a.html.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer Dy
and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1563–1572,
Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018a. PMLR. URL http://proceedings.mlr.press/v80/franceschi18a.html.

Luca Franceschi, Riccardo Grazzi, Massimiliano Pontil, Saverio Salzo, and Paolo Frasconi. Far-ho: A bilevel programming package for hyperparameter optimization and
metalearning. CoRR, abs/1806.04941, 2018b. URL http://arxiv.org/abs/1806.04941.

Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In
Proceedings of the 24th International Conference on Artiﬁcial Intelligence, IJCAI’15, pages 3460–3468. AAAI Press, 2015. ISBN 978-1-57735-738-4. URL http://dl.acm.org/
citation.cfm?id=2832581.2832731.
参考文献

Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. 2016.

Akshay Chandrashekaran and Ian R. Lane. Speeding up Hyper-parameter Optimization by Extrapolation of Learning Curves Using Previous Builds.
In Michelangelo Ceci, Jaakko Hollmén, Ljupčo Todorovski, Celine Vens, and Sašo Džeroski, editors, Machine Learning and Knowledge Discovery in
Databases, pages 477–492, Cham, 2017. Springer International Publishing. ISBN 978-3-319-71249-9.

Tobias Hinz, Nicolás Navarro-Guerrero, Sven Magg, and Stefan Wermter. Speeding up the hyperparameter optimization of deep convolutional neural
networks. International Journal of Computational Intelligence and Applications, page 1850008, 2018.

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A Novel Bandit-Based Approach to
Hyperparameter Optimization. Journal of Machine Learning Research, 18(185):1–52, 2018. URL http://jmlr.org/papers/v18/16-558.html.

Hadrien Bertrand, Roberto Ardon, Matthieu Perrot, and Isabelle Bloch. Hyperparameter optimization of deep neural networks : Combining
hyperband with bayesian model selection. 2017.

Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and eﬃcient hyperparameter optimization at scale. In International Conference on
Machine Learning, pages 1436–1445, 2018b.

Jiazhuo Wang, Jason Xu, and Xuejun Wang. Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep
Learning. arXiv:1801.01596 [cs], January 2018a. URL http://arxiv.org/abs/1801.01596. arXiv: 1801.01596.

Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to Warm-Start Bayesian Hyperparameter Optimization. ArXiv e-prints, October 2017.

Jungtaek Kim, Saehoon Kim, and Seungjin Choi. Learning to transfer initializations for bayesian hyperparameter optimization. arXiv preprint arXiv:
1710.06219, 2017.

T Gomes, P Miranda, R Prudêncio, C Soares, and A Carvalho. Combining meta-learning and optimization algorithms for parameter selection. In 5 th
PLANNING TO LEARN WORKSHOP WS28 AT ECAI 2012, page 6. 2012.
参考文献

Matthias Reif, Faisal Shafait, and Andreas Dengel. Meta-learning for evolutionary parameter optimization of classifiers. Machine learning, 87(3):357–
380, 2012.

Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In International Conference on Machine
Learning, pages 199–207, 2013.

Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial Intelligence and Statistics,
pages 1077–1085, 2014.

Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using meta-learning to initialize bayesian optimization of hyperparameters. In
Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection-Volume 1201, pages 3–10. 2014.

Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Initializing bayesian hyperparameter optimization via meta-learning. In AAAI, pages
1128–1135, 2015.

Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable meta-learning for bayesian optimization. arXiv preprint arXiv:1802.02219, 2018.

Dirk V Arnold and H-G Beyer. A general noise model and its effects on evolution strategy performance. IEEE Transactions on Evolutionary
Computation, 10(4):380–391, 2006.

Sandor Markon, Dirk V Arnold, Thomas Back, Thomas Beielstein, and H-G Beyer. Thresholding-a selection operator for noisy es. In Evolutionary
Computation, 2001. Proceedings of the 2001 Congress on, volume 1, pages 465–472. IEEE, 2001.

Thomas Beielstein and Sandor Markon. Threshold selection, hypothesis tests, and doe methods. In Evolutionary Computation, 2002. CEC’02.
Proceedings of the 2002 Congress on, volume 1, pages 777–782. IEEE, 2002.

Yaochu Jin and Jürgen Branke. Evolutionary optimization in uncertain environments-a survey. IEEE Transactions on evolutionary computation, 9(3):
303–317, 2005.
参考文献

Chi Keong Goh and Kay Chen Tan. An investigation on noisy environments in evolutionary multiobjective optimization. IEEE Transactions on
Evolutionary Computation, 11(3):354–381, 2007.

Christian Gießen and Timo Kötzing. Robustness of populations in stochastic environments. Algorithmica, 75(3):462–489, 2016.

Hong Wang, Hong Qian, and Yang Yu. Noisy derivative-free optimization with value suppression. 2018b.

Yoshihiko Ozaki, Masaki Yano, and Masaki Onishi. Effective hyperparameter optimization using Nelder-Mead method in deep learning. IPSJ
Transactions on Computer Vision and Applications, 9(1), December 2017. ISSN 1882-6695. doi: 10.1186/s41074-017-0030-7. URL https://
ipsjcva.springeropen.com/articles/10.1186/s41074-017-0030-7.

LeCun Y, Cortes C MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. 2010.

LeCun Y, Bottou L, Bengio Y, Patrick H Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324, 1998.

Chang JR, Chen YS Batch-Normalized Maxout Network in Network. In: Proceedings of the 33rd International Conference on Machine Learning.
2015. https://arxiv.org/abs/1511.02583.

Eran E, Roee E, Tal E Age and gender estimation of unfiltered faces. IEEE Trans Inf Forensic Secur 9(12):2170–2179, 2014.

Gil L, Tal H Age and gender classification using convolutional neural networks. Computer Vision and Pattern Recognition Workshops (CVPRW).
2015. http://ieeexplore.ieee.org/document/7301352.

Skogby Steinholtz Olof. A comparative study of black-box optimization algorithms for tuning of hyper-parameters in deep neural networks, 2018.
参考文献

Aaron Klein, Eric Christiansen, Kevin Murphy, and Frank Hutter. Towards reproducible neural architecture and hyperparameter search. 2018.

Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. Towards an empirical
foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10,
page 3, 2013.

Ian Dewancker, Michael McCourt, Scott Clark, Patrick Hayes, Alexandra Johnson, and George Ke. A strategy for ranking optimization methods using
multiple criteria. In Workshop on Automatic Machine Learning, pages 11–20, 2016.

Julien-Charles Lévesque, Audrey Durand, Christian Gagné, and Robert Sabourin. Bayesian optimization for conditional hyperparameter spaces. In
Proc. of the International Joint Conference on Neural Networks (IJCNN). IEEE, 05 2017.

Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and Michael A Osborne. Raiders of the lost architecture: Kernels for bayesian
optimization in conditional parameter spaces. arXiv preprint arXiv:1409.4011, 2014a.

Harold J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise. Journal of Basic
Engineering, 86(1):97+, 1964. ISSN 00219223. doi: 10.1115/1.3653121. URL http://dx.doi.org/10.1115/1.3653121.

Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for seeking the extremum. Towards Global Optimization,
1978.

José Miguel Henrández-Lobato, Matthew W. Hoﬀman, and Zoubin Ghahramani. Predictive entropy search for eﬃcient global optimization of black-
box functions. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pages 918–926,
Cambridge, MA, USA, 2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2968826.2968929.

D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and
Applications, 79(1):157–181, October 1993. ISSN 1573-2878. doi: 10.1007/BF00941892. URL https://doi.org/10.1007/BF00941892.

Pedro Larraanaga and Jose A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers,
Norwell, MA, USA, 2001. ISBN 0792374665.
参考文献

Nikolaus Hansen. The CMA Evolution Strategy: A Comparing Review. In Jose A. Lozano, Pedro Larrañaga, Iñaki Inza, and Endika Bengoetxea,
editors, Towards a New Evolutionary Computation: Advances in the Estimation of Distribution Algorithms, pages 75–102. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2006. ISBN 978-3-540-32494-2. doi: 10.1007/3-540-32494-1_4. URL https://doi.org/10.1007/3-540-32494-1_4.

Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and
experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1015–
1022, USA, 2010. Omnipress. ISBN 978-1-60558-907-7. URL http://dl.acm.org/citation.cfm?id=3104322.3104451.

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process
optimization in the bandit setting. IEEE Transactions on Information Theory, 58:3250–3265, 2012.

Adam D. Bull. Convergence rates of eﬃcient global optimization algorithms. J. Mach. Learn. Res., 12:2879–2904, November 2011. ISSN 1532-4435.
URL http://dl.acm.org/citation.cfm?id=1953048.2078198.

Kirthevasan Kandasamy, Jeﬀ Schneider, and Barnabás Póczos. High dimensional bayesian optimisation and bandits via additive models. In
Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 295–304.
JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045151.

Kirthevasan Kandasamy. Tuning hyper-parameters without grad students: Scaling up bandit optimisation. 2017.
参考文献

Machine Learning Hyperparameter Optimization

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Machine Learning Hyperparameter Optimization

Ähnlich wie Machine Learning Hyperparameter Optimization (20)

Mehr von gree_tech

Mehr von gree_tech (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Machine Learning Hyperparameter Optimization