SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Chap 8. Optimization for training deep models
(Goodfellow et al. (2016), Deep Learning, MIT Press)
Presenter : Young-Geun Choi
Department of Statistics
Seoul National University
Apr 1, 2016
Young-Geun Choi Optimization for deep models Apr 1, 2016 1 / 1
Young-Geun Choi Optimization for deep models Apr 1, 2016 2 / 1
8.1 How learning differs from pure optimization
Cost function is estimated by empirical risk – (1) indirectly evaluated and
(2) separable.
(1) Cost function:
J(θ) = E(x,y)∼ˆpdata
L(f(x; θ), y).
(try to) estimate true
J∗
(θ) = E(x,y)∼pdata
L(f(x; θ), y)
(2) N : size of all training example
E(x,y)∼ˆpdata
L(f(x; θ), y) =
1
N
N∑
i=1
L(f(x(i)
; θ), y(i)
).
Young-Geun Choi Optimization for deep models Apr 1, 2016 3 / 1
8.1.3 Batch and minibatch algorithms
We may use gradient descent : θnew
= θold
− ϵ ▽θ J∗
(θold
) for optimization.
Is using all samples good to estimate J∗
(θ) or ▽θJ∗
(θ)?
Standard error of mean is approximated by σ/
√
N (CLT)
10,000 times more examples give 100 times more accuracy
There also might be redundancy in the training set (many samples might be
very similar)
→ Use subsets (minibatch) of size m to estimate ▽θJ∗
(θ)
Caution (confusing terminologies)
Batch or deterministic gradient methods : processing with all training
examples
e.g. “Batch gradient descent” : use full training set
e.g. “minibatch (stochastic) gradient descent” : stochastic gradient descent
(Sec 8.3-)
Batch size : the size of a minibatch
Young-Geun Choi Optimization for deep models Apr 1, 2016 4 / 1
8.1.3 Batch and minibatch algorithms
Batch sizes (minibatch sizes) are generally driven by
Minimum batch size : below which there is no reduction in the time to
process a minibatch due to multicore architecture
Typically all examples in a (mini)batch are to be processed in parallel;
amount of memory scales matters
When using GPU, m = 2p
will be good (32, 256, 4096)
Very large dataset – rather than randomly select examples, shuffle and stack
in advance and consecutively select them
Young-Geun Choi Optimization for deep models Apr 1, 2016 5 / 1
8.2 Challenges in neural network optimization
Deep neural network cost function
Many composition of functions
Many parameters
Non-identifiability
8.2.1 Ill-conditioning
Hessian matrix can be ill-conditioned (very large maximum eigenvalue, very
small minimum eigenvalue)
Very small step can increase the cost function
8.2.2 Local minima
Any deep models essentially have an extremely large number of local minima
due to weight space symmetry.
Experts suspect that most local minima have a low cost function value – it is
not important to find the global minima
Young-Geun Choi Optimization for deep models Apr 1, 2016 6 / 1
8.2 Challenges in neural network optimization
8.2.3 Plateaus, saddle points and other flat regions
Zero-gradient points are more likely to be saddle points in higher dimension
(more parameters).
Can optimization algorithms (introduced later) escape saddle points? – check
https://github.com/cs231n/cs231n.github.io/blob/master/
neural-networks-3.md#sgd
Flat regions (zero Hessian, zero gradient) even matter
8.2.4 Cliffs, exploding gradients
Highly deep neural network, or recurrent neural networks often contains sharp
nonlinearities.
Very high derivative at some place can catapult the parameters very far,
possibly spoiling most the optimization work that had been done.
Young-Geun Choi Optimization for deep models Apr 1, 2016 7 / 1
8.2 Challenges in neural network optimization
8.2.5 Long-term dependencies
In recurrent neural networks (Chap 10), repeatedly a matrix is multiplied;
vanishing and exploding gradient problem can occur
8.2.6 Inexact gradients
We use samples – can lead to a noisy or even biased estimate
8.2.7 Poor correspondence between local and global structure
Can we make a non-local move? Rather, find good initial points.
Young-Geun Choi Optimization for deep models Apr 1, 2016 8 / 1
8.2 Challenges in neural network optimization
8.2.8 Theoretical limits of optimization
Some theory: output discrete (but practical NN units output smoothly
increasing values and local search is feasible)
Some theory: problem classes that are intractable (but it is difficult to tell a
problem is in the class)
Some theory: a network of a given size is intractable (but a larger network
can find a solution)
Developing more realistic bounds on the performance of optimization
algorithms therefore remains an important goal for machine learning research.
Young-Geun Choi Optimization for deep models Apr 1, 2016 9 / 1
8.3 Basic algorithms: 8.3.1 Stochastic gradient descent
(SGD)
Algorithm SGD update at training iteration k
Require: Learning rate ϵ = ϵk
Require: Initial parameter θ
1: while stopping criterion not met do
2: Sample a minibatch {x(1)
, . . . , x(m)
} and corresponding y(i)
3: Compute gradient descent estimate: g ← 1
m ▽θ
∑
i L(f(x(i)
; θ), y(i)
))
4: Apply update: θ ← θ − ϵg
5: end while
Learning rate, initial point are kind of hyper-parameters.
Learning rate ϵk depends on iteration index k
SGD convergence: when
∑∞
k=1 ϵk = ∞ and
∑∞
k=1 ϵ2
k < ∞
In practice, decay linearly until iteration τ:
ϵk =
(
1 −
k
τ
)
ϵ0 +
k
τ
ϵτ
After iteration τ, leave ϵk constant. Usually ϵτ is 1% of ϵ0.
Young-Geun Choi Optimization for deep models Apr 1, 2016 10 / 1
8.3.1 Stochastic gradient descent (SGD)
τ may be set to the number of iterations required to make a few hundred
passes through the training set.
How to set ϵ0? Monitor learning curves.
Check plots in https://github.com/cs231n/cs231n.github.io/blob/
master/neural-networks-3.md#sgd
Optimal ϵ0 (in terms of total training time and the final cost value) is higher
than the learning rate that yields the best performance after 100 iterations
and so.
SGD computation time (per update) does not grow with the number of
training examples.
Excess error (J(θ) − minθ J(θ)) of SGD after k iterations
▶ O(1/
√
k) for convex
▶ O(1/k) for strong convex
▶ Cramer-Rao bound : generalization error cannot decrease faster than O(1/k)
▶ Algorithms introduced later are good in practice, but lost in the constant
factors hidden in O(1/k).
Young-Geun Choi Optimization for deep models Apr 1, 2016 11 / 1
8.3.2 Momentum
(Stochastic) gradient descent vs. momentum
Young-Geun Choi Optimization for deep models Apr 1, 2016 12 / 1
8.3.2 Momentum
Algorithm SGD with momentum
Require: Learning rate ϵ = ϵk, momentum parameter α
Require: Initial parameter θ, initial velocity v
1: while stopping criterion not met do
2: Sample a minibatch {x(1)
, . . . , x(m)
} and corresponding y(i)
3: Compute gradient descent estimate: ˆg ← 1
m ▽θ
∑
i L(f(x(i)
; θ), y(i)
))
4: Compute velocity update: v ← αv − ϵg
5: Apply update: θ ← θ + v
6: end while
Speed decreasing from v to αv is an analogy from ‘viscous drag’.
Young-Geun Choi Optimization for deep models Apr 1, 2016 13 / 1
8.3.3 Nesterov Momentum
Algorithm SGD with Nesterov momentum
Require: Learning rate ϵ = ϵk, momentum parameter α
Require: Initial parameter θ, initial velocity v
1: while stopping criterion not met do
2: Sample a minibatch {x(1)
, . . . , x(m)
} and corresponding y(i)
3: Compute gradient descent estimate: ˆg ← 1
m ▽θ
∑
i L(f(x(i)
; θ+αv), y(i)
))
4: Compute velocity update: v ← αv − ϵg
5: Apply update: θ ← θ + v
6: end while
Excess error is O(1/k2
) for convex batch gradient
In SGD, it does not improve rate of convergence.
Young-Geun Choi Optimization for deep models Apr 1, 2016 14 / 1
8.4 Parameter initialization strategies
Nonconvex cost function: initial point strongly affects training.
No rule of thumb.
Some helpful comments (Weights)
Break symmetry – no same wij
Consider Gaussian or uniform U
(
− 1√
J
, 1√
J
)
(J : input layer size).
Large initial value may lead to overfitting.
Bias : okay to set zero
Young-Geun Choi Optimization for deep models Apr 1, 2016 15 / 1
8.5 Algorithms with adaptive learning rates: 8.5.1 AdaGrad
Learning rate : the most difficult to set, high impact on model performance
Can we determine it in adaptive way?
Algorithm The AdaGrad algorithm
Require: Global learning rate ϵ
Require: Initial parameter θ
Require: Small constant δ, perhaps 10−7
for numerical stability
1: Initialize gradient accumulation variable r = 0
2: while stopping criterion not met do
3: Sample a minibatch {x(1)
, . . . , x(m)
} and corresponding y(i)
4: Compute gradient descent estimate: g ← 1
m ▽θ
∑
i L(f(x(i)
; θ), y(i)
))
5: Accumulated squared descent : r ← r + g ⊙ g
6: Compute update: ∆θ ← − ϵ
δ+
√
r
⊙ g (operations elementwise)
7: Apply update: θ ← θ + ∆θ
8: end while
Accumulation of squared gradients from the beginning of training can result
in a premature and excessive decrease in effect learning rate.
Young-Geun Choi Optimization for deep models Apr 1, 2016 16 / 1
8.5.2 RMSProp
Algorithm The RMSProp algorithm
Require: Global learning rate ϵ, decay rate ρ
Require: Initial parameter θ
Require: Small constant δ, usually 10−6
for numerical stability
1: Initialize gradient accumulation variable r = 0
2: while stopping criterion not met do
3: Sample a minibatch {x(1)
, . . . , x(m)
} and corresponding y(i)
4: Compute gradient descent estimate: g ← 1
m ▽θ
∑
i L(f(x(i)
; θ), y(i)
))
5: Accumulated squared descent : r ← ρr + (1 − ρ)g ⊙ g
6: Compute update: ∆θ ← − ϵ
δ+
√
r
⊙ g (operations elementwise)
7: Apply update: θ ← θ + ∆θ
8: end while
Learning trajectory may pass through many different structures and
eventually arrive at a region that is not a locally convex bowl.
Practically good.
Young-Geun Choi Optimization for deep models Apr 1, 2016 17 / 1
8.5.2 RMSProp
Algorithm The RMSProp algorithm with Nesterov momentum
Require: Global learning rate ϵ, decay rate ρ
Require: Initial parameter θ, initial velocity v
1: Initialize gradient accumulation variable r = 0
2: while stopping criterion not met do
3: Sample a minibatch {x(1)
, . . . , x(m)
} and corresponding y(i)
4: Compute gradient descent estimate: g ← 1
m ▽θ
∑
i L(f(x(i)
; θ+αv), y(i)
))
5: Accumulated squared descent : r ← ρr + (1 − ρ)g ⊙ g
6: Compute velocity update: v ← αv − − ϵ√
r
⊙ g (operations elementwise)
7: Apply update: θ ← θ + v
8: end while
Young-Geun Choi Optimization for deep models Apr 1, 2016 18 / 1
8.5.3 Adam
Algorithm The Adam algorithm
Require: Global learning rate (step size) ϵ
Require: Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1) (Suggested defaults:
0.9 and 0.999, resp.
Require: Small constant δ for numerical stabilization (Suggested 10−8)
Require: Initial parameter θ
1: Initialize 1st and 2nd moment variables s = 0, r = 0
2: Initialize timestep t = 0
3: while stopping criterion not met do
4: Sample a minibatch {x(1), . . . , x(m)} and corresponding y(i)
5: Compute gradient descent estimate: g ← 1
m
▽θ
∑
i L(f(x(i); θ), y(i)))
6: t ← t + 1
7: Update biased first moment update: s ← ρ1s + (1 − ρ1)g
8: Update biased second moment update: r ← ρ2r + (1 − ρ2)g ⊙ g
9: Correct bias in first moment: ˆs ← s
1−ρt
1
10: Correct bias in second moment: ˆr ← r
1−ρt
2
11: Compute update: ∆θ = −ϵ ˆs√
ˆr+δ
(operations elementwise)
12: Apply update: θ ← θ + ∆θ
13: end while
Momentum + RMSProp + Some more insights
Fairly robust to hyperparameter choice, learning rate sometimes needs change
Young-Geun Choi Optimization for deep models Apr 1, 2016 19 / 1
8.5.4 Choosing the right optimization algorithm
Currently popular are SGD (with/out momentum), RMSProp (with/out
momentum), AdaDelta, and Adam
Choice (at this point) seems to depend largely on the user’s familiarity with
the algorithm (for ease of hyperparameter tuning)
8.6 Approximate second-order methods – Omitted here (practically not used)
Newton’s method
Conjugate gradients
BFGS
Young-Geun Choi Optimization for deep models Apr 1, 2016 20 / 1
8.7 Optimization strategies and meta-algorithms: 8.7.1
Batch normalization
Notation change : j-th unit in l-th layer revisited
h
(l)
j = ϕ


J(l−1)
∑
k=1
w
(l)
k,jh
(l−1)
k + b
(l)
j

 , j = 1, . . . , J(l)
Vector notation
h(l)
= ϕ
(
W(l)
h(l−1)
+ b(l)
)
Motivation
Each layer is affected by preceding layers. Then for deeper networks,
▶ Small changes to the network parameters amplify.
▶ Simultaneous update of all layers even matters.
▶ Gradient vanishing ϕ′
(t) even for ReLU function and good initial point
Empirically, training converges faster if input (h(l−1)
) distribution is fixed.
▶ If we have l = F2(F1(u, θ1), θ2), learning θ2 can be viewed by (x = F1(u, θ1))
l = F2(x, θ2), θnew
2 = θold
2 −
α
m
m∑
i=1
∂F2(xi, θold
2 )
∂θ2
▶ But simply normalizing each input (of a layer) may change the layer can
represent.
Young-Geun Choi Optimization for deep models Apr 1, 2016 21 / 1
8.7.1 Batch normalization
h(l)
= ϕ
(
W(l)
h(l−1)
+ b(l)
)
rewrite
−→ ϕ (Wx + b) .
Let’s say we have inputs {x1, . . . , xm} (xi : activations of the previous layer for
i-th example).
Algorithm Batch normalizing transform – When learn W and b, learn γ, β too.
Require: B = {x1, . . . , xm}, parameter to be learned: γ, β
Ensure: yi = BNγ,β(xi)
1: Mini-batch mean: µB ← 1
m
∑m
i=1 xi
2: Mini-batch variance: σ2
B ← 1
m
∑m
i=1(xi − µB)2
3: Normalize: ˆxi ← xi−µB√
σ2
B+ϵ
4: Scale and shift: yi ← γˆxi + β ≡ BNγ,β(xi)
Young-Geun Choi Optimization for deep models Apr 1, 2016 22 / 1
8.7.1 Batch normalization
ϕ (Wxi + b)
ˆxi ←
xi − µB
√
σ2
B + ϵ
, yi ← γˆxi + β ≡ BNγ,β(xi)
x’s have heterogeneous mean and variance over minibatches which heavily
depends on previous layers
On the other hand, y’s have same mean and variance over minibatches which
determined sole by γ, β which is much easier to learn with gradient descent.
In practice, batch-normalize the whole Wx + b rather than x itself (b then
should be omitted)
14 times fewer training steps, first place in ImageNet (2014)
Young-Geun Choi Optimization for deep models Apr 1, 2016 23 / 1
8.7 Optimization strategies and meta-algorithms
8.7.2 Coordinate descent
Optimize coordinate-wisely
Used to be good for convex problems
8.7.3 Polyak averaging
(Moving) average over gradient descent visit points
8.7.4 Supervised pretraining
Young-Geun Choi Optimization for deep models Apr 1, 2016 24 / 1
8.7 Optimization strategies and meta-algorithms
8.7.5 Designing models to aid optimization
In practice, it is more important to choose a model family that is easy to
optimize than to use a powerful optimization algorithm.
LSTM, ReLU, maxout units have all moved toward easier optimization.
8.7.6 Continuation methods and curriculum learning
Young-Geun Choi Optimization for deep models Apr 1, 2016 25 / 1

Weitere ähnliche Inhalte

Was ist angesagt?

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Artificial intelligence agents and environment
Artificial intelligence agents and environmentArtificial intelligence agents and environment
Artificial intelligence agents and environmentMinakshi Atre
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in aiRobert Antony
 
Informed and Uninformed search Strategies
Informed and Uninformed search StrategiesInformed and Uninformed search Strategies
Informed and Uninformed search StrategiesAmey Kerkar
 
Logics for non monotonic reasoning-ai
Logics for non monotonic reasoning-aiLogics for non monotonic reasoning-ai
Logics for non monotonic reasoning-aiShaishavShah8
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networksmilad abbasi
 
sum of subset problem using Backtracking
sum of subset problem using Backtrackingsum of subset problem using Backtracking
sum of subset problem using BacktrackingAbhishek Singh
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceExperfy
 
Activation functions
Activation functionsActivation functions
Activation functionsPRATEEK SAHU
 
Agents in Artificial intelligence
Agents in Artificial intelligence Agents in Artificial intelligence
Agents in Artificial intelligence Lalit Birla
 

Was ist angesagt? (20)

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Artificial intelligence agents and environment
Artificial intelligence agents and environmentArtificial intelligence agents and environment
Artificial intelligence agents and environment
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in ai
 
First order logic
First order logicFirst order logic
First order logic
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
Informed and Uninformed search Strategies
Informed and Uninformed search StrategiesInformed and Uninformed search Strategies
Informed and Uninformed search Strategies
 
Reasoning in AI
Reasoning in AIReasoning in AI
Reasoning in AI
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
 
A* Algorithm
A* AlgorithmA* Algorithm
A* Algorithm
 
Logics for non monotonic reasoning-ai
Logics for non monotonic reasoning-aiLogics for non monotonic reasoning-ai
Logics for non monotonic reasoning-ai
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
sum of subset problem using Backtracking
sum of subset problem using Backtrackingsum of subset problem using Backtracking
sum of subset problem using Backtracking
 
Informed search
Informed searchInformed search
Informed search
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
 
Activation functions
Activation functionsActivation functions
Activation functions
 
Agents in Artificial intelligence
Agents in Artificial intelligence Agents in Artificial intelligence
Agents in Artificial intelligence
 

Ähnlich wie Chap 8. Optimization for training deep models

A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationssuserfa7e73
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...ijcsa
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...diannepatricia
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicoreillidan2004
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationIlya Loshchilov
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 

Ähnlich wie Chap 8. Optimization for training deep models (20)

A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementation
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
MULTIPROCESSOR SCHEDULING AND PERFORMANCE EVALUATION USING ELITIST NON DOMINA...
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Ds33717725
Ds33717725Ds33717725
Ds33717725
 
Ds33717725
Ds33717725Ds33717725
Ds33717725
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large clouds
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Particle filter
Particle filterParticle filter
Particle filter
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 

Kürzlich hochgeladen

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Kürzlich hochgeladen (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Chap 8. Optimization for training deep models

  • 1. Chap 8. Optimization for training deep models (Goodfellow et al. (2016), Deep Learning, MIT Press) Presenter : Young-Geun Choi Department of Statistics Seoul National University Apr 1, 2016 Young-Geun Choi Optimization for deep models Apr 1, 2016 1 / 1
  • 2. Young-Geun Choi Optimization for deep models Apr 1, 2016 2 / 1
  • 3. 8.1 How learning differs from pure optimization Cost function is estimated by empirical risk – (1) indirectly evaluated and (2) separable. (1) Cost function: J(θ) = E(x,y)∼ˆpdata L(f(x; θ), y). (try to) estimate true J∗ (θ) = E(x,y)∼pdata L(f(x; θ), y) (2) N : size of all training example E(x,y)∼ˆpdata L(f(x; θ), y) = 1 N N∑ i=1 L(f(x(i) ; θ), y(i) ). Young-Geun Choi Optimization for deep models Apr 1, 2016 3 / 1
  • 4. 8.1.3 Batch and minibatch algorithms We may use gradient descent : θnew = θold − ϵ ▽θ J∗ (θold ) for optimization. Is using all samples good to estimate J∗ (θ) or ▽θJ∗ (θ)? Standard error of mean is approximated by σ/ √ N (CLT) 10,000 times more examples give 100 times more accuracy There also might be redundancy in the training set (many samples might be very similar) → Use subsets (minibatch) of size m to estimate ▽θJ∗ (θ) Caution (confusing terminologies) Batch or deterministic gradient methods : processing with all training examples e.g. “Batch gradient descent” : use full training set e.g. “minibatch (stochastic) gradient descent” : stochastic gradient descent (Sec 8.3-) Batch size : the size of a minibatch Young-Geun Choi Optimization for deep models Apr 1, 2016 4 / 1
  • 5. 8.1.3 Batch and minibatch algorithms Batch sizes (minibatch sizes) are generally driven by Minimum batch size : below which there is no reduction in the time to process a minibatch due to multicore architecture Typically all examples in a (mini)batch are to be processed in parallel; amount of memory scales matters When using GPU, m = 2p will be good (32, 256, 4096) Very large dataset – rather than randomly select examples, shuffle and stack in advance and consecutively select them Young-Geun Choi Optimization for deep models Apr 1, 2016 5 / 1
  • 6. 8.2 Challenges in neural network optimization Deep neural network cost function Many composition of functions Many parameters Non-identifiability 8.2.1 Ill-conditioning Hessian matrix can be ill-conditioned (very large maximum eigenvalue, very small minimum eigenvalue) Very small step can increase the cost function 8.2.2 Local minima Any deep models essentially have an extremely large number of local minima due to weight space symmetry. Experts suspect that most local minima have a low cost function value – it is not important to find the global minima Young-Geun Choi Optimization for deep models Apr 1, 2016 6 / 1
  • 7. 8.2 Challenges in neural network optimization 8.2.3 Plateaus, saddle points and other flat regions Zero-gradient points are more likely to be saddle points in higher dimension (more parameters). Can optimization algorithms (introduced later) escape saddle points? – check https://github.com/cs231n/cs231n.github.io/blob/master/ neural-networks-3.md#sgd Flat regions (zero Hessian, zero gradient) even matter 8.2.4 Cliffs, exploding gradients Highly deep neural network, or recurrent neural networks often contains sharp nonlinearities. Very high derivative at some place can catapult the parameters very far, possibly spoiling most the optimization work that had been done. Young-Geun Choi Optimization for deep models Apr 1, 2016 7 / 1
  • 8. 8.2 Challenges in neural network optimization 8.2.5 Long-term dependencies In recurrent neural networks (Chap 10), repeatedly a matrix is multiplied; vanishing and exploding gradient problem can occur 8.2.6 Inexact gradients We use samples – can lead to a noisy or even biased estimate 8.2.7 Poor correspondence between local and global structure Can we make a non-local move? Rather, find good initial points. Young-Geun Choi Optimization for deep models Apr 1, 2016 8 / 1
  • 9. 8.2 Challenges in neural network optimization 8.2.8 Theoretical limits of optimization Some theory: output discrete (but practical NN units output smoothly increasing values and local search is feasible) Some theory: problem classes that are intractable (but it is difficult to tell a problem is in the class) Some theory: a network of a given size is intractable (but a larger network can find a solution) Developing more realistic bounds on the performance of optimization algorithms therefore remains an important goal for machine learning research. Young-Geun Choi Optimization for deep models Apr 1, 2016 9 / 1
  • 10. 8.3 Basic algorithms: 8.3.1 Stochastic gradient descent (SGD) Algorithm SGD update at training iteration k Require: Learning rate ϵ = ϵk Require: Initial parameter θ 1: while stopping criterion not met do 2: Sample a minibatch {x(1) , . . . , x(m) } and corresponding y(i) 3: Compute gradient descent estimate: g ← 1 m ▽θ ∑ i L(f(x(i) ; θ), y(i) )) 4: Apply update: θ ← θ − ϵg 5: end while Learning rate, initial point are kind of hyper-parameters. Learning rate ϵk depends on iteration index k SGD convergence: when ∑∞ k=1 ϵk = ∞ and ∑∞ k=1 ϵ2 k < ∞ In practice, decay linearly until iteration τ: ϵk = ( 1 − k τ ) ϵ0 + k τ ϵτ After iteration τ, leave ϵk constant. Usually ϵτ is 1% of ϵ0. Young-Geun Choi Optimization for deep models Apr 1, 2016 10 / 1
  • 11. 8.3.1 Stochastic gradient descent (SGD) τ may be set to the number of iterations required to make a few hundred passes through the training set. How to set ϵ0? Monitor learning curves. Check plots in https://github.com/cs231n/cs231n.github.io/blob/ master/neural-networks-3.md#sgd Optimal ϵ0 (in terms of total training time and the final cost value) is higher than the learning rate that yields the best performance after 100 iterations and so. SGD computation time (per update) does not grow with the number of training examples. Excess error (J(θ) − minθ J(θ)) of SGD after k iterations ▶ O(1/ √ k) for convex ▶ O(1/k) for strong convex ▶ Cramer-Rao bound : generalization error cannot decrease faster than O(1/k) ▶ Algorithms introduced later are good in practice, but lost in the constant factors hidden in O(1/k). Young-Geun Choi Optimization for deep models Apr 1, 2016 11 / 1
  • 12. 8.3.2 Momentum (Stochastic) gradient descent vs. momentum Young-Geun Choi Optimization for deep models Apr 1, 2016 12 / 1
  • 13. 8.3.2 Momentum Algorithm SGD with momentum Require: Learning rate ϵ = ϵk, momentum parameter α Require: Initial parameter θ, initial velocity v 1: while stopping criterion not met do 2: Sample a minibatch {x(1) , . . . , x(m) } and corresponding y(i) 3: Compute gradient descent estimate: ˆg ← 1 m ▽θ ∑ i L(f(x(i) ; θ), y(i) )) 4: Compute velocity update: v ← αv − ϵg 5: Apply update: θ ← θ + v 6: end while Speed decreasing from v to αv is an analogy from ‘viscous drag’. Young-Geun Choi Optimization for deep models Apr 1, 2016 13 / 1
  • 14. 8.3.3 Nesterov Momentum Algorithm SGD with Nesterov momentum Require: Learning rate ϵ = ϵk, momentum parameter α Require: Initial parameter θ, initial velocity v 1: while stopping criterion not met do 2: Sample a minibatch {x(1) , . . . , x(m) } and corresponding y(i) 3: Compute gradient descent estimate: ˆg ← 1 m ▽θ ∑ i L(f(x(i) ; θ+αv), y(i) )) 4: Compute velocity update: v ← αv − ϵg 5: Apply update: θ ← θ + v 6: end while Excess error is O(1/k2 ) for convex batch gradient In SGD, it does not improve rate of convergence. Young-Geun Choi Optimization for deep models Apr 1, 2016 14 / 1
  • 15. 8.4 Parameter initialization strategies Nonconvex cost function: initial point strongly affects training. No rule of thumb. Some helpful comments (Weights) Break symmetry – no same wij Consider Gaussian or uniform U ( − 1√ J , 1√ J ) (J : input layer size). Large initial value may lead to overfitting. Bias : okay to set zero Young-Geun Choi Optimization for deep models Apr 1, 2016 15 / 1
  • 16. 8.5 Algorithms with adaptive learning rates: 8.5.1 AdaGrad Learning rate : the most difficult to set, high impact on model performance Can we determine it in adaptive way? Algorithm The AdaGrad algorithm Require: Global learning rate ϵ Require: Initial parameter θ Require: Small constant δ, perhaps 10−7 for numerical stability 1: Initialize gradient accumulation variable r = 0 2: while stopping criterion not met do 3: Sample a minibatch {x(1) , . . . , x(m) } and corresponding y(i) 4: Compute gradient descent estimate: g ← 1 m ▽θ ∑ i L(f(x(i) ; θ), y(i) )) 5: Accumulated squared descent : r ← r + g ⊙ g 6: Compute update: ∆θ ← − ϵ δ+ √ r ⊙ g (operations elementwise) 7: Apply update: θ ← θ + ∆θ 8: end while Accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in effect learning rate. Young-Geun Choi Optimization for deep models Apr 1, 2016 16 / 1
  • 17. 8.5.2 RMSProp Algorithm The RMSProp algorithm Require: Global learning rate ϵ, decay rate ρ Require: Initial parameter θ Require: Small constant δ, usually 10−6 for numerical stability 1: Initialize gradient accumulation variable r = 0 2: while stopping criterion not met do 3: Sample a minibatch {x(1) , . . . , x(m) } and corresponding y(i) 4: Compute gradient descent estimate: g ← 1 m ▽θ ∑ i L(f(x(i) ; θ), y(i) )) 5: Accumulated squared descent : r ← ρr + (1 − ρ)g ⊙ g 6: Compute update: ∆θ ← − ϵ δ+ √ r ⊙ g (operations elementwise) 7: Apply update: θ ← θ + ∆θ 8: end while Learning trajectory may pass through many different structures and eventually arrive at a region that is not a locally convex bowl. Practically good. Young-Geun Choi Optimization for deep models Apr 1, 2016 17 / 1
  • 18. 8.5.2 RMSProp Algorithm The RMSProp algorithm with Nesterov momentum Require: Global learning rate ϵ, decay rate ρ Require: Initial parameter θ, initial velocity v 1: Initialize gradient accumulation variable r = 0 2: while stopping criterion not met do 3: Sample a minibatch {x(1) , . . . , x(m) } and corresponding y(i) 4: Compute gradient descent estimate: g ← 1 m ▽θ ∑ i L(f(x(i) ; θ+αv), y(i) )) 5: Accumulated squared descent : r ← ρr + (1 − ρ)g ⊙ g 6: Compute velocity update: v ← αv − − ϵ√ r ⊙ g (operations elementwise) 7: Apply update: θ ← θ + v 8: end while Young-Geun Choi Optimization for deep models Apr 1, 2016 18 / 1
  • 19. 8.5.3 Adam Algorithm The Adam algorithm Require: Global learning rate (step size) ϵ Require: Exponential decay rates for moment estimates, ρ1 and ρ2 in [0, 1) (Suggested defaults: 0.9 and 0.999, resp. Require: Small constant δ for numerical stabilization (Suggested 10−8) Require: Initial parameter θ 1: Initialize 1st and 2nd moment variables s = 0, r = 0 2: Initialize timestep t = 0 3: while stopping criterion not met do 4: Sample a minibatch {x(1), . . . , x(m)} and corresponding y(i) 5: Compute gradient descent estimate: g ← 1 m ▽θ ∑ i L(f(x(i); θ), y(i))) 6: t ← t + 1 7: Update biased first moment update: s ← ρ1s + (1 − ρ1)g 8: Update biased second moment update: r ← ρ2r + (1 − ρ2)g ⊙ g 9: Correct bias in first moment: ˆs ← s 1−ρt 1 10: Correct bias in second moment: ˆr ← r 1−ρt 2 11: Compute update: ∆θ = −ϵ ˆs√ ˆr+δ (operations elementwise) 12: Apply update: θ ← θ + ∆θ 13: end while Momentum + RMSProp + Some more insights Fairly robust to hyperparameter choice, learning rate sometimes needs change Young-Geun Choi Optimization for deep models Apr 1, 2016 19 / 1
  • 20. 8.5.4 Choosing the right optimization algorithm Currently popular are SGD (with/out momentum), RMSProp (with/out momentum), AdaDelta, and Adam Choice (at this point) seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning) 8.6 Approximate second-order methods – Omitted here (practically not used) Newton’s method Conjugate gradients BFGS Young-Geun Choi Optimization for deep models Apr 1, 2016 20 / 1
  • 21. 8.7 Optimization strategies and meta-algorithms: 8.7.1 Batch normalization Notation change : j-th unit in l-th layer revisited h (l) j = ϕ   J(l−1) ∑ k=1 w (l) k,jh (l−1) k + b (l) j   , j = 1, . . . , J(l) Vector notation h(l) = ϕ ( W(l) h(l−1) + b(l) ) Motivation Each layer is affected by preceding layers. Then for deeper networks, ▶ Small changes to the network parameters amplify. ▶ Simultaneous update of all layers even matters. ▶ Gradient vanishing ϕ′ (t) even for ReLU function and good initial point Empirically, training converges faster if input (h(l−1) ) distribution is fixed. ▶ If we have l = F2(F1(u, θ1), θ2), learning θ2 can be viewed by (x = F1(u, θ1)) l = F2(x, θ2), θnew 2 = θold 2 − α m m∑ i=1 ∂F2(xi, θold 2 ) ∂θ2 ▶ But simply normalizing each input (of a layer) may change the layer can represent. Young-Geun Choi Optimization for deep models Apr 1, 2016 21 / 1
  • 22. 8.7.1 Batch normalization h(l) = ϕ ( W(l) h(l−1) + b(l) ) rewrite −→ ϕ (Wx + b) . Let’s say we have inputs {x1, . . . , xm} (xi : activations of the previous layer for i-th example). Algorithm Batch normalizing transform – When learn W and b, learn γ, β too. Require: B = {x1, . . . , xm}, parameter to be learned: γ, β Ensure: yi = BNγ,β(xi) 1: Mini-batch mean: µB ← 1 m ∑m i=1 xi 2: Mini-batch variance: σ2 B ← 1 m ∑m i=1(xi − µB)2 3: Normalize: ˆxi ← xi−µB√ σ2 B+ϵ 4: Scale and shift: yi ← γˆxi + β ≡ BNγ,β(xi) Young-Geun Choi Optimization for deep models Apr 1, 2016 22 / 1
  • 23. 8.7.1 Batch normalization ϕ (Wxi + b) ˆxi ← xi − µB √ σ2 B + ϵ , yi ← γˆxi + β ≡ BNγ,β(xi) x’s have heterogeneous mean and variance over minibatches which heavily depends on previous layers On the other hand, y’s have same mean and variance over minibatches which determined sole by γ, β which is much easier to learn with gradient descent. In practice, batch-normalize the whole Wx + b rather than x itself (b then should be omitted) 14 times fewer training steps, first place in ImageNet (2014) Young-Geun Choi Optimization for deep models Apr 1, 2016 23 / 1
  • 24. 8.7 Optimization strategies and meta-algorithms 8.7.2 Coordinate descent Optimize coordinate-wisely Used to be good for convex problems 8.7.3 Polyak averaging (Moving) average over gradient descent visit points 8.7.4 Supervised pretraining Young-Geun Choi Optimization for deep models Apr 1, 2016 24 / 1
  • 25. 8.7 Optimization strategies and meta-algorithms 8.7.5 Designing models to aid optimization In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm. LSTM, ReLU, maxout units have all moved toward easier optimization. 8.7.6 Continuation methods and curriculum learning Young-Geun Choi Optimization for deep models Apr 1, 2016 25 / 1