SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Competition Winning Learning Rates
Leslie N. Smith
Naval Center for Applied Research in Artificial Intelligence
US Naval Research Laboratory, Washington, DC 20375
leslie.smith@nrl.navy.mil; Phone: (202) 767-9532
MLConf 2018
November 14, 2018
UNCLASSIFIED
Introduction
• My story of enlightenment about learning rates (LR)
• My first steps: Cyclical learning rates (CLR)
– What are cyclical learning rates? Why do they matter?
• A new LR schedule and Super-Convergence
– Fast training of networks with large learning rates
• Competition winning learning rates
– Stanford’s BENCHDawn competition
– Kaggle’s iMaterialist Challenge
• Enlightenment is a never-ending story
– Is weight decay more important than LR?
2
Outline
• Outline
Deep Learning Basics Background
– Uses a neural network composed of many “hidden” layers, 𝒍; each layer
contains trainable weights, 𝑾𝒍, and biases, 𝒃𝒍, and a non-linear function, σ
– Image x is input and output y is compared to the label, which defines the loss
}Loss
Label
Back-propagationVanishing Gradient
Input x
Classes
Weights
Speed limit 20
Yield
Pedestrian crossing
𝒚𝒍 = 𝑭 𝒚𝒍−𝟏 = 𝝈 𝑾𝒍 𝒚𝒍−𝟏 + 𝒃𝒍
𝒚 𝑳 = 𝝈 𝑾 𝑳 𝝈(𝑾 𝑳−𝟏 𝝈(… 𝑾 𝟏 𝒙 + 𝒃 𝟏 … ))
3
|| 𝒚 𝑳 - y ||
4
What are learning rates?
• Learning rates are the step size in Stochastic Gradient
Descent’s (SGD) back-propagation
• Long known that the learning rate (LR) is the most
important hyper-parameter to tune
– Too large: the training diverges
– Too small: trains slowly to a sub-optimal solution
• How to find an optimal LR?
– Grid or random search
– Time consuming and inaccurate
Cyclical learning rates (CLR)
𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
5
What is CLR? And who cares?
• Cyclical learning rates (CLR)
– Learning rate schedule that varies between
min and max values
• LR range test: One stepsize of
increasing LR
– Quick and easy way to find an optimal
learning rate
– The peak defines the max_lr
– The optimal LR is a bit less than max_lr
– min_lr ≈ max_lr / 3
Cyclical learning rates (CLR)
max_lr
min_lr
6
Super-convergence
• What if there’s no peak?
– Implies the ability to train at very large
learning rates
• Super-convergence
– Start with a small LR
– Grow to a large learning rate maximum
• 1cycle learning rate schedule
– One CLR cycle
Super-convergence
7
Super-convergence
• What if there’s no peak?
– Implies the ability to train at very large
learning rates
• Super-convergence
– Start with a small LR
– Grow to a large learning rate maximum
• 1cycle learning rate schedule
– One CLR cycle
Super-convergence
8
Super-convergence
• What if there’s no peak?
– Implies the ability to train at very large
learning rates
• Super-convergence
– Start with a small LR
– Grow to a large learning rate maximum
• 1cycle learning rate schedule
– One CLR cycle, ending with a smaller LR
than the min
Super-convergence
9
ImageNet Super -convergence
Introduction
• My story of enlightenment about learning rates (LR)
• My first steps: Cyclical learning rates (CLR)
– What are cyclical learning rates? Why do they matter?
• A new LR schedule and Super-Convergence
– Fast training of networks with large learning rates
• Competition winning learning rates
– Stanford’s BENCHDawn competition
– Kaggle’s iMaterialist Challenge
• Enlightenment is a never-ending story
– Is weight decay more important than LR?
10
Outline
11
Competition winning LRCompetition Winning Learning Rates
• DAWNBench challenge
– “Howard explains that in order to create an algorithm for solving CIFAR,
Fast.AI’s group turned to a relatively unknown technique known as “super
convergence.” This wasn’t developed by a well-funded tech company or
published in a big journal, but was created and self-published by a single
engineer named Leslie Smith working at the Naval Research Laboratory.”
The Verge, 5/7/2018 article about fast.ai team 1st place winning
12
Competition winning LRCompetition Winning Learning Rates
• Kaggle iMaterialist
Challenge (Fashion)
– “For training I used Adam
initially but I switched to the
1cycle policy with SGD very
early on. You can read more
about this training regime in
a paper by Leslie Smith and
you can find details on how to
use it by Sylvain Gugger, the
author of the implementation
in the fastai library here.”
1st place winner, Radek
Osmulski
13
Relevant publications
• Smith, Leslie N. "Cyclical learning rates for training neural
networks." In Applications of Computer Vision (WACV),
2017 IEEE Winter Conference on, pp. 464-472. IEEE, 2017.
• Smith, Leslie N., and Nicholay Topin. "Super-Convergence:
Very Fast Training of Residual Networks Using Large
Learning Rates." arXiv preprint arXiv:1708.07120 (2017).
• Smith, Leslie N. "A disciplined approach to neural network
hyper-parameters: Part 1--learning rate, batch size,
momentum, and weight decay." arXiv preprint
arXiv:1803.09820 (2018).
• There’s a large batch of literature on Large Batch Training
Competition Winning Learning Rates
Introduction
• My story of enlightenment about learning rates (LR)
• My first steps: Cyclical learning rates (CLR)
– What are cyclical learning rates? Why do they matter?
• A new LR schedule and Super-Convergence
– Fast training of networks with large learning rates
• Competition winning learning rates
– Stanford’s BENCHDawn competition
– Kaggle’s iMaterialist Challenge
• Enlightenment is a never-ending story
– Is weight decay more important than LR?
14
Outline
15
What is Weight Decay?
• L2 normalization
• The “effective weight
decay” is a combination of
the WD coefficient, λ, and
the learning rate, ε
• Which term does the
learning rate schedule
impact more?
Weight decay (WD)
𝑳(𝜽, 𝒙) = |𝒇 𝜽, 𝒙 − 𝒚| 2 + ½ λ||w||2
Effective WD
𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙 − 𝜺 𝝀 𝒘 𝒕
𝒘 𝒕+𝟏 = (𝟏 − 𝜺 𝝀) 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
16
What is the optimal WD?
• Decreasing weight decay has a much greater effect than
decreasing the learning rate
• The maximum weight decay, maxWD, can be found in a
similar way as maxLR (i.e., WD range test)
• Use a large WD early in training and let it decay
Weight decay (WD)
MaxWD
17
Dynamic weight decay
• Hyper-parameter’s relationship
where LR = learning rate, WD = weight decay coefficient, TBS = total batch size,
and α = momentum
– Large TBS and smaller values of LR and α permit larger maxWD
• Large batch training can be improved by a large WD if WD decays
during training
Weight decay (WD)
(𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
83.7%
82.2%
84%
83.5%
18
Conclusions (for now)
• Takeaways
– Enlightenment is a never-ending story
– A new diet plan for your network
• Decaying weight decay in large batch training; set WD large in the
early phase of training and zero near the end
– Hyper-parameters are tightly coupled and must be tuned
together
Competition Winning Learning Rates
(𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
• Outline
Competition Winning Learning Rates
Questions and comments?
The End
19

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based ModelDeep Learning JP
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksBennoG1
 
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Universitat Politècnica de Catalunya
 
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...Deep Learning JP
 
[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...
[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...
[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...Deep Learning JP
 
[論文紹介] Convolutional Neural Network(CNN)による超解像
[論文紹介] Convolutional Neural Network(CNN)による超解像[論文紹介] Convolutional Neural Network(CNN)による超解像
[論文紹介] Convolutional Neural Network(CNN)による超解像Rei Takami
 
Image-to-Image Translation
Image-to-Image TranslationImage-to-Image Translation
Image-to-Image TranslationJunho Kim
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
 
ECCV2020 オーラル論文完全読破 (2/2)
ECCV2020 オーラル論文完全読破 (2/2) ECCV2020 オーラル論文完全読破 (2/2)
ECCV2020 オーラル論文完全読破 (2/2) cvpaper. challenge
 
Image Translation with GAN
Image Translation with GANImage Translation with GAN
Image Translation with GANJunho Cho
 
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)Preferred Networks
 
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsDeep Learning JP
 
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?Deep Learning JP
 
ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)Hideo Terada
 

Was ist angesagt? (20)

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
強化学習5章
強化学習5章強化学習5章
強化学習5章
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial Networks
 
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
 
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
 
[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...
[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...
[DL輪読会]A Simple Unified Framework for Detecting Out-of-Distribution Samples a...
 
[論文紹介] Convolutional Neural Network(CNN)による超解像
[論文紹介] Convolutional Neural Network(CNN)による超解像[論文紹介] Convolutional Neural Network(CNN)による超解像
[論文紹介] Convolutional Neural Network(CNN)による超解像
 
Image-to-Image Translation
Image-to-Image TranslationImage-to-Image Translation
Image-to-Image Translation
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
MobileViTv1
MobileViTv1MobileViTv1
MobileViTv1
 
ECCV2020 オーラル論文完全読破 (2/2)
ECCV2020 オーラル論文完全読破 (2/2) ECCV2020 オーラル論文完全読破 (2/2)
ECCV2020 オーラル論文完全読破 (2/2)
 
Image Translation with GAN
Image Translation with GANImage Translation with GAN
Image Translation with GAN
 
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
 
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
[DL輪読会]Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
 
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)ディープラーニングの2値化(Binarized Neural Network)
ディープラーニングの2値化(Binarized Neural Network)
 

Ähnlich wie Competition winning learning rates

Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsSayak Paul
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
 
The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of BackpropagationJennifer Prendki
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellSri Ambati
 
Learning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted DropoutLearning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted DropoutSeunghyun Hwang
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfvudinhphuong96
 
Neural Nets from Scratch
Neural Nets from ScratchNeural Nets from Scratch
Neural Nets from ScratchSeth H. Weidman
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 
Neural Network Based Individual Classification System
Neural Network Based Individual Classification SystemNeural Network Based Individual Classification System
Neural Network Based Individual Classification SystemIRJET Journal
 

Ähnlich wie Competition winning learning rates (20)

Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural nets
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of Backpropagation
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
 
lecture3.pdf
lecture3.pdflecture3.pdf
lecture3.pdf
 
Learning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted DropoutLearning Sparse Networks using Targeted Dropout
Learning Sparse Networks using Targeted Dropout
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
Neural Nets from Scratch
Neural Nets from ScratchNeural Nets from Scratch
Neural Nets from Scratch
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Lectura seis
Lectura seisLectura seis
Lectura seis
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
Neural Network Based Individual Classification System
Neural Network Based Individual Classification SystemNeural Network Based Individual Classification System
Neural Network Based Individual Classification System
 

Mehr von MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mehr von MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Competition winning learning rates

  • 1. Competition Winning Learning Rates Leslie N. Smith Naval Center for Applied Research in Artificial Intelligence US Naval Research Laboratory, Washington, DC 20375 leslie.smith@nrl.navy.mil; Phone: (202) 767-9532 MLConf 2018 November 14, 2018 UNCLASSIFIED
  • 2. Introduction • My story of enlightenment about learning rates (LR) • My first steps: Cyclical learning rates (CLR) – What are cyclical learning rates? Why do they matter? • A new LR schedule and Super-Convergence – Fast training of networks with large learning rates • Competition winning learning rates – Stanford’s BENCHDawn competition – Kaggle’s iMaterialist Challenge • Enlightenment is a never-ending story – Is weight decay more important than LR? 2 Outline
  • 3. • Outline Deep Learning Basics Background – Uses a neural network composed of many “hidden” layers, 𝒍; each layer contains trainable weights, 𝑾𝒍, and biases, 𝒃𝒍, and a non-linear function, σ – Image x is input and output y is compared to the label, which defines the loss }Loss Label Back-propagationVanishing Gradient Input x Classes Weights Speed limit 20 Yield Pedestrian crossing 𝒚𝒍 = 𝑭 𝒚𝒍−𝟏 = 𝝈 𝑾𝒍 𝒚𝒍−𝟏 + 𝒃𝒍 𝒚 𝑳 = 𝝈 𝑾 𝑳 𝝈(𝑾 𝑳−𝟏 𝝈(… 𝑾 𝟏 𝒙 + 𝒃 𝟏 … )) 3 || 𝒚 𝑳 - y ||
  • 4. 4 What are learning rates? • Learning rates are the step size in Stochastic Gradient Descent’s (SGD) back-propagation • Long known that the learning rate (LR) is the most important hyper-parameter to tune – Too large: the training diverges – Too small: trains slowly to a sub-optimal solution • How to find an optimal LR? – Grid or random search – Time consuming and inaccurate Cyclical learning rates (CLR) 𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
  • 5. 5 What is CLR? And who cares? • Cyclical learning rates (CLR) – Learning rate schedule that varies between min and max values • LR range test: One stepsize of increasing LR – Quick and easy way to find an optimal learning rate – The peak defines the max_lr – The optimal LR is a bit less than max_lr – min_lr ≈ max_lr / 3 Cyclical learning rates (CLR) max_lr min_lr
  • 6. 6 Super-convergence • What if there’s no peak? – Implies the ability to train at very large learning rates • Super-convergence – Start with a small LR – Grow to a large learning rate maximum • 1cycle learning rate schedule – One CLR cycle Super-convergence
  • 7. 7 Super-convergence • What if there’s no peak? – Implies the ability to train at very large learning rates • Super-convergence – Start with a small LR – Grow to a large learning rate maximum • 1cycle learning rate schedule – One CLR cycle Super-convergence
  • 8. 8 Super-convergence • What if there’s no peak? – Implies the ability to train at very large learning rates • Super-convergence – Start with a small LR – Grow to a large learning rate maximum • 1cycle learning rate schedule – One CLR cycle, ending with a smaller LR than the min Super-convergence
  • 10. Introduction • My story of enlightenment about learning rates (LR) • My first steps: Cyclical learning rates (CLR) – What are cyclical learning rates? Why do they matter? • A new LR schedule and Super-Convergence – Fast training of networks with large learning rates • Competition winning learning rates – Stanford’s BENCHDawn competition – Kaggle’s iMaterialist Challenge • Enlightenment is a never-ending story – Is weight decay more important than LR? 10 Outline
  • 11. 11 Competition winning LRCompetition Winning Learning Rates • DAWNBench challenge – “Howard explains that in order to create an algorithm for solving CIFAR, Fast.AI’s group turned to a relatively unknown technique known as “super convergence.” This wasn’t developed by a well-funded tech company or published in a big journal, but was created and self-published by a single engineer named Leslie Smith working at the Naval Research Laboratory.” The Verge, 5/7/2018 article about fast.ai team 1st place winning
  • 12. 12 Competition winning LRCompetition Winning Learning Rates • Kaggle iMaterialist Challenge (Fashion) – “For training I used Adam initially but I switched to the 1cycle policy with SGD very early on. You can read more about this training regime in a paper by Leslie Smith and you can find details on how to use it by Sylvain Gugger, the author of the implementation in the fastai library here.” 1st place winner, Radek Osmulski
  • 13. 13 Relevant publications • Smith, Leslie N. "Cyclical learning rates for training neural networks." In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 464-472. IEEE, 2017. • Smith, Leslie N., and Nicholay Topin. "Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates." arXiv preprint arXiv:1708.07120 (2017). • Smith, Leslie N. "A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay." arXiv preprint arXiv:1803.09820 (2018). • There’s a large batch of literature on Large Batch Training Competition Winning Learning Rates
  • 14. Introduction • My story of enlightenment about learning rates (LR) • My first steps: Cyclical learning rates (CLR) – What are cyclical learning rates? Why do they matter? • A new LR schedule and Super-Convergence – Fast training of networks with large learning rates • Competition winning learning rates – Stanford’s BENCHDawn competition – Kaggle’s iMaterialist Challenge • Enlightenment is a never-ending story – Is weight decay more important than LR? 14 Outline
  • 15. 15 What is Weight Decay? • L2 normalization • The “effective weight decay” is a combination of the WD coefficient, λ, and the learning rate, ε • Which term does the learning rate schedule impact more? Weight decay (WD) 𝑳(𝜽, 𝒙) = |𝒇 𝜽, 𝒙 − 𝒚| 2 + ½ λ||w||2 Effective WD 𝒘 𝒕+𝟏 = 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙 − 𝜺 𝝀 𝒘 𝒕 𝒘 𝒕+𝟏 = (𝟏 − 𝜺 𝝀) 𝒘 𝒕 − 𝜺 𝛁 𝑳 𝜽, 𝒙
  • 16. 16 What is the optimal WD? • Decreasing weight decay has a much greater effect than decreasing the learning rate • The maximum weight decay, maxWD, can be found in a similar way as maxLR (i.e., WD range test) • Use a large WD early in training and let it decay Weight decay (WD) MaxWD
  • 17. 17 Dynamic weight decay • Hyper-parameter’s relationship where LR = learning rate, WD = weight decay coefficient, TBS = total batch size, and α = momentum – Large TBS and smaller values of LR and α permit larger maxWD • Large batch training can be improved by a large WD if WD decays during training Weight decay (WD) (𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 83.7% 82.2% 84% 83.5%
  • 18. 18 Conclusions (for now) • Takeaways – Enlightenment is a never-ending story – A new diet plan for your network • Decaying weight decay in large batch training; set WD large in the early phase of training and zero near the end – Hyper-parameters are tightly coupled and must be tuned together Competition Winning Learning Rates (𝐿𝑅 ∗ 𝑊𝐷)/(𝑇𝐵𝑆 ∗ (1 − α)) ≈ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
  • 19. • Outline Competition Winning Learning Rates Questions and comments? The End 19