Introduction to cyclical learning rates for training neural nets

•

1 gefällt mir•46 views

Recently, an AI education firm named fast.ai topped a Kaggle competition beating researchers from Google, Amazon and Intel AI research. One of the ingredients that allowed them this success is the effective use of Cyclical Learning Rates in their models. CLR is not a well-studied topic and it is still not accepted by the community because it was proposed by an organization which is not that much in the AI limelight. But CLR has shown some exceptional results in the training of Neural Nets.

Bildung

2018-11-03
Introduction to Cyclical Learning
Rates for training Neural Nets
Sayak Paul
Project Instructor @ DataCamp
(GDG DevFest, Kolkata, India)

Overview of the talk
• Why are learning rates used?
• Some existing approaches for choosing the right learning rate
• What are the shortcomings of these approaches?
• Need of a systematic approach for setting the learning rate –
Cyclical Learning Rates (CLR)
• What is CLR?
• Some amazing results shown by CLR
• Conclusion

Why are learning rates used?
Learning is an important hyperparameter for adjusting the weights of a network with
respect to the loss gradient.
Source: Andrew Ng’s lecture notes from Coursera

Some of the existing approaches for choosing the right LR
• Trying out different learning rates for a problem.
• Grid-searching/Random-searching.
• Adaptive Learning Rates / Learning Rate Schedules.
Step Decay Schedule
Grid and Random layout of parameters

Problems with these approaches
• Computationally costly.
• Gives no early clue if at all the result would get better.

Cyclical Learning Rates*
• Proposed by Leslie N. Smith in his paper entitled “Cyclical Learning Rates for Training Neural
Networks” in 2015.
• The idea is to simply keep increasing the learning rate from a very small value, until the loss stops
decreasing.
Source
* Cyclical Learning Rates for Training Neural Networks – Leslie N. Smith

How is Cyclical Learning Rate (CLR) systematic?
• The main idea behind CLR is varying learning rates between min and max values.
• LR_Range_Test() is conducted for fixing the min and max values of learning rate.

LR_Range_Test()
• One step of increasing learning rate.
min_lr
max_lr
Also called Triangular Learning Rate Policy
Source: Cyclical Learning Rates for Training Neural Networks – Leslie N. Smith

Choosing max_lr and min_lr
• Run the model for several epochs while letting the learning rate increase linearly (use triangular
learning rate policy) between low and high learning rate values.
• Next, plot the accuracy versus learning rate curve.
• Note the learning rate value when the accuracy starts to increase and when the accuracy slows,
becomes ragged, or starts to fall. These two learning rates are good choices for defining the range
of the learning rates.
Source: Cyclical Learning Rates for Training Neural Networks – Leslie N. Smith

Popular CLR implementations in Python
As a Keras callback
As lr_find() method

Some amazing results shown by CLR
Kaggle iMaterialist Challenge (Fashion) Leaderboard

Some amazing results shown by CLR (contd.)
DAWNBench Challenge Leaderboard and Leader’s specs

Limitations of CLR
• Limited applicability.
• Seems to work only for Cifar-10 and resnets.
• But definitely provides a more systematic way for choosing learning
rate than the earlier approaches.

Notable enhancements inspired by CLR
• Stochastic Gradient Descent with Restarts.
• Differential Learning Rates.

Some Wealth of Wisdom
Original CLR Paper DataCamp tutorial covering CLR
Slides available on my Github (Username: sayakpaul)

Thank you!
Sayak Paul
Project Instructor
spsayakpaul@gmail.com

Empfohlen

PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksTaesu Kim

Learning to Learn by Gradient Descent by Gradient DescentKaty Lee

Algorithms and ProgrammingMelanie Knight

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk

Context-aware preference modeling with factorizationBalázs Hidasi

Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk

Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...Dongmin Choi

Utilizing additional information in factorization methods (research overview,...Balázs Hidasi

Empfohlen

PR12-094: Model-Agnostic Meta-Learning for fast adaptation of deep networksTaesu Kim

Learning to Learn by Gradient Descent by Gradient DescentKaty Lee

Algorithms and ProgrammingMelanie Knight

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk

Context-aware preference modeling with factorizationBalázs Hidasi

Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk

Review : Adaptive Consistency Regularization for Semi-Supervised Transfer Lea...Dongmin Choi

Utilizing additional information in factorization methods (research overview,...Balázs Hidasi

High-dimensional dynamics of generalization error in neural networks (Explained)Hikaru Ibayashi

Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk

Strategy PatternShahriar Iqbal Chowdhury

Machine learning with scikitlearnPratap Dangeti

Model Based Episodic MemoryHung Le

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

An overview of gradient descent optimization algorithms Hakky St

RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...Anirban Santara

Competition winning learning ratesMLconf

The Predictron: End-to-end Learning and PlanningYoonho Lee

AWS Forcecast: DeepAR Predictor Time-series PolarSeven Pty Ltd

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee

Applied Mathematical Modeling with Apache Solr - Joel Bernstein, LucidworksLucidworks

Presentazione Tesi Laurea Triennale in InformaticaLuca Marignati

Exploration Strategies in Reinforcement LearningDongmin Lee

Generative Models for General AudiencesSangwoo Mo

H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati

1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn

Imitation LearningAnirban Santara

Meetup_Consumer_Credit_Default_Vers_2_AllBernard Ong

Winning Kaggle 101: Introduction to StackingTed Xiao

Optimization as a model for few shot learningKaty Lee

Weitere ähnliche Inhalte

Was ist angesagt?

High-dimensional dynamics of generalization error in neural networks (Explained)Hikaru Ibayashi

Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk

Strategy PatternShahriar Iqbal Chowdhury

Machine learning with scikitlearnPratap Dangeti

Model Based Episodic MemoryHung Le

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

An overview of gradient descent optimization algorithms Hakky St

RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...Anirban Santara

Competition winning learning ratesMLconf

The Predictron: End-to-end Learning and PlanningYoonho Lee

AWS Forcecast: DeepAR Predictor Time-series PolarSeven Pty Ltd

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee

Applied Mathematical Modeling with Apache Solr - Joel Bernstein, LucidworksLucidworks

Presentazione Tesi Laurea Triennale in InformaticaLuca Marignati

Exploration Strategies in Reinforcement LearningDongmin Lee

Generative Models for General AudiencesSangwoo Mo

H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati

1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn

Imitation LearningAnirban Santara

Meetup_Consumer_Credit_Default_Vers_2_AllBernard Ong

Was ist angesagt? (20)

High-dimensional dynamics of generalization error in neural networks (Explained)

Proximal Policy Optimization Algorithms, Schulman et al, 2017

Strategy Pattern

Machine learning with scikitlearn

Model Based Episodic Memory

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

An overview of gradient descent optimization algorithms

RAIL: Risk-Averse Imitation Learning | Invited talk at Intel AI Workshop at K...

Competition winning learning rates

The Predictron: End-to-end Learning and Planning

AWS Forcecast: DeepAR Predictor Time-series

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...

Applied Mathematical Modeling with Apache Solr - Joel Bernstein, Lucidworks

Presentazione Tesi Laurea Triennale in Informatica

Exploration Strategies in Reinforcement Learning

Generative Models for General Audiences

H2O World - GBM and Random Forest in H2O- Mark Landry

1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration

Imitation Learning

Meetup_Consumer_Credit_Default_Vers_2_All

Ähnlich wie Introduction to cyclical learning rates for training neural nets

Winning Kaggle 101: Introduction to StackingTed Xiao

Optimization as a model for few shot learningKaty Lee

H2O World - Ensembles with Erin LeDellSri Ambati

part3Module 3 ppt_with classification.pptxVaishaliBagewadikar

Machine learning - session 3Luis Borbon

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya

weka data mining kalthoom almaqbali

Mini datathonKunal Jain

Mini datathon - BengaluruKunal Jain

Machine LearningGirish Khanzode

PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee

Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm

MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68

Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li

MSCV Capstone Spring 2020 Presentation - RL for ADMayank Gupta

AI_Unit-4_Learning.pptxMohammadAsim91

Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati

Tuning learning rateJamie (Taka) Wang

Identifying and classifying unknown Network Disruptionjagan477830

Preliminary Exam SlidesDebasmit Das

Ähnlich wie Introduction to cyclical learning rates for training neural nets (20)

Winning Kaggle 101: Introduction to Stacking

Optimization as a model for few shot learning

H2O World - Ensembles with Erin LeDell

part3Module 3 ppt_with classification.pptx

Machine learning - session 3

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...

weka data mining

Mini datathon

Mini datathon - Bengaluru

Machine Learning

PR-231: A Simple Framework for Contrastive Learning of Visual Representations

Machine Learning with Python- Methods for Machine Learning.pptx

MACHINE LEARNING YEAR DL SECOND PART.pptx

Two strategies for large-scale multi-label classification on the YouTube-8M d...

MSCV Capstone Spring 2020 Presentation - RL for AD

AI_Unit-4_Learning.pptx

Strata San Jose 2016: Scalable Ensemble Learning with H2O

Tuning learning rate

Identifying and classifying unknown Network Disruption

Preliminary Exam Slides

Kürzlich hochgeladen

Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

microwave assisted reaction. General introductionMaksud Ahmed

PSYCHIATRIC History collection FORMAT.pptxPoojaSen20

Accessible design: Minimum effort, maximum impactdawncurless

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Alper Gobel In Media Res Media ComponentInMediaRes1

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique

URLs and Routing in the Odoo 17 Website AppCeline George

Paris 2024 Olympic Geographies - an activityGeoBlogs

Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019

_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood

Crayon Activity Handout For the Crayon AUnboundStockton

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

Kürzlich hochgeladen (20)

Separation of Lanthanides/ Lanthanides and Actinides

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

microwave assisted reaction. General introduction

PSYCHIATRIC History collection FORMAT.pptx

Accessible design: Minimum effort, maximum impact

The basics of sentences session 2pptx copy.pptx

Alper Gobel In Media Res Media Component

Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx

URLs and Routing in the Odoo 17 Website App

Paris 2024 Olympic Geographies - an activity

Sanyam Choudhary Chemistry practical.pdf

_Math 4-Q4 Week 5.pptx Steps in Collecting Data

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx

Crayon Activity Handout For the Crayon A

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Introduction to ArtificiaI Intelligence in Higher Education

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

A Critique of the Proposed National Education Policy Reform

Introduction to cyclical learning rates for training neural nets

1. 2018-11-03 Introduction to Cyclical Learning Rates for training Neural Nets Sayak Paul Project Instructor @ DataCamp (GDG DevFest, Kolkata, India)

2. Overview of the talk • Why are learning rates used? • Some existing approaches for choosing the right learning rate • What are the shortcomings of these approaches? • Need of a systematic approach for setting the learning rate – Cyclical Learning Rates (CLR) • What is CLR? • Some amazing results shown by CLR • Conclusion

3. Why are learning rates used? Learning is an important hyperparameter for adjusting the weights of a network with respect to the loss gradient. Source: Andrew Ng’s lecture notes from Coursera

4. Some of the existing approaches for choosing the right LR • Trying out different learning rates for a problem. • Grid-searching/Random-searching. • Adaptive Learning Rates / Learning Rate Schedules. Step Decay Schedule Grid and Random layout of parameters

5. Problems with these approaches • Computationally costly. • Gives no early clue if at all the result would get better.

6. Cyclical Learning Rates* • Proposed by Leslie N. Smith in his paper entitled “Cyclical Learning Rates for Training Neural Networks” in 2015. • The idea is to simply keep increasing the learning rate from a very small value, until the loss stops decreasing. Source * Cyclical Learning Rates for Training Neural Networks – Leslie N. Smith

7. How is Cyclical Learning Rate (CLR) systematic? • The main idea behind CLR is varying learning rates between min and max values. • LR_Range_Test() is conducted for fixing the min and max values of learning rate.

8. LR_Range_Test() • One step of increasing learning rate. min_lr max_lr Also called Triangular Learning Rate Policy Source: Cyclical Learning Rates for Training Neural Networks – Leslie N. Smith

9. Choosing max_lr and min_lr • Run the model for several epochs while letting the learning rate increase linearly (use triangular learning rate policy) between low and high learning rate values. • Next, plot the accuracy versus learning rate curve. • Note the learning rate value when the accuracy starts to increase and when the accuracy slows, becomes ragged, or starts to fall. These two learning rates are good choices for defining the range of the learning rates. Source: Cyclical Learning Rates for Training Neural Networks – Leslie N. Smith

10. Popular CLR implementations in Python As a Keras callback As lr_find() method

11. Some amazing results shown by CLR Kaggle iMaterialist Challenge (Fashion) Leaderboard

12. Some amazing results shown by CLR (contd.) DAWNBench Challenge Leaderboard and Leader’s specs

13. Limitations of CLR • Limited applicability. • Seems to work only for Cifar-10 and resnets. • But definitely provides a more systematic way for choosing learning rate than the earlier approaches.

14. Notable enhancements inspired by CLR • Stochastic Gradient Descent with Restarts. • Differential Learning Rates.

15. Some Wealth of Wisdom Original CLR Paper DataCamp tutorial covering CLR Slides available on my Github (Username: sayakpaul)

16. Thank you! Sayak Paul Project Instructor spsayakpaul@gmail.com