Learning to Learn by Gradient Descent by Gradient Descent

•

1 gefällt mir•651 views

Katy Lee

NIPS 2016

Technologie

Background
• learn:
1. a task
2. training experience
3. a performance measure
• a computer program is said to learn if its
performance at the task improves with experience.
Mitchell [Mitchell, 1993]

Background
• learning to learn:
1. a family of tasks
2. training experience for each of these tasks
3. a family of performance measures
• an algorithm is said to learn to learn if its
performance at each task improves with
experience and with the number of tasks.
Thrun, Sebastian, and Lorien Pratt, eds. Learning to learn. Springer Science & Business Media, 2012.

Background
• Frequently, tasks in machine learning can be
expressed as the problem of optimizing an
objective function deﬁned over some domain
• The goal is to ﬁnd the minimizer
• the standard approach for differentiable functions
is some form of gradient descent, resulting in a
sequence of updates

Motivation
• Most of the modern work is based around
designing update rules for speciﬁc classes of
problems, it might perform poorly on other class of
problems

Motivation
• In this work we take a different tack and instead
propose to replace hand-designed update rules
with a learned update rule

Outline
• Related work
• Main idea
• Evaluation
• Conclusion

Related Work
• C. Daniel, J. Taylor, and S. Nowozin. Learning step
size controllers for robust neural network training. In
Association for the Advancement of Artiﬁcial
Intelligence, 2016.

Outline
• Related work or naïve methods
• Main idea
• Evaluation
• Conclusion

• In this work, they proposed to replace hand-
designed update rules with a learned update rule,
which we called the optimizer(a LSTM) m, with its
own parameter
• This results in updates to the optimizee f of the form
φ
gt is the output of LSTM

How to train the optimizer
• For training the optimizer, we have an objective that
depends on the trajectory for a time horizon T
• θ the optimizee parameters
• ϕ: the optimizer parameters
• f: the function in question
m is the LSTM

Intuition of Trajectory
old trajectory
trajectory with new φ

Challenge
• too many parameters in LSTM
• solution?

Information Sharing
Between Coordinates
• global average cells(GAC) designate a subset of
the cells in each LSTM layer for communication.
their outgoing activations are averages at each
step across all coordinates.
• allowing different LSTMs to communicate with each
other

Experiment 3: systematically
changing NN architecture
LSTM train on one-hiddent-layer 20-units NN

Experiment 4 on covnet on
CIFAR-10
LSTM-sub: train on only hold out dataset

Experiment 5 on Neural Style, optimizer
train on only one style and 1800 content
image from imageNet

Conclusion
• So far the learning process is handcraft, but this
work shows how to train a NN by a NN
• generalize well on different architecture but not on
different activation function
• execution time?
• sometimes, when you are confused for long, try to
email the author(all of them). A typo can kill you.

Weitere ähnliche Inhalte

Was ist angesagt?

ShapGiovanni Bruner

Towards Human-Centered Machine LearningSri Ambati

Autoencoders in Deep Learningmilad abbasi

AutoencoderMehrnaz Faraz

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn

Bayesian Global OptimizationAmazon Web Services

Supervised and unsupervised learningParas Kohli

XgboostVivian S. Zhang

Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks

Feature Selection in Machine LearningUpekha Vandebona

Application of Foundation Model for Autonomous DrivingYu Huang

Meta learning tutorialJoaquin Vanschoren

Graph coloring problemV.V.Vanniaperumal College for Women

LeNet to ResNetSomnath Banerjee

AutoencoderWataru Hirota

Graph Coloring : Greedy Algorithm & Welsh Powell AlgorithmPriyank Jain

Introduction to Some Tree based Learning MethodHonglin Yu

Time Series Forecasting Project Presentation.Anupama Kate

NLP using transformers Arvind Devaraj

Stock price prediction using k* nearest neighbors and indexing dynamic time w...Kei Nakagawa

Was ist angesagt? (20)

Shap

Towards Human-Centered Machine Learning

Autoencoders in Deep Learning

Autoencoder

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...

Bayesian Global Optimization

Supervised and unsupervised learning

Xgboost

Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...

Feature Selection in Machine Learning

Application of Foundation Model for Autonomous Driving

Meta learning tutorial

Graph coloring problem

LeNet to ResNet

Autoencoder

Graph Coloring : Greedy Algorithm & Welsh Powell Algorithm

Introduction to Some Tree based Learning Method

Time Series Forecasting Project Presentation.

NLP using transformers

Stock price prediction using k* nearest neighbors and indexing dynamic time w...

Ähnlich wie Learning to Learn by Gradient Descent by Gradient Descent

Optimization as a model for few shot learningKaty Lee

AI_Unit-4_Learning.pptxMohammadAsim91

Presentation File of paper "Leveraging Normalization Layer in Adapters With P...dyyjkd

Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...Jeromy Anglim

Paper review: Learned Optimizers that Scale and Generalize.Wuhyun Rico Shin

Reinforcement learningDongHyun Kwak

Transfer Learning in NLP: A SurveyNUPUR YADAV

Online Tuning of Large Scale Recommendation SystemsViral Gupta

Introduction of Deep Reinforcement LearningNAVER Engineering

Reinforcement LearningDongHyun Kwak

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya

Learning how to learnJoaquin Vanschoren

Introduction to cyclical learning rates for training neural netsSayak Paul

NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...ssuser4b1f48

ngboost.pptxMohamedAliHabib3

ELLA LC algorithm presentation in ICIP 2016InVID Project

OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview

human computer Interaction cognitive models.pptJayaprasanna4

Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes

Ähnlich wie Learning to Learn by Gradient Descent by Gradient Descent (20)

Optimization as a model for few shot learning

AI_Unit-4_Learning.pptx

Presentation File of paper "Leveraging Normalization Layer in Adapters With P...

Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...

Paper review: Learned Optimizers that Scale and Generalize.

Reinforcement learning

Transfer Learning in NLP: A Survey

Online Tuning of Large Scale Recommendation Systems

Introduction of Deep Reinforcement Learning

Reinforcement Learning

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...

Learning how to learn

Introduction to cyclical learning rates for training neural nets

NS-CUK Seminar: J.H.Lee, Review on "Task Relation-aware Continual User Repres...

ngboost.pptx

ELLA LC algorithm presentation in ICIP 2016

OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING

human computer Interaction cognitive models.ppt

Evolving the Optimal Relevancy Ranking Model at Dice.com

Mehr von Katy Lee

ICML 2017 Meta networkKaty Lee

Technical interview experience sharingKaty Lee

Overcoming catastrophic forgetting in neural networkKaty Lee

Meta learning with memory augmented neural networkKaty Lee

Making neural programming architectures generalize via recursionKaty Lee

FinalReportKaty Lee

Neural_Programmer_InterpreterKaty Lee

Mehr von Katy Lee (7)

ICML 2017 Meta network

Technical interview experience sharing

Overcoming catastrophic forgetting in neural network

Meta learning with memory augmented neural network

Making neural programming architectures generalize via recursion

FinalReport

Neural_Programmer_Interpreter

Kürzlich hochgeladen

Histor y of HAM Radio presentation slidevu2urc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Partners Life - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide

Driving Behavioral Change for Information Management through Data-Driven Gree...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

08448380779 Call Girls In Civil Lines Women Seeking Men

Tata AIG General Insurance Company - Insurer Innovation Award 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Data Cloud, More than a CDP by Matt Robison

🐬 The future of MySQL is Postgres 🐘

Automating Google Workspace (GWS) & more with Apps Script

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Unblocking The Main Thread Solving ANRs and Frozen Frames

Axa Assurance Maroc - Insurer Innovation Award 2024

Partners Life - Insurer Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

How to Troubleshoot Apps for the Modern Connected Worker

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

Breaking the Kubernetes Kill Chain: Host Path Mount

GenCyber Cyber Security Day Presentation

Learning to Learn by Gradient Descent by Gradient Descent

1. Learning to learn by gradient descent by gradient descent citation: 9 -> 38 Katy, 2016/11/25@DataLab NIPS 2016

2. Background • learn: 1. a task 2. training experience 3. a performance measure • a computer program is said to learn if its performance at the task improves with experience. Mitchell [Mitchell, 1993]

3. Background • learning to learn: 1. a family of tasks 2. training experience for each of these tasks 3. a family of performance measures • an algorithm is said to learn to learn if its performance at each task improves with experience and with the number of tasks. Thrun, Sebastian, and Lorien Pratt, eds. Learning to learn. Springer Science & Business Media, 2012.

4. Background • Frequently, tasks in machine learning can be expressed as the problem of optimizing an objective function deﬁned over some domain • The goal is to ﬁnd the minimizer • the standard approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates

5. Motivation • Most of the modern work is based around designing update rules for speciﬁc classes of problems, it might perform poorly on other class of problems

6. Motivation • In this work we take a different tack and instead propose to replace hand-designed update rules with a learned update rule

7. Outline • Related work • Main idea • Evaluation • Conclusion

8. Outline • Related work • Main idea • Evaluation • Conclusion

9. Related Work • C. Daniel, J. Taylor, and S. Nowozin. Learning step size controllers for robust neural network training. In Association for the Advancement of Artiﬁcial Intelligence, 2016.

10. Outline • Related work or naïve methods • Main idea • Evaluation • Conclusion

11. Learning to learn with RNN

12. • In this work, they proposed to replace hand- designed update rules with a learned update rule, which we called the optimizer(a LSTM) m, with its own parameter • This results in updates to the optimizee f of the form φ gt is the output of LSTM

13. How to train the optimizer • For training the optimizer, we have an objective that depends on the trajectory for a time horizon T • θ the optimizee parameters • ϕ: the optimizer parameters • f: the function in question m is the LSTM

14. Intuition of Trajectory old trajectory trajectory with new φ

15.

16. Challenge • too many parameters in LSTM • solution?

17. Coordinatewise LSTM Optimizer gt

18. Information Sharing Between Coordinates • global average cells(GAC) designate a subset of the cells in each LSTM layer for communication. their outgoing activations are averages at each step across all coordinates. • allowing different LSTMs to communicate with each other

19. Outline • Related work • Main idea • Evaluation • Conclusion

20. Experiment 1

21. Experiment 2: change structure

22. Experiment 3: systematically changing NN architecture LSTM train on one-hiddent-layer 20-units NN

23. Experiment 4 on covnet on CIFAR-10 LSTM-sub: train on only hold out dataset

24. Experiment 5 on Neural Style, optimizer train on only one style and 1800 content image from imageNet

25. Outline • Related work or naïve methods • Main idea • Evaluation • Conclusion

26. Conclusion • So far the learning process is handcraft, but this work shows how to train a NN by a NN • generalize well on different architecture but not on different activation function • execution time? • sometimes, when you are confused for long, try to email the author(all of them). A typo can kill you.

Learning to Learn by Gradient Descent by Gradient Descent

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Learning to Learn by Gradient Descent by Gradient Descent

Ähnlich wie Learning to Learn by Gradient Descent by Gradient Descent (20)

Mehr von Katy Lee

Mehr von Katy Lee (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Learning to Learn by Gradient Descent by Gradient Descent