SlideShare a Scribd company logo
1 of 22
Download to read offline
Hugo Larochelle
Work done atTwitter

Google Brain

Joint work with Sachin Ravi
e of meta-learning setup. The top represents the meta-training set Dmet
gray box is a separate dataset that consists of the training set D (lef
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ is that really how we’ll solve AI ?
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ is that really how we’ll solve AI ?
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
• Let’s attack directly the problem of few-shot learning
‣ we want to design a learning algorithm A that outputs a good parameters 𝜽

of a model M, when fed a small dataset Dtrain={(Xt,Yt)}t=1
• Idea: let’s learn that algorithm A, end-to-end
‣ this is known as meta-learning or learning to learn
• Learning algorithm A
‣ input: training set Dtrain={(Xt,Yt)}
‣ output: parameters 𝜽 model M (the learner)
‣ objective: good performance on test set Dtest=(X,Y)
• Meta-learning algorithm
‣ input: meta-training set ={(Dtrain,Dtest)}n=1
‣ output: parameters 𝝝 algorithm A (the meta-learner)
‣ objective: good performance on meta-test set =(Dtrain,Dtest)
captures fundamental knowledge shared among all the tasks.
We first begin by detailing the meta-learning formulation we use. In the typical mach
setting, we are interested in a dataset D and usually split D so that we optimize param
training set Dtrain and evaluate its generalization on the test set Dtest. In meta-learnin
we are dealing with meta-sets D containing multiple regular datasets, where each D 2 D
of Dtrain and Dtest.
We consider the k-shot, N-class classification task, where for each dataset D, the train
sists of k labelled examples for each of N classes, meaning that Dtrain consists of k · N
and Dtest has a set number of examples for evaluation.
In meta-learning, we thus have different meta-sets for meta-training, meta-validation
testing (Dmeta train, Dmeta validation, and Dmeta test, respectively). On Dmeta tr
interested in training a learning procedure (the meta-learning model) that can take as i
its training sets Dtrain and produce a model that achieves high average classification perf
its corresponding test set Dtest. Using Dmeta validation we can perform hyper-paramet
of the meta-learning model and evaluate its generalization performance on Dmeta test.
For this formulation to correspond to the few-shot learning setting, each training set
D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that must
(n) (n) N
Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line div
examples from the training set Dtrain and test set Dtest. Each (Xi, Yi) is the ith
batch from
training set whereas (X, Y) is all the elements from the test set. The dashed arrows indicate tha
do not back-propagate through that step when training the meta-learner. We refer to the learn
M, where M(X; ✓) is the output of learner M using parameters ✓ for inputs X. We also use r
a shorthand for r✓t 1 Lt.
to have training conditions match those of test time. During evaluation of the meta-learning
each dataset D = (Dtrain, Dtest) 2 Dmeta test, a good meta-learner model will, given a seri
learner gradients and losses on the training set Dtrain, suggest a series of updates for the lea
model that trains it towards good performance on the test set Dtest.
1: Example of meta-learning setup. The top represents the meta-training set Dmeta train,
nside each gray box is a separate dataset that consists of the training set Dtrain (left side of
line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
• How to parametrize learning algorithms?
‣ we take inspiration from the gradient descent algorithm:
‣ we parametrize this update similarly to LSTM state updates:

- state ct is model M’s parameter space
- state update ct is the negative gradient
- ft and it are LSTM gates:
ider a single dataset D 2 Dmeta train. Suppose we have a learner neural net mode
meters ✓ that we want to train on Dtrain. The standard optimization algorithms used t
neural networks are some variant of gradient descent, which uses updates of the form
✓t = ✓t 1 ↵tr✓t 1 Lt,
e ✓t 1 are the parameters of the learner after t 1 updates, ↵t is the learning rate at
the loss optimized by the learner for its tth
update, r✓t 1 Lt is the gradient of that los
ect to parameters ✓t 1, and ✓t is the updated parameters of the learner.
der review as a conference paper at ICLR 2017
r key observation that we leverage here is that this update resembles the update for the cell
ct = ft ct 1 + it ˜ct,
ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
us, we propose training a meta-learner LSTM to learn an update rule for training a neural
rk. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and
ndidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for
zation. We define parametric forms for it and ft so that the meta-learner can determine opt
ues through the course of the updates.
Under review as a conference paper at ICLR 2017
Our key observation that we leverage here is that this update resembles the update for the cell state
in an LSTM
ct = ft ct 1 + it ˜ct, (2)
if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net-
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the
candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
values through the course of the updates.
Let us start with it, which corresponds to the learning rate for the updates. We let
it = WI ·
r✓t 1
Lt, Lt, ✓t 1, it 1
+ bI ,
meaning that the learning rate is a function of the current parameter value ✓t, the current gradient
Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta-
if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net-
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the
candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
values through the course of the updates.
Let us start with it, which corresponds to the learning rate for the updates. We let
it = WI ·
r✓t 1
Lt, Lt, ✓t 1, it 1
+ bI ,
meaning that the learning rate is a function of the current parameter value ✓t, the current gradient
Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta-
learner should be able to finely control the learning rate so as to train the learner quickly while
avoiding divergence.
As for ft, it seems possible that the optimal choice isn’t the constant 1. Intuitively, what would
justify shrinking the parameters of the learner and forgetting part of its previous value would be
if the learner is currently in a bad local optima and needs a large change to escape. This would
correspond to a situation where the loss is high but the gradient is close to zero. Thus, one proposal
for the forget gate is to have it be a function of that information, as well as the previous value of the
forget gate:
ft = WF ·
r✓t 1
Lt, Lt, ✓t 1, ft 1
+ bF .
Under review as a conference paper at ICLR 2017
Dtrain Dtest
(n) (n)
• We use our meta-learning LSTM to model parameter dynamics during training
‣ LSTM parameters are shared across M’s parameters (i.e. treated like a large minibatch)
‣ learns c0, which is like learning M’s initialization
• It is trained to produce parameters that have low loss on the corresponding test set
‣ possible thanks to backprop (though we don’t ignore gradients through the inputs of the LSTM)
• Inputs to meta-learning LSTM are the loss, the parameter and its loss gradient
‣ we use the preprocessing proposed by Andrychowicz et al. (2016)
• Model M uses batch normalization
‣ we are careful to avoid “leakage” between meta-train / meta-validation / meta-test sets
• Early work on learning an update rule
‣ Learning a synaptic learning rule (1990)

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier
‣ On the search for new learning rules for ANNs (1995)

Samy Bengio,Yoshua Bengio, and Jocelyn Cloutier
• Early work on recurrent networks modifying their weights
‣ Learning to control fast-weight memories:An alternative to dynamic recurrent
networks (1992)

Jürgen Schmidhuber
‣ A neural network that embeds its own meta-levels (1993)

Jürgen Schmidhuber
[see related work section of Learning to learn by gradient descent by gradient descent (2016)]
• Training a recurrent neural network to optimize
‣ outputs update, so can decide to do something else than gradient descent
• Learning to learn by gradient descent by gradient descent (2016)

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau,Tom Schaul, and Nando de Freitas
• Learning to learn using gradient descent (2001)

Sepp Hochreiter,A. StevenYounger, and Peter R. Conwell
t-2 t-1 t
m m m
+ + +
ft-1 ftft-2
∇t-2 ∇t-1 ∇t
ht-2 ht-1 ht ht+1
gt-1 gt
θt-2 θt-1 θt θt+1
Figure 2: Computational graph used for computing the gradient of the optimizer.
2.1 Coordinatewise LSTM optimizer
One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of
thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
• Training a “pattern matcher” to optimize

each episode’s test set performance
‣ no notion of learning an update

• Matching networks for one shot learning (2016)

Oriol Vinyals, Charles Blundell,Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to min
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot le
we contribute by the definition of tasks that can be used to benchmark other approaches o
• Training a “prototype extractor” to optimize

each episode’s test set performance
‣ no notion of learning an update

• Prototypical Networks for Few-shot Learning (2016)

Jake Snell, Kevin Swersky and Richard Zemel
(a) Few-shot
Figure 1: Prototypical networks in the few-shot and zero-s
ck are computed as the mean of embedded support exa
prototypes ck are produced by embedding class meta-data
• Training a “initialization+fine-tuning” procedure

that’s based on a known update (e.g.ADAM)
‣ much simpler than a meta-LSTM,

yet works quite well!
• Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)

Chelsea Finn, Pieter Abbeel and Sergey Levine
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
the task loss.
ion of this work is a simple model-
orithm for meta-learning that trains
such that a small number of gradi-
to fast learning on a new task. We
hm on different model types, includ-
d convolutional networks, and in sev-
ncluding few-shot regression, image
forcement learning. Our evaluation
earning algorithm compares favor-
one-shot learning methods designed
sed classification, while using fewer
1 ✓⇤
Figure 1. Diagram of our model-agnostic meta-learning
rithm (MAML), which optimizes for a representation ✓ th
quickly adapt to new tasks.
• Training a neural Turing machine 

to learn
‣ no notion of gradient on learner
• One-shot learning with memory-augmented neural networks (2016)

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap
One-shot learning with Memory-Augmented Neural Networks
a) Task setup (b) Network strategy
Omniglot images (or x-values for regression), xt, are presented with time-offset labels (or function values),
from simply mapping the class labels to the output. From episode to episode, the classes to be presented
• Training a convolutional network to learn
• Meta-Learning withTemporal Convolutions (2017)

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel
• How does its performance compare to existing approaches that are specialized to a particular
task domain, or have elements of high-level strategy already built-in?
4.1 Few-Shot Image Classification
In the few-shot classification setting, we wish to classify data points into N classes, when we
only have a small number (K) of labeled examples per class. A meta-learner is readily applicable,
because it learns how to compare input points, rather than memorize a specific mapping from points to
classes. Figure 2 illustrates how few-shot image classification fits into the meta-learning formalization
presented in Section 2.1 and our introduction of the TCML in Section 2.2.
Predicted Labels
(Current Features,
Previous Label)
(i0, --)
(x0, --)
(i1, y0)
(x1, y0)
(i2, y1)
(x2, y1)
(i3, y2)
(x3, y2)
2 3
Embedding Function
(Current Image,
Previous Label)
Figure 2: An episode of few-shot image classification using a TCML. Given an image it, the input
to the TCML is a feature vector xt (produced by a embedding function xt = (it)), and the label
yt 1 of the previous image it 1. The embedding function is learned jointly with the TCML, which is
trained to classify each image it based on the images i0, . . . , it 1 seen at previous timesteps within
the same episode. Qualitatively, in order to make the correct prediction at time t = 3, the TCML
• Mini-ImageNet
‣ random subset of 100 classes (64 training, 16 validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
Under review as a conference paper at ICLR 2017
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73%
Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71%
43.56% ± 0.84% 55.31% ± 0.73%
• Mini-ImageNet
‣ random subset of 100 classes (64 training, 16 validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
Under review as a conference paper at ICLR 2017
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73%
Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71%
43.56% ± 0.84% 55.31% ± 0.73%
56.48% ± 0.99% 61.22% ± 0.98%
48.70% ± 1.84% 63.10% ± 0.92%
49.42% ± 0.78% 68.20% ± 0.66%
MAML (Finn et al.)
Prototypical Nets (Snell et al.)
TCML (Mishra et al.)
• How to scale up to a variable number of classes / examples
‣ we need an “ImageNet transposed”
• How best to characterize / parametrize learning algorithms (i.e. meta-models)
‣ inspiration from other optimization algorithms? other learning algorithms?
• How to apply beyond supervised learning
‣ unsupervised learning, semi-supervised learning, active learning, domain adaptation?
• … meta-meta-learning ?

More Related Content

What's hot

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningEng Teong Cheah
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsDezyreAcademy
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...Edureka!
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsYoonho Lee
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
Supervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationSupervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationTara ram Goyal
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn

What's hot (20)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
Machine Learning
Machine LearningMachine Learning
Machine Learning
Machine Learning
Machine LearningMachine Learning
Machine Learning
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
Supervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its applicationSupervised Machine Learning Techniques common algorithms and its application
Supervised Machine Learning Techniques common algorithms and its application
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...


An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learningcsandit
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
Higgs bosob machine learning challange
Higgs bosob machine learning challangeHiggs bosob machine learning challange
Higgs bosob machine learning challangeTharindu Ranasinghe
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfAaryanArora10
Why start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsWhy start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsData Con LA
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemMasaharu Kinoshita
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"Young-Min kang
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningVenkata Karthik Gullapalli
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureRai University
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentKaty Lee


An improved teaching learning
An improved teaching learningAn improved teaching learning
An improved teaching learning
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
Higgs bosob machine learning challange
Higgs bosob machine learning challangeHiggs bosob machine learning challange
Higgs bosob machine learning challange
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
Why start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaignsWhy start using uplift models for more efficient marketing campaigns
Why start using uplift models for more efficient marketing campaigns
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine LearningA Novel Methodology to Implement Optimization Algorithms in Machine Learning
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient Descent

More from MLReview

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMCMLReview
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...MLReview
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN AutoencodersMLReview
Representing and comparing probabilities: Part 2
Representing and comparing probabilities: Part 2Representing and comparing probabilities: Part 2
Representing and comparing probabilities: Part 2MLReview
Representing and comparing probabilities
Representing and comparing probabilitiesRepresenting and comparing probabilities
Representing and comparing probabilitiesMLReview
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryMLReview
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue SystemsMLReview
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic CompositionMLReview
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?MLReview
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridMLReview
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview

More from MLReview (13)

Bayesian Non-parametric Models for Data Science using PyMC
 Bayesian Non-parametric Models for Data Science using PyMC Bayesian Non-parametric Models for Data Science using PyMC
Bayesian Non-parametric Models for Data Science using PyMC
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...  Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Machine Learning and Counterfactual Reasoning for "Personalized" Decision- ...
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
PixelGAN Autoencoders
  PixelGAN Autoencoders  PixelGAN Autoencoders
PixelGAN Autoencoders
Representing and comparing probabilities: Part 2
Representing and comparing probabilities: Part 2Representing and comparing probabilities: Part 2
Representing and comparing probabilities: Part 2
Representing and comparing probabilities
Representing and comparing probabilitiesRepresenting and comparing probabilities
Representing and comparing probabilities
Theoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning TheoryTheoretical Neuroscience and Deep Learning Theory
Theoretical Neuroscience and Deep Learning Theory
2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems2017 Tutorial - Deep Learning for Dialogue Systems
2017 Tutorial - Deep Learning for Dialogue Systems
Deep Learning for Semantic Composition
Deep Learning for Semantic CompositionDeep Learning for Semantic Composition
Deep Learning for Semantic Composition
Near human performance in question answering?
Near human performance in question answering?Near human performance in question answering?
Near human performance in question answering?
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
Real-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral GridReal-time Edge-aware Image Processing with the Bilateral Grid
Real-time Edge-aware Image Processing with the Bilateral Grid
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither

Recently uploaded

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems

Recently uploaded (20)

Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station


  • 1. OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING Hugo Larochelle Work done atTwitter
 Google Brain
 Joint work with Sachin Ravi
  • 2. e of meta-learning setup. The top represents the meta-training set Dmet gray box is a separate dataset that consists of the training set D (lef
  • 3. A RESEARCH AGENDA • Deep learning successes have required a lot of labeled training data ‣ collecting and labeling such data requires significant human labor ‣ is that really how we’ll solve AI ? • Alternative solution : exploit other sources of data that are imperfect but plentiful ‣ unlabeled data (unsupervised learning) ‣ multimodal data (multimodal learning) ‣ multidomain data (transfer learning, domain adaptation) 3
  • 4. A RESEARCH AGENDA • Deep learning successes have required a lot of labeled training data ‣ collecting and labeling such data requires significant human labor ‣ is that really how we’ll solve AI ? • Alternative solution : exploit other sources of data that are imperfect but plentiful ‣ unlabeled data (unsupervised learning) ‣ multimodal data (multimodal learning) ‣ multidomain data (transfer learning, domain adaptation) 3
  • 5. A RESEARCH AGENDA • Let’s attack directly the problem of few-shot learning ‣ we want to design a learning algorithm A that outputs a good parameters 𝜽
 of a model M, when fed a small dataset Dtrain={(Xt,Yt)}t=1 • Idea: let’s learn that algorithm A, end-to-end ‣ this is known as meta-learning or learning to learn 4 T
  • 6. META-LEARNING • Learning algorithm A ‣ input: training set Dtrain={(Xt,Yt)} ‣ output: parameters 𝜽 model M (the learner) ‣ objective: good performance on test set Dtest=(X,Y) • Meta-learning algorithm ‣ input: meta-training set ={(Dtrain,Dtest)}n=1 ‣ output: parameters 𝝝 algorithm A (the meta-learner) ‣ objective: good performance on meta-test set =(Dtrain,Dtest) 5 captures fundamental knowledge shared among all the tasks. 2 TASK DESCRIPTION We first begin by detailing the meta-learning formulation we use. In the typical mach setting, we are interested in a dataset D and usually split D so that we optimize param training set Dtrain and evaluate its generalization on the test set Dtest. In meta-learnin we are dealing with meta-sets D containing multiple regular datasets, where each D 2 D of Dtrain and Dtest. We consider the k-shot, N-class classification task, where for each dataset D, the train sists of k labelled examples for each of N classes, meaning that Dtrain consists of k · N and Dtest has a set number of examples for evaluation. In meta-learning, we thus have different meta-sets for meta-training, meta-validation testing (Dmeta train, Dmeta validation, and Dmeta test, respectively). On Dmeta tr interested in training a learning procedure (the meta-learning model) that can take as i its training sets Dtrain and produce a model that achieves high average classification perf its corresponding test set Dtest. Using Dmeta validation we can perform hyper-paramet of the meta-learning model and evaluate its generalization performance on Dmeta test. For this formulation to correspond to the few-shot learning setting, each training set D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that must (n) (n) N Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line div examples from the training set Dtrain and test set Dtest. Each (Xi, Yi) is the ith batch from training set whereas (X, Y) is all the elements from the test set. The dashed arrows indicate tha do not back-propagate through that step when training the meta-learner. We refer to the learn M, where M(X; ✓) is the output of learner M using parameters ✓ for inputs X. We also use r a shorthand for r✓t 1 Lt. to have training conditions match those of test time. During evaluation of the meta-learning each dataset D = (Dtrain, Dtest) 2 Dmeta test, a good meta-learner model will, given a seri learner gradients and losses on the training set Dtrain, suggest a series of updates for the lea model that trains it towards good performance on the test set Dtest.
  • 7. META-LEARNING 6 1: Example of meta-learning setup. The top represents the meta-training set Dmeta train, nside each gray box is a separate dataset that consists of the training set Dtrain (left side of line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
  • 9. A META-LEARNING MODEL • How to parametrize learning algorithms? ‣ we take inspiration from the gradient descent algorithm: ‣ we parametrize this update similarly to LSTM state updates:
 - state ct is model M’s parameter space - state update ct is the negative gradient - ft and it are LSTM gates: 8 MODEL DESCRIPTION ider a single dataset D 2 Dmeta train. Suppose we have a learner neural net mode meters ✓ that we want to train on Dtrain. The standard optimization algorithms used t neural networks are some variant of gradient descent, which uses updates of the form ✓t = ✓t 1 ↵tr✓t 1 Lt, e ✓t 1 are the parameters of the learner after t 1 updates, ↵t is the learning rate at the loss optimized by the learner for its tth update, r✓t 1 Lt is the gradient of that los ect to parameters ✓t 1, and ✓t is the updated parameters of the learner. 2 der review as a conference paper at ICLR 2017 r key observation that we leverage here is that this update resembles the update for the cell an LSTM ct = ft ct 1 + it ˜ct, ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1 Lt. us, we propose training a meta-learner LSTM to learn an update rule for training a neural rk. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and ndidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for zation. We define parametric forms for it and ft so that the meta-learner can determine opt ues through the course of the updates. ~ Under review as a conference paper at ICLR 2017 Our key observation that we leverage here is that this update resembles the update for the cell state in an LSTM ct = ft ct 1 + it ˜ct, (2) if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1 Lt. Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net- work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti- mization. We define parametric forms for it and ft so that the meta-learner can determine optimal values through the course of the updates. Let us start with it, which corresponds to the learning rate for the updates. We let it = WI · ⇥ r✓t 1 Lt, Lt, ✓t 1, it 1 ⇤ + bI , meaning that the learning rate is a function of the current parameter value ✓t, the current gradient r✓t Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta- if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1 Lt. Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net- work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti- mization. We define parametric forms for it and ft so that the meta-learner can determine optimal values through the course of the updates. Let us start with it, which corresponds to the learning rate for the updates. We let it = WI · ⇥ r✓t 1 Lt, Lt, ✓t 1, it 1 ⇤ + bI , meaning that the learning rate is a function of the current parameter value ✓t, the current gradient r✓t Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta- learner should be able to finely control the learning rate so as to train the learner quickly while avoiding divergence. As for ft, it seems possible that the optimal choice isn’t the constant 1. Intuitively, what would justify shrinking the parameters of the learner and forgetting part of its previous value would be if the learner is currently in a bad local optima and needs a large change to escape. This would correspond to a situation where the loss is high but the gradient is close to zero. Thus, one proposal for the forget gate is to have it be a function of that information, as well as the previous value of the forget gate: ft = WF · ⇥ r✓t 1 Lt, Lt, ✓t 1, ft 1 ⇤ + bF .
  • 10. META-LEARNING UPDATES 9 Under review as a conference paper at ICLR 2017 (M) (LSTM) Dtrain Dtest (n) (n) R
  • 11. TO SUM UP • We use our meta-learning LSTM to model parameter dynamics during training ‣ LSTM parameters are shared across M’s parameters (i.e. treated like a large minibatch) ‣ learns c0, which is like learning M’s initialization • It is trained to produce parameters that have low loss on the corresponding test set ‣ possible thanks to backprop (though we don’t ignore gradients through the inputs of the LSTM) • Inputs to meta-learning LSTM are the loss, the parameter and its loss gradient ‣ we use the preprocessing proposed by Andrychowicz et al. (2016) • Model M uses batch normalization ‣ we are careful to avoid “leakage” between meta-train / meta-validation / meta-test sets 10
  • 12. RELATED WORK: META-LEARNING • Early work on learning an update rule ‣ Learning a synaptic learning rule (1990)
 Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier ‣ On the search for new learning rules for ANNs (1995)
 Samy Bengio,Yoshua Bengio, and Jocelyn Cloutier • Early work on recurrent networks modifying their weights ‣ Learning to control fast-weight memories:An alternative to dynamic recurrent networks (1992)
 Jürgen Schmidhuber ‣ A neural network that embeds its own meta-levels (1993)
 Jürgen Schmidhuber 11 [see related work section of Learning to learn by gradient descent by gradient descent (2016)]
  • 13. RELATED WORK: META-LEARNING • Training a recurrent neural network to optimize ‣ outputs update, so can decide to do something else than gradient descent • Learning to learn by gradient descent by gradient descent (2016)
 Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau,Tom Schaul, and Nando de Freitas • Learning to learn using gradient descent (2001)
 Sepp Hochreiter,A. StevenYounger, and Peter R. Conwell 12 Optimizee Optimizer t-2 t-1 t m m m + + + ft-1 ftft-2 ∇t-2 ∇t-1 ∇t ht-2 ht-1 ht ht+1 gt-1 gt θt-2 θt-1 θt θt+1 gt-2 Figure 2: Computational graph used for computing the gradient of the optimizer. 2.1 Coordinatewise LSTM optimizer One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
  • 14. RELATED WORK: FEW-SHOT LEARNING • Training a “pattern matcher” to optimize
 each episode’s test set performance ‣ no notion of learning an update
 rule • Matching networks for one shot learning (2016)
 Oriol Vinyals, Charles Blundell,Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra 13 Figure 1: Matching Networks architecture train it by showing only a few examples per class, switching the task from minibatch to min much like how it will be tested when presented with a few examples of a new task. Besides our contributions in defining a model and training criterion amenable for one-shot le we contribute by the definition of tasks that can be used to benchmark other approaches o
  • 15. RELATED WORK: FEW-SHOT LEARNING • Training a “prototype extractor” to optimize
 each episode’s test set performance ‣ no notion of learning an update
 rule • Prototypical Networks for Few-shot Learning (2016)
 Jake Snell, Kevin Swersky and Richard Zemel 14 c1 c2 c3 x (a) Few-shot v1 Figure 1: Prototypical networks in the few-shot and zero-s ck are computed as the mean of embedded support exa prototypes ck are produced by embedding class meta-data
  • 16. RELATED WORK: FEW-SHOT LEARNING • Training a “initialization+fine-tuning” procedure
 that’s based on a known update (e.g.ADAM) ‣ much simpler than a meta-LSTM,
 yet works quite well! • Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)
 Chelsea Finn, Pieter Abbeel and Sergey Levine 15 Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks the task loss. ion of this work is a simple model- orithm for meta-learning that trains such that a small number of gradi- to fast learning on a new task. We hm on different model types, includ- d convolutional networks, and in sev- ncluding few-shot regression, image forcement learning. Our evaluation earning algorithm compares favor- one-shot learning methods designed sed classification, while using fewer meta-learning learning/adaptation ✓ rL1 rL2 rL3 ✓⇤ 1 ✓⇤ 2 ✓⇤ 3 Figure 1. Diagram of our model-agnostic meta-learning rithm (MAML), which optimizes for a representation ✓ th quickly adapt to new tasks.
  • 17. RELATED WORK: FEW-SHOT LEARNING • Training a neural Turing machine 
 to learn ‣ no notion of gradient on learner • One-shot learning with memory-augmented neural networks (2016)
 Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap 16 One-shot learning with Memory-Augmented Neural Networks a) Task setup (b) Network strategy Omniglot images (or x-values for regression), xt, are presented with time-offset labels (or function values), from simply mapping the class labels to the output. From episode to episode, the classes to be presented
  • 18. RELATED WORK: FEW-SHOT LEARNING • Training a convolutional network to learn • Meta-Learning withTemporal Convolutions (2017)
 Nikhil Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel 17 • How does its performance compare to existing approaches that are specialized to a particular task domain, or have elements of high-level strategy already built-in? 4.1 Few-Shot Image Classification In the few-shot classification setting, we wish to classify data points into N classes, when we only have a small number (K) of labeled examples per class. A meta-learner is readily applicable, because it learns how to compare input points, rather than memorize a specific mapping from points to classes. Figure 2 illustrates how few-shot image classification fits into the meta-learning formalization presented in Section 2.1 and our introduction of the TCML in Section 2.2. ŷ TCML Predicted Labels (Current Features, Previous Label) φ A (i0, --) (x0, --) φ D (i1, y0) (x1, y0) φ C (i2, y1) (x2, y1) φ A (i3, y2) (x3, y2) 2 3 Learned Embedding Function (Current Image, Previous Label) 0 Figure 2: An episode of few-shot image classification using a TCML. Given an image it, the input to the TCML is a feature vector xt (produced by a embedding function xt = (it)), and the label yt 1 of the previous image it 1. The embedding function is learned jointly with the TCML, which is trained to classify each image it based on the images i0, . . . , it 1 seen at previous timesteps within the same episode. Qualitatively, in order to make the correct prediction at time t = 3, the TCML
  • 19. EXPERIMENT • Mini-ImageNet ‣ random subset of 100 classes (64 training, 16 validation, 20 testing) ‣ random sets Dtrain are generated by randomly picking 5 classes from class subset ‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers 18 Under review as a conference paper at ICLR 2017 Model 5-class 1-shot 5-shot Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79% Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65% Matching Network 43.40 ± 0.78% 51.09 ± 0.71% Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73% Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71% 43.56% ± 0.84% 55.31% ± 0.73%
  • 20. EXPERIMENT • Mini-ImageNet ‣ random subset of 100 classes (64 training, 16 validation, 20 testing) ‣ random sets Dtrain are generated by randomly picking 5 classes from class subset ‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers 19 Under review as a conference paper at ICLR 2017 Model 5-class 1-shot 5-shot Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79% Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65% Matching Network 43.40 ± 0.78% 51.09 ± 0.71% Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73% Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71% 43.56% ± 0.84% 55.31% ± 0.73% 56.48% ± 0.99% 61.22% ± 0.98% 48.70% ± 1.84% 63.10% ± 0.92% 49.42% ± 0.78% 68.20% ± 0.66% MAML (Finn et al.) Prototypical Nets (Snell et al.) TCML (Mishra et al.) (updated)
  • 21. DISCUSSION • How to scale up to a variable number of classes / examples ‣ we need an “ImageNet transposed” • How best to characterize / parametrize learning algorithms (i.e. meta-models) ‣ inspiration from other optimization algorithms? other learning algorithms? • How to apply beyond supervised learning ‣ unsupervised learning, semi-supervised learning, active learning, domain adaptation? • … meta-meta-learning ? 20