1. OPTIMIZATION AS A MODEL FOR
FEW-SHOT LEARNING
Hugo Larochelle
Work done atTwitter
Google Brain
Joint work with Sachin Ravi
2. e of meta-learning setup. The top represents the meta-training set Dmet
gray box is a separate dataset that consists of the training set D (lef
3. A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ is that really how we’ll solve AI ?
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
3
4. A RESEARCH AGENDA
• Deep learning successes have required a lot of labeled training data
‣ collecting and labeling such data requires significant human labor
‣ is that really how we’ll solve AI ?
• Alternative solution : exploit other sources of data that are imperfect but plentiful
‣ unlabeled data (unsupervised learning)
‣ multimodal data (multimodal learning)
‣ multidomain data (transfer learning, domain adaptation)
3
5. A RESEARCH AGENDA
• Let’s attack directly the problem of few-shot learning
‣ we want to design a learning algorithm A that outputs a good parameters 𝜽
of a model M, when fed a small dataset Dtrain={(Xt,Yt)}t=1
• Idea: let’s learn that algorithm A, end-to-end
‣ this is known as meta-learning or learning to learn
4
T
6. META-LEARNING
• Learning algorithm A
‣ input: training set Dtrain={(Xt,Yt)}
‣ output: parameters 𝜽 model M (the learner)
‣ objective: good performance on test set Dtest=(X,Y)
• Meta-learning algorithm
‣ input: meta-training set ={(Dtrain,Dtest)}n=1
‣ output: parameters 𝝝 algorithm A (the meta-learner)
‣ objective: good performance on meta-test set =(Dtrain,Dtest)
5
captures fundamental knowledge shared among all the tasks.
2 TASK DESCRIPTION
We first begin by detailing the meta-learning formulation we use. In the typical mach
setting, we are interested in a dataset D and usually split D so that we optimize param
training set Dtrain and evaluate its generalization on the test set Dtest. In meta-learnin
we are dealing with meta-sets D containing multiple regular datasets, where each D 2 D
of Dtrain and Dtest.
We consider the k-shot, N-class classification task, where for each dataset D, the train
sists of k labelled examples for each of N classes, meaning that Dtrain consists of k · N
and Dtest has a set number of examples for evaluation.
In meta-learning, we thus have different meta-sets for meta-training, meta-validation
testing (Dmeta train, Dmeta validation, and Dmeta test, respectively). On Dmeta tr
interested in training a learning procedure (the meta-learning model) that can take as i
its training sets Dtrain and produce a model that achieves high average classification perf
its corresponding test set Dtest. Using Dmeta validation we can perform hyper-paramet
of the meta-learning model and evaluate its generalization performance on Dmeta test.
For this formulation to correspond to the few-shot learning setting, each training set
D 2 D will contain few labeled examples (we consider k = 1 or k = 5), that must
(n) (n) N
Figure 1: Computational graph for the forward pass of the meta-learner. The dashed line div
examples from the training set Dtrain and test set Dtest. Each (Xi, Yi) is the ith
batch from
training set whereas (X, Y) is all the elements from the test set. The dashed arrows indicate tha
do not back-propagate through that step when training the meta-learner. We refer to the learn
M, where M(X; ✓) is the output of learner M using parameters ✓ for inputs X. We also use r
a shorthand for r✓t 1 Lt.
to have training conditions match those of test time. During evaluation of the meta-learning
each dataset D = (Dtrain, Dtest) 2 Dmeta test, a good meta-learner model will, given a seri
learner gradients and losses on the training set Dtrain, suggest a series of updates for the lea
model that trains it towards good performance on the test set Dtest.
7. META-LEARNING
6
1: Example of meta-learning setup. The top represents the meta-training set Dmeta train,
nside each gray box is a separate dataset that consists of the training set Dtrain (left side of
line) and the test set Dtest (right side of dashed line). In this illustration, we are considering
9. A META-LEARNING MODEL
• How to parametrize learning algorithms?
‣ we take inspiration from the gradient descent algorithm:
‣ we parametrize this update similarly to LSTM state updates:
- state ct is model M’s parameter space
- state update ct is the negative gradient
- ft and it are LSTM gates:
8
MODEL DESCRIPTION
ider a single dataset D 2 Dmeta train. Suppose we have a learner neural net mode
meters ✓ that we want to train on Dtrain. The standard optimization algorithms used t
neural networks are some variant of gradient descent, which uses updates of the form
✓t = ✓t 1 ↵tr✓t 1 Lt,
e ✓t 1 are the parameters of the learner after t 1 updates, ↵t is the learning rate at
the loss optimized by the learner for its tth
update, r✓t 1 Lt is the gradient of that los
ect to parameters ✓t 1, and ✓t is the updated parameters of the learner.
2
der review as a conference paper at ICLR 2017
r key observation that we leverage here is that this update resembles the update for the cell
an LSTM
ct = ft ct 1 + it ˜ct,
ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Lt.
us, we propose training a meta-learner LSTM to learn an update rule for training a neural
rk. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and
ndidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for
zation. We define parametric forms for it and ft so that the meta-learner can determine opt
ues through the course of the updates.
~
Under review as a conference paper at ICLR 2017
Our key observation that we leverage here is that this update resembles the update for the cell state
in an LSTM
ct = ft ct 1 + it ˜ct, (2)
if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Lt.
Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net-
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the
candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
values through the course of the updates.
Let us start with it, which corresponds to the learning rate for the updates. We let
it = WI ·
⇥
r✓t 1
Lt, Lt, ✓t 1, it 1
⇤
+ bI ,
meaning that the learning rate is a function of the current parameter value ✓t, the current gradient
r✓t
Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta-
if ft = 1, ct 1 = ✓t 1, it = ↵t, and ˜ct = r✓t 1
Lt.
Thus, we propose training a meta-learner LSTM to learn an update rule for training a neural net-
work. We set the cell state of the LSTM to be the parameters of the learner, or ct = ✓t, and the
candidate cell state ˜ct = r✓t 1 Lt, given how valuable information about the gradient is for opti-
mization. We define parametric forms for it and ft so that the meta-learner can determine optimal
values through the course of the updates.
Let us start with it, which corresponds to the learning rate for the updates. We let
it = WI ·
⇥
r✓t 1
Lt, Lt, ✓t 1, it 1
⇤
+ bI ,
meaning that the learning rate is a function of the current parameter value ✓t, the current gradient
r✓t
Lt, the current loss Lt, and the previous learning rate it 1. With this information, the meta-
learner should be able to finely control the learning rate so as to train the learner quickly while
avoiding divergence.
As for ft, it seems possible that the optimal choice isn’t the constant 1. Intuitively, what would
justify shrinking the parameters of the learner and forgetting part of its previous value would be
if the learner is currently in a bad local optima and needs a large change to escape. This would
correspond to a situation where the loss is high but the gradient is close to zero. Thus, one proposal
for the forget gate is to have it be a function of that information, as well as the previous value of the
forget gate:
ft = WF ·
⇥
r✓t 1
Lt, Lt, ✓t 1, ft 1
⇤
+ bF .
11. TO SUM UP
• We use our meta-learning LSTM to model parameter dynamics during training
‣ LSTM parameters are shared across M’s parameters (i.e. treated like a large minibatch)
‣ learns c0, which is like learning M’s initialization
• It is trained to produce parameters that have low loss on the corresponding test set
‣ possible thanks to backprop (though we don’t ignore gradients through the inputs of the LSTM)
• Inputs to meta-learning LSTM are the loss, the parameter and its loss gradient
‣ we use the preprocessing proposed by Andrychowicz et al. (2016)
• Model M uses batch normalization
‣ we are careful to avoid “leakage” between meta-train / meta-validation / meta-test sets
10
12. RELATED WORK: META-LEARNING
• Early work on learning an update rule
‣ Learning a synaptic learning rule (1990)
Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier
‣ On the search for new learning rules for ANNs (1995)
Samy Bengio,Yoshua Bengio, and Jocelyn Cloutier
• Early work on recurrent networks modifying their weights
‣ Learning to control fast-weight memories:An alternative to dynamic recurrent
networks (1992)
Jürgen Schmidhuber
‣ A neural network that embeds its own meta-levels (1993)
Jürgen Schmidhuber
11
[see related work section of Learning to learn by gradient descent by gradient descent (2016)]
13. RELATED WORK: META-LEARNING
• Training a recurrent neural network to optimize
‣ outputs update, so can decide to do something else than gradient descent
• Learning to learn by gradient descent by gradient descent (2016)
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau,Tom Schaul, and Nando de Freitas
• Learning to learn using gradient descent (2001)
Sepp Hochreiter,A. StevenYounger, and Peter R. Conwell
12
Optimizee
Optimizer
t-2 t-1 t
m m m
+ + +
ft-1 ftft-2
∇t-2 ∇t-1 ∇t
ht-2 ht-1 ht ht+1
gt-1 gt
θt-2 θt-1 θt θt+1
gt-2
Figure 2: Computational graph used for computing the gradient of the optimizer.
2.1 Coordinatewise LSTM optimizer
One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of
thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it
14. RELATED WORK: FEW-SHOT LEARNING
• Training a “pattern matcher” to optimize
each episode’s test set performance
‣ no notion of learning an update
rule
• Matching networks for one shot learning (2016)
Oriol Vinyals, Charles Blundell,Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan Wierstra
13
Figure 1: Matching Networks architecture
train it by showing only a few examples per class, switching the task from minibatch to min
much like how it will be tested when presented with a few examples of a new task.
Besides our contributions in defining a model and training criterion amenable for one-shot le
we contribute by the definition of tasks that can be used to benchmark other approaches o
15. RELATED WORK: FEW-SHOT LEARNING
• Training a “prototype extractor” to optimize
each episode’s test set performance
‣ no notion of learning an update
rule
• Prototypical Networks for Few-shot Learning (2016)
Jake Snell, Kevin Swersky and Richard Zemel
14
c1
c2
c3
x
(a) Few-shot
v1
Figure 1: Prototypical networks in the few-shot and zero-s
ck are computed as the mean of embedded support exa
prototypes ck are produced by embedding class meta-data
16. RELATED WORK: FEW-SHOT LEARNING
• Training a “initialization+fine-tuning” procedure
that’s based on a known update (e.g.ADAM)
‣ much simpler than a meta-LSTM,
yet works quite well!
• Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)
Chelsea Finn, Pieter Abbeel and Sergey Levine
15
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
the task loss.
ion of this work is a simple model-
orithm for meta-learning that trains
such that a small number of gradi-
to fast learning on a new task. We
hm on different model types, includ-
d convolutional networks, and in sev-
ncluding few-shot regression, image
forcement learning. Our evaluation
earning algorithm compares favor-
one-shot learning methods designed
sed classification, while using fewer
meta-learning
learning/adaptation
✓
rL1
rL2
rL3
✓⇤
1 ✓⇤
2
✓⇤
3
Figure 1. Diagram of our model-agnostic meta-learning
rithm (MAML), which optimizes for a representation ✓ th
quickly adapt to new tasks.
17. RELATED WORK: FEW-SHOT LEARNING
• Training a neural Turing machine
to learn
‣ no notion of gradient on learner
• One-shot learning with memory-augmented neural networks (2016)
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap
16
One-shot learning with Memory-Augmented Neural Networks
a) Task setup (b) Network strategy
Omniglot images (or x-values for regression), xt, are presented with time-offset labels (or function values),
from simply mapping the class labels to the output. From episode to episode, the classes to be presented
18. RELATED WORK: FEW-SHOT LEARNING
• Training a convolutional network to learn
• Meta-Learning withTemporal Convolutions (2017)
Nikhil Mishra, Mostafa Rohaninejad, Xi Chen and Pieter Abbeel
17
• How does its performance compare to existing approaches that are specialized to a particular
task domain, or have elements of high-level strategy already built-in?
4.1 Few-Shot Image Classification
In the few-shot classification setting, we wish to classify data points into N classes, when we
only have a small number (K) of labeled examples per class. A meta-learner is readily applicable,
because it learns how to compare input points, rather than memorize a specific mapping from points to
classes. Figure 2 illustrates how few-shot image classification fits into the meta-learning formalization
presented in Section 2.1 and our introduction of the TCML in Section 2.2.
ŷ
TCML
Predicted Labels
(Current Features,
Previous Label)
φ
A
(i0, --)
(x0, --)
φ
D
(i1, y0)
(x1, y0)
φ
C
(i2, y1)
(x2, y1)
φ
A
(i3, y2)
(x3, y2)
2 3
Learned
Embedding Function
(Current Image,
Previous Label)
0
Figure 2: An episode of few-shot image classification using a TCML. Given an image it, the input
to the TCML is a feature vector xt (produced by a embedding function xt = (it)), and the label
yt 1 of the previous image it 1. The embedding function is learned jointly with the TCML, which is
trained to classify each image it based on the images i0, . . . , it 1 seen at previous timesteps within
the same episode. Qualitatively, in order to make the correct prediction at time t = 3, the TCML
19. EXPERIMENT
• Mini-ImageNet
‣ random subset of 100 classes (64 training, 16 validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
18
Under review as a conference paper at ICLR 2017
Model
5-class
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73%
Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71%
43.56% ± 0.84% 55.31% ± 0.73%
20. EXPERIMENT
• Mini-ImageNet
‣ random subset of 100 classes (64 training, 16 validation, 20 testing)
‣ random sets Dtrain are generated by randomly picking 5 classes from class subset
‣ model M is a small 4-layers CNN, meta-learner LSTM has 2 layers
19
Under review as a conference paper at ICLR 2017
Model
5-class
1-shot 5-shot
Baseline-finetune 28.86 ± 0.54% 49.79 ± 0.79%
Baseline-nearest-neighbor 41.08 ± 0.70% 51.04 ± 0.65%
Matching Network 43.40 ± 0.78% 51.09 ± 0.71%
Matching Network FCE 43.56 ± 0.84% 55.31 ± 0.73%
Meta-Learner LSTM (OURS) 43.44 ± 0.77% 60.60 ± 0.71%43.44% ± 0.77% 60.60% ± 0.71%
43.56% ± 0.84% 55.31% ± 0.73%
56.48% ± 0.99% 61.22% ± 0.98%
48.70% ± 1.84% 63.10% ± 0.92%
49.42% ± 0.78% 68.20% ± 0.66%
MAML (Finn et al.)
Prototypical Nets (Snell et al.)
TCML (Mishra et al.)
(updated)
21. DISCUSSION
• How to scale up to a variable number of classes / examples
‣ we need an “ImageNet transposed”
• How best to characterize / parametrize learning algorithms (i.e. meta-models)
‣ inspiration from other optimization algorithms? other learning algorithms?
• How to apply beyond supervised learning
‣ unsupervised learning, semi-supervised learning, active learning, domain adaptation?
• … meta-meta-learning ?
20