25. Oct 2021•0 gefällt mir•1,062 views

Downloaden Sie, um offline zu lesen

Melden

Wissenschaft

Meta-learning, or learning how to learn, is our innate ability to learn new, ever more complex tasks very efficiently by building on prior experience. It is a very exciting direction for machine learning (and AI in general). In this tutorial, I introduce the main concepts and state of the art.

Joaquin VanschorenFolgen

Meta-Learning PresentationAkshayaNagarajan10

Introduction to Few shot learningRidge-i, Inc.

Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ

Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsJoonyoung Yi

Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksYoonho Lee

OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview

- 1. Joaquin Vanschoren (TU Eindhoven) Meta-Learning A tutorial Cover art: MC Escher, Ustwo Games
- 2. Intro: humans can easily learn from a single example 2 thanks to years of learning (and eons of evolution)
- 3. Intro: humans can easily learn from a single example 2 thanks to years of learning (and eons of evolution)
- 4. Intro: humans can easily learn from a single example Canna Indica ‘Picasso’ 2 thanks to years of learning (and eons of evolution)
- 5. Intro: humans can easily learn from a single example Canna Indica ‘Picasso’ ? later train test 2 thanks to years of learning (and eons of evolution)
- 6. Can a computer learn from a single example? e.g. ResNet50 (23 million trainable parameters) ImageNet Canna Indica `Picasso’ That won’t work :) Humans also don’t start from scratch. 3
- 7. Can a computer learn from a single example? e.g. ResNet50 (23 million trainable parameters) ImageNet Canna Indica `Picasso’ That won’t work :) Humans also don’t start from scratch. 3
- 8. e.g. ResNet50 (23 million trainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Canna Indica `Picasso’ Source task Target task 4
- 9. e.g. ResNet50 (23 million trainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Canna Indica `Picasso’ Source task Target task 4
- 10. e.g. ResNet50 (23 million trainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Finetune Canna Indica `Picasso’ Source task Target task 4
- 11. e.g. ResNet50 (23 million trainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Finetune A single source task (e.g. ImageNet) may not generalize well to the test task Canna Indica `Picasso’ Source task Target task 4
- 12. Meta-learning Learn over a series (or distribution) of many di ff erent tasks/episodes Inductive bias: learn assumptions that you can to transfer to new tasks Prepare yourself to learn new things faster task 1 task 2 task 1000 new task Canna Trumpet vine … inductive bias 5
- 13. Meta-learning Learn over a series (or distribution) of many di ff erent tasks/episodes Inductive bias: learn assumptions that you can to transfer to new tasks Prepare yourself to learn new things faster task 1 task 2 task 1000 new task Canna Trumpet vine … Useful in many real-life situations: rare events, test-time constraints, data collection costs, privacy issues,… inductive bias 5
- 14. Inspired by human learning We don’t transfer from a single source task, we learn across many, many tasks year 1 year 2 year 3 year 4 We have a‘drive’to explore new, challenging, but doable, fun tasks 6
- 15. Human-like Learning*** Task 1 Learning humans learn across tasks: less trial-and-error, less data, less compute new tasks should be related to experience (doable, fun, interesting?) Task 2 Learning Task 3 Learning new tasks (of own choosing) Tasknew Learning experience Lake et al. 2017 key aspects of fast learning: compositionality, causality, learning to learn Kemp et al. 2007 inductive bias 7
- 16. Inductive bias (in language) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? 8 which assumptions do we make?
- 17. Inductive bias (in language) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? 8 which assumptions do we make?
- 18. Inductive bias (in language) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? 8 which assumptions do we make?
- 19. Inductive bias (in language) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? Common mistakes one-to-one bias: assume that every word is one color 8 which assumptions do we make?
- 20. Inductive bias (in language) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? Common mistakes one-to-one bias: assume that every word is one color concatenation bias: assume that order is always left-to-right 8 which assumptions do we make?
- 21. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 22. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 23. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 24. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 25. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 26. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 27. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 28. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
- 29. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) zup? zup zup? blicket wif zup? zup dax / Lake et al. 2019 9
- 30. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) concatenation bias: assume that order is always left-to-right zup? zup zup? blicket wif zup? zup dax / zup dax Lake et al. 2019 9
- 31. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) concatenation bias: assume that order is always left-to-right zup? zup zup? blicket wif zup? zup dax / zup dax mutual exclusivity: if object has a name, it doesn’t need another name zup dax Lake et al. 2019 9
- 32. What if there is no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) concatenation bias: assume that order is always left-to-right zup? zup zup? blicket wif zup? zup dax / zup dax mutual exclusivity: if object has a name, it doesn’t need another name zup dax These assumptions (inductive biases) are necessary for learning quickly Humans assume that words have consistent meanings and follow input/output constraints Lake et al. 2019 9
- 33. Meta-learning inductive biases Task 1 model Task 2 Tasknew Lake et al. 2017 Capture useful assumptions from the data - that can often not be easily expressed dax zup lug wif wif zup dax? lug dax lug zup? dax wif lug? wif wif dax? dax zup lug wif dax dax? wif lug dax? wif lug lug dax? zup zup zup? state model state dax zup lug wif dax dax? lug lug dax dax? wif lug lug dax? zup wif zup? Lake et al. 2019 10 model
- 34. Meta-learning goal Task 1 Learning learn minimal inductive biases from prior tasks instead of constructing manual ones should still generalize well (otherwise you meta-over fi t) Task 2 Learning Task 3 Learning Tasknew Learning Inductive bias: any assumptions added to training data to learn more e ff ectively. E.g: • Instead of general model architectures, learn better architectures (and hyperparameters) • Instead of starting from random weights, learn good initial weights • Instead of standard loss/reward function, learn a better loss/reward function (prior) 11 inductive bias
- 35. Task 1 Learning Task 2 Learning Task 3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 12 part 2 part 3 3 pillars
- 36. Task 1 Learning Task 2 Learning Task 3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 13 part 2 part 3 3 pillars
- 37. Task 1 Model Learning algorithm ɸ* = argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 14
- 38. Task 1 Model Learning algorithm ɸ* = argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 14 Meta-learning algorithm g meta-parameters 𝛳 (encode some inductive bias)
- 39. Task 1..N Models performance AutoML Models Models Closely related to Automated Machine Learning (AutoML) But: meta-learn how to design architectures/pipelines and tune hyperparameters Human data scientists also learn from experience Models Models Models Learning hyperparameters Hutter et al. 2019 λ New task Models performance self-learning AutoML Models λ bias (priors, meta-knowledge, human priors) Search space can be huge!
- 40. Meta-learning for AutoML: how? Learning hyperparameter priors Warm starting (what works on similar tasks?) start randomly start with good candidates Meta-models (learn how to build models/components) Complex hyperparameter space Simple hyperparameter space Vanschoren 2018 Task λ, scores λ, scores Learner Learner metadata hyperparameters = architecture + hyperparameters Task λ, scores Learner metadata Task Task
- 41. Observation: current AutoML strongly depends on learned priors Complex hyperparameter space Simple hyperparameter space observation
- 42. Manual architecture priors • Most successful pipelines have a similar structure • Can we meta-learn a prior over successful structures? Ensembling/ stacking Figure source: Feurer et al. 2015
- 43. Manual architecture priors • Most successful pipelines have a similar structure • Can we meta-learn a prior over successful structures? + smaller search space - you can’t learn entirely new architectures Ensembling/ stacking Figure source: Feurer et al. 2015
- 44. Manual architecture priors autosklearn Feurer et al. 2015 autoWEKA Thornton et al. 2013 hyperopt-sklearn Komer et al. 2014 AutoGluon-Tabular Erickson et al. 2020 • Most successful pipelines have a similar structure • Can we meta-learn a prior over successful structures? + smaller search space - you can’t learn entirely new architectures Ensembling/ stacking Figure source: Feurer et al. 2015
- 45. Successful deep networks often have repeated motifs (cells) e.g. Inception v4: Manual architecture priors Szegedy Figure source: Szegedy et al 2016
- 46. Google NASNet Zoph et al 2018 Compositionality: learn hierarchical building blocks to simplify the task •learn parameterized building blocks (cells) •stack cells together in macro- architecture + smaller search space + cells can be learned on a small dataset & transferred to a larger dataset -strong domain priors, doesn’t generalize well Cell search space Cell search space prior Figure source: Elsken et al., 2019
- 47. Google NASNet Zoph et al 2018 Compositionality: learn hierarchical building blocks to simplify the task •learn parameterized building blocks (cells) •stack cells together in macro- architecture + smaller search space + cells can be learned on a small dataset & transferred to a larger dataset -strong domain priors, doesn’t generalize well Cell search space Can we meta-learn hierarchies / components that generalize better? Cell search space prior Figure source: Elsken et al., 2019
- 48. If you constrain the search space enough, you can get SOTA results with random search! Li & Talwalkar 2019 Yu et al. 2019 Real et al. 2019 Cell search space prior • Strong prior! Figure source: Li & Talwalkar., 2019
- 49. Weight-agnostic neural networks • ALL weights are shared • Only evolve the architecture? • Minimal description length • Baldwin e ff ect? Manual priors: Weight sharing Gaier & Ha, 2019 Figure source: Gaier & HA., 2019
- 50. Learning hyperparameter priors Complex hyperparameter space Simple hyperparameter space λ, scores Learner
- 51. 1 van Rijn & Hutter 2018 ResNets for image classi fi cation • Functional ANOVA 1 • Select hyperparameters that cause variance in the evaluations. • Useful to speed up black-box optimization techniques Learn hyperparameter importance Figure source: van Rijn & Hutter, 2018
- 52. • Tunability 1,2,3 Learn good defaults, measure importance as improvement via tuning 1 Probst et al. 2018 2 Weerts et al. 2018 3 van Rijn et al. 2018 Learn defaults + hyperparameter importance Learned defaults Tuning risk
- 53. • Start with a few (random) hyperparameter con fi gurations • Fit a surrogate model to predict other con fi gurations • Probabilistic regression: mean and standard deviation (blue band) • Use an acquisition function to trade o ff exploration and exploitation, e.g. Expected Improvement (EI) • Sample for the best con fi guration under that function μ σ Bayesian Optimization (interlude) Mockus, 1974 μ μ + σ μ − σ value of λi performance
- 54. • Repeat until some stopping criterion: • Fixed budget • Convergence • EI threshold • Theoretical guarantees • Also works for non-convex, noisy data • Used in AlphaGo Srinivas et al. 2010, Freitas et al. 2012, Kawaguchi et al. 2016 Bayesian Optimization Figure source: Shahriari 2016
- 55. • Hyperparameters can interact in very non-linear ways • Use a neural net to learn a suitable basis expansion ϕz(λ) for all tasks • You can use Bayesian linear models, transfers info on con fi guration space Learn basis expansions for hyperparameters Perrone et al. 2018 P Bayesian Linear surrogate φz(λ)i λi,score Learn basis expansion on lots of data (e.g. OpenML) φz(λ) φz(λ) λ Gaussian Processes surrogate
- 56. • If task j is similar to the new task, its surrogate model Sj will likely transfer well • Sum up all Sj predictions, weighted by task similarity (as in active testing)1 • Build combined Gaussian process, weighted by current performance on new task2 Tasks Models Models Models performance Learning Learning Learning New Task meta-learner Models Models Models performance per task tj: Pi,j } Surrogate model transfer 1 Wistuba et al. 2018 λi P Sj 2 Feurer et al. 2018 S = ∑ wj Sj + + S1 S2 S3 λi
- 57. prior tasks Surrogate model transfer Manolache & Vanschoren 2019 • Store surrogate model Sij for every pair of task i and algorithm j • Simpler surrogates, better transfer • Learn weighted ensemble -> signi fi cant speed up in optimization new task
- 58. Warm starting (what works on similar tasks?) start randomly start with good candidates Task λ, scores Learner metadata
- 59. How to measure task similarity? • Hand-designed (statistical) meta-features that describe (tabular) datasets 1 • Task2Vec: task embedding for image data 2 • Optimal transport: similarity measure based on comparing probability distributions 3 • Metadata embedding based on textual dataset description 4 • Dataset2Vec: compares batches of datasets 5 • Distribution-based invariant deep networks 6 1 Vanschoren 2018 2 Achille et al. 2019 3 Alvarez-Melis et al. 2020 4 Drori et al. 2019 5 Jooma et al. 2020 6 de Bie et al. 2020 Figure source: Alvarez-Melis et al. 2020
- 60. • Find k most similar tasks, warm-start search with best λi • Auto-sklearn: Bayesian optimization (SMAC) • Meta-learning yield better models, faster • Winner of AutoML Challenges Tasks Models Models Models performance Learning Learning Learning New Task meta-learner Models Models Models performance Pi,j } Warm-starting with kNN λ1..k mj best λi on similar tasks Feurer et al. 2015 λi Bayesian optimization λ P λ1 λ3 λ2 λ4 Figure source: Feurer et al., 2015
- 61. Fusi et al. 2017 Pi,j λi TL λL tj tnew warm-started with λ1..k . . .. . . .. . . . . λi λLi P p(P|λLi) latent representation • Learn latent representation for tasks T and con fi gurations λ • Use meta-features to warm-start on new task • Returns probabilistic predictions for Bayesian optitmization • Collaborative fi ltering: con fi gurations λi are `rated’by tasks tj Probabilistic Matrix Factorization Figure source: Fusi et al., 2017
- 62. DARTS: Di ff erentiable NAS Liu et al. 2018 convolution max pooling zero • Fixed (one-shot) structure, learn which operators to use • Give all operators a weight • Optimize and model weights using bilevel optimization • approximate *( ) by adapting after every training step αi αi ωj ωj αi ωj One-shot model operator weights αi interleaved optimization of and with SGD αi ωj argmax αi Figure source: Liu et al., 2018
- 63. • Warm-start DARTS with architectures that worked well on similar problems • Slightly better performance, but much faster (5x) Warm-started DARTS Grobelnik and Vanschoren, 2021
- 64. Meta-models (learn how to build models/components) Task λ, scores Learner metadata Task Task
- 65. • Learn direct mapping between meta-features and Pi,j • Zero-shot meta-models: predict best λi given meta-features 1 • Ranking models: return ranking λ1..k 2 • Predict which algorithms / con fi gurations to consider / tune3 • Predict performance / runtime for given 𝛳 i and task4 • Can be integrated in larger AutoML systems: warm start, guide search,… meta-learner Algorithm selection models λbest 1 Brazdil et al. 2009, Lemke et al. 2015 2 Sun and Pfahringer 2013, Pinto et al. 2017 meta-learner λ1..k mj mj meta-learner Pij mj, λi 3 Sanders and C. Giraud-Carrier 2017 meta-learner Λ mj 4 Yang et al. 2018
- 66. • Learn nonlinearities: RL-based search of space of likely useful activation functions 1 • E.g. Swish can outperform ReLU 1 Ramachandran et al. 2017 • Learn optimizers: RL-based search of space of likely useful update rules 2 • E.g. PowerSign can outperform Adam, RMPprop 2 Bello et al. 2017 Learning model components PowerSign : esign(g)sign(m) g Swish : x 1 + e−βx g: gradient, m:moving average • Learn acquisition functions for Bayesian optimization 3 3 Volpp et al. 2020 Figure source: Ramachandran et al., 2017 (top), Bello et al. 2017 (bottom)
- 67. Monte Carlo Tree Search + reinforcement learning MOSAIC [Rakotoarison et al. 2019] AlphaD3M [Drori et al. 2019] • Self-play: • Game actions: insert, delete, replace components in a pipeline • Monte Carlo Tree Search builds pipelines given action probabilities • With grammar to avoid invalid pipelines • Neural network (LSTM) Predicts pipeline performance (can be pre-trained on prior datasets) Figure source: Drori et al., 2019
- 68. omniglot vgg_ fl ower dtd Results on increasingly di ffi cult tasks: • Initially slower than DQN, but faster after a few tasks • Policy entropy shows learning/re- learning Gomez & Vanschoren, 2019 Meta-Reinforcement Learning for NAS
- 69. MetaNAS: MAML + Neural Architecture Search Elsken et al., 2020 Figure source: Elsken et al., 2020 • Combines gradient based meta-learning (REPTILE) with NAS • During meta-train, it optimizes the meta-architecture (DARTS weights) along with the meta-parameters (initial weights) 𝛳 • During meta-test, the architecture can be adapted to the novel task through gradient descent
- 70. Task 1 Learning Task 2 Learning Task 3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 43 part 2 part 3 3 pillars
- 71. Task 1 Model Learning algorithm ɸ* = argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology (reminder) Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 44
- 72. Task 1 Model Learning algorithm ɸ* = argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology (reminder) Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 44 Meta-learning algorithm g meta-parameters 𝛳 (encode some inductive bias)
- 73. Task 1 Learning Task 2 Learning Task 3 Learning Tasknew Learning x,y x,y x,y x,y Strategy 1: bilevel optimization parameterize some aspect of the learner that we want to learn as meta-parameters 𝛳 meta-learn 𝛳 across tasks 𝛳𝛳𝛳𝛳 * ɸi* = argmaxɸi log p(ɸi | Ti, 𝛳 ) p(ɸnew | Tnew, 𝛳 *) Bilevel optimization: 𝛳 * = argmax log p( 𝛳 | Ti) ɸi* = argmaxɸ log p(ɸ | Ti, 𝛳 *) so that 𝛳 (prior), could encode an initialization ɸ, the hyperparameters λ , the optimizer,… Learned 𝛳 * should learn Tnew from small amount of data, yet generalize to a large number of tasks allows fi ne-tuning on Tnew 45
- 74. Task 1 Model Learning algorithm Task 2 Model Learning algorithm Task 3 Model Learning algorithm !T1 meta-train meta-test !T2 !T3 Tasknew Model Learning algorithm !Tnew p(T) every task is a meta-training example ETnew !meta(!T1,!T2,…) meta-learning algorithm Meta-model x,y x,y x,y x,y training (inner loop) meta-training (outer loop) adapt meta-test error meta-train error 𝛳 ’ ɸ’1 ɸ’2 ɸ’3 𝛳 * = argmax log p( 𝛳 | T) = argmin ℒmeta(ℒT1,ℒT2,… ) meta-parameters 𝛳 meta-optimizer Fmeta meta-loss ℒmeta fɸ1(x) fɸ2(x) fɸ3(x) distribution of tasks p(T) f (Ti) Meta-learning with bilevel optimization 𝛳𝛳 * adapt 𝛳𝛳𝛳 ɸ’new ( fi netune) 46
- 75. Task 1 Model Learning algorithm Task 2 Model Learning algorithm Task 3 Model Learning algorithm meta-train meta-test Tasknew Model Learning algorithm p(T) !(ɸi, Di,test) meta-learning algorithm g Meta-model Di,train meta-training meta-train error/reward 𝛳 ’ 𝛳 𝛳 distribution of tasks p(T) ɸi Strategy 2: black-box models ɸi ɸnew ɸ3 ɸ2 ɸ1 D3,test D2,test D1,test ɸnew ɸ1 ɸ2 ɸ3 Dnew,train ɸi = g (Di,train) Dnew,test black box meta-model g : predicts ɸ given Dtrain ( 𝛳 is hidden) hypernetwork where input embedding learned across task s Amortized models: no training in meta-test (learning fast by learning slow) feed-forward only ∇ !(ɸ3, D3,test) !(ɸ2, D2,test) !(ɸ1, D1,test) sample task i, update 𝛳 stateful, with memory (e.g. RNN) 47
- 76. Example: few-shot classi fi cation meta-train train task 1 Dtrain (support) train task n test task Dtest (query) Dtrain Dtest Dtrain Dtest meta- learner model learner model train tasks Bilevel: 𝛳 * ɸ training !(ɸ, Dtest) !meta(!T1,!T2,…) 𝛳 ’ meta- training Dtrain Dtest model learner test tasks 𝛳 * fi netuning !(ɸ, Dtest) Dtrain Dtest ɸ 𝛳 *: initial ɸ optimize r metric … Vinyals et al. 2016 Trianta fi llou et al. 2019 2-way, 3-shot meta-train meta-test 48
- 77. Example: few-shot classi fi cation meta-train train task 1 Dtrain (support) train task n test task Dtest (query) Dtrain Dtest Dtrain Dtest meta- learner model learner model test tasks ɸ* ɸ Dtrain Dtest Black Box: meta- learner model learner model train tasks ɸ* ɸ !(ɸ, Dtest) !meta(!T1,!T2,…) meta- training Dtrain Dtest 𝛳 ’ 49
- 78. meta-RL algorithm model RL agent model train tasks 𝛳 * ɸ 𝛳 ’ model RL agent test tasks 𝛳 * fi netuning !(ɸ, Dtest) ɸ initial state: randomized object and goals positions Example: meta-reinforcement learning Bilevel: episodes Yu et al. 2019 training (inner loop) meta-training (outer loop) other initial state s or related tasks episodes 𝛳 *: fi xed ɸ objective, e.g. which ɸ to learn Rmeta(RT1,RT2,…) R( 𝝅 ɸ, Dtest) e ff ective policy 𝝅 ɸ modulated by ɸ meta-policy gradient prior 50
- 79. meta-RL algorithm model RL agent model train tasks 𝛳 * ɸ 𝛳 ’ model RL agent test tasks 𝛳 * fi netuning !(ɸ, Dtest) ɸ initial state: randomized object and goals positions Example: meta-reinforcement learning Bilevel: episodes Yu et al. 2019 training (inner loop) meta-training (outer loop) other initial state s or related tasks episodes 𝛳 *: fi xed ɸ objective, e.g. which ɸ to learn Rmeta(RT1,RT2,…) R( 𝝅 ɸ, Dtest) e ff ective policy 𝝅 ɸ modulated by ɸ meta-policy gradient prior 50
- 80. meta-RL algorithm RNN (fast) RL agent RNN test tasks meta-RL algorithm RNN (fast) RL agent RNN train tasks training (inner loop) Rmeta(RT1,RT2,…) meta-training (outer loop) Black Box: Example: meta-reinforcement learning (few) test episodes 𝛳 ’ train episodes (few) test episodes train episodes R( 𝝅 ɸ, Dtest) ɸ ɸ ɸ ɸ initial state: randomized object and goals positions other initial state s or related tasks 51
- 81. meta-RL algorithm RNN (fast) RL agent RNN test tasks meta-RL algorithm RNN (fast) RL agent RNN train tasks training (inner loop) Rmeta(RT1,RT2,…) meta-training (outer loop) Black Box: Example: meta-reinforcement learning (few) test episodes 𝛳 ’ train episodes (few) test episodes train episodes R( 𝝅 ɸ, Dtest) ɸ ɸ ɸ ɸ initial state: randomized object and goals positions other initial state s or related tasks 51
- 82. Taxonomy of meta-learning methods Hospedales et al. 2020 meta-parameters 𝛳 meta-objective ℒmeta meta-learning algorithm parameter initialization ɸ0 multi vs single task like base-learners, meta-learners consist of a representation, an objective, and an optimizer meta-learning learned ɸ0 random ɸ0 meta-optimizer Fmeta gradient descent 52
- 83. meta-learning learned optimizer (update rule) normal SGD (back-propagation) Taxonomy of meta-learning methods meta-parameters 𝛳 meta-objective ℒmeta meta-learning algorithm multi vs single task like base-learners, meta-learners consist of a representation, an objective, and an optimizer optimizer Hospedales et al. 2020 meta-optimizer Fmeta gradient descent 53
- 84. Taxonomy of meta-learning methods meta-parameters 𝛳 meta-objective ℒmeta meta-optimizer Fmeta meta-learning algorithm parameter initialization ɸ0 optimizer black-box model embedding (metric learning) hyperparameters λ architecture loss ℒ / reward R … gradient descent reinforcement learning evolution many vs few shots multi vs single task fast adaptation vs fi nal perf. like base-learners, meta-learners consist of a representation, an objective, and an optimizer of fl ine vs online Hospedales et al. 2020 We won’t exhaustively discuss all existing combinations, let’s focus on understanding a few key strategies Huisman et al. 2020 54
- 85. Gradient-based methods: learning ɸinit Task i Model Learning algorithm !Ti meta-train Tasknew Model Learning algorithm !Tnew !meta = ∑i !(ɸi, Di,test) meta-learning algorithm Meta-model x,y x,y training (inner loop) meta-training (outer loop) 𝛳 ’ ɸ’i fɸi(x) ɸ’new ( fi netune) 𝛳 * ɸinit = 𝛳 p(T) • 𝛳 (prior): model initialization ɸinit • learn representation suitable for many tasks (e.g. pretrained CNN) • maximize rapid learning • Each task i yields task-adapted ɸi • Update algorithm u • Finetune 𝛳 * on Tnew (in few steps) meta-test ɸinit = learned ɸ0 random ɸ0 ɸ’new = u( 𝛳 *,Dnew,train) ∇ɸ ∇ ∇ɸ Can be seen as bilevel optimization: 𝛳 * = argmin ∑i ℒi(ɸi, Di,test) ɸi = u( 𝛳 ,Di,train) ɸi = u( 𝛳 ,Di,train) 55
- 86. • Current initialization 𝛳 , model f • On i tasks, perform k SGD steps to fi nd , then evaluate • Update task-speci fi c parameters: • Update 𝛳 to minimize sum of per-task losses, repeat Finn et al. 2017 Model agnostic meta-learning (MAML) !Task 1 !Task 2 θ θ ∇θℒ1(fϕ* 1 ) θ ∇ℒ2 ∇ℒ1 ∇ℒ3 ϕ* 3 ϕ* 1 ϕ* 2 Meta-training Meta-testing • Training data of new task Dtrain • 𝛳 *: pre-trained parameters • Finetune: meta-gradient: second-order gradient + backpropagate meta-gradient 𝛼 , β: learning rates ∇θℒ2(fϕ* 2 ) i=3, k=3 ∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i ) ϕ1 ϕ2 θ ← θ − β∇θ ∑i ℒi(f(θ − α∇θℒi(fϕ* i ))) θ ← θ − β∇θ ∑i ℒi(fϕi ) derivative of test-set loss ϕ = θ* − α∇θℒ(fθ) compute how changes in 𝛳 a ff ect the gradient at new 𝛳 θ1 θ2 θ3 = ϕ* 2 θ1 θ2 θ3 = ϕ* 1 ϕ* i 56
- 87. • Current initialization 𝛳 , model f • On i tasks, perform k SGD steps to fi nd , then evaluate • Update task-speci fi c parameters: • Update 𝛳 to minimize sum of per-task losses, repeat Finn et al. 2017 Model agnostic meta-learning (MAML) !Task 1 !Task 2 θ θ ∇θℒ1(fϕ* 1 ) θ ∇ℒ2 ∇ℒ1 ∇ℒ3 ϕ* 3 ϕ* 1 ϕ* 2 Meta-training Meta-testing • Training data of new task Dtrain • 𝛳 *: pre-trained parameters • Finetune: meta-gradient: second-order gradient + backpropagate meta-gradient 𝛼 , β: learning rates ∇θℒ2(fϕ* 2 ) i=3, k=3 ∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i ) ϕ1 ϕ2 θ ← θ − β∇θ ∑i ℒi(f(θ − α∇θℒi(fϕ* i ))) θ ← θ − β∇θ ∑i ℒi(fϕi ) derivative of test-set loss ϕ = θ* − α∇θℒ(fθ) compute how changes in 𝛳 a ff ect the gradient at new 𝛳 θ1 θ2 θ3 = ϕ* 2 θ1 θ2 θ3 = ϕ* 1 ϕ* i 56
- 88. • Current initialization 𝛳 , model f • On i tasks, perform k SGD steps to fi nd , then evaluate • Update task-speci fi c parameters: • Update 𝛳 to minimize sum of per-task losses, repeat Finn et al. 2017 Model agnostic meta-learning (MAML) !Task 1 !Task 2 θ θ ∇θℒ1(fϕ* 1 ) θ ∇ℒ2 ∇ℒ1 ∇ℒ3 ϕ* 3 ϕ* 1 ϕ* 2 Meta-training Meta-testing • Training data of new task Dtrain • 𝛳 *: pre-trained parameters • Finetune: meta-gradient: second-order gradient + backpropagate meta-gradient 𝛼 , β: learning rates ∇θℒ2(fϕ* 2 ) i=3, k=3 ∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i ) ϕ1 ϕ2 θ ← θ − β∇θ ∑i ℒi(f(θ − α∇θℒi(fϕ* i ))) θ ← θ − β∇θ ∑i ℒi(fϕi ) derivative of test-set loss ϕ = θ* − α∇θℒ(fθ) compute how changes in 𝛳 a ff ect the gradient at new 𝛳 θ1 θ2 θ3 = ϕ* 2 θ1 θ2 θ3 = ϕ* 1 ϕ* i 56
- 89. • Example for reinforcement learning: • Goal: reach certain velocity in certain direction Model agnostic meta-learning (MAML) Source: Finn et al. 2017 - oracle - MAML - pretrained 57
- 90. • Example for reinforcement learning: • Goal: reach certain velocity in certain direction Model agnostic meta-learning (MAML) Source: Finn et al. 2017 - oracle - MAML - pretrained 57
- 91. 1 Finn et al. 2017 5 Flennerhag et al. 2019 Other gradient-based techniques • Changing update rule yield di ff erent variants: • MAML 1,6 • MetaSGD 2 • Tnet 3 • Meta curvature 4 • WarpGrad 5 All use second-order gradients, Meta-learn a transformation of the gradient for better adaptation 2 Li et al. 2017 3 Lee et al. 2018 4 Park et al. 2019 fi gure source: Flennerhag et al. 2019 MAML WarpGrad θ ← θ − β∇θ ∑i ℒi(fϕi ) ϕi = θ − α∇θℒi(fϕ* i ) ϕi = θ − α diag(w)∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i , w) ϕi = θ − α B(θ, w)∇θℒi(fϕ* i ) ϕi = θ − α P(θ, ϕi)∇θℒi(fϕ* i ) 6 Antoniou et al. 2019 w: weight per parameter • Online MAML (Follow the Meta Leader) 7 • Minimizes regret • Robust, but computation costs grow over time 7 Finn et al. 2019 58
- 92. 1 Finn et al. 2017 Scalability • Backpropagating derivates of ɸi wrt 𝛳 is compute + memory intensive (for large models) • First order approximations of MAML: • First order MAML1 uses only the last inner gradient update: • Reptile 2: iterate over tasks, update 𝛳 in direction of : 2 Nichol et al. 2018 θ θ1 θ2 θ3 θ4 = ϕ* i FOMAML Reptile ϕ* i − θ Single task: FOMAML meta-gradient: ∇θ ∑i ℒi(ϕi) T1 T2 T3 θ θ ← θ − β (ϕ* i − θ) ϕ* i SGD 3 Rajeswaran et al. 2019 θ ← θ − β∑i ℒi(fϕ* i ) 59
- 93. • MAML is more resilient to over fi tting than many other meta-learning techniques 1 • E ff ectiveness seems mainly due to feature reuse ( fi nding 𝛳 ) 2 • Fine-tuning only the last layer equally good • On few-shot learning benchmarks, a good embedding outperforms most meta-learning 3 • Learn representation on entire meta-dataset (merged into single task) • Train a linear classi fi er on embedded few-shot Dtrain , predictDtest 1 Finn et al. 2018 Generalizability 2 Raghu et al. 2020 𝛳 * ɸ’new 3 Tian et al. 2020 fi gure source: Tian et al. 2020 linear model • For meta-RL, also learn how to explore (how to sample new environments) • E-MAML: add exploration to meta- objective (allows longer term goals) 4 4 Stadie et al. 2019 60
- 94. Grant et al. 2018 Can meta-learning reason about uncertainty in the task distribution? y x g fɸ ̂ ϕ ← gθ((x, y)train) ϕ θ ytest ← f ̂ ϕ(xtest) Ti = {Dtrain, Dtest} ← 𝒯 meta−train Tj = {Dtrain, Dtest} ← 𝒯 meta−test θ ← θ − ∇θℒ(ytest) ϕj ← gθ((x, y)train) ytest ← fϕj (xtest) y x g fɸ ϕ θ p(ϕi |(x, y)train) Ti = {Dtrain, Dtest} ← 𝒯 meta−train Tj = {Dtrain, Dtest} ← 𝒯 meta−test p(θ) ← p(θ|(x, y)train) p(ϕj |(x, y)train, θ) p(ytest |xtest) fi netuning p(ytest |xtest) fi netuning Bayesian meta-learning . : . . : 61
- 95. • Use approximation methods to represent uncertainty over • Sampling technique + variational inference: PLATIPUS 1, BMAML 2, ABML 3 • Laplace approximation: LLAMA 4 • Variational approximation of posterior: • Neural Statistician 5, Neural Processes 6 • What if our tasks are not IID? • Impose additional structure, e.g. with task-speci fi c variable z 7 ϕi y x ϕ θ z mini-ImageNet with fi lters for non-homogeneous tasks: 7 Jerfel et al. 2019 Figure source: Jerfel et al. 2019 Fully Bayesian meta-learning 1 Finn et al. 2019 2 Kim et al. 2018 3 Ravi et al. 2019 5 Edwards et al. 2017 6 Garnelo et al. 2018 4 Grant et al. 2018 62
- 96. • Our brains probably don’t do backprop, instead: • weights 2: networks that continuously modify the weights of another network • Gradient based 3, 4: parameterize the update rule using a neural network • Learn meta-parameters across tasks, by gradient descent 1 Bengio et al. 1995 meta-gradient Meta-learning optimizers 2 Schmidhuber 1992 ϕ′ ￼ i = ϕi − η( ) θ 3 Runarsson and Jonsson 2000 4 Hochreiter 2001 63 5 Andrychowicz et al. 2016 6 Ravi and Larochelle 2017 7 Wichrowska et al. 2017 • Represent update rule as an LSTM 4,5,6, hierarchical RNN 7
- 97. • Meta-learned (RNN) optimizers‘rediscover’momentum, gradient clipping, learning rate schedules, learning rate adaptation,… 1 • RL-based optimizers: represent updates as a policy, learn using guided policy search 2 • Combined with MAML: • learn per-parameter learning rates 3,4 • learn precondition matrix (to `warp' the loss surface) 5,6 • Black-box optimizers: meta-learned with an RNN 7, or with user-de fi ned priors8 • Speed up backpropagation by meta-learning sparsity and weight sharing9 2 Ke and Malik 2016 Meta-learning optimizers 6 Flennerhag et al. 2019 3 Li et al. 2017 5 Park et al. 2019 4 Antoniou et al. 2019 7 Chen et al. 2017 8 Souza et al. 2020 9 Kirsch et al. 2020 1 Maheswaranathan et al. 2020 learning rate schedules learning rate adaptation Figure source: Maheswaranathan et al. 2020 64
- 98. Metric learning Learn an embedding network 𝛳 that transforms data {Dtrain,Dtest} across all tasks to a representation that allows easy similarity comparison Can be seen as a simple black box meta-model • Often non-parametric (independent of ɸ) Task i learner ɸ Dtrain g g ? embedding g (x) 65 Vinyals et al. 2016 Sung et al. 2018
- 99. Prototypical networks Snell et al. 2017 • Use an embedding function f to encode each data point • De fi ne a prototype vc for every class c based on the examples of that class • Class distribution for input x is based on inverse distance between x and prototypes • Distance function can be any di ff erentiable distance • E.g. squared Euclidean • Loss function to learn the embedding: vc = 1 |D(y) train | ∑ (xi,yi)∈D(y) train fθ(xi) D(y) train p(y = c|xtest) = softmax(−dϕ(fθ(x), vc)) Figure source: Snell et al. 2017 ℒ(θ) = − log pθ(y = c|x) 66
- 100. Metric learning • Quite a few other techniques exist • Siamese neural networks 1 • Graph Neural Networks 2 • Also applicable for semi-supervised and active learning • Attentive Recurrent Comparators 3 • Compares inputs not as a whole but by parts (e.g. image patches) • MetaOptNet 4 • Learns embeddings so that linear models can distinguish between classes • Overall: • Fast at test time, although pair-wise comparisons limit task size • Mostly limited to few-shot supervised tasks • Fails when test tasks are more distant: no way to adapt Koch et al. 2015 Garcia et al. 2017 Shyam et al. 2017 Lee et al. 2019 67
- 101. Wang et al. 2016 Duan et al. 2018 Black-box model for meta-RL meta-RL algorithm RNN (fast) RL agent RNN test tasks meta-RL algorithm RNN (fast) RL agent RNN train tasks training (inner loop) Rmeta(RT1,RT2,…) meta-training (outer loop) (few) test episodes 𝛳 ’ train episodes (few) test episodes train episodes R( 𝝅 ɸ, Dtest) ɸ ɸ ɸ ɸ training task 1 training task 2 • RNNs serve as dynamic task embedding storage • Maximize expected reward in each trial • Very expressive, perform very well on short tasks • Longer horizons are an open challenge Previous inputs stored in hidden state Image source: Duan et al. 201868
- 102. Santoro et al. 2016 Mishra et al. 2018 Other black-box models Munkhdalai et al. 2017 • Memory-augmented NNs 1 • Uses neural Turing machines: short term + long term memory • Meta Networks 2 • Meta-learner that returns‘fast weights’for itself and the base network solving the task • Simple Neural attentive meta- learner (SNAIL) • Aims to overcome memory limitations of RNNs with series of 1D convolutions Meta Networks. Image source: Munkhdalai et al. 2017 SNAIL. Image source: Mishra et al. 2018 69
- 103. Task 1 Learning Task 2 Learning Task 3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 70 part 2 part 3 3 pillars
- 104. • Ultimately, meta-learning translates constraints on the learner to constraints on the data • The biases we don’t put in manually have to be learnable from data • Can we automatically create new tasks to inform and challenge our meta-learners? • Paired open-ended trailblazer (POET): evolves a parameterized environment 𝛳 E for agent 𝛳 A • Select agents that can solve challenges AND evolve environments so they are solvable Training Task Acquisition best performing agents move on creates multiple overlapping curricula Wang et al. 2019 Wang et al. 2020 Figure source: Wang et al. 2019 71
- 105. • Ultimately, meta-learning translates constraints on the learner to constraints on the data • The biases we don’t put in manually have to be learnable from data • Can we automatically create new tasks to inform and challenge our meta-learners? • Paired open-ended trailblazer (POET): evolves a parameterized environment 𝛳 E for agent 𝛳 A • Select agents that can solve challenges AND evolve environments so they are solvable Training Task Acquisition best performing agents move on creates multiple overlapping curricula Wang et al. 2019 Wang et al. 2020 Figure source: Wang et al. 2019 71
- 106. Does POET scale? Increasingly di ffi cult 3D terrain, 18 degrees of freedom. Zhou & Vanschoren. 2021 72
- 107. Does POET scale? Increasingly di ffi cult 3D terrain, 18 degrees of freedom. Zhou & Vanschoren. 2021 72
- 108. Generative Teaching Networks • Based on an existing dataset, generate synthetic training data for more e ffi cient training • Like dataset distillation, but uses meta-learning to update the generator model • While POET has limited expressivity (limited to 𝛳 E), GTNs could produce all sorts of training datasets and environments • While being careful not to generate noise: needs some grounding in reality Such et al. 2019 73
- 109. Thank you Image source: Pesah et al. 2018 74