SlideShare a Scribd company logo
1 of 31
Download to read offline
1 / 28
Meta-learning and the ELBO
eddy.l
January 02, 2019
2 / 28
Why unsupervised/generative?
Intelligence existed before labels did
Unsupervised=cake, supervised=icing, RL=cherry (Lecun)
A human brain has 1015
connections and lives for 109
seconds. (Hinton)
”What I cannot create, I do not understand” (Feynman)
3 / 28
A latent variable model
x1
z1
x2
z2
x3
z3
x4
z4
4 / 28
A latent variable model
plate notation
x
z
N
5 / 28
Learning a generative model
Assume z ∼ p(z) and x ∼ p(x|z; θ). A good generative model is one that
generates real-looking data, so let’s find
θ∗
= arg max
θ
p(xreal
; θ) (1)
6 / 28
Learning a generative model
First attempt
Maybe we can find θ∗
using gradient ascent on something that increases with
p(x; θ)?
θ∗
= arg max
θ
p(x; θ) = arg max
θ
log p(x; θ) (2)
= arg max
θ
log p(z)p(x|z; θ)dz (3)
= arg max
θ
log Ep(z) [p(x|z; θ)] (4)
Even evaluating p(x; θ) requires an integration; we can’t monte-carlo
approximate it because of the log.
7 / 28
Learning a generative model
The ELBO
Let’s give up on optimizing p(x; θ) directly. Let q(z|x) be anything we can
sample from. By Jensen’s inequality (omitting parameters),
log p(x) = log Ep(z) [p(x|z)] (5)
= log Eq(z|x)
p(z)p(x|z)
q(z|x)
= log Eq(z|x)
p(x, z)
q(z|x)
(6)
≥Eq(z|x) log
p(x, z)
q(z|x)
(7)
with equlity when q(z|x) = p(z|x) everywhere. This thing is called the ELBO
(Evidence Lower BOund). We assumed we can sample z ∼ q(z|x), so we can get
monte-carlo samples for the ELBO.
8 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
“Auto-encoding variational bayes” by Kingma and Welling,
“Stochastic backpropagation and approximate inference in
deep generative models” by Rezende, Mohamed, and
Wierstra
9 / 28
Variational Audoencoders
x
z
N
Networks p(x|z; θ), q(z|x; φ).
Maximize ELBO = log p(z)p(x|z;θ)
q(z|x;φ) ,
where z ∼ q(z|x; φ).
What does this loss function mean?
10 / 28
Interpretations of ELBO
1. Lower bound of evidence
We have already shown that the ELBO is a lower bound of the evidence:
log p(x) ≥Eq(z|x) log
p(x, z)
q(z|x)
= ELBO (8)
Thus, we can view optimizing the ELBO as approximately optimizing p(x):
arg max
θ
p(x) ≈ arg max
θ
ELBO (9)
11 / 28
Interpretations of ELBO
2. Distance to posterior
Let’s take a closer look at the gap btw p(x) and the ELBO:
log p(x) = log
p(x, z)
p(z|x)
= Eq(z|x) log
p(x, z)
p(z|x)
(10)
= Eq(z|x) log
p(x, z)
q(z|x)
+ DKL (q(z|x)||p(z|x)) (11)
= ELBO + DKL (q(z|x)||p(z|x)) (12)
From the point of view of the inference network, maximizing the ELBO is
equivalent to minimizing the KL divergence to the posterior:
arg min
φ
DKL (q(z|x)||p(z|x)) = arg max
φ
ELBO (13)
12 / 28
Interpretations of ELBO
3. Autoencoder
ELBO = Eq(z|x) [log p(x, z) − log q(z)]
=
1
N
N
n=1
Eq(zn|xn) [log p(xn|zn)] − DKL (q(zn|xn)||p(z))
Can be used as a loss function if the KL divergence term can be computed
analytically.
For each datapoint xn, we make the model reconstruct xn while keeping each
embedding zn close to the prior p(z).
13 / 28
Interpretations of ELBO
3. Autoencoder
1
N
N
n=1
DKL (q(zn)||p(z)) =
1
N
N
n=1
q(zn) (log q(zn) − log q(z) + log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) +
N
n=1 q(zn)
N
(log q(z) − log p(z))
=
1
N
N
n=1
DKL (q(zn)||q(z)) + DKL (q(z)||p(z))
so
ELBO =
1
N
N
n=1
Eq(zn) [log p(xn|zn)] − DKL (q(zn)||q(z)) − DKL (q(z)||p(z))
14 / 28
Interpretations of ELBO
4. Free energy
ELBO = Eq(z) [log p(x, z) − log q(z)]
= −Eq(z) [− log p(x, z)] + H(q(z))
This is composed of a negative energy term plus the entropy of the distribution
of states.
Physical states are values of z and the energy of the state z is − log p(x, z).
Therefore, the ELBO takes the form of a negative Helmholtz energy.
15 / 28
Interpretations of ELBO
4. Free energy
ELBO = −Eq(z) [− log p(x, z)] + H(q(z))
We know that the distribution over states that minimizes the free energy is the
Boltzmann distribution:
p(z) ∝ exp(−E(z)) = p(x, z) ∝ p(z|x). (14)
We see that the distribution of z that minimizes ELBO is p(z|x), the true
posterior.
16 / 28
Interpretations of ELBO
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
first describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
17 / 28
Interpretations of ELBO
5. Minimum Description Length (the ”bits-back” argument)
Suppose we want to describe a datapoint x using as few bits as possible. We can
first describe z, and then describe x. Shannon’s source coding theorem says that
this scheme takes at least
Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)]
bits of data.
The ”bits-back” argument is that after we know x, we can directly compute
q(z|x), so we should subtract this extra information from the cost of describing
z. The description length is then
Ex∼data,z∼q(z|x) [− log p(z)+ log q(z|x) − log p(x|z)] = ELBO
.
18 / 28
Interpretations of ELBO
1. lower bound of p(x)
2. learning to output p(z|x)
3. autoencoder with regularization
4. free energy
5. communication cost
19 / 28
2 latent variables?
x
z1
z2
N
20 / 28
2 latent variables?
x
z1
z2
N
Networks
p(x|z1; θ)p(z1|z2; θ), q(z2|z1; φ), q(z1|x; φ).
Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ)
q(z2|z1;φ)q(z1|x;φ) ,
where z1 ∼ q(z1|x; φ) and z2 ∼ q(z2|z1; φ).
21 / 28
2 latent variables?
A better way
x
z1
z2
N
“Ladder variational autoencoders” by Sønderby et al.
22 / 28
2 latent variables?
x
z1
z2
N
Networks
p(x|z1; θ)p(z1|z2; θ), q(z2|x; φ), q(z1|x; φ).
Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ)
q(z2|x;φ)q(z1|x,z2;φ,θ) ,
where z2 ∼ q(z2|x; φ) and z1 ∼ p(z1|x, z2; θ).
23 / 28
Neural Statistician
x
z
c
N
T
“Towards a neural statistician” by Edwards and Storkey
24 / 28
Conditional VAE
x y
z
N
“Learning structured output representation using deep conditional generative models”
by Sohn, Lee, and Yan
25 / 28
semi-supervised VAE
x
z
y
N
Semi-Supervised Learning with Deep Generative Models by Kingma et al.
26 / 28
Few-shot classification
x y
zc
N
M
T
“Siamese neural networks for one-shot image recognition” by Koch, “Matching
networks for one shot learning” by Vinyals et al., “Prototypical networks for few-shot
learning” by Snell, Swersky, and Zemel
27 / 28
Few-shot classification
special case: triplet loss
x y
zc
3
2
T
“Deep Metric Learning Using Triplet Network” by Hoffer and Ailon
28 / 28
Few-shot classification
Triplet loss
L ∼ d(a, p) − d(a, n) (15)
= − log p(a|N(p, αI)) + log p(a|N(n, αI)) (16)
Prototypical Net logits:
(−d2
(a, c1), −d2
(a, c2), · · · , −d2
(a, cn)) (17)
= (log p(a|N(c1, αI)), · · · , log p(a|N(cn, αI)) (18)
29 / 28
References I
[1] Harrison Edwards and Amos Storkey. “Towards a neural statistician”. In:
arXiv preprint arXiv:1606.02185 (2016).
[2] Elad Hoffer and Nir Ailon. “Deep Metric Learning Using Triplet Network”.
In: Lecture Notes in Computer Science (2015), 84–92. issn: 1611-3349.
doi: 10.1007/978-3-319-24261-3_7. url:
http://dx.doi.org/10.1007/978-3-319-24261-3_7.
[3] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”.
In: arXiv preprint arXiv:1312.6114 (2013).
[4] Diederik P. Kingma et al. Semi-Supervised Learning with Deep Generative
Models. 2014. arXiv: 1406.5298 [cs.LG].
[5] Gregory Koch. “Siamese neural networks for one-shot image recognition”.
In: 2015.
30 / 28
References II
[6] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
“Stochastic backpropagation and approximate inference in deep generative
models”. In: arXiv preprint arXiv:1401.4082 (2014).
[7] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for
few-shot learning”. In: Advances in Neural Information Processing Systems.
2017, pp. 4077–4087.
[8] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output
representation using deep conditional generative models”. In: Advances in
Neural Information Processing Systems. 2015, pp. 3483–3491.
[9] Casper Kaae Sønderby et al. “Ladder variational autoencoders”. In:
Advances in neural information processing systems. 2016, pp. 3738–3746.
[10] Oriol Vinyals et al. “Matching networks for one shot learning”. In:
Advances in Neural Information Processing Systems. 2016, pp. 3630–3638.
31 / 28
Thank You

More Related Content

What's hot

Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
Complex Number I - Presentation
Complex Number I - PresentationComplex Number I - Presentation
Complex Number I - Presentation
yhchung
 
Math lecture 10 (Introduction to Integration)
Math lecture 10 (Introduction to Integration)Math lecture 10 (Introduction to Integration)
Math lecture 10 (Introduction to Integration)
Osama Zahid
 
Multiple integral(tripple integral)
Multiple integral(tripple integral)Multiple integral(tripple integral)
Multiple integral(tripple integral)
jigar sable
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
Nikhil Sharma
 

What's hot (20)

[PR12] Capsule Networks - Jaejun Yoo
[PR12] Capsule Networks - Jaejun Yoo[PR12] Capsule Networks - Jaejun Yoo
[PR12] Capsule Networks - Jaejun Yoo
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Introduction of VAE
Introduction of VAEIntroduction of VAE
Introduction of VAE
 
Complex Number I - Presentation
Complex Number I - PresentationComplex Number I - Presentation
Complex Number I - Presentation
 
Math lecture 10 (Introduction to Integration)
Math lecture 10 (Introduction to Integration)Math lecture 10 (Introduction to Integration)
Math lecture 10 (Introduction to Integration)
 
Group Theory
Group TheoryGroup Theory
Group Theory
 
A Walk in the GAN Zoo
A Walk in the GAN ZooA Walk in the GAN Zoo
A Walk in the GAN Zoo
 
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagation
 
Deformable DETR Review [CDM]
Deformable DETR Review [CDM]Deformable DETR Review [CDM]
Deformable DETR Review [CDM]
 
Eigen values and eigen vectors
Eigen values and eigen vectorsEigen values and eigen vectors
Eigen values and eigen vectors
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Multiple integral(tripple integral)
Multiple integral(tripple integral)Multiple integral(tripple integral)
Multiple integral(tripple integral)
 
Journal Club: VQ-VAE2
Journal Club: VQ-VAE2Journal Club: VQ-VAE2
Journal Club: VQ-VAE2
 
Applying deep learning to medical data
Applying deep learning to medical dataApplying deep learning to medical data
Applying deep learning to medical data
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
Masked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptxMasked Autoencoders Are Scalable Vision Learners.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
 
Number Theory
Number TheoryNumber Theory
Number Theory
 
Group theory
Group theoryGroup theory
Group theory
 

Similar to Meta-learning and the ELBO

Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 

Similar to Meta-learning and the ELBO (20)

Deep generative model.pdf
Deep generative model.pdfDeep generative model.pdf
Deep generative model.pdf
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
Semi vae memo (2)
Semi vae memo (2)Semi vae memo (2)
Semi vae memo (2)
 
k-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture modelsk-MLE: A fast algorithm for learning statistical mixture models
k-MLE: A fast algorithm for learning statistical mixture models
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
ma112011id535
ma112011id535ma112011id535
ma112011id535
 

More from Yoonho Lee

More from Yoonho Lee (11)

New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Parameter Space Noise for Exploration
Parameter Space Noise for ExplorationParameter Space Noise for Exploration
Parameter Space Noise for Exploration
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Recently uploaded (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 

Meta-learning and the ELBO

  • 1. 1 / 28 Meta-learning and the ELBO eddy.l January 02, 2019
  • 2. 2 / 28 Why unsupervised/generative? Intelligence existed before labels did Unsupervised=cake, supervised=icing, RL=cherry (Lecun) A human brain has 1015 connections and lives for 109 seconds. (Hinton) ”What I cannot create, I do not understand” (Feynman)
  • 3. 3 / 28 A latent variable model x1 z1 x2 z2 x3 z3 x4 z4
  • 4. 4 / 28 A latent variable model plate notation x z N
  • 5. 5 / 28 Learning a generative model Assume z ∼ p(z) and x ∼ p(x|z; θ). A good generative model is one that generates real-looking data, so let’s find θ∗ = arg max θ p(xreal ; θ) (1)
  • 6. 6 / 28 Learning a generative model First attempt Maybe we can find θ∗ using gradient ascent on something that increases with p(x; θ)? θ∗ = arg max θ p(x; θ) = arg max θ log p(x; θ) (2) = arg max θ log p(z)p(x|z; θ)dz (3) = arg max θ log Ep(z) [p(x|z; θ)] (4) Even evaluating p(x; θ) requires an integration; we can’t monte-carlo approximate it because of the log.
  • 7. 7 / 28 Learning a generative model The ELBO Let’s give up on optimizing p(x; θ) directly. Let q(z|x) be anything we can sample from. By Jensen’s inequality (omitting parameters), log p(x) = log Ep(z) [p(x|z)] (5) = log Eq(z|x) p(z)p(x|z) q(z|x) = log Eq(z|x) p(x, z) q(z|x) (6) ≥Eq(z|x) log p(x, z) q(z|x) (7) with equlity when q(z|x) = p(z|x) everywhere. This thing is called the ELBO (Evidence Lower BOund). We assumed we can sample z ∼ q(z|x), so we can get monte-carlo samples for the ELBO.
  • 8. 8 / 28 Variational Audoencoders x z N Networks p(x|z; θ), q(z|x; φ). Maximize ELBO = log p(z)p(x|z;θ) q(z|x;φ) , where z ∼ q(z|x; φ). “Auto-encoding variational bayes” by Kingma and Welling, “Stochastic backpropagation and approximate inference in deep generative models” by Rezende, Mohamed, and Wierstra
  • 9. 9 / 28 Variational Audoencoders x z N Networks p(x|z; θ), q(z|x; φ). Maximize ELBO = log p(z)p(x|z;θ) q(z|x;φ) , where z ∼ q(z|x; φ). What does this loss function mean?
  • 10. 10 / 28 Interpretations of ELBO 1. Lower bound of evidence We have already shown that the ELBO is a lower bound of the evidence: log p(x) ≥Eq(z|x) log p(x, z) q(z|x) = ELBO (8) Thus, we can view optimizing the ELBO as approximately optimizing p(x): arg max θ p(x) ≈ arg max θ ELBO (9)
  • 11. 11 / 28 Interpretations of ELBO 2. Distance to posterior Let’s take a closer look at the gap btw p(x) and the ELBO: log p(x) = log p(x, z) p(z|x) = Eq(z|x) log p(x, z) p(z|x) (10) = Eq(z|x) log p(x, z) q(z|x) + DKL (q(z|x)||p(z|x)) (11) = ELBO + DKL (q(z|x)||p(z|x)) (12) From the point of view of the inference network, maximizing the ELBO is equivalent to minimizing the KL divergence to the posterior: arg min φ DKL (q(z|x)||p(z|x)) = arg max φ ELBO (13)
  • 12. 12 / 28 Interpretations of ELBO 3. Autoencoder ELBO = Eq(z|x) [log p(x, z) − log q(z)] = 1 N N n=1 Eq(zn|xn) [log p(xn|zn)] − DKL (q(zn|xn)||p(z)) Can be used as a loss function if the KL divergence term can be computed analytically. For each datapoint xn, we make the model reconstruct xn while keeping each embedding zn close to the prior p(z).
  • 13. 13 / 28 Interpretations of ELBO 3. Autoencoder 1 N N n=1 DKL (q(zn)||p(z)) = 1 N N n=1 q(zn) (log q(zn) − log q(z) + log q(z) − log p(z)) = 1 N N n=1 DKL (q(zn)||q(z)) + N n=1 q(zn) N (log q(z) − log p(z)) = 1 N N n=1 DKL (q(zn)||q(z)) + DKL (q(z)||p(z)) so ELBO = 1 N N n=1 Eq(zn) [log p(xn|zn)] − DKL (q(zn)||q(z)) − DKL (q(z)||p(z))
  • 14. 14 / 28 Interpretations of ELBO 4. Free energy ELBO = Eq(z) [log p(x, z) − log q(z)] = −Eq(z) [− log p(x, z)] + H(q(z)) This is composed of a negative energy term plus the entropy of the distribution of states. Physical states are values of z and the energy of the state z is − log p(x, z). Therefore, the ELBO takes the form of a negative Helmholtz energy.
  • 15. 15 / 28 Interpretations of ELBO 4. Free energy ELBO = −Eq(z) [− log p(x, z)] + H(q(z)) We know that the distribution over states that minimizes the free energy is the Boltzmann distribution: p(z) ∝ exp(−E(z)) = p(x, z) ∝ p(z|x). (14) We see that the distribution of z that minimizes ELBO is p(z|x), the true posterior.
  • 16. 16 / 28 Interpretations of ELBO 5. Minimum Description Length (the ”bits-back” argument) Suppose we want to describe a datapoint x using as few bits as possible. We can first describe z, and then describe x. Shannon’s source coding theorem says that this scheme takes at least Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)] bits of data.
  • 17. 17 / 28 Interpretations of ELBO 5. Minimum Description Length (the ”bits-back” argument) Suppose we want to describe a datapoint x using as few bits as possible. We can first describe z, and then describe x. Shannon’s source coding theorem says that this scheme takes at least Ex∼data,z∼q(z|x) [− log p(z)− log p(x|z)] bits of data. The ”bits-back” argument is that after we know x, we can directly compute q(z|x), so we should subtract this extra information from the cost of describing z. The description length is then Ex∼data,z∼q(z|x) [− log p(z)+ log q(z|x) − log p(x|z)] = ELBO .
  • 18. 18 / 28 Interpretations of ELBO 1. lower bound of p(x) 2. learning to output p(z|x) 3. autoencoder with regularization 4. free energy 5. communication cost
  • 19. 19 / 28 2 latent variables? x z1 z2 N
  • 20. 20 / 28 2 latent variables? x z1 z2 N Networks p(x|z1; θ)p(z1|z2; θ), q(z2|z1; φ), q(z1|x; φ). Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ) q(z2|z1;φ)q(z1|x;φ) , where z1 ∼ q(z1|x; φ) and z2 ∼ q(z2|z1; φ).
  • 21. 21 / 28 2 latent variables? A better way x z1 z2 N “Ladder variational autoencoders” by Sønderby et al.
  • 22. 22 / 28 2 latent variables? x z1 z2 N Networks p(x|z1; θ)p(z1|z2; θ), q(z2|x; φ), q(z1|x; φ). Maximize ELBO = log p(z2)p(z1|z2;θ)p(x|z1;θ) q(z2|x;φ)q(z1|x,z2;φ,θ) , where z2 ∼ q(z2|x; φ) and z1 ∼ p(z1|x, z2; θ).
  • 23. 23 / 28 Neural Statistician x z c N T “Towards a neural statistician” by Edwards and Storkey
  • 24. 24 / 28 Conditional VAE x y z N “Learning structured output representation using deep conditional generative models” by Sohn, Lee, and Yan
  • 25. 25 / 28 semi-supervised VAE x z y N Semi-Supervised Learning with Deep Generative Models by Kingma et al.
  • 26. 26 / 28 Few-shot classification x y zc N M T “Siamese neural networks for one-shot image recognition” by Koch, “Matching networks for one shot learning” by Vinyals et al., “Prototypical networks for few-shot learning” by Snell, Swersky, and Zemel
  • 27. 27 / 28 Few-shot classification special case: triplet loss x y zc 3 2 T “Deep Metric Learning Using Triplet Network” by Hoffer and Ailon
  • 28. 28 / 28 Few-shot classification Triplet loss L ∼ d(a, p) − d(a, n) (15) = − log p(a|N(p, αI)) + log p(a|N(n, αI)) (16) Prototypical Net logits: (−d2 (a, c1), −d2 (a, c2), · · · , −d2 (a, cn)) (17) = (log p(a|N(c1, αI)), · · · , log p(a|N(cn, αI)) (18)
  • 29. 29 / 28 References I [1] Harrison Edwards and Amos Storkey. “Towards a neural statistician”. In: arXiv preprint arXiv:1606.02185 (2016). [2] Elad Hoffer and Nir Ailon. “Deep Metric Learning Using Triplet Network”. In: Lecture Notes in Computer Science (2015), 84–92. issn: 1611-3349. doi: 10.1007/978-3-319-24261-3_7. url: http://dx.doi.org/10.1007/978-3-319-24261-3_7. [3] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint arXiv:1312.6114 (2013). [4] Diederik P. Kingma et al. Semi-Supervised Learning with Deep Generative Models. 2014. arXiv: 1406.5298 [cs.LG]. [5] Gregory Koch. “Siamese neural networks for one-shot image recognition”. In: 2015.
  • 30. 30 / 28 References II [6] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and approximate inference in deep generative models”. In: arXiv preprint arXiv:1401.4082 (2014). [7] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for few-shot learning”. In: Advances in Neural Information Processing Systems. 2017, pp. 4077–4087. [8] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output representation using deep conditional generative models”. In: Advances in Neural Information Processing Systems. 2015, pp. 3483–3491. [9] Casper Kaae Sønderby et al. “Ladder variational autoencoders”. In: Advances in neural information processing systems. 2016, pp. 3738–3746. [10] Oriol Vinyals et al. “Matching networks for one shot learning”. In: Advances in Neural Information Processing Systems. 2016, pp. 3630–3638.