2. Table of contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family 1 / 60
3. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes 2 / 60
4. Motivation
Neural Net vs Gaussian Processes
£ Neural Net (NN)
• Function approximation ability
• New functions are learned from scratch each time
• Uncertainty of functions can not be considered
£ Gaussian Processes (GP)
• Can use prior knowledge to quickly estimate the shape of
new function
• Can model uncertainty of functions
• Computationally expensive
• Hard to design prior distribution
Aim
Combine the benefits of NN and GP
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Introduction 3 / 60
5. Conditional Neural Processes (CNPs)
• A conditional distribution over functions trained to model
the empirical conditional distributions of functions
• permutation invariant in training/test data
• scalable: running time complexity of O(n + m)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Introduction 4 / 60
6. Stochastic Processes i
• observations : O = {(xi, yi)}n−1
i=0 ⊂ X × Y
• targets : T = {xi}n+m−1
i=n
• generative model (stochastic processes) :
• yi = f(xi), f : X → Y (noiseless case)
• f ∼ P (prior process)
• P Y P(f(T) | O, T) (predictive distribution)
Task
Predict the output values f(x) for ∀x ∈ T given O
Example 1 (Gaussian Processes)
P = GP(µ(x), k(x, x′))
Y predictive distribution : f(x) ∼ N(µn(x), σ2
n(x))
µn(x) = µ(x) + k(x)⊤
(K + σ2
I)−1
(y − m)
σ2
n(x) = k(x, x) − k(x)⊤
(K + σ2
I)−1
k(x)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 5 / 60
7. Stochastic Processes ii
1D Gaussian process regression
Difficulties of ordinary SP approaches
1. It is difficult to design appropriate priors
2. GPs (typical ex) do not scale w.r.t. the number of data
→ O((n + m)3) computational costs are required
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 6 / 60
8. Conditional Neural Processes i
conditional stochastic process Qθ(f(·) | O, T)
Predictive Ability of NNs + Uncertainty Modeling of SPs
Assumption 1
1. (permutation invariant)
Qθ(f(T) | O, T) = Qθ
(
f
(
T′
)
| O, T′
)
= Qθ
(
f(T) | O′
, T
)
• O′
, T′
: permutations of O, T resp.
2. (factorizability)
Qθ(f(T) | O, T) =
∏
x∈T
Qθ(f(x) | O, x)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 7 / 60
9. Conditional Neural Processes ii Architecture
PredictObserve Aggregate
r3r2r1
…x3x2x1 x5 x6x4
ahhh y5 y6y4
ry3y2y1 g gg
…
ri = hθ(xi, yi) ∀(xi, yi) ∈ O
r = r1 ⊕ r2 ⊕ . . . rn−1 ⊕ rn
φi = gθ(xi, r) ∀(xi) ∈ T
hθ, gθ
⊕
r1 ⊕ r2 ⊕ . . . rn−1 ⊕ rn =
1
n
n
i=1
ri
Qθ (f (xi) | O, xi) = Q (f (xi) | φi)
φi = (µi, σ2
i ) N(µi, σ2
i )
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 8 / 60
10. Conditional Neural Processes ii Architecture
• 構造としては VAE に非常に近い
→ h と a は VAE の encoder に対応し, 入力データから潜在
表現 r を獲得
• VAE との違いその 1 : 入力 x に加えて出力 y も与えて潜在
表現を学習
• VAE との違いその 2 : 潜在表現 r は確率変数ではなく, デー
タ毎の表現 r1, ..., rn の和で決まる
• 違いその 2 でデータ毎に独立に計算した潜在表現を使って
いることが後で説明する “画像全体で一貫した completion
にならない” 原因になっている
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 9 / 60
11. Conditional Neural Processes iii Training
Optimization Problem
minimization of the negative conditional log probability
θ∗
= arg min
θ
L(θ)
L(θ) = −Ef∼P
[
EN
[
log Qθ
(
{yi}n−1
i=0 | ON , {xi}n−1
i=0
)]]
• f ∼ P : prior process
• N ∼ Unif(0, n − 1)
• ON = {(xi, yi)}N
i=0 ⊂ O
practical implementation : gradient descent
1. sampling f and N
2. MC estimates of the gradient of L(θ)
3. gradient descent by estimated gradient
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Model 10 / 60
12. Function Regression i Setting
Dataset
1. random sample from GP w/ fixed kernel¶ms
2. random sample from GP w/ switching two kernels
network architectures
£ hθ : 3-layer MLP with 128-dim output ri, i = 1, ..., 128
£ r = 1
128
∑128
i=1 ri : aggregation
£ gθ : 5-layer MLP, gθ(xi, r) = µi, σ2
i (mean & var of Gaussian)
£ Adam (optimizer)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 11 / 60
13. Function Regression ii Results
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 12 / 60
14. Image Completion i Setting
Dataset
1. MNIST (f : [0, 1]2 → [0, 1])
Complete the entire image from a small number of
observation
2. CelebA (f : [0, 1]2 → [0, 1]3)
Complete the entire image from a small number of
observation
network architectures
• the same model architecture as for 1D function regression
except for
• input layer : 2D pixel coordinates normalized to [0, 1]2
• output layer : color intensity of the corresponding pixel
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 13 / 60
15. Image Completion ii Results
• 1 (non-informative) observation point
→ prediction corresponds to the average over all digits
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 14 / 60
17. Image Completion iii Latent Variable Model
Original CNPs
• The model returns factored outputs (sample-wise
independent modeling)
→ best prediction with limited data points is to average
over all possible predictions
• It can not sample different coherent images of all the
possible digits conditioned on the observations
← GPs can do this due to a kernel function
• Adding latent variables, CNPs can maintain this property
CNPs の latent variable model は後述する Neural Processes と
同じもの
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 16 / 60
18. Image Completion iv Latent Variable Model
z ∼ N(µ, σ2
)
r = (µ, σ2
) = hθ(X, Y )
φi = (µi, σ2
i ) = gθ(xi, z)
•
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 17 / 60
19. Classification i Settings
Dataset
• Omniglot
• 1,623 classes of characters from 50 different alphabets
• suitable for few-shot learning
• N-way classification task
• N classes are randomly chosen at each training step
network architectures
• encoder h : include convolution layers
• aggregation r : class-wise aggregation & concatenate
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 18 / 60
20. Classification ii Results
PredictObserve Aggregate
r5r4r3
ahhh
r
Class
E
Class
D
Class
C g gg
r2r1
hh
Class
B
Class
A
A
B
C
E
D
0 1 0 1 0 1
5-way Acc 20-way Acc Runtime
1-shot 5-shot 1-shot 5-shot
MANN 82.8% 94.9% - - O(nm)
MN 98.1% 98.9% 93.8% 98.5% O(nm)
CNP 95.3% 98.5% 89.9% 96.8% O(n + m)
K. Matsui (RIKEN AIP) Neural Processes Family Conditional Neural Processes Experimental Results 19 / 60
21. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes 20 / 60
22. Generative Model i
Assumption 2
1. (Exchangeability)
入力 x や出力 y の順番を入れ替えても分布は変わらない
ρx1:n (y1:n) = ρπ(x1:n)(π(y1:n))
2. (Consistency)
ある列 Dm = {(xi, yi)}m
i=1 に対する分布とそれを含む列で Dm 以外を
周辺化して得られる分布は同じ
ρx1:m (y1:m) =
∫
ρx1:n (y1:n)dym+1:n
3. (Decomposability)
観測モデルに独立分解仮定
p(y1:n | f, x1:n) =
n∏
i=1
N(yi | f(xi), σ2
)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 21 / 60
23. Generative Model ii
f をある確率過程からのサンプルとしたときの観測値の事
後分布
ρx1:n (y1:n) =
∫
p(y1:n | f, x1:n)p(f)df
=
∫ n∏
i=1
N(yi | f(xi), σ2
)p(f)df
f を隠れ変数付き NN g(x, z) でモデル化するとき, 生成モ
デルは
p(z, y1:n | x1:n) =
n∏
i=1
N(yi | g(xi, z), σ2
)p(z)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 22 / 60
24. Evidence Lower-Bound (ELBO)
隠れ変数 z の変分事後分布を q(z | x1:n, y1:n) とおくと ELBO は
log p (y1:n|x1:n)
≥ Eq(z|x1:n,y1:n)
[ n∑
i=1
log p (yi | z, xi) + log
p(z)
q (z | x1:n, y1:n)
]
特に予測時には観測データとテストデータを分割して
log p (ym+1:n | x1:m, xm+1:n, y1:m)
≥ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
p (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
≈ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
q (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
p を q で近似している理由は, 観測データによる条件付き分布としての計算
に O(m3
) のコストがかかるのを回避するため
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 23 / 60
25. Architectures
x1 x2 x3
y1 y2 y3
hθ hθ hθ
r1 r2 r3
a
r
x4 x5 x6
gθ gθ gθ
z
ˆy4 ˆy5 ˆy6
z
z ∼ N(µ(r), σ2
(r)I)
gθ(xi) = P(y | z, xi) :
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 24 / 60
26. Comparing Architectures : VAE, CNPs & NPs
X
qφ(z | X)
pθ(X | z)
ˆX
z ∼ N(0, I)
X Y
r =
n
i=1 ri
ˆY
gθ(Y | ˆX, r)
hθ(xi, yi)
X Y
r =
n
i=1 ri
ˆY
hθ(xi, yi)
z ∼ N(µ(r), σ2
(r)I)
gθ(Y | ˆX, z)
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Model 25 / 60
27. Black-Box Optimization with Thompson Sampling
Neural process Gaussian process Random Search
0.26 0.14 1.00
K. Matsui (RIKEN AIP) Neural Processes Family Neural Processes Experiments 26 / 60
28. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes 27 / 60
29. Recall : Neural Processes
x1 y1
x2 y2
x3 y3
MLPθ
MLPθ
MLPθ
MLPΨ
MLPΨ
MLPΨ
r1
r2
r3
s1
s2
s3
rCm
m sC
x
rC
~
MLP y
ENCODER DECODER
Deterministic
Path
Latent
Path
NEURAL PROCESS
m Mean
z
z
*
*
• 潜在表現 r と潜在変数 z を両方モデルに組み込む ver.
• ELBO を目的関数として学習
log p (yT | xT , xC , yC )
≥Eq(z|sT ) [log p (yT | xT , rC , z)] − DKL (q (z | sT ) ∥q (z | sC ))
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 28 / 60
30. Motivation i
オリジナルの NP は context set に対して underfit しやすい
•
•
•
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Introduction 29 / 60
38. Experiment 1 : 1D Function regression on synthetic GP data
context point
target negative log likelihood
training iteration
wall clock time
Published as a conference paper at ICLR 2019
Figure 3: Qualitative and quantitative results of different attention mechanisms for 1D GP func
regression with random kernel hyperparameters. Left: moving average of context reconstruc
error (top) and target negative log likelihood (NLL) given contexts (bottom) plotted against train
iterations (left) and wall clock time (right). d denotes the bottleneck size i.e. hidden layer size o
MLPs and the dimensionality of r and z. Right: predictive mean and variance of different atten
mechanisms given the same context. Best viewed in colour.
1D Function regression on synthetic GP data We first explore the (A)NPs trained on data th
generated from a Gaussian Process with a squared-exponential kernel and small likelihood noi
We emphasise that (A)NPs need not be trained on GP data or data generated from a known stocha
process, and this is just an illustrative example. We explore two settings: one where the hype
rameters of the kernel are fixed throughout training, and another where they vary randomly at e
d =
hidden layer size of MLPs
dimensionality of r
dimensionality of z
• 2 GP
• context point n target point m iteration
•
• ANPs self-attention , cross-attention
1
|C|
i C
Eq(z|sC ) [log p (yi|xi, r (xC, yC, xi) , z)]
1
|T|
i T
Eq(z|sC ) [log p (yi|xi, r (xC, yC, xi) , z)]
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 37 / 60
39. Experiment 1 : 1D Function regression on synthetic GP data
NP Attentive NP
Figure 1: Comparison of predictions given by a fully tr
tion regression (left) / 2D image regression (right). T
to predict the target outputs (y-values of all x 2 [ 2
are noticeably more accurate than for NP at the conte
provide relevant information for a given target predic
•
inaccurate predictive means
•
overestimated variances at the input locations
NP
ANP
•
•
→
Multihead Attention
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 38 / 60
40. Experiment 1 : 1D Function regression on synthetic GP data
•
•
•
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 39 / 60
41. Experiment 2 : 2D Function regression on image data
•
•
•
•
•
•
• p(yT | xT , rC, z)
•
•
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 40 / 60
42. Experiment 2 : 2D Function regression on image data
K. Matsui (RIKEN AIP) Neural Processes Family Attentive Neural Processes Experiments 41 / 60
43. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 42 / 60
45. BO from Meta-Learning Viewpoints
Key Observation
We can sample functions similar to the target function from
prior distribution (e.g. GP)
Algorithm 1 Bayesian Optimisation
Input:
f∗
- Target function of interest (= T ∗
).
D0 = {(x0, y0)} - Observed evaluations of f∗
.
N - Maximum number of function iterations.
Mθ - Model pre-trained on evaluations of similar
functions f1, . . . fn ∼ p(T ).
for n=1, ..., N do
// Model-adaptation
Optimise θ to improve M’s prediction on Dn−1.
Thompson sampling: Draw ˆgn ∼ M, find
xn = arg minx∈X E ˆg(y|x)
Evaluate target function and save result.
Dn ← Dn−1 ∪ {(xn, f∗
(xn))}
end for
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 44 / 60
46. NPs as Meta-Learning Model
Use neural processes as a model M because
1. statistical efficiency
Accurate predictions of function values based on small
numbers of evaluations
2. calibrated uncertainties
balance exploration and exploitation
3. O(n + m) computational complexity
4. non-parametric modeling
→ Not necessary to set hyper parameters such as learning
rate and update frequency in MAML
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 45 / 60
47. Experiments : Bayesian Optimization via NPs
Adversarial task search for RL agents [Ruderman+ (2018)]
• Search Problem of adversarially designed 3D Maze
• trivially solvable by human players
• But RL agents will catastrophically fail
• Notation
• fA : given agent mapping from task params to its
performance r
• parameters of the task
• M : maze layout
• ps, pg : start and goal positions
Problem setup
1. Position search (p∗
s, p∗
g) = arg min
ps,pg
fA(M, ps, pg)
2. Full maze search (M∗, p∗
s, p∗
g) = arg min
M,ps,pg
fA(M, ps, pg)
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 46 / 60
48. Experiments : Bayesian Optimization via NPs
(a) Position search results (b) Full maze search results
Figure 2: Bayesian Optimisation results. Left: Position search Right: Full maze search. We report
the minimum up to iteration t (scaled in [0,1]) as a function of the number of iterations. Bold
lines show the mean performance over 4 unseen agents on a set of held-out mazes. We also show
20% of the standard deviation. Baselines: GP: Gaussian Process (with a linear and Matern 3/2
product kernel [Bonilla et al., 2008]), BBB: Bayes by Backprop [Blundell et al., 2015], AlphaDiv:
AlphaDivergence [Hernández-Lobato et al., 2016], DKL: Deep Kernel Learning [Wilson et al., 2016].
K. Matsui (RIKEN AIP) Neural Processes Family BO via Neural Processes 47 / 60
49. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP 48 / 60
50. Contributions
Neural processes (NPs) と Gaussian processes (GPs) の理論的
な関係を示した. 特に, ある条件の下では NPs はカーネル関数
にdeep kernelsを用いた GPs と数学的に等価であることを示
した.
• GPs の理論が NPs に適用可能になりうる
• deep kernel GP を一度学習しておくことで異なる予測タス
クにも適用可能な共分散関数を獲得するという学習方法
方針
£ deep kernel GP と NP とで同じ ELBO が出てくる
£ 生成モデルとしては NP の decoder 部分を deep kernel の
NN と潜在変数の内積の形で書くと同じものになる
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP 49 / 60
51. Gaussian Processes with Deep Kernels i
Notation
• x1:n, y1:n : 観測点
• f : Rp → R : 真関数
• GP model : p(f | x1:n) = N(m, K)
p(y1:n | f) = N(f, τ−1
I)
ここで, f = (f(x1), ..., f(xn)), m = (m(x1), ..., m(xn)),
Kij = k(xi, xj)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 50 / 60
52. Gaussian Processes with Deep Kernels ii
Definition 1 (deep kernel [Tsuda+ (2002)])
k (xi, xj) :=
1
d
d∑
j,j′=1
σ
(
w⊤
j xi + bj
)
Σjj′ σ
(
w⊤
j′ xj + bj
)
• σ
(
w⊤
j xi + bj
)
は 1 層の NN, w, b はモデルパラメータで
σ(·) は活性化関数
• Σ = (Σjj′ )d
j,j′=1 は半正定値行列
行列表記
ϕi := ϕ(xi, W , b) =
√
1
d
σ(W ⊤
xi + b) ∈ Rd
,
Φ = [ϕ1, ..., ϕn] とおくと, k(X, X) = ΦΣΦ⊤.
以下, GP の平均関数は次のような形で書かれるとする
m(X) = Φµ, µ ∈ Rd
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 51 / 60
53. Gaussian Processes with Deep Kernels iii
Latent function を積分消去して得られるエビデンス (周辺尤度)
p (y|X) =
∫
p (y, f | X) df =
∫
p (y | f) p (f | X) df
= N
(
Φµ, ΦΣΦ⊤
+ τ−1
In
)
NPs の生成モデルと関連づけるために隠れ変数を導入
z ∼ N(µ, Σ)
このとき, 上記のエビデンスは z の周辺化からも導出される
p (y|X) =
∫
p (y | X, z) p (z) dz =
∫
N
(
Φz, τ−1
In
)
N (µ, Σ) dz
= N
(
Φµ, ΦΣΦ⊤
+ τ−1
In
)
特に, z ∼ N(0, Id) のときは p (y|X) = N(0, ΦΦ⊤ + τ−1In)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 52 / 60
54. Gaussian Processes with Deep Kernels iv
エビデンス p (y|X) を計算するには共分散行列 ΦΣΦ⊤ + τ−1In
の逆行列計算が必要 (O(n3) の計算コスト)
→ エビデンス下界 (ELBO) の計算で置き換えてコスト削減
log p(Y | X) ≥ Eq(z|X)[log p(Y | z, X)] − KL(q(z | X)∥p(z))
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP GP with Deep Kernels 53 / 60
55. 予測時の ELBO の一致 i
Deep kernel GPs のエビデンス下界を, 観測データとテストデー
タを明示的に分離して書く (C = 1 : m, T = m + 1 : n はそれぞ
れ観測データ, テストデータを表す)
log p (YT | XT , XC, YC)
≥Eq(z|XT ,YT ) [log p (YT | z, XT )] − KL (q (z | XT , YT ) ∥p (z | XC, YC))
ここで, p (z | XC, YC) は観測データ XC, YC に基づいて設定さ
れる “data-driven” な prior
p (z | XC, YC) = N(µ(XC, YC), Σ(XC, YC))
NPs でやったのと同様にこれを変分事後分布で近似する
p (z | XC, YC) ≈ q (z | XC, YC)
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 54 / 60
56. 予測時の ELBO の一致 ii
以上の下で, Deep kernel GP の ELBO は
Eq(z|XT ,YT ) [log p (YT | z, XT )] − KL (q (z | XT , YT ) ∥q (z | XC, YC))
一方, NPs の ELBO は
log p (ym+1:n | x1:m, xm+1:n, y1:m)
≥ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
p (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
≈ Eq(z|xm+1:n,ym+1:n)
[ n∑
i=m+1
log p (yi | z, xi) + log
q (z | x1:m, y1:m)
q (z | xm+1:n, ym+1:n)
]
となり, 両者の生成モデルが同じならばELBO も一致すること
がわかる
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 55 / 60
57. 生成モデル
NPs の生成モデル:
p(Y | z, X)p(z) = N
(
Y ; gθ(z, X), τ−1
I
)
N(z; µ, Σ)
Deep kernel GPs with latent variable の生成モデル:
p(Y | z, X)p(z) = N
(
Y ; Φz, τ−1
I
)
N (z; µ, Σ)
上記を比較すると, gθ(z, X) = Φz ととれば両者が一致するこ
とがわかる. より一般には, パラメータ Θ = {W ℓ, bℓ}L
ℓ=1 を持つ
L 層の Deep NN ΦΘ(·) によって
gθ(z, X) = ΦΘ(X)z
なる形の affine-decoder を用いることで両者は一致する
K. Matsui (RIKEN AIP) Neural Processes Family Connection between NP and GP NP as GP 56 / 60
58. Table of Contents
1. Conditional Neural Processes [Garnelo+ (ICML2018)]
2. Neural Processes [Garnelo+ (ICML2018WS)]
3. Attentive Neural Processes [Kim+ (ICLR2019)]
4. Meta-Learning surrogate models for sequential decision
making [Galashov+ (ICLR2019WS)]
5. On the Connection between Neural Processes and Gaussian
Processes with Deep Kernels [Rudner+ (NeurIPS2018WS)]
6. Conclusion
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 57 / 60
59. Summary
• NPs family は出力 y を予測するための条件付き分布を直接
モデリングする方法
• GPs 回帰では予測時に O((m + n)3) かかっていた計算コス
トが O(m + n) で済む
• BO への応用も既に考えられている (問題によっては
GP-based の BO よりも高性能)
• 潜在表現 · 変数の導出に attention を用いた ANPs はより
GP に近い回帰の結果を返す
• NPs は GPs 回帰において deep kernel を用いるのと等価な
操作とみなせる
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 58 / 60
61. References
[1] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
1126–1135. JMLR. org, 2017.
[2] Alexandre Galashov, Jonathan Schwarz, Hyunjik Kim, Marta Garnelo, David Saxton, Pushmeet Kohli, SM Eslami,
and Yee Whye Teh. Meta-learning surrogate models for sequential decision making. arXiv preprint
arXiv:1903.11907, 2019.
[3] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan,
Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on
Machine Learning, pages 1690–1699, 2018.
[4] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye
Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
[5] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and
Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019.
[6] Tim GJ Rudner, Vincent Fortuin, Yee Whye Teh, and Yarin Gal. On the connection between neural processes and
gaussian processes with deep kernels. In Workshop on Bayesian Deep Learning, NeurIPS, 2018.
K. Matsui (RIKEN AIP) Neural Processes Family Conclusion 60 / 60