The presentation is an introduction to decision making with approximate Bayesian Methods. It consists of a review of Bayesian Decision Theory and Variational Inference along with a description of Loss Calibrated Variational Inference.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Loss Calibrated Variational Inference
1. 1/35
Loss Calibrated Variational Inference
Tomasz Ku´smierczyk
Joseph Sakaya
October 17, 2019
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
2. 2/35
Outline of the Talk
Recap of Lecture 11 - Variational Inference
Reparameterization gradients
Bayesian decision theory
Loss calibrated variational inference: framework
Loss calibration: discrete case
Loss calibration: continuous case
Conclusion
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
3. 3/35
Recap: Lecture 11
Motivation
MCMC approximates posteriors by sampling
Computationally expensive
Asymptotic convergence
Diagnostics can be tricky
Variational inference approximates the posterior with a
parameteric family of distributions q(θ; λ)
Converts inference to an optimization problem
Scales very well
Does not converge to the true posterior
Minimize KL divergence between a proxy q(θ; λ) and
p(θ|D)
KL(q(θ; λ) p(θ|D)) = Eq(θ;λ) log
q(θ; λ)
p(θ|D)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
4. 4/35
Variational Inference
Evidence Lower Bound
Consider the equation:
log p(D) = KL(q(θ; λ) p(θ|D))+Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
ELBO L(λ)
Minimization of KL is the same as maximizing L(λ) since
log p(D) is constant w.r.t λ.
Therefore,
λ∗
= arg max
λ
L(λ) ≡ arg min
λ
KL(q(θ; λ) p(θ|D)).
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
5. 5/35
Reparameterization gradients
Objective to maximize:
L(λ) = Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
λ∗
= arg max
λ
L(λ)
Optimization via gradient descent. the gradient λL(λ) is
related to the distribution q(θ; λ) over which we take
expectation.
Use reparameterization trick to transform Eq(θ;λ) [. . .] to an
expectation over the base distribution Eq( ) [. . .]
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
6. 6/35
Reparameterization gradients
Draw S samples from the base distribution
s ∼ q( ).
Transform θs = f( s, λ) and evaluate the Monte Carlo
estimate of the ELBO:
L(λ) ≈
1
S
S
s=1
[log p(D, θs) − q(θs; λ)] .
The Monte Carlo estimate of the gradient now becomes:
λL(λ) ≈
1
S
S
s=1
[ λ(log p(D, θs) − q(θs; λ))] .
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
7. 7/35
Bayesian Decision Theory
Decision making under uncertainty characterized by the
posterior p(θ|D)
Make optimal decisions h, given p(θ|D) and utility u(h, θ)
defined over the parameters θ
An optimal decision maximises the posterior gain (expected
utility):
G(h) =
Θ
u(h, θ)p(θ|D) dθ
Or alternatively, minimizes the risk (expected loss):
R(h) =
Θ
(h, θ)p(θ|D) dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
8. 8/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
9. 9/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
How about when (h, θ) = |h − θ|?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
10. 10/35
Bayesian Decision Theory
The optimal decision maximizes the gain
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
p = arg max
h∈H
G(h)
However, p(θ|D) is intractable
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
11. 11/35
Bayesian Decision Theory
Approximate p(θ|D) with q(θ; λ)
Gq(h) = u(h, y)q(θ; λ) dθ
h∗
q = arg max
H∈H
Gq(h)
Million dollar question: Is h∗
q = h∗
p?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
12. 12/35
Bayesian Decision Theory
Example: Nuclear power plant
Collect temperature data D from sensor.
Infer a posterior distribution p(θ|D) over θ.
Utility Matrix
θ < Tcrit θ ≥ Tcrit
on 1010 100
off 105 1010
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
13. 13/35
Bayesian Decision Theory
Example: Nuclear power plant
In each of the cases what is the optimal decision?
G(h = ‘on’) =
Tcrit
0
1010
× p(θ|D) dθ +
500
Tcrit
100
× p(θ|D) dθ
G(h = ‘off’) =
Tcrit
0
105
× p(θ|D) dθ +
500
Tcrit
1010
× p(θ|D) dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
18. 18/35
Bayesian Decision Theory
Two Types of Decisions
Decision over parameters
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
= arg max
h∈H
G(h)
Decision over model outputs
G(h|x) =
Θ Y
u(h, y)p(y|θ, x) dy p(θ|D) dθ
h∗
= arg max
h∈H
G(h|x)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
19. 19/35
Lessons learnt
If you have access to the full posterior, you have nothing to
worry about. The posteriors are necessary and sufficient
information for making accurate decisions.
If you are approximating a multi-modal posterior with a
unimodal variational distribution, the decision making task
should be part of the inference.
Do not take anything for granted, especially because it is
black-box.
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
20. 20/35
Loss Calibrated Lower Bound
Lower bound the Gain
log G(h) = log p(θ|D)u(θ, h)dθ
= log
q(θ)
q(θ)
p(θ|D)u(θ, h)dθ
≥ q(θ)log
p(θ|D)
q(θ)
u(θ, h)dθ via Jensen’s inequality
= −KL(q, p) + q(θ) log u(θ, h)dθ
= ELBO(λ) − log p(D) + q(θ) log u(θ, h)dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
21. 21/35
LCVI: objective
New objective:
L(λ, h) = Eq(θ;λ)[log p(D, θ) − log q(θ; λ)]
ELBO(λ) - expected lower bound
+ Eq(θ;λ) [log u(h, θ)]
U(λ,h) - utility-dependent penalty term
Optimization using EM:
M-step: h∗
q = arg maxh∈H Gq(h)
E-step: λ∗
= arg maxλ L(λ, h∗
q)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
22. 22/35
Discrete case: Diabetes
D = {(x, y)},
x - patient covariates
Y = {Healthy, Moderate, Severe}
utility matrix u(h, y):
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
it is bad to say ’Healthy’ when ’Severe’
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23. 23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
24. 23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Mean-field approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
25. 23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Mean-field approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Reparametrization:
θSe = µSe + σSe · Se,
θMod = µMod + σMod · Mod,
θHe = µHe + σHe · He,
Maximize LV I(λ) := ELBO(λ) w.r.t approximation parameters
λ = {µSe, ..., σHe}
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26. 24/35
Recap: LCVI objective in predictive setting
L(λ, h) = ELBO(λ) + q(θ) log u(y, h)p(y|θ, D)dydθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
27. 25/35
Discrete case: LCVI objective
sum over possible outputs:
L(λ, h) = ELBO(λ) + Eq(θ;λ) log
y∈Y
u(h, y)p(y|θ, D)
expectation using MC:
≈ ELBO(λ) +
1
M
θ∼q(θ;λ)
log
y∈Y
u(h, y)p(y|θ, D)
reparameterization:
≈ ELBO(λ) +
1
M
∼q( )
log
y∈Y
u (h, y) p (y|fθ( , λ), D)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
28. 26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
29. 26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
E-step: use λL(λ, h) to update λ (h fixed)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
30. 26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
E-step: use λL(λ, h) to update λ (h fixed)
for example if h = He:
L(λ, He) ≈ ELBO +
1
M ∼q0
log 2.0 ·
ex·θHe
k ex·θk
+ 1.0 ·
ex·θMod
k ex·θk
+ 0.0 ·
ex·θSe
k ex·θk
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
31. 27/35
VI vs. LCVI: Test Data Confusion Matrices
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.86 0.11 0.03
0.00 1.00 0.00
0.00 0.00 1.00
VI
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.99 0.01 0.00
0.00 1.00 0.00
0.00 0.00 1.00
LCVI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
32. 28/35
Continuous case with double reparametrization
MC approximation of both integrals:
L(λ, h) ≈ ELBO +
1
M
θ∼qλ(θ)
log
1
N
y∼p(y|θ,x)
u(h, y)
reparametrization:
≈ ELBO +
1
M ∼q0
log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)
≈ ELBO +
1
M ∼q0
log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))
gradient-based optimization w.r.t. h and λ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
33. 29/35
Posterior predictive distribution shift
2.5 5.0 7.5 10.0
Value
0.0
0.1
0.2
0.3
ProbabilityDensity
hLCVI
hVI
data
user no: 791
artist: Muse
LCVI
VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
34. 30/35
LCVI (blue) vs. VI (red/green)
.328.330.333
Empirical
.460.465
.398.400.402
1.701.75
q = 0.2
.320.325
Risk
q = 0.5
.450.460
q = 0.8
.320.325
squared
1.101.201.30
tilted
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
35. 31/35
Conclusion
Bad posterior approximations result in sub-optimal
decisions / predictions
Learn better approximations (better in concrete task)
Learn how to make better decisions from bad posteriors
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
36. 32/35
References
Adam D Cobb, Stephen J Roberts, and Yarin Gal.
Loss-Calibrated Approximate Inference in Bayesian Neural
Networks.
In Theory of Deep Learning workshop, ICML, 2018.
Tomasz Ku´smierczyk, Joseph Sakaya, and Arto Klami.
Variational Bayesian Decision-making for Continuous
Utilities.
In Thirty-third Conference on Neural Information
Processing Systems, NeurIPS, 2019.
Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin
Ghahramani.
Approximate inference for the loss-calibrated Bayesian.
In Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics, AISTATS, 2011.
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
37. 33/35
Continous case (detailed): Monte Carlo
L = ELBO + Eqλ(θ) log u(h, y)p(y|θ, x) dy
Approximate expectation using MC:
≈ ELBO +
1
M
θ∼qλ(θ)
log u(h, y)p(y|θ, x) dy
Approximate integral using MC:
≈ ELBO +
1
M
θ∼qλ(θ)
log
1
N
y∼p(y|θ,x)
u(h, y)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
38. 34/35
Continuous case (detailed): double reparametrization
L ≈ ELBO +
1
M
θ∼qλ(θ)
log
1
N
y∼p(y|θ,x)
u(h, y)
The Monte Carlo expectation of λU(λ, h) is:
≈ ELBO +
1
M ∼q0
log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)
≈ ELBO +
1
M ∼q0
log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
39. 35/35
Continuous case (detailed): double reparametrization
≈ ELBO +
1
M ∼q0
log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))
p(y|.) needs to be reparameterizable:
until recently only for gaussians, but:
Michael Figurnov, Shakir Mohamed, Andriy Mnih. Implicit
Reparameterization Gradients, arXiv: May 2018.
we need M × N samples
computation graph is O(M × N)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration