SlideShare a Scribd company logo
1 of 39
Download to read offline
1/35
Loss Calibrated Variational Inference
Tomasz Ku´smierczyk
Joseph Sakaya
October 17, 2019
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
2/35
Outline of the Talk
Recap of Lecture 11 - Variational Inference
Reparameterization gradients
Bayesian decision theory
Loss calibrated variational inference: framework
Loss calibration: discrete case
Loss calibration: continuous case
Conclusion
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
3/35
Recap: Lecture 11
Motivation
MCMC approximates posteriors by sampling
Computationally expensive
Asymptotic convergence
Diagnostics can be tricky
Variational inference approximates the posterior with a
parameteric family of distributions q(θ; λ)
Converts inference to an optimization problem
Scales very well
Does not converge to the true posterior
Minimize KL divergence between a proxy q(θ; λ) and
p(θ|D)
KL(q(θ; λ) p(θ|D)) = Eq(θ;λ) log
q(θ; λ)
p(θ|D)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
4/35
Variational Inference
Evidence Lower Bound
Consider the equation:
log p(D) = KL(q(θ; λ) p(θ|D))+Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
ELBO L(λ)
Minimization of KL is the same as maximizing L(λ) since
log p(D) is constant w.r.t λ.
Therefore,
λ∗
= arg max
λ
L(λ) ≡ arg min
λ
KL(q(θ; λ) p(θ|D)).
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
5/35
Reparameterization gradients
Objective to maximize:
L(λ) = Eq(θ;λ) [log p(D, θ) − q(θ; λ)]
λ∗
= arg max
λ
L(λ)
Optimization via gradient descent. the gradient λL(λ) is
related to the distribution q(θ; λ) over which we take
expectation.
Use reparameterization trick to transform Eq(θ;λ) [. . .] to an
expectation over the base distribution Eq( ) [. . .]
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
6/35
Reparameterization gradients
Draw S samples from the base distribution
s ∼ q( ).
Transform θs = f( s, λ) and evaluate the Monte Carlo
estimate of the ELBO:
L(λ) ≈
1
S
S
s=1
[log p(D, θs) − q(θs; λ)] .
The Monte Carlo estimate of the gradient now becomes:
λL(λ) ≈
1
S
S
s=1
[ λ(log p(D, θs) − q(θs; λ))] .
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
7/35
Bayesian Decision Theory
Decision making under uncertainty characterized by the
posterior p(θ|D)
Make optimal decisions h, given p(θ|D) and utility u(h, θ)
defined over the parameters θ
An optimal decision maximises the posterior gain (expected
utility):
G(h) =
Θ
u(h, θ)p(θ|D) dθ
Or alternatively, minimizes the risk (expected loss):
R(h) =
Θ
(h, θ)p(θ|D) dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
8/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
9/35
Bayesian Decision Theory - Example
When (h, θ) = (h − θ)2, what is the optimal decision h?
How about when (h, θ) = |h − θ|?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
10/35
Bayesian Decision Theory
The optimal decision maximizes the gain
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
p = arg max
h∈H
G(h)
However, p(θ|D) is intractable
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
11/35
Bayesian Decision Theory
Approximate p(θ|D) with q(θ; λ)
Gq(h) = u(h, y)q(θ; λ) dθ
h∗
q = arg max
H∈H
Gq(h)
Million dollar question: Is h∗
q = h∗
p?
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
12/35
Bayesian Decision Theory
Example: Nuclear power plant
Collect temperature data D from sensor.
Infer a posterior distribution p(θ|D) over θ.
Utility Matrix
θ < Tcrit θ ≥ Tcrit
on 1010 100
off 105 1010
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
13/35
Bayesian Decision Theory
Example: Nuclear power plant
In each of the cases what is the optimal decision?
G(h = ‘on’) =
Tcrit
0
1010
× p(θ|D) dθ +
500
Tcrit
100
× p(θ|D) dθ
G(h = ‘off’) =
Tcrit
0
105
× p(θ|D) dθ +
500
Tcrit
1010
× p(θ|D) dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
14/35
Bayesian Decision Theory
Unimodal posteriors – VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
15/35
Bayesian Decision Theory
Multimodal posteriors – VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
16/35
Bayesian Decision Theory
Multimodal posteriors – Expectation Propagation
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
17/35
Bayesian Decision Theory
Multimodal posteriors – ideal fit
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
18/35
Bayesian Decision Theory
Two Types of Decisions
Decision over parameters
G(h) =
Θ
u(h, θ)p(θ|D) dθ
h∗
= arg max
h∈H
G(h)
Decision over model outputs
G(h|x) =
Θ Y
u(h, y)p(y|θ, x) dy p(θ|D) dθ
h∗
= arg max
h∈H
G(h|x)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
19/35
Lessons learnt
If you have access to the full posterior, you have nothing to
worry about. The posteriors are necessary and sufficient
information for making accurate decisions.
If you are approximating a multi-modal posterior with a
unimodal variational distribution, the decision making task
should be part of the inference.
Do not take anything for granted, especially because it is
black-box.
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
20/35
Loss Calibrated Lower Bound
Lower bound the Gain
log G(h) = log p(θ|D)u(θ, h)dθ
= log
q(θ)
q(θ)
p(θ|D)u(θ, h)dθ
≥ q(θ)log
p(θ|D)
q(θ)
u(θ, h)dθ via Jensen’s inequality
= −KL(q, p) + q(θ) log u(θ, h)dθ
= ELBO(λ) − log p(D) + q(θ) log u(θ, h)dθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
21/35
LCVI: objective
New objective:
L(λ, h) = Eq(θ;λ)[log p(D, θ) − log q(θ; λ)]
ELBO(λ) - expected lower bound
+ Eq(θ;λ) [log u(h, θ)]
U(λ,h) - utility-dependent penalty term
Optimization using EM:
M-step: h∗
q = arg maxh∈H Gq(h)
E-step: λ∗
= arg maxλ L(λ, h∗
q)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
22/35
Discrete case: Diabetes
D = {(x, y)},
x - patient covariates
Y = {Healthy, Moderate, Severe}
utility matrix u(h, y):
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
it is bad to say ’Healthy’ when ’Severe’
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Mean-field approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
23/35
Discrete case: model & Automatic VI
Likelihood (softmax): p(y = cj|θ, x) = ex·θj
k ex·θk
Some priors: p(θSe), p(θMod), p(θHe)
Mean-field approximation family:
q(θSe, θMod, θHe) =
N(θSe|µSe, σ2
Se)N(θMod|µMod, σ2
Mod)N(θHe|µHe, σ2
He)
Reparametrization:
θSe = µSe + σSe · Se,
θMod = µMod + σMod · Mod,
θHe = µHe + σHe · He,
Maximize LV I(λ) := ELBO(λ) w.r.t approximation parameters
λ = {µSe, ..., σHe}
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
24/35
Recap: LCVI objective in predictive setting
L(λ, h) = ELBO(λ) + q(θ) log u(y, h)p(y|θ, D)dydθ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
25/35
Discrete case: LCVI objective
sum over possible outputs:
L(λ, h) = ELBO(λ) + Eq(θ;λ) log
y∈Y
u(h, y)p(y|θ, D)
expectation using MC:
≈ ELBO(λ) +
1
M
θ∼q(θ;λ)
log
y∈Y
u(h, y)p(y|θ, D)
reparameterization:
≈ ELBO(λ) +
1
M
∼q( )
log
y∈Y
u (h, y) p (y|fθ( , λ), D)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
E-step: use λL(λ, h) to update λ (h fixed)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
26/35
Discrete case: LCVI Optimization
u y
He Mod Sev
He 2.0 1.0 0.0
h Mod 1.2 2.0 1.3
Sev 1.1 1.4 2.0
M-step: choose h that maximizes L(λ, h) (λ fixed)
E-step: use λL(λ, h) to update λ (h fixed)
for example if h = He:
L(λ, He) ≈ ELBO +
1
M ∼q0
log 2.0 ·
ex·θHe
k ex·θk
+ 1.0 ·
ex·θMod
k ex·θk
+ 0.0 ·
ex·θSe
k ex·θk
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
27/35
VI vs. LCVI: Test Data Confusion Matrices
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.86 0.11 0.03
0.00 1.00 0.00
0.00 0.00 1.00
VI
He
Mod
Sev
Predicted label
He
Mod
Sev
Truelabel
0.99 0.01 0.00
0.00 1.00 0.00
0.00 0.00 1.00
LCVI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
28/35
Continuous case with double reparametrization
MC approximation of both integrals:
L(λ, h) ≈ ELBO +
1
M
θ∼qλ(θ)

log
1
N
y∼p(y|θ,x)
u(h, y)


reparametrization:
≈ ELBO +
1
M ∼q0

log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)


≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


gradient-based optimization w.r.t. h and λ
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
29/35
Posterior predictive distribution shift
2.5 5.0 7.5 10.0
Value
0.0
0.1
0.2
0.3
ProbabilityDensity
hLCVI
hVI
data
user no: 791
artist: Muse
LCVI
VI
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
30/35
LCVI (blue) vs. VI (red/green)
.328.330.333
Empirical
.460.465
.398.400.402
1.701.75
q = 0.2
.320.325
Risk
q = 0.5
.450.460
q = 0.8
.320.325
squared
1.101.201.30
tilted
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
31/35
Conclusion
Bad posterior approximations result in sub-optimal
decisions / predictions
Learn better approximations (better in concrete task)
Learn how to make better decisions from bad posteriors
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
32/35
References
Adam D Cobb, Stephen J Roberts, and Yarin Gal.
Loss-Calibrated Approximate Inference in Bayesian Neural
Networks.
In Theory of Deep Learning workshop, ICML, 2018.
Tomasz Ku´smierczyk, Joseph Sakaya, and Arto Klami.
Variational Bayesian Decision-making for Continuous
Utilities.
In Thirty-third Conference on Neural Information
Processing Systems, NeurIPS, 2019.
Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin
Ghahramani.
Approximate inference for the loss-calibrated Bayesian.
In Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics, AISTATS, 2011.
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
33/35
Continous case (detailed): Monte Carlo
L = ELBO + Eqλ(θ) log u(h, y)p(y|θ, x) dy
Approximate expectation using MC:
≈ ELBO +
1
M
θ∼qλ(θ)
log u(h, y)p(y|θ, x) dy
Approximate integral using MC:
≈ ELBO +
1
M
θ∼qλ(θ)

log

 1
N
y∼p(y|θ,x)
u(h, y)




Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
34/35
Continuous case (detailed): double reparametrization
L ≈ ELBO +
1
M
θ∼qλ(θ)

log
1
N
y∼p(y|θ,x)
u(h, y)


The Monte Carlo expectation of λU(λ, h) is:
≈ ELBO +
1
M ∼q0

log
1
N
y∼p(y|fθ( ,λ),x)
u(h, y)


≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
35/35
Continuous case (detailed): double reparametrization
≈ ELBO +
1
M ∼q0

log
1
N
δ∼p0
u(h, gy(δ, fθ( , λ))


p(y|.) needs to be reparameterizable:
until recently only for gaussians, but:
Michael Figurnov, Shakir Mohamed, Andriy Mnih. Implicit
Reparameterization Gradients, arXiv: May 2018.
we need M × N samples
computation graph is O(M × N)
Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration

More Related Content

What's hot

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?ABC: How Bayesian can it be?
ABC: How Bayesian can it be?Christian Robert
 
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8Nguyễn Công Hoàng
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsNAVER Engineering
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Alexander Litvinenko
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Formal methods 4 - Z notation
Formal methods   4 - Z notationFormal methods   4 - Z notation
Formal methods 4 - Z notationVlad Patryshev
 
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Alexander Litvinenko
 
Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem sumit gyawali
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learningSteve Nouri
 
Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Apostolos Chalkis
 
26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-MeansAndres Mendez-Vazquez
 
Variational inference using implicit distributions
Variational inference using implicit distributionsVariational inference using implicit distributions
Variational inference using implicit distributionsTomasz Kusmierczyk
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationTasuku Soma
 

What's hot (20)

H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
 
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
Giáo trình Phân tích và thiết kế giải thuật - CHAP 8
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
Chap8 new
Chap8 newChap8 new
Chap8 new
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Approximation Algorithms
Approximation AlgorithmsApproximation Algorithms
Approximation Algorithms
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Formal methods 4 - Z notation
Formal methods   4 - Z notationFormal methods   4 - Z notation
Formal methods 4 - Z notation
 
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017) Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
Low-rank methods for analysis of high-dimensional data (SIAM CSE talk 2017)
 
Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem
 
Cheatsheet unsupervised-learning
Cheatsheet unsupervised-learningCheatsheet unsupervised-learning
Cheatsheet unsupervised-learning
 
Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...
 
Vertex cover Problem
Vertex cover ProblemVertex cover Problem
Vertex cover Problem
 
26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means26 Machine Learning Unsupervised Fuzzy C-Means
26 Machine Learning Unsupervised Fuzzy C-Means
 
Variational inference using implicit distributions
Variational inference using implicit distributionsVariational inference using implicit distributions
Variational inference using implicit distributions
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function Maximization
 

Similar to Loss Calibrated Variational Inference

Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsAlexander Litvinenko
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overviewSebastien Destercke
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayesmehdi Cherti
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Hybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networksHybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networksMKosmykov
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfShiwani Gupta
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defenseMarco Ceze
 

Similar to Loss Calibrated Variational Inference (20)

Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problems
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...
 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overview
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Auto encoding-variational-bayes
Auto encoding-variational-bayesAuto encoding-variational-bayes
Auto encoding-variational-bayes
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Hybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networksHybrid dynamics in large-scale logistics networks
Hybrid dynamics in large-scale logistics networks
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
module4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdfmodule4_dynamic programming_2022.pdf
module4_dynamic programming_2022.pdf
 
MarcoCeze_defense
MarcoCeze_defenseMarcoCeze_defense
MarcoCeze_defense
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 

More from Tomasz Kusmierczyk

Overconfidence and subnetwork Inference for BNNs
Overconfidence and subnetwork Inference for BNNsOverconfidence and subnetwork Inference for BNNs
Overconfidence and subnetwork Inference for BNNsTomasz Kusmierczyk
 
On the Causal Effect of Digital Badges
On the Causal Effect of Digital BadgesOn the Causal Effect of Digital Badges
On the Causal Effect of Digital BadgesTomasz Kusmierczyk
 
What are the negative effects of social media?: fighting fake information
What are the negative effects of social media?: fighting fake informationWhat are the negative effects of social media?: fighting fake information
What are the negative effects of social media?: fighting fake informationTomasz Kusmierczyk
 
Sampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo TechniquesSampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo TechniquesTomasz Kusmierczyk
 
Probabilistic Models in Recommender Systems: Time Variant Models
Probabilistic Models in Recommender Systems: Time Variant ModelsProbabilistic Models in Recommender Systems: Time Variant Models
Probabilistic Models in Recommender Systems: Time Variant ModelsTomasz Kusmierczyk
 
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)Tomasz Kusmierczyk
 

More from Tomasz Kusmierczyk (7)

Priors for BNNs
Priors for BNNsPriors for BNNs
Priors for BNNs
 
Overconfidence and subnetwork Inference for BNNs
Overconfidence and subnetwork Inference for BNNsOverconfidence and subnetwork Inference for BNNs
Overconfidence and subnetwork Inference for BNNs
 
On the Causal Effect of Digital Badges
On the Causal Effect of Digital BadgesOn the Causal Effect of Digital Badges
On the Causal Effect of Digital Badges
 
What are the negative effects of social media?: fighting fake information
What are the negative effects of social media?: fighting fake informationWhat are the negative effects of social media?: fighting fake information
What are the negative effects of social media?: fighting fake information
 
Sampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo TechniquesSampling and Markov Chain Monte Carlo Techniques
Sampling and Markov Chain Monte Carlo Techniques
 
Probabilistic Models in Recommender Systems: Time Variant Models
Probabilistic Models in Recommender Systems: Time Variant ModelsProbabilistic Models in Recommender Systems: Time Variant Models
Probabilistic Models in Recommender Systems: Time Variant Models
 
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
 

Recently uploaded

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Loss Calibrated Variational Inference

  • 1. 1/35 Loss Calibrated Variational Inference Tomasz Ku´smierczyk Joseph Sakaya October 17, 2019 Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 2. 2/35 Outline of the Talk Recap of Lecture 11 - Variational Inference Reparameterization gradients Bayesian decision theory Loss calibrated variational inference: framework Loss calibration: discrete case Loss calibration: continuous case Conclusion Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 3. 3/35 Recap: Lecture 11 Motivation MCMC approximates posteriors by sampling Computationally expensive Asymptotic convergence Diagnostics can be tricky Variational inference approximates the posterior with a parameteric family of distributions q(θ; λ) Converts inference to an optimization problem Scales very well Does not converge to the true posterior Minimize KL divergence between a proxy q(θ; λ) and p(θ|D) KL(q(θ; λ) p(θ|D)) = Eq(θ;λ) log q(θ; λ) p(θ|D) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 4. 4/35 Variational Inference Evidence Lower Bound Consider the equation: log p(D) = KL(q(θ; λ) p(θ|D))+Eq(θ;λ) [log p(D, θ) − q(θ; λ)] ELBO L(λ) Minimization of KL is the same as maximizing L(λ) since log p(D) is constant w.r.t λ. Therefore, λ∗ = arg max λ L(λ) ≡ arg min λ KL(q(θ; λ) p(θ|D)). Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 5. 5/35 Reparameterization gradients Objective to maximize: L(λ) = Eq(θ;λ) [log p(D, θ) − q(θ; λ)] λ∗ = arg max λ L(λ) Optimization via gradient descent. the gradient λL(λ) is related to the distribution q(θ; λ) over which we take expectation. Use reparameterization trick to transform Eq(θ;λ) [. . .] to an expectation over the base distribution Eq( ) [. . .] Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 6. 6/35 Reparameterization gradients Draw S samples from the base distribution s ∼ q( ). Transform θs = f( s, λ) and evaluate the Monte Carlo estimate of the ELBO: L(λ) ≈ 1 S S s=1 [log p(D, θs) − q(θs; λ)] . The Monte Carlo estimate of the gradient now becomes: λL(λ) ≈ 1 S S s=1 [ λ(log p(D, θs) − q(θs; λ))] . Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 7. 7/35 Bayesian Decision Theory Decision making under uncertainty characterized by the posterior p(θ|D) Make optimal decisions h, given p(θ|D) and utility u(h, θ) defined over the parameters θ An optimal decision maximises the posterior gain (expected utility): G(h) = Θ u(h, θ)p(θ|D) dθ Or alternatively, minimizes the risk (expected loss): R(h) = Θ (h, θ)p(θ|D) dθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 8. 8/35 Bayesian Decision Theory - Example When (h, θ) = (h − θ)2, what is the optimal decision h? Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 9. 9/35 Bayesian Decision Theory - Example When (h, θ) = (h − θ)2, what is the optimal decision h? How about when (h, θ) = |h − θ|? Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 10. 10/35 Bayesian Decision Theory The optimal decision maximizes the gain G(h) = Θ u(h, θ)p(θ|D) dθ h∗ p = arg max h∈H G(h) However, p(θ|D) is intractable Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 11. 11/35 Bayesian Decision Theory Approximate p(θ|D) with q(θ; λ) Gq(h) = u(h, y)q(θ; λ) dθ h∗ q = arg max H∈H Gq(h) Million dollar question: Is h∗ q = h∗ p? Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 12. 12/35 Bayesian Decision Theory Example: Nuclear power plant Collect temperature data D from sensor. Infer a posterior distribution p(θ|D) over θ. Utility Matrix θ < Tcrit θ ≥ Tcrit on 1010 100 off 105 1010 Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 13. 13/35 Bayesian Decision Theory Example: Nuclear power plant In each of the cases what is the optimal decision? G(h = ‘on’) = Tcrit 0 1010 × p(θ|D) dθ + 500 Tcrit 100 × p(θ|D) dθ G(h = ‘off’) = Tcrit 0 105 × p(θ|D) dθ + 500 Tcrit 1010 × p(θ|D) dθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 14. 14/35 Bayesian Decision Theory Unimodal posteriors – VI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 15. 15/35 Bayesian Decision Theory Multimodal posteriors – VI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 16. 16/35 Bayesian Decision Theory Multimodal posteriors – Expectation Propagation Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 17. 17/35 Bayesian Decision Theory Multimodal posteriors – ideal fit Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 18. 18/35 Bayesian Decision Theory Two Types of Decisions Decision over parameters G(h) = Θ u(h, θ)p(θ|D) dθ h∗ = arg max h∈H G(h) Decision over model outputs G(h|x) = Θ Y u(h, y)p(y|θ, x) dy p(θ|D) dθ h∗ = arg max h∈H G(h|x) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 19. 19/35 Lessons learnt If you have access to the full posterior, you have nothing to worry about. The posteriors are necessary and sufficient information for making accurate decisions. If you are approximating a multi-modal posterior with a unimodal variational distribution, the decision making task should be part of the inference. Do not take anything for granted, especially because it is black-box. Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 20. 20/35 Loss Calibrated Lower Bound Lower bound the Gain log G(h) = log p(θ|D)u(θ, h)dθ = log q(θ) q(θ) p(θ|D)u(θ, h)dθ ≥ q(θ)log p(θ|D) q(θ) u(θ, h)dθ via Jensen’s inequality = −KL(q, p) + q(θ) log u(θ, h)dθ = ELBO(λ) − log p(D) + q(θ) log u(θ, h)dθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 21. 21/35 LCVI: objective New objective: L(λ, h) = Eq(θ;λ)[log p(D, θ) − log q(θ; λ)] ELBO(λ) - expected lower bound + Eq(θ;λ) [log u(h, θ)] U(λ,h) - utility-dependent penalty term Optimization using EM: M-step: h∗ q = arg maxh∈H Gq(h) E-step: λ∗ = arg maxλ L(λ, h∗ q) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 22. 22/35 Discrete case: Diabetes D = {(x, y)}, x - patient covariates Y = {Healthy, Moderate, Severe} utility matrix u(h, y): u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 it is bad to say ’Healthy’ when ’Severe’ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 23. 23/35 Discrete case: model & Automatic VI Likelihood (softmax): p(y = cj|θ, x) = ex·θj k ex·θk Some priors: p(θSe), p(θMod), p(θHe) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 24. 23/35 Discrete case: model & Automatic VI Likelihood (softmax): p(y = cj|θ, x) = ex·θj k ex·θk Some priors: p(θSe), p(θMod), p(θHe) Mean-field approximation family: q(θSe, θMod, θHe) = N(θSe|µSe, σ2 Se)N(θMod|µMod, σ2 Mod)N(θHe|µHe, σ2 He) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 25. 23/35 Discrete case: model & Automatic VI Likelihood (softmax): p(y = cj|θ, x) = ex·θj k ex·θk Some priors: p(θSe), p(θMod), p(θHe) Mean-field approximation family: q(θSe, θMod, θHe) = N(θSe|µSe, σ2 Se)N(θMod|µMod, σ2 Mod)N(θHe|µHe, σ2 He) Reparametrization: θSe = µSe + σSe · Se, θMod = µMod + σMod · Mod, θHe = µHe + σHe · He, Maximize LV I(λ) := ELBO(λ) w.r.t approximation parameters λ = {µSe, ..., σHe} Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 26. 24/35 Recap: LCVI objective in predictive setting L(λ, h) = ELBO(λ) + q(θ) log u(y, h)p(y|θ, D)dydθ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 27. 25/35 Discrete case: LCVI objective sum over possible outputs: L(λ, h) = ELBO(λ) + Eq(θ;λ) log y∈Y u(h, y)p(y|θ, D) expectation using MC: ≈ ELBO(λ) + 1 M θ∼q(θ;λ) log y∈Y u(h, y)p(y|θ, D) reparameterization: ≈ ELBO(λ) + 1 M ∼q( ) log y∈Y u (h, y) p (y|fθ( , λ), D) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 28. 26/35 Discrete case: LCVI Optimization u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 M-step: choose h that maximizes L(λ, h) (λ fixed) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 29. 26/35 Discrete case: LCVI Optimization u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 M-step: choose h that maximizes L(λ, h) (λ fixed) E-step: use λL(λ, h) to update λ (h fixed) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 30. 26/35 Discrete case: LCVI Optimization u y He Mod Sev He 2.0 1.0 0.0 h Mod 1.2 2.0 1.3 Sev 1.1 1.4 2.0 M-step: choose h that maximizes L(λ, h) (λ fixed) E-step: use λL(λ, h) to update λ (h fixed) for example if h = He: L(λ, He) ≈ ELBO + 1 M ∼q0 log 2.0 · ex·θHe k ex·θk + 1.0 · ex·θMod k ex·θk + 0.0 · ex·θSe k ex·θk Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 31. 27/35 VI vs. LCVI: Test Data Confusion Matrices He Mod Sev Predicted label He Mod Sev Truelabel 0.86 0.11 0.03 0.00 1.00 0.00 0.00 0.00 1.00 VI He Mod Sev Predicted label He Mod Sev Truelabel 0.99 0.01 0.00 0.00 1.00 0.00 0.00 0.00 1.00 LCVI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 32. 28/35 Continuous case with double reparametrization MC approximation of both integrals: L(λ, h) ≈ ELBO + 1 M θ∼qλ(θ)  log 1 N y∼p(y|θ,x) u(h, y)   reparametrization: ≈ ELBO + 1 M ∼q0  log 1 N y∼p(y|fθ( ,λ),x) u(h, y)   ≈ ELBO + 1 M ∼q0  log 1 N δ∼p0 u(h, gy(δ, fθ( , λ))   gradient-based optimization w.r.t. h and λ Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 33. 29/35 Posterior predictive distribution shift 2.5 5.0 7.5 10.0 Value 0.0 0.1 0.2 0.3 ProbabilityDensity hLCVI hVI data user no: 791 artist: Muse LCVI VI Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 34. 30/35 LCVI (blue) vs. VI (red/green) .328.330.333 Empirical .460.465 .398.400.402 1.701.75 q = 0.2 .320.325 Risk q = 0.5 .450.460 q = 0.8 .320.325 squared 1.101.201.30 tilted Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 35. 31/35 Conclusion Bad posterior approximations result in sub-optimal decisions / predictions Learn better approximations (better in concrete task) Learn how to make better decisions from bad posteriors Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 36. 32/35 References Adam D Cobb, Stephen J Roberts, and Yarin Gal. Loss-Calibrated Approximate Inference in Bayesian Neural Networks. In Theory of Deep Learning workshop, ICML, 2018. Tomasz Ku´smierczyk, Joseph Sakaya, and Arto Klami. Variational Bayesian Decision-making for Continuous Utilities. In Thirty-third Conference on Neural Information Processing Systems, NeurIPS, 2019. Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, AISTATS, 2011. Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 37. 33/35 Continous case (detailed): Monte Carlo L = ELBO + Eqλ(θ) log u(h, y)p(y|θ, x) dy Approximate expectation using MC: ≈ ELBO + 1 M θ∼qλ(θ) log u(h, y)p(y|θ, x) dy Approximate integral using MC: ≈ ELBO + 1 M θ∼qλ(θ)  log   1 N y∼p(y|θ,x) u(h, y)     Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 38. 34/35 Continuous case (detailed): double reparametrization L ≈ ELBO + 1 M θ∼qλ(θ)  log 1 N y∼p(y|θ,x) u(h, y)   The Monte Carlo expectation of λU(λ, h) is: ≈ ELBO + 1 M ∼q0  log 1 N y∼p(y|fθ( ,λ),x) u(h, y)   ≈ ELBO + 1 M ∼q0  log 1 N δ∼p0 u(h, gy(δ, fθ( , λ))   Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration
  • 39. 35/35 Continuous case (detailed): double reparametrization ≈ ELBO + 1 M ∼q0  log 1 N δ∼p0 u(h, gy(δ, fθ( , λ))   p(y|.) needs to be reparameterizable: until recently only for gaussians, but: Michael Figurnov, Shakir Mohamed, Andriy Mnih. Implicit Reparameterization Gradients, arXiv: May 2018. we need M × N samples computation graph is O(M × N) Tomasz Ku´smierczyk Joseph Sakaya Loss Calibration