Variational Inference

Variational Inference
Note: Much (meaning almost all) of this has been
liberated from John Winn and Matthew Beal’s theses,
and David McKay’s book.

Overview
• Probabilistic models & Bayesian
inference
• Variational Inference
• Univariate Gaussian Example
• GMM Example
• Variational Message Passing

Bayesian networks
• Directed graph
• Nodes represent
variables
• Links show dependencies
• Conditional distribution at
each node
• Defines a joint
distribution:
.
P(C,L,S,I)=P(L) P(C)
P(S|C) P(I|L,S)
Lighting
color
Surface
color
Image
color
Object class
C
SL
I
P(L)
P(C)
P(S|C)
P(I|L,S)

Lighting
color
Hidden
Bayesian inference
Observed
• Observed variables D
and hidden variables H.
• Hidden variables include
parameters and latent
variables.
• Learning/inference
involves finding:
• P(H1, H2…| D), or
• P(H,Θ|D,M) - explicitly for
generative model.
Surface
color
Image
color
C
SL
I
Object class

Bayesian inference vs. ML/MAP
• Consider learning one parameter θ
)(
)()|(
)|(
DP
PDP
DP
θθ
θ =
• How should we represent this posterior distribution?
)()|( θθ PDP∝

θMAP
θ
Maximum of P(V| θ) P(θ)
P(D| θ) P(θ)

P(D| θ) P(θ)
θMAP
θ
High probability mass
High probability density

θML
θ
Samples
P(D| θ) P(θ)

θML
θ
Variational
approximation
)(θQ
P(D| θ) P(θ)

Variational Inference
1. Choose a family of variational
distributions Q(H).
2. Use Kullback-Leibler divergence
KL(Q||P) as a measure of ‘distance’
between P(H|D) and Q(H).
3. Find Q which minimizes divergence.
(in three easy steps…)

Choose Variational Distribution
• P(H|D) ~ Q(H).
• If P is so complex how do we choose Q?
• Any Q is better than an ML or MAP point
estimate.
• Choose Q so it can “get” close to P and is
tractable – factorize, conjugate.

Kullback-Leibler Divergence
• Derived from Variational Free Energy by Feynman and
Bobliubov
• Relative Entropy between two probability distributions
• KL(Q||P) > 0 , for any Q (Jensen’s inequality)
• KL(Q||P) = 0 iff P = Q.
• Not true distance measure, not symmetric
∑=
X xP
xQ
xQPQKL
)(
)(
ln)()||(

Minimising
KL(Q||P)
P
Q
Q Exclusive
∑=
H DHP
HQ
HQ
)|(
)(
ln)(
Minimising
KL(P||Q) P
∑=
H HQ
DHP
DHP
)(
)|(
ln)|(
Inclusive

∑=
H DHP
HQ
HQPQKL
)|(
)(
ln)()||(
∑ ∑+=
H H
DPHQ
DHP
HQ
HQPQKL )(ln)(
),(
)(
ln)()||(
∑=
H DHP
DPHQ
HQPQKL
),(
)()(
ln)()||(
∑=
H HQ
DHP
DHPQPK
)(
)|(
ln)|()||(
∑ +=
H
DP
DHP
HQ
HQPQKL )(ln
),(
)(
ln)()||(
∑=
H DHP
HQ
HQPQKL
)|(
)(
ln)()||(
Bayes Rules
Log property
Sum over H

∑ ∑−≡
H H
HQHQDHPHQQL )(ln)(),(ln)()( DEFINE
• L is the difference between: expectation of the marginal
likelihood with respect to Q, and the entropy of Q
• Maximize L(Q) is equivalent to minimizing KL Divergence
• We could not do the same trick for KL(P||Q), thus we will
approximate likelihood with a function that has it’s mass where
the likelihood is most probable (exclusive).
∑ +=
H
DP
DHP
HQ
HQPQKL )(ln
),(
)(
ln)()||(
)()(ln)||( QLDPPQKL −=

Summarize
)||(KL)()(ln PQQLDP +=
∑ 





=
H HQ
DHP
HQQL
)(
),(
ln)()(
where
• For arbitrary Q(H)
• We choose a family of Q distributions
where L(Q) is tractable to compute.
maximisefixed minimise
Still difficult in general to calculate

Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)maximise
fixed

Minimising the KL divergence
L(Q)
KL(Q || P)
ln P(D)
maximise
fixed

Factorised Approximation
• Assume Q factorises
• Optimal solution for one factor given by
• Given the form of Q, find the best H in KL sense
• Choose conjugate priors P(H) to give from of Q
• Do it iteratively of each Qi(Hi)
const.),(ln)(ln *
+= ≠ijii DHPHQ
∏=
i
ii HQHQ )()(
∏∑≠
=
ji H
iiijj
i
DHPHQ
Z
HQ )),(ln)(exp(
1
)(*

Derivation
∏∑≠
=
ji H
iijj
i
DHPHQ
Z
HQ )),(ln)(exp(
1
)(*
∑ ∑−≡
H H
HQHQDHPHQQL )(ln)(),(ln)()(
∑ ∑ ∏∏∏ −=
H H j
jj
i
ii
i
ii HQHQDHPHQ )(ln)(),(ln)(
∑ ∑∏ ∑∏ −=
H H i j
jjii
i
ii HQHQDHPHQ )(ln)(),(ln)(
∑ ∑∑∏ −=
H i H
iiii
i
ii
i
HQHQDHPHQ )(ln)(),(ln)(
∑ ∑∑∑∏ ≠≠
−−=
H ji H
iiii
H
jjjj
ji
iijj
ij
HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(
Log property
Substitution
Factor one term Qj
Not a Function of Qj
Idea: Use factoring of Q
to isolate Qj and
maximize L wrt Qj
ZQQKL jj log)||( *
−−=

Example: Univariate Gaussian
• Normal distribution
• Find P(µ,γ | x)
• Conjugate prior
• Factorized variational
distribution
• Q distribution same form as
prior distributions
• Inference involves updating
these hidden parameters

• Use Q* to derive:
• Where <> is the expectation over Q function
• Iteratively solve

• Estimate of log evidence can be found by
calculating L(Q):
• Where <.> are expectations wrt to Q(.)

Example
Take four data samples form
Gaussian (Thick Line) to find
posterior. Dashed lines distribution
from sampled variational.
Variational and True posterior from
Gaussian given four samples. P(µ) =
N(0,1000). P(γ) = Gamma(.001,.001).

VB with Image Segmentation
20 40 60 80 100 120 140 160 180
20
40
60
80
100
120
0 100 200 300
0
100
200
0 100 200 300
0
50
100
0 100 200 300
0
100
200
300
0 100 200 300
0
50
100
0 100 200 300
0
50
100
150
0 100 200 300
0
50
100
RGB histogram of two pixel
locations.
“VB at the pixel level will give
better results.”
Feature vector (x,y,Vx,Vy,r,g,b) -
will have issues with data
association.
VB with GMM will be complex –
doing this in real time will be
execrable.

Variational Equations for GMM-Ugly

Brings Up VMP – Efficient Computation
Lighting
color
Surface
color
Image
color
Object class
C
SL
I
P(L)
P(C)
P(S|C)
P(I|L,S)

Variational Inference

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Variational Inference

Similar to Variational Inference (20)

More from Tushar Tank

More from Tushar Tank (10)

Recently uploaded

Recently uploaded (20)

Variational Inference

Editor's Notes