3. Bayesian networks
• Directed graph
• Nodes represent
variables
• Links show dependencies
• Conditional distribution at
each node
• Defines a joint
distribution:
.
P(C,L,S,I)=P(L) P(C)
P(S|C) P(I|L,S)
Lighting
color
Surface
color
Image
color
Object class
C
SL
I
P(L)
P(C)
P(S|C)
P(I|L,S)
4. Lighting
color
Hidden
Bayesian inference
Observed
• Observed variables D
and hidden variables H.
• Hidden variables include
parameters and latent
variables.
• Learning/inference
involves finding:
• P(H1, H2…| D), or
• P(H,Θ|D,M) - explicitly for
generative model.
Surface
color
Image
color
C
SL
I
Object class
5. Bayesian inference vs. ML/MAP
• Consider learning one parameter θ
)(
)()|(
)|(
DP
PDP
DP
θθ
θ =
• How should we represent this posterior distribution?
)()|( θθ PDP∝
6. Bayesian inference vs. ML/MAP
θMAP
θ
Maximum of P(V| θ) P(θ)
• Consider learning one parameter θ
P(D| θ) P(θ)
7. Bayesian inference vs. ML/MAP
P(D| θ) P(θ)
θMAP
θ
High probability mass
High probability density
• Consider learning one parameter θ
8. Bayesian inference vs. ML/MAP
θML
θ
Samples
• Consider learning one parameter θ
P(D| θ) P(θ)
9. Bayesian inference vs. ML/MAP
θML
θ
Variational
approximation
)(θQ
• Consider learning one parameter θ
P(D| θ) P(θ)
10. Variational Inference
1. Choose a family of variational
distributions Q(H).
2. Use Kullback-Leibler divergence
KL(Q||P) as a measure of ‘distance’
between P(H|D) and Q(H).
3. Find Q which minimizes divergence.
(in three easy steps…)
11. Choose Variational Distribution
• P(H|D) ~ Q(H).
• If P is so complex how do we choose Q?
• Any Q is better than an ML or MAP point
estimate.
• Choose Q so it can “get” close to P and is
tractable – factorize, conjugate.
12. Kullback-Leibler Divergence
• Derived from Variational Free Energy by Feynman and
Bobliubov
• Relative Entropy between two probability distributions
• KL(Q||P) > 0 , for any Q (Jensen’s inequality)
• KL(Q||P) = 0 iff P = Q.
• Not true distance measure, not symmetric
∑=
X xP
xQ
xQPQKL
)(
)(
ln)()||(
14. Kullback-Leibler Divergence
∑=
H DHP
HQ
HQPQKL
)|(
)(
ln)()||(
∑ ∑+=
H H
DPHQ
DHP
HQ
HQPQKL )(ln)(
),(
)(
ln)()||(
∑=
H DHP
DPHQ
HQPQKL
),(
)()(
ln)()||(
∑=
H HQ
DHP
DHPQPK
)(
)|(
ln)|()||(
∑ +=
H
DP
DHP
HQ
HQPQKL )(ln
),(
)(
ln)()||(
∑=
H DHP
HQ
HQPQKL
)|(
)(
ln)()||(
Bayes Rules
Log property
Sum over H
15. Kullback-Leibler Divergence
∑ ∑−≡
H H
HQHQDHPHQQL )(ln)(),(ln)()( DEFINE
• L is the difference between: expectation of the marginal
likelihood with respect to Q, and the entropy of Q
• Maximize L(Q) is equivalent to minimizing KL Divergence
• We could not do the same trick for KL(P||Q), thus we will
approximate likelihood with a function that has it’s mass where
the likelihood is most probable (exclusive).
∑ +=
H
DP
DHP
HQ
HQPQKL )(ln
),(
)(
ln)()||(
)()(ln)||( QLDPPQKL −=
16. Summarize
)||(KL)()(ln PQQLDP +=
∑
=
H HQ
DHP
HQQL
)(
),(
ln)()(
where
• For arbitrary Q(H)
• We choose a family of Q distributions
where L(Q) is tractable to compute.
maximisefixed minimise
Still difficult in general to calculate
22. Factorised Approximation
• Assume Q factorises
• Optimal solution for one factor given by
• Given the form of Q, find the best H in KL sense
• Choose conjugate priors P(H) to give from of Q
• Do it iteratively of each Qi(Hi)
const.),(ln)(ln *
+= ≠ijii DHPHQ
∏=
i
ii HQHQ )()(
∏∑≠
=
ji H
iiijj
i
DHPHQ
Z
HQ )),(ln)(exp(
1
)(*
23. Derivation
∏∑≠
=
ji H
iijj
i
DHPHQ
Z
HQ )),(ln)(exp(
1
)(*
∑ ∑−≡
H H
HQHQDHPHQQL )(ln)(),(ln)()(
∑ ∑ ∏∏∏ −=
H H j
jj
i
ii
i
ii HQHQDHPHQ )(ln)(),(ln)(
∑ ∑∏ ∑∏ −=
H H i j
jjii
i
ii HQHQDHPHQ )(ln)(),(ln)(
∑ ∑∑∏ −=
H i H
iiii
i
ii
i
HQHQDHPHQ )(ln)(),(ln)(
∑ ∑∑∑∏ ≠≠
−−=
H ji H
iiii
H
jjjj
ji
iijj
ij
HQHQHQHQDHPHQHQ )(ln)()(ln)(),(ln)()(
Log property
Substitution
Factor one term Qj
Not a Function of Qj
Idea: Use factoring of Q
to isolate Qj and
maximize L wrt Qj
ZQQKL jj log)||( *
−−=
24. Example: Univariate Gaussian
• Normal distribution
• Find P(µ,γ | x)
• Conjugate prior
• Factorized variational
distribution
• Q distribution same form as
prior distributions
• Inference involves updating
these hidden parameters
26. Example: Univariate Gaussian
• Estimate of log evidence can be found by
calculating L(Q):
• Where <.> are expectations wrt to Q(.)
27. Example
Take four data samples form
Gaussian (Thick Line) to find
posterior. Dashed lines distribution
from sampled variational.
Variational and True posterior from
Gaussian given four samples. P(µ) =
N(0,1000). P(γ) = Gamma(.001,.001).
28. VB with Image Segmentation
20 40 60 80 100 120 140 160 180
20
40
60
80
100
120
0 100 200 300
0
100
200
0 100 200 300
0
50
100
0 100 200 300
0
100
200
300
0 100 200 300
0
50
100
0 100 200 300
0
50
100
150
0 100 200 300
0
50
100
RGB histogram of two pixel
locations.
“VB at the pixel level will give
better results.”
Feature vector (x,y,Vx,Vy,r,g,b) -
will have issues with data
association.
VB with GMM will be complex –
doing this in real time will be
execrable.
31. Brings Up VMP – Efficient Computation
Lighting
color
Surface
color
Image
color
Object class
C
SL
I
P(L)
P(C)
P(S|C)
P(I|L,S)
Editor's Notes
Illustration ML vs. Bayesian – for Bayesian methods, mention sampling
WRITE CONCLUSION SLIDE!!
Maximum likelihood/MAP
Finds point estimates of hidden variables
Vulnerable to over-fitting
Variational inference
Finds posterior distributions over hidden variables
Allows direct model comparison
Illustration ML vs. Bayesian – for Bayesian methods, mention sampling
WRITE CONCLUSION SLIDE!!
Maximum likelihood/MAP
Finds point estimates of hidden variables
Vulnerable to over-fitting
Variational inference
Finds posterior distributions over hidden variables
Allows direct model comparison
Illustration ML vs. Bayesian – for Bayesian methods, mention sampling
WRITE CONCLUSION SLIDE!!
Maximum likelihood/MAP
Finds point estimates of hidden variables
Vulnerable to over-fitting
Variational inference
Finds posterior distributions over hidden variables
Allows direct model comparison
Illustration ML vs. Bayesian – for Bayesian methods, mention sampling
WRITE CONCLUSION SLIDE!!
Maximum likelihood/MAP
Finds point estimates of hidden variables
Vulnerable to over-fitting
Variational inference
Finds posterior distributions over hidden variables
Allows direct model comparison
Illustration ML vs. Bayesian – for Bayesian methods, mention sampling
WRITE CONCLUSION SLIDE!!
Maximum likelihood/MAP
Finds point estimates of hidden variables
Vulnerable to over-fitting
Variational inference
Finds posterior distributions over hidden variables
Allows direct model comparison
Guarantees to increase the lower bound – unless already at a maximum.