02 probabilistic inference in graphical models

Belief Propagation Variational Inference Sampling Break

Part 3: Probabilistic Inference in
Graphical Models

Sebastian Nowozin and Christoph H. Lampert

Colorado Springs, 25th June 2011

Part 3: Probabilistic Inference in Graphical Models


Belief Propagation

Belief Propagation

“Message-passing” algorithm
Exact and optimal for tree-structured graphs
Approximate for cyclic graphs



Belief Propagation

Example: Inference on Chains

Yi Yj Yk Yl
F G H

Z = exp(−E (y ))
y ∈Y

= exp(−E (yi , yj , yk , yl ))
yi ∈Yi yj ∈Yj yk ∈Yk yl ∈Yl

= exp(−(EF (yi , yj ) + EG (yj , yk ) + EH (yk , yl )))



Belief Propagation


Yi Yj Yk Yl
F G H

Z = exp(−(EF (yi , yj ) + EG (yj , yk ) + EH (yk , yl )))

= exp(−EF (yi , yj )) exp(−EG (yj , yk )) exp(−EH (yk , yl ))

= exp(−EF (yi , yj )) exp(−EG (yj , yk )) exp(−EH (yk , yl ))



Belief Propagation


rH→Yk ∈ RYk

Yi Yj Yk Yl
F G H

Z = exp(−EF (yi , yj )) exp(−EG (yj , yk )) exp(−EH (yk , yl ))

rH→Yk (yk )

= exp(−EF (yi , yj )) exp(−EG (yj , yk ))rH→Yk (yk )
yi ∈Yi yj ∈Yj yk ∈Yi



Belief Propagation


rG→Yj ∈ RYj

Yi Yj Yk Yl
F G H

Z = exp(−EF (yi , yj )) exp(−EG (yj , yk ))rH→Yk (yk )
yi ∈Yi yj ∈Yj yk ∈Yk

rG →Yj (yj )

= exp(−EF (yi , yj ))rG →Yj (yj )
yi ∈Yi yj ∈Yj



Belief Propagation

Example: Inference on Trees

Yi Yj Yk Yl
F G H
I

Ym

Z = exp(−E (y ))
y ∈Y

= exp(−(EF (yi , yj ) + · · · + EI (yk , ym )))
yi ∈Yi yj ∈Yi yk ∈Yi yl ∈Yi ym ∈Ym



Belief Propagation

Yi Yj Yk Yl
F G H
I

Ym

Z = exp(−EF (yi , yj )) exp(−EG (yj , yk )) ·
 
   
 
 
 exp(−EH (yk , yl ))· exp(−EI (yk , ym )) 


 yl ∈Yl ym ∈Ym 
 
rH→Yk (yk ) rI →Yk (yk )



Belief Propagation


rH→Yk (yk )

Yi Yj Yk Yl
F G H
rI→Yk (yk ) I

Ym


(rH→Yk (yk ) · rI →Yk (yk ))



Belief Propagation

qYk →G (yk ) rH→Yk (yk )

Yi Yj Yk Yl
F G H
rI→Yk (yk ) I

Ym


(rH→Yk (yk ) · rI →Yk (yk ))
qYk →G (yk )



Belief Propagation


qYk →G (yk ) rH→Yk (yk )

Yi Yj Yk Yl
F G H
rI→Yk (yk ) I

Ym

Z = exp(−EF (yi , yj )) exp(−EG (yj , yk ))qYk →G (yk )



Belief Propagation

Factor Graph Sum-Product Algorithm

“Message”: pair of vectors at each ...
factor graph edge (i, F ) ∈ E ... rF →Yi
1. rF →Yi ∈ RYi : factor-to-variable
Yi ...
message
2. qYi →F ∈ RYi : variable-to-factor ... F
qYi →F
message
Algorithm iteratively update messages
After convergence: Z and µF can be
obtained from the messages



Belief Propagation

Sum-Product: Variable-to-Factor message

Set of factors adjacent to
rA→Yi
variable i A qYi →F
M(i) = {F ∈ F : (i, F ) ∈ E} Yi
B rF →Yi F
Variable-to-factor message . rB→Yi
.
.
qYi →F (yi ) = rF →Yi (yi )
F ∈M(i){F }



Belief Propagation

Sum-Product: Factor-to-Variable message

Yj qYj →F
rF →Yi
Yi
Yk qYi →F
qYk →F F
.
.
.

Factor-to-variable message
 

rF →Yi (yi ) = exp (−EF (yF )) qYj →F (yj )
yF ∈YF , j∈N(F ){i}
yi =yi



Belief Propagation

Message Scheduling

F ∈M(i){F }
 

yi =yi

Problem: message updates depend on each other
No dependencies if product is empty (= 1)
For tree-structured graphs we can resolve all dependencies



Belief Propagation

Message Scheduling: Trees
10. 1.
B Ym B Ym

9. 2.
F F
6. 8. 5. 3.
C C
Yk Yl Yk Yl
7. 4.
3. 5. 8. 6.

D E D E
2. 4. 9. 7.
A A
Yi Yj Yi Yj
1. 10.

1. Select one variable node as tree root
2. Compute leaf-to-root messages
3. Compute root-to-leaf messages



Belief Propagation

Inference Results: Z and marginals .
.
.
.
.
.
Yi
A
rA→Yi qYi →F
...
F
Yi qYj →F qYk →F
rB→Yi rC→Yi

B C Yj Yk
... ... ... ... ... ...

Partition function, evaluated at root
Z= rF →Yr (yr )
yr ∈Yr F ∈M(r )

Marginal distributions, for each factor
1
µF (yF ) = p(YF = yF ) = exp (−EF (yF )) qYi →F (yi )
Z
i∈N(F )



Belief Propagation

Max-Product/Max-Sum Algorithm

Belief Propagation for MAP inference

y ∗ = argmax p(Y = y |x, w )
y ∈Y

Exact for trees
For cyclic graphs: not as well understood as sum-product algorithm

F ∈M(i){F }
 

rF →Yi (yi ) = max −EF (yF ) + qYj →F (yj )
yF ∈YF ,
j∈N(F ){i}
yi =yi



Belief Propagation

Sum-Product/Max-Sum comparison
Sum-Product
F ∈M(i){F }
 

yi =yi

Max-Sum
F ∈M(i){F }
 

rF →Yi (yi ) = max −EF (yF ) + qYj →F (yj )
yF ∈YF ,
j∈N(F ){i}
yi =yi



Belief Propagation

Example: Pictorial Structures
(1)
Ftop
X Ytop

(2)
F
top,head
... ...
Yhead

... ...
...
Yrarm Ytorso Ylarm
...

Yrhnd
... ... ... Ylhnd
Yrleg Ylleg

... ...

Yrfoot Ylfoot

Tree-structured model for articulated pose (Felzenszwalb and
Huttenlocher, 2000), (Fischler and Elschlager, 1973)
Body-part variables, states: discretized tuple (x, y , s, θ)
(x, y ) position, s scale, and θ rotation


Belief Propagation

Example: Pictorial Structures (cont)

 

rF →Yi (yi ) = max −EF (yi , yj ) + qYj →F (yj ) (1)
(yi ,yj )∈Yi ×Yj ,
j∈N(F ){i}
yi =yi

Because Yi is large (≈ 500k), Yi × Yj is too big
(Felzenszwalb and Huttenlocher, 2000) use special form for
EF (yi , yj ) so that (1) is computable in O(|Yi |)



Belief Propagation

Loopy Belief Propagation

Key difference: no schedule that removes dependencies
But: message computation is still well defined
Therefore, classic loopy belief propagation (Pearl, 1988)
1. Initialize message vectors to 1 (sum-product) or 0 (max-sum)
2. Update messages, hoping for convergence
3. Upon convergence, treat beliefs µF as approximate marginals
Different messaging schedules (synchronous/asynchronous,
static/dynamic)
Improvements: generalized BP (Yedidia et al., 2001), convergent BP
(Heskes, 2006)



Belief Propagation

Synchronous Iteration

A B A B
Yi Yj Yk Yi Yj Yk

C D E C D E

F G F G
Yl Ym Yn Yl Ym Yn

H I J H I J

K L K L
Yo Yp Yq Yo Yp Yq



Variational Inference

Mean ﬁeld methods

Mean ﬁeld methods (Jordan et al., 1999), (Xing et al., 2003)
Distribution p(y |x, w ), inference intractable
Approximate distribution q(y )
Tractable family Q

Q

q
p



Mean ﬁeld methods (cont)

Q
q ∗ = argmin DKL (q(y ) p(y |x, w ))
q∈Q
q
p
DKL (q(y ) p(y |x, w ))
q(y )
= q(y ) log
p(y |x, w )
y ∈Y

= q(y ) log q(y ) − q(y ) log p(y |x, w )
y ∈Y y ∈Y

= −H(q) + µF ,yF (q)EF (yF ; xF , w ) + log Z (x, w ),
F ∈F yF ∈YF

where H(q) is the entropy of q and µF ,yF are the marginals of q. (The
form of µ depends on Q.)



Mean ﬁeld methods (cont)

q ∗ = argmin DKL (q(y ) p(y |x, w ))
q∈Q

When Q is rich: q ∗ is close to p
Marginals of q ∗ approximate marginals of p
Gibbs inequality
DKL (q(y ) p(y |x, w )) ≥ 0
Therefore, we have a lower bound

log Z (x, w ) ≥ H(q) − µF ,yF (q)EF (yF ; xF , w ).
F ∈F yF ∈YF




Naive Mean Field

Set Q all distributions of the form

q(y ) = qi (yi ).
i∈V

qe qf qg

qh qi qj

qk ql qm




Naive Mean Field

Set Q all distributions of the form

q(y ) = qi (yi ).
i∈V

Marginals µF ,yF take the form

µF ,yF (q) = qi (yi ).
i∈N(F )

Entropy H(q) decomposes

H(q) = Hi (qi ) = − qi (yi ) log qi (yi ).
i∈V i∈V yi ∈Yi




Naive Mean Field (cont)

Putting it together,

argmin DKL (q(y ) p(y |x, w ))
q∈Q

= argmax H(q) − µF ,yF (q)EF (yF ; xF , w ) − log Z (x, w )
q∈Q
F ∈F yF ∈YF

= argmax H(q) − µF ,yF (q)EF (yF ; xF , w )
q∈Q
F ∈F yF ∈YF

= argmax − qi (yi ) log qi (yi )
q∈Q
i∈V yi ∈Yi

− qi (yi ) EF (yF ; xF , w ) .
F ∈F yF ∈YF i∈N(F )

Optimizing over Q is optimizing over qi ∈ ∆i , the probability simplices.




argmax − qi (yi ) log qi (yi )
q∈Q
i∈V yi ∈Yi

− qi (yi ) EF (yF ; xF , w ) .
F ∈F yF ∈YF i∈N(F )

Non-concave maximization problem →
hard. (For general EF and pairwise or F qf
ˆ
higher-order factors.) µF

Block coordinate ascent: closed-form qi
qh
ˆ qj
ˆ
update for each qi

ql
ˆ





Closed form update for qi :
 

qi∗ (yi ) =
 
exp 1 −
 qj (yj ) EF (yF ; xF , w ) + λ
ˆ 
F ∈F , yF ∈YF , j∈N(F ){i}
i∈N(F ) [yF ]i =yi
 
 
λ = − log 
 exp 1 − qj (yj ) EF (yF ; xF , w ) 
ˆ 
yi ∈Yi F ∈F , yF ∈YF ,j∈N(F ){i}
i∈N(F ) [yF ]i =yi

Look scary, but very easy to implement




Structured Mean Field

Naive mean ﬁeld approximation can be poor
Structured mean ﬁeld (Saul and Jordan, 1995) uses factorial
approximations with larger tractable subgraphs
Block coordinate ascent: optimizing an entire subgraph using exact
probabilistic inference on trees



Sampling

Sampling

Probabilistic inference and related tasks require the computation of

Ey ∼p(y |x,w ) [h(x, y )],

where h : X × Y → R is an arbitary but well-behaved function.
Inference: hF ,zF (x, y ) = I (yF = zF ),

Ey ∼p(y |x,w ) [hF ,zF (x, y )] = p(yF = zF |x, w ),

the marginal probability of factor F taking state zF .
Parameter estimation: feature map h(x, y ) = φ(x, y ),
φ : X × Y → Rd ,
Ey ∼p(y |x,w ) [φ(x, y )],
“expected suﬃcient statistics under the model distribution”.



Sampling

Monte Carlo

Sample approximation from y (1) , y (2) , . . .
S
1
Ey ∼p(y |x,w ) [h(x, y )] ≈ h(x, y (s) ).
S s=1

When the expectation exists, then the law of large numbers
guarantees convergence for S → ∞,
√
For S independent samples, approximation error is O(1/ S),
independent of the dimension d.

Problem
Producing exact samples y (s) from p(y |x) is hard



Sampling

Markov Chain Monte Carlo (MCMC)

Markov chain with p(y |x) as
stationary distribution 0.5 B
0.1
0.25 0.4
Here: Y = {A, B, C }
Here: p(y ) = (0.1905, 0.3571, 0.4524)
A 0.5 0.5

0.25
C
0.5



Sampling

Markov Chains

Deﬁnition (Finite Markov chain)
0.5 B
Given a ﬁnite set Y and a matrix 0.1
P ∈ RY×Y , then a random process 0.25 0.4
(Z1 , Z2 , . . . ) with Zt taking values from Y A 0.5 0.5
is a Markov chain with transition matrix P,
if
0.25
p(Zt+1 = y (j) |Z1 = y (1) , Z2 = y (2) , . . . , Zt = y (t) )
C
0.5
= p(Zt+1 = y (j) |Zt = y (t) )
= Py (t) ,y (j)



Sampling

MCMC Simulation (1)

Steps
1. Construct a Markov chain with stationary distribution p(y |x, w )
2. Start at y (0)
3. Perform random walk according to Markov chain
4. After sufficient number S of steps, stop and treat y (S) as sample
from p(y |x, w )

Justified by ergodic theorem.
In practise: discard a fixed number of initial samples (“burn-in
phase”) to forget starting point
In practise: afterwards, use between 100 to 100, 000 samples



Sampling

MCMC Simulation (1)

Steps
1. Construct a Markov chain with stationary distribution p(y |x, w )
2. Start at y (0)
3. Perform random walk according to Markov chain
4. After each step, output y (i) as sample from p(y |x, w )

Justiﬁed by ergodic theorem.
In practise: discard a ﬁxed number of initial samples (“burn-in
phase”) to forget starting point
In practise: afterwards, use between 100 to 100, 000 samples



Sampling

Gibbs sampler

How to construct a suitable Markov chain?
Metropolis-Hastings chain, almost always possible
Special case: Gibbs sampler (Geman and Geman, 1984)
1. Select a variable yi ,
2. Sample yi ∼ p(yi |yV {i} , x).



Sampling

Gibbs Sampler

(t) (t)
(t)
p(yi , yV {i} |x, w ) p (yi , yV {i} |x, w )
˜
p(yi |yV {i} , x, w ) = (t)
= (t)
yi ∈Yi p(yi , yV {i} |x, w ) yi ∈Yi p (yi , yV {i} |x, w )
˜



Sampling

Gibbs Sampler

(t)
(t) F ∈M(i) exp(−EF (yi , yF {i} , xF , w ))
p(yi |yV {i} , x, w ) = (t)
yi ∈Yi F ∈M(i) exp(−EF (yi , yF {i} , xF , w ))



Sampling

Example: Gibbs sampler



Sampling


Cummulative mean
1

0.95

0.9

0.85
p(foreground)

0.8

0.75

0.7

0.65

0.6

0.55

0.5
0 500 1000 1500 2000 2500 3000
Gibbs sweep



Sampling


Gibbs sample means and average of 50 repetitions
1

0.8
Estimated probability

0.6

0.4

0.2

0
0 500 1000 1500 2000 2500 3000
Gibbs sweep

p(yi = “foreground ) ≈ 0.770 ± 0.011



Sampling


Estimated one unit standard deviation
0.25

0.2
Standard deviation

0.15

0.1

0.05

0
0 500 1000 1500 2000 2500 3000
Gibbs sweep

√
Standard deviation O(1/ S)



Sampling

Gibbs Sampler Transitions



Sampling

Block Gibbs Sampler

Extension to larger groups of variables
1. Select a block yI
2. Sample yI ∼ p(yI |yV I , x)
→ Tractable if sampling from blocks is tractable.



Sampling

Summary: Sampling

Two families of Monte Carlo methods
1. Markov Chain Monte Carlo (MCMC)
2. Importance Sampling

Properties
Often simple to implement, general, parallelizable
(Cannot compute partition function Z )
Can fail without any warning signs

References
(H¨ggstr¨m, 2000), introduction to Markov chains
a o
(Liu, 2001), excellent Monte Carlo introduction



Break

Coﬀee Break

Coﬀee Break
Continuing at 10:30am

Slides available at
http://www.nowozin.net/sebastian/
cvpr2011tutorial/


02 probabilistic inference in graphical models

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Andere mochten auch

Andere mochten auch (20)

Mehr von zukun

Mehr von zukun (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

02 probabilistic inference in graphical models