PAWL - GPU meeting @ Warwick

Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion

Parallel Adaptive Wang–Landau Algorithm

Pierre E. Jacob

CEREMADE - Universit´ Paris Dauphine, funded by AXA Research
e

GPU in Computational Statistics
January 25th, 2012

joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford),
Pierre Del Moral (INRIA & Universit´ de Bordeaux), Robin J. Ryder (Dauphine)
e

Pierre E. Jacob PAWL 1/ 29

Improvements
Conclusion

Outline

1 Wang–Landau algorithm

2 Improvements
Automatic Binning
Adaptive proposals
Parallel Interacting Chains

3 Example: variable selection

4 Conclusion


Improvements
Conclusion

Wang–Landau

Context
unnormalized target density π
on a state space X

A kind of adaptive MCMC algorithm
It iteratively generates a sequence Xt .
The stationary distribution is not π itself.
At each iteration a diﬀerent stationary distribution is targeted.


Improvements
Conclusion

Wang–Landau

Partition the space
The state space X is cut into d bins:
d
X = Xi and ∀i = j Xi ∩ Xj = ∅
i=1

Goal
The generated sequence spends a desired proportion φi of
time in each bin Xi ,
within each bin Xi the sequence is asymptotically distributed
according to the restriction of π to Xi .


Improvements
Conclusion

Wang–Landau

Stationary distribution
Deﬁne the mass of π over Xi by:

ψi = π(x)dx
Xi

The stationary distribution of the WL algorithm is:
φJ(x)
π (x) ∝ π(x) ×
˜
ψJ(x)

where J(x) is the index such that x ∈ XJ(x)


Improvements
Conclusion

Wang–Landau

Example with a bimodal, univariate target density: π and two π
˜
corresponding to diﬀerent partitions. Here φi = d −1 .

Original Density, with partition lines Biased by X Biased by Log Density
0

−2

−4
Log Density

−6

−8

−10

−12

−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
X


Improvements
Conclusion

Wang–Landau

Plugging estimates
In practice we cannot compute ψi analytically. Instead we plug in
estimates θt (i) of ψi /φi at iteration t, and deﬁne the distribution
πθt by:
1
πθt (x) ∝ π(x) ×
θt (J(x))

Metropolis–Hastings
The algorithm does a Metropolis–Hastings step, aiming πθt at
iteration t, generating a new point Xt , updating θt . . .


Improvements
Conclusion

Wang–Landau

Estimate of the bias
The update of the estimated bias θt (i) is done according to:

θt (i) ← θt−1 (i) [1 + γt (1 Xi (Xt ) − φi )]
I

with d the number of bins, γt a decreasing sequence or “step
size”. E.g. γt = 1/t.
If 1 Xi (Xt ) then θt (i) increases;
I
otherwise θt (i) decreases.


Improvements
Conclusion

Wang–Landau

The algorithm itself
1: First, ∀i ∈ {1, . . . , d} set θ0 (i) ← 1.
2: Choose a decreasing sequence {γt }, typically γt = 1/t.
3: Sample X0 from an initial distribution π0 .
4: for t = 1 to T do
5: Sample Xt from Pt−1 (Xt−1 , ·), a MH kernel with invariant
distribution πθt−1 (x).
6: Update the bias: θt (i) ← θt−1 (i)[1 + γt (1 Xi (Xt ) − φi )].
I
7: end for


Improvements
Conclusion

Wang–Landau

Result
In the end we get:
a sequence Xt asymptotically following π ,
˜
as well as estimates θt (i) of ψi /φi .


Improvements
Conclusion

Wang–Landau

Usual improvement: Flat Histogram
Wait for the FH criterion to occur before decreasing γt .

νt (i)
(FH) max − φi < c
i=1...d t
t
where νt (i) = k=1 1 Xi (Xk )
I and c > 0.

WL with stochastic schedule
Let κt be the number of times FH was reached at iteration t. Use
γκt at iteration t instead of γt . If FH reached, reset νt (i) to 0.


Improvements
Conclusion

Wang–Landau

Theoretical Understanding of WL with deterministic schedule
The schedule γt decreases at each iteration, hence θt converges,
hence Pt (·, ·) converges . . . ≈ “diminishing adaptation”.

Theoretical Understanding of WL with stochastic schedule
Flat Histogram is reached in ﬁnite time for any γ, φ, c if one uses
the following update:

log θt (i) ← log θt−1 (i) + γ(1 Xt (Xt ) − φi )
I

instead of
θt (i) ← θt−1 (i)[1 + γ(1 Xt (Xt ) − φi )]
I


Automatic Binning
Improvements
Adaptive proposals
Conclusion

Automate Binning

Maintain some kind of uniformity within bins. If non-uniform, split
the bin.
Frequency

Frequency

Log density Log density

(a) Before the split (b) After the split


Automatic Binning
Improvements
Adaptive proposals
Conclusion

Adaptive proposals

Target a speciﬁc acceptance rate:

σt+1 = σt + ρt (21 > 0.234) − 1)
I(A

Or use the empirical covariance of the already-generated chain:

Σt = δ × Cov (X1 , . . . , Xt )


Automatic Binning
Improvements
Adaptive proposals
Conclusion


(1) (N)
N chains (Xt , . . . , Xt ) instead of one.
targeting the same biased distribution πθt at iteration t,
sharing the same estimated bias θt at iteration t.

The update of the estimated bias becomes:
 
N
1 (j)
log θt (i) ← log θt−1 (i) + γκt  1 Xi (Xt ) − φi 
I
N
j=1


Automatic Binning
Improvements
Adaptive proposals
Conclusion


How “parallel” is PAWL?
The algorithm’s additional cost compared to independent parallel
MCMC chains lies in:
1 N (j)
getting the proportions N j=1 1 Xi (Xt )
I
updating (θt (1), . . . , θt (d)).


Automatic Binning
Improvements
Adaptive proposals
Conclusion


Example: Normal distribution
Histogram of the binned coordinate

0.4
0.3
Density

0.2
0.1
0.0

−4 −2 0 2 4

binned coordinate


Automatic Binning
Improvements
Adaptive proposals
Conclusion


Reaching Flat Histogram

40

30
#FH

N=1
20 N = 10
N = 100

10

2000 4000 6000 8000 10000
iterations


Automatic Binning
Improvements
Adaptive proposals
Conclusion


Stabilization of the log penalties
10

5
value

0

−5

−10

2000 4000 6000 8000 10000
iterations

Figure: log θt against t, for N = 1


Automatic Binning
Improvements
Adaptive proposals
Conclusion


10

5
value

0

−5

−10

2000 4000 6000 8000 10000
iterations



Automatic Binning
Improvements
Adaptive proposals
Conclusion


Multiple eﬀects of parallel chains
 
N
1 (j)
log θt (i) ← log θt−1 (i) + γκt  1 Xi (Xt ) − φi 
I
N
j=1

FH is reached more often when N increases, hence γκt
decreases quicker;
log θt tends to vary much less when N increases, even for a
ﬁxed value of γ.


Improvements
Conclusion

Variable selection

Settings
Pollution data as in McDonald & Schwing (1973). For 60
metropolitan areas:
15 possible explanatory variables (including precipitation,
population per household, . . . ) (denoted by X ),
the response variable Y is the age-adjusted mortality rate.
This leads to 32,768 possible models to explain the data.


Improvements
Conclusion

Variable selection

Introduce
γ ∈ {0, 1}p the “variable selector”,
qγ represents the number of variables in model “γ”,
g some large value (g -prior, see Zellner 1986, Marin & Robert
2007).

Posterior distribution

π(γ|y, X) ∝ (g + 1)−(qγ +1)/2
−n/2
g
T
y y− yT Xγ (XT Xγ )−1 Xγ y
γ .
g +1


Improvements
Conclusion

Variable selection

Most naive MH algorithm
The proposal is ﬂipping a variable on / oﬀ at random, at each
iteration.

Binning
Along values of log π(x), found with a preliminary exploration, in
20 bins.


Improvements
Conclusion

Variable selection

N=1 N = 10 N = 100

0

−20
Log(θ)

−40

−60

20000 40000 60000 80000 5000 10000 15000 20000 25000 500 1000 1500 2000 2500 3000 3500
Iteration

Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show the
real ψ.


Improvements
Conclusion

Variable selection
Wang−Landau Metropolis−Hastings, Temp = 1

0.7

0.6

0.5

0.4

0.3

0.2

0.1
Model Saturation

0.0

Metropolis−Hastings, Temp = 10 Metropolis−Hastings, Temp = 100

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500
Iteration

Figure: qγ /p (mean and 95% interval) along iterations, for N = 100.


Improvements
Conclusion

Conclusion

Automatic binning but. . .
We still have to deﬁne a range of plausible (or “interesting”)
values.

Parallel Chains
Seems reasonable to use more than N = 1 chain, with or without
GPUs. No theoretical validation of this yet. Optimal N for a given
computational eﬀort?

Need of a stochastic schedule?
It seems that using large N makes the use and hence the choice of
γt irrelevant.


Improvements
Conclusion

Would you like to know more?

Article: An Adaptive Interacting Wang-Landau Algorithm for
Automatic Density Exploration, with L. Bornn, P. Del Moral, A.
Doucet.
Article: The Wang-Landau algorithm reaches the Flat
Histogram criterion in ﬁnite time, with R. Ryder.
Software: PAWL, available on CRAN:
install.packages("PAWL")
References:
F. Wang, D. Landau, Physical Review E, 64(5):56101
Y. Atchad´, J. Liu, Statistica Sinica, 20:209-233
e


PAWL - GPU meeting @ Warwick

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to PAWL - GPU meeting @ Warwick

Similar to PAWL - GPU meeting @ Warwick (20)

More from Pierre Jacob

More from Pierre Jacob (11)

Recently uploaded

Recently uploaded (20)

PAWL - GPU meeting @ Warwick