Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
PAWL - GPU meeting @ Warwick
1. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Parallel Adaptive Wang–Landau Algorithm
Pierre E. Jacob
CEREMADE - Universit´ Paris Dauphine, funded by AXA Research
e
GPU in Computational Statistics
January 25th, 2012
joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford),
Pierre Del Moral (INRIA & Universit´ de Bordeaux), Robin J. Ryder (Dauphine)
e
Pierre E. Jacob PAWL 1/ 29
2. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Outline
1 Wang–Landau algorithm
2 Improvements
Automatic Binning
Adaptive proposals
Parallel Interacting Chains
3 Example: variable selection
4 Conclusion
Pierre E. Jacob PAWL 2/ 29
3. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Context
unnormalized target density π
on a state space X
A kind of adaptive MCMC algorithm
It iteratively generates a sequence Xt .
The stationary distribution is not π itself.
At each iteration a different stationary distribution is targeted.
Pierre E. Jacob PAWL 3/ 29
4. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Partition the space
The state space X is cut into d bins:
d
X = Xi and ∀i = j Xi ∩ Xj = ∅
i=1
Goal
The generated sequence spends a desired proportion φi of
time in each bin Xi ,
within each bin Xi the sequence is asymptotically distributed
according to the restriction of π to Xi .
Pierre E. Jacob PAWL 4/ 29
5. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Stationary distribution
Define the mass of π over Xi by:
ψi = π(x)dx
Xi
The stationary distribution of the WL algorithm is:
φJ(x)
π (x) ∝ π(x) ×
˜
ψJ(x)
where J(x) is the index such that x ∈ XJ(x)
Pierre E. Jacob PAWL 5/ 29
6. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Example with a bimodal, univariate target density: π and two π
˜
corresponding to different partitions. Here φi = d −1 .
Original Density, with partition lines Biased by X Biased by Log Density
0
−2
−4
Log Density
−6
−8
−10
−12
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
X
Pierre E. Jacob PAWL 6/ 29
7. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Plugging estimates
In practice we cannot compute ψi analytically. Instead we plug in
estimates θt (i) of ψi /φi at iteration t, and define the distribution
πθt by:
1
πθt (x) ∝ π(x) ×
θt (J(x))
Metropolis–Hastings
The algorithm does a Metropolis–Hastings step, aiming πθt at
iteration t, generating a new point Xt , updating θt . . .
Pierre E. Jacob PAWL 7/ 29
8. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Estimate of the bias
The update of the estimated bias θt (i) is done according to:
θt (i) ← θt−1 (i) [1 + γt (1 Xi (Xt ) − φi )]
I
with d the number of bins, γt a decreasing sequence or “step
size”. E.g. γt = 1/t.
If 1 Xi (Xt ) then θt (i) increases;
I
otherwise θt (i) decreases.
Pierre E. Jacob PAWL 8/ 29
9. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
The algorithm itself
1: First, ∀i ∈ {1, . . . , d} set θ0 (i) ← 1.
2: Choose a decreasing sequence {γt }, typically γt = 1/t.
3: Sample X0 from an initial distribution π0 .
4: for t = 1 to T do
5: Sample Xt from Pt−1 (Xt−1 , ·), a MH kernel with invariant
distribution πθt−1 (x).
6: Update the bias: θt (i) ← θt−1 (i)[1 + γt (1 Xi (Xt ) − φi )].
I
7: end for
Pierre E. Jacob PAWL 9/ 29
10. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Result
In the end we get:
a sequence Xt asymptotically following π ,
˜
as well as estimates θt (i) of ψi /φi .
Pierre E. Jacob PAWL 10/ 29
11. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Usual improvement: Flat Histogram
Wait for the FH criterion to occur before decreasing γt .
νt (i)
(FH) max − φi < c
i=1...d t
t
where νt (i) = k=1 1 Xi (Xk )
I and c > 0.
WL with stochastic schedule
Let κt be the number of times FH was reached at iteration t. Use
γκt at iteration t instead of γt . If FH reached, reset νt (i) to 0.
Pierre E. Jacob PAWL 11/ 29
12. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Wang–Landau
Theoretical Understanding of WL with deterministic schedule
The schedule γt decreases at each iteration, hence θt converges,
hence Pt (·, ·) converges . . . ≈ “diminishing adaptation”.
Theoretical Understanding of WL with stochastic schedule
Flat Histogram is reached in finite time for any γ, φ, c if one uses
the following update:
log θt (i) ← log θt−1 (i) + γ(1 Xt (Xt ) − φi )
I
instead of
θt (i) ← θt−1 (i)[1 + γ(1 Xt (Xt ) − φi )]
I
Pierre E. Jacob PAWL 12/ 29
13. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Automate Binning
Maintain some kind of uniformity within bins. If non-uniform, split
the bin.
Frequency
Frequency
Log density Log density
(a) Before the split (b) After the split
Pierre E. Jacob PAWL 13/ 29
14. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Adaptive proposals
Target a specific acceptance rate:
σt+1 = σt + ρt (21 > 0.234) − 1)
I(A
Or use the empirical covariance of the already-generated chain:
Σt = δ × Cov (X1 , . . . , Xt )
Pierre E. Jacob PAWL 14/ 29
15. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
(1) (N)
N chains (Xt , . . . , Xt ) instead of one.
targeting the same biased distribution πθt at iteration t,
sharing the same estimated bias θt at iteration t.
The update of the estimated bias becomes:
N
1 (j)
log θt (i) ← log θt−1 (i) + γκt 1 Xi (Xt ) − φi
I
N
j=1
Pierre E. Jacob PAWL 15/ 29
16. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
How “parallel” is PAWL?
The algorithm’s additional cost compared to independent parallel
MCMC chains lies in:
1 N (j)
getting the proportions N j=1 1 Xi (Xt )
I
updating (θt (1), . . . , θt (d)).
Pierre E. Jacob PAWL 16/ 29
17. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
Example: Normal distribution
Histogram of the binned coordinate
0.4
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
binned coordinate
Pierre E. Jacob PAWL 17/ 29
18. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
Reaching Flat Histogram
40
30
#FH
N=1
20 N = 10
N = 100
10
2000 4000 6000 8000 10000
iterations
Pierre E. Jacob PAWL 18/ 29
19. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
Stabilization of the log penalties
10
5
value
0
−5
−10
2000 4000 6000 8000 10000
iterations
Figure: log θt against t, for N = 1
Pierre E. Jacob PAWL 19/ 29
20. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
Stabilization of the log penalties
10
5
value
0
−5
−10
2000 4000 6000 8000 10000
iterations
Figure: log θt against t, for N = 10
Pierre E. Jacob PAWL 20/ 29
21. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
Stabilization of the log penalties
10
5
value
0
−5
−10
2000 4000 6000 8000 10000
iterations
Figure: log θt against t, for N = 100
Pierre E. Jacob PAWL 21/ 29
22. Wang–Landau algorithm
Automatic Binning
Improvements
Adaptive proposals
Example: variable selection
Parallel Interacting Chains
Conclusion
Parallel Interacting Chains
Multiple effects of parallel chains
N
1 (j)
log θt (i) ← log θt−1 (i) + γκt 1 Xi (Xt ) − φi
I
N
j=1
FH is reached more often when N increases, hence γκt
decreases quicker;
log θt tends to vary much less when N increases, even for a
fixed value of γ.
Pierre E. Jacob PAWL 22/ 29
23. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Variable selection
Settings
Pollution data as in McDonald & Schwing (1973). For 60
metropolitan areas:
15 possible explanatory variables (including precipitation,
population per household, . . . ) (denoted by X ),
the response variable Y is the age-adjusted mortality rate.
This leads to 32,768 possible models to explain the data.
Pierre E. Jacob PAWL 23/ 29
24. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Variable selection
Introduce
γ ∈ {0, 1}p the “variable selector”,
qγ represents the number of variables in model “γ”,
g some large value (g -prior, see Zellner 1986, Marin & Robert
2007).
Posterior distribution
π(γ|y, X) ∝ (g + 1)−(qγ +1)/2
−n/2
g
T
y y− yT Xγ (XT Xγ )−1 Xγ y
γ .
g +1
Pierre E. Jacob PAWL 24/ 29
25. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Variable selection
Most naive MH algorithm
The proposal is flipping a variable on / off at random, at each
iteration.
Binning
Along values of log π(x), found with a preliminary exploration, in
20 bins.
Pierre E. Jacob PAWL 25/ 29
26. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Variable selection
N=1 N = 10 N = 100
0
−20
Log(θ)
−40
−60
20000 40000 60000 80000 5000 10000 15000 20000 25000 500 1000 1500 2000 2500 3000 3500
Iteration
Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show the
real ψ.
Pierre E. Jacob PAWL 26/ 29
27. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Variable selection
Wang−Landau Metropolis−Hastings, Temp = 1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Model Saturation
0.0
Metropolis−Hastings, Temp = 10 Metropolis−Hastings, Temp = 100
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500
Iteration
Figure: qγ /p (mean and 95% interval) along iterations, for N = 100.
Pierre E. Jacob PAWL 27/ 29
28. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Conclusion
Automatic binning but. . .
We still have to define a range of plausible (or “interesting”)
values.
Parallel Chains
Seems reasonable to use more than N = 1 chain, with or without
GPUs. No theoretical validation of this yet. Optimal N for a given
computational effort?
Need of a stochastic schedule?
It seems that using large N makes the use and hence the choice of
γt irrelevant.
Pierre E. Jacob PAWL 28/ 29
29. Wang–Landau algorithm
Improvements
Example: variable selection
Conclusion
Would you like to know more?
Article: An Adaptive Interacting Wang-Landau Algorithm for
Automatic Density Exploration, with L. Bornn, P. Del Moral, A.
Doucet.
Article: The Wang-Landau algorithm reaches the Flat
Histogram criterion in finite time, with R. Ryder.
Software: PAWL, available on CRAN:
install.packages("PAWL")
References:
F. Wang, D. Landau, Physical Review E, 64(5):56101
Y. Atchad´, J. Liu, Statistica Sinica, 20:209-233
e
Pierre E. Jacob PAWL 29/ 29