SlideShare a Scribd company logo
1 of 29
Download to read offline
Wang–Landau algorithm
                            Improvements
                Example: variable selection
                               Conclusion




     Parallel Adaptive Wang–Landau Algorithm

                                 Pierre E. Jacob

      CEREMADE - Universit´ Paris Dauphine, funded by AXA Research
                          e


                  GPU in Computational Statistics
                       January 25th, 2012


          joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford),
Pierre Del Moral (INRIA & Universit´ de Bordeaux), Robin J. Ryder (Dauphine)
                                    e




                           Pierre E. Jacob    PAWL                             1/ 29
Wang–Landau algorithm
                               Improvements
                   Example: variable selection
                                  Conclusion


Outline


  1   Wang–Landau algorithm

  2   Improvements
        Automatic Binning
        Adaptive proposals
        Parallel Interacting Chains

  3   Example: variable selection

  4   Conclusion



                              Pierre E. Jacob    PAWL   2/ 29
Wang–Landau algorithm
                             Improvements
                 Example: variable selection
                                Conclusion


Wang–Landau


 Context
     unnormalized target density π
      on a state space X

 A kind of adaptive MCMC algorithm
      It iteratively generates a sequence Xt .
      The stationary distribution is not π itself.
      At each iteration a different stationary distribution is targeted.




                            Pierre E. Jacob    PAWL                       3/ 29
Wang–Landau algorithm
                               Improvements
                   Example: variable selection
                                  Conclusion


Wang–Landau

 Partition the space
 The state space X is cut into d bins:
                        d
                X =         Xi        and        ∀i = j   Xi ∩ Xj = ∅
                      i=1


 Goal
        The generated sequence spends a desired proportion φi of
        time in each bin Xi ,
        within each bin Xi the sequence is asymptotically distributed
        according to the restriction of π to Xi .


                              Pierre E. Jacob     PAWL                  4/ 29
Wang–Landau algorithm
                            Improvements
                Example: variable selection
                               Conclusion


Wang–Landau


 Stationary distribution
 Define the mass of π over Xi by:

                                ψi =               π(x)dx
                                              Xi

 The stationary distribution of the WL algorithm is:
                                                      φJ(x)
                            π (x) ∝ π(x) ×
                            ˜
                                                      ψJ(x)

 where J(x) is the index such that x ∈ XJ(x)



                           Pierre E. Jacob         PAWL       5/ 29
Wang–Landau algorithm
                                                                Improvements
                                                    Example: variable selection
                                                                   Conclusion


Wang–Landau

 Example with a bimodal, univariate target density: π and two π
                                                              ˜
 corresponding to different partitions. Here φi = d −1 .


                           Original Density, with partition lines                     Biased by X                         Biased by Log Density
                  0


                −2


                −4
  Log Density




                −6


                −8


                −10


                −12


                      −5      0              5             10       15   −5     0           5          10   15   −5   0            5              10   15
                                                                                        X




                                                                    Pierre E. Jacob             PAWL                                                        6/ 29
Wang–Landau algorithm
                             Improvements
                 Example: variable selection
                                Conclusion


Wang–Landau


 Plugging estimates
 In practice we cannot compute ψi analytically. Instead we plug in
 estimates θt (i) of ψi /φi at iteration t, and define the distribution
 πθt by:
                                              1
                        πθt (x) ∝ π(x) ×
                                          θt (J(x))

 Metropolis–Hastings
 The algorithm does a Metropolis–Hastings step, aiming πθt at
 iteration t, generating a new point Xt , updating θt . . .



                            Pierre E. Jacob    PAWL                      7/ 29
Wang–Landau algorithm
                              Improvements
                  Example: variable selection
                                 Conclusion


Wang–Landau


 Estimate of the bias
 The update of the estimated bias θt (i) is done according to:

                θt (i) ← θt−1 (i) [1 + γt (1 Xi (Xt ) − φi )]
                                            I

 with d the number of bins, γt a decreasing sequence or “step
 size”. E.g. γt = 1/t.
     If 1 Xi (Xt ) then θt (i) increases;
         I
     otherwise θt (i) decreases.




                             Pierre E. Jacob    PAWL             8/ 29
Wang–Landau algorithm
                               Improvements
                   Example: variable selection
                                  Conclusion


Wang–Landau


 The algorithm itself
  1:   First, ∀i ∈ {1, . . . , d} set θ0 (i) ← 1.
  2:   Choose a decreasing sequence {γt }, typically γt = 1/t.
  3:   Sample X0 from an initial distribution π0 .
  4:   for t = 1 to T do
  5:      Sample Xt from Pt−1 (Xt−1 , ·), a MH kernel with invariant
          distribution πθt−1 (x).
  6:      Update the bias: θt (i) ← θt−1 (i)[1 + γt (1 Xi (Xt ) − φi )].
                                                      I
  7:   end for




                              Pierre E. Jacob    PAWL                      9/ 29
Wang–Landau algorithm
                             Improvements
                 Example: variable selection
                                Conclusion


Wang–Landau




 Result
 In the end we get:
     a sequence Xt asymptotically following π ,
                                            ˜
     as well as estimates θt (i) of ψi /φi .




                            Pierre E. Jacob    PAWL   10/ 29
Wang–Landau algorithm
                              Improvements
                  Example: variable selection
                                 Conclusion


Wang–Landau


 Usual improvement: Flat Histogram
 Wait for the FH criterion to occur before decreasing γt .

                                                νt (i)
                    (FH)              max              − φi < c
                                     i=1...d      t
                   t
 where νt (i) =    k=1 1 Xi (Xk )
                        I                and c > 0.

 WL with stochastic schedule
 Let κt be the number of times FH was reached at iteration t. Use
 γκt at iteration t instead of γt . If FH reached, reset νt (i) to 0.



                             Pierre E. Jacob    PAWL                    11/ 29
Wang–Landau algorithm
                             Improvements
                 Example: variable selection
                                Conclusion


Wang–Landau

 Theoretical Understanding of WL with deterministic schedule
 The schedule γt decreases at each iteration, hence θt converges,
 hence Pt (·, ·) converges . . . ≈ “diminishing adaptation”.

 Theoretical Understanding of WL with stochastic schedule
 Flat Histogram is reached in finite time for any γ, φ, c if one uses
 the following update:

              log θt (i) ← log θt−1 (i) + γ(1 Xt (Xt ) − φi )
                                             I

 instead of
                θt (i) ← θt−1 (i)[1 + γ(1 Xt (Xt ) − φi )]
                                         I


                            Pierre E. Jacob    PAWL                    12/ 29
Wang–Landau algorithm
                                                                  Automatic Binning
                                       Improvements
                                                                  Adaptive proposals
                           Example: variable selection
                                                                  Parallel Interacting Chains
                                          Conclusion


Automate Binning

  Maintain some kind of uniformity within bins. If non-uniform, split
  the bin.
            Frequency




                                                                 Frequency




                                 Log density                                         Log density




                        (a) Before the split                                 (b) After the split


                                               Pierre E. Jacob    PAWL                             13/ 29
Wang–Landau algorithm
                                               Automatic Binning
                             Improvements
                                               Adaptive proposals
                 Example: variable selection
                                               Parallel Interacting Chains
                                Conclusion


Adaptive proposals



  Target a specific acceptance rate:

                 σt+1 = σt + ρt (21 > 0.234) − 1)
                                   I(A

  Or use the empirical covariance of the already-generated chain:

                        Σt = δ × Cov (X1 , . . . , Xt )




                            Pierre E. Jacob    PAWL                          14/ 29
Wang–Landau algorithm
                                                   Automatic Binning
                                 Improvements
                                                   Adaptive proposals
                     Example: variable selection
                                                   Parallel Interacting Chains
                                    Conclusion


Parallel Interacting Chains


               (1)             (N)
  N chains (Xt , . . . , Xt          ) instead of one.
       targeting the same biased distribution πθt at iteration t,
       sharing the same estimated bias θt at iteration t.

  The update of the estimated bias becomes:
                                                            
                                           N
                                         1          (j)
       log θt (i) ← log θt−1 (i) + γκt      1 Xi (Xt ) − φi 
                                              I
                                         N
                                                           j=1




                                Pierre E. Jacob    PAWL                          15/ 29
Wang–Landau algorithm
                                                   Automatic Binning
                               Improvements
                                                   Adaptive proposals
                   Example: variable selection
                                                   Parallel Interacting Chains
                                  Conclusion


Parallel Interacting Chains




  How “parallel” is PAWL?
  The algorithm’s additional cost compared to independent parallel
  MCMC chains lies in:
                                         1       N          (j)
      getting the proportions            N       j=1 1 Xi (Xt )
                                                      I
      updating (θt (1), . . . , θt (d)).




                              Pierre E. Jacob      PAWL                          16/ 29
Wang–Landau algorithm
                                                             Automatic Binning
                             Improvements
                                                             Adaptive proposals
                 Example: variable selection
                                                             Parallel Interacting Chains
                                Conclusion


Parallel Interacting Chains

  Example: Normal distribution
                                        Histogram of the binned coordinate



                           0.4
                           0.3
                 Density

                           0.2
                           0.1
                           0.0




                                   −4        −2          0            2      4

                                                  binned coordinate




                                 Pierre E. Jacob             PAWL                          17/ 29
Wang–Landau algorithm
                                                   Automatic Binning
                            Improvements
                                                   Adaptive proposals
                Example: variable selection
                                                   Parallel Interacting Chains
                               Conclusion


Parallel Interacting Chains

  Reaching Flat Histogram


           40




           30
         #FH




                                                                                 N=1
           20                                                                    N = 10
                                                                                 N = 100


           10




                      2000         4000          6000      8000        10000
                                    iterations


                             Pierre E. Jacob       PAWL                                    18/ 29
Wang–Landau algorithm
                                                            Automatic Binning
                                       Improvements
                                                            Adaptive proposals
                           Example: variable selection
                                                            Parallel Interacting Chains
                                          Conclusion


Parallel Interacting Chains

  Stabilization of the log penalties
                      10




                      5
              value




                      0




                      −5




                  −10

                                       2000        4000           6000      8000          10000
                                                     iterations


                              Figure: log θt against t, for N = 1

                                      Pierre E. Jacob       PAWL                                  19/ 29
Wang–Landau algorithm
                                                            Automatic Binning
                                       Improvements
                                                            Adaptive proposals
                           Example: variable selection
                                                            Parallel Interacting Chains
                                          Conclusion


Parallel Interacting Chains

  Stabilization of the log penalties
                      10




                      5
              value




                      0




                      −5




                  −10

                                       2000        4000           6000      8000          10000
                                                     iterations


                             Figure: log θt against t, for N = 10

                                      Pierre E. Jacob       PAWL                                  20/ 29
Wang–Landau algorithm
                                                            Automatic Binning
                                       Improvements
                                                            Adaptive proposals
                           Example: variable selection
                                                            Parallel Interacting Chains
                                          Conclusion


Parallel Interacting Chains

  Stabilization of the log penalties
                      10




                      5
              value




                      0




                      −5




                  −10

                                       2000        4000           6000      8000          10000
                                                     iterations


                            Figure: log θt against t, for N = 100

                                      Pierre E. Jacob       PAWL                                  21/ 29
Wang–Landau algorithm
                                                Automatic Binning
                              Improvements
                                                Adaptive proposals
                  Example: variable selection
                                                Parallel Interacting Chains
                                 Conclusion


Parallel Interacting Chains


  Multiple effects of parallel chains
                                                                               
                                                          N
                                                    1                     (j)
        log θt (i) ← log θt−1 (i) + γκt                       1 Xi (Xt ) − φi 
                                                                I
                                                    N
                                                        j=1


      FH is reached more often when N increases, hence γκt
      decreases quicker;
      log θt tends to vary much less when N increases, even for a
      fixed value of γ.




                             Pierre E. Jacob    PAWL                                22/ 29
Wang–Landau algorithm
                             Improvements
                 Example: variable selection
                                Conclusion


Variable selection



  Settings
  Pollution data as in McDonald & Schwing (1973). For 60
  metropolitan areas:
      15 possible explanatory variables (including precipitation,
      population per household, . . . ) (denoted by X ),
      the response variable Y is the age-adjusted mortality rate.
  This leads to 32,768 possible models to explain the data.




                            Pierre E. Jacob    PAWL                 23/ 29
Wang–Landau algorithm
                              Improvements
                  Example: variable selection
                                 Conclusion


Variable selection

  Introduce
       γ ∈ {0, 1}p the “variable selector”,
       qγ represents the number of variables in model “γ”,
       g some large value (g -prior, see Zellner 1986, Marin & Robert
       2007).

  Posterior distribution

          π(γ|y, X) ∝ (g + 1)−(qγ +1)/2
                                                         −n/2
                              g
                         T
                       y y−      yT Xγ (XT Xγ )−1 Xγ y
                                         γ                      .
                            g +1


                             Pierre E. Jacob    PAWL                    24/ 29
Wang–Landau algorithm
                              Improvements
                  Example: variable selection
                                 Conclusion


Variable selection



  Most naive MH algorithm
  The proposal is flipping a variable on / off at random, at each
  iteration.

  Binning
  Along values of log π(x), found with a preliminary exploration, in
  20 bins.




                             Pierre E. Jacob    PAWL                   25/ 29
Wang–Landau algorithm
                                            Improvements
                                Example: variable selection
                                               Conclusion


Variable selection

                          N=1                                    N = 10                                 N = 100

              0




            −20
   Log(θ)




            −40




            −60




                  20000   40000    60000   80000     5000     10000   15000   20000   25000   500 1000 1500 2000 2500 3000 3500
                                                              Iteration



  Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show the
  real ψ.



                                           Pierre E. Jacob            PAWL                                                        26/ 29
Wang–Landau algorithm
                               Improvements
                   Example: variable selection
                                  Conclusion


Variable selection
                                                           Wang−Landau                                         Metropolis−Hastings, Temp = 1



                                       0.7


                                       0.6


                                       0.5


                                       0.4


                                       0.3


                                       0.2


                                       0.1
                    Model Saturation




                                       0.0

                                                   Metropolis−Hastings, Temp = 10                             Metropolis−Hastings, Temp = 100



                                       0.7


                                       0.6


                                       0.5


                                       0.4


                                       0.3


                                       0.2


                                       0.1


                                       0.0

                                             500   1000    1500    2000     2500    3000   3500         500   1000    1500     2000     2500    3000   3500
                                                                                           Iteration




    Figure: qγ /p (mean and 95% interval) along iterations, for N = 100.

                                                    Pierre E. Jacob                                    PAWL                                                   27/ 29
Wang–Landau algorithm
                              Improvements
                  Example: variable selection
                                 Conclusion


Conclusion

  Automatic binning but. . .
  We still have to define a range of plausible (or “interesting”)
  values.

  Parallel Chains
  Seems reasonable to use more than N = 1 chain, with or without
  GPUs. No theoretical validation of this yet. Optimal N for a given
  computational effort?

  Need of a stochastic schedule?
  It seems that using large N makes the use and hence the choice of
  γt irrelevant.

                             Pierre E. Jacob    PAWL                   28/ 29
Wang–Landau algorithm
                             Improvements
                 Example: variable selection
                                Conclusion


Would you like to know more?


      Article: An Adaptive Interacting Wang-Landau Algorithm for
      Automatic Density Exploration, with L. Bornn, P. Del Moral, A.
      Doucet.
      Article: The Wang-Landau algorithm reaches the Flat
      Histogram criterion in finite time, with R. Ryder.
      Software: PAWL, available on CRAN:
                         install.packages("PAWL")
  References:
      F. Wang, D. Landau, Physical Review E, 64(5):56101
      Y. Atchad´, J. Liu, Statistica Sinica, 20:209-233
               e


                            Pierre E. Jacob    PAWL                    29/ 29

More Related Content

What's hot

ABC-Xian
ABC-XianABC-Xian
ABC-Xian
Deb Roy
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
Jiahao Chen
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
grssieee
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
zukun
 

What's hot (20)

Koc3(dba)
Koc3(dba)Koc3(dba)
Koc3(dba)
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
100 things I know
100 things I know100 things I know
100 things I know
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010
 
Hamilton-Jacobi approach for second order traffic flow models
Hamilton-Jacobi approach for second order traffic flow modelsHamilton-Jacobi approach for second order traffic flow models
Hamilton-Jacobi approach for second order traffic flow models
 
ABC-Xian
ABC-XianABC-Xian
ABC-Xian
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
 
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
Linear Programming and its Usage in Approximation Algorithms for NP Hard Opti...
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
 
Discussion of Faming Liang's talk
Discussion of Faming Liang's talkDiscussion of Faming Liang's talk
Discussion of Faming Liang's talk
 
Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generation
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
IGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdfIGARSS2011 FR3.T08.3 BenDavid.pdf
IGARSS2011 FR3.T08.3 BenDavid.pdf
 
Slides lausanne-2013-v2
Slides lausanne-2013-v2Slides lausanne-2013-v2
Slides lausanne-2013-v2
 
Further discriminatory signature of inflation
Further discriminatory signature of inflationFurther discriminatory signature of inflation
Further discriminatory signature of inflation
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 

Viewers also liked

Viewers also liked (6)

Path storage in the particle filter
Path storage in the particle filterPath storage in the particle filter
Path storage in the particle filter
 
SMC^2: an algorithm for sequential analysis of state-space models
SMC^2: an algorithm for sequential analysis of state-space modelsSMC^2: an algorithm for sequential analysis of state-space models
SMC^2: an algorithm for sequential analysis of state-space models
 
On non-negative unbiased estimators
On non-negative unbiased estimatorsOn non-negative unbiased estimators
On non-negative unbiased estimators
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Presentation of SMC^2 at BISP7
Presentation of SMC^2 at BISP7Presentation of SMC^2 at BISP7
Presentation of SMC^2 at BISP7
 

Similar to PAWL - GPU meeting @ Warwick

1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to
fariyaPatel
 
CPSC 125 Ch 1 sec 4
CPSC 125 Ch 1 sec 4CPSC 125 Ch 1 sec 4
CPSC 125 Ch 1 sec 4
David Wood
 

Similar to PAWL - GPU meeting @ Warwick (20)

Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniques
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Variational inference
Variational inference  Variational inference
Variational inference
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to1004_theorem_proving_2018.pptx on the to
1004_theorem_proving_2018.pptx on the to
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 
A brief introduction to Gaussian process
A brief introduction to Gaussian processA brief introduction to Gaussian process
A brief introduction to Gaussian process
 
Ryder
RyderRyder
Ryder
 
On the convergence properties of the Wang-Landau algorithm
On the convergence properties of the Wang-Landau algorithmOn the convergence properties of the Wang-Landau algorithm
On the convergence properties of the Wang-Landau algorithm
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
Q-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeQ-Metrics in Theory and Practice
Q-Metrics in Theory and Practice
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
 
CPSC 125 Ch 1 sec 4
CPSC 125 Ch 1 sec 4CPSC 125 Ch 1 sec 4
CPSC 125 Ch 1 sec 4
 
Stochastic Differentiation
Stochastic DifferentiationStochastic Differentiation
Stochastic Differentiation
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle Introduction
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 

More from Pierre Jacob

More from Pierre Jacob (11)

ISBA 2022 Susie Bayarri lecture
ISBA 2022 Susie Bayarri lectureISBA 2022 Susie Bayarri lecture
ISBA 2022 Susie Bayarri lecture
 
Couplings of Markov chains and the Poisson equation
Couplings of Markov chains and the Poisson equation Couplings of Markov chains and the Poisson equation
Couplings of Markov chains and the Poisson equation
 
Monte Carlo methods for some not-quite-but-almost Bayesian problems
Monte Carlo methods for some not-quite-but-almost Bayesian problemsMonte Carlo methods for some not-quite-but-almost Bayesian problems
Monte Carlo methods for some not-quite-but-almost Bayesian problems
 
Monte Carlo methods for some not-quite-but-almost Bayesian problems
Monte Carlo methods for some not-quite-but-almost Bayesian problemsMonte Carlo methods for some not-quite-but-almost Bayesian problems
Monte Carlo methods for some not-quite-but-almost Bayesian problems
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
 
Unbiased MCMC with couplings
Unbiased MCMC with couplingsUnbiased MCMC with couplings
Unbiased MCMC with couplings
 
Unbiased Markov chain Monte Carlo methods
Unbiased Markov chain Monte Carlo methods Unbiased Markov chain Monte Carlo methods
Unbiased Markov chain Monte Carlo methods
 
Recent developments on unbiased MCMC
Recent developments on unbiased MCMCRecent developments on unbiased MCMC
Recent developments on unbiased MCMC
 
Current limitations of sequential inference in general hidden Markov models
Current limitations of sequential inference in general hidden Markov modelsCurrent limitations of sequential inference in general hidden Markov models
Current limitations of sequential inference in general hidden Markov models
 
Density exploration methods
Density exploration methodsDensity exploration methods
Density exploration methods
 
Presentation MCB seminar 09032011
Presentation MCB seminar 09032011Presentation MCB seminar 09032011
Presentation MCB seminar 09032011
 

Recently uploaded

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Recently uploaded (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

PAWL - GPU meeting @ Warwick

  • 1. Wang–Landau algorithm Improvements Example: variable selection Conclusion Parallel Adaptive Wang–Landau Algorithm Pierre E. Jacob CEREMADE - Universit´ Paris Dauphine, funded by AXA Research e GPU in Computational Statistics January 25th, 2012 joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford), Pierre Del Moral (INRIA & Universit´ de Bordeaux), Robin J. Ryder (Dauphine) e Pierre E. Jacob PAWL 1/ 29
  • 2. Wang–Landau algorithm Improvements Example: variable selection Conclusion Outline 1 Wang–Landau algorithm 2 Improvements Automatic Binning Adaptive proposals Parallel Interacting Chains 3 Example: variable selection 4 Conclusion Pierre E. Jacob PAWL 2/ 29
  • 3. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Context unnormalized target density π on a state space X A kind of adaptive MCMC algorithm It iteratively generates a sequence Xt . The stationary distribution is not π itself. At each iteration a different stationary distribution is targeted. Pierre E. Jacob PAWL 3/ 29
  • 4. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Partition the space The state space X is cut into d bins: d X = Xi and ∀i = j Xi ∩ Xj = ∅ i=1 Goal The generated sequence spends a desired proportion φi of time in each bin Xi , within each bin Xi the sequence is asymptotically distributed according to the restriction of π to Xi . Pierre E. Jacob PAWL 4/ 29
  • 5. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Stationary distribution Define the mass of π over Xi by: ψi = π(x)dx Xi The stationary distribution of the WL algorithm is: φJ(x) π (x) ∝ π(x) × ˜ ψJ(x) where J(x) is the index such that x ∈ XJ(x) Pierre E. Jacob PAWL 5/ 29
  • 6. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Example with a bimodal, univariate target density: π and two π ˜ corresponding to different partitions. Here φi = d −1 . Original Density, with partition lines Biased by X Biased by Log Density 0 −2 −4 Log Density −6 −8 −10 −12 −5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15 X Pierre E. Jacob PAWL 6/ 29
  • 7. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Plugging estimates In practice we cannot compute ψi analytically. Instead we plug in estimates θt (i) of ψi /φi at iteration t, and define the distribution πθt by: 1 πθt (x) ∝ π(x) × θt (J(x)) Metropolis–Hastings The algorithm does a Metropolis–Hastings step, aiming πθt at iteration t, generating a new point Xt , updating θt . . . Pierre E. Jacob PAWL 7/ 29
  • 8. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Estimate of the bias The update of the estimated bias θt (i) is done according to: θt (i) ← θt−1 (i) [1 + γt (1 Xi (Xt ) − φi )] I with d the number of bins, γt a decreasing sequence or “step size”. E.g. γt = 1/t. If 1 Xi (Xt ) then θt (i) increases; I otherwise θt (i) decreases. Pierre E. Jacob PAWL 8/ 29
  • 9. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau The algorithm itself 1: First, ∀i ∈ {1, . . . , d} set θ0 (i) ← 1. 2: Choose a decreasing sequence {γt }, typically γt = 1/t. 3: Sample X0 from an initial distribution π0 . 4: for t = 1 to T do 5: Sample Xt from Pt−1 (Xt−1 , ·), a MH kernel with invariant distribution πθt−1 (x). 6: Update the bias: θt (i) ← θt−1 (i)[1 + γt (1 Xi (Xt ) − φi )]. I 7: end for Pierre E. Jacob PAWL 9/ 29
  • 10. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Result In the end we get: a sequence Xt asymptotically following π , ˜ as well as estimates θt (i) of ψi /φi . Pierre E. Jacob PAWL 10/ 29
  • 11. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Usual improvement: Flat Histogram Wait for the FH criterion to occur before decreasing γt . νt (i) (FH) max − φi < c i=1...d t t where νt (i) = k=1 1 Xi (Xk ) I and c > 0. WL with stochastic schedule Let κt be the number of times FH was reached at iteration t. Use γκt at iteration t instead of γt . If FH reached, reset νt (i) to 0. Pierre E. Jacob PAWL 11/ 29
  • 12. Wang–Landau algorithm Improvements Example: variable selection Conclusion Wang–Landau Theoretical Understanding of WL with deterministic schedule The schedule γt decreases at each iteration, hence θt converges, hence Pt (·, ·) converges . . . ≈ “diminishing adaptation”. Theoretical Understanding of WL with stochastic schedule Flat Histogram is reached in finite time for any γ, φ, c if one uses the following update: log θt (i) ← log θt−1 (i) + γ(1 Xt (Xt ) − φi ) I instead of θt (i) ← θt−1 (i)[1 + γ(1 Xt (Xt ) − φi )] I Pierre E. Jacob PAWL 12/ 29
  • 13. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Automate Binning Maintain some kind of uniformity within bins. If non-uniform, split the bin. Frequency Frequency Log density Log density (a) Before the split (b) After the split Pierre E. Jacob PAWL 13/ 29
  • 14. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Adaptive proposals Target a specific acceptance rate: σt+1 = σt + ρt (21 > 0.234) − 1) I(A Or use the empirical covariance of the already-generated chain: Σt = δ × Cov (X1 , . . . , Xt ) Pierre E. Jacob PAWL 14/ 29
  • 15. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains (1) (N) N chains (Xt , . . . , Xt ) instead of one. targeting the same biased distribution πθt at iteration t, sharing the same estimated bias θt at iteration t. The update of the estimated bias becomes:   N 1 (j) log θt (i) ← log θt−1 (i) + γκt  1 Xi (Xt ) − φi  I N j=1 Pierre E. Jacob PAWL 15/ 29
  • 16. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains How “parallel” is PAWL? The algorithm’s additional cost compared to independent parallel MCMC chains lies in: 1 N (j) getting the proportions N j=1 1 Xi (Xt ) I updating (θt (1), . . . , θt (d)). Pierre E. Jacob PAWL 16/ 29
  • 17. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains Example: Normal distribution Histogram of the binned coordinate 0.4 0.3 Density 0.2 0.1 0.0 −4 −2 0 2 4 binned coordinate Pierre E. Jacob PAWL 17/ 29
  • 18. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains Reaching Flat Histogram 40 30 #FH N=1 20 N = 10 N = 100 10 2000 4000 6000 8000 10000 iterations Pierre E. Jacob PAWL 18/ 29
  • 19. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains Stabilization of the log penalties 10 5 value 0 −5 −10 2000 4000 6000 8000 10000 iterations Figure: log θt against t, for N = 1 Pierre E. Jacob PAWL 19/ 29
  • 20. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains Stabilization of the log penalties 10 5 value 0 −5 −10 2000 4000 6000 8000 10000 iterations Figure: log θt against t, for N = 10 Pierre E. Jacob PAWL 20/ 29
  • 21. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains Stabilization of the log penalties 10 5 value 0 −5 −10 2000 4000 6000 8000 10000 iterations Figure: log θt against t, for N = 100 Pierre E. Jacob PAWL 21/ 29
  • 22. Wang–Landau algorithm Automatic Binning Improvements Adaptive proposals Example: variable selection Parallel Interacting Chains Conclusion Parallel Interacting Chains Multiple effects of parallel chains   N 1 (j) log θt (i) ← log θt−1 (i) + γκt  1 Xi (Xt ) − φi  I N j=1 FH is reached more often when N increases, hence γκt decreases quicker; log θt tends to vary much less when N increases, even for a fixed value of γ. Pierre E. Jacob PAWL 22/ 29
  • 23. Wang–Landau algorithm Improvements Example: variable selection Conclusion Variable selection Settings Pollution data as in McDonald & Schwing (1973). For 60 metropolitan areas: 15 possible explanatory variables (including precipitation, population per household, . . . ) (denoted by X ), the response variable Y is the age-adjusted mortality rate. This leads to 32,768 possible models to explain the data. Pierre E. Jacob PAWL 23/ 29
  • 24. Wang–Landau algorithm Improvements Example: variable selection Conclusion Variable selection Introduce γ ∈ {0, 1}p the “variable selector”, qγ represents the number of variables in model “γ”, g some large value (g -prior, see Zellner 1986, Marin & Robert 2007). Posterior distribution π(γ|y, X) ∝ (g + 1)−(qγ +1)/2 −n/2 g T y y− yT Xγ (XT Xγ )−1 Xγ y γ . g +1 Pierre E. Jacob PAWL 24/ 29
  • 25. Wang–Landau algorithm Improvements Example: variable selection Conclusion Variable selection Most naive MH algorithm The proposal is flipping a variable on / off at random, at each iteration. Binning Along values of log π(x), found with a preliminary exploration, in 20 bins. Pierre E. Jacob PAWL 25/ 29
  • 26. Wang–Landau algorithm Improvements Example: variable selection Conclusion Variable selection N=1 N = 10 N = 100 0 −20 Log(θ) −40 −60 20000 40000 60000 80000 5000 10000 15000 20000 25000 500 1000 1500 2000 2500 3000 3500 Iteration Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show the real ψ. Pierre E. Jacob PAWL 26/ 29
  • 27. Wang–Landau algorithm Improvements Example: variable selection Conclusion Variable selection Wang−Landau Metropolis−Hastings, Temp = 1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Model Saturation 0.0 Metropolis−Hastings, Temp = 10 Metropolis−Hastings, Temp = 100 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500 Iteration Figure: qγ /p (mean and 95% interval) along iterations, for N = 100. Pierre E. Jacob PAWL 27/ 29
  • 28. Wang–Landau algorithm Improvements Example: variable selection Conclusion Conclusion Automatic binning but. . . We still have to define a range of plausible (or “interesting”) values. Parallel Chains Seems reasonable to use more than N = 1 chain, with or without GPUs. No theoretical validation of this yet. Optimal N for a given computational effort? Need of a stochastic schedule? It seems that using large N makes the use and hence the choice of γt irrelevant. Pierre E. Jacob PAWL 28/ 29
  • 29. Wang–Landau algorithm Improvements Example: variable selection Conclusion Would you like to know more? Article: An Adaptive Interacting Wang-Landau Algorithm for Automatic Density Exploration, with L. Bornn, P. Del Moral, A. Doucet. Article: The Wang-Landau algorithm reaches the Flat Histogram criterion in finite time, with R. Ryder. Software: PAWL, available on CRAN: install.packages("PAWL") References: F. Wang, D. Landau, Physical Review E, 64(5):56101 Y. Atchad´, J. Liu, Statistica Sinica, 20:209-233 e Pierre E. Jacob PAWL 29/ 29