SlideShare ist ein Scribd-Unternehmen logo
1 von 188
Approximate Bayesian Computation:
        how Bayesian can it be?

             Christian P. Robert
ISBA LECTURES ON BAYESIAN FOUNDATIONS
          ISBA 2012, Kyoto, Japan

        Universit´ Paris-Dauphine, IuF, & CREST
                 e
       http://www.ceremade.dauphine.fr/~xian


                  June 21, 2012
This talk is dedicated to the memory of our dear friend
and fellow Bayesian, George Casella, 1951–2012
Outline
Introduction

Approximate Bayesian computation

ABC as an inference machine
Approximate Bayesian computation



   Introduction
       Monte Carlo basics
       simulation-based methods in Econometrics

   Approximate Bayesian computation

   ABC as an inference machine
General issue




   Given a density π known up to a normalizing constant, e.g. a
   posterior density, and an integrable function h, how can one use

                                           h(x)˜ (x)µ(dx)
                                                π
              Ih =    h(x)π(x)µ(dx) =
                                             π (x)µ(dx)
                                             ˜

   when   h(x)˜ (x)µ(dx) is intractable?
              π
Monte Carlo basics

   Generate an iid sample x1 , . . . , xN from π and estimate Ih by
                                           N
                          Imc (h) = N −1
                          ˆ
                            N                    h(xi ).
                                           i=1

               ˆ        as
   since [LLN] IMC (h) −→ Ih
                 N

   Furthermore, if Ih2 =    h2 (x)π(x)µ(dx) < ∞,
                  √                    L
         [CLT]          ˆ
                      N IMC (h) − Ih
                          N                 N 0, I [h − Ih ]2    .


   Caveat
   Often impossible or inefficient to simulate directly from π
Monte Carlo basics

   Generate an iid sample x1 , . . . , xN from π and estimate Ih by
                                           N
                          Imc (h) = N −1
                          ˆ
                            N                    h(xi ).
                                           i=1

               ˆ        as
   since [LLN] IMC (h) −→ Ih
                 N

   Furthermore, if Ih2 =    h2 (x)π(x)µ(dx) < ∞,
                  √                    L
         [CLT]          ˆ
                      N IMC (h) − Ih
                          N                 N 0, I [h − Ih ]2    .


   Caveat
   Often impossible or inefficient to simulate directly from π
Importance Sampling



   For Q proposal distribution with density q

                    Ih =     h(x){π/q}(x)q(x)µ(dx).


   Principle
   Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by
                                     N
                   IIS (h) = N −1
                   ˆ
                     Q,N                  h(xi ){π/q}(xi ).
                                    i=1
Importance Sampling



   For Q proposal distribution with density q

                    Ih =     h(x){π/q}(x)q(x)µ(dx).


   Principle
   Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by
                                     N
                   IIS (h) = N −1
                   ˆ
                     Q,N                  h(xi ){π/q}(xi ).
                                    i=1
Importance Sampling (convergence)



   Then
         ˆ        as
   [LLN] IIS (h) −→ Ih
           Q,N                  and if Q((hπ/q)2 ) < ∞,
               √                      L
       [CLT]         ˆ
                   N(IIS (h) − Ih )
                       Q,N                N 0, Q{(hπ/q − Ih )2 } .


   Caveat
                                                      ˆ
   If normalizing constant unknown, impossible to use IIS (h)
                                                        Q,N

   Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
Importance Sampling (convergence)



   Then
         ˆ        as
   [LLN] IIS (h) −→ Ih
           Q,N                  and if Q((hπ/q)2 ) < ∞,
               √                      L
       [CLT]         ˆ
                   N(IIS (h) − Ih )
                       Q,N                N 0, Q{(hπ/q − Ih )2 } .


   Caveat
                                                      ˆ
   If normalizing constant unknown, impossible to use IIS (h)
                                                        Q,N

   Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
Self-normalised importance Sampling


   Self normalized version
                           N                  −1 N
           ˆ
           ISNIS (h)   =         {π/q}(xi )            h(xi ){π/q}(xi ).
             Q,N
                           i=1                   i=1

           ˆ            as
   [LLN] ISNIS (h) −→ Ih
             Q,N
   and if I((1+h2 )(π/q)) < ∞,
           √                        L
   [CLT]         ˆ
              N(ISNIS (h) − Ih )
                   Q,N                  N     0, π {(π/q)(h − Ih }2 ) .

   Caveat
                                              ˆ
   If π cannot be computed, impossible to use ISNIS (h)
                                                Q,N
Self-normalised importance Sampling


   Self normalized version
                           N                  −1 N
           ˆ
           ISNIS (h)   =         {π/q}(xi )            h(xi ){π/q}(xi ).
             Q,N
                           i=1                   i=1

           ˆ            as
   [LLN] ISNIS (h) −→ Ih
             Q,N
   and if I((1+h2 )(π/q)) < ∞,
           √                        L
   [CLT]         ˆ
              N(ISNIS (h) − Ih )
                   Q,N                  N     0, π {(π/q)(h − Ih }2 ) .

   Caveat
                                              ˆ
   If π cannot be computed, impossible to use ISNIS (h)
                                                Q,N
Self-normalised importance Sampling


   Self normalized version
                           N                  −1 N
           ˆ
           ISNIS (h)   =         {π/q}(xi )            h(xi ){π/q}(xi ).
             Q,N
                           i=1                   i=1

           ˆ            as
   [LLN] ISNIS (h) −→ Ih
             Q,N
   and if I((1+h2 )(π/q)) < ∞,
           √                        L
   [CLT]         ˆ
              N(ISNIS (h) − Ih )
                   Q,N                  N     0, π {(π/q)(h − Ih }2 ) .

   Caveat
                                              ˆ
   If π cannot be computed, impossible to use ISNIS (h)
                                                Q,N
Econom’ections



                                                               Model choice



   Similar exploration of simulation-based and approximation
   techniques in Econometrics
       Simulated method of moments
       Method of simulated moments
       Simulated pseudo-maximum-likelihood
       Indirect inference
                                       [Gouri´roux & Monfort, 1996]
                                             e
Simulated method of moments

                      o
  Given observations y1:n from a model

                     yt = r (y1:(t−1) , t , θ) ,      t   ∼ g (·)


    1. simulate   1:n ,   derive

                               yt (θ) = r (y1:(t−1) ,     t , θ)

    2. and estimate θ by
                                         n
                              arg min         (yto − yt (θ))2
                                    θ
                                        t=1
Simulated method of moments

                      o
  Given observations y1:n from a model

                     yt = r (y1:(t−1) , t , θ) ,            t   ∼ g (·)


    1. simulate   1:n ,   derive

                               yt (θ) = r (y1:(t−1) ,           t , θ)

    2. and estimate θ by
                                        n               n                 2
                           arg min           yto   −            yt (θ)
                                   θ
                                       t=1             t=1
Method of simulated moments

  Given a statistic vector K (y ) with

                      Eθ [K (Yt )|y1:(t−1) ] = k(y1:(t−1) ; θ)

  find an unbiased estimator of k(y1:(t−1) ; θ),

                                ˜
                                k( t , y1:(t−1) ; θ)

  Estimate θ by
                         n                S
           arg min            K (yt ) −         ˜ t
                                                k( s , y1:(t−1) ; θ)/S
                  θ
                        t=1               s=1

                                                        [Pakes & Pollard, 1989]
Indirect inference




                                                 ˆ
   Minimise [in θ] a distance between estimators β based on a
   pseudo-model for genuine observations and for observations
   simulated under the true model and the parameter θ.

                             [Gouri´roux, Monfort, & Renault, 1993;
                                   e
                             Smith, 1993; Gallant & Tauchen, 1996]
Indirect inference (PML vs. PSE)


   Example of the pseudo-maximum-likelihood (PML)

                ˆ
                β(y) = arg max          log f (yt |β, y1:(t−1) )
                               β
                                    t


   leading to
                          ˆ        ˆ
                arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
                     θ

   when
                    ys (θ) ∼ f (y|θ)        s = 1, . . . , S
Indirect inference (PML vs. PSE)


   Example of the pseudo-score-estimator (PSE)
                                                                      2
                ˆ                        ∂ log f
                β(y) = arg min                   (yt |β, y1:(t−1) )
                             β
                                     t
                                            ∂β

   leading to
                             ˆ        ˆ
                   arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
                        θ

   when
                       ys (θ) ∼ f (y|θ)        s = 1, . . . , S
Consistent indirect inference



           ...in order to get a unique solution the dimension of
       the auxiliary parameter β must be larger than or equal to
       the dimension of the initial parameter θ. If the problem is
       just identified the different methods become easier...

   Consistency depending on the criterion and on the asymptotic
   identifiability of θ
                                  [Gouri´roux, Monfort, 1996, p. 66]
                                         e
Consistent indirect inference



           ...in order to get a unique solution the dimension of
       the auxiliary parameter β must be larger than or equal to
       the dimension of the initial parameter θ. If the problem is
       just identified the different methods become easier...

   Consistency depending on the criterion and on the asymptotic
   identifiability of θ
                                  [Gouri´roux, Monfort, 1996, p. 66]
                                         e
Choice of pseudo-model




   Pick model such that
        ˆ
    1. β(θ) not flat
        (i.e. sensitive to changes in θ)
        ˆ
    2. β(θ) not dispersed (i.e. robust agains changes in ys (θ))
                                          [Frigessi & Heggland, 2004]
Empirical likelihood

   Another approximation method (not yet related with simulation)

   Definition
   For dataset y = (y1 , . . . , yn ), and parameter of interest θ, pick
   constraints
                                  E[h(Y , θ)] = 0
   uniquely identifying θ and define the empirical likelihood as
                                                 n
                            Lel (θ|y) = max          pi
                                          p
                                              i=1

    for p in the set {p ∈ [0; 1]n ,    pi = 1,        i   pi h(yi , θ) = 0}.

                                                                    [Owen, 1988]
Empirical likelihood



   Another approximation method (not yet related with simulation)

   Example
   When θ = Ef [Y ], empirical likelihood is the maximum of

                                p1 · · · pn

   under constraint
                         p1 y1 + . . . + pn yn = θ
ABCel


  Another approximation method (now related with simulation!)
  Importance sampling implementation
  Algorithm 1: Raw ABCel sampler
    Given observation y
    for i = 1 to M do
      Generate θ i from the prior distribution π(·)
      Set the weight ωi = Lel (θ i |y)
    end for
    Proceed with pairs (θi , ωi ) as in regular importance sampling

                                 [Mengersen, Pudlo & Robert, 2012]
A?B?C?




    A stands for approximate
    [wrong likelihood]
    B stands for Bayesian
    C stands for computation
    [producing a parameter
    sample]
A?B?C?




    A stands for approximate
    [wrong likelihood]
    B stands for Bayesian
    C stands for computation
    [producing a parameter
    sample]
A?B?C?




    A stands for approximate
    [wrong likelihood]
    B stands for Bayesian
    C stands for computation
    [producing a parameter
    sample]
How much Bayesian?




      asymptotically so (meaningfull?)
      approximation error unknown (w/o simulation)
      pragmatic Bayes (there is no other solution!)
Approximate Bayesian computation



   Introduction

   Approximate Bayesian computation
      Genetics of ABC
      ABC basics
      Advances and interpretations
      Alphabet (compu-)soup

   ABC as an inference machine
Genetic background of ABC




   ABC is a recent computational technique that only requires being
   able to sample from the likelihood f (·|θ)
   This technique stemmed from population genetics models, about
   15 years ago, and population geneticists still contribute
   significantly to methodological developments of ABC.
                             [Griffith & al., 1997; Tavar´ & al., 1999]
                                                          e
Demo-genetic inference



   Each model is characterized by a set of parameters θ that cover
   historical (time divergence, admixture time ...), demographics
   (population sizes, admixture rates, migration rates, ...) and genetic
   (mutation rate, ...) factors
   The goal is to estimate these parameters from a dataset of
   polymorphism (DNA sample) y observed at the present time

   Problem:
   most of the time, we can not calculate the likelihood of the
   polymorphism data f (y|θ).
Demo-genetic inference



   Each model is characterized by a set of parameters θ that cover
   historical (time divergence, admixture time ...), demographics
   (population sizes, admixture rates, migration rates, ...) and genetic
   (mutation rate, ...) factors
   The goal is to estimate these parameters from a dataset of
   polymorphism (DNA sample) y observed at the present time

   Problem:
   most of the time, we can not calculate the likelihood of the
   polymorphism data f (y|θ).
A genuine example of application
             !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
             1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+




                                                      94


   Pygmies populations: do they have a common origin? Was there a
   lot of exchanges between pygmies and non-pygmies populations?
Alternative !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
            scenarios
           1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

        Différents scénarios possibles, choix de scenario par ABC




                                                             Verdu et al. 2009

                                                                    96
Untractable likelihood



   Missing (too missing!) data structure:

                      f (y|θ) =        f (y|G , θ)f (G |θ)dG
                                   G

   cannot be computed in a manageable way...
   The genealogies are considered as nuisance parameters
       This modelling clearly differs from the phylogenetic perspective
       where the tree is the parameter of interest.
Untractable likelihood



   Missing (too missing!) data structure:

                      f (y|θ) =        f (y|G , θ)f (G |θ)dG
                                   G

   cannot be computed in a manageable way...
   The genealogies are considered as nuisance parameters
       This modelling clearly differs from the phylogenetic perspective
       where the tree is the parameter of interest.
Untractable likelihood



So, what can we do when the
likelihood function f (y|θ) is
well-defined but impossible / too
costly to compute...?
    MCMC cannot be implemented!
    shall we give up Bayesian
    inference altogether?!
    or settle for an almost Bayesian
    inference...?
Untractable likelihood



So, what can we do when the
likelihood function f (y|θ) is
well-defined but impossible / too
costly to compute...?
    MCMC cannot be implemented!
    shall we give up Bayesian
    inference altogether?!
    or settle for an almost Bayesian
    inference...?
ABC methodology

  Bayesian setting: target is π(θ)f (x|θ)
  When likelihood f (x|θ) not in closed form, likelihood-free rejection
  technique:
  Foundation
  For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
  jointly simulating
                       θ ∼ π(θ) , z ∼ f (z|θ ) ,
  until the auxiliary variable z is equal to the observed value, z = y,
  then the selected
                                θ ∼ π(θ|y)

          [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997]
                                                     e
ABC methodology

  Bayesian setting: target is π(θ)f (x|θ)
  When likelihood f (x|θ) not in closed form, likelihood-free rejection
  technique:
  Foundation
  For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
  jointly simulating
                       θ ∼ π(θ) , z ∼ f (z|θ ) ,
  until the auxiliary variable z is equal to the observed value, z = y,
  then the selected
                                θ ∼ π(θ|y)

          [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997]
                                                     e
ABC methodology

  Bayesian setting: target is π(θ)f (x|θ)
  When likelihood f (x|θ) not in closed form, likelihood-free rejection
  technique:
  Foundation
  For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
  jointly simulating
                       θ ∼ π(θ) , z ∼ f (z|θ ) ,
  until the auxiliary variable z is equal to the observed value, z = y,
  then the selected
                                θ ∼ π(θ|y)

          [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997]
                                                     e
Why does it work?!



   The proof is trivial:

                       f (θi ) ∝         π(θi )f (z|θi )Iy (z)
                                   z∈D
                             ∝ π(θi )f (y|θi )
                             = π(θi |y) .

                                                            [Accept–Reject 101]
A as A...pproximative



   When y is a continuous random variable, strict equality z = y is
   replaced with a tolerance zone

                                (y, z) ≤

   where is a distance
   Output distributed from
                                     def
                π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )

                                               [Pritchard et al., 1999]
A as A...pproximative



   When y is a continuous random variable, strict equality z = y is
   replaced with a tolerance zone

                                (y, z) ≤

   where is a distance
   Output distributed from
                                     def
                π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )

                                               [Pritchard et al., 1999]
ABC algorithm


  In most implementations, further degree of A...pproximation:

  Algorithm 1 Likelihood-free rejection sampler
    for i = 1 to N do
      repeat
         generate θ from the prior distribution π(·)
         generate z from the likelihood f (·|θ )
      until ρ{η(z), η(y)} ≤
      set θi = θ
    end for

  where η(y) defines a (not necessarily sufficient) statistic
Output


  The likelihood-free algorithm samples from the marginal in z of:

                                    π(θ)f (z|θ)IA ,y (z)
                    π (θ, z|y) =                           ,
                                   A ,y ×Θ π(θ)f (z|θ)dzdθ

  where A   ,y   = {z ∈ D|ρ(η(z), η(y)) < }.
  The idea behind ABC is that the summary statistics coupled with a
  small tolerance should provide a good approximation of the
  posterior distribution:

                     π (θ|y) =     π (θ, z|y)dz ≈ π(θ|y) .


                                                               ...does it?!
Output


  The likelihood-free algorithm samples from the marginal in z of:

                                    π(θ)f (z|θ)IA ,y (z)
                    π (θ, z|y) =                           ,
                                   A ,y ×Θ π(θ)f (z|θ)dzdθ

  where A   ,y   = {z ∈ D|ρ(η(z), η(y)) < }.
  The idea behind ABC is that the summary statistics coupled with a
  small tolerance should provide a good approximation of the
  posterior distribution:

                     π (θ|y) =     π (θ, z|y)dz ≈ π(θ|y) .


                                                               ...does it?!
Output


  The likelihood-free algorithm samples from the marginal in z of:

                                    π(θ)f (z|θ)IA ,y (z)
                    π (θ, z|y) =                           ,
                                   A ,y ×Θ π(θ)f (z|θ)dzdθ

  where A   ,y   = {z ∈ D|ρ(η(z), η(y)) < }.
  The idea behind ABC is that the summary statistics coupled with a
  small tolerance should provide a good approximation of the
  posterior distribution:

                     π (θ|y) =     π (θ, z|y)dz ≈ π(θ|y) .


                                                               ...does it?!
Convergence of ABC (first attempt)




   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ > 0, there exists 0 such that < 0 implies
Convergence of ABC (first attempt)




   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ > 0, there exists 0 such that < 0 implies

         π(θ)    f (z|θ)IA ,y (z) dz      π(θ)f (y|θ)(1 δ)µ(B )
                                     ∈
          A ,y ×Θ π(θ)f (z|θ)dzdθ        Θ π(θ)f (y|θ)dθ(1 ± δ)µ(B )
Convergence of ABC (first attempt)




   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ > 0, there exists 0 such that < 0 implies

         π(θ)    f (z|θ)IA ,y (z) dz      π(θ)f (y|θ)(1 δ)
                                                            XX )
                                                            µ(BX
                                     ∈
          A ,y ×Θ π(θ)f (z|θ)dzdθ        Θ π(θ)f (y|θ)dθ(1 ± δ) X
                                                               µ(B )
                                                               X
                                                                 X
Convergence of ABC (first attempt)



   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ  0, there exists 0 such that  0 implies

         π(θ)    f (z|θ)IA ,y (z) dz      π(θ)f (y|θ)(1 δ)
                                                            XX )
                                                            µ(BX
                                     ∈
          A ,y ×Θ π(θ)f (z|θ)dzdθ        Θ π(θ)f (y|θ)dθ(1 ± δ) X
                                                               µ(B )
                                                               X
                                                                 X

                    [Proof extends to other continuous-in-0 kernels K ]
Convergence of ABC (second attempt)


   What happens when                   → 0?
   For B ⊂ Θ, we have

               A         f (z|θ)dz                                    f (z|θ)π(θ)dθ
                    ,y                                            B
                                        π(θ)dθ =                                       dz
    B A   ,y ×Θ
                    π(θ)f (z|θ)dzdθ                   A   ,y   A ,y ×Θ π(θ)f (z|θ)dzdθ

                         B   f (z|θ)π(θ)dθ            m(z)
      =                                                              dz
           A   ,y
                                m(z)         A ,y ×Θ π(θ)f (z|θ)dzdθ
                                           m(z)
      =             π(B|z)                                dz
           A   ,y                 A ,y ×Θ π(θ)f (z|θ)dzdθ



   which indicates convergence for a continuous π(B|z).
Convergence of ABC (second attempt)


   What happens when                   → 0?
   For B ⊂ Θ, we have

               A         f (z|θ)dz                                    f (z|θ)π(θ)dθ
                    ,y                                            B
                                        π(θ)dθ =                                       dz
    B A   ,y ×Θ
                    π(θ)f (z|θ)dzdθ                   A   ,y   A ,y ×Θ π(θ)f (z|θ)dzdθ

                         B   f (z|θ)π(θ)dθ            m(z)
      =                                                              dz
           A   ,y
                                m(z)         A ,y ×Θ π(θ)f (z|θ)dzdθ
                                           m(z)
      =             π(B|z)                                dz
           A   ,y                 A ,y ×Θ π(θ)f (z|θ)dzdθ



   which indicates convergence for a continuous π(B|z).
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Pima Indian benchmark




                                                                80
                   100




                                                                                                       1.0
                   80




                                                                60




                                                                                                       0.8
                   60




                                                                                                       0.6
         Density




                                                      Density




                                                                                             Density
                                                                40
                   40




                                                                                                       0.4
                                                                20
                   20




                                                                                                       0.2
                                                                                                       0.0
                   0




                                                                0




                     −0.005   0.010   0.020   0.030                  −0.05   −0.03   −0.01                   −1.0   0.0   1.0   2.0




   Figure: Comparison between density estimates of the marginals on β1
   (left), β2 (center) and β3 (right) from ABC rejection samples (red) and
   MCMC samples (black)

                                                                             .
MA example



  MA(q) model
                                         q
                          xt =   t   +         ϑi    t−i
                                         i=1

  Simple prior: uniform over the inverse [real and complex] roots in
                                               q
                         Q(u) = 1 −                 ϑi u i
                                             i=1

  under the identifiability conditions
MA example




  MA(q) model
                                         q
                          xt =   t   +         ϑi   t−i
                                         i=1

  Simple prior: uniform prior over the identifiability zone, i.e. triangle
  for MA(2)
MA example (2)

  ABC algorithm thus made of
    1. picking a new value (ϑ1 , ϑ2 ) in the triangle
    2. generating an iid sequence ( t )−qt≤T
    3. producing a simulated series (xt )1≤t≤T
  Distance: basic distance between the series
                                                 T
                ρ((xt )1≤t≤T , (xt )1≤t≤T ) =         (xt − xt )2
                                                t=1

  or distance between summary statistics like the q = 2
  autocorrelations
                                    T
                            τj =           xt xt−j
                                   t=j+1
MA example (2)

  ABC algorithm thus made of
    1. picking a new value (ϑ1 , ϑ2 ) in the triangle
    2. generating an iid sequence ( t )−qt≤T
    3. producing a simulated series (xt )1≤t≤T
  Distance: basic distance between the series
                                                 T
                ρ((xt )1≤t≤T , (xt )1≤t≤T ) =         (xt − xt )2
                                                t=1

  or distance between summary statistics like the q = 2
  autocorrelations
                                    T
                            τj =           xt xt−j
                                   t=j+1
Comparison of distance impact




   Evaluation of the tolerance on the ABC sample against both
   distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact



            4




                                              1.5
            3




                                              1.0
            2




                                              0.5
            1




                                              0.0
            0




                0.0   0.2   0.4   0.6   0.8         −2.0   −1.0    0.0   0.5   1.0   1.5

                             θ1                                   θ2




   Evaluation of the tolerance on the ABC sample against both
   distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact



            4




                                              1.5
            3




                                              1.0
            2




                                              0.5
            1




                                              0.0
            0




                0.0   0.2   0.4   0.6   0.8         −2.0   −1.0    0.0   0.5   1.0   1.5

                             θ1                                   θ2




   Evaluation of the tolerance on the ABC sample against both
   distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC-NP

  Better usage of [prior] simulations by
 adjustement: instead of throwing away
 θ such that ρ(η(z), η(y))  , replace
 θ’s with locally regressed transforms
  (use with BIC)



          θ∗ = θ − {η(z) − η(y)}T β
                                  ˆ           [Csill´ry et al., TEE, 2010]
                                                    e

          ˆ
   where β is obtained by [NP] weighted least square regression on
   (η(z) − η(y)) with weights

                            Kδ {ρ(η(z), η(y))}

                                      [Beaumont et al., 2002, Genetics]
ABC-NP (regression)



   Also found in the subsequent literature, e.g. in        Fearnhead-Prangle (2012)   :
   weight directly simulation by

                           Kδ {ρ(η(z(θ)), η(y))}

   or
                           S
                       1
                                 Kδ {ρ(η(zs (θ)), η(y))}
                       S
                           s=1

                                       [consistent estimate of f (η|θ)]
   Curse of dimensionality: poor estimate when d = dim(η) is large...
ABC-NP (regression)



   Also found in the subsequent literature, e.g. in        Fearnhead-Prangle (2012)   :
   weight directly simulation by

                           Kδ {ρ(η(z(θ)), η(y))}

   or
                           S
                       1
                                 Kδ {ρ(η(zs (θ)), η(y))}
                       S
                           s=1

                                       [consistent estimate of f (η|θ)]
   Curse of dimensionality: poor estimate when d = dim(η) is large...
ABC-NP (density estimation)



   Use of the kernel weights

                             Kδ {ρ(η(z(θ)), η(y))}

   leads to the NP estimate of the posterior expectation

                         i   θi Kδ {ρ(η(z(θi )), η(y))}
                             i Kδ {ρ(η(z(θi )), η(y))}

                                                          [Blum, JASA, 2010]
ABC-NP (density estimation)



   Use of the kernel weights

                           Kδ {ρ(η(z(θ)), η(y))}

   leads to the NP estimate of the posterior conditional density
                        ˜
                        Kb (θi − θ)Kδ {ρ(η(z(θi )), η(y))}
                    i
                            i Kδ {ρ(η(z(θi )), η(y))}

                                                     [Blum, JASA, 2010]
ABC-NP (density estimations)



   Other versions incorporating regression adjustments
                             ˜
                             Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
                         i
                                  i Kδ {ρ(η(z(θi )), η(y))}

   In all cases, error

      E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
        g
                               c
              var(ˆ (θ|y)) =
                  g                (1 + oP (1))
                             nbδ d
ABC-NP (density estimations)



   Other versions incorporating regression adjustments
                             ˜
                             Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
                         i
                                  i Kδ {ρ(η(z(θi )), η(y))}

   In all cases, error

      E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
        g
                               c
              var(ˆ (θ|y)) =
                  g                (1 + oP (1))
                             nbδ d
                                                          [Blum, JASA, 2010]
ABC-NP (density estimations)



   Other versions incorporating regression adjustments
                             ˜
                             Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
                         i
                                  i Kδ {ρ(η(z(θi )), η(y))}

   In all cases, error

      E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
        g
                               c
              var(ˆ (θ|y)) =
                  g                (1 + oP (1))
                             nbδ d
                                                    [standard NP calculations]
ABC-NCH


  Incorporating non-linearities and heterocedasticities:

                                                  σ (η(y))
                                                  ˆ
                θ∗ = m(η(y)) + [θ − m(η(z))]
                     ˆ              ˆ
                                                  σ (η(z))
                                                  ˆ

  where
      m(η) estimated by non-linear regression (e.g., neural network)
      ˆ
      σ (η) estimated by non-linear regression on residuals
      ˆ

                     log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi
                              ˆ

                                               [Blum  Fran¸ois, 2009]
                                                           c
ABC-NCH


  Incorporating non-linearities and heterocedasticities:

                                                  σ (η(y))
                                                  ˆ
                θ∗ = m(η(y)) + [θ − m(η(z))]
                     ˆ              ˆ
                                                  σ (η(z))
                                                  ˆ

  where
      m(η) estimated by non-linear regression (e.g., neural network)
      ˆ
      σ (η) estimated by non-linear regression on residuals
      ˆ

                     log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi
                              ˆ

                                               [Blum  Fran¸ois, 2009]
                                                           c
ABC-NCH (2)



  Why neural network?
      fights curse of dimensionality
      selects relevant summary statistics
      provides automated dimension reduction
      offers a model choice capability
      improves upon multinomial logistic
                                            [Blum  Fran¸ois, 2009]
                                                        c
ABC-NCH (2)



  Why neural network?
      fights curse of dimensionality
      selects relevant summary statistics
      provides automated dimension reduction
      offers a model choice capability
      improves upon multinomial logistic
                                            [Blum  Fran¸ois, 2009]
                                                        c
ABC-MCMC



  Markov chain (θ(t) ) created via the transition function
           
           θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y
           
           
                                                       π(θ )Kω (θ(t) |θ )
  θ(t+1) =                        and u ∼ U(0, 1) ≤   π(θ(t) )Kω (θ |θ(t) )
                                                                              ,
             
              (t)
             θ                  otherwise,

  has the posterior π(θ|y ) as stationary distribution
                                                 [Marjoram et al, 2003]
ABC-MCMC



  Markov chain (θ(t) ) created via the transition function
           
           θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y
           
           
                                                       π(θ )Kω (θ(t) |θ )
  θ(t+1) =                        and u ∼ U(0, 1) ≤   π(θ(t) )Kω (θ |θ(t) )
                                                                              ,
             
              (t)
             θ                  otherwise,

  has the posterior π(θ|y ) as stationary distribution
                                                 [Marjoram et al, 2003]
ABC-MCMC (2)


  Algorithm 2 Likelihood-free MCMC sampler
    Use Algorithm 1 to get (θ(0) , z(0) )
    for t = 1 to N do
      Generate θ from Kω ·|θ(t−1) ,
      Generate z from the likelihood f (·|θ ),
      Generate u from U[0,1] ,
                 π(θ )Kω (θ(t−1) |θ )
      if u ≤                           I
               π(θ(t−1) Kω (θ |θ(t−1) ) A   ,y   (z ) then
         set  (θ (t) , z(t) ) = (θ , z )

      else
         (θ(t) , z(t) )) = (θ(t−1) , z(t−1) ),
      end if
    end for
Why does it work?


   Acceptance probability does not involve calculating the likelihood
   and

          π (θ , z |y)              q(θ (t−1) |θ )f (z(t−1) |θ (t−1) )
                                ×
      π (θ (t−1) , z(t−1) |y)            q(θ |θ (t−1) )f (z |θ )
                                             π(θ ) XX|θ IA ,y (z )
                                                     )
                                                    f (z X  X
                                =
                                    π(θ (t−1) ) f (z(t−1) |θ (t−1) ) IA ,y (z(t−1) )
                                    q(θ (t−1) |θ ) f (z(t−1) |θ (t−1) )
                                ×
                                         q(θ |θ (t−1) ) XX|θ
                                                         )
                                                        f (z XX
Why does it work?


   Acceptance probability does not involve calculating the likelihood
   and

          π (θ , z |y)              q(θ (t−1) |θ )f (z(t−1) |θ (t−1) )
                                ×
      π (θ (t−1) , z(t−1) |y)            q(θ |θ (t−1) )f (z |θ )
                                             π(θ ) XX|θ IA ,y (z )
                                                     )
                                                    f (z X  X
                                =               hh            (
                                    π(θ (t−1) ) ((hhhhh) IA ,y (z(t−1) )
                                                f (z(t−1) |θ (t−1)
                                                     ((( h   ((
                                                   hh            (
                                    q(θ (t−1) |θ ) ((hhhhh)
                                                   f (z(t−1) |θ (t−1)
                                                                ((
                                                        ((( h
                                ×
                                         q(θ |θ (t−1) ) XX|θ
                                                         )
                                                        f (z XX
Why does it work?


   Acceptance probability does not involve calculating the likelihood
   and

          π (θ , z |y)              q(θ (t−1) |θ )f (z(t−1) |θ (t−1) )
                                ×
      π (θ (t−1) , z(t−1) |y)            q(θ |θ (t−1) )f (z |θ )
                                             π(θ ) XX|θ IA ,y (z )
                                                     )
                                                    f (z X  X
                                =               hh            (
                                    π(θ (t−1) ) ((hhhhh) X(t−1) )
                                                f (z(t−1) |θ (t−1) IAXX
                                                             (( X
                                                     ((( h ,y (zX           
                                                                         X
                                                   hh              (
                                    q(θ (t−1) |θ ) ((hhhhh)
                                                   f (z(t−1) |θ (t−1)
                                                                  ((
                                                        ((( h
                                ×
                                         q(θ |θ (t−1) ) X|θ
                                                        X )
                                                        f (z XX
                                      π(θ )q(θ (t−1) |θ )
                                =                              IA ,y (z )
                                    π(θ (t−1) q(θ |θ (t−1) )
A toy example




   Case of
                           1        1
                      x ∼ N (θ, 1) + N (−θ, 1)
                           2        2
   under prior θ ∼ N (0, 10)

   ABC sampler
   thetas=rnorm(N,sd=10)
   zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1)
   eps=quantile(abs(zed-x),.01)
   abc=thetas[abs(zed-x)eps]
A toy example




   Case of
                           1        1
                      x ∼ N (θ, 1) + N (−θ, 1)
                           2        2
   under prior θ ∼ N (0, 10)

   ABC sampler
   thetas=rnorm(N,sd=10)
   zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1)
   eps=quantile(abs(zed-x),.01)
   abc=thetas[abs(zed-x)eps]
A toy example


   Case of
                           1        1
                      x ∼ N (θ, 1) + N (−θ, 1)
                           2        2
   under prior θ ∼ N (0, 10)

   ABC-MCMC sampler
   metas=rep(0,N)
   metas[1]=rnorm(1,sd=10)
   zed[1]=x
   for (t in 2:N){
     metas[t]=rnorm(1,mean=metas[t-1],sd=5)
     zed[t]=rnorm(1,mean=(1-2*(runif(1).5))*metas[t],sd=1)
     if ((abs(zed[t]-x)eps)||(runif(1)dnorm(metas[t],sd=10)/dnorm(metas[t-1],sd=10))){
      metas[t]=metas[t-1]
      zed[t]=zed[t-1]}
     }
A toy example


                x =2
0.00   0.05   0.10   0.15   0.20   0.25   0.30
                                                                 A toy example




    −4
    −2

θ
    0
                                                          x =2




    2
    4
A toy example


                x = 10
A toy example


         0.25
         0.20
         0.15
         0.10
         0.05
         0.00              x = 10




                −10   −5      0     5   10

                               θ
A toy example


                x = 50
A toy example


         0.10
         0.08
         0.06
         0.04
         0.02
         0.00               x = 50




                −40   −20      0     20   40

                                θ
ABC-PMC

  Use of a transition kernel as in population Monte Carlo with
  manageable IS correction
  Generate a sample at iteration t by
                                 N
                                       (t−1)              (t−1)
                 πt (θ(t) ) ∝
                 ˆ                    ωj       Kt (θ(t) |θj       )
                                j=1

  modulo acceptance of the associated xt , and use an importance
                                                  (t)
  weight associated with an accepted simulation θi
                         (t)           (t)          (t)
                       ωi       ∝ π(θi ) πt (θi ) .
                                         ˆ

  c Still likelihood free
                                                       [Beaumont et al., 2009]
Our ABC-PMC algorithm

  Given a decreasing sequence of approximation levels             1    ≥ ... ≥           T,

    1. At iteration t = 1,
            For i = 1, ..., N
                                (1)                        (1)
                 Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y )                           1
                      (1)
                 Set ωi = 1/N
                                                                 (1)
       Take τ 2 as twice the empirical variance of the θi ’s
    2. At iteration 2 ≤ t ≤ T ,
            For i = 1, ..., N, repeat
                                      (t−1)                              (t−1)
                 Pick θi from the θj          ’s with probabilities ωj
                              (t)           2                          (t)
                 generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi )
            until (x, y )      t
                  (t)         (t)     N       (t−1)      −1      (t)         (t−1)
            Set ωi      ∝ π(θi )/     j=1   ωj        ϕ σt  θi         − θj          )
             2                                                                           (t)
       Take τt+1 as twice the weighted empirical variance of the θi ’s
Sequential Monte Carlo


   SMC is a simulation technique to approximate a sequence of
   related probability distributions πn with π0 “easy” and πT as
   target.
   Iterated IS as PMC: particles moved from time n to time n via
   kernel Kn and use of a sequence of extended targets πn˜
                                             n
                    πn (z0:n ) = πn (zn )
                    ˜                             Lj (zj+1 , zj )
                                            j=0

   where the Lj ’s are backward Markov kernels [check that πn (zn ) is a
   marginal]
                           [Del Moral, Doucet  Jasra, Series B, 2006]
Sequential Monte Carlo (2)

   Algorithm 3 SMC sampler
               (0)
     sample zi ∼ γ0 (x) (i = 1, . . . , N)
                         (0)        (0)        (0)
     compute weights wi = π0 (zi )/γ0 (zi )
     for t = 1 to N do
       if ESS(w (t−1) )  NT then
          resample N particles z (t−1) and set weights to 1
       end if
                  (t−1)        (t−1)
       generate zi      ∼ Kt (zi     , ·) and set weights to
                                            (t)         (t)    (t−1)
                     (t)      (t−1)    πt (zi ))Lt−1 (zi ), zi         ))
                 wi        = wi−1            (t−1)        (t−1)     (t)
                                      πt−1 (zi     ))Kt (zi     ), zi ))

     end for

                              [Del Moral, Doucet  Jasra, Series B, 2006]
ABC-SMC

                                      [Del Moral, Doucet  Jasra, 2009]

  True derivation of an SMC-ABC algorithm
  Use of a kernel Kn associated with target π            n   and derivation of the
  backward kernel
                                      π n (z )Kn (z , z)
                     Ln−1 (z, z ) =
                                            πn (z)

  Update of the weights
                                        M                  m
                                        m=1 IA       n
                                                         (xin )
                  win ∝ wi(n−1)       M                m
                                                     (xi(n−1) )
                                      m=1 IA   n−1

        m
  when xin ∼ K (xi(n−1) , ·)
ABC-SMCM




  Modification: Makes M repeated simulations of the pseudo-data z
  given the parameter, rather than using a single [M = 1] simulation,
  leading to weight that is proportional to the number of accepted
  zi s
                                  M
                               1
                      ω(θ) =         Iρ(η(y),η(zi ))
                              M
                                i=1

  [limit in M means exact simulation from (tempered) target]
Properties of ABC-SMC



  The ABC-SMC method properly uses a backward kernel L(z, z ) to
  simplify the importance weight and to remove the dependence on
  the unknown likelihood from this weight. Update of importance
  weights is reduced to the ratio of the proportions of surviving
  particles
  Major assumption: the forward kernel K is supposed to be invariant
  against the true target [tempered version of the true posterior]
  Adaptivity in ABC-SMC algorithm only found in on-line
  construction of the thresholds t , slowly enough to keep a large
  number of accepted transitions
Properties of ABC-SMC



  The ABC-SMC method properly uses a backward kernel L(z, z ) to
  simplify the importance weight and to remove the dependence on
  the unknown likelihood from this weight. Update of importance
  weights is reduced to the ratio of the proportions of surviving
  particles
  Major assumption: the forward kernel K is supposed to be invariant
  against the true target [tempered version of the true posterior]
  Adaptivity in ABC-SMC algorithm only found in on-line
  construction of the thresholds t , slowly enough to keep a large
  number of accepted transitions
A mixture example (1)


   Toy model of Sisson et al. (2007): if

        θ ∼ U(−10, 10) ,   x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) ,

   then the posterior distribution associated with y = 0 is the normal
   mixture
                θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100)
   restricted to [−10, 10].
   Furthermore, true target available as

   π(θ||x|  ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
A mixture example (2)

   Recovery of the target, whether using a fixed standard deviation of
   τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
      1.0




                                           1.0




                                                                                1.0




                                                                                                                     1.0




                                                                                                                                                          1.0
      0.8




                                           0.8




                                                                                0.8




                                                                                                                     0.8




                                                                                                                                                          0.8
      0.6




                                           0.6




                                                                                0.6




                                                                                                                     0.6




                                                                                                                                                          0.6
      0.4




                                           0.4




                                                                                0.4




                                                                                                                     0.4




                                                                                                                                                          0.4
      0.2




                                           0.2




                                                                                0.2




                                                                                                                     0.2




                                                                                                                                                          0.2
      0.0




                                           0.0




                                                                                0.0




                                                                                                                     0.0




                                                                                                                                                          0.0
            −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3

                           θ                                    θ                                    θ                                    θ                                    θ
Yet another ABC-PRC

  Another version of ABC-PRC called PRC-ABC with a proposal
  distribution Mt , S replications of the pseudo-data, and a PRC step:
  PRC-ABC Algorithm
                                                (t−1)                          (t−1)
   1. Select θ at random from the θi                    ’s with probabilities ωi
                    (t)                         iid           (t−1)
   2. Generate θi         ∼ Kt , x1 , . . . , xS ∼ f (x|θi            ), set
                          (t)                                           (t)
                     ωi         =       Kt {η(xs ) − η(y)} Kt (θi ) ,
                                    s
                                                        (t)
      and accept with probability p (i) = ωi /ct,N
            (t)       (t)
   3. Set ωi      ∝ ωi /p (i) (1 ≤ i ≤ N)

                          [Peters, Sisson  Fan, Stat  Computing, 2012]
Yet another ABC-PRC




  Another version of ABC-PRC called PRC-ABC with a proposal
  distribution Mt , S replications of the pseudo-data, and a PRC step:
                                           (t−1)        (t−1)
      Use of a NP estimate Mt (θ) =      ωi        ψ(θ|θi       ) (with a
      fixed variance?!)
      S = 1 recommended for “greatest gains in sampler
      performance”
      ct,N upper quantile of non-zero weights at time t
ABC inference machine



   Introduction

   Approximate Bayesian computation

   ABC as an inference machine
     Error inc.
     Exact BC and approximate targets
     summary statistic
     Series B discussion4
How much Bayesian?


    maybe a convergent method
    of inference (meaningful?
    sufficient?)
    approximation error
    unknown (w/o simulation)
    pragmatic Bayes (there is no
    other solution!)
    many calibration issues
    (tolerance, distance,
    statistics)

                                   ...should Bayesians care?!
How much Bayesian?


    maybe a convergent method
    of inference (meaningful?
    sufficient?)
    approximation error
    unknown (w/o simulation)
    pragmatic Bayes (there is no
    other solution!)
    many calibration issues
    (tolerance, distance,
    statistics)


                                   ...should Bayesians care?!
ABCµ



  Idea Infer about the error as well:
  Use of a joint density

                  f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

  where y is the data, and ξ( |y, θ) is the prior predictive density of
  ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
  Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
  approximation.
             [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
ABCµ



  Idea Infer about the error as well:
  Use of a joint density

                  f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

  where y is the data, and ξ( |y, θ) is the prior predictive density of
  ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
  Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
  approximation.
             [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
ABCµ



  Idea Infer about the error as well:
  Use of a joint density

                  f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

  where y is the data, and ξ( |y, θ) is the prior predictive density of
  ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
  Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
  approximation.
             [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
ABCµ details


   Multidimensional distances ρk (k = 1, . . . , K ) and errors
   k = ρk (ηk (z), ηk (y)), with


                        ˆ              1
    k   ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) =           K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
                                      Bhk
                                            b

                                              ˆ
   then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
   ABCµ involves acceptance probability

                                                ˆ
                    π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ )
                                                 ˆ
                    π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
ABCµ details


   Multidimensional distances ρk (k = 1, . . . , K ) and errors
   k = ρk (ηk (z), ηk (y)), with


                        ˆ              1
    k   ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) =           K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
                                      Bhk
                                            b

                                              ˆ
   then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
   ABCµ involves acceptance probability

                                                ˆ
                    π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ )
                                                 ˆ
                    π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
ABCµ multiple errors




                       [ c Ratmann et al., PNAS, 2009]
ABCµ for model choice




                        [ c Ratmann et al., PNAS, 2009]
Questions about ABCµ


  For each model under comparison, marginal posterior on       used to
  assess the fit of the model (HPD includes 0 or not).
      Is the data informative about ? [Identifiability]
      How much does the prior π( ) impact the comparison?
      How is using both ξ( |x0 , θ) and π ( ) compatible with a
      standard probability model? [remindful of Wilkinson’s eABC ]
      Where is the penalisation for complexity in the model
      comparison?
                                [X, Mengersen  Chen, 2010, PNAS]
Questions about ABCµ


  For each model under comparison, marginal posterior on       used to
  assess the fit of the model (HPD includes 0 or not).
      Is the data informative about ? [Identifiability]
      How much does the prior π( ) impact the comparison?
      How is using both ξ( |x0 , θ) and π ( ) compatible with a
      standard probability model? [remindful of Wilkinson’s eABC ]
      Where is the penalisation for complexity in the model
      comparison?
                                [X, Mengersen  Chen, 2010, PNAS]
Wilkinson’s exact BC (not exactly!)

   ABC approximation error (i.e. non-zero tolerance) replaced with
   exact simulation from a controlled approximation to the target,
   convolution of true posterior with kernel function

                                π(θ)f (z|θ)K (y − z)
                π (θ, z|y) =                            ,
                               π(θ)f (z|θ)K (y − z)dzdθ

   with K kernel parameterised by bandwidth .
                                                     [Wilkinson, 2008]

   Theorem
   The ABC algorithm based on the assumption of a randomised
   observation y = y + ξ, ξ ∼ K , and an acceptance probability of
                   ˜

                               K (y − z)/M

   gives draws from the posterior distribution π(θ|y).
Wilkinson’s exact BC (not exactly!)

   ABC approximation error (i.e. non-zero tolerance) replaced with
   exact simulation from a controlled approximation to the target,
   convolution of true posterior with kernel function

                                π(θ)f (z|θ)K (y − z)
                π (θ, z|y) =                            ,
                               π(θ)f (z|θ)K (y − z)dzdθ

   with K kernel parameterised by bandwidth .
                                                     [Wilkinson, 2008]

   Theorem
   The ABC algorithm based on the assumption of a randomised
   observation y = y + ξ, ξ ∼ K , and an acceptance probability of
                   ˜

                               K (y − z)/M

   gives draws from the posterior distribution π(θ|y).
How exact a BC?




      “Using to represent measurement error is
      straightforward, whereas using to model the model
      discrepancy is harder to conceptualize and not as
      commonly used”
                                        [Richard Wilkinson, 2008]
How exact a BC?

  Pros
         Pseudo-data from true model and observed data from noisy
         model
         Interesting perspective in that outcome is completely
         controlled
         Link with ABCµ and assuming y is observed with a
         measurement error with density K
         Relates to the theory of model approximation
                                               [Kennedy  O’Hagan, 2001]
  Cons
         Requires K to be bounded by M
         True approximation error never assessed
         Requires a modification of the standard ABC algorithm
ABC for HMMs



  Specific case of a hidden     Markov model



                             Xt+1 ∼ Qθ (Xt , ·)
                             Yt+1 ∼ gθ (·|xt )
              0
  where only y1:n is observed.
                                   [Dean, Singh, Jasra,  Peters, 2011]
  Use of specific constraints, adapted to the Markov structure:
                         0                       0
                 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
ABC for HMMs



  Specific case of a hidden     Markov model



                             Xt+1 ∼ Qθ (Xt , ·)
                             Yt+1 ∼ gθ (·|xt )
              0
  where only y1:n is observed.
                                   [Dean, Singh, Jasra,  Peters, 2011]
  Use of specific constraints, adapted to the Markov structure:
                         0                       0
                 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
ABC-MLE for HMMs


  ABC-MLE defined by
         ˆ                       0                      0
         θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
                    θ


  Exact MLE for the likelihood     same basis as Wilkinson!


                                 0
                            pθ (y1 , . . . , yn )

  corresponding to the perturbed process

                 (xt , yt + zt )1≤t≤n      zt ∼ U(B(0, 1)

                                 [Dean, Singh, Jasra,  Peters, 2011]
ABC-MLE for HMMs


  ABC-MLE defined by
         ˆ                       0                      0
         θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
                    θ


  Exact MLE for the likelihood     same basis as Wilkinson!


                                 0
                            pθ (y1 , . . . , yn )

  corresponding to the perturbed process

                 (xt , yt + zt )1≤t≤n      zt ∼ U(B(0, 1)

                                 [Dean, Singh, Jasra,  Peters, 2011]
ABC-MLE is biased



      ABC-MLE is asymptotically (in n) biased with target

                      l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )]


      but ABC-MLE converges to true value in the sense

                              l n (θn ) → l (θ)

      for all sequences (θn ) converging to θ and   n
ABC-MLE is biased



      ABC-MLE is asymptotically (in n) biased with target

                      l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )]


      but ABC-MLE converges to true value in the sense

                              l n (θn ) → l (θ)

      for all sequences (θn ) converging to θ and   n
Noisy ABC-MLE



  Idea: Modify instead the data from the start
                         0
                       (y1 + ζ1 , . . . , yn + ζn )

                                                      [   see Fearnhead-Prangle   ]
  noisy ABC-MLE estimate
                         0                           0
      arg max Pθ Y1 ∈ B(y1 + ζ1 , ), . . . , Yn ∈ B(yn + ζn , )
           θ

                                 [Dean, Singh, Jasra,  Peters, 2011]
Consistent noisy ABC-MLE




      Degrading the data improves the estimation performances:
          Noisy ABC-MLE is asymptotically (in n) consistent
          under further assumptions, the noisy ABC-MLE is
          asymptotically normal
          increase in variance of order −2
      likely degradation in precision or computing time due to the
      lack of summary statistic [curse of dimensionality]
SMC for ABC likelihood

   Algorithm 4 SMC ABC for HMMs
     Given θ
     for k = 1, . . . , n do
                             1 1                N    N
       generate proposals (xk , yk ), . . . , (xk , yk ) from the model
       weigh each proposal with ωk   l =I                     l
                                              B(yk + ζk , ) (yk )
                                                  0

                                                          l
       renormalise the weights and sample the xk ’s accordingly
     end for
     approximate the likelihood by
                                n       N
                                              l
                                             ωk N
                               k=1     l=1




                                    [Jasra, Singh, Martin,  McCoy, 2010]
Which summary?



  Fundamental difficulty of the choice of the summary statistic when
  there is no non-trivial sufficient statistics [except when done by the
  experimenters in the field]
  Starting from a large collection of summary statistics is available,
  Joyce and Marjoram (2008) consider the sequential inclusion into
  the ABC target, with a stopping rule based on a likelihood ratio
  test
      Not taking into account the sequential nature of the tests
      Depends on parameterisation
      Order of inclusion matters
Which summary?



  Fundamental difficulty of the choice of the summary statistic when
  there is no non-trivial sufficient statistics [except when done by the
  experimenters in the field]
  Starting from a large collection of summary statistics is available,
  Joyce and Marjoram (2008) consider the sequential inclusion into
  the ABC target, with a stopping rule based on a likelihood ratio
  test
      Not taking into account the sequential nature of the tests
      Depends on parameterisation
      Order of inclusion matters
Which summary?



  Fundamental difficulty of the choice of the summary statistic when
  there is no non-trivial sufficient statistics [except when done by the
  experimenters in the field]
  Starting from a large collection of summary statistics is available,
  Joyce and Marjoram (2008) consider the sequential inclusion into
  the ABC target, with a stopping rule based on a likelihood ratio
  test
      Not taking into account the sequential nature of the tests
      Depends on parameterisation
      Order of inclusion matters
Semi-automatic ABC


  Fearnhead and Prangle (2010) study ABC and the selection of the
  summary statistic in close proximity to Wilkinson’s proposal
  ABC then considered from a purely inferential viewpoint and
  calibrated for estimation purposes
  Use of a randomised (or ‘noisy’) version of the summary statistics

                           η (y) = η(y) + τ
                           ˜

  Derivation of a well-calibrated version of ABC, i.e. an algorithm
  that gives proper predictions for the distribution associated with
  this randomised summary statistic [calibration constraint: ABC
  approximation with same posterior mean as the true randomised
  posterior]
Semi-automatic ABC


  Fearnhead and Prangle (2010) study ABC and the selection of the
  summary statistic in close proximity to Wilkinson’s proposal
  ABC then considered from a purely inferential viewpoint and
  calibrated for estimation purposes
  Use of a randomised (or ‘noisy’) version of the summary statistics

                           η (y) = η(y) + τ
                           ˜

  Derivation of a well-calibrated version of ABC, i.e. an algorithm
  that gives proper predictions for the distribution associated with
  this randomised summary statistic [calibration constraint: ABC
  approximation with same posterior mean as the true randomised
  posterior]
Summary statistics




      Optimality of the posterior expectation E[θ|y] of the
      parameter of interest as summary statistics η(y)!
      Use of the standard quadratic loss function

                           (θ − θ0 )T A(θ − θ0 ) .
Summary statistics




      Optimality of the posterior expectation E[θ|y] of the
      parameter of interest as summary statistics η(y)!
      Use of the standard quadratic loss function

                           (θ − θ0 )T A(θ − θ0 ) .
Details on Fearnhead and Prangle (FP) ABC


   Use of a summary statistic S(·), an importance proposal g (·), a
   kernel K (·) ≤ 1 and a bandwidth h  0 such that

                        (θ, ysim ) ∼ g (θ)f (ysim |θ)

   is accepted with probability (hence the bound)

                         K [{S(ysim ) − sobs }/h]

   and the corresponding importance weight defined by

                                π(θ) g (θ)

                                            [Fearnhead  Prangle, 2012]
Errors, errors, and errors

   Three levels of approximation
       π(θ|yobs ) by π(θ|sobs ) loss of information
                                                                       [ignored]
       π(θ|sobs ) by

                                   π(s)K [{s − sobs }/h]π(θ|s) ds
               πABC (θ|sobs ) =
                                      π(s)K [{s − sobs }/h] ds

       noisy observations
       πABC (θ|sobs ) by importance Monte Carlo based on N
       simulations, represented by var(a(θ)|sobs )/Nacc [expected
       number of acceptances]
                                                      [M. Twain/B. Disraeli]
Average acceptance asymptotics



   For the average acceptance probability/approximate likelihood

          p(θ|sobs ) =   f (ysim |θ) K [{S(ysim ) − sobs }/h] dysim ,

   overall acceptance probability

           p(sobs ) =    p(θ|sobs ) π(θ) dθ = π(sobs )hd + o(hd )

                                                          [FP, Lemma 1]
Optimal importance proposal



   Best choice of importance proposal in terms of effective sample size

                     g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2

                                   [Not particularly useful in practice]

       note that p(θ|sobs ) is an approximate likelihood
       reminiscent of parallel tempering
       could be approximately achieved by attrition of half of the
       data
Optimal importance proposal



   Best choice of importance proposal in terms of effective sample size

                     g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2

                                   [Not particularly useful in practice]

       note that p(θ|sobs ) is an approximate likelihood
       reminiscent of parallel tempering
       could be approximately achieved by attrition of half of the
       data
Calibration of h


       “This result gives insight into how S(·) and h affect the Monte
       Carlo error. To minimize Monte Carlo error, we need hd to be not
       too small. Thus ideally we want S(·) to be a low dimensional
       summary of the data that is sufficiently informative about θ that
       π(θ|sobs ) is close, in some sense, to π(θ|yobs )” (FP, p.5)


       turns h into an absolute value while it should be
       context-dependent and user-calibrated
       only addresses one term in the approximation error and
       acceptance probability (“curse of dimensionality”)
       h large prevents πABC (θ|sobs ) to be close to π(θ|sobs )
       d small prevents π(θ|sobs ) to be close to π(θ|yobs ) (“curse of
       [dis]information”)
Calibrating ABC


       “If πABC is calibrated, then this means that probability statements
       that are derived from it are appropriate, and in particular that we
       can use πABC to quantify uncertainty in estimates” (FP, p.5)


   Definition
   For 0  q  1 and subset A, event Eq (A) made of sobs such that
   PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if

                            Pr(θ ∈ A|Eq (A)) = q



       unclear meaning of conditioning on Eq (A)
Calibrating ABC


       “If πABC is calibrated, then this means that probability statements
       that are derived from it are appropriate, and in particular that we
       can use πABC to quantify uncertainty in estimates” (FP, p.5)


   Definition
   For 0  q  1 and subset A, event Eq (A) made of sobs such that
   PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if

                            Pr(θ ∈ A|Eq (A)) = q



       unclear meaning of conditioning on Eq (A)
Calibrated ABC




   Theorem (FP)
   Noisy ABC, where

                   sobs = S(yobs ) + h ,    ∼ K (·)

   is calibrated
                                                      [Wilkinson, 2008]
                      no condition on h!!
Calibrated ABC




   Consequence: when h = ∞

   Theorem (FP)
   The prior distribution is always calibrated

   is this a relevant property then?
More questions about calibrated ABC


      “Calibration is not universally accepted by Bayesians. It is even more
      questionable here as we care how statements we make relate to the
      real world, not to a mathematically defined posterior.” R. Wilkinson


      Same reluctance about the prior being calibrated
      Property depending on prior, likelihood, and summary
      Calibration is a frequentist property (almost a p-value!)
      More sensible to account for the simulator’s imperfections
      than using noisy-ABC against a meaningless based measure
                                                            [Wilkinson, 2012]
Converging ABC



  Theorem (FP)
  For noisy ABC, the expected noisy-ABC log-likelihood,

    E {log[p(θ|sobs )]} =   log[p(θ|S(yobs ) + )]π(yobs |θ0 )K ( )dyobs d ,

  has its maximum at θ = θ0 .

  True for any choice of summary statistic? even ancilary statistics?!
                                   [Imposes at least identifiability...]
  Relevant in asymptotia and not for the data
Converging ABC



  Corollary
  For noisy ABC, the ABC posterior converges onto a point mass on
  the true parameter value as m → ∞.

  For standard ABC, not always the case (unless h goes to zero).
  Strength of regularity conditions (c1) and (c2) in Bernardo
   Smith, 1994?
                [out-of-reach constraints on likelihood and posterior]
  Again, there must be conditions imposed upon summary
  statistics...
Loss motivated statistic

   Under quadratic loss function,

   Theorem (FP)
                                           ˆ
    (i) The minimal posterior error E[L(θ, θ)|yobs ] occurs when
        ˆ = E(θ|yobs ) (!)
        θ
    (ii) When h → 0, EABC (θ|sobs ) converges to E(θ|yobs )
                                           ˆ
   (iii) If S(yobs ) = E[θ|yobs ] then for θ = EABC [θ|sobs ]

                 ˆ
          E[L(θ, θ)|yobs ] = trace(AΣ) + h2     xT AxK (x)dx + o(h2 ).

   measure-theoretic difficulties?
   dependence of sobs on h makes me uncomfortable inherent to noisy
   ABC
   Relevant for choice of K ?
Optimal summary statistic

       “We take a different approach, and weaken the requirement for
       πABC to be a good approximation to π(θ|yobs ). We argue for πABC
       to be a good approximation solely in terms of the accuracy of
       certain estimates of the parameters.” (FP, p.5)

   From this result, FP
       derive their choice of summary statistic,

                                  S(y) = E(θ|y)

                                                               [almost sufficient]
       suggest

                 h = O(N −1/(2+d) ) and h = O(N −1/(4+d) )

       as optimal bandwidths for noisy and standard ABC.
Optimal summary statistic

       “We take a different approach, and weaken the requirement for
       πABC to be a good approximation to π(θ|yobs ). We argue for πABC
       to be a good approximation solely in terms of the accuracy of
       certain estimates of the parameters.” (FP, p.5)

   From this result, FP
       derive their choice of summary statistic,

                                  S(y) = E(θ|y)

                                           [wow! EABC [θ|S(yobs )] = E[θ|yobs ]]
       suggest

                 h = O(N −1/(2+d) ) and h = O(N −1/(4+d) )

       as optimal bandwidths for noisy and standard ABC.
Caveat



  Since E(θ|yobs ) is most usually unavailable, FP suggest
    (i) use a pilot run of ABC to determine a region of non-negligible
        posterior mass;
   (ii) simulate sets of parameter values and data;
  (iii) use the simulated sets of parameter values and data to
        estimate the summary statistic; and
   (iv) run ABC with this choice of summary statistic.
  where is the assessment of the first stage error?
Caveat



  Since E(θ|yobs ) is most usually unavailable, FP suggest
    (i) use a pilot run of ABC to determine a region of non-negligible
        posterior mass;
   (ii) simulate sets of parameter values and data;
  (iii) use the simulated sets of parameter values and data to
        estimate the summary statistic; and
   (iv) run ABC with this choice of summary statistic.
  where is the assessment of the first stage error?
Approximating the summary statistic




   As Beaumont et al. (2002) and Blum and Fran¸ois (2010), FP
                                                    c
   use a linear regression to approximate E(θ|yobs ):
                             (i)
                      θi = β0 + β (i) f (yobs ) +   i

   with f made of standard transforms
                                                [Further justifications?]
[my]questions about semi-automatic ABC

      dependence on h and S(·) in the early stage
      reduction of Bayesian inference to point estimation
      approximation error in step (i) not accounted for
      not parameterisation invariant
      practice shows that proper approximation to genuine posterior
      distributions stems from using a (much) larger number of
      summary statistics than the dimension of the parameter
      the validity of the approximation to the optimal summary
      statistic depends on the quality of the pilot run
      important inferential issues like model choice are not covered
      by this approach.
                                                      [Robert, 2012]
More about semi-automatic ABC

                              [End of section derived from comments on Read Paper, just appeared]

      “The apparently arbitrary nature of the choice of summary statistics
      has always been perceived as the Achilles heel of ABC.” M.
      Beaumont

      “Curse of dimensionality” linked with the increase of the
      dimension of the summary statistic
      Connection with principal component analysis
                                                                            [Itan et al., 2010]
      Connection with partial least squares
                                                                    [Wegman et al., 2009]
      Beaumont et al. (2002) postprocessed output is used as input
      by FP to run a second ABC
More about semi-automatic ABC

                              [End of section derived from comments on Read Paper, just appeared]

      “The apparently arbitrary nature of the choice of summary statistics
      has always been perceived as the Achilles heel of ABC.” M.
      Beaumont

      “Curse of dimensionality” linked with the increase of the
      dimension of the summary statistic
      Connection with principal component analysis
                                                                            [Itan et al., 2010]
      Connection with partial least squares
                                                                    [Wegman et al., 2009]
      Beaumont et al. (2002) postprocessed output is used as input
      by FP to run a second ABC
Wood’s alternative

   Instead of a non-parametric kernel approximation to the likelihood
                       1
                                K {η(yr ) − η(yobs )}
                       R    r

   Wood (2010) suggests a normal approximation

                           η(y(θ)) ∼ Nd (µθ , Σθ )

   whose parameters can be approximated based on the R simulations
   (for each value of θ).
       Parametric versus non-parametric rate [Uh?!]
       Automatic weighting of components of η(·) through Σθ
       Dependence on normality assumption (pseudo-likelihood?)
                                 [Cornebise, Girolami  Kosmidis, 2012]
Wood’s alternative

   Instead of a non-parametric kernel approximation to the likelihood
                       1
                                K {η(yr ) − η(yobs )}
                       R    r

   Wood (2010) suggests a normal approximation

                           η(y(θ)) ∼ Nd (µθ , Σθ )

   whose parameters can be approximated based on the R simulations
   (for each value of θ).
       Parametric versus non-parametric rate [Uh?!]
       Automatic weighting of components of η(·) through Σθ
       Dependence on normality assumption (pseudo-likelihood?)
                                 [Cornebise, Girolami  Kosmidis, 2012]
Reinterpretation and extensions

   Reinterpretation of ABC output as joint simulation from

                         π (x, y |θ) = f (x|θ)¯Y |X (y |x)
                         ¯                    π

   where
                            πY |X (y |x) = K (y − x)
                            ¯

   Reinterpretation of noisy ABC
   if y |y obs ∼ πY |X (·|y obs ), then marginally
      ¯          ¯

                                 y ∼ πY |θ (·|θ0 )
                                 ¯ ¯

  c Explain for the consistency of Bayesian inference based on y and π
                                                               ¯     ¯
                                       [Lee, Andrieu  Doucet, 2012]
Reinterpretation and extensions

   Reinterpretation of ABC output as joint simulation from

                         π (x, y |θ) = f (x|θ)¯Y |X (y |x)
                         ¯                    π

   where
                            πY |X (y |x) = K (y − x)
                            ¯

   Reinterpretation of noisy ABC
   if y |y obs ∼ πY |X (·|y obs ), then marginally
      ¯          ¯

                                 y ∼ πY |θ (·|θ0 )
                                 ¯ ¯

  c Explain for the consistency of Bayesian inference based on y and π
                                                               ¯     ¯
                                       [Lee, Andrieu  Doucet, 2012]
ABC for Markov chains


   Rewriting the posterior as

                      π(θ)1−n π(θ|x1 )      π(θ|xt−1 , xt )

   where π(θ|xt−1 , xt ) ∝ f (xt |xt−1 , θ)π(θ)
        Allows for a stepwise ABC, replacing each π(θ|xt−1 , xt ) by an
        ABC approximation
        Similarity with FP’s multiple sources of data (and also with
          Dean et al., 2011 )


                                                  [White et al., 2010, 2012]
ABC for Markov chains


   Rewriting the posterior as

                      π(θ)1−n π(θ|x1 )      π(θ|xt−1 , xt )

   where π(θ|xt−1 , xt ) ∝ f (xt |xt−1 , θ)π(θ)
        Allows for a stepwise ABC, replacing each π(θ|xt−1 , xt ) by an
        ABC approximation
        Similarity with FP’s multiple sources of data (and also with
          Dean et al., 2011 )


                                                  [White et al., 2010, 2012]
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be
Approximate Bayesian Computation: How Bayesian Can It Be

Weitere ähnliche Inhalte

Was ist angesagt?

IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Representation formula for traffic flow estimation on a network
Representation formula for traffic flow estimation on a networkRepresentation formula for traffic flow estimation on a network
Representation formula for traffic flow estimation on a networkGuillaume Costeseque
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsGabriel Peyré
 
Loss Calibrated Variational Inference
Loss Calibrated Variational InferenceLoss Calibrated Variational Inference
Loss Calibrated Variational InferenceTomasz Kusmierczyk
 
Some recent developments in the traffic flow variational formulation
Some recent developments in the traffic flow variational formulationSome recent developments in the traffic flow variational formulation
Some recent developments in the traffic flow variational formulationGuillaume Costeseque
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationGabriel Peyré
 
Learning Sparse Representation
Learning Sparse RepresentationLearning Sparse Representation
Learning Sparse RepresentationGabriel Peyré
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloJeremyHeng10
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...zukun
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationGabriel Peyré
 
Control of Uncertain Nonlinear Systems Using Ellipsoidal Reachability Calculus
Control of Uncertain Nonlinear Systems Using Ellipsoidal Reachability CalculusControl of Uncertain Nonlinear Systems Using Ellipsoidal Reachability Calculus
Control of Uncertain Nonlinear Systems Using Ellipsoidal Reachability CalculusLeo Asselborn
 
Numerical approach for Hamilton-Jacobi equations on a network: application to...
Numerical approach for Hamilton-Jacobi equations on a network: application to...Numerical approach for Hamilton-Jacobi equations on a network: application to...
Numerical approach for Hamilton-Jacobi equations on a network: application to...Guillaume Costeseque
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPSSA KPI
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingEnthought, Inc.
 
Hamilton-Jacobi approach for second order traffic flow models
Hamilton-Jacobi approach for second order traffic flow modelsHamilton-Jacobi approach for second order traffic flow models
Hamilton-Jacobi approach for second order traffic flow modelsGuillaume Costeseque
 

Was ist angesagt? (20)

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Representation formula for traffic flow estimation on a network
Representation formula for traffic flow estimation on a networkRepresentation formula for traffic flow estimation on a network
Representation formula for traffic flow estimation on a network
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
 
Loss Calibrated Variational Inference
Loss Calibrated Variational InferenceLoss Calibrated Variational Inference
Loss Calibrated Variational Inference
 
Some recent developments in the traffic flow variational formulation
Some recent developments in the traffic flow variational formulationSome recent developments in the traffic flow variational formulation
Some recent developments in the traffic flow variational formulation
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Learning Sparse Representation
Learning Sparse RepresentationLearning Sparse Representation
Learning Sparse Representation
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
Control of Uncertain Nonlinear Systems Using Ellipsoidal Reachability Calculus
Control of Uncertain Nonlinear Systems Using Ellipsoidal Reachability CalculusControl of Uncertain Nonlinear Systems Using Ellipsoidal Reachability Calculus
Control of Uncertain Nonlinear Systems Using Ellipsoidal Reachability Calculus
 
Numerical approach for Hamilton-Jacobi equations on a network: application to...
Numerical approach for Hamilton-Jacobi equations on a network: application to...Numerical approach for Hamilton-Jacobi equations on a network: application to...
Numerical approach for Hamilton-Jacobi equations on a network: application to...
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLP
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
 
Hamilton-Jacobi approach for second order traffic flow models
Hamilton-Jacobi approach for second order traffic flow modelsHamilton-Jacobi approach for second order traffic flow models
Hamilton-Jacobi approach for second order traffic flow models
 

Andere mochten auch

The Trusted Cloud Transfer Protocol (TCTP)
The Trusted Cloud Transfer Protocol (TCTP)The Trusted Cloud Transfer Protocol (TCTP)
The Trusted Cloud Transfer Protocol (TCTP)Mathias Slawik
 
CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We...
 CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We... CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We...
CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We...GaylordHealth
 
Ni#1
Ni#1Ni#1
Ni#1bc6n
 
Mch 05312012
Mch 05312012Mch 05312012
Mch 05312012ismoten
 
Kéri István „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztése
Kéri István   „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztéseKéri István   „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztése
Kéri István „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztéseNorth Hungary
 
Rop alex drane_q7
Rop alex drane_q7Rop alex drane_q7
Rop alex drane_q7TEDMED
 
Innovative Uses of a Plunger
Innovative Uses of a PlungerInnovative Uses of a Plunger
Innovative Uses of a PlungerNikhil Saraf
 
Dennis Schmid Portfolio
Dennis Schmid PortfolioDennis Schmid Portfolio
Dennis Schmid PortfolioDennis Schmid
 
2011-2012 Florida Bright Futures Definition of Terms
2011-2012 Florida Bright Futures Definition of Terms2011-2012 Florida Bright Futures Definition of Terms
2011-2012 Florida Bright Futures Definition of TermsA.J. Kleinheksel
 

Andere mochten auch (16)

Identidad digital de un Servicio de Soporte a la Docencia - Tarragona
Identidad digital de un Servicio de Soporte a la Docencia - TarragonaIdentidad digital de un Servicio de Soporte a la Docencia - Tarragona
Identidad digital de un Servicio de Soporte a la Docencia - Tarragona
 
Informe contraloria
Informe contraloriaInforme contraloria
Informe contraloria
 
The Trusted Cloud Transfer Protocol (TCTP)
The Trusted Cloud Transfer Protocol (TCTP)The Trusted Cloud Transfer Protocol (TCTP)
The Trusted Cloud Transfer Protocol (TCTP)
 
CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We...
 CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We... CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We...
CONCUSSION The Impact on Schools: Strategies and Adjustments in the First We...
 
Ni#1
Ni#1Ni#1
Ni#1
 
Mch 05312012
Mch 05312012Mch 05312012
Mch 05312012
 
Kéri István „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztése
Kéri István   „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztéseKéri István   „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztése
Kéri István „Geo tur” - a Novohrad – Nógrád geopark ökoturisztikai fejlesztése
 
Lect 1
Lect 1Lect 1
Lect 1
 
Transgenicos !!!1
Transgenicos !!!1Transgenicos !!!1
Transgenicos !!!1
 
Sector pulse june 2013
Sector pulse  june 2013Sector pulse  june 2013
Sector pulse june 2013
 
Ejercicio n10
Ejercicio n10Ejercicio n10
Ejercicio n10
 
Gracias
Gracias Gracias
Gracias
 
Rop alex drane_q7
Rop alex drane_q7Rop alex drane_q7
Rop alex drane_q7
 
Innovative Uses of a Plunger
Innovative Uses of a PlungerInnovative Uses of a Plunger
Innovative Uses of a Plunger
 
Dennis Schmid Portfolio
Dennis Schmid PortfolioDennis Schmid Portfolio
Dennis Schmid Portfolio
 
2011-2012 Florida Bright Futures Definition of Terms
2011-2012 Florida Bright Futures Definition of Terms2011-2012 Florida Bright Futures Definition of Terms
2011-2012 Florida Bright Futures Definition of Terms
 

Ähnlich wie Approximate Bayesian Computation: How Bayesian Can It Be

Xian's abc
Xian's abcXian's abc
Xian's abcDeb Roy
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-DivergencesFrank Nielsen
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Two dimensional Pool Boiling
Two dimensional Pool BoilingTwo dimensional Pool Boiling
Two dimensional Pool BoilingRobvanGils
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesJuanPabloCarbajal3
 
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...Alexander Litvinenko
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30Ryan White
 
Non-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian UpdateNon-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian UpdateAlexander Litvinenko
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisCharaf Ech-Chatbi
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisCharaf Ech-Chatbi
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Jagadeeswaran Rathinavel
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Sean Meyn
 
Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data
Let's Practice What We Preach: Likelihood Methods for Monte Carlo DataLet's Practice What We Preach: Likelihood Methods for Monte Carlo Data
Let's Practice What We Preach: Likelihood Methods for Monte Carlo DataChristian Robert
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169Ryan White
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresAnmol Dwivedi
 

Ähnlich wie Approximate Bayesian Computation: How Bayesian Can It Be (20)

Xian's abc
Xian's abcXian's abc
Xian's abc
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
Shanghai tutorial
Shanghai tutorialShanghai tutorial
Shanghai tutorial
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Two dimensional Pool Boiling
Two dimensional Pool BoilingTwo dimensional Pool Boiling
Two dimensional Pool Boiling
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
Madrid easy
Madrid easyMadrid easy
Madrid easy
 
Non-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian UpdateNon-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian Update
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009
 
Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data
Let's Practice What We Preach: Likelihood Methods for Monte Carlo DataLet's Practice What We Preach: Likelihood Methods for Monte Carlo Data
Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
 

Mehr von Deb Roy

Model complexity
Model complexityModel complexity
Model complexityDeb Roy
 
PhD Recipe
PhD RecipePhD Recipe
PhD RecipeDeb Roy
 
Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Deb Roy
 
Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Deb Roy
 
ABC & Empirical Lkd
ABC & Empirical LkdABC & Empirical Lkd
ABC & Empirical LkdDeb Roy
 
Mcmc & lkd free II
Mcmc & lkd free IIMcmc & lkd free II
Mcmc & lkd free IIDeb Roy
 
Mcmc & lkd free I
Mcmc & lkd free IMcmc & lkd free I
Mcmc & lkd free IDeb Roy
 
Simulation - public lecture
Simulation - public lectureSimulation - public lecture
Simulation - public lectureDeb Roy
 
Vanilla rao blackwellisation
Vanilla rao blackwellisationVanilla rao blackwellisation
Vanilla rao blackwellisationDeb Roy
 

Mehr von Deb Roy (10)

Model complexity
Model complexityModel complexity
Model complexity
 
PhD Recipe
PhD RecipePhD Recipe
PhD Recipe
 
DIC
DICDIC
DIC
 
Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01Slides2 130201091056-phpapp01
Slides2 130201091056-phpapp01
 
Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02Discussion cabras-robert-130323171455-phpapp02
Discussion cabras-robert-130323171455-phpapp02
 
ABC & Empirical Lkd
ABC & Empirical LkdABC & Empirical Lkd
ABC & Empirical Lkd
 
Mcmc & lkd free II
Mcmc & lkd free IIMcmc & lkd free II
Mcmc & lkd free II
 
Mcmc & lkd free I
Mcmc & lkd free IMcmc & lkd free I
Mcmc & lkd free I
 
Simulation - public lecture
Simulation - public lectureSimulation - public lecture
Simulation - public lecture
 
Vanilla rao blackwellisation
Vanilla rao blackwellisationVanilla rao blackwellisation
Vanilla rao blackwellisation
 

Kürzlich hochgeladen

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Approximate Bayesian Computation: How Bayesian Can It Be

  • 1. Approximate Bayesian Computation: how Bayesian can it be? Christian P. Robert ISBA LECTURES ON BAYESIAN FOUNDATIONS ISBA 2012, Kyoto, Japan Universit´ Paris-Dauphine, IuF, & CREST e http://www.ceremade.dauphine.fr/~xian June 21, 2012
  • 2. This talk is dedicated to the memory of our dear friend and fellow Bayesian, George Casella, 1951–2012
  • 4. Approximate Bayesian computation Introduction Monte Carlo basics simulation-based methods in Econometrics Approximate Bayesian computation ABC as an inference machine
  • 5. General issue Given a density π known up to a normalizing constant, e.g. a posterior density, and an integrable function h, how can one use h(x)˜ (x)µ(dx) π Ih = h(x)π(x)µ(dx) = π (x)µ(dx) ˜ when h(x)˜ (x)µ(dx) is intractable? π
  • 6. Monte Carlo basics Generate an iid sample x1 , . . . , xN from π and estimate Ih by N Imc (h) = N −1 ˆ N h(xi ). i=1 ˆ as since [LLN] IMC (h) −→ Ih N Furthermore, if Ih2 = h2 (x)π(x)µ(dx) < ∞, √ L [CLT] ˆ N IMC (h) − Ih N N 0, I [h − Ih ]2 . Caveat Often impossible or inefficient to simulate directly from π
  • 7. Monte Carlo basics Generate an iid sample x1 , . . . , xN from π and estimate Ih by N Imc (h) = N −1 ˆ N h(xi ). i=1 ˆ as since [LLN] IMC (h) −→ Ih N Furthermore, if Ih2 = h2 (x)π(x)µ(dx) < ∞, √ L [CLT] ˆ N IMC (h) − Ih N N 0, I [h − Ih ]2 . Caveat Often impossible or inefficient to simulate directly from π
  • 8. Importance Sampling For Q proposal distribution with density q Ih = h(x){π/q}(x)q(x)µ(dx). Principle Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by N IIS (h) = N −1 ˆ Q,N h(xi ){π/q}(xi ). i=1
  • 9. Importance Sampling For Q proposal distribution with density q Ih = h(x){π/q}(x)q(x)µ(dx). Principle Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by N IIS (h) = N −1 ˆ Q,N h(xi ){π/q}(xi ). i=1
  • 10. Importance Sampling (convergence) Then ˆ as [LLN] IIS (h) −→ Ih Q,N and if Q((hπ/q)2 ) < ∞, √ L [CLT] ˆ N(IIS (h) − Ih ) Q,N N 0, Q{(hπ/q − Ih )2 } . Caveat ˆ If normalizing constant unknown, impossible to use IIS (h) Q,N Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
  • 11. Importance Sampling (convergence) Then ˆ as [LLN] IIS (h) −→ Ih Q,N and if Q((hπ/q)2 ) < ∞, √ L [CLT] ˆ N(IIS (h) − Ih ) Q,N N 0, Q{(hπ/q − Ih )2 } . Caveat ˆ If normalizing constant unknown, impossible to use IIS (h) Q,N Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
  • 12. Self-normalised importance Sampling Self normalized version N −1 N ˆ ISNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). Q,N i=1 i=1 ˆ as [LLN] ISNIS (h) −→ Ih Q,N and if I((1+h2 )(π/q)) < ∞, √ L [CLT] ˆ N(ISNIS (h) − Ih ) Q,N N 0, π {(π/q)(h − Ih }2 ) . Caveat ˆ If π cannot be computed, impossible to use ISNIS (h) Q,N
  • 13. Self-normalised importance Sampling Self normalized version N −1 N ˆ ISNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). Q,N i=1 i=1 ˆ as [LLN] ISNIS (h) −→ Ih Q,N and if I((1+h2 )(π/q)) < ∞, √ L [CLT] ˆ N(ISNIS (h) − Ih ) Q,N N 0, π {(π/q)(h − Ih }2 ) . Caveat ˆ If π cannot be computed, impossible to use ISNIS (h) Q,N
  • 14. Self-normalised importance Sampling Self normalized version N −1 N ˆ ISNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). Q,N i=1 i=1 ˆ as [LLN] ISNIS (h) −→ Ih Q,N and if I((1+h2 )(π/q)) < ∞, √ L [CLT] ˆ N(ISNIS (h) − Ih ) Q,N N 0, π {(π/q)(h − Ih }2 ) . Caveat ˆ If π cannot be computed, impossible to use ISNIS (h) Q,N
  • 15. Econom’ections Model choice Similar exploration of simulation-based and approximation techniques in Econometrics Simulated method of moments Method of simulated moments Simulated pseudo-maximum-likelihood Indirect inference [Gouri´roux & Monfort, 1996] e
  • 16. Simulated method of moments o Given observations y1:n from a model yt = r (y1:(t−1) , t , θ) , t ∼ g (·) 1. simulate 1:n , derive yt (θ) = r (y1:(t−1) , t , θ) 2. and estimate θ by n arg min (yto − yt (θ))2 θ t=1
  • 17. Simulated method of moments o Given observations y1:n from a model yt = r (y1:(t−1) , t , θ) , t ∼ g (·) 1. simulate 1:n , derive yt (θ) = r (y1:(t−1) , t , θ) 2. and estimate θ by n n 2 arg min yto − yt (θ) θ t=1 t=1
  • 18. Method of simulated moments Given a statistic vector K (y ) with Eθ [K (Yt )|y1:(t−1) ] = k(y1:(t−1) ; θ) find an unbiased estimator of k(y1:(t−1) ; θ), ˜ k( t , y1:(t−1) ; θ) Estimate θ by n S arg min K (yt ) − ˜ t k( s , y1:(t−1) ; θ)/S θ t=1 s=1 [Pakes & Pollard, 1989]
  • 19. Indirect inference ˆ Minimise [in θ] a distance between estimators β based on a pseudo-model for genuine observations and for observations simulated under the true model and the parameter θ. [Gouri´roux, Monfort, & Renault, 1993; e Smith, 1993; Gallant & Tauchen, 1996]
  • 20. Indirect inference (PML vs. PSE) Example of the pseudo-maximum-likelihood (PML) ˆ β(y) = arg max log f (yt |β, y1:(t−1) ) β t leading to ˆ ˆ arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2 θ when ys (θ) ∼ f (y|θ) s = 1, . . . , S
  • 21. Indirect inference (PML vs. PSE) Example of the pseudo-score-estimator (PSE) 2 ˆ ∂ log f β(y) = arg min (yt |β, y1:(t−1) ) β t ∂β leading to ˆ ˆ arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2 θ when ys (θ) ∼ f (y|θ) s = 1, . . . , S
  • 22. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier... Consistency depending on the criterion and on the asymptotic identifiability of θ [Gouri´roux, Monfort, 1996, p. 66] e
  • 23. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier... Consistency depending on the criterion and on the asymptotic identifiability of θ [Gouri´roux, Monfort, 1996, p. 66] e
  • 24. Choice of pseudo-model Pick model such that ˆ 1. β(θ) not flat (i.e. sensitive to changes in θ) ˆ 2. β(θ) not dispersed (i.e. robust agains changes in ys (θ)) [Frigessi & Heggland, 2004]
  • 25. Empirical likelihood Another approximation method (not yet related with simulation) Definition For dataset y = (y1 , . . . , yn ), and parameter of interest θ, pick constraints E[h(Y , θ)] = 0 uniquely identifying θ and define the empirical likelihood as n Lel (θ|y) = max pi p i=1 for p in the set {p ∈ [0; 1]n , pi = 1, i pi h(yi , θ) = 0}. [Owen, 1988]
  • 26. Empirical likelihood Another approximation method (not yet related with simulation) Example When θ = Ef [Y ], empirical likelihood is the maximum of p1 · · · pn under constraint p1 y1 + . . . + pn yn = θ
  • 27. ABCel Another approximation method (now related with simulation!) Importance sampling implementation Algorithm 1: Raw ABCel sampler Given observation y for i = 1 to M do Generate θ i from the prior distribution π(·) Set the weight ωi = Lel (θ i |y) end for Proceed with pairs (θi , ωi ) as in regular importance sampling [Mengersen, Pudlo & Robert, 2012]
  • 28. A?B?C? A stands for approximate [wrong likelihood] B stands for Bayesian C stands for computation [producing a parameter sample]
  • 29. A?B?C? A stands for approximate [wrong likelihood] B stands for Bayesian C stands for computation [producing a parameter sample]
  • 30. A?B?C? A stands for approximate [wrong likelihood] B stands for Bayesian C stands for computation [producing a parameter sample]
  • 31. How much Bayesian? asymptotically so (meaningfull?) approximation error unknown (w/o simulation) pragmatic Bayes (there is no other solution!)
  • 32. Approximate Bayesian computation Introduction Approximate Bayesian computation Genetics of ABC ABC basics Advances and interpretations Alphabet (compu-)soup ABC as an inference machine
  • 33. Genetic background of ABC ABC is a recent computational technique that only requires being able to sample from the likelihood f (·|θ) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still contribute significantly to methodological developments of ABC. [Griffith & al., 1997; Tavar´ & al., 1999] e
  • 34. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we can not calculate the likelihood of the polymorphism data f (y|θ).
  • 35. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we can not calculate the likelihood of the polymorphism data f (y|θ).
  • 36. A genuine example of application !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ 94 Pygmies populations: do they have a common origin? Was there a lot of exchanges between pygmies and non-pygmies populations?
  • 37. Alternative !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! scenarios 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ Différents scénarios possibles, choix de scenario par ABC Verdu et al. 2009 96
  • 38. Untractable likelihood Missing (too missing!) data structure: f (y|θ) = f (y|G , θ)f (G |θ)dG G cannot be computed in a manageable way... The genealogies are considered as nuisance parameters This modelling clearly differs from the phylogenetic perspective where the tree is the parameter of interest.
  • 39. Untractable likelihood Missing (too missing!) data structure: f (y|θ) = f (y|G , θ)f (G |θ)dG G cannot be computed in a manageable way... The genealogies are considered as nuisance parameters This modelling clearly differs from the phylogenetic perspective where the tree is the parameter of interest.
  • 40. Untractable likelihood So, what can we do when the likelihood function f (y|θ) is well-defined but impossible / too costly to compute...? MCMC cannot be implemented! shall we give up Bayesian inference altogether?! or settle for an almost Bayesian inference...?
  • 41. Untractable likelihood So, what can we do when the likelihood function f (y|θ) is well-defined but impossible / too costly to compute...? MCMC cannot be implemented! shall we give up Bayesian inference altogether?! or settle for an almost Bayesian inference...?
  • 42. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997] e
  • 43. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997] e
  • 44. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997] e
  • 45. Why does it work?! The proof is trivial: f (θi ) ∝ π(θi )f (z|θi )Iy (z) z∈D ∝ π(θi )f (y|θi ) = π(θi |y) . [Accept–Reject 101]
  • 46. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) ≤ where is a distance Output distributed from def π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < ) [Pritchard et al., 1999]
  • 47. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) ≤ where is a distance Output distributed from def π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < ) [Pritchard et al., 1999]
  • 48. ABC algorithm In most implementations, further degree of A...pproximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} ≤ set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  • 49. Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!
  • 50. Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!
  • 51. Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!
  • 52. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ > 0, there exists 0 such that < 0 implies
  • 53. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ > 0, there exists 0 such that < 0 implies π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ)µ(B ) ∈ A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ)µ(B )
  • 54. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ > 0, there exists 0 such that < 0 implies π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ) XX ) µ(BX ∈ A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ) X µ(B ) X X
  • 55. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ 0, there exists 0 such that 0 implies π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ) XX ) µ(BX ∈ A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ) X µ(B ) X X [Proof extends to other continuous-in-0 kernels K ]
  • 56. Convergence of ABC (second attempt) What happens when → 0? For B ⊂ Θ, we have A f (z|θ)dz f (z|θ)π(θ)dθ ,y B π(θ)dθ = dz B A ,y ×Θ π(θ)f (z|θ)dzdθ A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ B f (z|θ)π(θ)dθ m(z) = dz A ,y m(z) A ,y ×Θ π(θ)f (z|θ)dzdθ m(z) = π(B|z) dz A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ which indicates convergence for a continuous π(B|z).
  • 57. Convergence of ABC (second attempt) What happens when → 0? For B ⊂ Θ, we have A f (z|θ)dz f (z|θ)π(θ)dθ ,y B π(θ)dθ = dz B A ,y ×Θ π(θ)f (z|θ)dzdθ A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ B f (z|θ)π(θ)dθ m(z) = dz A ,y m(z) A ,y ×Θ π(θ)f (z|θ)dzdθ m(z) = π(B|z) dz A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ which indicates convergence for a continuous π(B|z).
  • 58. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 59. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 60. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 61. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 62. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 63. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 64. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 65. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 66. Pima Indian benchmark 80 100 1.0 80 60 0.8 60 0.6 Density Density Density 40 40 0.4 20 20 0.2 0.0 0 0 −0.005 0.010 0.020 0.030 −0.05 −0.03 −0.01 −1.0 0.0 1.0 2.0 Figure: Comparison between density estimates of the marginals on β1 (left), β2 (center) and β3 (right) from ABC rejection samples (red) and MCMC samples (black) .
  • 67. MA example MA(q) model q xt = t + ϑi t−i i=1 Simple prior: uniform over the inverse [real and complex] roots in q Q(u) = 1 − ϑi u i i=1 under the identifiability conditions
  • 68. MA example MA(q) model q xt = t + ϑi t−i i=1 Simple prior: uniform prior over the identifiability zone, i.e. triangle for MA(2)
  • 69. MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1 , ϑ2 ) in the triangle 2. generating an iid sequence ( t )−qt≤T 3. producing a simulated series (xt )1≤t≤T Distance: basic distance between the series T ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2 t=1 or distance between summary statistics like the q = 2 autocorrelations T τj = xt xt−j t=j+1
  • 70. MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1 , ϑ2 ) in the triangle 2. generating an iid sequence ( t )−qt≤T 3. producing a simulated series (xt )1≤t≤T Distance: basic distance between the series T ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2 t=1 or distance between summary statistics like the q = 2 autocorrelations T τj = xt xt−j t=j+1
  • 71. Comparison of distance impact Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 72. Comparison of distance impact 4 1.5 3 1.0 2 0.5 1 0.0 0 0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5 θ1 θ2 Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 73. Comparison of distance impact 4 1.5 3 1.0 2 0.5 1 0.0 0 0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5 θ1 θ2 Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 74. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 75. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 76. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 77. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 78. ABC-NP Better usage of [prior] simulations by adjustement: instead of throwing away θ such that ρ(η(z), η(y)) , replace θ’s with locally regressed transforms (use with BIC) θ∗ = θ − {η(z) − η(y)}T β ˆ [Csill´ry et al., TEE, 2010] e ˆ where β is obtained by [NP] weighted least square regression on (η(z) − η(y)) with weights Kδ {ρ(η(z), η(y))} [Beaumont et al., 2002, Genetics]
  • 79. ABC-NP (regression) Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) : weight directly simulation by Kδ {ρ(η(z(θ)), η(y))} or S 1 Kδ {ρ(η(zs (θ)), η(y))} S s=1 [consistent estimate of f (η|θ)] Curse of dimensionality: poor estimate when d = dim(η) is large...
  • 80. ABC-NP (regression) Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) : weight directly simulation by Kδ {ρ(η(z(θ)), η(y))} or S 1 Kδ {ρ(η(zs (θ)), η(y))} S s=1 [consistent estimate of f (η|θ)] Curse of dimensionality: poor estimate when d = dim(η) is large...
  • 81. ABC-NP (density estimation) Use of the kernel weights Kδ {ρ(η(z(θ)), η(y))} leads to the NP estimate of the posterior expectation i θi Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} [Blum, JASA, 2010]
  • 82. ABC-NP (density estimation) Use of the kernel weights Kδ {ρ(η(z(θ)), η(y))} leads to the NP estimate of the posterior conditional density ˜ Kb (θi − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} [Blum, JASA, 2010]
  • 83. ABC-NP (density estimations) Other versions incorporating regression adjustments ˜ Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d ) g c var(ˆ (θ|y)) = g (1 + oP (1)) nbδ d
  • 84. ABC-NP (density estimations) Other versions incorporating regression adjustments ˜ Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d ) g c var(ˆ (θ|y)) = g (1 + oP (1)) nbδ d [Blum, JASA, 2010]
  • 85. ABC-NP (density estimations) Other versions incorporating regression adjustments ˜ Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d ) g c var(ˆ (θ|y)) = g (1 + oP (1)) nbδ d [standard NP calculations]
  • 86. ABC-NCH Incorporating non-linearities and heterocedasticities: σ (η(y)) ˆ θ∗ = m(η(y)) + [θ − m(η(z))] ˆ ˆ σ (η(z)) ˆ where m(η) estimated by non-linear regression (e.g., neural network) ˆ σ (η) estimated by non-linear regression on residuals ˆ log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi ˆ [Blum Fran¸ois, 2009] c
  • 87. ABC-NCH Incorporating non-linearities and heterocedasticities: σ (η(y)) ˆ θ∗ = m(η(y)) + [θ − m(η(z))] ˆ ˆ σ (η(z)) ˆ where m(η) estimated by non-linear regression (e.g., neural network) ˆ σ (η) estimated by non-linear regression on residuals ˆ log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi ˆ [Blum Fran¸ois, 2009] c
  • 88. ABC-NCH (2) Why neural network? fights curse of dimensionality selects relevant summary statistics provides automated dimension reduction offers a model choice capability improves upon multinomial logistic [Blum Fran¸ois, 2009] c
  • 89. ABC-NCH (2) Why neural network? fights curse of dimensionality selects relevant summary statistics provides automated dimension reduction offers a model choice capability improves upon multinomial logistic [Blum Fran¸ois, 2009] c
  • 90. ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )Kω (θ(t) |θ ) θ(t+1) = and u ∼ U(0, 1) ≤ π(θ(t) )Kω (θ |θ(t) ) ,   (t) θ otherwise, has the posterior π(θ|y ) as stationary distribution [Marjoram et al, 2003]
  • 91. ABC-MCMC Markov chain (θ(t) ) created via the transition function  θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )Kω (θ(t) |θ ) θ(t+1) = and u ∼ U(0, 1) ≤ π(θ(t) )Kω (θ |θ(t) ) ,   (t) θ otherwise, has the posterior π(θ|y ) as stationary distribution [Marjoram et al, 2003]
  • 92. ABC-MCMC (2) Algorithm 2 Likelihood-free MCMC sampler Use Algorithm 1 to get (θ(0) , z(0) ) for t = 1 to N do Generate θ from Kω ·|θ(t−1) , Generate z from the likelihood f (·|θ ), Generate u from U[0,1] , π(θ )Kω (θ(t−1) |θ ) if u ≤ I π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then set (θ (t) , z(t) ) = (θ , z ) else (θ(t) , z(t) )) = (θ(t−1) , z(t−1) ), end if end for
  • 93. Why does it work? Acceptance probability does not involve calculating the likelihood and π (θ , z |y) q(θ (t−1) |θ )f (z(t−1) |θ (t−1) ) × π (θ (t−1) , z(t−1) |y) q(θ |θ (t−1) )f (z |θ ) π(θ ) XX|θ IA ,y (z ) ) f (z X X = π(θ (t−1) ) f (z(t−1) |θ (t−1) ) IA ,y (z(t−1) ) q(θ (t−1) |θ ) f (z(t−1) |θ (t−1) ) × q(θ |θ (t−1) ) XX|θ ) f (z XX
  • 94. Why does it work? Acceptance probability does not involve calculating the likelihood and π (θ , z |y) q(θ (t−1) |θ )f (z(t−1) |θ (t−1) ) × π (θ (t−1) , z(t−1) |y) q(θ |θ (t−1) )f (z |θ ) π(θ ) XX|θ IA ,y (z ) ) f (z X X = hh ( π(θ (t−1) ) ((hhhhh) IA ,y (z(t−1) ) f (z(t−1) |θ (t−1) ((( h (( hh ( q(θ (t−1) |θ ) ((hhhhh) f (z(t−1) |θ (t−1) (( ((( h × q(θ |θ (t−1) ) XX|θ ) f (z XX
  • 95. Why does it work? Acceptance probability does not involve calculating the likelihood and π (θ , z |y) q(θ (t−1) |θ )f (z(t−1) |θ (t−1) ) × π (θ (t−1) , z(t−1) |y) q(θ |θ (t−1) )f (z |θ ) π(θ ) XX|θ IA ,y (z ) ) f (z X X = hh ( π(θ (t−1) ) ((hhhhh) X(t−1) ) f (z(t−1) |θ (t−1) IAXX (( X ((( h ,y (zX X hh ( q(θ (t−1) |θ ) ((hhhhh) f (z(t−1) |θ (t−1) (( ((( h × q(θ |θ (t−1) ) X|θ X ) f (z XX π(θ )q(θ (t−1) |θ ) = IA ,y (z ) π(θ (t−1) q(θ |θ (t−1) )
  • 96. A toy example Case of 1 1 x ∼ N (θ, 1) + N (−θ, 1) 2 2 under prior θ ∼ N (0, 10) ABC sampler thetas=rnorm(N,sd=10) zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1) eps=quantile(abs(zed-x),.01) abc=thetas[abs(zed-x)eps]
  • 97. A toy example Case of 1 1 x ∼ N (θ, 1) + N (−θ, 1) 2 2 under prior θ ∼ N (0, 10) ABC sampler thetas=rnorm(N,sd=10) zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1) eps=quantile(abs(zed-x),.01) abc=thetas[abs(zed-x)eps]
  • 98. A toy example Case of 1 1 x ∼ N (θ, 1) + N (−θ, 1) 2 2 under prior θ ∼ N (0, 10) ABC-MCMC sampler metas=rep(0,N) metas[1]=rnorm(1,sd=10) zed[1]=x for (t in 2:N){ metas[t]=rnorm(1,mean=metas[t-1],sd=5) zed[t]=rnorm(1,mean=(1-2*(runif(1).5))*metas[t],sd=1) if ((abs(zed[t]-x)eps)||(runif(1)dnorm(metas[t],sd=10)/dnorm(metas[t-1],sd=10))){ metas[t]=metas[t-1] zed[t]=zed[t-1]} }
  • 100. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 A toy example −4 −2 θ 0 x =2 2 4
  • 101. A toy example x = 10
  • 102. A toy example 0.25 0.20 0.15 0.10 0.05 0.00 x = 10 −10 −5 0 5 10 θ
  • 103. A toy example x = 50
  • 104. A toy example 0.10 0.08 0.06 0.04 0.02 0.00 x = 50 −40 −20 0 20 40 θ
  • 105. ABC-PMC Use of a transition kernel as in population Monte Carlo with manageable IS correction Generate a sample at iteration t by N (t−1) (t−1) πt (θ(t) ) ∝ ˆ ωj Kt (θ(t) |θj ) j=1 modulo acceptance of the associated xt , and use an importance (t) weight associated with an accepted simulation θi (t) (t) (t) ωi ∝ π(θi ) πt (θi ) . ˆ c Still likelihood free [Beaumont et al., 2009]
  • 106. Our ABC-PMC algorithm Given a decreasing sequence of approximation levels 1 ≥ ... ≥ T, 1. At iteration t = 1, For i = 1, ..., N (1) (1) Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y ) 1 (1) Set ωi = 1/N (1) Take τ 2 as twice the empirical variance of the θi ’s 2. At iteration 2 ≤ t ≤ T , For i = 1, ..., N, repeat (t−1) (t−1) Pick θi from the θj ’s with probabilities ωj (t) 2 (t) generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi ) until (x, y ) t (t) (t) N (t−1) −1 (t) (t−1) Set ωi ∝ π(θi )/ j=1 ωj ϕ σt θi − θj ) 2 (t) Take τt+1 as twice the weighted empirical variance of the θi ’s
  • 107. Sequential Monte Carlo SMC is a simulation technique to approximate a sequence of related probability distributions πn with π0 “easy” and πT as target. Iterated IS as PMC: particles moved from time n to time n via kernel Kn and use of a sequence of extended targets πn˜ n πn (z0:n ) = πn (zn ) ˜ Lj (zj+1 , zj ) j=0 where the Lj ’s are backward Markov kernels [check that πn (zn ) is a marginal] [Del Moral, Doucet Jasra, Series B, 2006]
  • 108. Sequential Monte Carlo (2) Algorithm 3 SMC sampler (0) sample zi ∼ γ0 (x) (i = 1, . . . , N) (0) (0) (0) compute weights wi = π0 (zi )/γ0 (zi ) for t = 1 to N do if ESS(w (t−1) ) NT then resample N particles z (t−1) and set weights to 1 end if (t−1) (t−1) generate zi ∼ Kt (zi , ·) and set weights to (t) (t) (t−1) (t) (t−1) πt (zi ))Lt−1 (zi ), zi )) wi = wi−1 (t−1) (t−1) (t) πt−1 (zi ))Kt (zi ), zi )) end for [Del Moral, Doucet Jasra, Series B, 2006]
  • 109. ABC-SMC [Del Moral, Doucet Jasra, 2009] True derivation of an SMC-ABC algorithm Use of a kernel Kn associated with target π n and derivation of the backward kernel π n (z )Kn (z , z) Ln−1 (z, z ) = πn (z) Update of the weights M m m=1 IA n (xin ) win ∝ wi(n−1) M m (xi(n−1) ) m=1 IA n−1 m when xin ∼ K (xi(n−1) , ·)
  • 110. ABC-SMCM Modification: Makes M repeated simulations of the pseudo-data z given the parameter, rather than using a single [M = 1] simulation, leading to weight that is proportional to the number of accepted zi s M 1 ω(θ) = Iρ(η(y),η(zi )) M i=1 [limit in M means exact simulation from (tempered) target]
  • 111. Properties of ABC-SMC The ABC-SMC method properly uses a backward kernel L(z, z ) to simplify the importance weight and to remove the dependence on the unknown likelihood from this weight. Update of importance weights is reduced to the ratio of the proportions of surviving particles Major assumption: the forward kernel K is supposed to be invariant against the true target [tempered version of the true posterior] Adaptivity in ABC-SMC algorithm only found in on-line construction of the thresholds t , slowly enough to keep a large number of accepted transitions
  • 112. Properties of ABC-SMC The ABC-SMC method properly uses a backward kernel L(z, z ) to simplify the importance weight and to remove the dependence on the unknown likelihood from this weight. Update of importance weights is reduced to the ratio of the proportions of surviving particles Major assumption: the forward kernel K is supposed to be invariant against the true target [tempered version of the true posterior] Adaptivity in ABC-SMC algorithm only found in on-line construction of the thresholds t , slowly enough to keep a large number of accepted transitions
  • 113. A mixture example (1) Toy model of Sisson et al. (2007): if θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) , then the posterior distribution associated with y = 0 is the normal mixture θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100) restricted to [−10, 10]. Furthermore, true target available as π(θ||x| ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
  • 114. A mixture example (2) Recovery of the target, whether using a fixed standard deviation of τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s. 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 θ θ θ θ θ
  • 115. Yet another ABC-PRC Another version of ABC-PRC called PRC-ABC with a proposal distribution Mt , S replications of the pseudo-data, and a PRC step: PRC-ABC Algorithm (t−1) (t−1) 1. Select θ at random from the θi ’s with probabilities ωi (t) iid (t−1) 2. Generate θi ∼ Kt , x1 , . . . , xS ∼ f (x|θi ), set (t) (t) ωi = Kt {η(xs ) − η(y)} Kt (θi ) , s (t) and accept with probability p (i) = ωi /ct,N (t) (t) 3. Set ωi ∝ ωi /p (i) (1 ≤ i ≤ N) [Peters, Sisson Fan, Stat Computing, 2012]
  • 116. Yet another ABC-PRC Another version of ABC-PRC called PRC-ABC with a proposal distribution Mt , S replications of the pseudo-data, and a PRC step: (t−1) (t−1) Use of a NP estimate Mt (θ) = ωi ψ(θ|θi ) (with a fixed variance?!) S = 1 recommended for “greatest gains in sampler performance” ct,N upper quantile of non-zero weights at time t
  • 117. ABC inference machine Introduction Approximate Bayesian computation ABC as an inference machine Error inc. Exact BC and approximate targets summary statistic Series B discussion4
  • 118. How much Bayesian? maybe a convergent method of inference (meaningful? sufficient?) approximation error unknown (w/o simulation) pragmatic Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) ...should Bayesians care?!
  • 119. How much Bayesian? maybe a convergent method of inference (meaningful? sufficient?) approximation error unknown (w/o simulation) pragmatic Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) ...should Bayesians care?!
  • 120. ABCµ Idea Infer about the error as well: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
  • 121. ABCµ Idea Infer about the error as well: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
  • 122. ABCµ Idea Infer about the error as well: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
  • 123. ABCµ details Multidimensional distances ρk (k = 1, . . . , K ) and errors k = ρk (ηk (z), ηk (y)), with ˆ 1 k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ] Bhk b ˆ then used in replacing ξ( |y, θ) with mink ξk ( |y, θ) ABCµ involves acceptance probability ˆ π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ ) ˆ π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
  • 124. ABCµ details Multidimensional distances ρk (k = 1, . . . , K ) and errors k = ρk (ηk (z), ηk (y)), with ˆ 1 k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ] Bhk b ˆ then used in replacing ξ( |y, θ) with mink ξk ( |y, θ) ABCµ involves acceptance probability ˆ π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ ) ˆ π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
  • 125. ABCµ multiple errors [ c Ratmann et al., PNAS, 2009]
  • 126. ABCµ for model choice [ c Ratmann et al., PNAS, 2009]
  • 127. Questions about ABCµ For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How much does the prior π( ) impact the comparison? How is using both ξ( |x0 , θ) and π ( ) compatible with a standard probability model? [remindful of Wilkinson’s eABC ] Where is the penalisation for complexity in the model comparison? [X, Mengersen Chen, 2010, PNAS]
  • 128. Questions about ABCµ For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How much does the prior π( ) impact the comparison? How is using both ξ( |x0 , θ) and π ( ) compatible with a standard probability model? [remindful of Wilkinson’s eABC ] Where is the penalisation for complexity in the model comparison? [X, Mengersen Chen, 2010, PNAS]
  • 129. Wilkinson’s exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π(θ)f (z|θ)K (y − z) π (θ, z|y) = , π(θ)f (z|θ)K (y − z)dzdθ with K kernel parameterised by bandwidth . [Wilkinson, 2008] Theorem The ABC algorithm based on the assumption of a randomised observation y = y + ξ, ξ ∼ K , and an acceptance probability of ˜ K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 130. Wilkinson’s exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π(θ)f (z|θ)K (y − z) π (θ, z|y) = , π(θ)f (z|θ)K (y − z)dzdθ with K kernel parameterised by bandwidth . [Wilkinson, 2008] Theorem The ABC algorithm based on the assumption of a randomised observation y = y + ξ, ξ ∼ K , and an acceptance probability of ˜ K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 131. How exact a BC? “Using to represent measurement error is straightforward, whereas using to model the model discrepancy is harder to conceptualize and not as commonly used” [Richard Wilkinson, 2008]
  • 132. How exact a BC? Pros Pseudo-data from true model and observed data from noisy model Interesting perspective in that outcome is completely controlled Link with ABCµ and assuming y is observed with a measurement error with density K Relates to the theory of model approximation [Kennedy O’Hagan, 2001] Cons Requires K to be bounded by M True approximation error never assessed Requires a modification of the standard ABC algorithm
  • 133. ABC for HMMs Specific case of a hidden Markov model Xt+1 ∼ Qθ (Xt , ·) Yt+1 ∼ gθ (·|xt ) 0 where only y1:n is observed. [Dean, Singh, Jasra, Peters, 2011] Use of specific constraints, adapted to the Markov structure: 0 0 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
  • 134. ABC for HMMs Specific case of a hidden Markov model Xt+1 ∼ Qθ (Xt , ·) Yt+1 ∼ gθ (·|xt ) 0 where only y1:n is observed. [Dean, Singh, Jasra, Peters, 2011] Use of specific constraints, adapted to the Markov structure: 0 0 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
  • 135. ABC-MLE for HMMs ABC-MLE defined by ˆ 0 0 θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , ) θ Exact MLE for the likelihood same basis as Wilkinson! 0 pθ (y1 , . . . , yn ) corresponding to the perturbed process (xt , yt + zt )1≤t≤n zt ∼ U(B(0, 1) [Dean, Singh, Jasra, Peters, 2011]
  • 136. ABC-MLE for HMMs ABC-MLE defined by ˆ 0 0 θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , ) θ Exact MLE for the likelihood same basis as Wilkinson! 0 pθ (y1 , . . . , yn ) corresponding to the perturbed process (xt , yt + zt )1≤t≤n zt ∼ U(B(0, 1) [Dean, Singh, Jasra, Peters, 2011]
  • 137. ABC-MLE is biased ABC-MLE is asymptotically (in n) biased with target l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )] but ABC-MLE converges to true value in the sense l n (θn ) → l (θ) for all sequences (θn ) converging to θ and n
  • 138. ABC-MLE is biased ABC-MLE is asymptotically (in n) biased with target l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )] but ABC-MLE converges to true value in the sense l n (θn ) → l (θ) for all sequences (θn ) converging to θ and n
  • 139. Noisy ABC-MLE Idea: Modify instead the data from the start 0 (y1 + ζ1 , . . . , yn + ζn ) [ see Fearnhead-Prangle ] noisy ABC-MLE estimate 0 0 arg max Pθ Y1 ∈ B(y1 + ζ1 , ), . . . , Yn ∈ B(yn + ζn , ) θ [Dean, Singh, Jasra, Peters, 2011]
  • 140. Consistent noisy ABC-MLE Degrading the data improves the estimation performances: Noisy ABC-MLE is asymptotically (in n) consistent under further assumptions, the noisy ABC-MLE is asymptotically normal increase in variance of order −2 likely degradation in precision or computing time due to the lack of summary statistic [curse of dimensionality]
  • 141. SMC for ABC likelihood Algorithm 4 SMC ABC for HMMs Given θ for k = 1, . . . , n do 1 1 N N generate proposals (xk , yk ), . . . , (xk , yk ) from the model weigh each proposal with ωk l =I l B(yk + ζk , ) (yk ) 0 l renormalise the weights and sample the xk ’s accordingly end for approximate the likelihood by n N l ωk N k=1 l=1 [Jasra, Singh, Martin, McCoy, 2010]
  • 142. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test Not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters
  • 143. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test Not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters
  • 144. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test Not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters
  • 145. Semi-automatic ABC Fearnhead and Prangle (2010) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes Use of a randomised (or ‘noisy’) version of the summary statistics η (y) = η(y) + τ ˜ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior]
  • 146. Semi-automatic ABC Fearnhead and Prangle (2010) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes Use of a randomised (or ‘noisy’) version of the summary statistics η (y) = η(y) + τ ˜ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior]
  • 147. Summary statistics Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)! Use of the standard quadratic loss function (θ − θ0 )T A(θ − θ0 ) .
  • 148. Summary statistics Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)! Use of the standard quadratic loss function (θ − θ0 )T A(θ − θ0 ) .
  • 149. Details on Fearnhead and Prangle (FP) ABC Use of a summary statistic S(·), an importance proposal g (·), a kernel K (·) ≤ 1 and a bandwidth h 0 such that (θ, ysim ) ∼ g (θ)f (ysim |θ) is accepted with probability (hence the bound) K [{S(ysim ) − sobs }/h] and the corresponding importance weight defined by π(θ) g (θ) [Fearnhead Prangle, 2012]
  • 150. Errors, errors, and errors Three levels of approximation π(θ|yobs ) by π(θ|sobs ) loss of information [ignored] π(θ|sobs ) by π(s)K [{s − sobs }/h]π(θ|s) ds πABC (θ|sobs ) = π(s)K [{s − sobs }/h] ds noisy observations πABC (θ|sobs ) by importance Monte Carlo based on N simulations, represented by var(a(θ)|sobs )/Nacc [expected number of acceptances] [M. Twain/B. Disraeli]
  • 151. Average acceptance asymptotics For the average acceptance probability/approximate likelihood p(θ|sobs ) = f (ysim |θ) K [{S(ysim ) − sobs }/h] dysim , overall acceptance probability p(sobs ) = p(θ|sobs ) π(θ) dθ = π(sobs )hd + o(hd ) [FP, Lemma 1]
  • 152. Optimal importance proposal Best choice of importance proposal in terms of effective sample size g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2 [Not particularly useful in practice] note that p(θ|sobs ) is an approximate likelihood reminiscent of parallel tempering could be approximately achieved by attrition of half of the data
  • 153. Optimal importance proposal Best choice of importance proposal in terms of effective sample size g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2 [Not particularly useful in practice] note that p(θ|sobs ) is an approximate likelihood reminiscent of parallel tempering could be approximately achieved by attrition of half of the data
  • 154. Calibration of h “This result gives insight into how S(·) and h affect the Monte Carlo error. To minimize Monte Carlo error, we need hd to be not too small. Thus ideally we want S(·) to be a low dimensional summary of the data that is sufficiently informative about θ that π(θ|sobs ) is close, in some sense, to π(θ|yobs )” (FP, p.5) turns h into an absolute value while it should be context-dependent and user-calibrated only addresses one term in the approximation error and acceptance probability (“curse of dimensionality”) h large prevents πABC (θ|sobs ) to be close to π(θ|sobs ) d small prevents π(θ|sobs ) to be close to π(θ|yobs ) (“curse of [dis]information”)
  • 155. Calibrating ABC “If πABC is calibrated, then this means that probability statements that are derived from it are appropriate, and in particular that we can use πABC to quantify uncertainty in estimates” (FP, p.5) Definition For 0 q 1 and subset A, event Eq (A) made of sobs such that PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if Pr(θ ∈ A|Eq (A)) = q unclear meaning of conditioning on Eq (A)
  • 156. Calibrating ABC “If πABC is calibrated, then this means that probability statements that are derived from it are appropriate, and in particular that we can use πABC to quantify uncertainty in estimates” (FP, p.5) Definition For 0 q 1 and subset A, event Eq (A) made of sobs such that PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if Pr(θ ∈ A|Eq (A)) = q unclear meaning of conditioning on Eq (A)
  • 157. Calibrated ABC Theorem (FP) Noisy ABC, where sobs = S(yobs ) + h , ∼ K (·) is calibrated [Wilkinson, 2008] no condition on h!!
  • 158. Calibrated ABC Consequence: when h = ∞ Theorem (FP) The prior distribution is always calibrated is this a relevant property then?
  • 159. More questions about calibrated ABC “Calibration is not universally accepted by Bayesians. It is even more questionable here as we care how statements we make relate to the real world, not to a mathematically defined posterior.” R. Wilkinson Same reluctance about the prior being calibrated Property depending on prior, likelihood, and summary Calibration is a frequentist property (almost a p-value!) More sensible to account for the simulator’s imperfections than using noisy-ABC against a meaningless based measure [Wilkinson, 2012]
  • 160. Converging ABC Theorem (FP) For noisy ABC, the expected noisy-ABC log-likelihood, E {log[p(θ|sobs )]} = log[p(θ|S(yobs ) + )]π(yobs |θ0 )K ( )dyobs d , has its maximum at θ = θ0 . True for any choice of summary statistic? even ancilary statistics?! [Imposes at least identifiability...] Relevant in asymptotia and not for the data
  • 161. Converging ABC Corollary For noisy ABC, the ABC posterior converges onto a point mass on the true parameter value as m → ∞. For standard ABC, not always the case (unless h goes to zero). Strength of regularity conditions (c1) and (c2) in Bernardo Smith, 1994? [out-of-reach constraints on likelihood and posterior] Again, there must be conditions imposed upon summary statistics...
  • 162. Loss motivated statistic Under quadratic loss function, Theorem (FP) ˆ (i) The minimal posterior error E[L(θ, θ)|yobs ] occurs when ˆ = E(θ|yobs ) (!) θ (ii) When h → 0, EABC (θ|sobs ) converges to E(θ|yobs ) ˆ (iii) If S(yobs ) = E[θ|yobs ] then for θ = EABC [θ|sobs ] ˆ E[L(θ, θ)|yobs ] = trace(AΣ) + h2 xT AxK (x)dx + o(h2 ). measure-theoretic difficulties? dependence of sobs on h makes me uncomfortable inherent to noisy ABC Relevant for choice of K ?
  • 163. Optimal summary statistic “We take a different approach, and weaken the requirement for πABC to be a good approximation to π(θ|yobs ). We argue for πABC to be a good approximation solely in terms of the accuracy of certain estimates of the parameters.” (FP, p.5) From this result, FP derive their choice of summary statistic, S(y) = E(θ|y) [almost sufficient] suggest h = O(N −1/(2+d) ) and h = O(N −1/(4+d) ) as optimal bandwidths for noisy and standard ABC.
  • 164. Optimal summary statistic “We take a different approach, and weaken the requirement for πABC to be a good approximation to π(θ|yobs ). We argue for πABC to be a good approximation solely in terms of the accuracy of certain estimates of the parameters.” (FP, p.5) From this result, FP derive their choice of summary statistic, S(y) = E(θ|y) [wow! EABC [θ|S(yobs )] = E[θ|yobs ]] suggest h = O(N −1/(2+d) ) and h = O(N −1/(4+d) ) as optimal bandwidths for noisy and standard ABC.
  • 165. Caveat Since E(θ|yobs ) is most usually unavailable, FP suggest (i) use a pilot run of ABC to determine a region of non-negligible posterior mass; (ii) simulate sets of parameter values and data; (iii) use the simulated sets of parameter values and data to estimate the summary statistic; and (iv) run ABC with this choice of summary statistic. where is the assessment of the first stage error?
  • 166. Caveat Since E(θ|yobs ) is most usually unavailable, FP suggest (i) use a pilot run of ABC to determine a region of non-negligible posterior mass; (ii) simulate sets of parameter values and data; (iii) use the simulated sets of parameter values and data to estimate the summary statistic; and (iv) run ABC with this choice of summary statistic. where is the assessment of the first stage error?
  • 167. Approximating the summary statistic As Beaumont et al. (2002) and Blum and Fran¸ois (2010), FP c use a linear regression to approximate E(θ|yobs ): (i) θi = β0 + β (i) f (yobs ) + i with f made of standard transforms [Further justifications?]
  • 168. [my]questions about semi-automatic ABC dependence on h and S(·) in the early stage reduction of Bayesian inference to point estimation approximation error in step (i) not accounted for not parameterisation invariant practice shows that proper approximation to genuine posterior distributions stems from using a (much) larger number of summary statistics than the dimension of the parameter the validity of the approximation to the optimal summary statistic depends on the quality of the pilot run important inferential issues like model choice are not covered by this approach. [Robert, 2012]
  • 169. More about semi-automatic ABC [End of section derived from comments on Read Paper, just appeared] “The apparently arbitrary nature of the choice of summary statistics has always been perceived as the Achilles heel of ABC.” M. Beaumont “Curse of dimensionality” linked with the increase of the dimension of the summary statistic Connection with principal component analysis [Itan et al., 2010] Connection with partial least squares [Wegman et al., 2009] Beaumont et al. (2002) postprocessed output is used as input by FP to run a second ABC
  • 170. More about semi-automatic ABC [End of section derived from comments on Read Paper, just appeared] “The apparently arbitrary nature of the choice of summary statistics has always been perceived as the Achilles heel of ABC.” M. Beaumont “Curse of dimensionality” linked with the increase of the dimension of the summary statistic Connection with principal component analysis [Itan et al., 2010] Connection with partial least squares [Wegman et al., 2009] Beaumont et al. (2002) postprocessed output is used as input by FP to run a second ABC
  • 171. Wood’s alternative Instead of a non-parametric kernel approximation to the likelihood 1 K {η(yr ) − η(yobs )} R r Wood (2010) suggests a normal approximation η(y(θ)) ∼ Nd (µθ , Σθ ) whose parameters can be approximated based on the R simulations (for each value of θ). Parametric versus non-parametric rate [Uh?!] Automatic weighting of components of η(·) through Σθ Dependence on normality assumption (pseudo-likelihood?) [Cornebise, Girolami Kosmidis, 2012]
  • 172. Wood’s alternative Instead of a non-parametric kernel approximation to the likelihood 1 K {η(yr ) − η(yobs )} R r Wood (2010) suggests a normal approximation η(y(θ)) ∼ Nd (µθ , Σθ ) whose parameters can be approximated based on the R simulations (for each value of θ). Parametric versus non-parametric rate [Uh?!] Automatic weighting of components of η(·) through Σθ Dependence on normality assumption (pseudo-likelihood?) [Cornebise, Girolami Kosmidis, 2012]
  • 173. Reinterpretation and extensions Reinterpretation of ABC output as joint simulation from π (x, y |θ) = f (x|θ)¯Y |X (y |x) ¯ π where πY |X (y |x) = K (y − x) ¯ Reinterpretation of noisy ABC if y |y obs ∼ πY |X (·|y obs ), then marginally ¯ ¯ y ∼ πY |θ (·|θ0 ) ¯ ¯ c Explain for the consistency of Bayesian inference based on y and π ¯ ¯ [Lee, Andrieu Doucet, 2012]
  • 174. Reinterpretation and extensions Reinterpretation of ABC output as joint simulation from π (x, y |θ) = f (x|θ)¯Y |X (y |x) ¯ π where πY |X (y |x) = K (y − x) ¯ Reinterpretation of noisy ABC if y |y obs ∼ πY |X (·|y obs ), then marginally ¯ ¯ y ∼ πY |θ (·|θ0 ) ¯ ¯ c Explain for the consistency of Bayesian inference based on y and π ¯ ¯ [Lee, Andrieu Doucet, 2012]
  • 175. ABC for Markov chains Rewriting the posterior as π(θ)1−n π(θ|x1 ) π(θ|xt−1 , xt ) where π(θ|xt−1 , xt ) ∝ f (xt |xt−1 , θ)π(θ) Allows for a stepwise ABC, replacing each π(θ|xt−1 , xt ) by an ABC approximation Similarity with FP’s multiple sources of data (and also with Dean et al., 2011 ) [White et al., 2010, 2012]
  • 176. ABC for Markov chains Rewriting the posterior as π(θ)1−n π(θ|x1 ) π(θ|xt−1 , xt ) where π(θ|xt−1 , xt ) ∝ f (xt |xt−1 , θ)π(θ) Allows for a stepwise ABC, replacing each π(θ|xt−1 , xt ) by an ABC approximation Similarity with FP’s multiple sources of data (and also with Dean et al., 2011 ) [White et al., 2010, 2012]