TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
CISEA 2019: ABC consistency and convergence
1. ABC: convergence and misspecification
Christian P. Robert
Universit´e Paris-Dauphine PSL, Paris & University of Warwick,
Coventry
Joint works with D. Frazier, J.-M. Marin, G. Martin, and J. Rousseau
4. A motivating if pedestrian example
paired and orphan socks
A drawer contains an unknown number of socks, some of which
can be paired and some of which are orphans (single). One takes
at random 11 socks without replacement from this drawer: no pair
can be found among those. What can we infer about the total
number of socks in the drawer?
sounds like an impossible task
one observation x = 11 and two unknowns, nsocks and npairs
writing the likelihood is a challenge [exercise]
5. A motivating if pedestrian example
paired and orphan socks
A drawer contains an unknown number of socks, some of which
can be paired and some of which are orphans (single). One takes
at random 11 socks without replacement from this drawer: no pair
can be found among those. What can we infer about the total
number of socks in the drawer?
sounds like an impossible task
one observation x = 11 and two unknowns, nsocks and npairs
writing the likelihood is a challenge [exercise]
6. Feller’s shoes
A closet contains n pairs of shoes. If 2r shoes are chosen
at random (with 2r < n), what is the probability that
there will be (a) no complete pair, (b) exactly one
complete pair, (c) exactly two complete pairs among
them?
[Feller, 1970, Chapter II, Exercise 26]
7. Feller’s shoes
A closet contains n pairs of shoes. If 2r shoes are chosen
at random (with 2r < n), what is the probability that
there will be (a) no complete pair, (b) exactly one
complete pair, (c) exactly two complete pairs among
them?
[Feller, 1970, Chapter II, Exercise 26]
Resolution as
pj =
n
j
22r−2j n − j
2r − 2j
2n
2r
being probability of obtaining js pairs among those 2r shoes, or for
an odd number t of shoes
pj = 2t−2j n
j
n − j
t − 2j
2n
t
8. Feller’s shoes
A closet contains n pairs of shoes. If 2r shoes are chosen
at random (with 2r < n), what is the probability that
there will be (a) no complete pair, (b) exactly one
complete pair, (c) exactly two complete pairs among
them?
[Feller, 1970, Chapter II, Exercise 26]
If one draws 11 socks out of m socks made of f orphans and g
pairs, with f + 2g = m, number k of socks from the orphan group
is hypergeometric H(11, m, f ) and probability to observe 11
orphan socks total is
11
k=0
f
k
2g
11−k
m
11
×
211−k g
11−k
2g
11−k
9. A prioris on socks
Given parameters nsocks and npairs, set of socks
S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks
and 11 socks picked at random from S give X unique socks.
Rassmus’ reasoning
If you are a family of 3-4 persons then a guesstimate would be that
you have something like 15 pairs of socks in store. It is also
possible that you have much more than 30 socks. So as a prior for
nsocks I’m going to use a negative binomial with mean 30 and
standard deviation 15.
On npairs/2nsocks I’m going to put a Beta prior distribution that puts
most of the probability over the range 0.75 to 1.0,
[Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
10. A prioris on socks
Given parameters nsocks and npairs, set of socks
S = s1, s1, . . . , snpairs , snpairs , snpairs+1, . . . , snsocks
and 11 socks picked at random from S give X unique socks.
Rassmus’ reasoning
If you are a family of 3-4 persons then a guesstimate would be that
you have something like 15 pairs of socks in store. It is also
possible that you have much more than 30 socks. So as a prior for
nsocks I’m going to use a negative binomial with mean 30 and
standard deviation 15.
On npairs/2nsocks I’m going to put a Beta prior distribution that puts
most of the probability over the range 0.75 to 1.0,
[Rassmus B˚a˚ath’s Research Blog, Oct 20th, 2014]
11. Simulating the experiment
Given a prior distribution on nsocks and npairs,
nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2)
possible to
1. generate new values
of nsocks and npairs,
2. generate a new
observation of X,
number of unique
socks out of 11.
3. accept the pair
(nsocks, npairs) if the
realisation of X is
equal to 11
12. Simulating the experiment
Given a prior distribution on nsocks and npairs,
nsocks ∼ Neg(30, 15) npairs|nsocks ∼ nsocks/2Be(15, 2)
possible to
1. generate new values
of nsocks and npairs,
2. generate a new
observation of X,
number of unique
socks out of 11.
3. accept the pair
(nsocks, npairs) if the
realisation of X is
equal to 11
13. Meaning
ns
Density
0 10 20 30 40 50 60
0.000.010.020.030.040.050.06
The outcome of this simulation method returns a distribution on
the pair (nsocks, npairs) that is the conditional distribution of the
pair given the observation X = 11
Proof: Generations from π(nsocks, npairs) are accepted with probability
P {X = 11|(nsocks, npairs)}
14. Meaning
ns
Density
0 10 20 30 40 50 60
0.000.010.020.030.040.050.06
The outcome of this simulation method returns a distribution on
the pair (nsocks, npairs) that is the conditional distribution of the
pair given the observation X = 11
Proof: Hence accepted values distributed from
π(nsocks, npairs) × P {X = 11|(nsocks, npairs)} ∝ π(nsocks, npairs|X = 11)
15. Additional example
Take a Normal sample
x1, . . . , xn ∼ N(µ, σ2
)
summarised into (insufficient)
^µn = med(x1, . . . , xn)
^σn = mad(x1, . . . , xn) = med|xi − med(x1, . . . , xn)|
Under a conjugate prior π(µ, σ2), posterior close to intractable.
but simulation of (^µn, ^σn) straightforward
16. Additional example
Take a Normal sample
x1, . . . , xn ∼ N(µ, σ2
)
summarised into (insufficient)
^µn = med(x1, . . . , xn)
^σn = mad(x1, . . . , xn) = med|xi − med(x1, . . . , xn)|
Under a conjugate prior π(µ, σ2), posterior close to intractable.
but simulation of (^µn, ^σn) straightforward
18. Approximate Bayesian computation
Motivating examples
Approximate Bayesian computation
ABC basics
Automated summary selection
ABC for model choice
Asymptotics of ABC
ABC under misspecification
19. Untractable likelihoods
Cases when the likelihood function
f (y|θ) is unavailable and when the
completion step
f (y|θ) =
Z
f (y, z|θ) dz
is impossible or too costly because of
the dimension of z
c MCMC cannot be implemented
20. The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]
c Exact simulation from π(θ|y)
21. The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]
c Exact simulation from π(θ|y)
22. A as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ(y, z)
where ρ is a distance
Output distributed from
π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < )
[Pritchard et al., 1999]
23. A as A...pproximative
When y is a continuous random variable, equality z = y is
replaced with a tolerance condition,
ρ(y, z)
where ρ is a distance
Output distributed from
π(θ) Pθ{ρ(y, z) < } ∝ π(θ|ρ(y, z) < )
[Pritchard et al., 1999]
24. ABC algorithm
Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
25. Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of a posterior
distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ | η(y)
summary
) .
26. Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of a posterior
distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ | η(y)
summary
) .
27. Dogger Bank re-enactment
Battle of Dogger Bank on Jan 24, 1915, between British and
German fleets : how likely was the British victory?
[MacKay, Price, and Wood, 2016]
28. Dogger Bank re-enactment
Battle of Dogger Bank on Jan 24, 1915, between British and
German fleets : how likely was the British victory?
[MacKay, Price, and Wood, 2016]
29. Dogger Bank re-enactment
Battle of Dogger Bank on Jan 24, 1915, between British and
German fleets : ABC simulation of posterior distribution
[MacKay, Price, and Wood, 2016]
30. twenty years of ABC advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2009]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
31. twenty years of ABC advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2009]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
32. twenty years of ABC advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2009]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
33. twenty years of ABC advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Beaumont et al., 2009; Mengersen et al., 2013; Clart´e et al., 2019]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2009]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
34. Noisily exact ABC
Idea: Modify the data from the start
˜y = y0 + ζ1
with the same scale as ABC and run ABC on ˜y
c ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y)
[Dean et al., 2011; Fearnhead and Prangle, 2012]
35. Noisily exact ABC
Idea: Modify the data from the start
˜y = y0 + ζ1
with the same scale as ABC and run ABC on ˜y
c ABC produces an exact simulation from π(θ|˜y) = π(θ|˜y)
[Dean et al., 2011; Fearnhead and Prangle, 2012]
36. consistent noisy ABC
Degrading the data improves the estimation performances (!):
Noisy ABC-MLE is asymptotically (in n) consistent
under further assumptions, noisy ABC-MLE asymptotically
normal
increase in variance of order −2
likely degradation in precision or computing time due to the
lack of summary statistic [curse of dimensionality]
37. Semi-automatic ABC
Fearnhead and Prangle (2012) study noisy ABC and selection of
the summary statistic from purely inferential viewpoint and
calibrated for estimation purposes
Derivation of a well-calibrated version of ABC, i.e. an algorithm
that gives proper predictions for the distribution associated with
this randomised summary statistic [calibration constraint: ABC
approximation with same posterior mean as the true randomised
posterior]
Optimality of the posterior expectation E[θ|y] of the parameter of
interest as summary statistics η(y)!
38. Semi-automatic ABC
Fearnhead and Prangle (2012) study noisy ABC and selection of
the summary statistic from purely inferential viewpoint and
calibrated for estimation purposes
Derivation of a well-calibrated version of ABC, i.e. an algorithm
that gives proper predictions for the distribution associated with
this randomised summary statistic [calibration constraint: ABC
approximation with same posterior mean as the true randomised
posterior]
Optimality of the posterior expectation E[θ|y] of the parameter of
interest as summary statistics η(y)!
39. Fully automatic ABC
Implementation of ABC still requires input of collection of
summaries
Towards automation
statistical projection techniques (LDA, PCA, NP-GLS, &tc.)
variable selection
machine learning approaches [Raynal & al., 2017]
bypassing summaries altogether
40. ABC with Wasserstein distance
Use as distance between simulated and observed samples the
Wasserstein distance:
Wp(y1:n, z1:n)p
= inf
σ∈Sn
1
n
n
i=1
ρ(yi , zσ(i))p
generalises Kolmogorov–Smirnov
covers well- and mis-specified cases
only depends on data space distance ρ(·, ·)
covers iid and dynamic models (curve matching)
computional feasible (linear in dimension, cubic in sample size)
Hilbert curve approximation in higher dimensions
[Bernton et al., 2019]
41. consistent inference with Wasserstein distance
As ε → 0 [and n fixed]
If either
1. f
(n)
θ is n-exchangeable and D(y1:n, z1:n) = 0 if and only if
z1:n = yσ(1:n) for some σ ∈ Sn, or
2. D(y1:n, z1:n) = 0 if and only if z1:n = y1:n.
then, at y1:n fixed, ABC posterior converges strongly to posterior
as ε → 0.
[Bernton et al., 2019]
42. consistent inference with Wasserstein distance
As n → ∞ [at ε fixed]
WABC distribution with a fixed ε does not converge in n to a
Dirac mass
[Bernton et al., 2019]
43. consistent inference with Wasserstein distance
As εn → 0 and n → ∞
Under range of assumptions, if fn(εn) → 0, and
P(W(^µn, µ ) εn) → 1
then WABC posterior with threshold εn + ε satisfies
πεn+ε
{θ ∈ H : W(µ , µθ) > ε + 4εn/3 + f −1
n (εL
n/R)} |y1:n
P
δ
[Bernton et al., 2019]
44. A bivariate Gaussian illustration
100 observations from bivariate Normal with variance 1 and
covariance 0.55
Compare WABC with ABC based on (g) raw Euclidean and (o)
Euclidean distance between sample means on 106 model
simulations.
45. ABC for model choice
Motivating examples
Approximate Bayesian computation
ABC for model choice
Asymptotics of ABC
ABC under misspecification
46. Bayesian model choice
Several models M1, M2, . . . are considered simultaneously for a
dataset y and the model index M is part of the inference.
Use of a prior distribution. π(M = m), plus a prior distribution on
the parameter conditional on the value m of the model index,
πm(θm)
Goal is to derive the posterior distribution of M, challenging
computational target when models are complex.
47. Generic ABC for model choice
Algorithm 2 Likelihood-free model choice sampler (ABC-MC)
for t = 1 to T do
repeat
Generate m from the prior π(M = m)
Generate θm from the prior πm(θm)
Generate z from the model fm(z|θm)
until ρ{η(z), η(y)} <
Set m(t) = m and θ(t)
= θm
end for
[Cornuet et al., DIYABC, 2009]
49. Limiting behaviour of B12 (under sufficiency)
If η(y) sufficient statistic for both models,
fi (y|θi ) = gi (y)f η
i (η(y)|θi )
Thus
B12(y) =
Θ1
π(θ1)g1(y)f η
1 (η(y)|θ1) dθ1
Θ2
π(θ2)g2(y)f η
2 (η(y)|θ2) dθ2
=
g1(y) π1(θ1)f η
1 (η(y)|θ1) dθ1
g2(y) π2(θ2)f η
2 (η(y)|θ2) dθ2
=
g1(y)
g2(y)
Bη
12(y) .
[Didelot, Everitt, Johansen & Lawson, 2011]
c No discrepancy only when cross-model sufficiency
c Inability to evaluate loss brought by summary statistics
50. Limiting behaviour of B12 (under sufficiency)
If η(y) sufficient statistic for both models,
fi (y|θi ) = gi (y)f η
i (η(y)|θi )
Thus
B12(y) =
Θ1
π(θ1)g1(y)f η
1 (η(y)|θ1) dθ1
Θ2
π(θ2)g2(y)f η
2 (η(y)|θ2) dθ2
=
g1(y) π1(θ1)f η
1 (η(y)|θ1) dθ1
g2(y) π2(θ2)f η
2 (η(y)|θ2) dθ2
=
g1(y)
g2(y)
Bη
12(y) .
[Didelot, Everitt, Johansen & Lawson, 2011]
c No discrepancy only when cross-model sufficiency
c Inability to evaluate loss brought by summary statistics
51. A stylised problem
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insufficient statistic
T(y) consistent?
Note/warnin: c drawn on T(y) through BT
12(y) necessarily differs
from c drawn on y through B12(y)
[Marin, Pillai, X, & Rousseau, JRSS B, 2013]
52. A stylised problem
Central question to the validation of ABC for model choice:
When is a Bayes factor based on an insufficient statistic
T(y) consistent?
Note/warnin: c drawn on T(y) through BT
12(y) necessarily differs
from c drawn on y through B12(y)
[Marin, Pillai, X, & Rousseau, JRSS B, 2013]
53. A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed
to model M2: y ∼ L(θ2, 1/
√
2), Laplace distribution with mean θ2
and scale parameter 1/
√
2 (variance one).
Four possible statistics
1. sample mean y (sufficient for M1 if not M2);
2. sample median med(y) (insufficient);
3. sample variance var(y) (ancillary);
4. median absolute deviation mad(y) = med(|y − med(y)|);
54. A benchmark if toy example
Comparison suggested by referee of PNAS paper [thanks!]:
[X, Cornuet, Marin, & Pillai, Aug. 2011]
Model M1: y ∼ N(θ1, 1) opposed
to model M2: y ∼ L(θ2, 1/
√
2), Laplace distribution with mean θ2
and scale parameter 1/
√
2 (variance one).
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.10.20.30.40.50.60.7
n=100
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100
55. Framework
Starting from sample
y = (y1, . . . , yn)
the observed sample, not necessarily iid with true distribution
y ∼ Pn
Summary statistics
T(y) = Tn
= (T1(y), T2(y), · · · , Td (y)) ∈ Rd
with true distribution Tn
∼ Gn.
56. Framework
c Comparison of
– under M1, y ∼ F1,n(·|θ1) where θ1 ∈ Θ1 ⊂ Rp1
– under M2, y ∼ F2,n(·|θ2) where θ2 ∈ Θ2 ⊂ Rp2
turned into
– under M1, T(y) ∼ G1,n(·|θ1), and θ1|T(y) ∼ π1(·|Tn
)
– under M2, T(y) ∼ G2,n(·|θ2), and θ2|T(y) ∼ π2(·|Tn
)
57. Assumptions
A collection of asymptotic “standard” assumptions:
(A1) is a standard central limit theorem under the true model with
asymptotic mean µ0
(A2) controls the large deviations of the estimator Tn
from the
model mean µ(θ)
(A3) is the standard prior mass condition found in Bayesian
asymptotics (di effective dimension of the parameter)
(A4) restricts the behaviour of the model density against the true
density
[Think CLT/BvM!]
58. Asymptotic marginals
Asymptotically, under (A1)–(A4)
mi (t) =
Θi
gi (t|θi ) πi (θi ) dθi
is such that
(i) if inf{|µi (θi ) − µ0|; θi ∈ Θi } = 0,
Cl vd−di
n mi (Tn
) Cuvd−di
n
and
(ii) if inf{|µi (θi ) − µ0|; θi ∈ Θi } > 0
mi (Tn
) = oPn [vd−τi
n + vd−αi
n ].
59. Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayes
factor is driven by the asymptotic mean value µ(θ) of Tn
under
both models. And only by this mean value!
60. Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayes
factor is driven by the asymptotic mean value µ(θ) of Tn
under
both models. And only by this mean value!
Indeed, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} = inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
Cl v
−(d1−d2)
n m1(Tn
) m2(Tn
) Cuv
−(d1−d2)
n ,
where Cl , Cu = OPn (1), irrespective of the true model.
c Only depends on the difference d1 − d2: no consistency
61. Between-model consistency
Consequence of above is that asymptotic behaviour of the Bayes
factor is driven by the asymptotic mean value µ(θ) of Tn
under
both models. And only by this mean value!
Else, if
inf{|µ0 − µ2(θ2)|; θ2 ∈ Θ2} > inf{|µ0 − µ1(θ1)|; θ1 ∈ Θ1} = 0
then
m1(Tn
)
m2(Tn
)
Cu min v
−(d1−α2)
n , v
−(d1−τ2)
n
62. Checking for adequate statistics
Run a practical check of the relevance (or non-relevance) of Tn
null hypothesis that both models are compatible with the statistic
Tn
H0 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} = 0
against
H1 : inf{|µ2(θ2) − µ0|; θ2 ∈ Θ2} > 0
testing procedure provides estimates of mean of Tn
under each
model and checks for equality
63. Checking in practice
Under each model Mi , generate ABC sample θi,l , l = 1, · · · , L
For each θi,l , generate yi,l ∼ Fi,n(·|ψi,l ), derive Tn
(yi,l ) and
compute
^µi =
1
L
L
l=1
Tn
(yi,l ), i = 1, 2 .
Conditionally on Tn
(y),
√
L { ^µi − Eπ
[µi (θi )|Tn
(y)]} N(0, Vi ),
Test for a common mean
H0 : ^µ1 ∼ N(µ0, V1) , ^µ2 ∼ N(µ0, V2)
against the alternative of different means
H1 : ^µi ∼ N(µi , Vi ), with µ1 = µ2 .
64. Toy example: Laplace versus Gauss
qqqqqqqqqqqqqqq
qqqqqqqqqq
q
qq
q
q
Gauss Laplace Gauss Laplace
010203040
Normalised χ2 without and with mad
65. [some] asymptotics of ABC
Motivating examples
Approximate Bayesian computation
ABC for model choice
Asymptotics of ABC
consistency of ABC posteriors
asymptotic posterior shape
asymptotic behaviour of EABC [θ]
ABC under misspecification
66. Asymptotic setup
asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
parametric: θ ∈ Rk, k fixed
concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ
Objects of interest:
posterior concentration and asymptotic shape of π (·|η(y(n)))
(normality?)
convergence of the posterior mean ^θ = EABC[θ|η(y(n))]
asymptotic acceptance rate
[Frazier et al., 2018]
67. Asymptotic setup
asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
parametric: θ ∈ Rk, k fixed
concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ
Objects of interest:
posterior concentration and asymptotic shape of π (·|η(y(n)))
(normality?)
convergence of the posterior mean ^θ = EABC[θ|η(y(n))]
asymptotic acceptance rate
[Frazier et al., 2018]
68. consistency of ABC posteriors
ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0
as n → +∞, ε → 0
Bayesian consistency implies that sets containing θ0 have posterior
probability tending to one as n → +∞, with implication being the
existence of a specific rate of concentration
69. consistency of ABC posteriors
ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
Π ( θ − θ0 > δ| η(y) − η(z) ε) → 0
as n → +∞, ε → 0
Concentration around true value and Bayesian consistency
impose less stringent conditions on the convergence speed of
tolerance n to zero, when compared with asymptotic
normality of ABC posterior
asymptotic normality of ABC posterior mean does not require
asymptotic normality of ABC posterior
70. consistency of ABC posteriors
Concentration of summary η(z): there exists b(θ) such that
η(z) − b(θ) = oP
θ
(1)
Consistency:
Π n ( θ − θ0 δ|η(y)) = 1 + op(1)
Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 δn|η(y)) = 1 + op(1)
71. consistency of ABC posteriors
Consistency:
Π n ( θ − θ0 δ|η(y)) = 1 + op(1)
Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 δn|η(y)) = 1 + op(1)
Point estimator consistency
^θ = EABC [θ|η(y(n)
)], EABC [θ|η(y(n)
)] − θ0 = op(1)
vn(EABC [θ|η(y(n)
)] − θ0) ⇒ N(0, v)
72. Rate of convergence
Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0
in P0-probability when M goes to infinity.
Posterior rate of concentration related to rate at which information
accumulates about true parameter vector
73. Rate of convergence
Π (·| η(y) − η(z) ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ε) → 0
in P0-probability when M goes to infinity.
Posterior rate of concentration related to rate at which information
accumulates about true parameter vector
74. Related results
Studies on the large sample properties of ABC, with focus the
asymptotic properties of ABC point estimators
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b]
assumption of CLT for summary statistic plus regularity
assumptions on convergence of its density of the summary
statistics to normal limit (incl. existence of Edgeworth
expansion with exponential tail control)
convergence rate of posterior mean if εT = o(1/v
3/5
T )
acceptance probability chosen as arbitrary density
η(y) − η(z)
75. Related results
Studies on the large sample properties of ABC, with focus the
asymptotic properties of ABC point estimators
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b]
stronger conditions on η than [our own] weak convergence of
vT {η(z) − b(θ)}, with non-uniform polynomial deviations
excludes non compact parameter space
when εT v−1
T , only deviation bounds and weak convergence
are required (weaker than convergence of densities)
when εT = o(1/vT ) no requirement on convergence rate
76. Related results
Studies on the large sample properties of ABC, with focus the
asymptotic properties of ABC point estimators
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b]
oo characterisation of asymptotic behavior of ABC posterior for
all εT = o(1) with posterior concentration
oo asymptotic normality and unbiasedness of posterior mean
remain achievable even when limT vT εT = ∞, provided
εT = o(1/v
1/2
T )
77. Related results
Studies on the large sample properties of ABC, with focus the
asymptotic properties of ABC point estimators
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2018a,b]
Li and Fearnhead (2018a) point out that if kη > kθ 1 and if
εT = o(1/v
3/5
T ), posterior mean asymptotically normal, unbiased,
but not asymptotically efficient
if vT εT → ∞ and vT ε2
T = o(1), for kη > kθ 1,
EΠε {vT (θ − θ0)} = { θb(θ0) θb(θ0)}−1
θb(θ0) vT {η(y) − b(θ0)} + op(1).
78. Convergence when n σn [= vT ]
Under (main) assumptions
(A1) ∃σn → 0
Pθ σ−1
n η(z) − b(θ) > u c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) u) uD
, u ≈ 0
posterior consistency
posterior concentration rate λn that depends on the deviation
control of d2{η(z), b(θ)}
posterior concentration rate for b(θ) bounded from below by O( n)
79. Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1.
summary statistics equal to sample autocovariances
ηj (y) = T−1
T
t=1+j
yt yt−j j = 0, 1
with
η0(y)
P
→ E[y2
t ] = 1 + (θ01)2
+ (θ02)2
and η1(y)
P
→ E[yt yt−1] = θ01(1 + θ02)
For ABC target pε (θ|η(y)) to degenerate at θ0
0 = b(θ0) − b (θ) =
1 + (θ01)2
+ (θ02)2
θ01(1 + θ02)
−
1 + (θ1)2
+ (θ2)2
θ1(1 + θ2)
must have unique solution θ = θ0
Take θ01 = .6, θ02 = .2: equation has two solutions
θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
80. Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 θ1 2, θ1 + θ2 −1, θ1 − θ2 1.
summary statistics equal to sample autocovariances
ηj (y) = T−1
T
t=1+j
yt yt−j j = 0, 1
with
η0(y)
P
→ E[y2
t ] = 1 + (θ01)2
+ (θ02)2
and η1(y)
P
→ E[yt yt−1] = θ01(1 + θ02)
For ABC target pε (θ|η(y)) to degenerate at θ0
0 = b(θ0) − b (θ) =
1 + (θ01)2
+ (θ02)2
θ01(1 + θ02)
−
1 + (θ1)2
+ (θ2)2
θ1(1 + θ2)
must have unique solution θ = θ0
Take θ01 = .6, θ02 = .2: equation has two solutions
θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
81. Concentration for the MA(2) model
True value θ0 = (0.6, 0.2)
Summaries first three autocorrelations
Tolerance proportional to εT = 1/T0.4
Rejection of normality of these posteriors
82. Asymptotic shape of posterior distribution
Shape of
Π ( · | η(y), η(z) εn)
depending on relation between εn and rate σn at which η(yn)
satisfy CLT
Three different regimes:
1. σn = o( n) −→ Uniform limit
2. σn n −→ perturbated Gaussian limit
3. σn n −→ Gaussian limit
83. New assumptions
(B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1)
Σn(θ)−1
{η(z)−b(θ)} ⇒ Nk1 (0, Id), (Σn(θ)Σn(θ0)−1
)n = Co
(B2) b(θ) is C1 and
θ − θ0 b(θ) − b(θ0)
(B3) Dominated convergence and
lim
n
Pθ(Σn(θ)−1{η(z) − b(θ)} ∈ u + B(0, un))
j un(j)
= ϕ(u)
84. main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
when nσ−1
n → +∞
Π n [ −1
n (θ − θ0) ∈ A|y] ⇒ UB0
(A), B0 = {x ∈ Rk
; b (θ0)T
x 1}
85. main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
when nσ−1
n → c
Π n [Σn(θ0)−1
(θ − θ0) − Zo
∈ A|y] ⇒ Qc(A), Qc = N
86. main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
when nσ−1
n → 0 and (B3) holds, set
Vn = [b (θ0)]n
Σn(θ0)b (θ0)
then
Π n [V −1
n (θ − θ0) − ˜Zo
∈ A|y] ⇒ Φ(A),
87. Illustration in the MA(2) setting
Sample sizes of T = 500, 1000
Asymptotic normality rejected for εT = 1/T0.4 and for θ1,
T = 500 and εT = 1/T0.55
88. asymptotic behaviour of EABC [θ]
When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
EABC [dT (θ − θ0)|yo
] ⇒ N(0, ( bo
)T
Σ−1
bo −1
[Li & Fearnhead (2018a)]
In fact, if β+1
n
√
n = o(1), with β H¨older-smoothness of π
EABC [(θ−θ0)|yo
] =
( bo)−1Zo
√
n
+
k
j=1
hj (θ0) 2j
n +op(1), 2k = β
[Fearnhead & Prangle, 2012]
89. asymptotic behaviour of EABC [θ]
When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
EABC [dT (θ − θ0)|yo
] ⇒ N(0, ( bo
)T
Σ−1
bo −1
[Li & Fearnhead (2018a)]
Iterating for fixed p mildly interesting: if
˜η(y) = EABC [θ|yo
]
then
EABC [θ|˜η(y)] = θ0 +
( bo)−1Zo
√
n
+
π (θ0)
π(θ0)
2
n + o()
[Fearnhead & Prangle, 2012]
90. Practical implications
In practice, tolerance determined by quantile (nearest neighbours):
Select all θi associated with the α = δ/N smallest distances
d2{η(zi ), η(y)} for some δ
Then
(i) if εT v−1
T or εT = o(v−1
T ), acceptance rate associated with
the threshold εT is
αT = pr ( η(z) − η(y) εT ) (vT εT )kη
× v−kθ
T v−kθ
T
(ii) if εT v−1
T ,
αT = pr ( η(z) − η(y) εT ) εkθ
T v−kθ
T
91. Practical implications
In practice, tolerance determined by quantile (nearest neighbours):
Select all θi associated with the α = δ/N smallest distances
d2{η(zi ), η(y)} for some δ
Then
(i) if εT v−1
T or εT = o(v−1
T ), acceptance rate associated with
the threshold εT is
αT = pr ( η(z) − η(y) εT ) (vT εT )kη
× v−kθ
T v−kθ
T
(ii) if εT v−1
T ,
αT = pr ( η(z) − η(y) εT ) εkθ
T v−kθ
T
92. Curse of dimensionality
For reasonable statistical behavior, rate of decline of αT the faster
the larger the dimension of θ, kθ, but unaffected by dimension of
η, kη
Theoretical justification for dimension reduction methods that
process parameter components individually and independently of
other components
[Fearnhead & Prangle, 2012; Martin & al., 2016]
importance sampling approach of Li & Fearnhead (2018a) yields
acceptance rates αT = O(1), when εT = O(1/vT )
93. Curse of dimensionality
For reasonable statistical behavior, rate of decline of αT the faster
the larger the dimension of θ, kθ, but unaffected by dimension of
η, kη
Theoretical justification for dimension reduction methods that
process parameter components individually and independently of
other components
[Fearnhead & Prangle, 2012; Martin & al., 2016]
importance sampling approach of Li & Fearnhead (2018a) yields
acceptance rates αT = O(1), when εT = O(1/vT )
94. Monte Carlo error
Link the choice of εT to Monte Carlo error associated with NT
draws in ABC Algorithm
Conditions (on εT ) under which
^αT = αT {1 + op(1)}
where ^αT = NT
i=1 1l [d{η(y), η(z)} εT ] /NT proportion of
accepted draws from NT simulated draws of θ
Either
(i) εT = o(v−1
T ) and (vT εT )−kη ε−kθ
T MNT
or
(ii) εT v−1
T and ε−kθ
T MNT
for M large enough;
95. conclusion on ABC consistency
asymptotic description of ABC: different regimes depending
on n & σn
no point in choosing n arbitrarily small: just n = o(σn)
no asymptotic gain in iterative ABC
results under weak(er) conditions by not studying g(η(z)|θ)
97. Illustration
Assumed data generating process (DGP) is z ∼ N(θ, 1) but actual
DGP is y ∼ N(θ, ˜σ2)
Use of summaries
sample mean η1(y) = 1
n
n
i=1 yi
centered summary η2(y) = 1
n−1
n
i=1(yi − η1(y))2 − 1
Three ABC:
ABC-AR: accept/reject approach with
K (d{η(z), η(y)}) = 1l [d{η(z), η(y)} ] and
d{x, y} = x − y
ABC-K: smooth rejection approach, with K (d{η(z), η(y)})
univariate Gaussian kernel
ABC-Reg: post-processing ABC approach with weighted linear
regression adjustment
98. Illustration
Posterior means for ABC-AR, ABC-K and ABC-Reg as
misspecification σ2 increases (N = 50, 000 simulated data sets)
αn = n−5/9 quantile for ABC-AR
ABC-K and ABC-Reg bandwidth of n−5/9
99. Framework
Data y with true distribution P0
Model P := {θ ∈ Θ ⊂ Rkθ : Pθ}
Summary statistic η(y) = (η1(y), ..., ηkη (y))
Misspecification
inf
θ∈Θ
D(P0||Pθ) = inf
θ∈Θ
− log
dP0(y)
dPθ(y)
dP0(y) > 0,
with
θ∗
= arg inf
θ∈Θ
D(P0||Pθ)
[Muller, 2013]
ABC misspecification
for b0 (resp. b(θ)) limit of η(y) (resp. η(z))
inf
θ∈Θ
d{b0, b(θ)} > 0
ABC pseudo-true value
θ∗
= arg inf
θ∈Θ
d{b0, b(θ)}.
100. Compulsory tolerance
Under standard identification conditions on b(·) ∈ Rkη , there exists
∗ such that
∗
= inf
θ∈Θ
d{b0, b(θ)} > 0
c For tolerances n = o(1), once n < ∗ no draw of θ to be
selected and posterior Π [A|η(y)] ill-conditioned
But appropriately chosen tolerance sequence ( n)n allows
ABC-based posterior to concentrate on distance-dependent
pseudo-true value θ∗
101. Compulsory tolerance
Under standard identification conditions on b(·) ∈ Rkη , there exists
∗ such that
∗
= inf
θ∈Θ
d{b0, b(θ)} > 0
c For tolerances n = o(1), once n < ∗ no draw of θ to be
selected and posterior Π [A|η(y)] ill-conditioned
But appropriately chosen tolerance sequence ( n)n allows
ABC-based posterior to concentrate on distance-dependent
pseudo-true value θ∗
102. ABC posterior concentration under misspecification
Assumptions
(A0) There exist continuous b : Θ → B and decreasing ρn(·) such
that ρn(u) → 0 at infinity and
Pθ [d{η(θ), b(θ)} > u] c(θ)ρn(u),
Θ
c(θ)dΠ(θ) < +∞
with existence of either
(i) Polynomial deviations: positive sequence vn → +∞ and
u0, κ > 0 such that ρn(u) = v−κ
n u−κ
, for u u0.
(ii) Exponential deviations: hθ(·) > 0 such that
Pθ[d{η(z), b(θ)} > u] c(θ)e−hθ(uvn)
and there exist c, C > 0
such that
Θ
c(θ)e−hθ(uvn)
dΠ(θ) Ce−c(uvn)τ
103. ABC posterior concentration under misspecification
Assumptions
(A1) There exist D > 0 and M0, δ0 > 0 such that, for all
δ0 δ > 0 and M M0, there exists
Sδ ⊂ {θ ∈ Θ : d{b(θ), b0} − ∗ δ} for which
In case (i), D < κ and
Sδ
1 −
c(θ)
M
dΠ(θ) δD
.
In case (ii),
Sδ
1 − c(θ)e−hθ(M)
dΠ(θ) δD
.
104. ABC posterior concentration under misspecification
Under (A0) and (A1), for n ↓ ∗ with
n
∗
+ Mv−1
n + v−1
0,n
if Mn sequence going to infinity and
δn Mn{( n − ∗) + ρn( n − ∗)}, then
Π [d{b(θ), b0} ∗
+ δn|η(y)] = oP0 (1),
provided
ρn( n − ∗
) ( n − ∗
)−D/κ
in case (i)
ρn( n − ∗
) | log( n − ∗
)|1/τ
in case (ii)
Further
Π [d{θ, θ∗
} > δ|η(y)] = oP0 (1)
[Bernton & al., 2019; Frazier & al., 2018]
105. ABC posterior concentration under misspecification
Under (A0) and (A1), for n ↓ ∗ with
n
∗
+ Mv−1
n + v−1
0,n
if Mn sequence going to infinity and
δn Mn{( n − ∗) + ρn( n − ∗)}, then
Π [d{b(θ), b0} ∗
+ δn|η(y)] = oP0 (1),
provided
ρn( n − ∗
) ( n − ∗
)−D/κ
in case (i)
ρn( n − ∗
) | log( n − ∗
)|1/τ
in case (ii)
Further
Π [d{θ, θ∗
} > δ|η(y)] = oP0 (1)
[Bernton & al., 2019; Frazier & al., 2018]
106. ABC asymptotic posterior
Under further assumptions
existence of positive definite matrix Σn(θ0) such that for all
θ − θ0 δ and all 0 < u δvn
Pθ [ Σn(θ0){η(z) − b(θ)} > u] c0u−κ
parametric CLT: existence of sequence of positive definite
matrices Σn(θ) such that for all θ in a neighbourhood of θ0
Σn(θ) ∗ vn, with vn → +∞ and
Σn(θ){η(z) − b(θ)} ⇒ N(0, Ikη ),
classical concentration assumptions
If limn vn( n − ∗) = 0, then for Z0
n = Σn(θ∗){η(y) − b0} and
Φkη (·) standard Normal
lim
n→+∞
Π Σn(θ∗
){b(θ) − b(θ∗
)} − Z0
n ∈ B|η(y) = Φkη (B)
107. Accept/Reject ABC revisited
Given link between n and αn in Frazier & al. (2019), posterior
concentration achieved through sequence of αn converging to zero
ABC algorithm based on appropriate αn always yields (under
appropriate regularity) approach that concentrates asymptotically,
regardless of misspecification.
(i) if ( n − ∗) v−1
n or ( n − ∗) = o(v−1
n ), then
αn = Pr ( η(z) − η(y) n) (vn{ n − ∗
})kη
× v−kθ
n v−kθ
n
(ii) if ( n − ∗) v−1
n , then
αn = Pr ( η(z) − η(y) n) { n − ∗
}kθ
v−kθ
n
108. Regression adjustment
In case of scalar θ, ABC-Reg
runs ABC-AR, with tolerance n, and obtains selected draws
and summaries {θi , η(zi )}
uses linear regression to predict accepted values of θ from
η(z) through
θi
= α + β {η(y) − η(zi
)} + νi
defines adjusted parameter draw as
˜θi
= θi
+ ^β {η(y) − η(zi
)}
109. ABC-Reg pseudo-consistency
If
n
∗
+ Mv−1
n + v−1
0,n ,
with
(i) ∗ = infθ∈Θ d{b(θ), b0} > 0
(ii) θ∗ = arg infθ∈Θ d{b(θ), b0} exists.
(iii) For some β0 with β0 > 0, ^β − β0 = oPθ
(1).
then
Π [|˜θ − ˜θ∗
| > δ|η(y)] = oP0 (1),
[Frazier & al., 2019]
The End
110. ABC-Reg pseudo-consistency
If
n
∗
+ Mv−1
n + v−1
0,n ,
with
(i) ∗ = infθ∈Θ d{b(θ), b0} > 0
(ii) θ∗ = arg infθ∈Θ d{b(θ), b0} exists.
(iii) For some β0 with β0 > 0, ^β − β0 = oPθ
(1).
then
Π [|˜θ − ˜θ∗
| > δ|η(y)] = oP0 (1),
[Frazier & al., 2019]
The End