This document discusses Arthur Dempster's approach to statistical inference using Dempster-Shafer theory of belief functions. It involves sampling uniformly from the set Rx, which represents the feasible sets of parameter values θ that are consistent with the observed data. Rx can be represented using inequalities between ratios of components in u, related to the minimum path weights in a graph. This allows defining conditional distributions for components of u, enabling a Gibbs sampler to iteratively sample from the uniform distribution on Rx.
Monte Carlo Methods for Not Quite Bayesian Inference
1. Monte Carlo methods for some
not-quite-but-almost Bayesian problems
Pierre E. Jacob
Department of Statistics, Harvard University
joint work with
Ruobin Gong, Paul T. Edlefsen, Arthur P. Dempster
John O’Leary, Yves F. Atchad´e, Niloy Biswas, Paul Vanetti
and others
Pierre E. Jacob Monte Carlo for not quite Bayes
2. Introduction
A lot of questions in statistics give rise to non-trivial
computational problems.
Among these, some are numerical integration problems, ⇔
about sampling from probability distributions.
Besag, Markov chain Monte Carlo for statistical inference, 2001.
Computational challenges arise in deviations from standard
Bayesian inference, motivated by three questions,
quantifying ignorance,
model misspecification,
robustness to some perturbation of the data.
Pierre E. Jacob Monte Carlo for not quite Bayes
3. Outline
1 Dempster–Shafer analysis of count data
2 Unbiased MCMC and diagnostics of convergence
3 Modular Bayesian inference
4 Bagging posterior distributions
Pierre E. Jacob Monte Carlo for not quite Bayes
4. Outline
1 Dempster–Shafer analysis of count data
2 Unbiased MCMC and diagnostics of convergence
3 Modular Bayesian inference
4 Bagging posterior distributions
Pierre E. Jacob Monte Carlo for not quite Bayes
5. Inference with count data
Notation : [N] := {1, . . . , N}. Simplex ∆.
Observations : xn ∈ [K] := {1, . . . , K}, x = (x1, . . . , xN ).
Index sets : Ik = {n ∈ [N] : xn = k}.
Counts : Nk = |Ik|.
Model: xn
iid
∼ Categorical(θ) with θ = (θk)k∈[K],
i.e. P(xn = k) = θk for all n, k.
Goal: estimate θ, predict, etc.
Maximum likelihood estimator: ˆθk = Nk/N.
Bayesian inference combines likelihood with prior on θ into a
posterior distribution, assigning a probability ∈ [0, 1] to any
measurable subset of the simplex ∆.
Pierre E. Jacob Monte Carlo for not quite Bayes
6. Sampling from a Categorical distribution
2 3
1
∆1(θ)
∆2(θ)∆3(θ)
θ
Subsimplex ∆k(θ), for θ ∈ ∆:
{z ∈ ∆ : ∀ ∈ [K] z /zk ≥ θ /θk}.
Sampling mechanism, for θ ∈ ∆:
- draw un uniform on ∆,
- define xn such that un ∈ ∆xn (θ),
denoted also xn = m(un, θ).
Then P(xn = k) = θk,
because Vol(∆k(θ)) = θk.
Pierre E. Jacob Monte Carlo for not quite Bayes
7. Arthur Dempster’s approach to inference
Observations x = (xn)n∈[N] are fixed.
If we draw u1, . . . , un ∼ ∆, there might exist θ ∈ ∆ such that
∀n ∈ [N] xn = m(un, θ),
or such a θ might not exist.
Arthur P. Dempster. New methods for reasoning towards posterior
distributions based on sample data. Annals of Mathematical Statistics, 1966.
Arthur P. Dempster. Statistical inference from a Dempster—Shafer
perspective. Past, Present, and Future of Statistical Science, 2014.
Pierre E. Jacob Monte Carlo for not quite Bayes
8. Draws in the simplex
Counts: (2, 3, 1). Let’s draw N = 6 uniform samples on ∆.
2 3
1
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
9. Draws in the simplex
Each un is associated to an observed xn ∈ {11, 22, 33}.
2 3
1
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
10. Draws in the simplex
If there exists a feasible θ, it cannot be just anywhere.
2 3
1
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
11. Draws in the simplex
The uns of each category add constraints on θ.
2 3
1
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
12. Draws in the simplex
Overall the constraints define a polytope for θ, or an empty set.
2 3
1
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
13. Draws in the simplex
Here, there is a polytope of θ such that ∀n ∈ [N] xn = m(un, θ).
2 3
1
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
14. Draws in the simplex
Any θ in the polytope separates the uns appropriately.
2 3
1
qqq
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
15. Draws in the simplex
Let’s try again with fresh uniform samples on ∆.
2 3
1
q q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
16. Draws in the simplex
Here there is no θ ∈ ∆ such that ∀n ∈ [N] xn = m(un, θ).
2 3
1
q q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
17. Lower and upper probabilities
Consider the set
Rx = (u1, . . . , uN ) ∈ ∆N
: ∃θ ∈ ∆ ∀n ∈ [N] xn = m(un, θ) .
and denote by νx the uniform distribution on Rx.
For u ∈ Rx, there is a set F(u) = {θ ∈ ∆ : ∀n xn = m(un, θ)}.
For a set Σ ⊂ ∆ of interest, define
(lower probability) P(Σ) = 1(F(u) ⊂ Σ)νx(du),
(upper probability) ¯P(Σ) = 1(F(u) ∩ Σ = ∅)νx(du).
Pierre E. Jacob Monte Carlo for not quite Bayes
18. Summary and Monte Carlo problem
Arthur Dempster’s approach, later called Dempster–Shafer
theory of belief functions, is based on a distribution of
feasible sets,
F(u) = {θ ∈ ∆ ∀n ∈ [N] xn = m(un, θ)},
where u ∼ νx, the uniform distribution on Rx.
How do we obtain samples from this distribution?
Rejection rate 99%, for data (2, 3, 1).
Hit-and-run algorithm?
Our proposed strategy is a Gibbs sampler. Starting from
some u ∈ Rx, we will iteratively refresh some components
un of u given others.
Pierre E. Jacob Monte Carlo for not quite Bayes
19. Gibbs sampler: initialization
We can obtain some u in Rx as follows.
Choose an arbitrary θ ∈ ∆.
For all n ∈ [N] sample un uniformly in ∆k(θ) where xn = k.
2 3
1
∆1(θ)
∆2(θ)∆3(θ)
θ
q
q
q
q
q
q
To sample components un given
others, we will express Rx,
{u : ∃θ ∀n xn = m(un, θ)}
in terms of relations that the
components un must satisfy with
respect to one another.
Pierre E. Jacob Monte Carlo for not quite Bayes
20. Equivalent representation
For any θ ∈ ∆,
∀n ∈ [N] xn = m(un, θ)
⇔ ∀k ∈ [K] ∀n ∈ Ik un ∈ ∆k(θ)
⇔ ∀k ∈ [K] ∀n ∈ Ik ∀ ∈ [K]
un,
un,k
≥
θ
θk
because ∆k(θ) = {z ∈ ∆ : ∀ ∈ [K] z /zk ≥ θ /θk}.
This is equivalent to
∀k ∈ [K] ∀ ∈ [K] min
n∈Ik
un,
un,k
≥
θ
θk
.
Therefore, denoting ηk→ = minn∈Ik
un, /un,k, we can write
Rx = u ∈ ∆N
: ∃θ ∈ ∆ ∀k, ∈ [K] θ /θk ≤ ηk→ .
Pierre E. Jacob Monte Carlo for not quite Bayes
21. Linear constraints
Counts: (9, 8, 3), u in Rx.
Values ηk→ = minn∈Ik
un, /un,k define linear constraints on θ.
2 3
1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
θ3 θ1 = η1→3
θ2 θ1 = η1→2
Pierre E. Jacob Monte Carlo for not quite Bayes
22. Some inequalities
Next, assume u ∈ Rx, write ηk→ = minn∈Ik
un, /un,k, and
consider some implications.
There exists θ ∈ ∆ such that θ /θk ≤ ηk→ for all k, ∈ [K].
Then, for all k,
θ
θk
≤ ηk→ , and
θk
θ
≤ η →k, thus ηk→ η →k ≥ 1.
Pierre E. Jacob Monte Carlo for not quite Bayes
23. More inequalities
We can continue, if K ≥ 3: for all k, , j,
η−1
→k ≤
θ
θk
=
θ
θj
θj
θk
≤ ηj→ ηk→j,
thus ηk→jηj→ η →k ≥ 1.
And if K ≥ 4, for all k, , j, m
ηk→jηj→ η →mηm→k ≥ 1.
Generally,
∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1.
Pierre E. Jacob Monte Carlo for not quite Bayes
24. Main result
So far, if ∃θ ∈ ∆ such that θ /θk ≤ ηk→ for k, ∈ [K] then
∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1.
The reverse implication holds too.
This would mean
Rx = {u : ∃θ ∀k, ∈ [K] θ /θk ≤ ηk→ }
= {u : ∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1}.
i.e. Rx is represented by relations between components (un).
This helps computing conditional distributions under νx,
leading to a Gibbs sampler.
Pierre E. Jacob Monte Carlo for not quite Bayes
25. Some remarks on these inequalities
∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1.
We can consider only unique indices in j1, . . . , jL,
since the other cases can be deduced from those.
Example: η1→2η2→4η4→3η3→2η2→1 ≥ 1,
follows from η1→2η2→1 ≥ 1 and η2→4η4→3η3→2 ≥ 1.
The indices j1 → j2 → · · · → jL → j1 form a cycle.
Pierre E. Jacob Monte Carlo for not quite Bayes
26. Graphs
Fully connected graph with weight log ηk→ on edge (k, ).
1
2
3
log(η1→2)
log(η2→1)
Value of a path = sum of the weights along the path.
Negative cycle = path from vertex to itself with negative value
Pierre E. Jacob Monte Carlo for not quite Bayes
27. Graphs
∀L ∀j1, . . . , jL ηj1→j2 . . . ηjL→j1 ≥ 1
⇔ ∀L ∀j1, . . . , jL log(ηj1→j2 ) + . . . + log(ηjL→j1 ) ≥ 0
⇔ there are no negative cycles in the graph.
1
2
3
log(η1→2)
log(η2→1)
Pierre E. Jacob Monte Carlo for not quite Bayes
28. Summary (wake up)
We want to sample uniformly on the set Rx,
Rx = {u : ∃θ ∀k, ∈ [K] θ /θk ≤ ηk→ }.
We have claimed that this set can also be written
{u : ∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1}.
The inequalities hold if and only if,
there are no negative cycles
in a fully connected graph with K vertices
and weight log ηk→ on edge (k, ), for all k, ∈ [K].
Pierre E. Jacob Monte Carlo for not quite Bayes
29. Proof
Proof of claim: “inequalities” ⇒ “∃θ : θ /θk ≤ ηk→ ∀k, ”.
min(k → ) := minimum value of path from k to in the graph.
Finite ∀k, because of absence of negative cycles in the graph.
Define θ via θk ∝ exp(min(K → k)).
Then θ ∈ ∆. Furthermore, for all k,
min(K → ) ≤ min(K → k) + log(ηk→ ),
therefore θ /θk ≤ ηk→ .
Pierre E. Jacob Monte Carlo for not quite Bayes
30. Conditional distributions
We can obtain conditional distributions of un for n ∈ Ik given
(un)n/∈Ik
with respect to νx:
un given (un)n/∈Ik
are i.i.d. uniform in ∆k(θ ),
where θ ∝ exp(− min( → k)) for all ,
with min( → k) := minimum value of path from to k.
Shortest paths can be computed in polynomial time.
Pierre E. Jacob Monte Carlo for not quite Bayes
31. Conditional distributions
Counts: (9, 8, 3). What is the conditional distribution of
(un)n∈Ik
given (un)n/∈Ik
under νx?
2 3
1
q
q
q
qq q
q
q
q
q
q
q q
q
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
32. Conditional distributions
Counts: (9, 8, 3). What is the conditional distribution of
(un)n∈Ik
given (un)n/∈Ik
under νx?
2 3
1
q
q
q q
q
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
33. Conditional distributions
Counts: (9, 8, 3). What is the conditional distribution of
(un)n∈Ik
given (un)n/∈Ik
under νx?
2 3
1
q
q
q q
q
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
34. Conditional distributions
Counts: (9, 8, 3). What is the conditional distribution of
(un)n∈Ik
given (un)n/∈Ik
under νx?
2 3
1
q
q
q q
q
q
q
q
q
q
q
Pierre E. Jacob Monte Carlo for not quite Bayes
35. Gibbs sampler
Initial u(0) ∈ Rx.
At each iteration t ≥ 1, for each category k ∈ [K],
1 compute θ such that, for n ∈ Ik,
un given other components is uniform on ∆k(θ ).
2 Draw u
(t)
n ∼ ∆k(θ ) for n ∈ Ik.
3 Update η
(t)
k→ for ∈ [K].
In step 1, θ is obtained by computing shortest path in graph
with weights η
(t)
k→ on edge (k, ).
Computed e.g. with Bellman–Ford algorithm, implemented in
Cs´ardi & Nepusz, igraph package, 2006.
Alternatively, we can compute θ by solving a linear program,
Berkelaar, Eikland & Notebaert, lpsolve package, 2004
Pierre E. Jacob Monte Carlo for not quite Bayes
36. Gibbs sampler
Counts: (9, 8, 3), 100 polytopes generated by the sampler.
2 3
1
Pierre E. Jacob Monte Carlo for not quite Bayes
37. Cost per iteration
Cost in seconds for 100 full sweeps.
0.0
0.3
0.6
0.9
4 8 12 16
K
elapsed
N 256 512 1024 2048
https://github.com/pierrejacob/dempsterpolytope
Pierre E. Jacob Monte Carlo for not quite Bayes
38. Cost per iteration
Cost in seconds for 100 full sweeps.
0.0
0.3
0.6
0.9
256 512 1024 2048
N
elapsed
K 4 8 12 16
https://github.com/pierrejacob/dempsterpolytope
Pierre E. Jacob Monte Carlo for not quite Bayes
39. How many iterations for convergence?
Let ν(t) by the distribution of u(t) after t iterations.
TV(ν(t), νx) = supA |ν(t)(A) − νx(A)|.
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
iteration
TVupperbounds
K 5 10 20
Pierre E. Jacob Monte Carlo for not quite Bayes
40. How many iterations for convergence?
Let ν(t) by the distribution of u(t) after t iterations.
TV(ν(t), νx) = supA |ν(t)(A) − νx(A)|.
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200
iteration
TVupperbounds
N 50 100 150 200
Pierre E. Jacob Monte Carlo for not quite Bayes
41. Summary
A Gibbs sampler can be used to approximate lower and upper
probabilities in the Dempster–Shafer framework.
Is perfect sampling possible here?
Extensions for hierarchical counts, hidden Markov models?
Jacob, Gong, Edlefsen & Dempster, A Gibbs sampler for a class of
random convex polytopes. On arXiv and researchers.one.
https://github.com/pierrejacob/dempsterpolytope
Pierre E. Jacob Monte Carlo for not quite Bayes
42. Outline
1 Dempster–Shafer analysis of count data
2 Unbiased MCMC and diagnostics of convergence
3 Modular Bayesian inference
4 Bagging posterior distributions
Pierre E. Jacob Monte Carlo for not quite Bayes
43. Coupled chains
Glynn & Rhee, Exact estimation for MC equilibrium expectations, 2014.
Generate two chains (Xt) and (Yt), going to π, as follows:
sample X0 and Y0 from π0 (independently, or not),
sample Xt|Xt−1 ∼ P(Xt−1, ·) for t = 1, . . . , L,
for t ≥ L + 1, sample
(Xt, Yt−L)|(Xt−1, Yt−L−1) ∼ ¯P ((Xt−1, Yt−L−1), ·).
¯P must be such that
Xt+1|Xt ∼ P(Xt, ·) and Yt|Yt−1 ∼ P(Yt−1, ·)
(thus Xt and Yt have the same distribution for all t ≥ 0),
there exists a random time τ such that Xt = Yt−L for t ≥ τ
(the chains meet and remain “faithful”).
Pierre E. Jacob Monte Carlo for not quite Bayes
44. Coupled chains
0
4
8
0 50 100 150 200
iteration
x
π = N(0, 1), RWMH with Normal proposal std = 0.5, π0 = N(10, 32
)
Pierre E. Jacob Monte Carlo for not quite Bayes
45. Unbiased estimators
Under some conditions, the estimator
1
m − k + 1
m
t=k
h(Xt)
+
1
m − k + 1
τ−1
t=k+L
min m − k + 1,
t − k
L
(h(Xt) − h(Yt−L)),
has expectation h(x)π(dx), finite cost and finite variance.
“MCMC estimator + bias correction terms”
Its efficiency can be close to that of MCMC estimators,
if k, m chosen appropriately (and L also).
Jacob, O’Leary & Atchad´e, Unbiased MCMC with couplings, 2019.
Pierre E. Jacob Monte Carlo for not quite Bayes
46. Finite-time bias of MCMC
Total variation distance between Xt ∼ πt and π = limt→∞ πt:
πt − π TV ≤ E[max(0, (τ − L − t)/L )].
0.000
0.005
0.010
0.015
0 50 100 150 200
τ − lag
lag = 1
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
0 50 100 150 200
iteration
TVupperbounds
Pierre E. Jacob Monte Carlo for not quite Bayes
47. Finite-time bias of MCMC
Total variation distance between Xt ∼ πt and π = limt→∞ πt:
πt − π TV ≤ E[max(0, (τ − L − t)/L )].
0.000
0.005
0.010
0.015
0 50 100 150
τ − lag
lag = 50
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
0 50 100 150 200
iteration
TVupperbounds
Pierre E. Jacob Monte Carlo for not quite Bayes
48. Finite-time bias of MCMC
Total variation distance between Xt ∼ πt and π = limt→∞ πt:
πt − π TV ≤ E[max(0, (τ − L − t)/L )].
0.000
0.005
0.010
0.015
0 50 100 150
τ − lag
lag = 100
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
0 50 100 150 200
iteration
TVupperbounds
Pierre E. Jacob Monte Carlo for not quite Bayes
49. Finite-time bias of MCMC
Upper bounds can also be obtained for e.g. 1-Wasserstein.
And perhaps lower bounds?
Applicable in e.g. high-dimensional and/or discrete spaces.
Biswas, Jacob & Vanetti, Estimating Convergence of Markov chains
with L-Lag Couplings, 2019.
Pierre E. Jacob Monte Carlo for not quite Bayes
50. Finite-time bias of MCMC
Example: Gibbs sampler for Dempster’s analysis of counts.
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200
iteration
TVupperbounds
N 50 100 150 200
This quantifies bias of MCMC estimators, not variance.
Pierre E. Jacob Monte Carlo for not quite Bayes
51. Outline
1 Dempster–Shafer analysis of count data
2 Unbiased MCMC and diagnostics of convergence
3 Modular Bayesian inference
4 Bagging posterior distributions
Pierre E. Jacob Monte Carlo for not quite Bayes
52. Models made of modules
First module:
parameter θ1, data Y1
prior: p1(θ1)
likelihood: p1(Y1|θ1)
Second module:
parameter θ2, data Y2
prior: p2(θ2|θ1)
likelihood: p2 (Y2|θ1, θ2)
We are interested in the estimation of θ1, θ2 or both.
Pierre E. Jacob Monte Carlo for not quite Bayes
53. Joint model approach
Parameter (θ1, θ2), with prior
p(θ1, θ2) = p1(θ1)p2(θ2|θ1).
Data (Y1, Y2), likelihood
p(Y1, Y2|θ1, θ2) = p1(Y1|θ1)p2(Y2|θ1, θ2).
Posterior distribution
π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1(Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).
Pierre E. Jacob Monte Carlo for not quite Bayes
54. Joint model approach
In the joint model approach, all data are used to
simultaneously infer all parameters. . .
. . . so that uncertainty about θ1 is propagated to the
estimation of θ2. . .
. . . but misspecification of the 2nd module can damage the
estimation of θ1.
What about allowing uncertainty propagation, but
preventing feedback of some modules on others?
Pierre E. Jacob Monte Carlo for not quite Bayes
55. Cut distribution
One might want to propagate uncertainty without allowing
“feedback” of second module on first module.
Cut distribution:
πcut
(θ1, θ2; Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2).
Different from the posterior distribution under joint model,
under which the first marginal is π(θ1|Y1, Y2).
Pierre E. Jacob Monte Carlo for not quite Bayes
56. Example: epidemiological study
Model of virus prevalence
∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),
Zi is number of women infected with high-risk HPV in a
sample of size Ni in country i.
Beta(1,1) prior on each ϕi, independently.
Impact of prevalence onto cervical cancer occurrence
∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,
Yi is number of cancer cases arising from Ti woman-years of
follow-up in country i.
N(0, 103) on θ2,1, θ2,2, independently.
Plummer, Cuts in Bayesian graphical models, 2014.
Pierre E. Jacob Monte Carlo for not quite Bayes
57. Monte Carlo with joint model approach
Joint model posterior has density
π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1 (Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).
The computational complexity typically grows
super-linearly with the number of modules.
Difficulties stack up. . .
intractability, multimodality, ridges, etc.
Pierre E. Jacob Monte Carlo for not quite Bayes
58. Monte Carlo with cut distribution
The cut distribution is defined as
πcut
(θ1, θ2; Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2) ∝
π (θ1, θ2|Y1, Y2)
p2 (Y2|θ1)
.
The denominator is the feedback of the 2nd module on θ1:
p2 (Y2|θ1) = p2(Y2|θ1, θ2)p2(dθ2|θ1).
The feedback term is typically intractable.
Pierre E. Jacob Monte Carlo for not quite Bayes
59. Monte Carlo with cut distribution
WinBUGS’ approach via the cut function: alternate between
sampling θ1 from K1(θ1 → dθ1), targeting p1(dθ1|Y1);
sampling θ2 from K2
θ1
(θ2 → dθ2), targeting p2(dθ2|θ1, Y2).
This does not leave the cut distribution invariant!
Iterating the kernel K2
θ1
enough times mitigates the issue.
Plummer, Cuts in Bayesian graphical models, 2014.
Pierre E. Jacob Monte Carlo for not quite Bayes
60. Monte Carlo with cut distribution
In a perfect world, we could sample i.i.d.
θi
1 from p1(θ1|Y1),
θi
2 given θi
1 from p2(θ2|θi
1, Y2),
then (θi
1, θi
2) would be i.i.d. from the cut distribution.
Pierre E. Jacob Monte Carlo for not quite Bayes
61. Monte Carlo with cut distribution
In an MCMC world, we can sample
θi
1 approximately from p1(θ1|Y1) using MCMC,
θi
2 given θi
1 approximately from p2(θ2|θi
1, Y2) using MCMC,
then resulting samples approximate the cut distribution,
in the limit of the numbers of iterations, at both stages.
Pierre E. Jacob Monte Carlo for not quite Bayes
62. Monte Carlo with cut distribution
In an unbiased MCMC world, we can approximate expectations
h(x)π(dx) without bias, in finite compute time.
We can obtain an unbiased approximation of p1(θ1|Y1), and for
each θ1, an unbiased approximation of p2(θ2|θ1, Y2).
Thus, by the tower property, we can unbiasedly estimate
h(θ1, θ2)p2(dθ2|θ1, Y1)p1(dθ1|Y1).
Jacob, O’Leary & Atchad´e, Unbiased MCMC with couplings, 2019.
Pierre E. Jacob Monte Carlo for not quite Bayes
63. Example: epidemiological study
0
1
2
3
−2.5 −2.0 −1.5
θ2,1
density
0.00
0.05
0.10
0.15
10 15 20 25
θ2,2
densityApproximation of the marginals of the cut distribution of
(θ2,1, θ2,2), the parameters of the Poisson regression module in
the epidemiological model of Plummer (2014).
Jacob, Holmes, Murray, Robert & Nicholson, Better together?
Statistical learning in models made of modules.
Pierre E. Jacob Monte Carlo for not quite Bayes
64. Outline
1 Dempster–Shafer analysis of count data
2 Unbiased MCMC and diagnostics of convergence
3 Modular Bayesian inference
4 Bagging posterior distributions
Pierre E. Jacob Monte Carlo for not quite Bayes
65. Bagging posterior distributions
We can stabilize the posterior distribution by using a
bootstrap and aggregation scheme, in the spirit of bag-
ging (Breiman, 1996b). In a nutshell, denote by D a
bootstrap or subsample of the data D. The posterior of
the random parameters θ given the data D has c.d.f.
F(·|D), and we can stabilize this using
FBayesBag(·|D) = E [F(·|D )],
where E is with respect to the bootstrap- or subsam-
pling scheme. We call it the BayesBag estimator. It
can be approximated by averaging over B posterior com-
putations for bootstrap- or subsamples, which might be
a rather demanding task (although say B=10 would al-
ready stabilize to a certain extent).
B¨uhlmann, Discussion of Big Bayes Stories and BayesBag, 2014.
Pierre E. Jacob Monte Carlo for not quite Bayes
66. Bagging posterior distributions
For b = 1, . . . , B
Sample data set D(b) by bootstrapping from D.
Obtain MCMC approximation ˆπ(b) of posterior given D(b).
Finally obtain B−1 B
b=1 ˆπ(b).
Converges to “BayesBag” distribution as both B and number of
MCMC samples go to infinity.
If we can obtain unbiased approximation of posterior given any
D, the resulting approximation of “BayesBag” would be
consistent as B → ∞ only.
Exactly the same reasoning as for the cut distribution.
Example at https://statisfaction.wordpress.com/2019/
10/02/bayesbag-and-how-to-approximate-it/
Pierre E. Jacob Monte Carlo for not quite Bayes
67. Discussion
Some existing alternatives to standard Bayesian inference
are well motivated, but raise computational questions.
There are on-going efforts toward scalable Monte Carlo
methods, e.g. using coupled Markov chains or regeneration
techniques, in addition to sustained search for new MCMC
algorithms.
Quantification of variance is commonly done, quantification
of bias is also possible.
What makes a computational method convenient? It does
not seem to be entirely about asymptotic efficiency when
method is optimally tuned.
Thank you for listening!
Funding provided by the National Science Foundation,
grants DMS-1712872 and DMS-1844695.
Pierre E. Jacob Monte Carlo for not quite Bayes
68. References
Practical couplings in the literature. . .
Propp & Wilson, Exact sampling with coupled Markov chains
and applications to statistical mechanics, Random Structures &
Algorithms, 1996.
Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.
Neal, Circularly-coupled Markov chain sampling, UoT tech
report, 1999.
Glynn & Rhee, Exact estimation for Markov chain equilibrium
expectations, Journal of Applied Probability, 2014.
Agapiou, Roberts & Vollmer, Unbiased Monte Carlo: posterior
estimation for intractable/infinite-dimensional models, Bernoulli,
2018.
Pierre E. Jacob Monte Carlo for not quite Bayes
69. References
Finite-time bias of MCMC. . .
Brooks & Roberts, Assessing convergence of Markov chain
Monte Carlo algorithms, STCO, 1998.
Cowles & Rosenthal, A simulation approach to convergence rates
for Markov chain Monte Carlo algorithms, STCO, 1998.
Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.
Gorham, Duncan, Vollmer & Mackey, Measuring Sample Quality
with Diffusions, AAP, 2019.
Pierre E. Jacob Monte Carlo for not quite Bayes
70. References
Own work. . .
with John O’Leary, Yves F. Atchad´e
Unbiased Markov chain Monte Carlo with couplings, 2019.
with Fredrik Lindsten, Thomas Sch¨on
Smoothing with Couplings of Conditional Particle Filters, 2019.
with Jeremy Heng
Unbiased Hamiltonian Monte Carlo with couplings, 2019.
with Lawrence Middleton, George Deligiannidis, Arnaud
Doucet
Unbiased Markov chain Monte Carlo for intractable target
distributions, 2019.
Unbiased Smoothing using Particle Independent
Metropolis-Hastings, 2019.
Pierre E. Jacob Monte Carlo for not quite Bayes
71. References
with Maxime Rischard, Natesh Pillai
Unbiased estimation of log normalizing constants with
applications to Bayesian cross-validation.
with Niloy Biswas, Paul Vanetti
Estimating Convergence of Markov chains with L-Lag Couplings,
2019.
with Chris Holmes, Lawrence Murray, Christian Robert,
George Nicholson
Better together? Statistical learning in models made of modules.
Pierre E. Jacob Monte Carlo for not quite Bayes