Computational methods for Bayesian model choice

1. On some computational methods for Bayesian model choice On some computational methods for Bayesian model choice Christian P. Robert CREST-INSEE and Universit´ Paris Dauphine e http://www.ceremade.dauphine.fr/~xian c Cours OFPR, CREST, Malakoﬀ 2-12 mars 2009

2. On some computational methods for Bayesian model choice Outline Introduction 1 Importance sampling solutions 2 Cross-model solutions 3 Nested sampling 4 ABC model choice 5

3. On some computational methods for Bayesian model choice Introduction Bayes tests Construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Example (Normal mean) For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

4. On some computational methods for Bayesian model choice Introduction Bayes tests Construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Example (Normal mean) For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

5. On some computational methods for Bayesian model choice Introduction Bayes tests The 0 − 1 loss Neyman–Pearson loss for testing hypotheses Test of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0 . Then D = {0, 1} The 0 − 1 loss 1 − d if θ ∈ Θ0 L(θ, d) = d otherwise,

6. On some computational methods for Bayesian model choice Introduction Bayes tests The 0 − 1 loss Neyman–Pearson loss for testing hypotheses Test of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0 . Then D = {0, 1} The 0 − 1 loss 1 − d if θ ∈ Θ0 L(θ, d) = d otherwise,

7. On some computational methods for Bayesian model choice Introduction Bayes tests Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ [L(θ, δ(x))] if θ ∈ Θ0 , Pθ (δ(x) = 0) = Pθ (δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), 1 δ π (x) = 0 otherwise,

8. On some computational methods for Bayesian model choice Introduction Bayes tests Type–one and type–two errors Associated with the risk R(θ, δ) = Eθ [L(θ, δ(x))] if θ ∈ Θ0 , Pθ (δ(x) = 0) = Pθ (δ(x) = 1) otherwise, Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), 1 δ π (x) = 0 otherwise,

9. On some computational methods for Bayesian model choice Introduction Bayes factor Bayes factor Deﬁnition (Bayes factors) For testing hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior π(Θ0 )π0 (θ) + π(Θc )π1 (θ) , 0 central quantity f (x|θ)π0 (θ)dθ π(Θ0 |x) π(Θ0 ) Θ0 B01 = = π(Θc |x) π(Θc ) 0 0 f (x|θ)π1 (θ)dθ Θc 0 [Jeﬀreys, 1939]

10. On some computational methods for Bayesian model choice Introduction Bayes factor Self-contained concept Outside decision-theoretic environment: eliminates impact of π(Θ0 ) but depends on the choice of (π0 , π1 ) Bayesian/marginal equivalent to the likelihood ratio Jeﬀreys’ scale of evidence: π if log10 (B10 ) between 0 and 0.5, evidence against H0 weak, π if log10 (B10 ) 0.5 and 1, evidence substantial, π if log10 (B10 ) 1 and 2, evidence strong and π if log10 (B10 ) above 2, evidence decisive Requires the computation of the marginal/evidence under both hypotheses/models

11. On some computational methods for Bayesian model choice Introduction Bayes factor Hot hand Example (Binomial homogeneity) Consider H0 : yi ∼ B(ni , p) (i = 1, . . . , G) vs. H1 : yi ∼ B(ni , pi ). Conjugate priors pi ∼ Be(α = ξ/ω, β = (1 − ξ)/ω), with a uniform prior on E[pi |ξ, ω] = ξ and on p (ω is ﬁxed) 1G 1 pyi (1 − pi )ni −yi pα−1 (1 − pi )β−1 d pi B10 = i i 0 i=1 0 ×Γ(1/ω)/[Γ(ξ/ω)Γ((1 − ξ)/ω)]dξ 1 P P i (ni −yi ) i yi (1 − p) 0p dp For instance, log10 (B10 ) = −0.79 for ω = 0.005 and G = 138 slightly favours H0 . [Kass & Raftery, 1995]

14. On some computational methods for Bayesian model choice Introduction Model choice Model choice and model comparison Choice between models Several models available for the same observation Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite Replace hypotheses with models but keep marginal likelihoods and Bayes factors

15. On some computational methods for Bayesian model choice Introduction Model choice Model choice and model comparison Choice between models Several models available for the same observation Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite Replace hypotheses with models but keep marginal likelihoods and Bayes factors

16. On some computational methods for Bayesian model choice Introduction Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj Θj j take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj Θj j

20. On some computational methods for Bayesian model choice Introduction Evidence Evidence All these problems end up with a similar quantity, the evidence Zk = πk (θk )Lk (θk ) dθk , Θk aka the marginal likelihood.

21. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Importance sampling Paradox Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on the alternative representation f (x) Ef [h(X)] = h(x) g(x) dx . g(x) X which allows us to use other distributions than f

22. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Importance sampling Paradox Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on the alternative representation f (x) Ef [h(X)] = h(x) g(x) dx . g(x) X which allows us to use other distributions than f

23. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Importance sampling algorithm Evaluation of Ef [h(X)] = h(x) f (x) dx X by Generate a sample X1 , . . . , Xn from a distribution g 1 Use the approximation 2 m f (Xj ) 1 h(Xj ) m g(Xj ) j=1

24. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bayes factor approximation When approximating the Bayes factor f0 (x|θ0 )π0 (θ0 )dθ0 Θ0 B01 = f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 use of importance functions and and 0 1 n−1 n0 i i i i=1 f0 (x|θ0 )π0 (θ0 )/ 0 (θ0 ) 0 B01 = n−1 n1 i i i i=1 f1 (x|θ1 )π1 (θ1 )/ 1 (θ1 ) 1

26. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 π1 (θ) − π2 (θ) var(B12 ) 1 =E 2 n π2 (θ) B12 is large, i.e. if π1 and π2 have little overlap...

27. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 π1 (θ) − π2 (θ) var(B12 ) 1 =E 2 n π2 (θ) B12 is large, i.e. if π1 and π2 have little overlap...

29. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜ (θ1i |x) π n1 i=1 1 θji ∼ πj (θ|x) B12 = n2 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

30. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜ (θ1i |x) π n1 i=1 1 θji ∼ πj (θ|x) B12 = n2 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

31. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

32. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

34. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ Var(B12 ) 1 ≈ −1 2 2 n1 n2 B12 π1 (θ)π2 (θ)α(θ) dθ (by the δ method) Dependence on the unknown normalising constants solved iteratively

35. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ Var(B12 ) 1 ≈ −1 2 2 n1 n2 B12 π1 (θ)π2 (θ)α(θ) dθ (by the δ method) Dependence on the unknown normalising constants solved iteratively

36. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with suﬃciently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜

37. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with suﬃciently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜

38. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 (π1 (θ) − π2 (θ))2 var(B12 ) 1 = Eϕ 2 ϕ(θ)2 n B12 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000]

39. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 (π1 (θ) − π2 (θ))2 var(B12 ) 1 = Eϕ 2 ϕ(θ)2 n B12 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000]

40. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Improving upon bridge sampler Theorem 5.5.3: The asymptotic variance of the optimal ratio importance sampling estimator is smaller than the asymptotic variance of the optimal bridge sampling estimator [Chen, Shao, & Ibrahim, 2000] Does not require the normalising constant | π1 (η) − π2 (η) | dη but a simulation from ϕ∗ (θ) ∝| π1 (θ) − π2 (θ) | .

41. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Improving upon bridge sampler Theorem 5.5.3: The asymptotic variance of the optimal ratio importance sampling estimator is smaller than the asymptotic variance of the optimal bridge sampling estimator [Chen, Shao, & Ibrahim, 2000] Does not require the normalising constant | π1 (η) − π2 (η) | dη but a simulation from ϕ∗ (θ) ∝| π1 (θ) − π2 (θ) | .

42. On some computational methods for Bayesian model choice Importance sampling solutions Varying dimensions Generalisation to point null situations When π1 (θ1 )dθ1 ˜ Θ1 B12 = π2 (θ2 )dθ2 ˜ Θ2 and Θ2 = Θ1 × Ψ, we get θ2 = (θ1 , ψ) and π1 (θ1 )ω(ψ|θ1 ) ˜ B12 = Eπ2 π2 (θ1 , ψ) ˜ holds for any conditional density ω(ψ|θ1 ).

43. On some computational methods for Bayesian model choice Importance sampling solutions Varying dimensions X-dimen’al bridge sampling Generalisation of the previous identity: For any α, Eπ [˜1 (θ1 )ω(ψ|θ1 )α(θ1 , ψ)] π B12 = 2 Eπ1 ×ω [˜2 (θ1 , ψ)α(θ1 , ψ)] π and, for any density ϕ, Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π B12 = Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)] π [Chen, Shao, & Ibrahim, 2000] Optimal choice: ω(ψ|θ1 ) = π2 (ψ|θ1 ) [Theorem 5.8.2]

44. On some computational methods for Bayesian model choice Importance sampling solutions Varying dimensions X-dimen’al bridge sampling Generalisation of the previous identity: For any α, Eπ [˜1 (θ1 )ω(ψ|θ1 )α(θ1 , ψ)] π B12 = 2 Eπ1 ×ω [˜2 (θ1 , ψ)α(θ1 , ψ)] π and, for any density ϕ, Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π B12 = Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)] π [Chen, Shao, & Ibrahim, 2000] Optimal choice: ω(ψ|θ1 ) = π2 (ψ|θ1 ) [Theorem 5.8.2]

45. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x= dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

46. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x= dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

47. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) ϕ(θk ) 1 Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

48. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) ϕ(θk ) 1 Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

49. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation T (t) (t) πk (θk )Lk (θk ) 1 = Z2k (t) T ϕ(θk ) t=1 (t) where the θk ’s are generated from the density ϕ(·) (with fatter tails like t’s)

50. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

51. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

52. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t Take δ (t) = 1 with probability 1 (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where 2 MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) If δ (t) = 2, generate θk ∼ ϕ(θk ) independently 3

55. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ1 (t) (t) (t) (t) (t) ξ= ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , T t=1 converges to ω1 Zk /{ω1 Zk + 1} ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie ˆ ˆ 3k (t) (t) (t) (t) (t) T t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = (t) (t) (t) (t) T t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]

56. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ1 (t) (t) (t) (t) (t) ξ= ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , T t=1 converges to ω1 Zk /{ω1 Zk + 1} ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie ˆ ˆ 3k (t) (t) (t) (t) (t) T t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = (t) (t) (t) (t) T t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]

57. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ∗ πk (θk |x)

58. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ∗ πk (θk |x)

59. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T 1 (t) ∗ ∗ πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are Gibbs sampled latent variables

60. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

61. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

62. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected diﬃculties Number of modes of the likelihood of order O(k!): 1 c Maximization and even [MCMC] exploration of the posterior surface harder Under exchangeable priors on (θ, p) [prior invariant under 2 permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

63. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected diﬃculties Number of modes of the likelihood of order O(k!): 1 c Maximization and even [MCMC] exploration of the posterior surface harder Under exchangeable priors on (θ, p) [prior invariant under 2 permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

64. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0.2 0.3 0.4 0.5 −1 0 1 2 3 µi pi −1 0 1 2 3 0 100 200 300 400 500 µ n i 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 σi pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σi µi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 σ n i

65. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

68. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T 1 (t) ∗ ∗ πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]

69. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T 1 (t) ∗ ∗ πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]

70. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density 0.8 0.6 Relative Frequency 0.4 0.2 0.0 −2 −1 0 1 2 3 data

71. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]

74. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Bayesian variable selection Regression setting: one dependent random variable y and a set {x1 , . . . , xk } of k explanatory variables. Question: Are all xi ’s involved in the regression? Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k) explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of explanatory variables for the regression of y [intercept included in every corresponding model] Computational issue 2k models in competition...

78. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Model notations 1 X = 1n x 1 · · · xk is the matrix containing 1n and all the k potential predictor variables Each model Mγ associated with binary indicator vector 2 γ ∈ Γ = {0, 1}k where γi = 1 means that the variable xi is included in the model Mγ qγ = 1T γ number of variables included in the model Mγ 3 n t1 (γ) and t0 (γ) indices of variables included in the model and 4 indices of variables not included in the model

79. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Model indicators For β ∈ Rk+1 and X, we deﬁne βγ as the subvector βγ = β0 , (βi )i∈t1 (γ) and Xγ as the submatrix of X where only the column 1n and the columns in t1 (γ) have been left.

80. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Models in competition The model Mγ is thus deﬁned as y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters. + Warning σ 2 is common to all models and thus uses the same prior for all models

81. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Models in competition The model Mγ is thus deﬁned as y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters. + Warning σ 2 is common to all models and thus uses the same prior for all models

82. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Informative G-prior Many (2k ) models in competition: we cannot expect a practitioner to specify a prior on every Mγ in a completely subjective and autonomous manner. Shortcut: We derive all priors from a single global prior associated with the so-called full model that corresponds to γ = (1, . . . , 1).

83. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Prior deﬁnitions For the full model, Zellner’s G-prior: (i) β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ) and σ 2 ∼ π(σ 2 |X) = σ −2 ˜ For each model Mγ , the prior distribution of βγ conditional (ii) on σ 2 is ﬁxed as −1 ˜ βγ |γ, σ 2 ∼ Nqγ +1 βγ , cσ 2 Xγ Xγ T , −1 ˜ T˜ T Xγ β and same prior on σ 2 . where βγ = Xγ Xγ

84. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Prior completion The joint prior for model Mγ is the improper prior 1 T −(qγ +1)/2−1 ˜ π(βγ , σ 2 |γ) ∝ σ2 exp − βγ − βγ 2(cσ 2 ) ˜ T (Xγ Xγ ) βγ − βγ .

85. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Prior competition (2) Inﬁnitely many ways of deﬁning a prior on the model index γ: choice of uniform prior π(γ|X) = 2−k . Posterior distribution of γ central to variable selection since it is proportional to marginal density of y on Mγ (or evidence of Mγ ) π(γ|y, X) ∝ f (y|γ, X)π(γ|X) ∝ f (y|γ, X) f (y|γ, β, σ 2 , X)π(β|γ, σ 2 , X) dβ π(σ 2 |X) dσ 2 . =

86. On some computational methods for Bayesian model choice Cross-model solutions Variable selection f (y|γ, σ 2 , X) = f (y|γ, β, σ 2 )π(β|γ, σ 2 ) dβ 1T −n/2 = (c + 1)−(qγ +1)/2 (2π)−n/2 σ 2 exp − yy 2σ 2 1 −1 ˜T T ˜ cy T Xγ Xγ Xγ T T Xγ y − βγ Xγ Xγ βγ + , 2σ 2 (c + 1) this posterior density satisﬁes −1 cT π(γ|y, X) ∝ (c + 1)−(qγ +1)/2 y T y − T T y Xγ Xγ Xγ Xγ y c+1 −n/2 1 ˜T T ˜ − β X Xγ βγ . c+1 γ γ

87. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars t1 (γ) π(γ|y, X) 0,1,2,4,5 0.2316 0,1,2,4,5,9 0.0374 0,1,9 0.0344 0,1,2,4,5,10 0.0328 0,1,4,5 0.0306 0,1,2,9 0.0250 0,1,2,4,5,7 0.0241 0,1,2,4,5,8 0.0238 0,1,2,4,5,6 0.0237 0,1,2,3,4,5 0.0232 0,1,6,9 0.0146 0,1,2,3,9 0.0145 0,9 0.0143 0,1,2,6,9 0.0135 0,1,4,5,9 0.0128 0,1,3,9 0.0117 0,1,2,8 0.0115

88. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars (cont’d) Interpretation Model Mγ with the highest posterior probability is t1 (γ) = (1, 2, 4, 5), which corresponds to the variables - altitude, - slope, - height of the tree sampled in the center of the area, and - diameter of the tree sampled in the center of the area. Corresponds to the ﬁve variables identiﬁed in the R regression output

89. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars (cont’d) Interpretation Model Mγ with the highest posterior probability is t1 (γ) = (1, 2, 4, 5), which corresponds to the variables - altitude, - slope, - height of the tree sampled in the center of the area, and - diameter of the tree sampled in the center of the area. Corresponds to the ﬁve variables identiﬁed in the R regression output

90. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Noninformative extension For Zellner noninformative prior with π(c) = 1/c, we have ∞ c−1 (c + 1)−(qγ +1)/2 y T y− π(γ|y, X) ∝ c=1 −n/2 −1 cT T T y Xγ Xγ Xγ Xγ y . c+1

91. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars t1 (γ) π(γ|y, X) 0,1,2,4,5 0.0929 0,1,2,4,5,9 0.0325 0,1,2,4,5,10 0.0295 0,1,2,4,5,7 0.0231 0,1,2,4,5,8 0.0228 0,1,2,4,5,6 0.0228 0,1,2,3,4,5 0.0224 0,1,2,3,4,5,9 0.0167 0,1,2,4,5,6,9 0.0167 0,1,2,4,5,8,9 0.0137 0,1,4,5 0.0110 0,1,2,4,5,9,10 0.0100 0,1,2,3,9 0.0097 0,1,2,9 0.0093 0,1,2,4,5,7,9 0.0092 0,1,2,6,9 0.0092

92. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Stochastic search for the most likely model When k gets large, impossible to compute the posterior probabilities of the 2k models. Need of a tailored algorithm that samples from π(γ|y, X) and selects the most likely models. Can be done by Gibbs sampling, given the availability of the full conditional posterior probabilities of the γi ’s. If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k) π(γi |y, γ−i , X) ∝ π(γ|y, X) (to be evaluated in both γi = 0 and γi = 1)

93. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Stochastic search for the most likely model When k gets large, impossible to compute the posterior probabilities of the 2k models. Need of a tailored algorithm that samples from π(γ|y, X) and selects the most likely models. Can be done by Gibbs sampling, given the availability of the full conditional posterior probabilities of the γi ’s. If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k) π(γi |y, γ−i , X) ∝ π(γ|y, X) (to be evaluated in both γi = 0 and γi = 1)

94. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Gibbs sampling for variable selection Initialization: Draw γ 0 from the uniform distribution on Γ (t−1) (t−1) Iteration t: Given (γ1 , . . . , γk ), generate (t) (t−1) (t−1) 1. γ1 according to π(γ1 |y, γ2 , . . . , γk , X) (t) 2. γ2 according to (t) (t−1) (t−1) π(γ2 |y, γ1 , γ3 , . . . , γk , X) . . . (t) (t) (t) p. γk according to π(γk |y, γ1 , . . . , γk−1 , X)

95. On some computational methods for Bayesian model choice Cross-model solutions Variable selection MCMC interpretation After T 1 MCMC iterations, output used to approximate the posterior probabilities π(γ|y, X) by empirical averages T 1 π(γ|y, X) = Iγ (t) =γ . T − T0 + 1 t=T0 where the T0 ﬁrst values are eliminated as burnin. And approximation of the probability to include i-th variable, T 1 P π (γi = 1|y, X) = Iγ (t) =1 . T − T0 + 1 i t=T0

96. On some computational methods for Bayesian model choice Cross-model solutions Variable selection MCMC interpretation After T 1 MCMC iterations, output used to approximate the posterior probabilities π(γ|y, X) by empirical averages T 1 π(γ|y, X) = Iγ (t) =γ . T − T0 + 1 t=T0 where the T0 ﬁrst values are eliminated as burnin. And approximation of the probability to include i-th variable, T 1 P π (γi = 1|y, X) = Iγ (t) =1 . T − T0 + 1 i t=T0

97. On some computational methods for Bayesian model choice Cross-model solutions Variable selection Pine processionary caterpillars P π (γi = 1|y, X) P π (γi = 1|y, X) γi γ1 0.8624 0.8844 γ2 0.7060 0.7716 γ3 0.1482 0.2978 γ4 0.6671 0.7261 γ5 0.6515 0.7006 γ6 0.1678 0.3115 γ7 0.1371 0.2880 γ8 0.1555 0.2876 γ9 0.4039 0.5168 γ10 0.1151 0.2609 ˜ Probabilities of inclusion with both informative (β = 011 , c = 100) and noninformative Zellner’s priors

98. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]

99. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]

100. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .

101. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .

102. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) c Diﬃcult calibration

105. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic (??)

106. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic (??)

107. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Pseudo-deterministic reasoning Consider the proposals θ2 ∼ N (Ψ1→2 (θ1 , v1→2 ), ε) Ψ1→2 (θ1 , v1→2 ) ∼ N (θ2 , ε) and Reciprocal proposal has density exp −(θ2 − Ψ1→2 (θ1 , v1→2 ))2 /2ε ∂Ψ1→2 (θ1 , v1→2 ) √ × ∂(θ1 , v1→2 ) 2πε by the Jacobian rule. Thus Metropolis–Hastings acceptance probability is π(M2 , θ2 ) ∂Ψ1→2 (θ1 , v1→2 ) 1∧ π(M1 , θ1 ) ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) Does not depend on ε: Let ε go to 0

108. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Pseudo-deterministic reasoning Consider the proposals θ2 ∼ N (Ψ1→2 (θ1 , v1→2 ), ε) Ψ1→2 (θ1 , v1→2 ) ∼ N (θ2 , ε) and Reciprocal proposal has density exp −(θ2 − Ψ1→2 (θ1 , v1→2 ))2 /2ε ∂Ψ1→2 (θ1 , v1→2 ) √ × ∂(θ1 , v1→2 ) 2πε by the Jacobian rule. Thus Metropolis–Hastings acceptance probability is π(M2 , θ2 ) ∂Ψ1→2 (θ1 , v1→2 ) 1∧ π(M1 , θ1 ) ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) Does not depend on ε: Let ε go to 0

109. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1 acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 )

112. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t) = (m, θ(m) ), Select model Mn with probability πmn 1 Generate umn ∼ ϕmn (u) and set 2 (θ(n) , vnm ) = Ψm→n (θ(m) , umn ) Take x(t+1) = (n, θ(n) ) with probability 3 π(n, θ(n) ) πnm ϕnm (vnm ) ∂Ψm→n (θ(m) , umn ) min ,1 π(m, θ(m) ) πmn ϕmn (umn ) ∂(θ(m) , umn ) and take x(t+1) = x(t) otherwise.

113. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture of normal distributions   k   2 pjk N (µjk , σjk ) Mk = (pjk , µjk , σjk );   j=1 Restrict moves from Mk to adjacent models, like Mk+1 and Mk−1 , with probabilities πk(k+1) and πk(k−1) .

114. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Mixture of normal distributions   k   2 pjk N (µjk , σjk ) Mk = (pjk , µjk , σjk );   j=1 Restrict moves from Mk to adjacent models, like Mk+1 and Mk−1 , with probabilities πk(k+1) and πk(k−1) .

Computational methods for Bayesian model choice

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (14)

Andere mochten auch

Andere mochten auch (20)

Mehr von Christian Robert

Mehr von Christian Robert (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Computational methods for Bayesian model choice