4th joint Warwick Oxford Statistics Seminar

1. On some computational methods for Bayesian model choice On some computational methods for Bayesian model choice Christian P. Robert Universit´ Paris Dauphine & CREST-INSEE e http://www.ceremade.dauphine.fr/~xian 4th Warwick–Oxford Statistics Seminar Warwick, November 3, 2009 Joint works with J.-M. Marin, M. Beaumont, N. Chopin, J.-M. Cornuet, & D. Wraith

2. On some computational methods for Bayesian model choice Outline 1 Introduction 2 Importance sampling solutions compared 3 Nested sampling 4 ABC model choice

3. On some computational methods for Bayesian model choice Introduction Model choice Model choice as model comparison Choice between models Several models available for the same observation Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite Replace hypotheses with models

4. On some computational methods for Bayesian model choice Introduction Model choice Model choice as model comparison Choice between models Several models available for the same observation Mi : x ∼ fi (x|θi ), i∈I where I can be ﬁnite or inﬁnite Replace hypotheses with models

5. On some computational methods for Bayesian model choice Introduction Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj take largest π(Mi |x) to determine “best” model,

9. On some computational methods for Bayesian model choice Introduction Bayes factor Bayes factor Deﬁnition (Bayes factors) For testing hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior π(Θ0 )π0 (θ) + π(Θc )π1 (θ) , 0 central quantity f (x|θ)π0 (θ)dθ π(Θ0 |x) π(Θ0 ) Θ0 B01 = = π(Θc |x) 0 π(Θc ) 0 f (x|θ)π1 (θ)dθ Θc 0 [Jeﬀreys, 1939]

10. On some computational methods for Bayesian model choice Introduction Evidence Evidence Problems using a similar quantity, the evidence Zk = πk (θk )Lk (θk ) dθk , Θk aka the marginal likelihood. [Jeﬀreys, 1939]

11. On some computational methods for Bayesian model choice Importance sampling solutions compared A comparison of importance sampling solutions 1 Introduction 2 Importance sampling solutions compared Regular importance Bridge sampling Harmonic means Mixtures to bridge Chib’s solution The Savage–Dickey ratio 3 Nested sampling 4 ABC model choice

12. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Bayes factor approximation When approximating the Bayes factor f0 (x|θ0 )π0 (θ0 )dθ0 Θ0 B01 = f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 use of importance functions 0 and 1 and n−1 0 n0 i i i=1 f0 (x|θ0 )π0 (θ0 )/ i 0 (θ0 ) B01 = n−1 1 n1 i i i=1 f1 (x|θ1 )π1 (θ1 )/ i 1 (θ1 )

13. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Diabetes in Pima Indian women Example (R benchmark) “A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix (AZ), was tested for diabetes according to WHO criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases.” 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes

14. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Probit modelling on Pima Indian women Probability of diabetes function of above variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a g-prior modelling: β ∼ N3 (0, n XT X)−1

15. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Probit modelling on Pima Indian women Probability of diabetes function of above variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a g-prior modelling: β ∼ N3 (0, n XT X)−1

16. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance MCMC 101 for probit models Use of either a random walk proposal β =β+ in a Metropolis-Hastings algorithm (since the likelihood is available) or of a Gibbs sampler that takes advantage of the missing/latent variable z|y, x, β ∼ N (xT β, 1) Iy × I1−y z≥0 z≤0 (since β|y, X, z is distributed as a standard normal) [Gibbs three times faster]

19. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Importance sampling for the Pima Indian dataset Use of the importance function inspired from the MLE estimate distribution ˆ ˆ β ∼ N (β, Σ) R Importance sampling code model1=summary(glm(y~-1+X1,family=binomial(link="probit"))) is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled) is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled) bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1], sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)- dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))

20. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Importance sampling for the Pima Indian dataset Use of the importance function inspired from the MLE estimate distribution ˆ ˆ β ∼ N (β, Σ) R Importance sampling code model1=summary(glm(y~-1+X1,family=binomial(link="probit"))) is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled) is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled) bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1], sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)- dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))

21. On some computational methods for Bayesian model choice Importance sampling solutions compared Regular importance Diabetes in Pima Indian women Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations from the prior and the above MLE importance sampler 5 4 q 3 2 Monte Carlo Importance sampling

22. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Bridge sampling Special case: If π1 (θ1 |x) ∝ π1 (θ1 |x) ˜ π2 (θ2 |x) ∝ π2 (θ2 |x) ˜ live on the same space (Θ1 = Θ2 ), then n 1 π1 (θi |x) ˜ B12 ≈ θi ∼ π2 (θ|x) n π2 (θi |x) ˜ i=1 [Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

23. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Bridge sampling variance The bridge sampling estimator does poorly if 2 var(B12 ) 1 π1 (θ) − π2 (θ) 2 ≈ E B12 n π2 (θ) is large, i.e. if π1 and π2 have little overlap...

24. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Bridge sampling variance The bridge sampling estimator does poorly if 2 var(B12 ) 1 π1 (θ) − π2 (θ) 2 ≈ E B12 n π2 (θ) is large, i.e. if π1 and π2 have little overlap...

27. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Optimal bridge sampling (2) Reason: Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ 2 ≈ 2 −1 B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ (by the δ method) Drawback: Dependence on the unknown normalising constants solved iteratively

28. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Optimal bridge sampling (2) Reason: Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ 2 ≈ 2 −1 B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ (by the δ method) Drawback: Dependence on the unknown normalising constants solved iteratively

29. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Extension to varying dimensions When dim(Θ1 ) = dim(Θ2 ), e.g. θ2 = (θ1 , ψ), introduction of a pseudo-posterior density, ω(ψ|θ1 , x), augmenting π1 (θ1 |x) into joint distribution π1 (θ1 |x) × ω(ψ|θ1 , x) on Θ2 so that π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ ˜ B12 = π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ ˜ π1 (θ1 )ω(ψ|θ1 ) ˜ Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π = Eπ2 = π2 (θ1 , ψ) ˜ Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)] π for any conditional density ω(ψ|θ1 ) and any joint density ϕ.

30. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Extension to varying dimensions When dim(Θ1 ) = dim(Θ2 ), e.g. θ2 = (θ1 , ψ), introduction of a pseudo-posterior density, ω(ψ|θ1 , x), augmenting π1 (θ1 |x) into joint distribution π1 (θ1 |x) × ω(ψ|θ1 , x) on Θ2 so that π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ ˜ B12 = π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ ˜ π1 (θ1 )ω(ψ|θ1 ) ˜ Eϕ [˜1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π = Eπ2 = π2 (θ1 , ψ) ˜ Eϕ [˜2 (θ1 , ψ)/ϕ(θ1 , ψ)] π for any conditional density ω(ψ|θ1 ) and any joint density ϕ.

31. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Illustration for the Pima Indian dataset Use of the MLE induced conditional of β3 given (β1 , β2 ) as a pseudo-posterior and mixture of both MLE approximations on β3 in bridge sampling estimate R bridge sampling code cova=model2$cov.unscaled expecta=model2$coeff[,1] covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3] probit1=hmprobit(Niter,y,X1) probit2=hmprobit(Niter,y,X2) pseudo=rnorm(Niter,meanw(probit1),sqrt(covw)) probit1p=cbind(probit1,pseudo) bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]), sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3], cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+ dnorm(pseudo,expecta[3],cova[3,3])))

32. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Illustration for the Pima Indian dataset Use of the MLE induced conditional of β3 given (β1 , β2 ) as a pseudo-posterior and mixture of both MLE approximations on β3 in bridge sampling estimate R bridge sampling code cova=model2$cov.unscaled expecta=model2$coeff[,1] covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3] probit1=hmprobit(Niter,y,X1) probit2=hmprobit(Niter,y,X2) pseudo=rnorm(Niter,meanw(probit1),sqrt(covw)) probit1p=cbind(probit1,pseudo) bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]), sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3], cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+ dnorm(pseudo,expecta[3],cova[3,3])))

33. On some computational methods for Bayesian model choice Importance sampling solutions compared Bridge sampling Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 × 20, 000 simulations from the prior (MC), the above bridge sampler and the above importance sampler 5 4 q q 3 q 2 MC Bridge IS

34. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means The original harmonic mean estimator When θki ∼ πk (θ|x), T 1 1 T L(θkt |x) t=1 is an unbiased estimator of 1/mk (x) [Newton & Raftery, 1994] Highly dangerous: Most often leads to an inﬁnite variance!!!

35. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means The original harmonic mean estimator When θki ∼ πk (θ|x), T 1 1 T L(θkt |x) t=1 is an unbiased estimator of 1/mk (x) [Newton & Raftery, 1994] Highly dangerous: Most often leads to an inﬁnite variance!!!

36. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that it’s easy for people to not realize this, and to na¨ıvely accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

37. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that it’s easy for people to not realize this, and to na¨ıvely accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

38. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output

39. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output

40. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

41. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

42. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation T (t) (t) 1 πk (θk )Lk (θk ) Z2k = (t) T ϕ(θk ) t=1 (t) where the θk ’s are generated from the density ϕ(·) (with fatter tails like t’s)

43. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means HPD indicator as ϕ Use the convex hull of MCMC simulations corresponding to the 10% HPD region (easily derived!) and ϕ as indicator: 10 ϕ(θ) = Id(θ,θ(t) )≤ T t∈HPD

44. On some computational methods for Bayesian model choice Importance sampling solutions compared Harmonic means Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above harmonic mean sampler and importance samplers 3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116 q q Harmonic mean Importance sampling

45. On some computational methods for Bayesian model choice Importance sampling solutions compared Mixtures to bridge Approximating Zk using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

46. On some computational methods for Bayesian model choice Importance sampling solutions compared Mixtures to bridge Approximating Zk using a mixture representation Bridge sampling redux Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

47. On some computational methods for Bayesian model choice Importance sampling solutions compared Mixtures to bridge Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ (t) = 1 with probability (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) 2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) 3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently

50. On some computational methods for Bayesian model choice Importance sampling solutions compared Mixtures to bridge Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) ω1 πk (θk )Lk (θk ) (t) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]

51. On some computational methods for Bayesian model choice Importance sampling solutions compared Mixtures to bridge Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) ω1 πk (θk )Lk (θk ) (t) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler]

52. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)

53. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)

54. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T ∗ 1 ∗ (t) πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are Gibbs sampled latent variables

55. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Label switching A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

56. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Label switching A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

57. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

58. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

59. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0.2 0.3 0.4 0.5 −1 0 1 2 3 µi 0 100 200 n 300 400 500 pi −1 0 µ 1 i 2 3 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 σi pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σi µi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi

60. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

63. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T ∗ 1 ∗ (t) πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]

64. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T ∗ 1 ∗ (t) πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]

65. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density 0.8 0.6 Relative Frequency 0.4 0.2 0.0 −2 −1 0 1 2 3 data

66. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]

69. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Case of the probit model For the completion by z, 1 π (θ|x) = ˆ π(θ|x, z (t) ) T t is a simple average of normal densities R Bridge sampling code gibbs1=gibbsprobit(Niter,y,X1) gibbs2=gibbsprobit(Niter,y,X2) bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3), sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/ mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2), sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))

70. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Case of the probit model For the completion by z, 1 π (θ|x) = ˆ π(θ|x, z (t) ) T t is a simple average of normal densities R Bridge sampling code gibbs1=gibbsprobit(Niter,y,X1) gibbs2=gibbsprobit(Niter,y,X2) bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3), sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/ mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2), sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))

71. On some computational methods for Bayesian model choice Importance sampling solutions compared Chib’s solution Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above Chib’s and importance samplers q 0.0255 q q q q 0.0250 0.0245 0.0240 q q Chib's method importance sampling

72. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio The Savage–Dickey ratio Special representation of the Bayes factor used for simulation Given a test H0 : θ = θ0 in a model f (x|θ, ψ) with a nuisance parameter ψ, under priors π0 (ψ) and π1 (θ, ψ) such that π1 (ψ|θ0 ) = π0 (ψ) then π1 (θ0 |x) B01 = , π1 (θ0 ) with the obvious notations π1 (θ) = π1 (θ, ψ)dψ , π1 (θ|x) = π1 (θ, ψ|x)dψ , [Dickey, 1971; Verdinelli & Wasserman, 1995]

73. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Measure-theoretic difficulty The representation depends on the choice of versions of conditional densities: π0 (ψ)f (x|θ0 , ψ) dψ B01 = [by definition] π1 (θ, ψ)f (x|θ, ψ) dψdθ π1 (ψ|θ0 )f (x|θ0 , ψ) dψ π1 (θ0 ) = [specific version of π1 (ψ|θ0 )] π1 (θ, ψ)f (x|θ, ψ) dψdθ π1 (θ0 ) π1 (θ0 , ψ)f (x|θ0 , ψ) dψ = [specific version of π1 (θ0 , ψ)] m1 (x)π1 (θ0 ) π1 (θ0 |x) = π1 (θ0 ) c Dickey’s (1971) condition is not a condition

74. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Similar measure-theoretic diﬃculty Verdinelli-Wasserman extension: π1 (θ0 |x) π1 (ψ|x,θ0 ,x) π0 (ψ) B01 = E π1 (θ0 ) π1 (ψ|θ0 ) depends on similar choices of versions Monte Carlo implementation relies on continuous versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

75. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Similar measure-theoretic diﬃculty Verdinelli-Wasserman extension: π1 (θ0 |x) π1 (ψ|x,θ0 ,x) π0 (ψ) B01 = E π1 (θ0 ) π1 (ψ|θ0 ) depends on similar choices of versions Monte Carlo implementation relies on continuous versions of all densities without making mention of it [Chen, Shao & Ibrahim, 2000]

76. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Computational implementation Starting from the (new) prior π1 (θ, ψ) = π1 (θ)π0 (ψ) ˜ deﬁne the associated posterior π1 (θ, ψ|x) = π0 (ψ)π1 (θ)f (x|θ, ψ) m1 (x) ˜ ˜ and impose π1 (θ0 |x) ˜ π0 (ψ)f (x|θ0 , ψ) dψ = π0 (θ0 ) m1 (x) ˜ to hold. Then π1 (θ0 |x) m1 (x) ˜ ˜ B01 = π1 (θ0 ) m1 (x)

77. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Computational implementation Starting from the (new) prior π1 (θ, ψ) = π1 (θ)π0 (ψ) ˜ deﬁne the associated posterior π1 (θ, ψ|x) = π0 (ψ)π1 (θ)f (x|θ, ψ) m1 (x) ˜ ˜ and impose π1 (θ0 |x) ˜ π0 (ψ)f (x|θ0 , ψ) dψ = π0 (θ0 ) m1 (x) ˜ to hold. Then π1 (θ0 |x) m1 (x) ˜ ˜ B01 = π1 (θ0 ) m1 (x)

78. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio First ratio If (θ(1) , ψ (1) ), . . . , (θ(T ) , ψ (T ) ) ∼ π (θ, ψ|x), then ˜ 1 π1 (θ0 |x, ψ (t) ) ˜ T t converges to π1 (θ0 |x) (if the right version is used in θ0 ). ˜ When π1 (θ0 |x, ψ unavailable, replace with ˜ T 1 π1 (θ0 |x, z (t) , ψ (t) ) ˜ T t=1

79. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio First ratio If (θ(1) , ψ (1) ), . . . , (θ(T ) , ψ (T ) ) ∼ π (θ, ψ|x), then ˜ 1 π1 (θ0 |x, ψ (t) ) ˜ T t converges to π1 (θ0 |x) (if the right version is used in θ0 ). ˜ When π1 (θ0 |x, ψ unavailable, replace with ˜ T 1 π1 (θ0 |x, z (t) , ψ (t) ) ˜ T t=1

80. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Bridge revival (1) Since m1 (x)/m1 (x) is unknown, apparent failure! ˜ Use of the identity π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x) Eπ1 (θ,ψ|x) ˜ = Eπ1 (θ,ψ|x) ˜ = π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x) ˜ to (biasedly) estimate m1 (x)/m1 (x) by ˜ T π1 (ψ (t) |θ(t) ) T t=1 π0 (ψ (t) ) based on the same sample from π1 . ˜

81. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Bridge revival (1) Since m1 (x)/m1 (x) is unknown, apparent failure! ˜ Use of the identity π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x) Eπ1 (θ,ψ|x) ˜ = Eπ1 (θ,ψ|x) ˜ = π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x) ˜ to (biasedly) estimate m1 (x)/m1 (x) by ˜ T π1 (ψ (t) |θ(t) ) T t=1 π0 (ψ (t) ) based on the same sample from π1 . ˜

82. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Bridge revival (2) Alternative identity π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x) ˜ Eπ1 (θ,ψ|x) = Eπ1 (θ,ψ|x) = π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x) ¯ ¯ suggests using a second sample (θ(1) , ψ (1) , z (1) ), . . . , ¯(T ) , ψ (T ) , z (T ) ) ∼ π1 (θ, ψ|x) and (θ ¯ T ¯ 1 π0 (ψ (t) ) T ¯ ¯ π1 (ψ (t) |θ(t) ) t=1 Resulting estimate: (t) , ψ (t) ) T ¯ 1 t π1 (θ0 |x, z ˜ 1 π0 (ψ (t) ) B01 = T π1 (θ0 ) T t=1 π1 (ψ ¯ ¯ (t) |θ (t) )

83. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Bridge revival (2) Alternative identity π0 (ψ)π1 (θ)f (x|θ, ψ) π0 (ψ) m1 (x) ˜ Eπ1 (θ,ψ|x) = Eπ1 (θ,ψ|x) = π1 (θ, ψ)f (x|θ, ψ) π1 (ψ|θ) m1 (x) ¯ ¯ suggests using a second sample (θ(1) , ψ (1) , z (1) ), . . . , ¯(T ) , ψ (T ) , z (T ) ) ∼ π1 (θ, ψ|x) and (θ ¯ T ¯ 1 π0 (ψ (t) ) T ¯ ¯ π1 (ψ (t) |θ(t) ) t=1 Resulting estimate: (t) , ψ (t) ) T ¯ 1 t π1 (θ0 |x, z ˜ 1 π0 (ψ (t) ) B01 = T π1 (θ0 ) T t=1 π1 (ψ ¯ ¯ (t) |θ (t) )

84. On some computational methods for Bayesian model choice Importance sampling solutions compared The Savage–Dickey ratio Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above importance, Chib’s, Savage–Dickey’s and bridge samplers q 3.4 3.2 3.0 2.8 q IS Chib Savage−Dickey Bridge

85. On some computational methods for Bayesian model choice Nested sampling Purpose Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation: 1 Z = Eπ [L(θ)] = ϕ(x) dx 0 with ϕ−1 (l) = P π (L(θ) > l). Note; ϕ(·) is intractable in most cases.

86. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: First approximation Approximate Z by a Riemann sum: j Z= (xi−1 − xi )ϕ(xi ) i=1 where the xi ’s are either: deterministic: xi = e−i/N or random: x0 = 1, xi+1 = ti xi , ti ∼ Be(N, 1) so that E[log xi ] = −i/N .

87. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ Z= e−θ dθ = e e = Eδ e δ δ N 1 ˆ Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 ) N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14

90. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Replace (intractable) ϕ(xi ) by ϕi , obtained by Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk ), where θk is the point with smallest likelihood in the pool of θi ’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi .

93. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Third approximation Iterate the above steps until a given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 1 ϕ(x) dx with ϕ(x) dx 0

94. On some computational methods for Bayesian model choice Nested sampling Error rates Approximation error Error = Z − Z j 1 = (xi−1 − xi )ϕi − ϕ(x) dx = − ϕ(x) dx i=1 0 0 j 1 + (xi−1 − xi )ϕ(xi ) − ϕ(x) dx (Quadrature Error) i=1 j + (xi−1 − xi ) {ϕi − ϕ(xi )} (Stochastic Error) i=1 [Dominated by Monte Carlo!]

95. On some computational methods for Bayesian model choice Nested sampling Error rates A CLT for the Stochastic Error The (dominating) stochastic error is OP (N −1/2 ): D N 1/2 {Stochastic Error} → N (0, V ) with V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt. s,t∈[ ,1] [Proof based on Donsker’s theorem] The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

96. On some computational methods for Bayesian model choice Nested sampling Error rates A CLT for the Stochastic Error The (dominating) stochastic error is OP (N −1/2 ): D N 1/2 {Stochastic Error} → N (0, V ) with V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt. s,t∈[ ,1] [Proof based on Donsker’s theorem] The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

97. On some computational methods for Bayesian model choice Nested sampling Impact of dimension Curse of dimension For a simple Gaussian-Gaussian model of dimension dim(θ) = d, the following 3 quantities are O(d): 1 asymptotic variance of the NS estimator; 2 number of iterations (necessary to reach a given truncation error); 3 cost of one simulated sample. Therefore, CPU time necessary for achieving error level e is O(d3 /e2 )

101. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases! Skilling (2007) proposes to use MCMC, but: this introduces a bias (stopping rule). if MCMC stationary distribution is unconst’d prior, more and more diﬃcult to sample points such that L(θ) > l as l increases. If implementable, then slice sampler can be devised at the same cost! [Thanks, Gareth!]

104. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j Z= (xi−1 − xi )ϕi w(θi ). i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.

105. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j Z= (xi−1 − xi )ϕi w(θi ). i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.

106. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Benchmark: Target distribution Posterior distribution on (µ, σ) associated with the mixture pN (0, 1) + (1 − p)N (µ, σ) , when p is known

107. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment n observations with µ = 2 and σ = 3/2, Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ 2 . occurrences of posterior bursts for µ = xi computation of the various estimates of Z

108. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment (cont’d) MCMC sample for n = 16 Nested sampling sequence observations from the mixture. with M = 1000 starting points.

109. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment (cont’d) MCMC sample for n = 50 Nested sampling sequence observations from the mixture. with M = 1000 starting points.

110. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison Monte Carlo and MCMC (=Gibbs) outputs based on T = 104 simulations and numerical integration based on a 850 × 950 grid in the (µ, σ) parameter space. Nested sampling approximation based on a starting sample of M = 1000 points followed by at least 103 further simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood. Constr’d prior simulation based on 50 values simulated by random walk accepting only steps leading to a lik’hood higher than the bound

111. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) q q q q q q 1.15 q q q q q q q q q q q 1.10 q q q q q q q q q q q q q q q q q q q 1.05 q 1.00 0.95 q q q q q q q q q q q q q q 0.90 q 0.85 q q q q V1 q V2 V3 V4 q Graph based on a sample of 10 observations for µ = 2 and σ = 3/2 (150 replicas).

112. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) 1.10 1.05 q q q 1.00 q 0.95 q q q q 0.90 V1 V2 V3 V4 Graph based on a sample of 50 observations for µ = 2 and σ = 3/2 (150 replicas).

113. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) 1.15 1.10 1.05 q q q q 1.00 q q q q 0.95 q 0.90 0.85 V1 V2 V3 V4 Graph based on a sample of 100 observations for µ = 2 and σ = 3/2 (150 replicas).

114. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) Nested sampling gets less reliable as sample size increases Most reliable approach is mixture Z3 although harmonic solution Z1 close to Chib’s solution [taken as golden standard] Monte Carlo method Z2 also producing poor approximations to Z (Kernel φ used in Z2 is a t non-parametric kernel estimate with standard bandwidth estimation.)

115. On some computational methods for Bayesian model choice ABC model choice ABC method Approximate Bayesian Computation Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: ABC algorithm For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly simulating θ ∼ π(θ) , x ∼ f (x|θ ) , until the auxiliary variable x is equal to the observed value, x = y. [Pritchard et al., 1999]

118. On some computational methods for Bayesian model choice ABC model choice ABC method Population genetics example Tree of ancestors in a sample of genes

119. On some computational methods for Bayesian model choice ABC model choice ABC method A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )

120. On some computational methods for Bayesian model choice ABC model choice ABC method A as approximative When y is a continuous random variable, equality x = y is replaced with a tolerance condition, (x, y) ≤ where is a distance between summary statistics Output distributed from π(θ) Pθ { (x, y) < } ∝ π(θ| (x, y) < )

121. On some computational methods for Bayesian model choice ABC model choice ABC method ABC improvements Simulating from the prior is often poor in eﬃciency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002]

4th joint Warwick Oxford Statistics Seminar

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 4th joint Warwick Oxford Statistics Seminar

Ähnlich wie 4th joint Warwick Oxford Statistics Seminar (20)

Mehr von Christian Robert

Mehr von Christian Robert (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

4th joint Warwick Oxford Statistics Seminar