1. Computational Bayesian Statistics
Computational Bayesian Statistics
Christian P. Robert, Universit´ Paris Dauphine
e
Frontiers of Statistical Decision Making and Bayesian Analysis,
In honour of Jim Berger,
17th of March, 2010
1 / 143
3. Computational Bayesian Statistics
Computational basics
Computational basics
1 Computational basics
The Bayesian toolbox
Improper prior distribution
Testing
Monte Carlo integration
Bayes factor approximations in perspective
3 / 143
4. Computational Bayesian Statistics
Computational basics
The Bayesian toolbox
The Bayesian toolbox
Bayes theorem = Inversion of probabilities
If A and E are events such that P (E) = 0, P (A|E) and P (E|A)
are related by
P (E|A)P (A)
P (A|E) =
P (E|A)P (A) + P (E|Ac )P (Ac )
P (E|A)P (A)
=
P (E)
4 / 143
5. Computational Bayesian Statistics
Computational basics
The Bayesian toolbox
New perspective
Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
Inference based on the distribution of θ conditional on x,
π(θ|x), called posterior distribution
f (x|θ)π(θ)
π(θ|x) = .
f (x|θ)π(θ) dθ
5 / 143
6. Computational Bayesian Statistics
Computational basics
The Bayesian toolbox
Bayesian model
A Bayesian statistical model is made of
1 a likelihood
f (x|θ),
and of
2 a prior distribution on the parameters,
π(θ) .
6 / 143
7. Computational Bayesian Statistics
Computational basics
The Bayesian toolbox
Posterior distribution center of Bayesian inference
π(θ|x) ∝ f (x|θ) π(θ)
Operates conditional upon the observations
Integrate simultaneously prior information/knowledge and
information brought by x
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope and an unique motor of
inference
7 / 143
8. Computational Bayesian Statistics
Computational basics
The Bayesian toolbox
Improper prior distribution
Extension from a prior distribution to a prior σ-finite measure π
such that
π(θ) dθ = +∞
Θ
Formal extension: π cannot be interpreted as a probability any
longer
8 / 143
9. Computational Bayesian Statistics
Computational basics
The Bayesian toolbox
Validation
Extension of the posterior distribution π(θ|x) associated with an
improper prior π given by Bayes’s formula
f (x|θ)π(θ)
π(θ|x) = ,
Θ f (x|θ)π(θ) dθ
when
f (x|θ)π(θ) dθ < ∞
Θ
9 / 143
10. Computational Bayesian Statistics
Computational basics
Testing
Testing hypotheses
Deciding about validity of assumptions or restrictions on the
parameter θ from the data, represented as
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0
Binary outcome of the decision process: accept [coded by 1] or
reject [coded by 0]
D = {0, 1}
Bayesian solution formally very close from a likelihood ratio test
statistic, but numerical values often strongly differ from classical
solutions
10 / 143
11. Computational Bayesian Statistics
Computational basics
Testing
Bayes factor
Bayesian testing procedure depends on P π (θ ∈ Θ0 |x) or
alternatively on the Bayes factor
π {P π (θ ∈ Θ1 |x)/P π (θ ∈ Θ0 |x)}
B10 =
{P π (θ ∈ Θ1 )/P π (θ ∈ Θ0 )}
= f (x|θ1 )π1 (θ1 )dθ1 f (x|θ0 )π0 (θ0 )dθ0 = m1 (x)/m0 (x)
[Akin to likelihood ratio]
11 / 143
12. Computational Bayesian Statistics
Computational basics
Testing
Banning improper priors
Impossibility of using improper priors for testing!
Reason: When using the representation
π(θ) = P π (θ ∈ Θ1 ) × π1 (θ) + P π (θ ∈ Θ0 ) × π0 (θ)
π1 and π0 must be normalised
12 / 143
13. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Fundamental integration issue
Generic problem of evaluating an integral
I = Ef [h(X)] = h(x) f (x) dx
X
where X is uni- or multidimensional, f is a closed form, partly
closed form, or implicit density, and h is a function
13 / 143
14. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Monte Carlo Principle
Use a sample (x1 , . . . , xm ) from the density f to approximate the
integral I by the empirical average
m
1
hm = h(xj )
m
j=1
Convergence of the average
hm −→ Ef [h(X)]
by the Strong Law of Large Numbers
14 / 143
15. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
e.g., Bayes factor approximation
For the normal case
x1 , . . . , xn ∼ N (µ + ξ, σ 2 )
y1 , . . . , yn ∼ N (µ − ξ, σ 2 )
and H0 : ξ = 0
under prior
π(µ, σ 2 ) = 1/σ 2 and ξ ∼ N (0, 1)
−n+1/2
π (¯ − y )2 + S 2
x ¯
B01 = √
[(2ξ − x − y )2 + S 2 ] e−ξ2 /2 dξ/ 2π
−n+1/2
¯ ¯
15 / 143
16. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Example
CMBdata
π
Simulate ξ1 , . . . , ξ1000 ∼ N (0, 1) and approximate B01 with
−n+1/2
π
(¯ − y )2 + S 2
x ¯
B01 = 1000
= 89.9
1
[(2ξi − x − y )2 + S 2 ]
−n+1/2
1000 i=1 ¯ ¯
when x = 0.0888 ,
¯ y = 0.1078 ,
¯ S 2 = 0.00875
16 / 143
17. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Precision evaluation
Estimate the variance with
m
1 1
vm = [h(xj ) − hm ]2 ,
mm−1
j=1
and for m large,
√
hm − Ef [h(X)] / vm ≈ N (0, 1).
Note
0.05
Construction of a convergence
0.04
0.03
test and of confidence bounds on 0.02
the approximation of Ef [h(X)]
0.01
0.00
0 50 100 150
1B10
17 / 143
18. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Example (Cauchy-normal)
For estimating a normal mean, a robust prior is a Cauchy prior
x ∼ N (θ, 1), θ ∼ C(0, 1).
Under squared error loss, posterior mean
∞
θ 2
e−(x−θ) /2 dθ
−∞ 1 + θ2
δ π (x) = ∞
1 2
2
e−(x−θ) /2 dθ
−∞ 1+θ
18 / 143
19. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Example (Cauchy-normal (2))
Form of δ π suggests simulating iid variables θ1 , · · · , θm ∼ N (x, 1)
and calculate
10.6
m θi
10.4
i=1 2
ˆπ 1 + θi
δm (x) = .
1
10.2
m
i=1 2
1 + θi
10.0
LLN implies
9.8
9.6
ˆπ
δm (x) −→ δ π (x) as m −→ ∞. 0 200 400 600 800 1000
iterations
19 / 143
20. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Importance sampling
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,
based on the alternative representation
f (x) f (x)
Ef [h(x)] = h(x) g(x) dx = Eg h(x)
X g(x) g(x)
which allows us to use other distributions than f
20 / 143
21. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Importance sampling (cont’d)
Importance sampling algorithm
Evaluation of
Ef [h(x)] = h(x) f (x) dx
X
by
1 Generate a sample x1 , . . . , xm from a distribution g
2 Use the approximation
m
1 f (xj )
h(xj )
m g(xj )
j=1
21 / 143
22. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Justification
Convergence of the estimator
m
1 f (xj )
h(xj ) −→ Ef [h(x)]
m g(xj )
j=1
1 converges for any choice of the distribution g as long as
supp(g) ⊃ supp(f )
2 Instrumental distribution g chosen from distributions easy to
simulate
3 Same sample (generated from g) can be used repeatedly, not
only for different functions h, but also for different densities f
22 / 143
23. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Choice of importance function
g can be any density but some choices better than others
1 Finite variance only when
f (x) f 2 (x)
Ef h2 (x) = h2 (x) dx < ∞ .
g(x) X g(x)
2 Instrumental distributions with tails lighter than those of f
(that is, with sup f /g = ∞) not appropriate, because weights
f (xj )/g(xj ) vary widely, giving too much importance to a few
values xj .
3 If sup f /g = M < ∞, the accept-reject algorithm can be used
as well to simulate f directly.
4 IS suffers from curse of dimensionality
23 / 143
24. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Example (Cauchy target)
Case of Cauchy distribution C(0, 1) when importance function is
Gaussian N (0, 1).
Density ratio
80
p⋆ (x) √ exp x2 /2
= 2π
π (1 + x2 )
60
p0 (x)
40
very badly behaved: e.g.,
20
∞
̺(x)2 p0 (x)dx = ∞
0
0 2000 4000 6000 8000 10000
iterations
−∞
Poor performances of the associated importance sampling
estimator
24 / 143
25. Computational Bayesian Statistics
Computational basics
Monte Carlo integration
Practical alternative
m m
h(xj ) f (xj )/g(xj ) f (xj )/g(xj )
j=1 j=1
where f and g are known up to constants.
1 Also converges to I by the Strong Law of Large Numbers.
2 Biased, but the bias is quite small: may beat the unbiased
estimator in squared error loss.
25 / 143
26. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Computational alternatives for evidence
Bayesian model choice and hypothesis testing relies on a similar
quantity, the evidence
Zk = πk (θk )Lk (θk ) dθk , k = 1, 2
Θk
aka the marginal likelihood.
[Jeffreys, 1939]
26 / 143
27. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Importance sampling
When approximating the Bayes factor
f0 (x|θ0 )π0 (θ0 )dθ0
Θ0
B01 =
f1 (x|θ1 )π1 (θ1 )dθ1
Θ1
simultaneous use of importance functions ̟0 and ̟1 and
n0
n−1
0
i i i
i=1 f0 (x|θ0 )π0 (θ0 )/̟0 (θ0 )
B01 = n1
n−1
1
i i i
i=1 f1 (x|θ1 )π1 (θ1 )/̟1 (θ1 )
27 / 143
28. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Diabetes in Pima Indian women
Example (R benchmark)
“A population of women who were at least 21 years old, of Pima Indian
heritage and living near Phoenix (AZ), was tested for diabetes according
to WHO criteria. The data were collected by the US National Institute of
Diabetes and Digestive and Kidney Diseases.”
200 Pima Indian women with observed variables
plasma glucose concentration in oral glucose tolerance test
diastolic blood pressure
diabetes pedigree function
presence/absence of diabetes
28 / 143
29. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Probit modelling on Pima Indian women
Probability of diabetes function of above variables
P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
g-prior modelling:
β ∼ N3 (0, n XT X)−1
29 / 143
30. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Importance sampling for the Pima Indian dataset
Use of the importance function inspired from the MLE estimate
distribution
ˆ ˆ
β ∼ N (β, Σ)
R Importance sampling code
model1=summary(glm(y~-1+X1,family=binomial(link="probit")))
is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))
30 / 143
31. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Diabetes in Pima Indian women
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations from the prior and
the above MLE importance sampler
5
4
3
2
Basic Monte Carlo Importance sampling
31 / 143
33. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Optimal bridge sampling
The optimal choice of auxiliary function α is
n1 + n2
α⋆ =
n1 π1 (θ|x) + n2 π2 (θ|x)
leading to
n1
1 π2 (θ1i |x)
˜
n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x)
i=1
B12 ≈ n2
1 π1 (θ2i |x)
˜
n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x)
i=1
Back later!
33 / 143
34. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Extension to varying dimensions
When dim(Θ1 ) = dim(Θ2 ), e.g. θ2 = (θ1 , ψ), introduction of a
pseudo-posterior density, ω(ψ|θ1 , x), augmenting π1 (θ1 |x) into
joint distribution
π1 (θ1 |x) × ω(ψ|θ1 , x)
on Θ2 so that
Z
π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ
˜
B12 = Z
π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ
˜
" #
π1 (θ1 )ω(ψ|θ1 )
˜ Eϕ [˜ 1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)]
π
= Eπ 2 =
π2 (θ1 , ψ)
˜ Eϕ [˜ 2 (θ1 , ψ)/ϕ(θ1 , ψ)]
π
for any conditional density ω(ψ|θ1 ) and any joint density ϕ.
34 / 143
35. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Illustration for the Pima Indian dataset
Use of the MLE induced conditional of β3 given (β1 , β2 ) as a
pseudo-posterior and mixture of both MLE approximations on β3
in bridge sampling estimate
R bridge sampling code
cova=model2$cov.unscaled
expecta=model2$coeff[,1]
covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]
probit1=hmprobit(Niter,y,X1)
probit2=hmprobit(Niter,y,X2)
pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
probit1p=cbind(probit1,pseudo)
bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+
dnorm(pseudo,expecta[3],cova[3,3])))
35 / 143
36. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 × 20, 000 simulations from the prior (MC), the above
bridge sampler and the above importance sampler
36 / 143
37. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
The original harmonic mean estimator
When θki ∼ πk (θ|x),
T
1 1
T L(θkt |x)
t=1
is an unbiased estimator of 1/mk (x)
[Newton & Raftery, 1994]
Highly dangerous: Most often leads to an infinite variance!!!
37 / 143
38. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk
which holds no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the Monte Carlo output
38 / 143
39. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails
than πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
39 / 143
40. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
HPD indicator as ϕ
Use the convex hull of Monte Carlo simulations corresponding to
the 10% HPD region (easily derived!) and ϕ as indicator:
10
ϕ(θ) = Id(θ,θ(t) )≤ǫ
T
t∈HPD
40 / 143
41. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the above harmonic mean sampler and importance samplers
3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116
Harmonic mean Importance sampling
41 / 143
42. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
Zk = mk (x) =
πk (θk |x)
Use of an approximation to the posterior
∗ ∗
fk (x|θk ) πk (θk )
Zk = mk (x) = .
ˆ ∗
πk (θk |x)
42 / 143
43. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwell
estimate
T
∗ 1 ∗ (t)
πk (θk |x) = πk (θk |x, zk ) ,
T
t=1
(t)
where the zk ’s are latent variables sampled from π(z|x) (often by
Gibbs )
43 / 143
44. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Case of the probit model
For the completion by z, hidden normal variable,
1
π (θ|x) =
ˆ π(θ|x, z (t) )
T t
is a simple average of normal densities
R Bridge sampling code
gibbs1=gibbsprobit(Niter,y,X1)
gibbs2=gibbsprobit(Niter,y,X2)
bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))
44 / 143
45. Computational Bayesian Statistics
Computational basics
Bayes factor approximations in perspective
Diabetes in Pima Indian women (cont’d)
Comparison of the variation of the Bayes factor approximations
based on 100 replicas for 20, 000 simulations for a simulation from
the above Chib’s and importance samplers
45 / 143
46. Computational Bayesian Statistics
Regression and variable selection
Regression and variable selection
2 Regression and variable selection
Regression
Zellner’s informative G-prior
Zellner’s noninformative G-prior
Markov Chain Monte Carlo Methods
Variable selection
46 / 143
47. Computational Bayesian Statistics
Regression and variable selection
Regression
Regressors and response
Variable of primary interest, y, called the response or the outcome
variable [assumed here to be continuous]
E.g., number of Pine processionary caterpillar colonies
Covariates x = (x1 , . . . , xk ) called explanatory variables [may be
discrete, continuous or both]
Distribution of y given x typically studied in the context of a set of
units or experimental subjects, i = 1, . . . , n, for instance patients in
an hospital ward, on which both yi and xi1 , . . . , xik are measured.
47 / 143
48. Computational Bayesian Statistics
Regression and variable selection
Regression
Regressors and response cont’d
Dataset made of the conjunction of the vector of outcomes
y = (y1 , . . . , yn )
and of the n × (k + 1) matrix of explanatory variables
1 x11 x12 . . . x1k
1 x21 x22 . . . x2k
x32 . . . x3k
X = 1 x31
. . . . .
.
.. .
. .
. .
. .
1 xn1 xn2 . . . xnk
48 / 143
49. Computational Bayesian Statistics
Regression and variable selection
Regression
Linear models
Ordinary normal linear regression model such that
y|β, σ 2 , X ∼ Nn (Xβ, σ 2 In )
and thus
E[yi |β, X] = β0 + β1 xi1 + . . . + βk xik
V(yi |σ 2 , X) = σ 2
49 / 143
50. Computational Bayesian Statistics
Regression and variable selection
Regression
Pine processionary caterpillars
Pine processionary caterpillar
colony size influenced by
x1 altitude
x2 slope (in degrees)
x3 number of pines in the area
x4 height of the central tree
x5 diameter of the central tree
x6 index of the settlement density
x7 orientation of the area (from 1
[southbound] to 2)
x8 height of the dominant tree
x9 number of vegetation strata
x10 mix settlement index (from 1 if not
mixed to 2 if mixed)
x1 x2 x3
50 / 143
51. Computational Bayesian Statistics
Regression and variable selection
Regression
Ibex horn growth
Growth of the Ibex horn size
S against age A
A
log(S) = α + β log +ǫ
1+A
6.5
6.0
5.5
log(Size)
5.0
4.5
4.0
3.5
−0.25 −0.20 −0.15 −0.10
log(1 1 + Age)
51 / 143
52. Computational Bayesian Statistics
Regression and variable selection
Regression
Likelihood function & estimator
The likelihood of the ordinary normal linear model is
−n/2 1
ℓ β, σ 2 |y, X = 2πσ 2 exp − (y − Xβ)T (y − Xβ)
2σ 2
The MLE of β is solution of the least squares minimisation problem
n
2
min (y − Xβ)T (y − Xβ) = min (yi − β0 − β1 xi1 − . . . − βk xik ) ,
β β
i=1
namely
ˆ
β = (X T X)−1 X T y
52 / 143
54. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Zellner’s informative G-prior
Constraint
Allow the experimenter to introduce information about the
location parameter of the regression while bypassing the most
difficult aspects of the prior specification, namely the derivation of
the prior correlation structure.
Zellner’s prior corresponds to
˜
β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 )
σ 2 ∼ π(σ 2 |X) ∝ σ −2 .
[Special conjugate]
54 / 143
55. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Prior selection
˜
Experimental prior determination restricted to the choices of β and
of the constant c.
Note
c can be interpreted as a measure of the amount of information
available in the prior relative to the sample. For instance, setting
1/c = 0.5 gives the prior the same weight as 50% of the sample.
There still is a lasting influence of the factor c
55 / 143
56. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Posterior structure
With this prior model, the posterior simplifies into
π(β, σ 2 |y, X) ∝ f (y|β, σ 2 , X)π(β, σ 2 |X)
1 ˆ ˆ
∝ σ2 exp − 2 (y − X β)T (y − X β)
−(n/2+1)
2σ
1 ˆ ˆ
− 2 (β − β)T X T X(β − β) σ 2
−k/2
2σ
1 ˜ ˜
× exp − (β − β)X T X(β − β) ,
2cσ 2
because X T X used in both prior and likelihood
[G-prior trick]
56 / 143
57. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Posterior structure (cont’d)
Therefore,
c 2
β|σ 2 , y, X ∼ Nk+1 ˜ ˆ σ c (X T X)−1
(β/c + β),
c+1 c+1
n s 2 1
σ 2 |y, X ∼ IG , + ˜ ˆ ˜ ˆ
(β − β)T X T X(β − β)
2 2 2(c + 1)
and
c ˜
β
β|y, X ∼ Tk+1 n, ˆ
+β ,
c+1 c
˜ ˆ ˜ ˆ
c(s2 + (β − β)T X T X(β − β)/(c + 1)) T −1
(X X) .
n(c + 1)
57 / 143
60. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Credible regions
Highest posterior density (HPD) regions on subvectors of the
parameter β derived from the marginal posterior distribution of β.
For a single parameter,
c ˜
βi
βi |y, X ∼ T1 n, ˆ
+ βi ,
c+1 c
˜ ˆ ˜ ˆ
c(s2 + (β − β)T X T X(β − β)/(c + 1))
ω(i,i) ,
n(c + 1)
where ω(i,i) is the (i, i)-th element of the matrix (X T X)−1 .
60 / 143
62. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
T marginal
Marginal distribution of y is multivariate t distribution
˜
Proof. Since β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ),
˜
Xβ|σ 2 , X ∼ N (X β, cσ 2 X(X T X)−1 X T ) ,
which implies that
˜
y|σ 2 , X ∼ Nn (X β, σ 2 (In + cX(X T X)−1 X T )).
Integrating in σ 2 yields
f (y|X) = (c + 1)−(k+1)/2 π −n/2 Γ(n/2)
−n/2
c 1 ˜T T ˜
× yT y − y T X(X T X)−1 X T y − β X Xβ .
c+1 c+1
62 / 143
63. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Point null hypothesis
If a null hypothesis is H0 : Rβ = r, the model under H0 can be
rewritten as
H0
y|β 0 , σ 2 , X0 ∼ Nn X0 β 0 , σ 2 In
where β 0 is (k + 1 − q) dimensional.
63 / 143
64. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Point null marginal
Under the prior
˜
β 0 |X0 , σ 2 ∼ Nk+1−q β 0 , c0 σ 2 (X0 X0 )−1 ,
T
the marginal distribution of y under H0 is
f (y|X0 , H0 ) = (c0 + 1)−(k+1−q)/2 π −n/2 Γ(n/2)
c0
× yT y − y T X0 (X0 X0 )−1 X0 y
T T
c0 + 1
−n/2
1 ˜T T ˜
− β X X0 β0 .
c0 + 1 0 0
64 / 143
65. Computational Bayesian Statistics
Regression and variable selection
Zellner’s informative G-prior
Bayes factor
Therefore the Bayes factor is in closed form:
π f (y|X, H1 ) (c0 + 1)(k+1−q)/2
B10 = =
f (y|X0 , H0 ) (c + 1)(k+1)/2
1 ˜T T n/2
yT y − c0 T T −1 T
− ˜
c0 +1 y X0 (X0 X0 ) X0 y c0 +1 β0 X0 X0 β0
c 1 ˜T T ˜
yT y − c+1 y T X(X T X)−1 X T y − c+1 β X X β
Means using the same σ 2 on both models
Problem: Still depends on the choice of (c0 , c)
65 / 143
66. Computational Bayesian Statistics
Regression and variable selection
Zellner’s noninformative G-prior
Zellner’s noninformative G-prior
Difference with informative G-prior setup is that we now consider c
as unknown (relief!)
Solution
˜
Use the same G-prior distribution with β = 0k+1 , conditional on c,
and introduce a diffuse prior on c,
π(c) = c−1 IN∗ (c) .
66 / 143
67. Computational Bayesian Statistics
Regression and variable selection
Zellner’s noninformative G-prior
Posterior distribution
Corresponding marginal posterior on the parameters of interest
π(β, σ 2 |y, X) = π(β, σ 2 |y, X, c)π(c|y, X) dc
∞
∝ π(β, σ 2 |y, X, c)f (y|X, c)π(c)
c=1
∞
∝ π(β, σ 2 |y, X, c)f (y|X, c) c−1 .
c=1
and
−n/2
T c T
f (y|X, c) ∝ (c+1) −(k+1)/2
y y− y X(X T X)−1 X T y .
c+1
67 / 143
68. Computational Bayesian Statistics
Regression and variable selection
Zellner’s noninformative G-prior
Posterior means
The Bayes estimates of β and σ 2 are given by
Eπ [β|y, X] ˆ
= Eπ [Eπ (β|y, X, c)|y, X] = Eπ [c/(c + 1)β)|y, X]
∞
c/(c + 1)f (y|X, c)c −1
c=1
=
∞
β
ˆ
f (y|X, c)c−1
c=1
and
∞ ˆ ˆ
s2 + β T X T X β/(c + 1)
f (y|X, c)c−1
c=1
n−2
Eπ [σ 2 |y, X] = ∞ .
f (y|X, c)c
−1
c=1
68 / 143
69. Computational Bayesian Statistics
Regression and variable selection
Zellner’s noninformative G-prior
Marginal distribution
Important point: the marginal distribution of the dataset is
available in closed form
∞ −n/2
c
f (y|X) ∝ c−1 (c + 1)−(k+1)/2 y T y − y T X(X T X)−1 X T y
i=1
c+1
T -shape means normalising constant can be computed too.
69 / 143
70. Computational Bayesian Statistics
Regression and variable selection
Zellner’s noninformative G-prior
Point null hypothesis
For null hypothesis H0 : Rβ = r, the model under H0 can be
rewritten as
H0
y|β 0 , σ 2 , X0 ∼ Nn X0 β 0 , σ 2 In
where β 0 is (k + 1 − q) dimensional.
70 / 143
71. Computational Bayesian Statistics
Regression and variable selection
Zellner’s noninformative G-prior
Point null marginal
Under the prior
β 0 |X0 , σ 2 , c ∼ Nk+1−q 0k+1−q , cσ 2 (X0 X0 )−1
T
and π(c) = 1/c, the marginal distribution of y under H0 is
∞ −n/2
c
f (y|X0 , H0 ) ∝ (c+1)−(k+1−q)/2 y T y − y T X0 (X0 X0 )−1 X0 y
T T
.
c=1
c+1
π
Bayes factor B10 = f (y|X)/f (y|X0 , H0 ) can be computed
71 / 143
73. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Complexity of most models encountered in Bayesian modelling
Standard simulation methods not good enough a solution
New technique at the core of Bayesian computing, based on
Markov chains
73 / 143
74. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Markov chains
Markov chain
A process (θ(t) )t∈N is an homogeneous
Markov chain if the distribution of θ(t)
given the past (θ(0) , . . . , θ(t−1) )
1 only depends on θ(t−1)
2 is the same for all t ∈ N∗ .
74 / 143
75. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Algorithms based on Markov chains
Idea: simulate from a posterior density π(·|x) [or any density] by
producing a Markov chain
(θ(t) )t∈N
whose stationary distribution is
π(·|x)
Translation
For t large enough, θ(t) is approximately distributed from π(θ|x),
no matter what the starting value θ(0) is [Ergodicity].
75 / 143
76. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Convergence
If an algorithm that generates such a chain can be constructed, the
ergodic theorem guarantees that, in almost all settings, the average
T
1
g(θ(t) )
T
t=1
converges to Eπ [g(θ)|x], for (almost) any starting value
76 / 143
77. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
More convergence
If the produced Markov chains are irreducible [can reach any region
in a finite number of steps], then they are both positive recurrent
with stationary distribution π(·|x) and ergodic [asymptotically
independent from the starting value θ(0) ]
While, for t large enough, θ(t) is approximately distributed
from π(θ|x) and can thus be used like the output from a more
standard simulation algorithm, one must take care of the
correlations between the θ(t) ’s
77 / 143
78. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Demarginalising
Takes advantage of hierarchical structures: if
π(θ|x) = π1 (θ|x, λ)π2 (λ|x) dλ ,
simulating from π(θ|x) comes from simulating from the joint
distribution
π1 (θ|x, λ) π2 (λ|x)
78 / 143
79. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Two-stage Gibbs sampler
Usually π2 (λ|x) not available/simulable
More often, both conditional posterior distributions,
π1 (θ|x, λ) and π2 (λ|x, θ)
can be simulated.
Idea: Create a Markov chain based on those conditionals
Example of latent variables!
79 / 143
80. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Two-stage Gibbs sampler (cont’d)
Initialization: Start with an
arbitrary value λ(0)
Iteration t: Given λ(t−1) , generate
1 θ(t) according to π1 (θ|x, λ(t−1) )
2 λ(t) according to π2 (λ|x, θ(t) )
J.W. Gibbs (1839-1903)
π(θ, λ|x) is a stationary distribution for this transition
80 / 143
81. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Implementation
1 Derive efficient decomposition of the joint distribution into
simulable conditionals (mixing behavior, acf(), blocking, &tc.)
2 Find when to stop the algorithm (mode chasing, missing
mass, shortcuts, &tc.)
81 / 143
82. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Simple Example: iid N (µ, σ 2 ) Observations
iid
When y1 , . . . , yn ∼ N (µ, σ 2 ) with both µ and σ unknown, the
posterior in (µ, σ 2 ) is conjugate outside a standard family
But...
1 n σ2
µ|y, σ 2 ∼ N µ n i=1 yi , n )
σ 2 |y, µ ∼ I G σ 2 n − 1, 2 n (yi
2
1
i=1 − µ)2
assuming constant (improper) priors on both µ and σ 2
Hence we may use the Gibbs sampler for simulating from the
posterior of (µ, σ 2 )
82 / 143
83. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Gibbs output analysis
Example (Cauchy posterior)
2
e−µ /20
π(µ|D) ∝
(1 + (x1 − µ)2 )(1 + (x2 − µ)2 )
is marginal of
2
−µ2 /20 2
π(µ, ω|D) ∝ e × e−ωi [1+(xi −µ) ] .
i=1
Corresponding conditionals
(ω1 , ω2 )|µ ∼ E xp(1 + (x1 − µ)2 ) ⊗ E xp(1 + (x2 − µ))2 )
µ|ω ∼ N ωi xi /( ωi + 1/20), 1/(2 ωi + 1/10)
i i i
83 / 143
84. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Gibbs output analysis (cont’d)
4
2
0
µ
−2
−4
−6
9900 9920 9940 9960 9980 10000
n
0.15
Density
0.10
0.05
0.00
−10 −5 0 5 10
µ
84 / 143
85. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
Generalisation
Consider several groups of parameters, θ, λ1 , . . . , λp , such that
π(θ|x) = ... π(θ, λ1 , . . . , λp |x) dλ1 · · · dλp
or simply divide θ in
(θ1 , . . . , θp )
85 / 143
86. Computational Bayesian Statistics
Regression and variable selection
Markov Chain Monte Carlo Methods
The general Gibbs sampler
For a joint distribution π(θ) with full conditionals π1 , . . . , πp ,
(t) (t)
Given (θ1 , . . . , θp ), simulate
(t+1) (t) (t)
1. θ1 ∼ π1 (θ1 |θ2 , . . . , θp ),
(t+1) (t+1) (t) (t)
2. θ2 ∼ π2 (θ2 |θ1 , θ3 , . . . , θp ),
.
.
.
(t+1) (t+1) (t+1)
p. θp ∼ πp (θp |θ1 , . . . , θp−1 ).
Then θ(t) → θ ∼ π
86 / 143
87. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Variable selection
Back to regression: one dependent random variable y and a set
{x1 , . . . , xk } of k explanatory variables.
Question: Are all xi ’s involved in the regression?
Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k)
explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of
explanatory variables for the regression of y [intercept included in
every corresponding model]
Computational issue
2k models in competition...
87 / 143
88. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Model notations
1
X = 1n x1 · · · xk
is the matrix containing 1n and all the k potential predictor
variables
2 Each model Mγ associated with binary indicator vector
γ ∈ Γ = {0, 1}k where γi = 1 means that the variable xi is
included in the model Mγ
3 qγ = 1T γ number of variables included in the model Mγ
n
4 t1 (γ) and t0 (γ) indices of variables included in the model and
indices of variables not included in the model
88 / 143
89. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Model indicators
For β ∈ Rk+1 and X, we define βγ as the subvector
βγ = β0 , (βi )i∈t1 (γ)
and Xγ as the submatrix of X where only the column 1n and the
columns in t1 (γ) have been left.
89 / 143
90. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Models in competition
The model Mγ is thus defined as
y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In
where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters.
+
Warning
σ 2 is common to all models and thus uses the same prior for all
models
90 / 143
91. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Informative G-prior
Many (2k ) models in competition: we cannot expect a practitioner
to specify a prior on every Mγ in a completely subjective and
autonomous manner.
Shortcut: We derive all priors from a single global prior associated
with the so-called full model that corresponds to γ = (1, . . . , 1).
91 / 143
92. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Prior definitions
(i) For the full model, Zellner’s G-prior:
˜
β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ) and σ 2 ∼ π(σ 2 |X) = σ −2
(ii) For each model Mγ , the prior distribution of βγ conditional
on σ 2 is fixed as
˜ −1
βγ |γ, σ 2 ∼ Nqγ +1 βγ , cσ 2 Xγ Xγ
T
,
˜ T −1 T ˜
where βγ = Xγ Xγ Xγ β and same prior on σ 2 .
92 / 143
93. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Prior completion
The joint prior for model Mγ is the improper prior
−(qγ +1)/2−1 1 ˜
T
π(βγ , σ 2 |γ) ∝ σ2 exp − βγ − βγ
2(cσ 2 )
T ˜
(Xγ Xγ ) βγ − βγ .
93 / 143
94. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Prior competition (2)
Infinitely many ways of defining a prior on the model index γ:
choice of uniform prior π(γ|X) = 2−k .
Posterior distribution of γ central to variable selection since
proportional to the marginal density of y on Mγ (or evidence of
Mγ )
π(γ|y, X) ∝ f (y|γ, X)π(γ|X) ∝ f (y|γ, X)
= f (y|γ, β, σ 2 , X)π(β|γ, σ 2 , X) dβ π(σ 2 |X) dσ 2 .
94 / 143
95. Computational Bayesian Statistics
Regression and variable selection
Variable selection
f (y|γ, σ 2 , X) = f (y|γ, β, σ 2 )π(β|γ, σ 2 ) dβ
1 T
(c + 1)−(qγ +1)/2 (2π)−n/2 σ 2
−n/2
= exp − y y
2σ 2
1 −1
˜T T ˜
+ cy T Xγ Xγ Xγ
T T
Xγ y − βγ Xγ Xγ βγ ,
2σ 2 (c + 1)
this posterior density satisfies
c T −1
π(γ|y, X) ∝ (c + 1)−(qγ +1)/2 y T y − T
y Xγ Xγ Xγ T
Xγ y
c+1
−n/2
1 ˜T T ˜
− β X Xγ βγ .
c+1 γ γ
95 / 143
97. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Pine processionary caterpillars (cont’d)
Interpretation
Model Mγ with the highest posterior probability is
t1 (γ) = (1, 2, 4, 5), which corresponds to the variables
- altitude,
- slope,
- height of the tree sampled in the center of the area, and
- diameter of the tree sampled in the center of the area.
Corresponds to the five variables identified in the R regression
output
97 / 143
98. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Noninformative extension
For Zellner noninformative prior with π(c) = 1/c, we have
∞
π(γ|y, X) ∝ c−1 (c + 1)−(qγ +1)/2 y T y−
c=1
−n/2
c T T −1 T
y Xγ Xγ Xγ Xγ y .
c+1
98 / 143
100. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Stochastic search for the most likely model
When k gets large, impossible to compute the posterior
probabilities of the 2k models.
Need of a tailored algorithm that samples from π(γ|y, X) and
selects the most likely models.
Can be done by Gibbs sampling , given the availability of the full
conditional posterior probabilities of the γi ’s.
If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k)
π(γi |y, γ−i , X) ∝ π(γ|y, X)
(to be evaluated in both γi = 0 and γi = 1)
100 / 143
101. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Gibbs sampling for variable selection
Initialization: Draw γ 0 from the uniform
distribution on Γ
(t−1) (t−1)
Iteration t: Given (γ1 , . . . , γk ), generate
(t) (t−1) (t−1)
1. γ1 according to π(γ1 |y, γ2 , . . . , γk , X)
(t)
2. γ2 according to
(t) (t−1) (t−1)
π(γ2 |y, γ1 , γ3 , . . . , γk , X)
.
.
.
(t) (t) (t)
p. γk according to π(γk |y, γ1 , . . . , γk−1 , X)
101 / 143
102. Computational Bayesian Statistics
Regression and variable selection
Variable selection
MCMC interpretation
After T ≫ 1 MCMC iterations, output used to approximate the
posterior probabilities π(γ|y, X) by empirical averages
T
1
π(γ|y, X) = Iγ (t) =γ .
T − T0 + 1
t=T0
where the T0 first values are eliminated as burnin.
And approximation of the probability to include i-th variable,
T
1
P π (γi = 1|y, X) = Iγ (t) =1 .
T − T0 + 1 i
t=T0
102 / 143
103. Computational Bayesian Statistics
Regression and variable selection
Variable selection
Pine processionary caterpillars
γi P π (γi = 1|y, X) P π (γi = 1|y, X)
γ1 0.8624 0.8844
γ2 0.7060 0.7716
γ3 0.1482 0.2978
γ4 0.6671 0.7261
γ5 0.6515 0.7006
γ6 0.1678 0.3115
γ7 0.1371 0.2880
γ8 0.1555 0.2876
γ9 0.4039 0.5168
γ10 0.1151 0.2609
˜
Probabilities of inclusion with both informative (β = 011 , c = 100)
and noninformative Zellner’s priors
103 / 143
104. Computational Bayesian Statistics
Generalized linear models
Generalized linear models
3 Generalized linear models
Generalisation of linear models
Metropolis–Hastings algorithms
The Probit Model
The logit model
104 / 143
105. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Generalisation of the linear dependence
Broader class of models to cover various dependence structures.
Class of generalised linear models (GLM) where
y|x, β ∼ f (y|xT β) .
i.e., dependence of y on x partly linear
105 / 143
106. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Specifications of GLM’s
Definition (GLM)
A GLM is a conditional model specified by two functions:
1 the density f of y given x parameterised by its expectation
parameter µ = µ(x) [and possibly its dispersion parameter
ϕ = ϕ(x)]
2 the link g between the mean µ and the explanatory variables,
written customarily as g(µ) = xT β or, equivalently,
E[y|x, β] = g −1 (xT β).
For identifiability reasons, g needs to be bijective.
106 / 143
107. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Likelihood
Obvious representation of the likelihood
n
ℓ(β, ϕ|y, X) = f yi |xiT β, ϕ
i=1
with parameters β ∈ Rk and ϕ > 0.
107 / 143
108. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Examples
Case of binary and binomial data, when
yi |xi ∼ B(ni , p(xi ))
with known ni
Logit [or logistic regression] model
Link is logit transform on probability of success
g(pi ) = log(pi /(1 − pi )) ,
with likelihood
Y „ ni « „ exp(xiT β) «yi „
n
1
«ni −yi
yi 1 + exp(xiT β) 1 + exp(xiT β)
i=1
( n )ffi n
X Y“ ”ni −yi
∝ exp yi xiT β 1 + exp(xiT β)
i=1 i=1
108 / 143
109. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Canonical link
Special link function g that appears in the natural exponential
family representation of the density
g ⋆ (µ) = θ if f (y|µ) ∝ exp{T (y) · θ − Ψ(θ)}
Example
Logit link is canonical for the binomial model, since
ni pi
f (yi |pi ) = exp yi log + ni log(1 − pi ) ,
yi 1 − pi
and thus
θi = log pi /(1 − pi )
109 / 143
110. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Examples (2)
Customary to use the canonical link, but only customary ...
Probit model
Probit link function given by
g(µi ) = Φ−1 (µi )
where Φ standard normal cdf
Likelihood
n
ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
i=1
Full processing
110 / 143
111. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Log-linear models
Standard approach to describe associations between several
categorical variables, i.e, variables with finite support
Sufficient statistic: contingency table, made of the cross-classified
counts for the different categorical variables.
Example (Titanic survivors)
Child Adult
Survivor Class Male Female Male Female
1st 0 0 118 4
2nd 0 0 154 13
No 3rd 35 17 387 89
Crew 0 0 670 3
1st 5 1 57 140
2nd 11 13 14 80
Yes 3rd 13 14 75 76
Crew 0 0 192 20
111 / 143
112. Computational Bayesian Statistics
Generalized linear models
Generalisation of linear models
Poisson regression model
1 Each count yi is Poisson with mean µi = µ(xi )
2 Link function connecting R+ with R, e.g. logarithm
g(µi ) = log(µi ).
Corresponding likelihood
n
1
ℓ(β|y, X) = exp yi xiT β − exp(xiT β) .
i=1
yi !
112 / 143
113. Computational Bayesian Statistics
Generalized linear models
Metropolis–Hastings algorithms
Metropolis–Hastings algorithms
Convergence assessment
Posterior inference in GLMs harder than for linear models
c Working with a GLM requires specific numerical or simulation
tools [E.g., GLIM in classical analyses]
Opportunity to introduce universal MCMC method:
Metropolis–Hastings algorithm
113 / 143
114. Computational Bayesian Statistics
Generalized linear models
Metropolis–Hastings algorithms
Generic MCMC sampler
Metropolis–Hastings algorithms are generic/down-the-shelf
MCMC algorithms
Only require likelihood up to a constant [difference with Gibbs
sampler]
can be tuned with a wide range of possibilities [difference with
Gibbs sampler & blocking]
natural extensions of standard simulation algorithms: based
on the choice of a proposal distribution [difference in Markov
proposal q(x, y) and acceptance]
114 / 143