Bayesian Core: Chapter 2

Bayesian Core:A Practical Approach to Computational Bayesian Statistics

Bayesian Core:
A Practical Approach to Computational Bayesian
Statistics

Jean-Michel Marin & Christian P. Robert

QUT, Brisbane, August 4–8 2008

1 / 785


Outline

The normal model
1

Regression and variable selection
2

Generalized linear models
3

Capture-recapture experiments
4

Mixture models
5

Dynamic models
6

Image analysis
7

2 / 785

The normal model

The normal model

The normal model
1
Normal problems
The Bayesian toolbox
Prior selection
Bayesian estimation
Conﬁdence regions
Testing
Monte Carlo integration
Prediction

3 / 785

The normal model
Normal problems

Normal model
Sample
x1 , . . . , xn
from a normal N (µ, σ 2 ) distribution

normal sample
0.6
0.5
0.4
Density

0.3
0.2
0.1
0.0

−2 −1 0 1 2

4 / 785

The normal model
Normal problems

Inference on (µ, σ) based on this sample

Estimation of [transforms of] (µ, σ)

5 / 785

The normal model
Normal problems


Conﬁdence region [interval] on (µ, σ)

6 / 785

The normal model
Normal problems


Conﬁdence region [interval] on (µ, σ)
Test on (µ, σ) and comparison with other samples

7 / 785

The normal model
Normal problems

Datasets

Larcenies =normaldata

4
Relative changes in reported

3
larcenies between 1991 and
1995 (relative to 1991) for

2
the 90 most populous US
counties (Source: FBI)
1
0

−0.4 −0.2 0.0 0.2 0.4 0.6

8 / 785

The normal model
Normal problems

Cosmological background =CMBdata

Spectral representation of the
“cosmological microwave
background” (CMB),
i.e. electromagnetic radiation
from photons back to 300, 000
years after the Big Bang,
expressed as diﬀerence in
apparent temperature from the
mean temperature

9 / 785

The normal model
Normal problems

Normal estimation

5
4
3
2
1
0

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

CMB

10 / 785

The normal model
Normal problems

Normal estimation

5
4
3
2
1
0

−0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6

CMB

11 / 785

The normal model


Bayes theorem = Inversion of probabilities

12 / 785

The normal model

Who’s Bayes?
Reverend Thomas Bayes (ca. 1702–1761)
Presbyterian minister in
Tunbridge Wells (Kent) from
1731, son of Joshua Bayes,
nonconformist minister. Election
to the Royal Society based on a
tract of 1736 where he defended
the views and philosophy of
Newton.

14 / 785

The normal model

Who’s Bayes?
Reverend Thomas Bayes (ca. 1702–1761)
Presbyterian minister in
Tunbridge Wells (Kent) from
1731, son of Joshua Bayes,
nonconformist minister. Election
to the Royal Society based on a
tract of 1736 where he defended
the views and philosophy of
Newton.
Sole probability paper, “Essay Towards Solving a Problem in the
Doctrine of Chances”, published posthumously in 1763 by Pierce
and containing the seeds of Bayes’ Theorem.

15 / 785

The normal model

New perspective

Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution

16 / 785

The normal model

New perspective

Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
Inference based on the distribution of θ conditional on x,
π(θ|x), called posterior distribution

f (x|θ)π(θ)
π(θ|x) = .
f (x|θ)π(θ) dθ

17 / 785

The normal model

Bayesian model

A Bayesian statistical model is made of
a likelihood
1

f (x|θ),

18 / 785

The normal model

Bayesian model

A Bayesian statistical model is made of
a likelihood
1

f (x|θ),
and of
a prior distribution on the parameters,
2

π(θ) .

19 / 785

The normal model

Justiﬁcations

Semantic drift from unknown θ to random θ

20 / 785

The normal model

Justiﬁcations

Actualization of information/knowledge on θ by extracting
information/knowledge on θ contained in the observation x

21 / 785

The normal model

Justiﬁcations

Allows incorporation of imperfect/imprecise information in the
decision process

22 / 785

The normal model

Justiﬁcations

Allows incorporation of imperfect/imprecise information in the
decision process
Unique mathematical way to condition upon the observations
(conditional perspective)

23 / 785

The normal model

Example (Normal illustration (σ 2 = 1))
Assume
π(θ) = exp{−θ} Iθ>0

24 / 785

The normal model

Example (Normal illustration (σ 2 = 1))
Assume
π(θ) = exp{−θ} Iθ>0
Then

π(θ|x1 , . . . , xn ) ∝ exp{−θ} exp{−n(θ − x)2 /2} Iθ>0
∝ exp −nθ2 /2 + θ(nx − 1) Iθ>0
∝ exp −n(θ − (x − 1/n))2 /2 Iθ>0

25 / 785

The normal model

Example (Normal illustration (2))
Truncated normal distribution

N + ((x − 1/n), 1/n)

mu=−0.08935864
2.5
2.0
1.5
1.0
0.5
0.0

0.0 0.1 0.2 0.3 0.4 0.5

26 / 785

The normal model

Prior and posterior distributions

Given f (x|θ) and π(θ), several distributions of interest:
the joint distribution of (θ, x),
1

ϕ(θ, x) = f (x|θ)π(θ) ;

27 / 785

The normal model

Prior and posterior distributions

Given f (x|θ) and π(θ), several distributions of interest:
the joint distribution of (θ, x),
1

ϕ(θ, x) = f (x|θ)π(θ) ;

the marginal distribution of x,
2

m(x) = ϕ(θ, x) dθ

= f (x|θ)π(θ) dθ ;

28 / 785

The normal model

Posterior distribution center of to Bayesian inference

π(θ|x) ∝ f (x|θ) π(θ)

Operates conditional upon the observations

30 / 785

The normal model


π(θ|x) ∝ f (x|θ) π(θ)

Integrate simultaneously prior information/knowledge and
information brought by x

31 / 785

The normal model


π(θ|x) ∝ f (x|θ) π(θ)

Avoids averaging over the unobserved values of x

32 / 785

The normal model


π(θ|x) ∝ f (x|θ) π(θ)

Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected

33 / 785

The normal model


π(θ|x) ∝ f (x|θ) π(θ)

Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope and an unique motor of
inference

34 / 785

The normal model

Example (Normal-normal case)
Consider x|θ ∼ N (θ, 1) and θ ∼ N (a, 10).

(x − θ)2 (θ − a)2
π(θ|x) ∝ f (x|θ)π(θ) ∝ exp − −
2 20
11θ2
∝ exp − + θ(x + a/10)
20
11
∝ exp − {θ − ((10x + a)/11)}2
20

35 / 785

The normal model

Example (Normal-normal case)
Consider x|θ ∼ N (θ, 1) and θ ∼ N (a, 10).

(x − θ)2 (θ − a)2
π(θ|x) ∝ f (x|θ)π(θ) ∝ exp − −
2 20
11θ2
∝ exp − + θ(x + a/10)
20
11
∝ exp − {θ − ((10x + a)/11)}2
20

and
θ|x ∼ N (10x + a) 11, 10 11

36 / 785

The normal model
Prior selection

Prior selection

The prior distribution is the key to Bayesian inference

37 / 785

The normal model
Prior selection

Prior selection

But...
In practice, it seldom occurs that the available prior information is
precise enough to lead to an exact determination of the prior
distribution

38 / 785

The normal model
Prior selection

Prior selection

But...
In practice, it seldom occurs that the available prior information is
precise enough to lead to an exact determination of the prior
distribution
There is no such thing as the prior distribution!

39 / 785

The normal model
Prior selection

Strategies for prior determination
Ungrounded prior distributions produce
unjustiﬁed posterior inference.
—Anonymous, ca. 2006

40 / 785

The normal model
Prior selection


Use a partition of Θ in sets (e.g., intervals), determine the
probability of each set, and approach π by an histogram

41 / 785

The normal model
Prior selection


Select signiﬁcant elements of Θ, evaluate their respective
likelihoods and deduce a likelihood curve proportional to π

42 / 785

The normal model
Prior selection


Use the marginal distribution of x,

m(x) = f (x|θ)π(θ) dθ
Θ

43 / 785

The normal model
Prior selection


Use the marginal distribution of x,

m(x) = f (x|θ)π(θ) dθ
Θ

Empirical and hierarchical Bayes techniques
44 / 785

The normal model
Prior selection

Conjugate priors

Speciﬁc parametric family with analytical properties
Conjugate prior
A family F of probability distributions on Θ is conjugate for a
likelihood function f (x|θ) if, for every π ∈ F, the posterior
distribution π(θ|x) also belongs to F.

45 / 785

The normal model
Prior selection

Conjugate priors

Speciﬁc parametric family with analytical properties
Conjugate prior
A family F of probability distributions on Θ is conjugate for a
likelihood function f (x|θ) if, for every π ∈ F, the posterior
distribution π(θ|x) also belongs to F.

Only of interest when F is parameterised : switching from prior to
posterior distribution is reduced to an updating of the
corresponding parameters.

46 / 785

The normal model
Prior selection

Justiﬁcations

Limited/ﬁnite information conveyed by x
Preservation of the structure of π(θ)

47 / 785

The normal model
Prior selection

Justiﬁcations

Exchangeability motivations
Device of virtual past observations

48 / 785

The normal model
Prior selection

Justiﬁcations

Linearity of some estimators
But mostly...

49 / 785

The normal model
Prior selection

Justiﬁcations

But mostly... tractability and simplicity

50 / 785

The normal model
Prior selection

Justiﬁcations

But mostly... tractability and simplicity
First approximations to adequate priors, backed up by
robustness analysis

51 / 785

The normal model
Prior selection

Exponential families

Sampling models of interest
Exponential family
The family of distributions

f (x|θ) = C(θ)h(x) exp{R(θ) · T (x)}

is called an exponential family of dimension k. When Θ ⊂ Rk ,
X ⊂ Rk and

f (x|θ) = h(x) exp{θ · x − Ψ(θ)},

the family is said to be natural.

52 / 785

The normal model
Prior selection

Analytical properties of exponential families

Suﬃcient statistics (Pitman–Koopman Lemma)

53 / 785

The normal model
Prior selection


Common enough structure (normal, Poisson, &tc...)

54 / 785

The normal model
Prior selection


Analyticity (E[x] = ∇Ψ(θ), ...)

55 / 785

The normal model
Prior selection


Analyticity (E[x] = ∇Ψ(θ), ...)
Allow for conjugate priors

π(θ|µ, λ) = K(µ, λ) eθ.µ−λΨ(θ) λ>0

56 / 785

The normal model
Prior selection

Standard exponential families

f (x|θ) π(θ) π(θ|x)
Normal Normal
N (θ, σ 2 ) N (µ, τ 2 ) N (ρ(σ 2 µ + τ 2 x), ρσ 2 τ 2 )
ρ−1 = σ 2 + τ 2
Poisson Gamma
P(θ) G(α, β) G(α + x, β + 1)
Gamma Gamma
G(ν, θ) G(α, β) G(α + ν, β + x)
Binomial Beta
B(n, θ) Be(α, β) Be(α + x, β + n − x)

57 / 785

The normal model
Prior selection

More...

f (x|θ) π(θ) π(θ|x)
Negative Binomial Beta
N eg(m, θ) Be(α, β) Be(α + m, β + x)
Multinomial Dirichlet
Mk (θ1 , . . . , θk ) D(α1 , . . . , αk ) D(α1 + x1 , . . . , αk + xk )
Normal Gamma
G(α + 0.5, β + (µ − x)2 /2)
N (µ, 1/θ) Ga(α, β)

58 / 785

The normal model
Prior selection

Linearity of the posterior mean

If
θ ∼ πλ,µ (θ) ∝ eθ·µ−λΨ(θ)
with µ ∈ X , then
µ
Eπ [∇Ψ(θ)] =
.
λ
where ∇Ψ(θ) = (∂Ψ(θ)/∂θ1 , . . . , ∂Ψ(θ)/∂θp )

59 / 785

The normal model
Prior selection

Linearity of the posterior mean

If
θ ∼ πλ,µ (θ) ∝ eθ·µ−λΨ(θ)
with µ ∈ X , then
µ
Eπ [∇Ψ(θ)] =
.
λ
where ∇Ψ(θ) = (∂Ψ(θ)/∂θ1 , . . . , ∂Ψ(θ)/∂θp )
Therefore, if x1 , . . . , xn are i.i.d. f (x|θ),
µ + n¯
x
Eπ [∇Ψ(θ)|x1 , . . . , xn ] = .
λ+n

60 / 785

The normal model
Prior selection

Example (Normal-normal)
In the normal N (θ, σ 2 ) case, conjugate also normal N (µ, τ 2 ) and

Eπ [∇Ψ(θ)|x] = Eπ [θ|x] = ρ(σ 2 µ + τ 2 x)

where
ρ−1 = σ 2 + τ 2

61 / 785

The normal model
Prior selection

Example (Full normal)
In the normal N (µ, σ 2 ) case, when both µ and σ are unknown,
there still is a conjugate prior on θ = (µ, σ 2 ), of the form

(σ 2 )−λσ exp − λµ (µ − ξ)2 + α /2σ 2

62 / 785

The normal model
Prior selection

Example (Full normal)
In the normal N (µ, σ 2 ) case, when both µ and σ are unknown,
there still is a conjugate prior on θ = (µ, σ 2 ), of the form

(σ 2 )−λσ exp − λµ (µ − ξ)2 + α /2σ 2

since

π(µ, σ 2 |x1 , . . . , xn ) ∝ (σ 2 )−λσ exp − λµ (µ − ξ)2 + α /2σ 2
×(σ 2 )−n/2 exp − n(µ − x)2 + s2 /2σ 2
x

∝ (σ 2 )−λσ −n/2 exp − (λµ + n)(µ − ξx )2

nλµ (x − ξ)2
+α + s2 + /2σ 2
x
n + λµ

63 / 785

The normal model
Prior selection

parameters (0,1,1,1)
2.0
1.5
σ

1.0
0.5

−1.0 −0.5 0.0 0.5 1.0

µ

64 / 785

The normal model
Prior selection

Improper prior distribution

Extension from a prior distribution to a prior σ-ﬁnite measure π
such that
π(θ) dθ = +∞
Θ

65 / 785

The normal model
Prior selection

Improper prior distribution

Extension from a prior distribution to a prior σ-ﬁnite measure π
such that
π(θ) dθ = +∞
Θ

Formal extension: π cannot be interpreted as a probability any
longer

66 / 785

The normal model
Prior selection

Justiﬁcations

Often only way to derive a prior in noninformative/automatic
1

settings

67 / 785

The normal model
Prior selection

Justiﬁcations

1

settings
Performances of associated estimators usually good
2

68 / 785

The normal model
Prior selection

Justiﬁcations

1

settings
2

Often occur as limits of proper distributions
3

69 / 785

The normal model
Prior selection

Justiﬁcations

1

settings
2

3

More robust answer against possible misspeciﬁcations of the
4

prior

70 / 785

The normal model
Prior selection

Justifications

1

settings
2

3

More robust answer against possible misspecifications of the
4

prior
Improper priors (infinitely!) preferable to vague proper priors
5

such as a N (0, 1002 ) distribution [e.g., BUGS]

71 / 785

The normal model
Prior selection

Example (Normal+improper)
If x ∼ N (θ, 1) and π(θ) = ̟, constant, the pseudo marginal
distribution is
+∞
1
√ exp −(x − θ)2 /2 dθ = ̟
m(x) = ̟
2π
−∞

and the posterior distribution of θ is

(x − θ)2
1
π(θ | x) = √ exp − ,
2
2π
i.e., corresponds to N (x, 1).
[independent of ̟]

73 / 785

The normal model
Prior selection

Meaningless as probability distribution

The mistake is to think of them [the non-informative priors]
as representing ignorance
—Lindley, 1990—

74 / 785

The normal model
Prior selection

Meaningless as probability distribution

The mistake is to think of them [the non-informative priors]
as representing ignorance
—Lindley, 1990—

Example
Consider a θ ∼ N (0, τ 2 ) prior. Then

P π (θ ∈ [a, b]) −→ 0

when τ → ∞ for any (a, b)

75 / 785

The normal model
Prior selection

Noninformative prior distributions

What if all we know is that we know “nothing” ?!

76 / 785

The normal model
Prior selection



In the absence of prior information, prior distributions solely
derived from the sample distribution f (x|θ)

77 / 785

The normal model
Prior selection



In the absence of prior information, prior distributions solely
derived from the sample distribution f (x|θ)

Noninformative priors cannot be expected to represent exactly
total ignorance about the problem at hand, but should rather
be taken as reference or default priors, upon which everyone
could fall back when the prior information is missing.
—Kass and Wasserman, 1996—

78 / 785

The normal model
Prior selection

Laplace’s prior

Principle of Insuﬃcient Reason (Laplace)

Θ = {θ1 , · · · , θp } π(θi ) = 1/p

79 / 785

The normal model
Prior selection

Laplace’s prior

Principle of Insuﬃcient Reason (Laplace)

Θ = {θ1 , · · · , θp } π(θi ) = 1/p

Extension to continuous spaces

π(θ) ∝ 1

[Lebesgue measure]

80 / 785

The normal model
Prior selection

Who’s Laplace?
Pierre Simon de Laplace (1749–1827)
French mathematician and
astronomer born in Beaumont en
Auge (Normandie) who formalised
mathematical astronomy in
M´canique C´leste. Survived the
e e
French revolution, the Napoleon
Empire (as a comte!), and the
Bourbon restauration (as a
marquis!!).

81 / 785

The normal model
Prior selection

Who’s Laplace?
Pierre Simon de Laplace (1749–1827)
French mathematician and
astronomer born in Beaumont en
Auge (Normandie) who formalised
mathematical astronomy in
M´canique C´leste. Survived the
e e
French revolution, the Napoleon
Empire (as a comte!), and the
Bourbon restauration (as a
marquis!!).
In Essai Philosophique sur les Probabilit´s, Laplace set out a
e
mathematical system of inductive reasoning based on probability,
precursor to Bayesian Statistics.

82 / 785

The normal model
Prior selection

Laplace’s problem

Lack of reparameterization invariance/coherence
1
and ψ = eθ
π(θ) ∝ 1, π(ψ) = = 1 (!!)
ψ

83 / 785

The normal model
Prior selection

Laplace’s problem

Lack of reparameterization invariance/coherence
1
and ψ = eθ
π(θ) ∝ 1, π(ψ) = = 1 (!!)
ψ

Problems of properness

x ∼ N (µ, σ 2 ), π(µ, σ) = 1
2 2
π(µ, σ|x) ∝ e−(x−µ) /2σ σ −1
⇒ π(σ|x) ∝1 (!!!)

84 / 785

The normal model
Prior selection

Jeﬀreys’ prior

Based on Fisher information
∂ log ℓ ∂ log ℓ
I F (θ) = Eθ
∂θt ∂θ

Ron Fisher (1890–1962)

85 / 785

The normal model
Prior selection

Jeﬀreys’ prior

Based on Fisher information
∂ log ℓ ∂ log ℓ
I F (θ) = Eθ
∂θt ∂θ

Ron Fisher (1890–1962)
the Jeﬀreys prior distribution is

π J (θ) ∝ |I F (θ)|1/2

86 / 785

The normal model
Prior selection

Who’s Jeﬀreys?

Sir Harold Jeﬀreys (1891–1989)
English mathematician,
statistician, geophysicist, and
astronomer. Founder of English
Geophysics & originator of the
theory that the Earth core is
liquid.

87 / 785

The normal model
Prior selection

Who’s Jeﬀreys?

Sir Harold Jeﬀreys (1891–1989)
English mathematician,
statistician, geophysicist, and
astronomer. Founder of English
Geophysics & originator of the
theory that the Earth core is
liquid.
Formalised Bayesian methods for
the analysis of geophysical data
and ended up writing Theory of
Probability

88 / 785

The normal model
Prior selection

Pros & Cons

Relates to information theory

89 / 785

The normal model
Prior selection

Pros & Cons

Agrees with most invariant priors

90 / 785

The normal model
Prior selection

Pros & Cons

Parameterization invariant

91 / 785

The normal model
Prior selection

Pros & Cons

Parameterization invariant
Suﬀers from dimensionality curse

92 / 785

The normal model
Bayesian estimation

Evaluating estimators

Purpose of most inferential studies: to provide the
statistician/client with a decision d ∈ D

93 / 785

The normal model
Bayesian estimation


Requires an evaluation criterion/loss function for decisions and
estimators
L(θ, d)

94 / 785

The normal model
Bayesian estimation


Requires an evaluation criterion/loss function for decisions and
estimators
L(θ, d)

There exists an axiomatic derivation of the existence of a
loss function.
[DeGroot, 1970]

95 / 785

The normal model
Bayesian estimation

Loss functions

Decision procedure δ π usually called estimator
(while its value δ π (x) is called estimate of θ)

96 / 785

The normal model
Bayesian estimation

Loss functions

Decision procedure δ π usually called estimator
(while its value δ π (x) is called estimate of θ)

Impossible to uniformly minimize (in d) the loss function

L(θ, d)

when θ is unknown

97 / 785

The normal model
Bayesian estimation

Bayesian estimation

Principle Integrate over the space Θ to get the posterior expected
loss

= Eπ [L(θ, d)|x]
= L(θ, d)π(θ|x) dθ,
Θ

and minimise in d

98 / 785

The normal model
Bayesian estimation

Bayes estimates

Bayes estimator
A Bayes estimate associated with a prior distribution π and a loss
function L is
arg min Eπ [L(θ, d)|x] .
d

99 / 785

The normal model
Bayesian estimation

The quadratic loss

Historically, ﬁrst loss function
(Legendre, Gauss, Laplace)

L(θ, d) = (θ − d)2

100 / 785

The normal model
Bayesian estimation

The quadratic loss

Historically, ﬁrst loss function
(Legendre, Gauss, Laplace)

L(θ, d) = (θ − d)2

The Bayes estimate δ π (x) associated with the prior π and with the
quadratic loss is the posterior expectation

Θ θf (x|θ)π(θ) dθ
δ π (x) = Eπ [θ|x] = .
Θ f (x|θ)π(θ) dθ

101 / 785

The normal model
Bayesian estimation

The absolute error loss

Alternatives to the quadratic loss:

L(θ, d) =| θ − d |,

or

k2 (θ − d) if θ > d,
Lk1 ,k2 (θ, d) =
k1 (d − θ) otherwise.

102 / 785

The normal model
Bayesian estimation

The absolute error loss

Alternatives to the quadratic loss:

L(θ, d) =| θ − d |,

or

k2 (θ − d) if θ > d,
Lk1 ,k2 (θ, d) =
k1 (d − θ) otherwise.

Associated Bayes estimate is (k2 /(k1 + k2 )) fractile of π(θ|x)

103 / 785

The normal model
Bayesian estimation

MAP estimator

With no loss function, consider using the maximum a posteriori
(MAP) estimator
arg max ℓ(θ|x)π(θ)
θ

104 / 785

The normal model
Bayesian estimation

MAP estimator

(MAP) estimator
θ

Penalized likelihood estimator

105 / 785

The normal model
Bayesian estimation

MAP estimator

(MAP) estimator
θ

Penalized likelihood estimator
Further appeal in restricted parameter spaces

106 / 785

The normal model
Bayesian estimation

Example (Binomial probability)
Consider x|θ ∼ B(n, θ).
Possible priors:
1
π J (θ) = θ−1/2 (1 − θ)−1/2 ,
B(1/2, 1/2)

π2 (θ) = θ−1 (1 − θ)−1 .
π1 (θ) = 1 and

107 / 785

The normal model
Bayesian estimation

Example (Binomial probability)
Consider x|θ ∼ B(n, θ).
Possible priors:
1
π J (θ) = θ−1/2 (1 − θ)−1/2 ,
B(1/2, 1/2)

π2 (θ) = θ−1 (1 − θ)−1 .
π1 (θ) = 1 and
Corresponding MAP estimators:
x − 1/2
δ πJ (x) = max ,0 ,
n−1
δ π1 (x) = x/n,
x−1
δ π2 (x) = max ,0 .
n−2

108 / 785

The normal model
Bayesian estimation

Not always appropriate

Example (Fixed MAP)
Consider
1 −1
1 + (x − θ)2
f (x|θ) = ,
π
1
and π(θ) = 2 e−|θ| .

109 / 785

The normal model
Bayesian estimation

Not always appropriate

Example (Fixed MAP)
Consider
1 −1
1 + (x − θ)2
f (x|θ) = ,
π
1
and π(θ) = 2 e−|θ| . Then the MAP estimate of θ is always

δ π (x) = 0

110 / 785

The normal model
Conﬁdence regions

Credible regions
Natural conﬁdence region: Highest posterior density (HPD) region
π
Cα = {θ; π(θ|x) > kα }

111 / 785

The normal model
Conﬁdence regions

Credible regions
Natural conﬁdence region: Highest posterior density (HPD) region
π
Cα = {θ; π(θ|x) > kα }

Optimality

0.15
The HPD regions give the highest
probabilities of containing θ for a

0.10
given volume
0.05
0.00

−10 −5 0 5 10

µ

112 / 785

The normal model
Conﬁdence regions

Example
If the posterior distribution of θ is N (µ(x), ω −2 ) with
ω 2 = τ −2 + σ −2 and µ(x) = τ 2 x/(τ 2 + σ 2 ), then

Cα = µ(x) − kα ω −1 , µ(x) + kα ω −1 ,
π

where kα is the α/2-quantile of N (0, 1).

113 / 785

The normal model
Conﬁdence regions

Example
If the posterior distribution of θ is N (µ(x), ω −2 ) with
ω 2 = τ −2 + σ −2 and µ(x) = τ 2 x/(τ 2 + σ 2 ), then

Cα = µ(x) − kα ω −1 , µ(x) + kα ω −1 ,
π

where kα is the α/2-quantile of N (0, 1).
If τ goes to +∞,
π
Cα = [x − kα σ, x + kα σ] ,

the “usual” (classical) conﬁdence interval

114 / 785

The normal model
Conﬁdence regions

Full normal
Under [almost!] Jeﬀreys prior prior
π(µ, σ 2 ) = 1/σ 2 ,
posterior distribution of (µ, σ)
σ2
µ|σ, x, s2
¯x ∼ N x,
¯ ,
n
n − 1 s2
,x
σ 2 |¯, sx
x2 ∼ IG .
2 2

115 / 785

The normal model
Conﬁdence regions

Full normal
Under [almost!] Jeﬀreys prior prior
π(µ, σ 2 ) = 1/σ 2 ,
posterior distribution of (µ, σ)
σ2
µ|σ, x, s2
¯x ∼ N x,
¯ ,
n
n − 1 s2
,x
σ 2 |¯, sx
x2 ∼ IG .
2 2

Then
n(¯ − µ)2 (n−3)/2
x
π(µ|¯, s2 ) ∝ ω 1/2 exp −ω exp{−ωs2 /2} dω
xx ω x
2
−n/2
s2 + n(¯ − µ)2
∝ x
x

[Tn−1 distribution]
116 / 785

The normal model
Conﬁdence regions

Normal credible interval
Derived credible interval on µ

[¯ − tα/2,n−1 sx
x n(n − 1), x + tα/2,n−1 sx
¯ n(n − 1)]

117 / 785

The normal model
Confidence regions

Normal credible interval
Derived credible interval on µ

[¯ − tα/2,n−1 sx
x n(n − 1), x + tα/2,n−1 sx
¯ n(n − 1)]

normaldata
Corresponding 95% confidence region for µ

[−0.070, −0.013, ]

Since 0 does not belong to this interval, reporting a significant
decrease in the number of larcenies between 1991 and 1995 is
acceptable

118 / 785

The normal model
Testing

Testing hypotheses

Deciding about validity of assumptions or restrictions on the
parameter θ from the data, represented as

H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0

119 / 785

The normal model
Testing

Testing hypotheses



Binary outcome of the decision process: accept [coded by 1] or
reject [coded by 0]
D = {0, 1}

120 / 785

The normal model
Testing

Testing hypotheses



Binary outcome of the decision process: accept [coded by 1] or
reject [coded by 0]
D = {0, 1}
Bayesian solution formally very close from a likelihood ratio test
statistic, but numerical values often strongly diﬀer from classical
solutions

121 / 785

The normal model
Testing

The 0 − 1 loss

Rudimentary loss function

1−d if θ ∈ Θ0
L(θ, d) =
d otherwise,

Associated Bayes estimate

1 if P π (θ ∈ Θ |x) > 1 ,
0
π
δ (x) = 2
0 otherwise.

122 / 785

The normal model
Testing

The 0 − 1 loss

Rudimentary loss function

1−d if θ ∈ Θ0
L(θ, d) =
d otherwise,

Associated Bayes estimate

1 if P π (θ ∈ Θ |x) > 1 ,
0
π
δ (x) = 2
0 otherwise.

Intuitive structure

123 / 785

The normal model
Testing

Extension

Weighted 0 − 1 (or a0 − a1 ) loss

0 if d = IΘ0 (θ),

L(θ, d) = a0 if θ ∈ Θ0 and d = 0,


a1 if θ ∈ Θ0 and d = 1,

124 / 785

The normal model
Testing

Extension

Weighted 0 − 1 (or a0 − a1 ) loss

0 if d = IΘ0 (θ),

L(θ, d) = a0 if θ ∈ Θ0 and d = 0,


a1 if θ ∈ Θ0 and d = 1,

Associated Bayes estimator

a1
1 if P π (θ ∈ Θ0 |x) > ,
a0 + a1
π
δ (x) =
0 otherwise.

125 / 785

The normal model
Testing

For x ∼ N (θ, σ 2 ) and θ ∼ N (µ, τ 2 ), π(θ|x) is N (µ(x), ω 2 ) with

σ2µ + τ 2x σ2τ 2
ω2 =
µ(x) = and .
σ2 + τ 2 σ2 + τ 2

126 / 785

The normal model
Testing

For x ∼ N (θ, σ 2 ) and θ ∼ N (µ, τ 2 ), π(θ|x) is N (µ(x), ω 2 ) with

σ2µ + τ 2x σ2τ 2
ω2 =
µ(x) = and .
σ2 + τ 2 σ2 + τ 2
To test H0 : θ < 0, we compute

θ − µ(x) −µ(x)
P π (θ < 0|x) = P π <
ω ω
= Φ (−µ(x)/ω) .

127 / 785

The normal model
Testing

Example (Normal-normal (2))
If za0 ,a1 is the a1 /(a0 + a1 ) quantile, i.e.,

Φ(za0 ,a1 ) = a1 /(a0 + a1 ) ,

H0 is accepted when

−µ(x) > za0 ,a1 ω,

the upper acceptance bound then being

σ2 σ2
x≤− µ − (1 + 2 )ωza0 ,a1 .
τ2 τ

128 / 785

The normal model
Testing

Bayes factor

Bayesian testing procedure depends on P π (θ ∈ Θ0 |x) or
alternatively on the Bayes factor

{P π (θ ∈ Θ1 |x)/P π (θ ∈ Θ0 |x)}
π
B10 =
{P π (θ ∈ Θ1 )/P π (θ ∈ Θ0 )}

in the absence of loss function parameters a0 and a1

129 / 785

The normal model
Testing

Associated reparameterisations

Corresponding models M1 vs. M0 compared via

P π (M1 |x) P π (M1 )
π
B10 =
P π (M0 |x) P π (M0 )

130 / 785

The normal model
Testing

Associated reparameterisations

Corresponding models M1 vs. M0 compared via

P π (M1 |x) P π (M1 )
π
B10 =
P π (M0 |x) P π (M0 )

If we rewrite the prior as
π(θ) = Pr(θ ∈ Θ1 ) × π1 (θ) + Pr(θ ∈ Θ0 ) × π0 (θ)
then

π
B10 = f (x|θ1 )π1 (θ1 )dθ1 f (x|θ0 )π0 (θ0 )dθ0 = m1 (x)/m0 (x)

[Akin to likelihood ratio]

131 / 785

The normal model
Testing

Jeﬀreys’ scale

π
if log10 (B10 ) varies between 0 and 0.5,
1

the evidence against H0 is poor,
if it is between 0.5 and 1, it is is
2

substantial,
if it is between 1 and 2, it is strong,
3

and
if it is above 2 it is decisive.
4

132 / 785

The normal model
Testing

Point null diﬃculties

If π absolutely continuous,

P π (θ = θ0 ) = 0 . . .

133 / 785

The normal model
Testing

Point null diﬃculties

If π absolutely continuous,

P π (θ = θ0 ) = 0 . . .

How can we test H0 : θ = θ0 ?!

134 / 785

The normal model
Testing

New prior for new hypothesis

Testing point null diﬃculties requires a modiﬁcation of the prior
distribution so that

π(Θ0 ) > 0 and π(Θ1 ) > 0

(hidden information) or

π(θ) = P π (θ ∈ Θ0 ) × π0 (θ) + P π (θ ∈ Θ1 ) × π1 (θ)

135 / 785

The normal model
Testing

New prior for new hypothesis

Testing point null diﬃculties requires a modiﬁcation of the prior
distribution so that

π(Θ0 ) > 0 and π(Θ1 ) > 0

(hidden information) or

π(θ) = P π (θ ∈ Θ0 ) × π0 (θ) + P π (θ ∈ Θ1 ) × π1 (θ)

[E.g., when Θ0 = {θ0 }, π0 is Dirac mass at θ0 ]

136 / 785

The normal model
Testing

For x ∼ N (θ, σ 2 ) and θ ∼ N (0, τ 2 ), to test H0 : θ = 0 requires a
modiﬁcation of the prior, with
2 /2τ 2
π1 (θ) ∝ e−θ Iθ=0

and π0 (θ) the Dirac mass in 0

138 / 785

The normal model
Testing

For x ∼ N (θ, σ 2 ) and θ ∼ N (0, τ 2 ), to test H0 : θ = 0 requires a
modiﬁcation of the prior, with
2 /2τ 2
π1 (θ) ∝ e−θ Iθ=0

and π0 (θ) the Dirac mass in 0
Then
2 2 2
e−x /2(σ +τ )
m1 (x) σ
√
= 2 2
f (x|0) σ 2 + τ 2 e−x /2σ
τ 2 x2
σ2
= exp ,
σ2 + τ 2 2σ 2 (σ 2 + τ 2 )

139 / 785

The normal model
Testing

Example (cont’d)
and
−1
τ 2 x2
σ2
1−ρ
π(θ = 0|x) = 1 + exp .
σ2 + τ 2 2σ 2 (σ 2 + τ 2 )
ρ

For z = x/σ and ρ = 1/2:
z 0 0.68 1.28 1.96
π(θ = 0|z, τ = σ) 0.586 0.557 0.484 0.351
π(θ = 0|z, τ = 3.3σ) 0.768 0.729 0.612 0.366

140 / 785

The normal model
Testing

Banning improper priors

Impossibility of using improper priors for testing!

141 / 785

The normal model
Testing

Banning improper priors

Impossibility of using improper priors for testing!
Reason: When using the representation

π(θ) = P π (θ ∈ Θ1 ) × π1 (θ) + P π (θ ∈ Θ0 ) × π0 (θ)

π1 and π0 must be normalised

142 / 785

The normal model
Testing

Example (Normal point null)
When x ∼ N (θ, 1) and H0 : θ = 0, for the improper prior
π(θ) = 1, the prior is transformed as
1 1
π(θ) = I0 (θ) + · Iθ=0 ,
2 2
and
2 /2
e−x
π(θ = 0|x) = +∞
e−x2 /2 + −∞ e−(x−θ)2 /2 dθ
1
√
= .
1 + 2πex2 /2

143 / 785

The normal model
Testing

Example (Normal point null (2))
Consequence: probability of H0 is bounded from above by
√
π(θ = 0|x) ≤ 1/(1 + 2π) = 0.285

x 0.0 1.0 1.65 1.96 2.58
π(θ = 0|x) 0.285 0.195 0.089 0.055 0.014

Regular tests: Agreement with the classical p-value (but...)

144 / 785

The normal model
Testing

Example (Normal one-sided)
For x ∼ N (θ, 1), π(θ) = 1, and H0 : θ ≤ 0 to test versus
H1 : θ > 0
0
1 2 /2
e−(x−θ)
π(θ ≤ 0|x) = √ dθ = Φ(−x).
2π −∞

The generalized Bayes answer is also the p-value

145 / 785

The normal model
Testing

Example (Normal one-sided)
For x ∼ N (θ, 1), π(θ) = 1, and H0 : θ ≤ 0 to test versus
H1 : θ > 0
0
1 2 /2
e−(x−θ)
π(θ ≤ 0|x) = √ dθ = Φ(−x).
2π −∞

The generalized Bayes answer is also the p-value

normaldata
If π(µ, σ 2 ) = 1/σ 2 ,

π(µ ≥ 0|x) = 0.0021

since µ|x ∼ T89 (−0.0144, 0.000206).

146 / 785

The normal model
Testing

Jeﬀreys–Lindley paradox

Limiting arguments not valid in testing settings: Under a
conjugate prior
−1
τ 2 x2
σ2
1−ρ
π(θ = 0|x) = 1+ exp ,
σ2 + τ 2 2σ 2 (σ 2 + τ 2 )
ρ

which converges to 1 when τ goes to +∞, for every x

147 / 785

The normal model
Testing

Jeﬀreys–Lindley paradox

Limiting arguments not valid in testing settings: Under a
conjugate prior
−1
τ 2 x2
σ2
1−ρ
π(θ = 0|x) = 1+ exp ,
σ2 + τ 2 2σ 2 (σ 2 + τ 2 )
ρ

which converges to 1 when τ goes to +∞, for every x
Diﬀerence with the “noninformative” answer
√
[1 + 2π exp(x2 /2)]−1

[ Invalid answer]

148 / 785

The normal model
Testing

Normalisation difficulties

If g0 and g1 are σ-finite measures on the subspaces Θ0 and Θ1 , the
choice of the normalizing constants influences the Bayes factor:
If gi replaced by ci gi (i = 0, 1), Bayes factor multiplied by
c0 /c1

149 / 785

The normal model
Testing

Normalisation difficulties

If g0 and g1 are σ-finite measures on the subspaces Θ0 and Θ1 , the
choice of the normalizing constants influences the Bayes factor:
If gi replaced by ci gi (i = 0, 1), Bayes factor multiplied by
c0 /c1

Example
If the Jeffreys prior is uniform and g0 = c0 , g1 = c1 ,
ρc0 f (x|θ) dθ
Θ0
π(θ ∈ Θ0 |x) =
ρc0 f (x|θ) dθ + (1 − ρ)c1 f (x|θ) dθ
Θ0 Θ1
ρ f (x|θ) dθ
Θ0
=
ρ f (x|θ) dθ + (1 − ρ)[c1 /c0 ] f (x|θ) dθ
Θ0 Θ1

150 / 785

The normal model


Generic problem of evaluating an integral

I = Ef [h(X)] = h(x) f (x) dx
X

where X is uni- or multidimensional, f is a closed form, partly
closed form, or implicit density, and h is a function

151 / 785

The normal model

Monte Carlo Principle

Use a sample (x1 , . . . , xm ) from the density f to approximate the
integral I by the empirical average
m
1
hm = h(xj )
m
j=1

152 / 785

Bayesian Core: Chapter 2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Christian Robert

Mehr von Christian Robert (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bayesian Core: Chapter 2