Axa Assurance Maroc - Insurer Innovation Award 2024
Viva extented final
1. ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS
DEPARTMENT OF STATISTICS
Efficient
Bayesian Marginal Likelihood
estimation in
Generalised Linear Latent Variable Models
thesis submitted by
Silia Vitoratou
advisors
Ioannis Ntzoufras
Irini Moustaki
Athens, 2013
3. Chapter 1
Key ideas and origins of the latent variable models (LVM).
“...co-relation must be the consequence of the variations
of the two organs being partly due to common causes ...“
Francis Galton, 1888.
• Suppose we want to infer for concepts that cannot be measured directly (such
as emotions, attitudes, perceptions, proficiency etc).
• We assume that they can be measured indirectly through other observed
items.
• The key idea is that all dependencies among p-manifest variables (observed
items) are attributed to k-latent (unobserved) ones.
• By principle, k << p. Hence, at the same time, the LVM methodology is a
multivariate analysis technique which aims to reduce the dimensionality, with
as little loss of information as possible.
3
4. Chapter 1
A unified approach: Generalised linear latent variable
models (GLLVM).
Generalized linear latent variable model (GLLVM; Bartholomew &Knott, 1999; Skrondal and
Rabe-Hesketh, 2004) . The models assumes that the response variables are linear combinations
of the latent ones and it consists of three components:
(a) the multivariate random component: where each observed item Yj, (j = 1, ..., p)
has a distribution from the exponential family (Bernoulli, Multinomial, Normal,
Gamma),
(b) the systematic component: where the latent variables Zℓ, ℓ = 1, ..., k, produce the
linear predictor ηj for each Yj
(c) the link function : which connects the previous two components
4
5. Chapter 1
A unified approach: Generalised linear latent variable
models (GLLVM).
Special case: Generalized linear latent trait model- with binary items (Moustaki
&Knott, 2000) .
The conditionals
are in this case Bernoulli(
), where
is
the conditional probability of a positive response to the observed item. The
logistic model is used for the response probabilities:
• The item parameters
are often referred to as the difficulty and
the discrimination parameters (respectively) of the item j.
All examples considered in this thesis refer to multivariate IRT (2-PL) models.
Current findings apply directly or can be expanded to any type of GLLVM.
5
6. Chapter 1
A unified approach: Generalised linear latent variable
models (GLLVM).
As only the p-items can be observed, any inference must be based on their joint
distribution.
All data dependencies are attributed to the existence of the latent variables.
Hence, the observed variables are assumed independent given the latent (local
independence assumption) :
where
is the prior distribution for the latent variables. A fully Bayesian
approach requires that the item parameter vector
is also
stochastic, associated with a prior probability.
6
7. Chapter 2
The fully Bayesian analogue: GLLTM with binary items
A) Priors
All model parameters are assumed a-priori independent
where
Prior from
Ntzoufras et al. (2000)
Fouskakis et al. (2009)
leading to
For unique solution we use the
Cholesky decomposition on B:
7
8. Chapter 2
The fully Bayesian analogue: GLLTM with binary items
B) Sampling from the posterior
• A Metropolis-within-Gibbs algorithm initially presented for IRT models by Patz and
Junker (1996) was used here for the multivariate case (k>1).
• Each item is updated in one block. So are the latent variables for each person.
C) Model evaluation
• In this thesis, the Bayes Factor (BF; Jeffreys, 1961; Kass and Raftery, 1995) was used for
model comparison.
• The BF is defined as the ratio of the posterior odds of two competing models (say m1
and m2) multiplied by their corresponding prior odds. Provided that the models have
equal prior probabilities, is given by:
that is, the ratio of the two models’ marginal or integrated likelihoods (hereafter
Bayesian marginal likelihood; BML).
8
9. Chapter 2
Estimating the Bayesian marginal likelihood
The BML (also known as the prior predictive distribution) is defined as the
expected model likelihood over the model parameters’ prior:
that quite often is a high dimensional integral, not available in closed form.
Monte Carlo integration is often used to estimate it, as for instance the
arithmetic mean:
This simple estimator does not really work adequately and a plethora of
Markov chains Monte Carlo (MCMC) techniques are employed instead in the
literature.
9
10. Chapter 2
Estimating the Bayesian marginal likelihood
The point based estimators (PBE) employ the candidates’ identity (Besag, 1989),
in a point of high density:
• Laplace-Metropolis (LM; Lewis & Raftery, 1997)
• Gaussian copula (GC; Nott et al, 2008)
• Chib & Jeliazkov (CJ; Chib & Jeliazkov, 2001)
The bridge sampling estimators (BSE), employ a bridge function
, based
on the form of which, several BML identities can be derived (even pre–
existing):
• Harmonic mean (HM; Newton & Raftery, 1994)
• Reciprocal mean (RM; Gelfand & Dey, 1994)
• Bridge harmonic (BH; Meng & Wong, 1996)
• Bridge geometric (BG; Meng & Wong, 1996)
The path sampling estimators (PSE), employ a continuous and differential
path
, to link two un-normalised densities and compute the ratio of the
corresponding constants:
• Power posteriors (PPT; Friel & Pettitt, 2008; Lartillot &Philippe, 2006)
• Steppingstone (PPS ; Xie at al, 2011)
• Generalised steppingstone (IPS; Fan et al, 2011)
10
11. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration: the case of GLLVM
From the early readings the methods applied for the parameter estimation of
model settings with latent variables relied on the
joint likelihood
Lord and Novick, 1968;
Lord,1980
or the
marginal likelihood
Bock and Aitkin, 1981;
Moustaki and Knott, 2000
Under the conditional independence assumptions of the GLLVMs, there are two
equivalent formulations of the BML, which lead to different MC estimators, namely the
joint BML
and the
marginal BML
11
12. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration: the case of GLLVM
A motivating example
A simulated data set with p = 6 items, N = 600 cases and k = 2 factors was considered.
Three popular BSE were computed under both approaches (R= 50,000 posterior
observations , after burn in period of 10,000 and thinning interval of 10).
• BH: Largest error difference
but rather close estimation...
• BG: Largest difference in the
estimation without large error
difference...
Differences are due to Monte
Carlo integration, under
independence assumptions
12
13. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration: the case of GLLVM
The joint version of BH comes with much
higher MCE than the RM...
...but is the joint version of RM that fails to
converge to the true value.
?
13
14. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration under independence
•
Consider any integral of the form:
•
The corresponding MC estimator is:
assuming a random sample of points drawn from h
•
The corresponding Monte Carlo Error (MCE) is:
•
Assume independence, that is,
hence
14
15. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration under independence
The two estimators are associated with different MCEs. Based on the early results of
Goodman (1962), for the variance of N independent variables, the variances of the
estimators are:
for each term
In finite settings, the difference can be outstanding
15
16. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration under independence
In particular, the difference in the variances is given by
Naturally, it depends on R. Note however that also it depends on
• dimensionality (N), since more positive terms are added, and
• on the means and variances of the N variables involved
At the same time, the difference in the means is given by
• Total covariation index (multivariate extension of the covariance).
• Under independence the index should be zero (the reverse statement does not hold)
• At the sample, the covariances, no matter how small, are non-zero leading to non zero TCI.
•Depends also on the number of the variables (N), their means, and their variation through the
covariances
16
17. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration: the case of GLLVM
A motivating example-Revisited
Different
variables are
being
averaged,
leading to
different
variance
components
Total covariance cancels out for the BH.
17
18. Chapter 3
The behavior of joint and marginal Monte Carlo estimators in multi-parameter latent variable models
Monte Carlo integration & independence
Refer to Chapter 3 of the current thesis for:
•
more results on the error difference,
•
properties of the TCI,
•
extension to conditional independence,
•
and more illustrative examples.
18
19. Chapter 4
Bayesian marginal likelihood estimation using the Metropolis kernel in multi-parameter latent variable models
Basic idea
Based on the work of Chib & Jeliazkov(2001), it is shown in Chapter 3 that the
Metropolis kernel can be used to marginalise out any subset of the parameter vector,
that otherwise would not be feasible.
• Consider the kernel of the Metropolis – Hastings algorithm, which denotes the
transition probability of sampling
, given that
has been already generated:
Transition
probability
Acceptance
probability
Proposal
density
• Then, the latent vector can be marginalised out directly from the Metropolis kernel as
follows:
19
20. Chapter 4
Bayesian marginal likelihood estimation using the Metropolis kernel in multi-parameter latent variable models
Chib & Jeliazkov estimator
Let us suppose that the parameter space is divided into p blocks of parameters. Then, using the Law
of total probability, the posterior at a specific point can be decomposed to
• If analytically available use candidates’ (Besag, 1989) formula to compute the BML directly.
• If the full conditionals are known, Chib (1995) uses the output from the Gibbs sampler to estimate them.
• Otherwise Chib and Jeliazkov (2001) show that each posterior ordinate can be computed by
Requires p
sequential
MCMC
runs.
20
21. Chapter 4
Bayesian marginal likelihood estimation using the Metropolis kernel in multi-parameter latent variable models
Chib & Jeliazkov estimator for models with latent vectors
The number of latent variables can be hundreds if not thousands. Hence the method is time
consuming. Chib & Jeliazkov suggest to use the last ordinate to marginalise out the latent vector,
provided that
is analytically tractable (often it is not).
In Chapter 4 of the thesis, it is shown that the latent vector can be marginalised out directly from
the MH kernel, as follows:
Hence the
dimension of the
latent vector is
not an issue.
This observation however leads to another result. Assuming local independence, prior
independence and a Metropolis - within – Gibbs algorithm, as in the case of the GLLVM, the Chib
& Jeliazkov identity is drastically simplified as follows:
Hence the number
of blocks , also, is
not an issue.
• The latent vector is marginalised out as previously.
• Moreover, even there are p-blocks for the model parameters, only the full MCMC is required.
• Can be used under data augmentations schemes that produce independence
21
22. Chapter 4
Bayesian marginal likelihood estimation using the Metropolis kernel in multi-parameter latent variable models
Independence Chib & Jeliazkov estimator
Three simulated data sets – under different scenarios. Compare CJI with ML estimators.
Rtotal
1st batch
30 batches
1000
2000
3000
iterations
per batch
22
23. Chapter 6
Implementation in simulated and real life datasets
Some results
•p =6 items,
•N=600 individuals,
•k=1 factor
kmodel = ktrue
23
24. Chapter 6
Implementation in simulated and real life datasets
Some results
•p =6 items,
•N=600 individuals,
•k=2 factors
kmodel = ktrue
24
25. Chapter 6
Implementation in simulated and real life datasets
Some results
•p =8 items,
•N=700 individuals,
•k=3 factor
kmodel = ktrue
25
26. Chapter 6
Implementation in simulated and real life datasets
Some results
•p =6 items,
•N=600 individuals,
•k=1 factor
kmodel <ktrue
26
27. Chapter 6
Implementation in simulated and real life datasets
Some results
•p =6 items,
•N=600 individuals,
•k=2 factors
kmodel >ktrue
27
28. Chapter 6
Implementation in simulated and real life datasets
Concluding comments
Refer to Chapter 4 of the current thesis for more details on the implementation of the CJI
(or see Vitoratou et al, 2013) :
More comparisons are presented in Chapter 6 of the thesis, in simulated and real data
sets. Some comments:
• The harmonic mean failed in all cases.
• The BSE were successful in all examples.
o The BG estimator was consistently associated with the smallest error.
o The RM was also well behaved in all cases.
o The BH was associated with more error that the former two BSE.
• The PBE are well behaved:
o LM is very quick and efficient – but might fail if the posterior is not symmetrical.
o Similarly for the GC.
o CJI is well behaved but time consuming. Since it is distributional free, can be used
as a benchmark method to get an idea of the BML.
28
29. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamics and Bayes
Ideas initially implemented in thermodynamics are currently explored in Bayesian
model evaluation.
Assume two unnormalised densities (q1 and q0) and we are interested in the
ratio of their normalising constants (λ).
For that purpose we use a continuous and differential function of the form
geometric path which
links the endpoint densities
temperature parameter
Boltzmann-Gibbs distribution
Partition function
Then the ratio λ can be computed via the thermodynamic integration identity (TI):
Bayes free energy
29
30. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamics and BML: Power Posteriors
The first application of the TI to the problem of estimating the BML is the power
posteriors (PP) method (Friel and Pettitt, 2008; Lartillot and Philippe, 2006). Let
then
prior-posterior path
power posterior
leading via the thermodynamic integration to the Bayesian marginal likelihood
For ts close to 0 we
sample from densities
close to the prior,
where the variability is
typically high.
30
31. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamics and BML: Importance Posteriors
Lefebvre et al. (2010) considered other options than the prior for the zero
endpoint, keeping the unnormalised posterior at the unit endpoint. Any proper
density g() will do:
An appealing option is to use an importance (envelope) function, that is a
density as close as possible to the posterior).
importance-posterior path
importance posterior
For ts close to 0 we
sample from densities
close to the importance
function, solving the
problem
of
high
variability.
31
32. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
An alternative approach: stepping-stone identities
Xie et al (2011) using the prior and the posterior as endpoint densities, considered a different approach
to compute the BMI, also related to thermodynamics (Neal, 1993). First, the interval [0,1] is partitioned
into n points and the free energy can be computed as:
Stepping
stone
• Under the power posteriors path, Xie et al (2011) showed that the BML occurs as:
• Under the importance posteriors path, Fan et al (2011) showed that the BML occurs as:
However, the stepping–stone identity (SI) is even more general and can be used under
different paths, as an alternative to the TI:
32
33. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Path sampling identities for the BML- revisited
Hence, there are two general identities to compute a ratio of normalising
constants, within the path sampling framework, namely
Different paths lead to different expressions for the BML:
Identity for the BML
path
TI
SI
Prior
posterior
Power posteriors (PPT)
Stepping-stone (PPS)
Importance
posterior
Importance posteriors (IPT)
Generalised stepping stone (IPS)
Friel and Pettitt, 2008
Lartillot and Philippe, 2006
inspired by Lefebvre et al. (2010)
Xie et al (2011)
Fan et al (2011)
Other paths can be used, under both approaches, to derive identities
for the BML or any other ratio of normalising constants.
Hereafter, the identities with be named by the path employed, with a
subscript denoting the method implemented, e.g. IPS
33
34. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamics & direct BF identities: Model switching
Lartillot and Philippe (2006) considered as endpoint densities the unormalised
posteriors of two competing models:
leading to the model switching path
leading via the thermodynamic integration to the Bayes Factor
bidirectional
melting-annealing
sampling scheme.
While it is easy to derive the SI counterpart expression:
34
35. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamics & direct BF identities: Quadrivials
Based on the idea of Lartillot and Philippe (2006) we may proceed with the compound paths.
which consist of
• a hyper, geometric path
which links two competing models, and
• a nested, geometric path
for each endpoint function Qi , i=0,1.
The two intersecting paths form a quadrivial
Which can be used either with the TI or the SI approach. If the ratio of interest is the BF, the two
BMLs should be derived at the endpoints of [0,1]. The PP and the IP paths are natural choices for
the nested part of the identity. For the latter
35
36. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Sources of error in path sampling estimators
a) The integral over [0,1] in the TI is typically approximated via numerical
approaches, such as the trapezoidal or Simpson’s rule (Neal, 1993; Gelman and
Meng, 1998), which require an n-point discretisation of [0,1]:
Note that the temperature schedule is also required for the SI method (it defines
the stepping stone ratios) . The discretisation introduces error to the TI and SI
estimators, that is referred to as the discretisation error.
It can be reduced by a) increasing the number of points n and/or b) by assigning
more points closer to the endpoint that is associated higher variability.
b) At each point
the corresponding
, a separate MCMC run is performed with target distribution
. Hence, Monte Carlo error occurs also at each run.
c) As a third source of error can be considered also the path-related error.
We may gain insight into a) and c) by considering the measures of
entropy related to the TI.
36
37. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Performance: Pine data-a simple regression example
Measurements taken on 42 specimens. A linear regression model was fitted for the
specimen’s maximum compressive strength (y), using their density (x) as independent
variable:
The objective in this example is to illustrate how each method and path combination
responds to prior uncertainty. To do so, we use three different prior schemes, namely:
The ratios of the corresponding BMLs under the three priors were estimated over n1 = 50
and n2 = 100 evenly spaced temperatures. At each temperature, a Gibbs algorithm was
implemented and 30,000 posterior observations were generated; after discarding 5,000 as a
burn-in period.
37
38. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Performance: Pine data-a simple regression example
Implementing a uniform temperature schedule:
Reflects
difference
in the
path-related
error
Reflects
difference
in the
discretisation
error
All quadrivals
come with
smaller batch
mean error
Note: PP works just fine under a geometric temperature schedule that samples
more points from the prior.
38
39. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
Based on the prior-posterior path, Friel and Pettitt (2008) and Lefebvre et al. (2010)
showed that the PP method is connected with the Kullback – Leibler diveregence
(KL; Kullback & Leibler, 1951).
Relative entropy
Differential entropy
Cross entropy
Here we present their findings on a general form, that is, for any geometric path
according to the TI
it holds that
symmetrised KL
39
40. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
Graphical representation of the TI
What about
the
intermediate
points?
40
41. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
TI minus free energy at each point
Instead of integrating
the mean energy over
the entire interval [0,1],
there is an optimal
temperature, where the
mean energy equals the
free energy.
41
42. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
Graphical representation of the NTI
functional KL
difference in the
KL-distance of
the sampling
distribution pt
from p1 and p0
The ratio of
interest occurs at
the point
where the
sampling
distribution is
equidistant from
the endpoint
densities
42
43. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
The normalised thermodynamic integral
Hence:
•According to the PPT method, the BML occurs at the point where the sampling
distribution is equidistant from the prior and the posterior.
•According to the QMST method, the BF occurs at the point where the sampling
distribution is equidistant from the two posteriors.
The sampling distribution pt is the Boltzmann-Gibbs distribution pertaining to the Hamiltonian
(energy function)
. Therefore
•according to the NTI, when geometric paths are employed, the free energy
occurs at the point where the Boltzmann-Gibbs distribution is equidistant from the
distributions at the endpoint states.
43
44. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
Graphical representation of the NTI
What are
the areas
stand for?
44
45. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
The normalised thermodynamic integral and probability distribution divergencies
A key observation here is that the sampling distribution embodies the Chernoff coefficient
(Chernoff, 1952) :
Based on that, the NTI can be written as:
meaning that
and therefore, the areas correspond to the Chernoff t-divergence. At t=t*, we obtain
the so-called Chernoff information:
45
46. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
Using the output from path sampling, the Chernoff divergence can be
computed easily (see Chapter 5 of the thesis for a step-by step algorithm).
Along with the Chernoff estimation, a number of other f-divergencies can
be directly estimated, namely
• the Bhattacharyya distance (Bhattacharyya, 1943) at t = 0.5,
• the Hellinger distance (Bhattacharyya, 1943; Hellinger, 1909),
• the Rényi t-divergence (Rényi, 1961) and
• the Tsallis t-relative entropy (Tsallis, 2001) .
These measures of entropy are commonly used in
• information theory, pattern recognition, cryptography, machine learning,
• hypothesis testing
• and recently, in non-equilibrium thermodynamics.
46
47. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Thermodynamic integration & distribution divergencies
Measures of entropy and the NTI
47
48. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Path selection, temperature schedule and error.
These results provide insight also on the error of the path sampling estimators. To begin with
Lefebre et al (2010) have showed that the total variance is associated with the J−divergence of
the endpoint densities and therefore with the choice of the path. Graphically
• the J-distance
coincides with the
slope of the secant
defined at the
endpoint densities.
The shape of the
curve is a
graphical
representation of
the total
variance.
• the slope of the tangent at
a particular point ti,
coincides with the local
variance
Higher local
variances, at the
points where the
curve is steeper.
• the graphical
representation of two
competing paths provides
information about the
estimators’ variances.
Paths with
smaller cliffs are
easier to take!
48
49. Chapter 5
Thermodynamic assessment of probability distribution divergencies and Bayesian model comparison
Path selection, temperature schedule and error.
Numerical approximation of the TI:
Assign more tis at points where the curve is steeper (higher local variances)
Different level
of accuracy
towards the
two endpoints
The discretization
error depends
primarily on the
path
49
50. Future work
Currently developing a library in R for BML estimation in GLLTM with Danny Arends.
Expand results (and R library) to account for other type of data.
Further study on the TCI (Chapter 3).
Use the ideas in Chapter 4 to construct a better Metropolis algorithm for GLLVMs.
Proceed further on the ideas presented in Chapter 5, with regard to the quadrivials, the
temperature schedule and the optimal t*. Explore applications to information criteria.
50
51. Bibliography
Bartholomew, D. and Knott, M. (1999). Latent variable models and factor analysis. Kendall’s Library of Statistics, 7. Wiley.
Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of
the Calcutta Mathematical Society, 35:99–109.
Besag, J. (1989). A candidate’s formula: A curious result in Bayesian prediction. Biometrika, 76:183.
Bock, R. and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika,
46:443–459.
Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical
Statistics, 23(4).
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90:1313–1321.
Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association,
96:270–281.
Fan, Y., Wu, R., Chen, M., Kuo, L., and Lewis, P. (2011). Choosing among partition models in Bayesian phylogenetics. Molecular Biology and
Evolution, 28(2):523–532.
Fouskakis, D., Ntzoufras, I., and Draper, D. (2009). Bayesian variable selection using cost-adjusted BIC, with application to cost-effective
measurement of quality of healthcare. Annals of Applied Statistics, 3:663–690.
Friel, N. and Pettitt, N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society Series B (Statistical
Methodology), 70(3):589–607.
Gelfand, A. E. and Dey, D. K. (1994). Bayesian Model Choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society. Series
B (Methodological), 56(3):501–514.
Gelman, A. and Meng, X. (1998). Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical
Science, 13(2):163–185.
Goodman, L. A. (1962). The variance of the product of K random variables. Journal of the American Statistical Association, 57:54–60.
Hellinger, E. (1909). Neue Begr¨undung der Theorie quadratischer Formen von unendlichvielen Veranderlichen. Journal fddotur die reine und
angewandte Mathematik, 136:210–271.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A,
Mathematical and Physical Sciences, 186(1007):453–461.
Kass, R. and Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90:773–795.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22:49–86.
Lewis, S. and Raftery, A. (1997). Estimating Bayes factors via posterior simulation with the Laplace Metropolis estimator. Journal of the
American Statistical Association, 92:648–655.
Lartillot, N. and Philippe, H. (2006). Computing Bayes factors using Thermodynamic Integration. Systematic Biology, 55:195–207.
Lefebvre, G., Steele, R., and Vandal, A. C. (2010). A path sampling identity for computing the Kullback-Leibler and J divergences.
Computational Statistics and Data Analysis, 54(7):1719–1731.
Lord, F. M. (1980). Applications of Item Response Theory to practical testing problems.Erlbaum Associates, Hillsdale, NJ.
Lord, F. M. and Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley, Oxford, UK
51
52. Meng, X.-L. and Wong, W.-H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica
Sinica, 6:831–860.
Moustaki, I. and Knott, M. (2000). Generalized Latent Trait Models. Psychometrika, 65:391–411.
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods.Technical Report CRG-TR-93-1, University of Toronto.
Newton, M. and Raftery, A. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society, 56:3–48.
Nott, D., Kohn, R., and Fielding, M. (2008). Approximating the marginal likelihood using copula. arXiv:0810.5474v1. Available at
http://arxiv.org/abs/0810.5474v1
Ntzoufras, I., Dellaportas, P., and Forster, J. (2000). Bayesian variable and link determination for Generalised Linear Models. Journal of Statistical Planning
and Inference,111(1-2):165–180.
Patz, R. J. and Junker, B. W. (1999b). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational
and Behavioral Statistics, 24(2):146–178.
Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models
with nested random effects. Journal of Econometrics, 128:301–323.
Raftery, A. and Banleld, J. (1991). Stopping the Gibbs sampler, the use of morphology, and other issues in spatial statistics. Annals of the Institute of
Statistical Mathematics, 43(430):32–43.
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Paedagogiske Institut, Copenhagen.
Renyi, A. (1961). On measures of entropy and information. In Proceedings of the 4 th Berkeley Symposium on Mathematics, Statistics and Probability, pages
547–561.
Tsallis et al., Nonextensive Statistical Mechanics and Its Applications, edited by S.Abe and Y. Okamoto (Springer-Verlag, Heidelberg, 2001); see also the
comprehensive list of references at http://tsallis.cat.cbpf.br/biblio.htm.
Vitoratou, S., Ntzoufras, I., and Moustaki, I. (2013). Marginal likelihood estimation from the Metropolis output: tips and tricks for efficient implementation in
generalized linear latent variable models. To appear in: Journal of Statistical Computation and Simulation.
Xie, W., Lewis, P., Fan, Y., Kuo, L., and Chen, M. (2011). Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic
Biology, 60(2):150–160.
This thesis is dedicated to
52