Proyecto Final de Carrera realizado por Rubén Salgado y dirigido por Carlos Maté consistente en un Sistema de Regresión Bayesiana aplicado a Datos de tipo Intervalo.
A Guide to Data Innovation for Development - From idea to proof-of-concept
Bayesian Regression for Interval Data
1. Autorizada la entrega del proyecto del alumno:
Rub´ n Salgado Fern´ ndez
e a
EL DIRECTOR DEL PROYECTO
Carlos Mat´ Jim´ nez
e e
Fdo.: Fecha: 12/06/2007
Vo Bo DEL COORDINADOR DE PROYECTOS
Claudia Meseguer Velasco
Fdo.: Fecha: 12/06/2007
2. UNIVERSIDAD PONTIFICIA DE COMILLAS
ESCUELA TECNICA SUPERIOR DE INGENIER´ (ICAI)
´ IA
´
INGENIERO EN ORGANIZACION INDUSTRIAL
PROYECTO FIN DE CARRERA
Bayesian Regression System
for Interval-Valued Data.
Application to the Spanish Continuous
Stock Market
AUTOR : Salgado Fern´ ndez, Rub´ n
a e
M ADRID , Junio 2007
3. Acknowlegdements
Firstly, I would like to thank my director, Carlos Mat´ Jim´ nez, PhD, for giving me the chance of
e e
making this project. With him, I have learnt, not only about Statistics and investigation, but also
about how to enjoy with them.
Special thanks to my parents. Their love and all they have taught me in this life are the things
what have made possible being the person I am now.
Thanks to my brothers, my sister and the rest of my family for their support and for the stolen time.
Thanks to Charo for standing my bad mood in the bad moments, for supporting me and for giving
me the inspiration to go ahead.
Madrid, June 2007
i
4. Resumen
´
En los ultimos a˜ os los m´ todos Bayesianos se han extendido y se han venido utilizando de forma
n e
exitosa en muchos y variados campos tales como marketing, medicina, ingenier´a, econometr´a o mer-
ı ı
cados financieros. La principal caracter´stica que hace destacar al an´ lisis Bayesiano de datos (AN-
ı a
BAD) frente a otras alternativas es que, no s´ lo tiene en cuenta la informaci´ n objetiva procedente de
o o
los datos del suceso en estudio, sino tambi´ n el conocimiento anterior al mismo. Los beneficios que
e
se obtienen de este enfoque son m´ ltiples ya que, cuanto mayor sea el conocimiento de la situaci´ n,
u o
a ´
con mayor fiabilidad se podr´ n tomar las decisiones y estas ser´ n m´ s acertadas. Pero no siempre todo
a a
han sido ventajas. El ANBAD, hasta hace unos a˜ os, presentaba una serie de dificultades que limita-
n
ban el desarrollo del mismo a los investigadores. Si bien la metodolog´a Bayesiana existe como tal
ı
desde hace bastante tiempo, no se ha empezado emplear de manera generalizada hasta los 90’s. Esta
expansi´ n ha sido propiciada en gran parte por el avance en el desarrollo computacional y la mejora y
o
perfeccionamiento de distintos m´ todos de c´ lculo como los m´ todos de cadenas de Markov-Monte
e a e
Carlo.
ı ´
En especial, esta metodolog´a se ha mostrado extraordinariamente util en la aplicaci´ n a los mod-
o
elos de regresi´ n, ampliamente adoptados. En m´ ltiples ocasiones en la pr´ ctica, se dan situaciones
o u a
en las que se requiere analizar la relaci´ n entre dos variables cuantitativas. Los dos objetivos fun-
o
damentales de este an´ lisis ser´ n, por un lado, determinar si dichas variables est´ n asociadas y en
a a a
qu´ sentido se da dicha asociaci´ n (es decir, si los valores de una de las variables tienden a aumentar
e o
-o disminuir- al aumentar los valores de la otra); y por otro, estudiar si los valores de una variable
pueden ser utilizados para predecir el valor de la otra. Un modelo de regresi´ n trata de proporcionar
o
informaci´ n sobre uno o varios sucesos a trav´ s de su relaci´ n con el comportamiento de otros. Con
o e o
la metodolog´a Bayesiana se permite incorporar el conocimiento del investigador al an´ lisis, haciendo
ı a
los resultados m´ s precisos, ya que no se a´slan los resultados a los datos de una determinada muestra.
a ı
ii
5. iii
a ´
Por otro lado, se est´ empezando a aceptar que el siglo XXI en el ambito de la estad´stica va a
ı
ser el siglo de la ”estad´stica del conocimiento” a diferencia del anterior que fue el de la ”estad´stica
ı ı
de los datos”. El concepto b´ sico para construir dicha estad´stica es el de dato simb´ lico y se han
a ı o
desarrollado m´ todos estad´sticos para algunos tipos de datos simb´ licos.
e ı o
En la actualidad, la exigencia del mercado, la demanda y, en general, del mundo crece. Esto
implica que cada vez sea mayor el deseo de predecir la ocurrencia de un evento o poder controlar el
comportamiento de ciertas cantidades con el menor error posible con el fin de ofrecer mejores pro-
ductos, obtener mayores beneficios o adelantos cient´ficos y mejores resultados.
ı
Sobre esta realidad, este proyecto trata de responder a dichas necesidades proporcionando una
amplia documentaci´ n sobre varias de las t´ cnicas m´ s utilizadas y m´ s punteras a d´a de hoy, como
o e a a ı
son el an´ lisis Bayesiano de datos, los modelos de regresi´ n y los datos simb´ licos, y proponiendo
a o o
diferentes t´ cnicas de regresi´ n. De igual forma se desarrollar´ una herramienta que permita poner
e o a
en pr´ ctica todos los conocimientos adquiridos. Dicha aplicaci´ n estar´ dirigida al mercado burs´ til
a o a a
espa˜ ol y permitir´ al usuario utilizarla de manera sencilla y amigable. En cuanto al desarrollo de esta
n a
herramienta se emplear´ uno de los lenguajes m´ s novedosos y con m´ s proyecci´ n del momento: R.
a a a o
Se trata, por tanto, de un proyecto que combina las t´ cnicas m´ s novedosas y con mayor proyecci´ n
e a o
tanto en materia te´ rica, como es la regresi´ n Bayesiana aplicada a datos de tipo intervalo, como en
o o
materia pr´ ctica, como es el empleo del lenguaje R.
a
6. Abstract
In the recent years, Bayesian methods have been spread and successfully used in many and several
fields such as Marketing, Medicine, Engineering, Econometrics or Financial Markets. The main char-
acteristic that makes Bayesian Data Analysis (BADAN) remarkable compared with other alternatives
is that not only does it take into account the objective information coming from the analyzed event,
but also the pre-event knowledge. The benefits obtained from this approach are innumerable due to
the fact that the more knowledge of the situation one has, the more reliable and accurate decisions
could be taken. However, although Bayesian methodology was set long time ago, it has not been
applied in a general way until the 90’s because of the computational difficulties. Such expansion has
been mainly favoured by the advances in that field and the improvement on different calculus meth-
ods, such as Markov-chain Monte Carlo methods.
Particularly, this Bayesian methodology has been resulted in an extraordinary useful application
for the regression models, which have been adopted by large. There are many times in real life in
which it is necessary to analyse the situation between two quantitive variables. The two main objec-
tives of this analysis would be, on the one hand, to determine whether such variables are associated
and in what sense that association comes about (that is, whether the value of one of the variables
tends to rise- or to decrease- when augmented the value of the other); and on the other hand, to study
whether the values of one variable can be used to predict the value of the other. A regression model
offers information about one or more events through their relationship with the behaviour of the oth-
ers. With the Bayesian methodology it is possible to add the researcher’s knowledge to the analysis,
making thus the results be more accurate due to the fact that the results are not isolated from the data
of one determined sample.
On the other hand, in the Statistics field, it has been more and more accepted the fact that the XXI
century will be the century of the ”Statistics of knowledge” contrary to the last one, which was the
iv
7. v
one of the ”Statistics of data”. The most basic concept to constitute such Statistics is the symbolic
data; furthermore, there have been developed more statistics methods for some types of symbolic data.
Nowadays, the requirements of the market, and the demands of the world in general, are growing
up. This implies the continuous increase of the desire for predicting the occurrence of an event or for
the ability of controlling the behaviour of certain quantities with the minimum error with the aim of
offering better products, obtaining more benefits or scientific improvements and better outcomes.
Under this frame, this project tries to responds such needs by offering a large documentation
about several of the most applied and leading nowadays techniques, such as Bayesian data analysis,
regression models, and symbolic data, and suggesting different regression techniques. Similarly, it
has been developed a tool that allow the reader to put all the acquired knowledge into practice. Such
application will be aimed to the Spanish Continuous Stock Market and it will let the user apply it eas-
ily. As far as the development of this tool is concerned, it has been used one of the more innovative
and with more projection languages of the moment: R.
So, the project is about a combination of the techniques that are most innovative and with the
most projection both in theoretical questions such as Bayesian regression applied to interval- valued
data and in practical questions such us the employment of the R language.
8. List of Figures
1.1 Project Work Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Univariate Normal Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.1 Interval time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1 Classical Regression with single values in training test . . . . . . . . . . . . . . . . 73
7.2 Classical Regression with single values in testing test . . . . . . . . . . . . . . . . . 74
7.3 Classical Regression with interval- valued data . . . . . . . . . . . . . . . . . . . . 75
7.4 Centre Method (2000) in training set . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.5 Centre Method (2000) in testing set . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.6 Centre and Radius Method in training set . . . . . . . . . . . . . . . . . . . . . . . 77
7.7 Centre and Radius Method in testing set . . . . . . . . . . . . . . . . . . . . . . . . 78
7.8 Bayesian Centre and Radius Method in testing test . . . . . . . . . . . . . . . . . . 80
7.9 Classical Regression with single values in training test . . . . . . . . . . . . . . . . 81
7.10 Classical Regression with single values in testing test . . . . . . . . . . . . . . . . . 81
7.11 Centre Method (2000) in training set . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.12 Centre Method (2000) in testing set . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.13 Centre and Radius Method in training set . . . . . . . . . . . . . . . . . . . . . . . 85
7.14 Centre and Radius Method in testing set . . . . . . . . . . . . . . . . . . . . . . . . 85
7.15 Bayesian Centre and Radius Method in testing set . . . . . . . . . . . . . . . . . . . 87
9.1 BARESIMDA MDI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.1 Interface between BARESIMDA and R . . . . . . . . . . . . . . . . . . . . . . . . 104
10.2 Interface between BARESIMDA and Excel . . . . . . . . . . . . . . . . . . . . . . 105
10.3 Logical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vi
19. Chapter 1
Introduction
1.1 Project Motivation
Statistics is primarily concerned with the analysis of data, either to assist in arriving at an improved
understanding of some underlying mechanism, or as a means for making informed rational decisions.
Both these aspects generally involve some degree of uncertainty. The statistician’s task is then to
explain such uncertainty, and to reduce it to the extent in which this is possible. Problems of this type
occur throughout all the physical, social and other sciences. One way of looking at statistics stems
from the perception that, ultimately, probability is the only appropriate way to describe and system-
atically deal with uncertainty, as if it were the language for the logic of uncertainty. Thus, inference
statements are precisely framed as probability statements on the possible values of the unknown quan-
tities of interest (parameters or future observations) conditional on the observed, available data. The
scientific discipline based on this understanding is called Bayesian Statistics. Moreover, increasingly
needed and sophisticated models, often hierarchical models, to describe available data are typically
too much complex for conventional statistics to handle, but can be tackled within Bayesian Statistics.
In principle, Bayesian Statistics is designed to handle all situations where uncertainty is found. Since
some uncertainty is present in most aspects of life, it may be argued that Bayesian Statistics should
be appreciated and used by everyone. It is the logic of contemporary society and science. According
to [Rupp04], applying Bayesian methodology is no more discussed, but the question is when this has
to be done.
Bayesian methods have matured and improved in several ways during last fifteen years. Actually,
they are increasingly becoming attractive to researchers as well as successful applications of Bayesian
1
20. 1. Introduction
data analysis have been appeared in many different fields, including Actuarial Science, Biometrics,
Finance, Market Research, Marketing, Medicine, Engineering or Social Science. It is not only that
the Bayesian approach produces appropriate answers to many current important problems, but also
there is an evident need for it, given the inapplicability of conventional statistics to many of them.
Thus, the main characteristic offered by Bayesian data analysis is the possibility of incorporating
researcher’s knowledge about the problem to be handled. This supposes obtaining the better and the
more reliable results as far as prior knowledge is more and more precise. But Bayesian Statistics was
restrained until mid 90’s by its computational complexity. Since then, it has had a great expansion
favoured by the development and improvement of different computational methods in this field such
as Markov chain Monte Carlo.
This methodology has shown to be extremely useful in its application to regression models, which
are widely accepted. Let us remember that the general purpose of regression analysis is to learn more
about the relationship between several independent or predictor variables and a dependent or criterion
variable. Bayesian methodology let the researcher incorporate her or his knowledge to the analysis,
improving the results since they do not only depend on the sampling data.
On the other hand, increasingly, datasets are so large that they must be summarized in some fash-
ion so that the resulting summary dataset is of a more manageable size, while still retaining as much
knowledge inherent to the entire dataset as possible. One consequence of this situation is that data
may no longer be formatted as single values such as is the case for classical data, but rather may be
represented by lists, intervals, distributions, and the like. These summarized data are examples of
symbolic data. This kind of data also lets us represent better the knowledge and beliefs having in our
mind and that it is limited and hardly to take out with classical Statistics. According to [Bill02], this
responds to the current need of changing from a Statistics of data in the past century to a Statistics of
knowledge in XXI century.
Market and demand requirements are increasing continuously throughout the time. This implies
a need of better and more accurate methods to forecast new situations and to control different quanti-
ties with the minimum error in order to supply better products, to obtain higher incomes or scientist
advantages and better results.
Dealing with this outlook, this project is intended to respond to those requirements providing a
2
21. 1. Introduction
wide and exhaustive documentation about some of the currently more used and advanced techniques,
including Bayesian data analysis, regression models and symbolic data. Different examples related
to the Continuous Spanish Stock Market have been explained throughout this writing, making clear
the advantages of employing the described methods. Likewise a software tool with a user- friendly
graphical interface has been developed to practice and to check all the acquired knowledge.
Therefore, this is a project combining the most recent techniques with major future implications
in theoretical issues, as Bayesian regression applied to interval- valued data is, with a technological
part dealing with the problem of interconnecting two software programs: one used to show the graph-
ical user interface and the other one employed to make computations.
Regarding to a more personal motivation, when accepting this project, several factors were taken
into consideration by the author:
• A great challenge: it is an ambitious project with a high technical complexity related to both its
theoretical basis and its technological basis. This represents a very good letter of introduction
in order to be incorporated to the labour world.
• A good planning time: this project was designed to be finished before June of 2007, which
means to be able of finishing the career in June and incorporating to labour world in September.
• Some very interesting issues: on one hand, it deals with the always needed issue of forecasting
and modelling observations and situations in order to get the best possible results. On the other
hand, it focuses on the Stock Market, which meets my personal hobbies.
• A new programming language: the possibility of learning deeply a new and relatively recent
programming language, such as R, was an extra- motivation factor.
• The project director: Carlos Mat´ is considered a demanding and very competent director by
e
the students of the university.
• An investigation scholarship: The possibility of being in the Industrial Organization department
of the University learning from people such as the director mentioned above and another very
recognized professors was a great factor.
3
22. 1. Introduction
1.2 Objectives
This project pretends to get the following aims.
• To provide a wide and rigorous documentation about the following issues: Bayesian data anal-
ysis, regression models and symbolic data. From this point, documentation about Bayesian
regression will be developed, as well as the software tool designed.
• To build a software tool in order to fit Bayesian regression models to interval- valued data,
finding out the most efficient way to design the graphical user interface. This must be as user-
friendly as possible.
• To find out the most efficient way to offer that system to future clients from the tests carried out
with the application.
• To design a survey to measure the quality of the tool and users’ satisfaction.
• The possibility to write an article for a scientific journal.
1.3 Methodology
As the title of the project indicates, the last purpose is the development of an application aimed to-
wards stock markets based on a Bayesian regression system and, therefore, some previous knowledge
is required.
The first stage is the familiarization of the Bayesian data analysis, regression models applied to
Bayesian methodology and symbolic data.
Within this phase, Bayesian data analysis will be firstly studied, trying to synthesize and to get
the most important elements. A special dedication will be given to posterior simulation and computa-
tional algorithms. Then, regression models will be treated, reviewing quickly the classical approach,
to deep later into the different Bayesian regression models, applying great part of what was explained
in Bayesian methodology. Finally, this first stage will be completed with the application to symbolic
data, paying special attention to interval- valued data.
The second stage is referred to the development of the software application, employing an incre-
mental methodology for programming and testing iterative prototypes. This methodology has been
4
23. 1. Introduction
considered the most suitable for this project since it will let us introduce successive models into the
application.
The following figure shows the structure of the work packages the project is divided into:
Figure 1.1: Project Work Packages
5
24. Chapter 2
Bayesian Data Analysis
2.1 What is Bayesian Data Analysis?
Statistics can be defined as the discipline that provides us with a methodology to collect, to organize,
to summarize and to analyze a set of data.
Regarding data analysis, it can be divided into two ways of analysis: exploratory data analysis and
confirmatory data analysis. The former is used to represent, describe and analyze a set of data through
simple methods in the first stages of statistical analysis. The latter is applied to make inferences from
data, based on probability models.
In the same way, confirmatory data analysis is divided into two branches depending on the adopted
approach. The first one, known as frequentist, is used to make the inference of the data resulting from
a sampling through classical methods. The second branch, known as Bayesian, goes further in the
analysis and adds to those data the prior knowledge which the researcher has about the treated prob-
lem. Since the frequentist approach is not worthy to explain everything here, a more extended revision
of different classical methods related to the frequentist approach can be found in [Mont02].
Exploratory
Data Analysis Frequentist
Confirmatory
Bayesian
6
25. 2. Bayesian Data Analysis
As far as Bayesian analysis is concerned and according to [Gelm04], the process can be divided
into the following three steps:
• To set up a full probability model, through a joint probability distribution for all observable and
unobservable quantities in a problem.
• To condition on observed data, obtaining the posterior distribution.
• Finally, to evaluate the fit of the model and the implications of the resulting posterior distribu-
tion.
f (θ, y), known as the joint probability distribution (or f (y|θ), if there are several parameters θ),
is obtained by means of
f (θ, y) = f (y|θ)f (θ) (resp. f (θ, y) = f (y|θ)f (θ)) (2.1)
where y is the set of sampled data. So this distribution is the product of two densities that are referred
to as the sampling distribution f (y|θ) (resp. f (y|θ)) and the prior distribution f (θ) (resp. f (θ)).
The sampling distribution, as its name suggests, is the probability model that the researcher as-
signs to the statistics (resp. set of statistics) to be studied after the data have been observed. Here,
an important problem stands up in relation to parametric approach due to the fact that the probability
model that the researcher chooses could not be adequate. The nonparametric approach overcomes
this inconvenient as it will be seen later.
When y is considered fixed, so it is function of θ (resp. θ), the sampling distribution is called the
likelihood function and obeys the likelihood principle, which states that for a given sample of data,
any two probability models f (y|θ) (resp. f (y|θ)) with the same likelihood function yield the same
inference for θ, (resp. θ).
The prior distribution does not depend upon the data. Accordingly, it contains the information
and the knowledge that the researcher has about the situation or problem to be solved. When there
is not any previous significant population from which the engineer can take his knowledge, that is,
the researcher has not any prior information about the problem, a non-informative prior distribution
must be used in the analysis in order to let the data speak for themselves. Hence, it is assumed that
the prior knowledge will have very little importance in the results. But most non- informative priors
7
26. 2. Bayesian Data Analysis
are ”improper” in that they do not integrate to 1, and this fact can cause problems. In these cases
it is necessary to be sure that the posterior distribution is proper. Another possibility is to use an
informative prior distribution but with an insignificant weight (around zero) associated to it.
Though the prior distribution can take any form, it is common to choose particular classes of
priors that make computation and interpretation easier. These are the conjugate priors. A conjugate
prior distribution is one which, when combined with the likelihood function, gives a distribution that
falls in the same class of distributions as the prior. Furthermore, and according to [Koop03], a natural
conjugate prior has the additional property that it has the same form as the likelihood does. But it is
not always possible to find this kind of distribution and the researcher has to manage a lot of distribu-
tions to be able to give expression to his prior knowledge about the problem. This is another handicap
that the nonparametric approach reduces.
In relation to the prior, what distribution should be chosen? There are three different points of
view corresponding to different styles of Bayesians:
• Classical Bayesians consider that the prior is a necessary evil and priors that interject the least
information possible should be chosen.
• Modern parametric Bayesians considers that the prior is a useful convenience and priors with
desirable properties such as conjugacy should be chosen. They remark that given a distribu-
tional choice, prior hyper-parameters that interject the least information possible should be
chosen.
• Subjective Bayesians give essential importance to the prior, in the sense they consider it as a
summary of old beliefs. So prior distributions which are based on previous knowledge (either
the results of earlier studies or non-scientific opinion) should be chosen.
Returning to Bayesian data analysis process, simply conditioning on the observed data y and
applying the Bayes’ Theorem, the posterior distribution, namely f (θ|y) (resp. f (θ|y)), yields:
f (θ, y) f (θ)f (y|θ) f (θ, y) f (θ)f (y|θ)
f (θ|y) = = (resp. f (θ|y) = = ) (2.2)
f (y) f (y) f (y) f (y)
where
∞ ∞ ∞
f (y) = f (θ)f (y|θ)dθ (resp. f (y) = f (θ)f (y|θ)dθ) (2.3)
0 0 0
8
27. 2. Bayesian Data Analysis
is known as the prior predictive distribution, since it is not conditional upon a previous observation of
the process and is applied to an observable quantity.
An equivalent form of the posterior distribution displayed above omits the prior predictive distri-
bution, since it does not involve θ (resp. θ) and the interest is based on learning about θ (resp. θ).
So, with fixed y, it can be said that the posterior distribution is proportional to the joint probability
distribution f (θ, y).
Once the posterior distribution is calculated, some kind of summary measure will be required to
estimate the uncertainty about the parameter θ (resp. θ). This is due to the fact that the posterior
distribution is a high- dimensional object and its use is not practical for a problem. That measure
which will summarize the posterior distribution can be the posterior mean, mode, median or variance,
apart from others. Its choice will depend on the requirements of the problem. So the posterior dis-
tribution has a great importance since it lets the researcher manage the uncertainty about θ (resp. θ)
and provide him information about it (resp. them) taking into account both his prior knowledge and
the data collected from sampling on that parameter.
According to [Mat´ 06], it is not difficult to deduce that posterior inference will fit in the non-
e
Bayesian one as long as the estimation which the researcher gives to the parameter θ (resp. θ) is the
same as the one resulting from the sampling.
Once the data y have been observed, a new unknown observable quantity y can be predicted for
˜
the same process through the posterior predictive distribution, namely f (˜|y):
y
f (˜|y) =
y f (˜, θ|y)dθ =
y f (˜|θ, y)dθ =
y f (˜|θ)f (θ|y)dθ
y (2.4)
To sum up, the basic idea is to update the prior distribution f (θ) through Bayes’ theorem by
observing the data y in order to get a posterior distribution f (θ|y). Then a summary measure or a
prediction for new data can be obtained from f (θ|y). Table 2.1 reflects what has been said.
9
28. 2. Bayesian Data Analysis
Distribution Expression Information Required Result
Likelihood f (y|θ) Data Distribution f (y|θ)
Prior f (θ) Researcher’s Knowledge Parameter Distribution f (θ)
Joint f (y|θ)f (θ) Likelihood Distribution Prior Distribution f (θ, y)
Posterior f (θ)f (y|θ) Prior Joint Distribution f (θ|y)
Predictive f (˜|θ)f (θ|y)dθ
y New Data Distribution Posterior Distribution f (˜|y)
y
Table 2.1: Distributions in Bayesian Data Analysis
2.2 Bayesian Analysis for Normal and other distributions
2.2.1 Univariate Normal distribution
The basic model to be discussed concerns an observable variable , normally distributed with mean µ
and unknown variance σ 2 :
y|µ, σ 2 N (µ, σ 2 ) (2.5)
As it can be seen in Appendix A, the likelihood function for a single observation is
1
f (y|µ, σ 2 ) ∝ (σ 2 )−1/2 exp − (y − µ)2 (2.6)
2σ 2
This means that the likelihood function is proportional to a Normal distribution, omitting those
terms that are constant.
Now let us consider we have n independent observations y1 , y2 , . . . , yn . According to the previ-
ous section, the parameters to be estimated θ are µ and σ 2 :
10
29. 2. Bayesian Data Analysis
θ = (θ1 , θ2 ) = (µ, σ 2 ) (2.7)
A full probability model must be set up through a joint probability distribution:
f (θ, (y1 , y2 , . . . , yn )) = f (θ, y) = f (y|θ)f (θ) (2.8)
The likelihood function for a sample of n iid observations in this case is
n
1
f (y|θ) = f (y|µ, σ 2 ) ∝ (σ 2 )−1/2 exp − (yi − µ)2 (2.9)
2σ 2
i=1
As it was recommended previously, a conjugate prior will be chosen; in fact, it will be a natural
conjugate prior. According to [Gelm04], this likelihood function suggests a conjugate prior distribu-
tion of the form
f (θ) = f (µ, σ 2 ) = f (µ|σ 2 )f (σ 2 ) (2.10)
where the marginal distribution of σ 2 is the Scaled Inverse-χ2 and the conditional distribution of µ
given σ 2 is Normal (details about these distributions in Appendix A):
µ|σ 2 N (µ0 , σ 2 V0 ) (2.11)
σ2 Inv − χ2 (µ0 , s2 )
0 (2.12)
So the joint prior distribution is:
f (θ) = f (µ, σ 2 ) = f (µ|σ 2 )f (σ 2 ) ∝ N − Inv − χ2 (µ0 , s2 V0 , ν0 , s2 )
0 0 (2.13)
Its four parameters can be identified as the location and scale of µ and the degrees of freedom and
scale of σ 2 , respectively.
As a natural conjugate prior was employed, the posterior joint distribution will have the same
form that the prior has. So, conditioning on the data, and according to Bayes’ Theorem, we have:
f (θ|y) = f (µ, σ 2 |y) = f (y|µ, σ 2 )f (µ, σ 2 ) ∝ N − Inv − χ2 (µ1 , s2 V1 , ν1 , s2 )
1 1 (2.14)
where it be can shown that
11
30. 2. Bayesian Data Analysis
µ1 = (V0−1 + n)−1 V0−1 µ0 + n¯
y (2.15)
−1
V1 = V0−1 + n (2.16)
ν1 = ν0 + n (2.17)
V0−1 n
ν1 s2 = ν0 s2 + (n − 1)s2 +
1 0 (¯ − µ0 )2
y (2.18)
V0−1 + n
All these formulae evidence that Bayesian inference combines prior and posterior information.
The first term means that posterior mean µ1 is a weighted mean of prior mean µ0 and empirical
mean divided by the sum of their respective weights, where these are represented by V0−1 and the
simple size n.
The second term represents the importance that posterior mean has and it can be seen as a com-
promise between the sample size and the significance given to the prior mean.
The third term indicates that the degrees of freedom of posterior variance are the sum of the prior
degrees of freedom and the sample size. That is, the prior degrees of freedom can be understood as a
fictitious sample size on which the expert’s prior information is based.
The last term explains the posterior sum of square errors as a combination of prior and empirical
sum of square errors plus a term that measures the conflict between prior and posterior information.
A more detailed explanation of this last step can be found in [Gelm04], [Koop03] or [Cong06].
It is obvious that the marginal posterior distributions are:
µ|σ 2 , y N (µ1 , σ 2 V0 ) (2.19)
σ 2 |y Inv − χ2 (ν1 , s2 )
1 (2.20)
If we integrate out σ 2 , the marginal for µ will be a t-distribution (see Appendix A for details):
µ|y tν1 (µ1 , s2 V0 )
1 (2.21)
12
31. 2. Bayesian Data Analysis
Let us see an application to the Spanish Stock market. Let us suppose that the monthly close
values associated with Ibex 35 are normally distributed. If we take the values at which the Span-
ish index closed during the first two weeks in January in 2006, it can be shown that the mean was
10893.29 and the standard deviation was 61.66. So the non- Bayesian approach would inference a
Normal distribution with the previous mean and standard deviation. Let us guess that we had asked
any analyst about the Ibex 35 evolution in January, he would have affirmed strongly that it would
decrease slightly, the mean close value at the end of the month would be around 10870 and, hence,
the standard deviation would be higher, around 100. Then, according to the previous formulas, the
posterior parameters would be
µ1 = (100 + 10)−1 (100 × 10870 + 10 × 10893.29) = 10872.12
V1 = (100 + 10)−1 = 0.0091
ν1 = 100 + 10 = 110
(100 × 1002 + 9 × 61.66 + 1000 (10893.29 − 10870)2 )
110
s1 = = 95.60
110
This means that there is a difference of almost 20 points between the Bayesian estimation and the
non-Bayesian for the mean close value of January. When the month of January would have passed, we
could compare both results and we could note that the Bayesian estimation was closer to the finally
real mean close value and standard deviation: 10871.2 and 112.44. In figure 2.1, it can be seen how
the blue line representing the Bayesian estimation is closer to the cyan line representing the final real
mean close value than the red line representing the frequentist estimation:
2.2.2 Multivariate Normal distribution
Now, let us consider that we have an observable vector y of d components with the multivariate
Normal distribution:
y N (µ, Σ) (2.22)
where the first parameter is the mean column vector and the second one is the variance-covariance
matrix.
Extending what was said above to the multivariate case, we have:
13
32. 2. Bayesian Data Analysis
−3
x 10
7
Frequentist Approach
Bayesian Approach
6 Real Mean Colse Value in January
5
4
3
2
1
0
10000 10200 10400 10600 10800 11000 11200 11400 11600 11800 12000
Figure 2.1: Univariate Normal Example
1
f (y|µ, Σ) ∝ Σ−1/2 exp − (y − µ) Σ−1 (y − µ) (2.23)
2
And for n iid observations:
n
−n/2 1
f (y1 , y2 , . . . , yn |µ, Σ) ∝ Σ exp − (yi − µ) Σ−1 (yi − µ) (2.24)
2
i=1
A multivariate generalization of the Scaled-Inverse χ2 is the Inverse Wishart distribution (see
details in Appendix A), so the prior joint distribution is
Λ0
f (θ|y) = f (µ, Σ|y) ∝ N − Inv − W ishart µ0 , , ν0 , Λ0 (2.25)
k0
due to the fact that
Σ
µ|Σ N µ0 , (2.26)
k0
Σ Inv − W ishart ν0 , Λ−1
0 (2.27)
14
33. 2. Bayesian Data Analysis
Univariate Normal Multivariate Normal
Expression y N (µ, σ 2 ) y N (µ, Σ)
Parameters to estimate µ, σ 2 µ, Σ
2
µ|σ 2 N µ0 , σ0
k µ|Σ Σ
N µ0 , k0
Prior Distributions σ2 Inv − χ2 ν0 , σ0
2 Σ Inv − W ishart ν0 , Λ−1
0
2
σ0
µ, σ 2 N − Inv − χ2 µ0 , 2
k0 , ν0 , σ0 µ, Σ N − Inv − W ishart µ0 , k0 , ν0 , Λ−1
Σ
0
2
µ|σ 2 N µ1 , σ1
k µ|Σ Σ
N µ1 , k1
Posterior Distributions σ2 Inv − χ2 ν1 , σ1
2 Σ Inv − W ishart ν1 , Λ−1
1
2
σ1
µ, σ 2 N − Inv − χ2 µ1 , 2
k1 , ν1 , σ1 µ, Σ N − Inv − W ishart µ1 , Λ1 , ν1 , Λ1
k
1
Table 2.2: Comparison between Univariate and Multivariate Normal
The posterior results are the same that were told for the univariate case but applying these distri-
butions. For those interested readers, more information in [Gelm04] or [Cong06].
A summary is shown in Table 2.2 in order to get the most important ideas.
2.2.3 Other distributions
As it has just been made with the Normal distribution, a Bayesian analysis for other distributions could
be done. For instance, the exponential distribution is commonly used in reliability analysis. Because
of this project will deal with the Normal distribution for the likelihood, it will not be explained in detail
the analysis with other distributions. Table 2.3 shows the conjugate prior and posterior distributions
15
34. 2. Bayesian Data Analysis
for other likelihood distributions. More details can be found in [Cong06], [Gelm04], or [Rossi06].
Likelihood Parameter Conjugate Prior Hyperparameters Posterior Hyperparameters
Bin(y|n, θ) θ Beta α, β α + y, β + n − y
P (y|θ) θ Gamma α, β α + n¯, β + n
y
Exp (y|θ) θ Gamma α, β α + 1, β + y
Geo(y|θ) θ Beta α, β α + 1, β + y
Table 2.3: Conjugate distributions for other likelihood distributions
2.3 Hierarchical Models
Hierarchical data arise when they are structured or related among them. When this occurs, standard
techniques either assume that these groups belong to entirely different populations or ignore the ag-
gregate information entirely.
Hierarchical models provide a way of pooling the information for the disparate groups without
assuming that they belong to precisely the same population.
Suppose we have collected data about some random variable Y from m different populations with
n observations for each population.
Let yij represent observation j from population i. Now suppose yij f (θi ), where θi is a vector
of parameters for population i. Furthermore, θi f (Θ), where Θ may also be a vector. Until this
point, we have only rewritten what it was said previously.
16
35. 2. Bayesian Data Analysis
Now let us extend the model, and assume that the parameters Θ11 , Θ12 that govern the distribution
of the θ’s are themselves random variables and assign a prior distribution to these variables as well:
Θ f (ψ) (2.28)
where Θ is called the hyperprior. The vector parameter ψ for the hyperprior may be ”known” and
represents our prior beliefs about Θ or, in theory; we can also assign a probability distribution for
these quantities as well, and proceed to another layer of hierarchy.
According to [Gelm04], the idea of exchangeability will be used to create a joint probability
distribution model for all the parameters θ. A formal definition to explain what exchangeability
consists of is:
”The parameters θ1 , θ2 , . . . , θn are exchangeable in their joint distribution if f (θ1 , θ2 , . . . , θn ) is
invariant to permutations in the index 1, 2, . . . , n”.
This means that if no information other than the data is available to distinguish any of the θi from
any of the others, and no ordering of the parameters can be made, one must assume symmetry among
the parameters in the prior distribution. So we can treat the parameters for each sub-population as
exchangeable units. This can be formulated by:
f θ1 , θ2 , . . . , θn |Θ = Πl f θi |Θ
i=1 (2.29)
The prior joint distribution is now:
f θ1 , θ2 , . . . , θn , Θ = f θ1 , θ2 , . . . , θn |Θ f (Θ) (2.30)
And conditioning on the data, it yields:
f θ1 , θ2 , . . . , θn |y = f θ1 , θ2 , . . . , θn , Θ f y|θ1 , θ2 , . . . , θn , Θ (2.31)
Perhaps the most important point in practice is that non-hierarchical models are usually inappro-
priate for hierarchical data, while non-hierarchical data can be modelled following the hierarchical
structure and assigning concrete values to the hyperprior parameters.
This kind of models will be used in Bayesian regression models with autocorrelated errors, as it
will be seen in the following chapters.
17
36. 2. Bayesian Data Analysis
For more details about Bayesian hierarchical models, the reader is referenced to [Cong06], [Gelm04]
and [Rossi06].
2.4 Nonparametric Bayesian
To overcome the limitations that have been mentioned throughout this chapter, it is the nonparametric
approach which achieves to get through and to reduce the restrictions of the parametric approach.
This kind of analysis can be performed through the so-called Dirichlet Process, which allows us to
express in a simple way the prior distributions or the distribution family of F , where F is the distri-
bution function of the studied variable. This process has a parameter, called α, which is transformed
into a distribution probability.
According to [Mat´ 06], a Dirichlet Process for F (t) requires to know:
e
• A previous proposal for F (t), F0 (t), that corresponds to the distribution function that remarks
the prior knowledge which the engineer has and it is denoted by
α(t)
F0 (t) = (2.32)
M
• A measure of the confidence about the previous proposal, denoted by M , and whose values can
vary between 0 and ∞, depending on whether there is a total confidence in the data or in the
previous proposal respectively.
ˆ
It can be demonstrated that the posterior distribution for F (t), Fn (t), with a sampling over n data,
is given by
ˆ
Fn (t) = pn Fn (t) + (1 − pn )Fn (t) (2.33)
M
where Fn (t) is the empirical distribution function and pn = M +n .
A more detailed information about the nonparametric approach and how Dirichlet processes are
used can be found in [Mull04] or [Gosh03].
18
37. 2. Bayesian Data Analysis
With this approach not only the limitation of the parametric approach related to the probability
model of the variable to study is avoided, since no hypothesis is required, but also it allows us to
confer a quantified importance to the prior knowledge which the engineer gives, depending on the
confidence on the certainty about this knowledge.
19
38. Chapter 3
Posterior Simulation
3.1 Introduction
A practical problem with Bayesian inference is the difficulty of summarizing realistically complex
posterior distributions. In most practical problems, posterior densities will not take the form of any
well-known and understood density, so summary statistics, such as the posterior mean and variance of
parameters of interest, will not be analytically available. It is at this point where the importance of the
Bayesian computation arises and any computational tools are required to gain meaningful inference
from the posterior distribution. Its importance is such that the computing revolution of the last 20
years has led to a blossoming of Bayesian methods in many fields such Econometrics, Ecology or
Health.
Regarding to this, the most transcendent simulation methods are the Markov chain Monte Carlo
methods (MCMC). MCMC methods date from the original work of [Metr53], who were interested
in methods for the efficient simulation of the energy levels of atoms in a crystalline structure. The
original idea was subsequently generalized by [Hast70], but its true potential was not fully realized
within the statistical literature until [Gelf90] demonstrated its application to the estimation of inte-
grals commonly occurring in the context of Bayesian statistical inference.
As [Berg05] points up, the underlying principle is simple: if one wishes to sample randomly from
a specific probability distribution then design a Markov chain whose long-time equilibrium is that
distribution, write a computer program to simulate the Markov chain, run it for a time long enough
to be confident that approximate equilibrium has been attained, then record the state of the Markov
20
39. 3. Posterior Simulation
chain as an approximate draw from equilibrium.
The technique has been developed strongly in different fields and with rather different emphases
in the computer science community concerned with the study of random algorithms (where the em-
phasis is on whether the resulting algorithm scales well with increasing size of the problem), in the
spatial statistics community (where one is interested in understanding what kinds of patterns arise
from complex stochastic models), and also in the applied statistics community (where it is applied
largely in Bayesian contexts, enabling researchers to formulate statistical models which would other-
wise be resistant to effective statistical analyses).
The development of the theoretical work also benefits the development of statistical applications.
The MCMC simulation techniques have been applied to develop practical statistical inferences for
almost all problems in (bio) statistics, for example, the problems in longitudinal data analysis, im-
age analysis, genetics, contagious disease epidemics, random spatial pattern, and financial statistical
models such as GARCH and stochastic volatility.
The simplicity of the underlying principle of MCMC is a major reason for its success. However
a substantial complication arises as the underlying target problem becomes more complex; namely,
how long should one run the Markov chain so as to ensure that it is close to equilibrium? According to
[Gelm04], with n = 100 independent samples should be enough for reasonable posterior summaries,
but in some cases more samples are needed to assure more accuracy.
3.2 Markov chains
The essential theory required in developing Monte Carlo methods based on Markov chains is pre-
sented here. The most fundamental result is that certain Markov chains converge to a unique invariant
distribution, and can be used to estimate expectations with respect to this distribution. But in order to
reach this conclusion, some concepts need to be defined firstly.
A Markov chain is a series of random variables, X0 , . . . , Xn , also called a statistic process, in
which only the value of Xn−1 influences the distribution of Xn . Formally:
P (Xn = xn |X0 = x0 , . . . , Xn−1 = xn−1 ) = P (Xn = xn |Xn−1 = xn−1 ) (3.1)
21
40. 3. Posterior Simulation
where the Xn−1 have a common range called the state space of the Markov chain.
The common language to refer to different situations in which a Markov chain can be found is
the following. If Xn = i, it is said that the chain is in the state i in the step n or that it has the value
i in the step n. This language confers the chain certain dynamic view, which is corroborated by the
main tool to study it: the transition probabilities P (Xn+1 = j|Xn = i), which are represented by the
transition matrix P = (Pij ) with Pij = P (Xn+1 = j|Xn = i) . This is used to show the probability
of changing of state i to state j.
Due to the fact that in major interesting applications Markov chains are homogeneous, the transi-
tion matrix can be defined from the initial probability, P0 = P (X1 = j|X0 = i). Regarding to this, a
Markov chain Xt is homogeneous if P (Xn+1 = j|Xn = i) = P (X1 = j|X0 = i) for all n, i, j.
Furthermore, using Chapman- Kolmogorov equation, it can be shown that, given the transition
matrixes P and, for step n, Pn of a homogenous Markov chain, then Pn = P n .
On the other hand we will see the concepts of invariant or stationary distribution, ergodicity and
irreducibility, which are indispensable to reach the main result. It will be assumed that Xt is a ho-
mogenous Markov chain.
Then, vector P is an invariant distribution of the chain Xt if satisfies:
a) πj ≥ 0 such as j πj = 1.
b) π = πP .
That is, a stationary distribution over the states of a Markov chain is one that persists forever once
it is reached.
The concept of ergodic state requires making other definitions clear such as recurrence and aperi-
odicity:
• The state i is recurrent if P (Xn = i|X0 = i) = 1 for any n ≥ 1. Otherwise, it is transient.
Moreover, i will be positive recurrent if the expected (average) return time is finite, and null
recurrent if it is not.
22
41. 3. Posterior Simulation
• The period of a state i, denoted by d, is defined as di = mcd(n : [Pn ]ii > 0). The state i is
aperiodic if di = 1, or periodic if it is greater.
Then a state is ergodic if it is positive recurrent and aperiodic. The last concept to define is the
irreducibility. A set of states C ∈ S, where S is the set of all possible states, is irreducible if for all
i, j ∈ C:
• i and j have the same period.
• i is transient if and only if j is transient.
• i is recurrent if and only if j is null recurrent.
Now, having all these concepts in mind, we can know if a Markov chain has a stationary distribu-
tion with next lemma:
Lema 3.2.1. Let Xt be a homogenous and irreducible Markov chain. The chain will have only one
stationary distribution if, and only if, all the states are positive recurrent. In that case, it will have
inputs given by πi = µi −1 , where µi denotes the expected return time of the state i.
The relation with the long time behaviour is given by this other lemma:
Lema 3.2.2. Let Xt be a homogenous, irreducible and aperiodic Markov chain. Then
1
[Pn ]ij −→ for all i, j ∈ S as n ∞ (3.2)
µi
3.3 Monte Carlo Integration
Monte Carlo integration estimates the integral E[g(θ)] by obtaining samples θt , t = 1, . . . , n from
the posterior distribution p(θ|y) and averaging
n
1
E[g(θ)] = g(θt ) (3.3)
n
t=1
where the function g(θ) represents the function of interest to estimate. Note that if samples
θt , t = 1, . . . , n has p(θ|y) as its stationary distribution, the θt form a Markov chain.
23
42. 3. Posterior Simulation
3.4 Gibbs sampler
In many models, it is not easy to draw directly from the posterior distribution p(θ|y). However, if the
parameter θ is partitioned into several blocks as θ = (θ1 , . . . , θp ) where θj for j = 1, . . . , p, then the
full conditional posterior distributions, p(θ1 |y, θ2 , . . . , θp ), . . . , p(θp |y, θ1 , . . . , θp−1 ) , could be sim-
ple to draw from to obtain a sequence θ1 , . . . , θp . For instance, in the Normal linear regression model
it is convenient to set j=2, with θ1 = β and θ2 = σ 2 , and the full conditional distributions would be
p(θ1 = β|y, θ2 = σ 2 ) and p(θ2 = σ 2 |y, θ1 = β), which are very useful in the Normal independent
model which will be explained later.
The Gibbs sampler is defined by iterative sampling from each of those p conditional distributions:
1. Set a starting value, θ0 = (θ2 , . . . , θp ).
0 0
2. Take random draws
1 0 0
- θ1 from p(θ1 |y, θ2 , . . . , θp )
1 1 0
- θ2 from p(θ2 |y, θ1 , . . . , θp )
.
-.
.
1 1 1
- θp from p(θp |y, θ1 , . . . , θp−1 )
3. Repeat step 2 as necessary.
4. Reject those θ affected by θ0 = (θ2 , . . . , θp ), that is the first p − 1 draws, and average the rest
0 0
of draws applying the Monte Carlo integration.
For instance, in the Normal regression model we would have:
1. Set a starting value, θ0 = (θ2 = (σ 2 )0 ).
0
2
2. Take random draws
- θ1 = β1 from p(θ1 = β|y, θ2 = (σ 2 )0 )
1 1 0
2
- θ2 = (σ 2 )1 from p(θ2 = σ 2 |y, θ1 = β)
1
2
1
3. Repeat step 2 as necessary.
1 1
4. Eliminate those θ1 = β1 and average the rest of draws applying the Monte Carlo integration.
24
43. 3. Posterior Simulation
Those values dropped which are affected by the starting point are called the burn-in. Generally,
any set of values which are discarded in a MCMC simulation is called the burn-in. The size of the
burn-in period is the subject of current research in MCMC methods.
As the state of each draw depends on the state of the previous one, the sequence is a Markov
chain. More detail information can found in [Chen00], [Mart01] or [Rossi06].
3.5 Metropolis-Hastings sampler and its special cases
3.5.1 Metropolis-Hastings sampler
The Metropolis-Hastings method is adequate to simulate models that are not conditionally conjugate.
Furthermore, it can be combined with the Gibbs sampler to simulate posterior distributions where
some of the conditional posterior distributions are easy to sample from and other ones are not. As
the algorithms above explained, this is based on formulating a Markov chain, but using a proposal
distribution, q(.|θt ), which depends on the current state θt , to generate a new proposed sample θ∗ .
This proposal is accepted as the next state with probability given by
p(θ∗ |y)q(θt |θ∗ )
α(θt , θ∗ ) = min 1, (3.4)
p(θt |y)q(θ∗ |θt )
If the point θ∗ is not accepted, then the chain does not move and θt+1 = θt . According to
[Mart01], the steps to follow are:
1. Initialize the chain to θ0 and set t=0.
2. Generate a candidate point θ∗ from q(.|θt ).
3. Generate U from a uniform (0,1) distribution.
4. If U ≤ α(θt , θ∗ ) then set θt+1 = θ∗ , else set θt+1 = θt .
5. Set t=t+1 and repeat steps 2 trough 5.
6. Take the average of the draws g(θ1 ), . . . , g(θn )
Note that it should be, not only recommendable, but also essential that the proposal distribution
q(·|θt ) were easy to sample from.
25
44. 3. Posterior Simulation
There are some special cases of this method. The most important are briefly explained below. As
well as those, it can shown according to [Gelm04] that the Gibbs sampler is another special case of
the Metropolis-Hastings algorithm where the proposal point is always accepted.
3.5.2 Metropolis sampler
This method is a particular case of the Metropolis-Hastings sampler where the proposal distribution
has to be symmetric. That is,
q(θ∗ |θt ) = q(θt |θ∗ ) (3.5)
for all θ∗ and θt . Then, the probability of accepting the new point is
p(θ = θ∗ |y)
α(θt , θ∗ ) = min 1, (3.6)
p(θ = θt |y)
The same procedure seen in the Metropolis-Hastings sampler has to be followed.
3.5.3 Random-walk sampler
This special case refers to a proposal distribution of the form
q(θ∗ |θt) = q(|θt − θ∗ |) (3.7)
And the candidate point is θ∗ = θt + z, where z is called the increment random variable from q.
Then, the probability of accepting the new point is
p(θ = θ∗ |y)
α(θt , θ∗ ) = min 1, (3.8)
p(θ = θt |y)
The same procedure seen in the Metropolis-Hastings sampler has to be followed.
3.5.4 Independence sampler
The last variation has a proposal distribution such that
q(θ∗ |θt ) = q(θ∗ ) (3.9)
So it does not depend on θt . Then, the probability of accepting the new point is
26
45. 3. Posterior Simulation
p(θ∗ |y)p(θt ) w(θ∗ )
α(θt , θ∗ ) = min 1, = min 1, (3.10)
p(θt |y)p(θ∗ ) w(θt )
where
p(θ|y)
w(θ) = (3.11)
q(θ)
It is important to remark that to make this method works well, the proposal distribution q should
be very similar to the posterior distribution p(θ|y).
The same procedure seen in the Metropolis-Hastings sampler has to be followed.
3.6 Importance sampling
Importance sampling is a variance reduction technique that can be used in the Monte Carlo method.
The idea behind this method is that certain values of the input random variables in a simulation have
more impact on the parameter being estimated than others. So instead of taking a simple average,
importance sampling takes a weighted average.
Let q(θ) be a density from which is easy to obtain random draws θ(s) for s = 1, . . . , S. Then q(θ)
is called the importance function, and the importance sampling can be defined:
PS (s) )g(θ (s) )
s=1 w(θ p(θ=θ(s) |y)
The function gs =
ˆ PS (s) )
, where w(θ(s) ) = q(θ=θ (s) )
, converges to E[g(θ)|y] as
s=1 w(θ
S −→ inf.
p∗ (θ|y)
In fact, w(θ(s) ) can be formulated by w(θ(s) ) = q ∗ (θ|y) , where the new densities are proportional
to the old ones.
For more information and details about Markov chain Monte Carlo methods and their application,
the reader is referred to [Chen00], [Gilk95], [Berg05] and [Kend05].
27