5. Unrealistic Assumptions
We only measure -A-- -C- -A--
“unphased” data
--T-- --G- -G-- -C-
-A-- -A--
--A-- --C- -G-- -T- -C-
We first need to
infer the phase
--T--------G--------A----G--------C---C---A----
--A--------C--------A----G--------T---C---A----
6. Unrealistic Assumptions
We only measure -A-- -C- -A--
“unphased” data
--T-- --G- -G-- -C-
-A-- -A--
--A-- --C- -G-- -T- -C-
We first need to
infer the phase
--T--------G--------A----G--------C---C---A----
--A--------C--------A----G--------T---C---A----
--T--------G--------A----G--------T---C---A----
--A--------C--------A----G--------C---C---A----
7. Unrealistic Assumptions
We only measure -A-- -C- -A--
“unphased” data
--T-- --G- -G-- -C-
-A-- -A--
--A-- --C- -G-- -T- -C-
We first need to
infer the phase
--T--------G--------A----G--------C---C---A----
--A--------C--------A----G--------T---C---A----
--T--------G--------A----G--------T---C---A----
--A--------C--------A----G--------C---C---A----
--T--------C--------A----G--------T---C---A----
--A--------G--------A----G--------C---C---A----
8. Unrealistic Assumptions
We only measure -A-- -C- -A--
“unphased” data
--T-- --G- -G-- -C-
-A-- -A--
--A-- --C- -G-- -T- -C-
We first need to
?
infer the phase
--T--------G--------A----G--------C---C---A----
--A--------C--------A----G--------T---C---A----
--A--------G--------A----G--------C---C---A----
--T--------C--------A----G--------T---C---A----
--T--------C--------A----G--------T---C---A----
--A--------G--------A----G--------C---C---A----
53. A Reasonable Local Model
Copyright Ó 2007 by the Genetics Society of America
DOI: 10.1534/genetics.107.071126
On Recombination-Induced Multiple and Simultaneous Coalescent Events
Joanna L. Davies,1 Frantisek Simanc´k, Rune Lyngsø, Thomas Mailund and Jotun Hein
ˇ ˇı
Department of Statistics, University of Oxford, Oxford, OX1 3TG, United Kingdom
Manuscript received January 18, 2007
Accepted for publication October 2, 2007
ABSTRACT
Coalescent theory deals with the dynamics of how sampled genetic material has spread through a
population from a single ancestor over many generations and is ubiquitous in contemporary molecular
population genetics. Inherent in most applications is a continuous-time approximation that is derived
under the assumption that sample size is small relative to the actual population size. In effect, this
precludes multiple and simultaneous coalescent events that take place in the history of large samples. If
sequences do not recombine, the number of sequences ancestral to a large sample is reduced sufficiently
after relatively few generations such that use of the continuous-time approximation is justified. However,
in tracing the history of large chromosomal segments, a large recombination rate per generation will
consistently maintain a large number of ancestors. This can create a major disparity between discrete-time
and continuous-time models and we analyze its importance, illustrated with model parameters typical of
the human genome. The presence of gene conversion exacerbates the disparity and could seriously
undermine applications of coalescent theory to complete genomes. However, we show that multiple and
simultaneous coalescent events influence global quantities, such as total number of ancestors, but have
negligible effect on local quantities, such as linkage disequilibrium. Reassuringly, most applications of the
coalescent model with recombination (including association mapping) focus on local quantities.
K INGMAN (1982) models the ancestry of a sample
of sequences with a continuous-time Markov pro-
cess referred to as the Kingman coalescent. Lineages
ulation size, the probability of such events occurring
becomes nonnegligible and consequently in these
instances the rate of coalescence is underestimated
collide or coalesce after random exponential waiting by Hudson’s continuous-time model. Hudson’s model
54. A Reasonable Local Model
• The “back in time” approach (in general)
means we ignore selection
• Implicit assumption that the disease is
selectively neutral
• Which may or may not be reasonable...
• Might be okay for late onset diseases...
60. The ARG as a
Statistical Model
lhd( )=
P( | )=
∫P( | , )P( | )d
61. The ARG as a
Statistical Model
lhd( )=
∫P( | , )P( | )d
Integration by magic
62. The ARG as a
Statistical Model
lhd( )=
∫P( | , )P( | )d
Integration by magic
statistical sampling
63. ARG Methods
• Sampling ARGs from the coalescence
process
• Sampling ARGs conditional on the data
(importance sampling)
• Sampling parsimonious ARGs conditional
on the data
64. ARG Methods
• Sampling ARGs from the coalescence
process
• This is a no go -- you would never sample an
ARG that can explain the data
• Sampling ARGs conditional on the data
(importance sampling)
• Sampling parsimonious ARGs conditional
on the data
65. ARG Methods
• Sampling ARGs from the coalescence
process
• Sampling ARGs conditional on the data
(importance sampling)
• Larribe, Lessard and Schork 2002 -- scales to
tens of individuals and tens of markers
• Sampling parsimonious ARGs conditional
on the data
66. ARG Methods
• Sampling parsimonious ARGs conditional on
the data
• Lyngsø, Song & Hein 2005 (calculates parsimonious
ARGs -- a 2008 paper in press for sampling)
• Minichiello & Durbin 2006 (samples parsimonious
ARGs and scores local genealogies)
• Both preferentially selects mutations and
coalescence events over recombinations
• Scales to thousands of individuals and hundreds of
markers
95. Using “Perfect Phylogenies”
Use the four-gamete test to find regions that
can be explained by a tree with no recurrent mutations
Mailund, Besenbacher & Schierup 2006
100. Using “Perfect Phylogenies”
Much faster (and much cruder)
Catches the essential tree structure
Mailund, Besenbacher & Schierup 2006
101. Scoring the Clustering
Red=cases
Green=controls
Are the case chromosomes significantly
over-represented in some clusters?
102.
103. Wild-types
Mutation
Mutants
We can place “mutations” on the tree edges
and partition chromosomes into “mutants”
and “wild-types” and test for different
distributions of cases and controls
104. Wild-types
Mutation
Mutants
Use average or maximum to score the tree
Average is kosher Bayesian stats; maximum
needs to be corrected for over-fitting.