Talk given to the UCI Genetic Epidemiology Research Group (GERI, http://www.geri.uci.edu/) on May 16, 2014. Recent results on power to detect associations in growing populations + need for better statistical tests.
FAIRSpectra - Enabling the FAIRification of Analytical Science
Simulating Genes in Genome-wide Association Studies
1. Simulating Genes in
GWAS
Kevin R. Thornton
Ecology and Evolutionary Biology
UC Irvine
slides will be available at
http://www.slideshare.net/molpopgen
http://www.molpopgen.org
3. Several genomic regions have been implicated in linkage studies
and, recently, replicated evidence implicating specific genes has been
reported. Increasing evidence suggests an overlap in genetic suscept-
ibility with schizophrenia, a psychotic disorder with many similar-
ities to BD. In particular association findings have been reported with
expanded reference group analysis (Supplementary Table 9), it is of
interest that the closest gene to the signal at rs1526805 (P 5 2.2 3
1027
) is KCNC2 which encodes the Shaw-related voltage-gated pot-
assium channel. Ion channelopathies are well-recognized as causes of
episodic central nervous system disease, including seizures, ataxias
−log10
(P)
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
Chromosome
Type 2 diabetes
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases
2log10 of the trend test P value for quality-control-positive SNPs, excluding
those in each disease that were excluded for having poor clustering after
visual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, with
P values ,1 3 1025
highlighted in green. All panels are truncated at
2log10(P value) 5 15, although some markers (for example, in the MHC in
T1D and RA) exceed this significance threshold.
666
doi:10.1038/nature05911
Burton et al.
7. Unsurprisingly, since the GWAS method is primarily powered
common alleles, risk allele frequencies were well above 5%
all TASPs (reported index TASs with an association p valu
5.0 ϫ 10Ϫ8 and all HapMap phase II CEU SNPs in LD [r2 Ͼ 0
OCA2, eye color
MC1R, hair color
LOXL1, exfoliation glaucoma125102030
OddsRatio
0 20 40 60 80 100
Reported risk allele frequency, %
1. Published odds ratios for discrete traits by reported risk allele frequencies. Labeled SNP-trait associations are those with the highest ORs. Note tha
is is on the log scale.
www.pnas.org/cgi/doi/10.1073/pnas.0903103106
Hindorff et al.
8. tion explained by rare variants, because natural selection should
mize the frequency of deleterious variants in the population [24].
efore, for any phenotype, many causal variants will be rare, and
proportion of population-level genetic variance in complex
notypes attributable to variants across the allele frequency
trum will depend upon the strength of selection in our evolu-
ry past. The problem is that this is something that we do not
that the power of detection is proportional to pa2
, but it is clear
for each complex trait, variance is contributed from the entire a
frequency spectrum. This highlights the scarcity of low-frequ
variants identified by GWAS for quantitative traits and com
disease in humans. Detecting these variants will require a comb
tion of greater sample size, better genotyping, and impro
phenotyping.
Minor allele frequency
(A) (B)
Absoluteeffect(SDunits)
<0.001 0.01 0.1 0.5
0135
Risk allele frequencyOddsraƟo
<0.001 0.01 0.1 0.5 1
1510 TRENDS in Genetics
e I. For quantitative traits (A), the absolute effect is plotted against the minor allele frequency, whereas for complex common diseases (B), the odds ratio is pl
st the risk allele frequency. Each of the 38 quantitative traits and 43 disease traits are represented by different colors. Abbreviation: SD, standard deviation
http://dx.doi.org/10.1016/j.tig.2014.02.003
Robinson et al.
9. 1
2
3
4
5
6
7
8
9
10
OddsRatio
N
on−synonym
ous
sites
Prom
oters
(1kb)
Prom
oters
(5kb)
5’U
TR
s
3’U
TR
s
m
iR
TS
Intronic
regions
Intergenic
regions
Intergenic
TFBSsC
pG
islandsPR
eM
od
sites
O
R
egAnno
elem
entsEAR
regions
M
C
Ss
H
AR
s
PSG
s
Annotation Set
Enrichment/depletion analysis after adjusting for ’hitchhiking’ effects from non−synonymous sites
Fig. 2. Odds ratios for TAS block enrichment/depletion analysis after adjusting for ‘‘hitchhiking’’ effects from nonsynonymous sites. Four annotation sets (Splice
sites, Validated enhancers, EvoFold elements, and noncoding RNAs) are not represented here because no TAS blocks mapped to these annotation sets. The blue
circle represents the point estimate of the odds ratio (OR) and the red lines represent the 95% CI. Possible ‘‘hitchhiking’’ effects from nonsynonymous sites are
reduced by discarding any TASP/control SNP in r2 Ͼ 0.6 with a nonsynonymous SNP. For an explanation of the annotation sets on the x axis, we refer the reader
to Table S4. Note that the y axis is on the log scale. Nonsynonymous OR computation is not adjusted for ‘‘hitchhiking’’ effects.
www.pnas.org/cgi/doi/10.1073/pnas.0903103106
Hindorff et al.
10. Observation Interpretation
Missing H Lots
Uniform frequencies of “hits” Common associations exist
Rare hits have larger OR
Rare alleles may have larger
effects
Larger OR in genes Genes matter
11. Observation Interpretation
Rare hits have larger
OR
Rare alleles may have
larger effects
Disease is harmful
with respect to fitness
(in the evolutionary
sense).
Larger OR in genes Genes matter
13. 0.4 0.020
0.015
0.010
0.005
a b
0.3
Frequencyofobservations
Causalvariantfrequency
0.2
0.1
0
0.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5
Odds
ratio
2
3
4
5
6
7
8
9
> 9
Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations.
a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17
diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up
to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on
common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants
whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed
data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations
is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is
due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of
variance explained19
. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from
0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold.
Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the
number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further
WS
The multiplicative model
G =
Y
i
(1 + ei)
Risch & colleagues, Pritchard,
countless others
14. The multiplicative model
G =
Y
i
(1 + ei)
0 2 4 6 8 10
0246810
Causative mutations on paternal allele
Causativemutationsonmaternalallele
0.2
0.4
0.6
0.8
1
1.2
1.4
Risch & colleagues, Pritchard,
countless others
15. WWHD?
(What would Haldane do?)
p2 2pq q2
1 1 sh 1 2s
Genotype AA Aa aa
Mating
frequency
Fitness
ˆq =
u
sh
ˆq ⇡
r
u
s
as h ! 0
DOI: 10.1017/S0305004100015644
Haldane
16. Mutation at rate u (per gamete per generation)
“A” allele
X
X
X
“a” allele
is heterogeneous
in its molecular origin
trans-heterozygotes are at risk.
Phenotype has (weak) effect on individual fitness
doi:10.1371/journal.pgen.1003258
Thornton et al.
17. E↵ect sizes ⇠ Exp( )
0.0
2.5
5.0
7.5
0.0 0.3 0.6 0.9
Effect size
density
= effect of haplotype.
Additive over causative mutations
hi
doi:10.1371/journal.pgen.1003258
Thornton et al.
18. Gij =
p
hi ⇥ hj
(geometric mean)
0 2 4 6 8 10
0246810
Causative mutations on paternal allele
Causativemutationsonmaternalallele
0.05
0.1
0.15
0.2
0.25
0.3 0.35
0.4
Pi,j = Gi,j + N(0, )
w = e
(Pi,j )2
2 2
S
doi:10.1371/journal.pgen.1003258
Thornton et al.
19. Aside: simulation tools
• C++ library for rapid forward simulation
• Available from https://github.com/molpopgen/
fwdpp
• Preprint on arXiv at http://arxiv.org/abs/1401.3786
21. 2Nsh = 1 2Nsh = 10 2Nsh = 100
0
5
10
15
20
0.1 0.5 1 0.1 0.5 1 0.1 0.5 1
Proportion of new mutations that are deleterious
Meanruntime(hours)
Simulation
fwdpp (gamete−based)
fwdpp (individual−based)
SLiM
2Nsh = 1 2Nsh = 10 2Nsh = 100
0
50
100
150
0.1 0.5 1 0.1 0.5 1 0.1 0.5 1
Proportion of new mutations that are deleterious
Meanpeakmemoryuse(megabytes)
http://arxiv.org/abs/1401.3786
Thornton
22. Selection is weak
●●● ● ● ● ● ● ● ● ●
0.0 0.1 0.2 0.3 0.4 0.5
0.700.800.901.00
Mean effect size (λ)
Relativefitness
● Population mean fitness
Average fitness of a case
Average minimum fitness
doi:10.1371/journal.pgen.1003258
Thornton et al.
25. GWAS have poor power
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.8
Mean effect size (λ)
Power
GWAS
GWAS,
no recombination
resequencing
resequencing
no recombination
doi:10.1371/journal.pgen.1003258
Thornton et al.
26. Compare model to data…
0.4 0.020
0.015
0.010
0.005
a b
0.3
Frequencyofobservations
Causalvariantfrequency
0.2
0.1
0
0.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5
Odds
ratio
2
3
4
5
6
7
8
9
> 9
Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations.
a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17
diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up
to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on
common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants
whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed
data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations
is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is
due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of
variance explained19
. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from
0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold.
Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the
number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further
reducing the probability that they can explain observed common variant association. Suppose that a disease has a
REVIEWS
doi:10.1038/nrg3118 doi:10.1371/journal.pbio.1000579
Gibson Wray et al.
27. …reveals a pretty good fit
doi:10.1371/journal.pbio.1000579
Wray et al.
0246810
MAF of most significant marker
(in cases)
Meannumberofmarkers
n = 36.899
0 0.1 0.2 0.3 0.4 0.5
= 0.05
(Based on simulating
imperfect SNP chips)
28. “Burden” tests do badly…
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
GWAS
GWAS
no recombination
Resequencing
Resequencing
no recombination
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
50 markers
50 markers
no recombination
100 markers
100 markers
no recombination
200 markers
200 markers
no recombination
250 markers
250 markers
no recombination
Madsen and Browning
(2009)
Li and Leal (2008)
doi:10.1371/journal.pgen.1003258
Thornton et al.
29. …because the model is
wrong.
●
●
●
●
●
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
02468
Mean effect size (λ)
Meannumberofcausativemutationsperdiploid
●
●
●
●
●
●
●
●
●
●
●
●
Controls
Cases
Controls (rares)
Cases (rares)
doi:10.1371/journal.pgen.1003258
Thornton et al.
30. SKAT does ok
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
Resequencing, default weights and optimal p−values
GWAS, default weights and optimal p−values
Resequencing, Madsen−Browning weights and optimal p−values
GWAS, Madsen−Browning weights and optimal p−values
doi:10.1371/journal.pgen.1003258
Thornton et al.
31. Manhattan plots
0 20 40 60 80 100
051015
Position (kbp)
−log10(p)
Common
Common, causative
Rare
Rare, causative
0 20 40 60 80 100
051015
Position (kbp)
−log10(p)
Common
Common, causative
Rare
Rare, causative
Methods), and excluded 153 individuals on this basis. We next
evolutio
particul
eases; po
tase 1) a
well as
biology
There
capture
implem
STRUC
reverted
subset o
librium
clearly p
rather th
show th
perhaps
tary Fig
The
results
Europe
trend te
1.05 for
diseases
than str
sion of
ariates i
only slig
graphica
P values
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
3020
20
100
0
40
80
60
40
100
Observedteststatistic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the
11-d.f. test for difference in SNP allele frequencies between geographical
regions, within the 9 collections. SNPs have been excluded using the project
quality control filters described in Methods. Green dots indicate SNPs with a
P value ,1 3 1025
. b, Quantile-quantile plots of these test statistics. SNPs at
which the test statistic exceeds 100 are represented by triangles at the top of
the plot, and the shaded region is the 95% concentration band (see
Methods). Also shown in blue is the quantile-quantile plot resulting from
removal of all SNPs in the 13 most differentiated regions (Table 1).
NATURE|Vol 447|7 June 2007
doi:10.1371/journal.pgen.1003258
doi:10.1038/nature05911
Burton et al.
Thornton et al.
32. A new association test
evolutionary interest, genes showing eviden
particularly interesting for the biology of tra
eases; possible targets for selection include N
tase 1) at 11q13, which could have a role in
well as TLR1 (toll-like receptor 1) at 4p14
biology of tuberculosis and leprosy has been
There may be important population st
captured by current geographical region
implementations of strongly model-base
STRUCTURE11,12
are impracticable for dat
reverted to the classical method of principa
subset of 197,175 SNPs chosen to reduce in
librium. Nevertheless, four of the first si
clearly picked up effects attributable to loc
rather than genome-wide structure. The rem
show the same predominant geographical t
perhaps unsurprisingly, London is set some
tary Fig. 8).
The overall effect of population struc
results seems to be small, once recent
Europe are excluded. Estimates of over-disp
trend test statistics (usually denoted l; ref. 1
1.05 for RA and T1D, respectively, to 1.08
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
3020
20
100
0
40
80
60
40
100
Observedteststatistic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the
11-d.f. test for difference in SNP allele frequencies between geographical
regions, within the 9 collections. SNPs have been excluded using the project
NATURE|Vol 447|7 June 2007
ESMK =
i=KX
i=1
✓
log10(pi) + log10
i
K
◆
doi:10.1371/journal.pgen.1003258
Thornton et al.
33. ESM is a more powerful test
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
GWAS
GWAS,
no recombination
resequencing
resequencing
no recombination
(Caveat: requires permutation to get p-values)
doi:10.1371/journal.pgen.1003258
Thornton et al.
34. Running ESM on real data
• We think we can implement ESM using a mix of the
PLINK toolkit plus some custom programs.
• We need data to test it out on.
• There are very few modern GWAS available for
reanalysis.
• Lack of data sharing hurts the field.
35. Rare alleles and missing
heritability
• Current tests are underpowered
• Heterogeneity means that GWAS “hits” tag few
causative mutations
• Causative mutations that are tagged tend to be
(relatively) common. These “common” mutations
have effect sizes much smaller than the typical
causative mutation that segregates
38. H^2 insensitive to growth
●
●
●
●
● ●
●
●
●
●
0.01
0.02
0.03
0.04
0.0 0.1 0.2 0.3 0.4 0.5
Average effect size of new mutation
Meanbroad−senseheritability
model
●
constant
growth
Unpublished
39. Consistent with recent
findings from other groups
N A LY S I S
t despite these substantial shifts in the
rall frequency spectrum, the impact on
netic load—namely, the mean number of
eterious variants per individual and thus
average fitness—is much more subtle.
n the semidominant case, the individual
rden is essentially unaffected by these
mographic events (Fig. 1c,d). With growth,
increased number of segregating sites
alanced exactly by a decrease in the mean
quency (with the converse being true for
bottleneck model) so that the number
variants per individual stays constant.
is kind of balance is predicted by classic
tation-selection balance models18 and
n be shown to hold for general changes
population size, provided that selection
trong and deleterious alleles are at least
tially dominant (Supplementary Note).
The behavior of the recessive model is
re complicated (Fig. 1e,f). In the bottle-
a b
c d
e f
100
–1,000 0 1,000 2,000 3,000
Time since beginning of bottleneck (generations) Time since beginning of growth (generations)
10,000
1,000
–1,000 0 1,000 2,000 3,000
Time (generations)
Bottleneck
Populationsize
100,000
10,000
Time (generations)
Growth
Populationsize
–200 –100 0 100 200
10
2
10
4
SemidominantRecessive
NumberperMB
102
104
102
104
umberperMB
umberperMB
100
10
2
10
4
NumberperMB
Number of
segregating sites
Number of segregating
sites
Number of segregating sites
Number of deleterious
alleles per individual
Number of deleterious alleles per individual
Number of rare deleterious alleles
Number of segregating sites
Number of rare segregating sites
Number of rare segregating
sites
Number of rare segregating sites
Number of rare segregating sites
Load: number of deleterious alleles per individual
Load: number of homozygous sites per individual
Load: number of deleterious alleles per individual
Number of rare
deleterious
alleles per individual
Number of rare deleterious alleles per individual
–200 –100 0 100 200
ure 1 Time course of load and other key
ects of variation through a bottleneck and
onential growth. (a,b) The bottleneck (a)
exponential growth (b). (c–f) The expected
mber of variants and alleles per MB assuming
midominant mutations (c,d) or recessive
tations (e,f) with s = 1% and a mutation rate
site per generation of 10−8.
Simons et al.
doi:10.1038/ng.2896
40. Power is affected
0.00
0.02
0.04
0.06
0.08
0.000 0.025 0.050 0.075 0.100
Effect size of segregating causative mutation
Frequencyinpopulation
Model
Constant
Growth
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
0.0 0.1 0.2 0.3 0.4 0.5
Mean effect size of causative mutation
Power
Statistic
●
ESM50
Logit
SKAT
Model
Constant
Growth
Unpublished
41. Excellent fit to empirical
data
Frequency of most−associated marker
No.markers
0.0 0.2 0.4 0.6 0.8 1.0
02468101214
Unpublished
42. Implications
• Power to detect regions with modest effects on risk
(4-5% contribution to broad-sense heritability) is
very low in growing populations
• The explanatory power of simple models is
probably far from exhausted
43. Implications
• Much more likely to detect loci
with mutations of modest
effect
• Underlying distribution of
mean effect size across loci is
completely unknown in any
system
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
0.0 0.1 0.2 0.3 0.4 0.5
Mean effect size of causative mutation
Power
Statistic
●
ESM50
Logit
SKAT
Model
Constant
Growth
Unpublished
44. Future work
• Multilocus models with epistasis
• Machine learning approaches: do they work?
• Develop new simulation tools
• Make simulation output available
• Implement ESM test for analyzing real GWAS data
45. Other work in the lab
• Copy number variation in Drosophila: doi: 10.1093/
molbev/msu124
• Detecting TE insertions using paired-end data in
Drosophila: doi: 10.1093/molbev/mst129
• Modeling experimental evolution: doi: 10.1093/
molbev/msu048
• Structural variation and variation in gene
expression