Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
A short and naive introduction to epistasis in association studies
1. A short and naive introduction to epistasis in
association studies
Nathalie Villa-Vialaneix
nathalie.villa-vialaneix@inra.fr
http://www.nathalievilla.org
EpiFun
June 1st, 2018 - Paris
Nathalie Villa-Vialaneix | Epistasis and GWAS 1/23
2. What is this presentation about?
Standard GWAS
Disease
Healthy
Nathalie Villa-Vialaneix | Epistasis and GWAS 2/23
3. What is this presentation about?
Standard GWAS
Disease
Healthy
What we are interesting in:
epistasis: interaction between two (or more) SNPs influences the
phenotype but every single SNP does not
how to detect SNP/SNP, gene/gene interactions?
Nathalie Villa-Vialaneix | Epistasis and GWAS 2/23
4. Everything is easier with a picture
Nathalie Villa-Vialaneix | Epistasis and GWAS 3/23
5. Disclaimer
naive (but hopefully comprehensive) presentation
seeks at giving an overview rather than precise
directions
might contains errors, overclaims, missing
references, badly understood concepts... to keep
you awake
Nathalie Villa-Vialaneix | Epistasis and GWAS 4/23
6. Disclaimer
naive (but hopefully comprehensive) presentation
seeks at giving an overview rather than precise
directions
might contains errors, overclaims, missing
references, badly understood concepts... to keep
you awake
Two main reviews used to make these slides:
[Neil et al., 2015, Stanislas, 2017, Emily, 2018].
Material:
References at the end of the slides
these slides on my website
http://www.nathalievilla.org/seminars2018.html
most articles available online at http://nextcloud.
nathalievilla.org/index.php/s/VLlheqpwhwD8eeZ (ask me to
be granted write rights)
Nathalie Villa-Vialaneix | Epistasis and GWAS 4/23
7. Evidence for epistatis
1 missing heritability: in GWAS, only a little part of the genetic variance
explains the phenotype (with a “one locus at a time” strategy)
2 small effect size of most SNPs
3 possible explanation from an evolutionnary perspective: yields robust
systems resistant to variations
Nathalie Villa-Vialaneix | Epistasis and GWAS 5/23
8. (a bit) More formal definition(s)...
no consensus on the definition...!
Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
9. (a bit) More formal definition(s)...
no consensus on the definition...!
biology/statistics [Neil et al., 2015]
biological point of view: (originally) effect of an allele at a given locus
is hidden by the effect of another allele at a second locus – (more
recently) effect of an allele at a given locus depends on the presence
or absence of a genetic variant at another locus
statistical point of view [Fisher, 1918]: departure from additive effects of
genetic variants with respect to their global contribution to the
phenotype
Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
10. (a bit) More formal definition(s)...
no consensus on the definition...!
[Emily, 2018]: in the case of a phenotype Y ∈ {0, 1} (cases and controls) and
two loci with variants {A, a} and {B, b} respectively, definitions of
epsistasis:
at allele ((A, B), (A, b), (a, B), (a, b)) or genotype ((AA, BB),
(Aa, BB), (Aa, Bb), ...) levels
Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
11. (a bit) More formal definition(s)...
no consensus on the definition...!
[Emily, 2018]: in the case of a phenotype Y ∈ {0, 1} (cases and controls) and
two loci with variants {A, a} and {B, b} respectively, definitions of
epsistasis:
at allele ((A, B), (A, b), (a, B), (a, b)) or genotype ((AA, BB),
(Aa, BB), (Aa, Bb), ...) levels
for a statistical (departure from linearity measured by odds-ratio
between cases and controls) or a biological (measures of
associations assumed to be equal in cases and controls)
Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
12. Back to pictures
G hides effect of B interaction independant effects
original definition extension lack of epistasis?
[Cordell, 2002]: statistical definition is less ambiguous even though it is often
hard to interpret from a biological point of view
Nathalie Villa-Vialaneix | Epistasis and GWAS 7/23
13. Challenges for epistatis detection
statistical “small n large p problems” (at least at genome scale)
computational complexity linear in n but exponential in p (when the
number of interactions grows)
biological gap between statistical and biological (functional)
interpretations
Nathalie Villa-Vialaneix | Epistasis and GWAS 8/23
14. 1 A tentative definition
2 SNP-SNP approaches
3 SNPset-SNPset approaches
4 GWAS
Nathalie Villa-Vialaneix | Epistasis and GWAS 9/23
15. Background
Purpose
Given two loci X1 and X2 (allelic or genotype level), how to detect their
epistatic effect on Y (cases/controls)?
Nathalie Villa-Vialaneix | Epistasis and GWAS 10/23
16. Background
Purpose
Given two loci X1 and X2 (allelic or genotype level), how to detect their
epistatic effect on Y (cases/controls)?
1 regression based methods (mostly linear)
2 comparison of correlation in cases / controls (or odds-ratio
differences)
3 information theory based methods
Other approaches based on ROC analysis for instance (not discussed)
Nathalie Villa-Vialaneix | Epistasis and GWAS 10/23
17. Regression based methods
1 {stat, allele} PLINK [Purcell et al., 2007] logistic regression
logit P (Y = 1|(x1, x2)) = α + βI{x1=A} + γI{x2=B}
additive effect
+ δI{(x1,x2)=(A,B)}
departure from additivity
and test of δ = 0 (genotypic version in [Cordell, 2002])
2 {stat, geno} BOOST [Wan et al., 2010] Poisson GLM (same approach with
a count model and boolean computations)
Nathalie Villa-Vialaneix | Epistasis and GWAS 11/23
18. Regression based methods
1 {stat, allele} PLINK [Purcell et al., 2007] logistic regression
logit P (Y = 1|(x1, x2)) = α + βI{x1=A} + γI{x2=B}
additive effect
+ δI{(x1,x2)=(A,B)}
departure from additivity
and test of δ = 0 (genotypic version in [Cordell, 2002])
2 {stat, geno} BOOST [Wan et al., 2010] Poisson GLM (same approach with
a count model and boolean computations)
computational optimization of ML (can be numerically unstable or
difficult), only linear interactions
Nathalie Villa-Vialaneix | Epistasis and GWAS 11/23
19. Wald-like test methods
Principle: test “H0: W = 0” for a W that measures “association” between
X1 and X2 for the outcome Y, where (usually) W ∼ χ2
under H0
Nathalie Villa-Vialaneix | Epistasis and GWAS 12/23
20. Wald-like test methods
Principle: test “H0: W = 0” for a W that measures “association” between
X1 and X2 for the outcome Y, where (usually) W ∼ χ2
under H0
Example (the simplest): [Zhao et al., 2006] {bio, allele}
W =
(r1 − r0)2
Var(r1) + Var(r0)
where rk = Cor(I{X1=A}, I{X2=B}|Y = k).
Other approaches are based on odd-ratio [Emily, 2002] {bio, geno}.
Nathalie Villa-Vialaneix | Epistasis and GWAS 12/23
21. Entropy based methods
Methods based on information theory [Shannon, 1948] (powerful to catch
nonlinear interactions)
Mutual information
I(X1, X2) =
x1∈{AA,Aa,aa} x2∈{BB,Bb,bb}
p12 log
p12
p1p2
with p12 = P(X1 = x1, X2 = x2) and pj = P(Xj = xj).
Nathalie Villa-Vialaneix | Epistasis and GWAS 13/23
22. Entropy based methods
Methods based on information theory [Shannon, 1948] (powerful to catch
nonlinear interactions)
Mutual information
I(X1, X2) =
x1∈{AA,Aa,aa} x2∈{BB,Bb,bb}
p12 log
p12
p1p2
with p12 = P(X1 = x1, X2 = x2) and pj = P(Xj = xj).
Example [Fan et al., 2011] IG = I(X1, X2|Y = 1) − I(X1, X2) + resampling
methods to test significance
lack of know distribution under H0
Nathalie Villa-Vialaneix | Epistasis and GWAS 13/23
23. Background
Purpose
Given two sets of SNPs (genes, aplotypes, ...) X1 = (X11, . . . , X1m1
) and
X2 = (X21, . . . , X2m2
) (allelic or genotype level), how to detect a global
epistatic effect on Y (cases/controls)?
⇒ “summary” of SNPs analyses.
Nathalie Villa-Vialaneix | Epistasis and GWAS 14/23
24. Background
Purpose
Given two sets of SNPs (genes, aplotypes, ...) X1 = (X11, . . . , X1m1
) and
X2 = (X21, . . . , X2m2
) (allelic or genotype level), how to detect a global
epistatic effect on Y (cases/controls)?
⇒ “summary” of SNPs analyses.
1 combination of tests (multiple testing or global test)
2 multidimensional analysis (regression models, tests, enthropy based
methods at the set level)
3 kernel based methods
Nathalie Villa-Vialaneix | Epistasis and GWAS 14/23
25. Combining tests
1 Multiple testing tests all interactions (X1j, X2k ) and obtain m1m2
p-values (of non independant tests) + multiple testing procedure
(Simes to control intersection of null hypotheses and “number of effective tests” to
account for correlations): GATES [Li et al., 2011]
(other approaches combining p-values have been proposed)
Nathalie Villa-Vialaneix | Epistasis and GWAS 15/23
26. Combining tests
1 Multiple testing tests all interactions (X1j, X2k ) and obtain m1m2
p-values (of non independant tests) + multiple testing procedure
(Simes to control intersection of null hypotheses and “number of effective tests” to
account for correlations): GATES [Li et al., 2011]
(other approaches combining p-values have been proposed)
2 Global distribution of test statistics: Wjk , test statistics for logistic
regression ⇒ W = [W11, . . . , Wm1m2
] ∼ N(0, Σ)
derive a p-value from N(0, Σ), with an estimation of Σ: minP
[Emily, 2016]
Nathalie Villa-Vialaneix | Epistasis and GWAS 15/23
27. Combining tests
1 Multiple testing tests all interactions (X1j, X2k ) and obtain m1m2
p-values (of non independant tests) + multiple testing procedure
(Simes to control intersection of null hypotheses and “number of effective tests” to
account for correlations): GATES [Li et al., 2011]
(other approaches combining p-values have been proposed)
2 Global distribution of test statistics: Wjk , test statistics for logistic
regression ⇒ W = [W11, . . . , Wm1m2
] ∼ N(0, Σ)
derive a p-value from N(0, Σ), with an estimation of Σ: minP
[Emily, 2016]
only linear interactions ; computational issues (both methods) ;
hyper-parameter hard to set (effective number of test; GATES)
Nathalie Villa-Vialaneix | Epistasis and GWAS 15/23
28. Multidimensional methods I
1 dimension reduction Summarize a SNP set with a few numerical
values (PCA, CCA...) and perform logistic regression with a test of
the interaction on the summaries:
logit P (Y = 1|(x1, x2)) = α+βPC1(x1)+γPC1(x2)+δPC1(x1)PC1(x2)
and test “δ = 0” [Li et al., 2009, Stanislas et al., 2017]
Nathalie Villa-Vialaneix | Epistasis and GWAS 16/23
29. Multidimensional methods I
1 dimension reduction Summarize a SNP set with a few numerical
values (PCA, CCA...) and perform logistic regression with a test of
the interaction on the summaries:
logit P (Y = 1|(x1, x2)) = α+βPC1(x1)+γPC1(x2)+δPC1(x1)PC1(x2)
and test “δ = 0” [Li et al., 2009, Stanislas et al., 2017]
2 tests Summarize the correlations of SNP sets in cases and controls
(CCA) and compare these two quantities with a test:
z1
− z0
Var(z1 − z0)
∼H0
N(0, 1)
for zk
an adequate transformation of
Cor(CCA1(X1|Y = k), CCA1(X2|Y = k)) [Peng et al., 2010]
extensions to PLS, KCCA, ...
Nathalie Villa-Vialaneix | Epistasis and GWAS 16/23
30. Multidimensional methods II
Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is
this SNP set associated to the phenotype? (similar to what is done in
genomic selection)
Nathalie Villa-Vialaneix | Epistasis and GWAS 17/23
31. Multidimensional methods II
Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is
this SNP set associated to the phenotype? (similar to what is done in
genomic selection)
3 Kernel methods SKAT [Wu et al., 2010]
what is a kernel?
K is a measure of association between individuals described by their SNP
set, (x1, . . . , xn): K(xi, xj) measures a “ressemblance” between i and j.
RKHS: under mild conditions, K defines a
unique Hilbert space, H, and a unique mapping
of the individuals into H, Φ, such that:
K(xi, xj) = Φ(xi), Φ(xj) H
Nathalie Villa-Vialaneix | Epistasis and GWAS 17/23
32. Multidimensional methods II
Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is
this SNP set associated to the phenotype? (similar to what is done in
genomic selection)
3 Kernel methods SKAT [Wu et al., 2010]
what is a kernel?
K is a measure of association between individuals described by their SNP
set, (x1, . . . , xn): K(xi, xj) measures a “ressemblance” between i and j.
the only purpose of the previous slide was to finish people not
paying a close enough attention to my talk
Nathalie Villa-Vialaneix | Epistasis and GWAS 17/23
33. Multidimensional methods II (again)
Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is
this SNP set associated to the phenotype? (similar to what is done in
genomic selection)
3 Kernel methods SKAT [Wu et al., 2010]
fixed effect model in RKHS:
logiti ∼ α + h(Xi) with h ∈ H to be estimated
is equivalent to a mixed effect model
logiti ∼ α + hi with hi ∼ N(0, τK), τ to be estimated
and tests of “h(X) = 0” can be performed using the kernel K
Nathalie Villa-Vialaneix | Epistasis and GWAS 18/23
34. Multidimensional methods II (again)
Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is
this SNP set associated to the phenotype? (similar to what is done in
genomic selection)
3 Kernel methods SKAT [Wu et al., 2010]
fixed effect model in RKHS:
logiti ∼ α + h(Xi) with h ∈ H to be estimated
is equivalent to a mixed effect model
logiti ∼ α + hi with hi ∼ N(0, τK), τ to be estimated
and tests of “h(X) = 0” can be performed using the kernel K
Idea: h is able to capture high order interactions between SNPs within the
set X.
Nathalie Villa-Vialaneix | Epistasis and GWAS 18/23
36. Background
Purpose
How to detect epistatic effects genome-wide?
Basics: combine information between SNP-SNP effects or
SNPset-SNPset effects... but combinatorial issues, especially to catch
high order interactions
Nathalie Villa-Vialaneix | Epistasis and GWAS 19/23
37. Background
Purpose
How to detect epistatic effects genome-wide?
Basics: combine information between SNP-SNP effects or
SNPset-SNPset effects... but combinatorial issues, especially to catch
high order interactions
1 exhaustive approaches
2 filtering
3 machine learning
Nathalie Villa-Vialaneix | Epistasis and GWAS 19/23
38. Exhaustive approaches
1 exhaustive testing PLINK (which multiple testing corrections?) or
(penalized) regression [Wu et al., 2009] (Lasso but not really
genome-wide)
mostly restricted to linear effects and pairwise interactions
Nathalie Villa-Vialaneix | Epistasis and GWAS 20/23
39. Exhaustive approaches
1 exhaustive testing PLINK (which multiple testing corrections?) or
(penalized) regression [Wu et al., 2009] (Lasso but not really
genome-wide)
mostly restricted to linear effects and pairwise interactions
2 Multiple Dimensionality
Reduction (MDR) (non
parametric, model free,
can deal with high order
interactions)
[Ritchie et al., 2001]
can fail to detect pure
epistasis, strongly
depends on several
hyperparameters,
overfits
Nathalie Villa-Vialaneix | Epistasis and GWAS 20/23
40. Filtering
Idea: filter SNPs or SNP pairs before exhaustive search
1 filtering on marginal effects (prevents from detecting pure epistasis)
[Marchini et al., 2005]
Nathalie Villa-Vialaneix | Epistasis and GWAS 21/23
41. Filtering
Idea: filter SNPs or SNP pairs before exhaustive search
1 filtering on marginal effects (prevents from detecting pure epistasis)
[Marchini et al., 2005]
2 Relief genetic distance between individuals is used to compute a
measure of the importance of the SNP according to differences in the
SNP between neighbors when they have common/different Y
[Robnik-Šikonja and Kononenko, 2003]
Nathalie Villa-Vialaneix | Epistasis and GWAS 21/23
42. Filtering
Idea: filter SNPs or SNP pairs before exhaustive search
1 filtering on marginal effects (prevents from detecting pure epistasis)
[Marchini et al., 2005]
2 Relief genetic distance between individuals is used to compute a
measure of the importance of the SNP according to differences in the
SNP between neighbors when they have common/different Y
[Robnik-Šikonja and Kononenko, 2003]
3 biofilter combines information coming from 13 datasets that identify if
SNP sets are related to the same pathway, to proteins that interact
(PPI), ... [Pendergrass et al., 2013]
strong bias toward most documented genes/pathways
Nathalie Villa-Vialaneix | Epistasis and GWAS 21/23
43. ML approaches
Idea: fit a ML model that predicts Y given all SNPs and try to extract
information about interactions: random forests (with conditional variable
importance [Bureau et al., 2004, Strobl et al., 2008]), Bayesian Network (BEAM,
[Zhang and Liu, 2007]), ... (I guess: evolutionnary algorithms, deep NN, ant
colony, ...)
Nathalie Villa-Vialaneix | Epistasis and GWAS 22/23
44. ML approaches
Idea: fit a ML model that predicts Y given all SNPs and try to extract
information about interactions: random forests (with conditional variable
importance [Bureau et al., 2004, Strobl et al., 2008]), Bayesian Network (BEAM,
[Zhang and Liu, 2007]), ... (I guess: evolutionnary algorithms, deep NN, ant
colony, ...)
Limitations: n might be too small to make a non parametric estimation
affordable (from a statistical perspective)
Nathalie Villa-Vialaneix | Epistasis and GWAS 22/23
45. no conclusion because this is just the beginning of the discussion...
(and I was dead tired finishing my slides at 4 am this morning)
Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23
46. References
Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., and van Eerdewegh, P. (2004).
Identifying SNPs predictive of phenotype using random forests.
Genetic Epidemiology, 28(2):171–182.
Cordell, H. J. (2002).
Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans.
Human Molecular Genetics, 11(20):2463–2468.
Emily, M. (2002).
IndOR: a new statistical procedure to test for SNP-SNP epistasis in genome-wide association studies.
Statistics in Medecine, 31(21):2359–2373.
Emily, M. (2016).
AGGrEGATOr: a gene-based gene-gene interaction test for case-control association studies.
Statistical Applications in Genetics and Molecular Biology, 15(2):151–171.
Emily, M. (2018).
A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies.
Journal de la Société Française de Statistique, 159(1):27–67.
Fan, R., Zhong, M., Wang, S., Andrew, A., Karagas, M., Chen, H.and Amos, C., Xiong, M., and Moore, J. (2011).
Entropy-based information gain approaches to detect and to characterize gene-gene and gene-environment
interactions/correlations of complex diseases.
Genetic Epidemiology, 35:706–721.
Fisher, R. (1918).
The correlation between relatives on the supposition of Mendelian inheritance.
Transactions of the Royal Society of Edinburgh, 52(9):399–433.
Li, J., Tang, R., Biernacka, J. M., and de Andrade, M. (2009).
Identification of gene-gene interaction using principal components.
BMC Proceedings, 3(Suppl 7):S78.
Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23
47. Li, M.-X., Gui, H.-S., and Kwan, Johnny S.H. Sham, P. C. (2011).
GATES: a rapid and powerful gene-based association test using extended Simes procedure.
The American Journal of Human Genetics, 88(3):283–293.
Marchini, J., Donnelly, P., and Cardon, L. R. (2005).
Genome-wide strategies for detecting multiple loci that influence complex diseases.
Nature Genetics, 37:413–417.
Neil, C., Sinoquet, C., Dina, C., and Rocheleau, G. (2015).
A survey about methods dedicated to epistasis detection.
Frontiers in Genetics.
Pendergrass, S. A., Frase, A., Wallace, J., Wolfe, D., Katiyar, N., Moore, C., and Ritchie, M. D. (2013).
Genomic analyses with biofilter 2.0: knowledge driven filtering, annotation, and model development.
BioData Mining, 6:25.
Peng, Q., Zhao, J., and Xue, F. (2010).
A gene-based method for detecting gene-gene co-association in a case-control association study.
European Journal of Human Genetics, 18:582–587.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., and Skiar, P. (2007).
PLINK: a tool set for whole-genome association and population-based linkage analyses.
The American Journal of Human Genetics, 81(3):559–575.
Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., and Moore, J. H. (2001).
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast
cancer.
The American Journal of Human Genetics, 69(1):138–147.
Robnik-Šikonja, M. and Kononenko, I. (2003).
Theoretical and empirical analysis of ReliefF and RReliefF.
Machine Learning, 53(1-2):23–69.
Shannon, C. E. (1948).
Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23
48. A mathematical theory of communication.
Bell System Technical Journal, 27:347–423 and 623–656.
Stanislas, V. (2017).
Approches statistiques pour la detection d’épistasie dans les études d’associations pangénomiques.
Thèse de doctorat, Université Paris Saclay, Paris, France.
Stanislas, V., Dalmasso, C., and Christophe, A. (2017).
Eigen-epistasis for detecting gene-gene interactions.
BMC Bioinformatics, 18:54.
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeilis, A. (2008).
Conditional variable importance for random forests.
BMC Bioinformatics, 9:307.
Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. L., and Yu, W. (2010).
BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies.
The American Journal of Human Genetics, 87(3):325–340.
Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J., and Lin, X. (2010).
Powerful SNP-set analysis for case-control genome-wide association studies.
American Journal of Human Genetics, 86(6):929–942.
Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. (2009).
Genome-wide association analysis by lasso penalized logistic regression.
Bioinformatics, 25(6):714–721.
Zhang, Y. and Liu, J. S. (2007).
Bayesian inference of epistatic interactions in case-control studies.
Nature Genetics, 39:1167–1173.
Zhao, J., Jin, L., and Xiong, M. (2006).
Test for interaction between two unlinked loci.
The American Journal of Human Genetics, 79(5):831–845.
Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23