SlideShare ist ein Scribd-Unternehmen logo
1 von 94
Downloaden Sie, um offline zu lesen
Methods for High Dimensional Interactions
Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics
Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
Ludmer Center – May 19, 2016
Underlying objective of this talk
1
Motivation
one predictor variable at a time
Predictor Variable Phenotype
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
2
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
Test 1
3
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
4
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environment
Gestational
Diabetes
Large Data
Child’s epigenome
(p ≈ 450k)
Phenotype
Obesity measures
5
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
6
Gene Expression: COPD patients
(a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers
(c) Correlations: Never Smokers (d) Correlations: Current Smokers
7
Imaging Data: Topological properties and Age
8
Correlations differ between Age groups
9
NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
10
Differential Networking
11
formal statement of initial problem
• n: number of subjects
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
12
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
epidemiological study
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
(epi)genetic/imaging associations
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
(epi)genetic/imaging associations
(epi)genetic/imaging associations
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
13
Is this mediation analysis?
14
Is this mediation analysis?
• No
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
• There are many untestable assumptions required for such analysis
→ not well understood for HD data
14
Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
cluster features based on euclidean distance, correlation, connectivity
regression with group level summary (PCA, average)
Clustering Together with Regression
15
ECLUST - our proposed method: 3 phases
Original Data
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
16
the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
16
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
• U: unobserved latent variable
• X: observed data which is a function of U
• ΣE : environment sensitive correlation matrix
17
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
18
advantages and disadvantages
General Approach Advantages Disadvantages
Single-Marker simple, easy to implement
multiple testing burden,
power, interpretability
Penalization
multivariate, variable
selection, sparsity, efficient
optimization algorithms
poor sensitivity with
correlated data, ignores
structure in design matrix,
interpretability
Environment Cluster with
Regression
multivariate, flexible
implementation,
group structure, takes
advantage of correlation,
interpretability
difficult to identify relevant
clusters, clustering is
unsupervised
19
Methods to detect gene clusters
Table 1: Methods to detect gene clusters
General Approach Formula
Correlation
pearson, spearman,
biweight midcorrelation
Correlation Scoring |ρE=1 − ρE=0|
Weighted Correlation
Scoring
c|ρE=1 − ρE=0|
Fisher’s Z
Transformation
|zij0−zij1|
√
1/(n0−3)+1/(n1−3)
20
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
K principal components
Supervised partial least squares
21
Simulation Studies
Simulation Study 1
(a) Corr(XE=0) (b) Corr(XE=1)
(c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall)
22
Results: Jaccard Index and test set MSE
23
Simulation Study 2
24
TOM based on all subjects
(a) TOM(Xall)
25
TOM based on unexposed subjects
(a) TOM(XE=0)
26
TOM based on exposed subjects
(a) TOM(XE=1)
27
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)|
28
Results: Test set MSE
29
Strong Heredity Models
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
• g(·) is a known link function
• µ = E [Y |X, E, β, α]
• β = (β1, β2, . . . , βp, βE ) ∈ Rp+1
• α = (α1E , . . . , αpE ) ∈ Rp
30
Variable Selection
arg min
β0,β,α
1
2
Y − g(µ)
2
+ λ ( β 1 + α 1)
• Y − g(µ)
2
= i (yi − g(µi ))2
• β 1 = j |βj |
• α 1 = j |αj |
• λ ≥ 0: tuning parameter
31
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
32
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
32
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
• Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E
32
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
Strong heredity principle2
:
ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
Y − g(µ)
2
+
λβ (w1β1 + · · · + wqβq + wE βE ) +
λγ (w1E γ1E + · · · + wqE γqE )
wj =
1
ˆβj
, wjE =
ˆβj
ˆβE
ˆαjE
34
Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction terms
• Automatically determines the optimal tuning parameters through
cross validation
• Can also be applied to genetic data (SNPs)
35
Feature Screening and
Non-linear associations
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
• keep all features with correlation greater than some threshold
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
• keep all features with correlation greater than some threshold
• However this procedure assumes a linear relationship between X and
Y
36
Non-linear feature screening: Kolmogorov-Smirnov Test
Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test
statistic
ˆKj = sup
x
|ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3)
Figure 8: Depiction of KS statistic
37
Non-linear Interaction Models
After feature screening, we can fit non-linear relationships between
X and Y
Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4)
38
Conclusions
Conclusions and Contributions
• Large system-wide changes are observed in many environments
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
39
Limitations
• There must be a high-dimensional signature of the exposure
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
40
What type of data is required to
use these methods
ECLUST method
1. environmental exposure (currently only binary)
2. a high dimensional dataset that can be affected by the exposure
3. a single phenotype (continuous or binary)
4. Must be a high-dimensional signature of the exposure
41
Strong Heredity and Non-linear Models
1. a single phenotype (continuous or binary)
2. environment variable (continuous or binary)
3. any number of predictor variables
42
Check out our Lab’s Software!
http://greenwoodlab.github.io/software/
43
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Marie Forest, Pablo Ginestet
• Greg Voisin, Vince Forgetta,
Kathleen Klein
• Mothers and children from the
study
44

Weitere ähnliche Inhalte

Ähnlich wie Methods for High Dimensional Interactions

Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
USC
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminar
gvesom
 
Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...
SuciAidaDahhar
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
BigMine
 

Ähnlich wie Methods for High Dimensional Interactions (20)

Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and Visualization
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
 
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
DOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet TypeDOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet Type
 
Genomic selection in Livestock
Genomic  selection in LivestockGenomic  selection in Livestock
Genomic selection in Livestock
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
a brief introduction to epistasis detection
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detection
 
Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
 
Probit and logit model
Probit and logit modelProbit and logit model
Probit and logit model
 
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap ResolutionConstraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap Resolution
 
Basen Network
Basen NetworkBasen Network
Basen Network
 
Split Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision TreesSplit Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision Trees
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminar
 
Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...
 
Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )
 
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 

Mehr von sahirbhatnagar

Analysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesityAnalysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesity
sahirbhatnagar
 

Mehr von sahirbhatnagar (10)

An introduction to knitr and R Markdown
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdown
 
Atelier r-gerad
Atelier r-geradAtelier r-gerad
Atelier r-gerad
 
Reproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitrReproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitr
 
Analysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesityAnalysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesity
 
Estimation and Accuracy after Model Selection
Estimation and Accuracy after Model SelectionEstimation and Accuracy after Model Selection
Estimation and Accuracy after Model Selection
 
Absolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancerAbsolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancer
 
Factors influencing participation in cancer screening
Factors influencing participation in cancer screeningFactors influencing participation in cancer screening
Factors influencing participation in cancer screening
 
Introduction to LaTeX
Introduction to LaTeXIntroduction to LaTeX
Introduction to LaTeX
 
Methylation and Expression data integration
Methylation and Expression data integrationMethylation and Expression data integration
Methylation and Expression data integration
 
Reproducible Research
Reproducible ResearchReproducible Research
Reproducible Research
 

Kürzlich hochgeladen

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Kürzlich hochgeladen (20)

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 

Methods for High Dimensional Interactions

  • 1. Methods for High Dimensional Interactions Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Ludmer Center – May 19, 2016
  • 4. one predictor variable at a time Predictor Variable Phenotype
  • 5. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 2
  • 6. a network based view Predictor Variable Phenotype
  • 7. a network based view Predictor Variable Phenotype
  • 8. a network based view Predictor Variable Phenotype Test 1 3
  • 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
  • 10. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 4
  • 11. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 5
  • 12. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 6
  • 13. Gene Expression: COPD patients (a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers (c) Correlations: Never Smokers (d) Correlations: Current Smokers 7
  • 14. Imaging Data: Topological properties and Age 8
  • 16. NIH MRI brain study Environment Age Large Data Cortical Thickness (p ≈ 80k) Phenotype Intelligence 10
  • 18. formal statement of initial problem • n: number of subjects 12
  • 19. formal statement of initial problem • n: number of subjects • p: number of predictor variables 12
  • 20. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 12
  • 21. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 12
  • 22. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y 12
  • 23. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 12
  • 25. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging
  • 26. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death)
  • 27. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) epidemiological study
  • 28. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations
  • 29. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations (epi)genetic/imaging associations
  • 30. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) 13
  • 31. Is this mediation analysis? 14
  • 32. Is this mediation analysis? • No 14
  • 33. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows 14
  • 34. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows • There are many untestable assumptions required for such analysis → not well understood for HD data 14
  • 36. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests
  • 37. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods
  • 38. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods cluster features based on euclidean distance, correlation, connectivity regression with group level summary (PCA, average) Clustering Together with Regression 15
  • 39. ECLUST - our proposed method: 3 phases Original Data
  • 40. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  • 41. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  • 42. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
  • 43. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
  • 44. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 16
  • 45. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 16
  • 46. Underlying model Y = β0 + β1U + β2U · E + ε (1) 17
  • 47. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) 17
  • 48. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE : environment sensitive correlation matrix 17
  • 49. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 18
  • 50. advantages and disadvantages General Approach Advantages Disadvantages Single-Marker simple, easy to implement multiple testing burden, power, interpretability Penalization multivariate, variable selection, sparsity, efficient optimization algorithms poor sensitivity with correlated data, ignores structure in design matrix, interpretability Environment Cluster with Regression multivariate, flexible implementation, group structure, takes advantage of correlation, interpretability difficult to identify relevant clusters, clustering is unsupervised 19
  • 51. Methods to detect gene clusters Table 1: Methods to detect gene clusters General Approach Formula Correlation pearson, spearman, biweight midcorrelation Correlation Scoring |ρE=1 − ρE=0| Weighted Correlation Scoring c|ρE=1 − ρE=0| Fisher’s Z Transformation |zij0−zij1| √ 1/(n0−3)+1/(n1−3) 20
  • 52. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average K principal components Supervised partial least squares 21
  • 54. Simulation Study 1 (a) Corr(XE=0) (b) Corr(XE=1) (c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall) 22
  • 55. Results: Jaccard Index and test set MSE 23
  • 57. TOM based on all subjects (a) TOM(Xall) 25
  • 58. TOM based on unexposed subjects (a) TOM(XE=0) 26
  • 59. TOM based on exposed subjects (a) TOM(XE=1) 27
  • 60. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 28
  • 63. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions • g(·) is a known link function • µ = E [Y |X, E, β, α] • β = (β1, β2, . . . , βp, βE ) ∈ Rp+1 • α = (α1E , . . . , αpE ) ∈ Rp 30
  • 64. Variable Selection arg min β0,β,α 1 2 Y − g(µ) 2 + λ ( β 1 + α 1) • Y − g(µ) 2 = i (yi − g(µi ))2 • β 1 = j |βj | • α 1 = j |αj | • λ ≥ 0: tuning parameter 31
  • 65. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones 32
  • 66. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible 32
  • 67. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible • Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E 32
  • 68. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 69. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 70. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 71. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 34
  • 72. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data (SNPs) 35
  • 74. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? 36
  • 75. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y 36
  • 76. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold 36
  • 77. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold • However this procedure assumes a linear relationship between X and Y 36
  • 78. Non-linear feature screening: Kolmogorov-Smirnov Test Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test statistic ˆKj = sup x |ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3) Figure 8: Depiction of KS statistic 37
  • 79. Non-linear Interaction Models After feature screening, we can fit non-linear relationships between X and Y Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4) 38
  • 81. Conclusions and Contributions • Large system-wide changes are observed in many environments 39
  • 82. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 39
  • 83. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 39
  • 84. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 39
  • 85. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 39
  • 86. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 39
  • 87. Limitations • There must be a high-dimensional signature of the exposure 40
  • 88. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 40
  • 89. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 40
  • 90. What type of data is required to use these methods
  • 91. ECLUST method 1. environmental exposure (currently only binary) 2. a high dimensional dataset that can be affected by the exposure 3. a single phenotype (continuous or binary) 4. Must be a high-dimensional signature of the exposure 41
  • 92. Strong Heredity and Non-linear Models 1. a single phenotype (continuous or binary) 2. environment variable (continuous or binary) 3. any number of predictor variables 42
  • 93. Check out our Lab’s Software! http://greenwoodlab.github.io/software/ 43
  • 94. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Marie Forest, Pablo Ginestet • Greg Voisin, Vince Forgetta, Kathleen Klein • Mothers and children from the study 44