Methods for High Dimensional Interactions

Methods for High Dimensional Interactions
Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics
Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
Ludmer Center – May 19, 2016

Underlying objective of this talk
1

one predictor variable at a time
Predictor Variable Phenotype

one predictor variable at a time
Test 1
Test 2
Test 3
Test 4
Test 5
2

a network based view

a network based view
Test 1
3

system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B

system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
4

Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environment
Gestational
Diabetes
Large Data
Child’s epigenome
(p ≈ 450k)
Phenotype
Obesity measures
5

Diﬀerential Correlation between environments
(a) Gestational diabetes aﬀected pregnancy (b) Controls
6

Gene Expression: COPD patients
(a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers
(c) Correlations: Never Smokers (d) Correlations: Current Smokers
7

Imaging Data: Topological properties and Age
8

Correlations diﬀer between Age groups
9

NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
10

formal statement of initial problem
• n: number of subjects
12

• p: number of predictor variables
12

• Xn×p: high dimensional data set (p >> n)
12

• Yn×1: phenotype
12

• En×1: environmental factor that has widespread eﬀect on X and can
modify the relation between X and Y
12

• En×1: environmental factor that has widespread eﬀect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
12

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
development, IQ
scores, Death)
epidemiological study

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
development, IQ
scores, Death)
(epi)genetic/imaging associations

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
development, IQ
scores, Death)

conceptual model
Environment
ﬀ(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
development, IQ
scores, Death)
13

Is this mediation analysis?
14

• No
14

• No
• We are not making any causal claims i.e. direction of the arrows
14

• No
• We are not making any causal claims i.e. direction of the arrows
• There are many untestable assumptions required for such analysis
→ not well understood for HD data
14

analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests

analysis strategies
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods

analysis strategies
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
cluster features based on euclidean distance, correlation, connectivity
regression with group level summary (PCA, average)
Clustering Together with Regression
15

ECLUST - our proposed method: 3 phases
Original Data

Original Data
E = 0
1) Gene Similarity
E = 1

Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation

Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1

Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
16

the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
16

Underlying model
Y = β0 + β1U + β2U · E + ε (1)
17

Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
17

Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
• U: unobserved latent variable
• X: observed data which is a function of U
• ΣE : environment sensitive correlation matrix
17

Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
18

advantages and disadvantages
General Approach Advantages Disadvantages
Single-Marker simple, easy to implement
multiple testing burden,
power, interpretability
Penalization
multivariate, variable
selection, sparsity, efficient
optimization algorithms
poor sensitivity with
correlated data, ignores
structure in design matrix,
interpretability
Environment Cluster with
Regression
multivariate, flexible
implementation,
group structure, takes
advantage of correlation,
interpretability
difficult to identify relevant
clusters, clustering is
unsupervised
19

Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
K principal components
Supervised partial least squares
21

Simulation Study 1
(a) Corr(XE=0) (b) Corr(XE=1)
(c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall)
22

Results: Jaccard Index and test set MSE
23

TOM based on all subjects
(a) TOM(Xall)
25

TOM based on unexposed subjects
(a) TOM(XE=0)
26

TOM based on exposed subjects
(a) TOM(XE=1)
27

Diﬀerence of TOMs
(a) |TOM(XE=1) − TOM(XE=0)|
28

Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main eﬀects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
• g(·) is a known link function
• µ = E [Y |X, E, β, α]
• β = (β1, β2, . . . , βp, βE ) ∈ Rp+1
• α = (α1E , . . . , αpE ) ∈ Rp
30

Variable Selection
arg min
β0,β,α
1
2
Y − g(µ)
2
+ λ ( β 1 + α 1)
• Y − g(µ)
2
= i (yi − g(µi ))2
• β 1 = j |βj |
• α 1 = j |αj |
• λ ≥ 0: tuning parameter
31

Why Strong Heredity?
• Statistical Power: large main eﬀects are more likely to lead to
detectable interactions than small ones
32

• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
32

• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
• Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E
32

Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main eﬀects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33

Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main eﬀects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
33

Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main eﬀects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
Strong heredity principle2
:
ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0
33

Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
Y − g(µ)
2
+
λβ (w1β1 + · · · + wqβq + wE βE ) +
λγ (w1E γ1E + · · · + wqE γqE )
wj =
1
ˆβj
, wjE =
ˆβj
ˆβE
ˆαjE
34

Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user speciﬁed interaction terms
• Automatically determines the optimal tuning parameters through
cross validation
• Can also be applied to genetic data (SNPs)
35

Feature Screening and
Non-linear associations

The most popular way of feature screening
How to ﬁt statistical models when you have over 100,000 features?
36

Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
36

• keep all features with correlation greater than some threshold
36

• keep all features with correlation greater than some threshold
• However this procedure assumes a linear relationship between X and
Y
36

Non-linear feature screening: Kolmogorov-Smirnov Test
Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test
statistic
ˆKj = sup
x
|ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3)
Figure 8: Depiction of KS statistic
37

Non-linear Interaction Models
After feature screening, we can ﬁt non-linear relationships between
X and Y
Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4)
38

Conclusions and Contributions
• Large system-wide changes are observed in many environments
39

• This assumption can possibly be exploited to aid analysis of large
data
39

data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
39

data
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
39

data
• Also, we develop and implement a strong heredity framework
within the penalized model
39

data
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
39

Limitations
• There must be a high-dimensional signature of the exposure
40

Limitations
• Clustering is unsupervised
40

Limitations
• Clustering is unsupervised
• Two tuning parameters
40

What type of data is required to
use these methods

ECLUST method
1. environmental exposure (currently only binary)
2. a high dimensional dataset that can be aﬀected by the exposure
3. a single phenotype (continuous or binary)
4. Must be a high-dimensional signature of the exposure
41

Strong Heredity and Non-linear Models
1. a single phenotype (continuous or binary)
2. environment variable (continuous or binary)
3. any number of predictor variables
42

Check out our Lab’s Software!
http://greenwoodlab.github.io/software/
43

acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Marie Forest, Pablo Ginestet
• Greg Voisin, Vince Forgetta,
Kathleen Klein
• Mothers and children from the
study
44

Methods for High Dimensional Interactions

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Methods for High Dimensional Interactions

Ähnlich wie Methods for High Dimensional Interactions (20)

Mehr von sahirbhatnagar

Mehr von sahirbhatnagar (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Methods for High Dimensional Interactions