An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

An analytic approach for interpretable
predictive models in high dimensional data, in
the presence of interactions with exposures
Sahir Rai Bhatnagar, PhD Candidate
Joint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood
Biostatistics, McGill University
preprint available at
sahirbhatnagar.com

Simulated Data ̸=
Real Data
0/21

Simple Rule 11:
Simulated Data ̸=
Real Data
0/21

one predictor variable at a time
Predictor Variable Phenotype

one predictor variable at a time
Test 1
Test 2
Test 3
Test 4
Test 5
1/21

a network based view

a network based view
Test 1
2/21

system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B

system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
3/21

Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, USherbrooke)
Environment
Gestational
Diabetes
Large Data
Child’s epigenome
(p ≈ 450k)
Phenotype
Obesity measures
4/21

Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5/21

NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
6/21

Goals of this study
Objective
(i) Whether clustering that incorporates known covariate or
exposure information can improve prediction models
7/21

Goals of this study
Objective
(i) Whether clustering that incorporates known covariate or
exposure information can improve prediction models
(ii) Can the resulting clusters provide an easier route to
interpretation
7/21

ECLUST - our proposed method: 2 steps
Original Data

Original Data
E = 0
1a) Gene Similarity
E = 1

Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation

Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1

Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
2) Penalized
Regression
Yn×1∼ + ×E
8/21

the objective of statistical
methods is the reduction of
data. A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
8/21

Step 1a: Method to detect gene clusters
(i) Hierarchical clustering (average linkage) with TOM1
scoring
dissimilarity2
:
|TOME=1 − TOME=0|
(ii) Number of clusters chosen using dynamicTreeCut algorithm 3
Original Data
E = 0
1a) Gene Similarity
E = 1
1Ravasz et al., Science (2002)
2Klein Oros et al., Frontiers in Genetics (2016)
3Langfelder and Zhang, Bioinformatics (2008)
9/21

Step 1b: Cluster Representation
(i) Average 4
(ii) 1st Principal Component 5
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007)
5Kendall, A Course in Multivariate analysis (1957)
10/21

Step 2: Variable Selection
(i) Linear effects: Lasso, Elastic Net 6
(ii) Non-linear effects: MARS 7
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
2) Penalized
Regression
Yn×1∼ + ×E
6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005)
7Friedman, Annals of Statistics (1991)
11/21

Simulated TOM by Exposure Status
(a) TOM(XE=1) (b) TOM(XE=0)
12/21

Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)| 13/21

TOM based on all subjects
(a) TOM(Xall) 14/21

Gestational Diabetes: Prediction Performance
15/21

Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –
vitamin D associated with obesity
16/21

• Diseases and Disorders: Hepatic System Disease – metabolism
of glucose and lipids
16/21

• Diseases and Disorders: Hepatic System Disease – metabolism
of glucose and lipids
• Physiological System Development and Function:
(i) Behavior and neurodevelopment – associated with obesity
(ii) Embryonic and organ development – GD associated with
macrosomia
16/21

Discussion and Contributions
• Large system-wide changes are observed in many
environments (DNA methylation, cortical thickness, gene
expression)
19/21

expression)
• Environment dependent clustering can improve prediction
performance in high dimensional settings (n << p)
19/21

expression)
• Clusters can be interpreted but require much more expert
knowledge
19/21

expression)
knowledge
• Leverages existing computationally fast algorithms and can run
on a laptop computer (p ≈ 10k)
19/21

expression)
knowledge
• Leverages existing computationally fast algorithms and can run
on a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21

Limitations
• There must be a high-dimensional signature of the exposure
20/21

Limitations
• Covariance estimation
20/21

Limitations
• Currently limited to binary environment
20/21

Limitations
• Currently limited to binary environment
• Interpretation can be difﬁcult
20/21

Acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, André Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Dr. Forest
• Greg Voisin, Dr. Forgetta,
Dr. Klein
• Mothers and children from the
study
21/21

An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von sahirbhatnagar

Mehr von sahirbhatnagar (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures