An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures
1. An analytic approach for interpretable
predictive models in high dimensional data, in
the presence of interactions with exposures
Sahir Rai Bhatnagar, PhD Candidate
Joint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood
Biostatistics, McGill University
preprint available at
sahirbhatnagar.com
14. NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
6/21
15. Goals of this study
Objective
(i) Whether clustering that incorporates known covariate or
exposure information can improve prediction models
7/21
16. Goals of this study
Objective
(i) Whether clustering that incorporates known covariate or
exposure information can improve prediction models
(ii) Can the resulting clusters provide an easier route to
interpretation
7/21
18. ECLUST - our proposed method: 2 steps
Original Data
19. ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
20. ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
21. ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
22. ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
23. ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
2) Penalized
Regression
Yn×1∼ + ×E
8/21
24. the objective of statistical
methods is the reduction of
data. A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
8/21
25. Step 1a: Method to detect gene clusters
(i) Hierarchical clustering (average linkage) with TOM1
scoring
dissimilarity2
:
|TOME=1 − TOME=0|
(ii) Number of clusters chosen using dynamicTreeCut algorithm 3
Original Data
E = 0
1a) Gene Similarity
E = 1
1Ravasz et al., Science (2002)
2Klein Oros et al., Frontiers in Genetics (2016)
3Langfelder and Zhang, Bioinformatics (2008)
9/21
26. Step 1b: Cluster Representation
(i) Average 4
(ii) 1st Principal Component 5
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007)
5Kendall, A Course in Multivariate analysis (1957)
10/21
27. Step 2: Variable Selection
(i) Linear effects: Lasso, Elastic Net 6
(ii) Non-linear effects: MARS 7
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) Cluster
Representation
n × 1 n × 1
2) Penalized
Regression
Yn×1∼ + ×E
6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005)
7Friedman, Annals of Statistics (1991)
11/21
34. Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –
vitamin D associated with obesity
16/21
35. Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –
vitamin D associated with obesity
• Diseases and Disorders: Hepatic System Disease – metabolism
of glucose and lipids
16/21
36. Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –
vitamin D associated with obesity
• Diseases and Disorders: Hepatic System Disease – metabolism
of glucose and lipids
• Physiological System Development and Function:
(i) Behavior and neurodevelopment – associated with obesity
(ii) Embryonic and organ development – GD associated with
macrosomia
16/21
40. Discussion and Contributions
• Large system-wide changes are observed in many
environments (DNA methylation, cortical thickness, gene
expression)
19/21
41. Discussion and Contributions
• Large system-wide changes are observed in many
environments (DNA methylation, cortical thickness, gene
expression)
• Environment dependent clustering can improve prediction
performance in high dimensional settings (n << p)
19/21
42. Discussion and Contributions
• Large system-wide changes are observed in many
environments (DNA methylation, cortical thickness, gene
expression)
• Environment dependent clustering can improve prediction
performance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expert
knowledge
19/21
43. Discussion and Contributions
• Large system-wide changes are observed in many
environments (DNA methylation, cortical thickness, gene
expression)
• Environment dependent clustering can improve prediction
performance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expert
knowledge
• Leverages existing computationally fast algorithms and can run
on a laptop computer (p ≈ 10k)
19/21
44. Discussion and Contributions
• Large system-wide changes are observed in many
environments (DNA methylation, cortical thickness, gene
expression)
• Environment dependent clustering can improve prediction
performance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expert
knowledge
• Leverages existing computationally fast algorithms and can run
on a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21
46. Limitations
• There must be a high-dimensional signature of the exposure
• Covariance estimation
20/21
47. Limitations
• There must be a high-dimensional signature of the exposure
• Covariance estimation
• Currently limited to binary environment
20/21
48. Limitations
• There must be a high-dimensional signature of the exposure
• Covariance estimation
• Currently limited to binary environment
• Interpretation can be difficult
20/21
49. Acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, André Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Dr. Forest
• Greg Voisin, Dr. Forgetta,
Dr. Klein
• Mothers and children from the
study
21/21