Arjun Manrai - National Academies Talk - June 6, 2019
1. Physics, AI, and the
Environment
Arjun (Raj) Manrai, Ph.D.
National Academies of Science, Engineering, and Medicine
June 6, 2019
@arjunmanrai
Arjun_Manrai@hms.harvard.edu
Harvard Medical School
Computational Health Informatics Program,
Boston Children’s Hospital
2. Supervised machine learning performs well with:
(a) Lots of clean, labeled data [Ex: ImageNet]
(b) Efficient compute, time [Ex: GPUs]
(c) Algorithmic advancements [Ex: Dropout]
Do we have these for environmental health
and the exposome?
3. Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
Exposures are complex: densely correlated,
time-varying, and often difficult to measure.
permuted data to produce
“null ρ”
sought replication in > 1
cohort
Patel & Manrai PSB 2015
4. Exposures can have very different timescales
throughout life
Athersuch Bioanalysis 2012
Cumulative (cadmium, PCB)
Constant, but excreted (phenols, vitamins)
Intervention (drugs)
Seasonal (allergen)
In-utero
Not shown: Diurnal
5. AI has been successful ‘out of the box’ for many
problems in health, but it is unlikely to be so for
many in environmental health and exposome.
Pervasive issues around measurement and
reproducibility are likely to limit application.
6. Reproducibility, old and new
“Non-reproducible single
occurrences are of no
significance to science.”
—Karl Popper
“It’s basically collecting lots of
variables and then playing with your
data until you find something that
counts as statistically significant but
is probably meaningless.”
—John Oliver
7. Let’s take a deeper look by starting with a very
powerful ML method for the exposome…linear
regression!
Suppose we want to associate lead
exposure and C-reactive protein (CRP), a
blood marker of inflammation
Regression gives us:
• Variance explained
• Interpretable coefficients
• Predictions
• Uncertainty
8. Step 2: Build an unadjusted model
Step 1: Get data
Step 3: Adjust sex and BMI, and then also race,
and SES, and then also smoking minus SES…
NHANES <- readRDS(‘NHANES_NAS.rds’)
lm(CRP ~ lead, data= NHANES) P-value = 0.12
P-value = 0.053
lm(CRP ~ lead + factor(sex) + bmi, data= NHANES)
lm(CRP ~ lead + factor(sex) + bmi + factor(race), data= NHANES)
…..
“trending towards significance”
9. Step 4: Aha moment! Filter to a more precise group
Step 5….
Step 6…
Step 7…
Bingo! p < 0.05
Publish.
Tenure.
NHANES2 <- NHANES %>% filter(age > 25 & age < 35)
lm(CRP ~ lead + factor(sex) + bmi, data= NHANES2)
11. When applying machine learning, we have
even more analytic choices:
• Hyperparameters (e.g. learning rate)
• Model architecture (e.g. number of hidden layers)
• Declaration of improvement (e.g. delta AUC = 0.001)
• Splits (e.g. training/val/test splits)
• Many analysts (e.g. Kaggle competitions)
…What can we do?
12. When do we address multiplicity?
Almost Always Almost Never
13. We can look to two fields:
(1) Genomics
(2) Physics
14. Comparison #1: Genomics
Major study design change in human genetics research:
Candidate gene studies to genome wide association studies.
15. Creation of a phenotype-exposure association map:
A 2-D view of 158 phenotype by 510 exposure associations
> 0
< 0
Association Size:
510 E exposure and diet indicators × 158 clinical trait phenotypes
NHANES 1999-2000, 2001-2002, 2005-2006, …, 2011-2012 (8)
Median N: 150-5000 per survey
~67,281 E-P associations!
significant associations (FDR < 5%)
adjusted by age, age2, sex, race, income
Manrai et al. 2019
158phenotypes
510 exposures
Comparison #1: Genomics (continued)
16. Large cohorts will expose massive mis-misspecification,
confounding, and correlation amongst exposures.
Manrai, Ioannidis, Patel. AJE 2019
Comparison #1: Genomics (Continued)
20. The central challenges of applying AI in environmental
health are not uniquely AI challenges. They are:
(1) Data: environmental/exposure data are time-varying,
densely correlated, and often hard to measure
[potential solutions: new measurement platforms,
consortia]
(2) Analytic choices, multiplicity
[potential solutions: pre-registration, -WAS, blinding]
(3) Often extreme missingness in data [potential
solutions: new imputation methods]
Useful comparisons in high-throughput genomics and
communal science in physics.
Summary