Arjun Manrai - National Academies Talk - June 6, 2019

Physics, AI, and the
Environment
Arjun (Raj) Manrai, Ph.D.

National Academies of Science, Engineering, and Medicine
June 6, 2019
@arjunmanrai

Arjun_Manrai@hms.harvard.edu

Harvard Medical School

Computational Health Informatics Program,

Boston Children’s Hospital

Supervised machine learning performs well with:

(a) Lots of clean, labeled data [Ex: ImageNet]

(b) Eﬃcient compute, time [Ex: GPUs]

(c) Algorithmic advancements [Ex: Dropout]
Do we have these for environmental health
and the exposome?

Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Exposures are complex: densely correlated,
time-varying, and often diﬃcult to measure.
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Patel & Manrai PSB 2015

Exposures can have very diﬀerent timescales

throughout life
Athersuch Bioanalysis 2012
Cumulative (cadmium, PCB)
Constant, but excreted (phenols, vitamins)
Intervention (drugs)
Seasonal (allergen)
In-utero
Not shown: Diurnal

AI has been successful ‘out of the box’ for many
problems in health, but it is unlikely to be so for
many in environmental health and exposome.
Pervasive issues around measurement and
reproducibility are likely to limit application.

Reproducibility, old and new
“Non-reproducible single
occurrences are of no
significance to science.”
—Karl Popper
“It’s basically collecting lots of
variables and then playing with your
data until you find something that
counts as statistically significant but
is probably meaningless.”
—John Oliver

Let’s take a deeper look by starting with a very
powerful ML method for the exposome…linear
regression!
Suppose we want to associate lead
exposure and C-reactive protein (CRP), a
blood marker of inﬂammation
Regression gives us:

• Variance explained

• Interpretable coeﬃcients

• Predictions

• Uncertainty

Step 2: Build an unadjusted model
Step 1: Get data
Step 3: Adjust sex and BMI, and then also race,

and SES, and then also smoking minus SES…
NHANES <- readRDS(‘NHANES_NAS.rds’)
lm(CRP ~ lead, data= NHANES) P-value = 0.12
P-value = 0.053
lm(CRP ~ lead + factor(sex) + bmi, data= NHANES)
lm(CRP ~ lead + factor(sex) + bmi + factor(race), data= NHANES)
…..
“trending towards signiﬁcance”

Step 4: Aha moment! Filter to a more precise group
Step 5….

Step 6…

Step 7…
Bingo! p < 0.05
Publish.

Tenure.
NHANES2 <- NHANES %>% ﬁlter(age > 25 & age < 35)

lm(CRP ~ lead + factor(sex) + bmi, data= NHANES2)

pcb
b-carotene
C-reactive protein
cotinine
Patel et al. JCE 2015
Formalize
and scale
with the
Vibration of
Eﬀects
Janus eﬀect

When applying machine learning, we have

even more analytic choices:
• Hyperparameters (e.g. learning rate)

• Model architecture (e.g. number of hidden layers)

• Declaration of improvement (e.g. delta AUC = 0.001)

• Splits (e.g. training/val/test splits)

• Many analysts (e.g. Kaggle competitions)
…What can we do?

When do we address multiplicity?
Almost Always Almost Never

We can look to two ﬁelds:

(1) Genomics

(2) Physics

Comparison #1: Genomics
Major study design change in human genetics research:
Candidate gene studies to genome wide association studies.

Creation of a phenotype-exposure association map:
A 2-D view of 158 phenotype by 510 exposure associations
> 0
< 0
Association Size:
510 E exposure and diet indicators × 158 clinical trait phenotypes

NHANES 1999-2000, 2001-2002, 2005-2006, …, 2011-2012 (8)

Median N: 150-5000 per survey

~67,281 E-P associations!
signiﬁcant associations (FDR < 5%)

adjusted by age, age2, sex, race, income

Manrai et al. 2019
158phenotypes
510 exposures
Comparison #1: Genomics (continued)

Large cohorts will expose massive mis-misspeciﬁcation,
confounding, and correlation amongst exposures.
Manrai, Ioannidis, Patel. AJE 2019
Comparison #1: Genomics (Continued)

Comparison #2: Physics
Evidentiary standards in particle physics and the power of

large, team science.
Discovery of the

Higgs Boston

Scientiﬁc American
Comparison #2: Physics (continued)

Nature
Comparison #2: Physics (continued)

The central challenges of applying AI in environmental
health are not uniquely AI challenges. They are:

(1) Data: environmental/exposure data are time-varying,
densely correlated, and often hard to measure
[potential solutions: new measurement platforms,
consortia]

(2) Analytic choices, multiplicity

[potential solutions: pre-registration, -WAS, blinding]

(3) Often extreme missingness in data [potential
solutions: new imputation methods]

Useful comparisons in high-throughput genomics and
communal science in physics.
Summary

Arjun Manrai - National Academies Talk - June 6, 2019

Recommended

Recommended

More Related Content

Similar to Arjun Manrai - National Academies Talk - June 6, 2019

Similar to Arjun Manrai - National Academies Talk - June 6, 2019 (20)

Recently uploaded

Recently uploaded (20)

Arjun Manrai - National Academies Talk - June 6, 2019