Studying the elusive in larger scale

Studying the elusive environment in larger
scale with the exposome and EWAS
Chirag J Patel

Boston University

11/30/15
chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org

P = G + EType 2 Diabetes

Cancer

Alzheimer’s

Gene expression
Phenotype Genome
Variants
Environment
Infectious agents

Nutrients

Pollutants

Drugs

We are great at G investigation!
over 2000

Genome-wide Association Studies (GWAS)

https://www.ebi.ac.uk/gwas/
G

Nothing comparable to elucidate E inﬂuence!
We lack high-throughput methods
and data to discover new E in P…
E: ???

A similar paradigm for discovery should exist

for E!
Why?

σ2
G
σ2
P
H2 =
Heritability (H2) is the range of phenotypic variability
attributed to genetic variability in a population
Indicator of the proportion of phenotypic
diﬀerences attributed to G.

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes (25%)
Heart Disease (25-30%)
Autism (50%???)

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!

©2015NatureAmerica,Inc.Allrightsreserved.
Despite a century of research on complex traits in humans, the
relative importance and specific nature of the influences of
genes and environment on human traits remain controversial.
We report a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications
including 14,558,903 partly dependent twin pairs, virtually
all published twin studies of complex traits. Estimates of
heritability cluster strongly within functional domains,
and across all traits the reported heritability is 49%. For a
majority (69%) of traits, the observed twin correlations are
consistent with a simple and parsimonious model where twin
resemblance is solely due to additive genetic variation. The
data are inconsistent with substantial influences from shared
environment or non-additive genetic variation. This study
provides the most comprehensive analysis of the causes of
individual differences in human traits thus far and will guide
future gene-mapping efforts. All the results can be visualized
using the MaTCH webtool.
Specifically, the partitioning of observed variability into underlying
genetic and environmental sources and the relative importance of
additive and non-additive genetic variation are continually debated1–5.
Recent results from large-scale genome-wide association studies
(GWAS) show that many genetic variants contribute to the variation
in complex traits and that effect sizes are typically small6,7. However,
the sum of the variance explained by the detected variants is much
smaller than the reported heritability of the trait4,6–10. This ‘missing
heritability’ has led some investigators to conclude that non-additive
variation must be important4,11. Although the presence of gene-gene
interaction has been demonstrated empirically5,12–17, little is known
about its relative contribution to observed variation18.
In this study, our aim is twofold. First, we analyze empirical esti-
mates of the relative contributions of genes and environment for
virtually all human traits investigated in the past 50 years. Second, we
assess empirical evidence for the presence and relative importance of
non-additive genetic influences on all human traits studied. We rely
on classical twin studies, as the twin design has been used widely
to disentangle the relative contributions of genes and environment,
across a variety of human traits. The classical twin design is based
on contrasting the trait resemblance of monozygotic and dizygotic
twin pairs. Monozygotic twins are genetically identical, and dizygotic
twins are genetically full siblings. We show that, for a majority of traits
(69%), the observed statistics are consistent with a simple and parsi-
monious model where the observed variation is solely due to additive
genetic variation. The data are inconsistent with a substantial influence
from shared environment or non-additive genetic variation. We also
show that estimates of heritability cluster strongly within functional
domains, and across all traits the reported heritability is 49%. Our
results are based on a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications includ-
ing 14,558,903 partly dependent twin pairs, virtually all twin studies of
complex traits published between 1958 and 2012. This study provides
the most comprehensive analysis of the causes of individual differences
in human traits thus far and will guide future gene-mapping efforts. All
Meta-analysis of the heritability of human traits based on
fifty years of twin studies
Tinca J C Polderman1,10, Beben Benyamin2,10, Christiaan A de Leeuw1,3, Patrick F Sullivan4–6,
Arjen van Bochoven7, Peter M Visscher2,8,11 & Danielle Posthuma1,9,11
1Department of Complex Trait Genetics, VU University, Center for Neurogenomics
and Cognitive Research, Amsterdam, the Netherlands. 2Queensland Brain
Institute, University of Queensland, Brisbane, Queensland, Australia. 3Institute
for Computing and Information Sciences, Radboud University Nijmegen,
Nijmegen, the Netherlands. 4Center for Psychiatric Genomics, Department
of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA.
5Department of Psychiatry, University of North Carolina, Chapel Hill, North
Carolina, USA. 6Department of Medical Epidemiology and Biostatistics,
Karolinska Institutet, Stockholm, Sweden. 7Faculty of Sciences, VU University,
Insight into the nature of observed variation in human traits is impor-
tant in medicine, psychology, social sciences and evolutionary biology.
It has gained new relevance with both the ability to map genes for
human traits and the availability of large, collaborative data sets to do
so on an extensive and comprehensive scale. Individual differences in
human traits have been studied for more than a century, yet the causes
of variation in human traits remain uncertain and controversial.
Nature Genetics, 2015
17,804 traits of the phenome
2,748 publications

14,558,903 twin pairs
Average H2 (genome): 0.49
Exposome may play an equal role.

Explaining the other 50%:
A new data-driven paradigm for robust discovery of
via EWAS and the exposome
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
ﬂammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microﬂuidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
“A more comprehensive view of
environmental exposure is
needed ... to discover major
causes of diseases...”
how to analyze in relation to health?
Wild, 2005

Rappaport and Smith, 2010, 2011

Buck-Louis and Sundaram 2012

Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014

Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.

Examples of exposome-driven discovery machinery

Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey1
since the 1960s

now biannual: 1999 onwards

10,000 participants per survey

The sample for the survey is selected to represent
the U.S. population of all ages. To produce reli-
able statistics, NHANES over-samples persons 60
and older, African Americans, and Hispanics.
Since the United States has experienced dramatic
growth in the number of older people during this
century, the aging population has major impli-
cations for health care needs, public policy, and
research priorities. NCHS is working with public
health agencies to increase the knowledge of the
health status of older Americans. NHANES has a
primary role in this endeavor.
All participants visit the physician. Dietary inter-
views and body measurements are included for
everyone. All but the very young have a blood
sample taken and will have a dental screening.
Depending upon the age of the participant, the
rest of the examination includes tests and proce-
dures to assess the various aspects of health listed
above. In general, the older the individual, the
more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’
homes. Health measurements are performed in
specially-designed and equipped mobile centers,
which travel to locations throughout the country.
The study team consists of a physician, medical
and health technicians, as well as dietary and health
interviewers. Many of the study staff are
bilingual (English/Spanish).
An advanced computer system using high-
end servers, desktop PCs, and wide-area
networking collect and process all of the
NHANES data, nearly eliminating the need
for paper forms and manual coding operations.
This system allows interviewers to use note-
book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
collecting quality data and increases the speed
with which results are released to the public.
In each location, local health and government
officials are notified of the upcoming survey.
Households in the study area receive a letter
from the NCHS Director to introduce the
survey. Local media may feature stories about
the survey.
NHANES is designed to facilitate and en-
courage participation. Transportation is provided
to and from the mobile center if necessary.
Participants receive compensation and a report
of medical findings is given to each participant.
All information collected in the survey is kept
strictly confidential. Privacy is protected by
public laws.
Uses of the Data
Information from NHANES is made available
through an extensive series of publications and
articles in scientific and technical journals. For
data users and researchers throughout the world,
survey data are available on the internet and on
easy-to-use CD-ROMs.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
federal agencies that collaborated in the de-
sign and development of the survey. The
National Institutes of Health, the Food and
Drug Administration, and CDC are among the
agencies that rely upon NHANES to provide
data essential for the implementation and
evaluation of program activities. The U.S.
Department of Agriculture and NCHS coop-
erate in planning and reporting dietary and
nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-
mental Protection Agency allows continued
study of the many important environmental
influences on our health.
• Physical fitness and physical functioning
• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
chitis, emphysema)
• Sexually transmitted diseases
• Vision
1 http://www.cdc.gov/nchs/nhanes.htm
>250 exposures (serum + urine)

GWAS chip

>85 quantitative clinical traits
(e.g., serum glucose, lipids, BMI)

Death index linkage (cause of
death)

Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey
Nutrients and Vitamins

vitamin D, carotenes
Infectious Agents

hepatitis, HIV, Staph. aureus
Plastics and consumables

phthalates, bisphenol A
Physical Activity

stepsPesticides and pollutants

atrazine; cadmium; hydrocarbons
Drugs

statins; aspirin

What E factors are associated with type 2 diabetes?

EWAS in Type 2 Diabetes:
Searching >250 exposures for associations with

FBG > 125 mg/dL
−log10(pvalue)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
012
Heptachlor Epoxide
OR=3.2, 1.8
PCB170
OR=4.5,2.3
γ-tocopherol (vitamin E)
OR=1.8,1.6
β-carotene
OR=0.6,0.6
FDR<10%
age, sex, race, SES, BMI
PLOS ONE. 2010

What E factors are associated with

mortality and biological aging?

EWAS to search for

exposures and behaviors associated with all-cause mortality.
NHANES: 1999-2004
National Death Index linked mortality

246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)

~5.5 years of followup
Cox proportional hazards

baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)

~2.8 years of followup
p < 0.05
IJE, 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
All-cause mortality:

253 exposure/behavior associations in survival
age, sex, income, education, race/ethnicity, occupation [in red]
FDR < 5%
sociodemographics
replicated factor
IJE, 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
EWAS (re)-identiﬁes factors associated with all-cause mortality:

Volcano plot of 200 associations
age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
age, sex, income, education, race/ethnicity, occupation [in red]
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%

452 associations in Telomere Length:
Polychlorinated biphenyls associated with longer telomeres?!
Manrai, Kohane (in review)
0
1
2
3
4
−0.2 −0.1 0.0 0.1 0.2
effect size
−log10(pvalue)
PCBs
FDR<5%
Trunk Fat
Alk. PhosCRP
Cadmium
Cadmium (urine)cigs per day
retinyl stearate
R2 ~ 1%
VO2 Maxpulse rate
shorter telomeres longer telomeres
adjusted by age, age2, race, poverty, education, occupation

median N=3000; 300-7000

Interindividual variation in mean leukocyte telomere length
(LTL) is associated with cancer and several age-associated
diseases. We report here a genome-wide meta-analysis of
37,684 individuals with replication of selected variants in
an additional 10,739 individuals. We identified seven loci,
including five new loci, associated with mean LTL (P < 5 × 10−8).
Five of the loci contain candidate genes (TERC, TERT, NAF1,
OBFC1 and RTEL1) that are known to be involved in telomere
biology. Lead SNPs at two loci (TERC and TERT) associate
with several cancers and other diseases, including idiopathic
pulmonary fibrosis. Moreover, a genetic risk score analysis
combining lead variants at all 7 loci in 22,233 coronary
artery disease cases and 64,762 controls showed an
association of the alleles associated with shorter LTL with
increased risk of coronary artery disease (21% (95%
confidence interval, 5–35%) per standard deviation
in LTL, P = 0.014). Our findings support a causal role of
telomere-length variation in some age-related diseases.
Telomeres are the protein-bound DNA repeat structures at the ends
of chromosomes that are important in maintaining genomic sta-
bility1. They are critical in regulating cellular replicative capacity2.
During somatic-cell replication, telomere length progressively short-
ens because of the inability of DNA polymerase to fully replicate the
3 end of the DNA strand. Once a critically short telomere length is
reached, the cell is triggered to enter replicative senescence, which
subsequently leads to cell death1,2. Conversely, in germ cells and
other stem cells that require renewal, telomere length is maintained
by the enzyme telomerase, a ribonucleoprotein that contains the
RNA template TERC and a reverse transcriptase TERT3. Both longer
and shorter telomere length are associated with increased risk of
certain cancers4,5, and reactivation of telomerase, which bypasses
cellular senescence, is a common requirement for oncogenic pro-
gression6. Therefore, telomere length is an important determinant
of telomere function.
Mean telomere length exhibits considerable interindividual vari-
ability and has high heritability with estimates varying between 44%
and 80% (refs. 7–9). Most of these studies have measured mean
telomere length in blood leukocytes. However, there is evidence that,
within an individual, mean LTL and telomere length in other tissues
are highly correlated10,11. In cross-sectional population studies, mean
LTL is longer in women than in men and is inversely associated with
age (declining by between 20–40 bp per year)9,12–14. Shorter age-
adjusted and sex-adjusted mean LTL has been found to be associated
with risk of several age-related diseases, including coronary artery
disease (CAD)12–15, and has been advanced as a marker of biologi-
cal aging16. However, the extent to which the association of shorter
LTL with age-related disorders is causal in nature remains unclear.
Identifying genetic variants that affect telomere length and testing
their association with disease could clarify any causal role.
So far, common variants at two loci on chromosome 3q26
(TERC)17–19 and chromosome 10q24.33 (OBFC1)18, which explain
<1% of the variance in telomere length, have shown a replicated asso-
ciation with mean LTL in genome-wide association studies (GWAS).
To identify other genetic determinants of LTL, we conducted a large-
scale GWAS meta-analysis of 37,684 individuals from 15 cohorts,
followed by replication of selected variants in an additional 10,739
individuals from 6 more cohorts.
Details of the studies included in the GWAS meta-analysis and in
the replication phase are provided in the Supplementary Note, and
key characteristics are summarized in Supplementary Table 1. All
subjects were of European descent, the majority of the cohorts were
population based and three of the replication cohorts were addi-
tional subjects from studies used in the meta-analysis. The genotyp-
ing platforms and the imputation method (to HapMap 2 build 36)
used by each GWAS cohort are summarized in Supplementary
Table 2. We measured mean LTL in each cohort using a quantitative
PCR method and expressed it as a ratio of telomere repeat length to
copy number of a single-copy gene (T/S ratio; Online Methods and
Supplementary Note).
Then we analyzed LTL, adjusted for age, sex and any study-specific
covariates, for association with genotype using linear regression in
each study and adjusted the results for genomic inflation control fac-
tors (Supplementary Table 2). We performed an inverse variance–
weighted meta-analysis for 2,362,330 SNPs (Online Methods)
with correction for the overall genomic inflation control factor
( = 1.007; quantile-quantile plot for the meta-analysis is shown in
Supplementary Fig. 1).
SNPs in seven loci exhibited association with mean LTL at genome-
wide significance (P < 5 × 10−8; Figs. 1, 2, Table 1 and Supplementary
Fig. 2). The association of the lead SNP on chromosome 2p16.2
(rs11125529) was very close to the threshold for genome-wide sig-
nificance, and the lead SNP in a locus on 16q23.3 (rs2967374) fell just
short of this threshold (Table 1). We therefore sought replication of
results for these two loci. We confirmed the association of rs11125529
Identification of seven loci affecting mean telomere
length and their association with disease
A full list of authors and affiliations appears at the end of the paper.
Received 26 June 2012; accepted 19 December 2012; published online 27 March 2013; doi:10.1038/ng.2528
Nature Genetics, 2013
Interindividual variation in mean leukocyte telomere length
(LTL) is associated with cancer and several age-associated
diseases. We report here a genome-wide meta-analysis of
37,684 individuals with replication of selected variants in
an additional 10,739 individuals. We identified seven loci,
including five new loci, associated with mean LTL (P < 5 × 10−8).
Five of the loci contain candidate genes (TERC, TERT, NAF1,
OBFC1 and RTEL1) that are known to be involved in telomere
biology. Lead SNPs at two loci (TERC and TERT) associate
with several cancers and other diseases, including idiopathic
pulmonary fibrosis. Moreover, a genetic risk score analysis
combining lead variants at all 7 loci in 22,233 coronary
artery disease cases and 64,762 controls showed an
association of the alleles associated with shorter LTL with
increased risk of coronary artery disease (21% (95%
confidence interval, 5–35%) per standard deviation
in LTL, P = 0.014). Our findings support a causal role of
telomere-length variation in some age-related diseases.
Telomeres are the protein-bound DNA repeat structures at the ends
of chromosomes that are important in maintaining genomic sta-
bility1. They are critical in regulating cellular replicative capacity2.
During somatic-cell replication, telomere length progressively short-
ens because of the inability of DNA polymerase to fully replicate the
3 end of the DNA strand. Once a critically short telomere length is
reached, the cell is triggered to enter replicative senescence, which
subsequently leads to cell death1,2. Conversely, in germ cells and
other stem cells that require renewal, telomere length is maintained
age (declining by between 20–40 bp per year)9,12–14. Shorter age-
adjusted and sex-adjusted mean LTL has been found to be associated
with risk of several age-related diseases, including coronary artery
disease (CAD)12–15, and has been advanced as a marker of biologi-
cal aging16. However, the extent to which the association of shorter
LTL with age-related disorders is causal in nature remains unclear.
Identifying genetic variants that affect telomere length and testing
their association with disease could clarify any causal role.
So far, common variants at two loci on chromosome 3q26
(TERC)17–19 and chromosome 10q24.33 (OBFC1)18, which explain
<1% of the variance in telomere length, have shown a replicated asso-
ciation with mean LTL in genome-wide association studies (GWAS).
To identify other genetic determinants of LTL, we conducted a large-
scale GWAS meta-analysis of 37,684 individuals from 15 cohorts,
followed by replication of selected variants in an additional 10,739
individuals from 6 more cohorts.
Details of the studies included in the GWAS meta-analysis and in
the replication phase are provided in the Supplementary Note, and
key characteristics are summarized in Supplementary Table 1. All
subjects were of European descent, the majority of the cohorts were
population based and three of the replication cohorts were addi-
tional subjects from studies used in the meta-analysis. The genotyp-
ing platforms and the imputation method (to HapMap 2 build 36)
used by each GWAS cohort are summarized in Supplementary
Table 2. We measured mean LTL in each cohort using a quantitative
PCR method and expressed it as a ratio of telomere repeat length to
copy number of a single-copy gene (T/S ratio; Online Methods and
Supplementary Note).
Then we analyzed LTL, adjusted for age, sex and any study-specific
Identification of seven loci affecting mean telomere
length and their association with disease
Does PCB exposure inﬂuence expression of 24 (29) genes
implicated in telomere length GWAS?
L E T T E R S
but not of rs2967374 (Table 1). The com-
bined P value from the GWAS meta-analyses
and replication cohorts for rs11125529 was
7.50 × 10−10. There was no evidence of sex-
dependent effects or additional independent
signals at any of these loci (Online Methods
and Supplementary Tables 3, 4).
Details of key genes in each locus associated with LTL and their
location in relation to the lead SNP are provided in Supplementary
Table 5. The most significantly associated locus we found was the
previously reported TERC locus on 3q26 (Figs. 1, 2 and Table 1)17.
Four additional loci, 5p15.33 (TERT), 4q32.2 (NAF1, nuclear assembly
factor 1), 10q24.33 (OBFC1, oligonucleotide/oligosaccharide-binding
fold containing 1)18 and 20q13.3 (RTEL1, regulator of telomere elon-
gation helicase 1), harbor genes that encode proteins with known
function in telomere biology3,20–23. NAF1 protein is required for
assembly of H/ACA box small nucleolar RNA, the RNA family to
which TERC belongs20. Thus, the three most significantly associated
loci (3q26, 5p15.33 and 4q32.2) harbor genes involved in the forma-
tion and activity of telomerase. We therefore examined whether the
lead SNPs at these loci as well as the other identified loci associate with
leukocyte telomerase activity in available data from 208 individuals.
We did not find an association of any of the variants with telomerase
activity (Supplementary Table 6). However, the study only had 80%
power ( of 0.05) to detect a SNP effect that explained 3.7% of the
variance in telomerase activity, and therefore smaller effects are likely
to have been missed in this exploratory analysis.
We also found a significant association (P = 6.90 × 10−11) at the
previously reported OBFC1 locus18. OBFC1 is a component of the
telomere-binding CST complex that also contains CTC1 and TEN1
(ref. 21). In yeast, this complex binds to the single-stranded gua-
nine overhang at the telomere and functions to promote telomere
replication. RTEL1 is a DNA helicase that has been shown to have
important roles in setting telomere length, telomere maintenance
and DNA repair in mice22,23. However, it should be noted that the
Figure 1 Signal-intensity plot of genotype
association with telomere length. Data
are displayed as –log10(P values) against
chromosomal location for the 2,362,330 SNPs
that were tested. The dotted line represents a
genome-wide level of significance at P = 5 × 10−8.
Loci that showed an association at this level are
plotted in red.
a 35
30
25
20
value)
r
2
0.8
0.6
0.4
0.2
rs10936599 100
Recombination
80
60
b
0.8
0.6
0.4
0.2
20
15
lue)
rs2736100 100
80
r
2
Recombinat
c
0.8
0.6
0.4
0.2
15
)
rs7675998 100
80
r
2
Recombina
30
20
–log10(Pvalue)
10
ACYP2
NAF1
TERT
Chromosome
OBFC1
ZNF208
RTEL1
TERC
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Samples exposed to PCBs associated with difference in genes

Expression differences for 24 GWAS implicated genes
Queried the Gene Expression Omnibus for PCBs

Affymetrix human arrays (GPL570)

7 gene expression experiments on humans

52 exposed; 14 unexposed
Differential gene expression and a functional analysis of PCB-exposed children:
Understanding disease and disorder development
Sisir K. Dutta a,
⁎, Partha S. Mitra a,1
, Somiranjan Ghosh a,1
, Shizhu Zang a,1
, Dean Sonneborn b
,
Irva Hertz-Picciotto b
, Tomas Trnovec c
, Lubica Palkovicova c
, Eva Sovcikova c
,
Svetlana Ghimbovschi d
, Eric P. Hoffman d
a
Molecular Genetics Laboratory, Howard University, Washington, DC, USA
b
Department of Public Health Sciences, University of California Davis, Davis, CA, USA
c
Slovak Medical University, Bratislava, Slovak Republic
d
Center for Genetic Medicine, Children's National Medical Center, Washington, DC, USA
a b s t r a c ta r t i c l e i n f o
Article history:
Received 20 December 2010
Accepted 10 July 2011
The goal of the present study is to understand the probable molecular mechanism of toxicities and the
associated pathways related to observed pathophysiology in high PCB-exposed populations. We have
performed a microarray-based differential gene expression analysis of children (mean age 46.1 months) of
Environment International 40 (2012) 143–154
Contents lists available at ScienceDirect
Environment International
journal homepage: www.elsevier.com/locate/envint

0
1
2
−0.50 −0.25 0.00 0.25 0.50 0.75
log(difference)
−log10(pvalue)
1555203_s_at (SLC44A4)
1555203_s_at (MYNN)
224206_x_at (MYNN)
Samples exposed to PCBs associated with diﬀerence in genes


Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput 2015

JECH 2015
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)

Red: positive ρ

Blue: negative ρ

thickness: |ρ|
Correlation globes paint a complex view of exposure
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput 2015

JECH 2015
Eﬀective number of
variables:

500 (10% decrease)
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)

Telomere Length All-cause mortality
http://bit.ly/globebrowse
Telomeres vs. all-cause mortality

Browse these and 82 other phenotype-exposome globes!
http://www.chiragjpgroup.org/exposome_correlation

What nodes have the most correlations / have the most connections?

(“hubs of the network”)

(What factors are correlated with others the most?)
income...
AJE, 2015

Pulse rate
Eosinophils number
Lymphocyte number
Monocyte
Segmented neutrophils number
Blood 2,5-Dimethylfuran
Cadmium LeadCotinine
C-reactive protein
Floor, GFAAS
Protoporphyrin
Glycohemoglobin
Glucose, plasma
g-tocopherol
Hepatitis A Antibody
Homocysteine
Herpes I
Herpes II
Red cell distribution width
Alkaline phosphotase
Globulin
Glucose, serum
Gamma glutamyl transferase
Triglycerides
Blood Benzene
Blood 1,4-Dichlorobenzene
Blood Ethylbenzene
Blood Styrene
Blood Toluene
Blood m-/p-Xylene
White blood cell count
Mono-benzyl phthalate
3-fluorene
2-fluorene
3-phenanthrene
2-phenanthrene
1-pyrene
Cadmium, urine
Albumin, urine
Lead, urine
10
20
30
-0.3 -0.2 -0.1 0.0
Effect Size per 1SD of income/poverty ratio
-log10(pvalue)
overall income/poverty ratio effects (per 1SD)
validated results
Lower income associated with 43 of 330 (>13%) exposures
and biomarkers in the US population
Higher income: lower levels of biomarkers
AJE, 2015
(Another 23 associated with higher levels=20%)

Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
mental data. These same problems affect epidemiology
of1-risk-factor-at-a-time,butinEWAStheirprevalencebe-
comes more clearly manifest at large scale. When study-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
manycorrelations),theinterventionisnotreallysimple.
In essence what is tested are multiple perturbations of
factors correlated with the one targeted for interven-
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014

JECH, 2014

Proc Symp Biocomp, 2015
How can we study the elusive environment in larger scale for
biomedical discovery?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observationa
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
Medicine, Stanford,
California, Department
of Statistics, Stanford
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
metabolomics, proteomics, and biosensors.7
Eventually, patterns of
US federally funded gene expression experiment data be d
itedinpublicrepositoriessuchastheGeneExpressionOmnibu
repositoryhasbeeninstrumentalindevelopmentoftechnolo
measurement of gene expression, data standardization, and
ofdatafordiscovery.JustaswiththeGeneExpressionOmnib
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive correl
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•bioinformatics to connect exposome with phenome
•new ‘omics technologies to measure the exposome
•dense correlations

•reverse causality
•confounding
•(longitudinal) publicly available data

http://grants.nih.gov/grants/guide/rfa-ﬁles/RFA-ES-15-010.html
NIH National Institute of Environmental Health: $34M in FY 2015:

new technologies for ascertaining the exposome in children
E
LaboratoryE
LaboratoryE
LaboratoryE
Laboratory
E Data Center
•Data repository

•Analytic ecosystem

•Data standards
Exposome Laboratory Network

with Paul Avillach, Michael McDuﬃe, Jeremy Easton-Marks,

Cartik Saravanamuthu and the BD2K PIC-SURE team
40K participants

>1000 indicators of exposure

Data and API available now

http://nhanes.hms.harvard.edu
BD2K Patient-Centered Information Commons
NHANES exposome browser

Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN (2012)
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in a cancer risk
Weak statistical evidence:

non-replicated

inconsistent eﬀects

non-standardized

e modelling
oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and
t of covariates can be in or out of the
With ten covariates, there are over 1000
models. Consider a maze as a metaphor
elling (Figure 3). The red line traces the
path out of the maze. The path through
ze looks simple, once it is known.
ways in the literature for dealing with model
selection, so we propose a new, composite
2. Publication bias
is general recognition that a paper
much better chance of acceptance if
hing new is found. This means that, for
ation, the claim in the paper has to
sed on a p-value less than 0.05. From
g’s point of view5
, this is quality by
tion. The journals are placing heavy
ce on a statistical test rather than
nation of the methods and steps that
o a conclusion. As to having a p-value
han 0.05, some might be tempted to
the system10
through multiple testing,
ple modelling or unfair treatment of
or some combination of the three that
to a small p-value. Researchers can be
creative in devising a plausible story to
statistical finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
A maze of associations is one way to a fragmented
literature and Vibration of Effects
Young, 2011
univariate
sex
sex & age
sex & race
sex & race & age
JCE, 2015

Distribution of associations and p-values due to model choice:
Estimating the Vibration of Eﬀects (or Risk) (e.g., mortality)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1
50
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90

The Vibration of Eﬀects: examples for Vitamin D and
Thyroxine in association with mortality risk
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90

●
●
●
●
●
9
10
111213
1
5
10
1.3
−log10(pvalue)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6
Hazard Ratio
−log10(pvalue)
Cadmium (1SD(log))
adjustment=current_past_smoking
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6
Hazard Ratio
−log10(pvalue)
Cadmium (1SD(log))
RHR = 1.29
RP = 8.29
The Vibration of Effects: shifts in the effect size distribution
due to select adjustments (e.g., adjusting cadmium levels with
smoking status)
JCE, 2015

JCE, 2015
Janus (two-faced) risk profile
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect

(both risk and protection?!)
“risk”“protection”
“significant”
Brittanica.com

oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and
t of covariates can be in or out of the
With ten covariates, there are over 1000
models. Consider a maze as a metaphor
elling (Figure 3). The red line traces the
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
Our modeling scenarios can lead to a fragmented literature;
however we can assess the distribution of effects with VoE
JCE, 2015
http://bit.ly/effectvibration

Can exposure enable re-classiﬁcation of phenotypes?

P
We are many phenotypes simultaneously:

Can we better categorize these P?
Body Measures

Body Mass Index

Height
Blood pressure & ﬁtness

Systolic BP

Diastolic BP

Pulse rate

VO2 Max
Metabolic

Glucose

LDL-Cholesterol

Triglycerides
Inﬂammation

C-reactive protein

white blood cell count
Kidney function

Creatinine

Sodium

Uric Acid
Liver function

Aspartate aminotransferase

Gamma glutamyltransferase
Aging

Telomere length

EWAS-derived phenotype-exposure association map:
A 2-D view of phenotype-exposure associations for re-
classiﬁcation
PCB170
Glucose
BMI
Height
Cholesterol
β-carotene
folate
http://bit.ly.com/pemap

Creation of a phenotype-exposure association map:
A 2-D view of 83 phenotype by 252 exposure associations
> 0
< 0
Association Size:
Clusters of exposures associated with clusters of phenotypes?
252 biomarkers of exposure × 83 clinical trait phenotypes

NHANES 1999-2000, 2001-2002, 2005-2006

~21K regressions: replicated signiﬁcant (FDR < 5%) in 2003-2004

adjusted by age, age2, sex, race, income, chronic disease

Hugues Aschard, JP Ioannidis
83phenotypes
252 exposures

Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Lactate dehydrogenase LDH
Globulin
Alanine aminotransferase ALT
Aspartate aminotransferase AST
Albumin
Methylmalonic acid
PSA. total
Prostate specific antigen ratio
TIBC, Frozen Serum
Red blood cell count
Platelet count SI
Segmented neutrophils percent
Mean platelet volume
Mean cell volume
Mean cell hemoglobin
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Direct HDL-Cholesterol
Bone alkaline phosphotase
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Subscapular Skinfold
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Maximal Calf Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+- EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E

Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+-
nutrients
BMI,weight,
BMD
metabolic
renalfunction
pcbs
metabolic
bloodparameters
hydrocarbons
EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E

Toward a phenotype-exposure association map:
(Re)-categorizing phenotypes with E
7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
immunological:Basophils percent
immunological:Lymphocyte percent
immunological:Eosinophils percent
kidney:Phosphorus
liver:Total protein
liver:Aspartate aminotransferase AST
liver:Alanine aminotransferase ALT
body measures:Head Circumference
body measures:Recumbent Length
liver:Lactate dehydrogenase LDH
cancer:Prostate specific antigen ratio
cancer:PSA, free
blood:Transferrin saturation
liver:Total bilirubin
heart:Direct HDL-Cholesterol
immunological:Monocyte percent
bone:Head BMD
body measures:Standing Height
body measures:Upper Leg Length
bone:Total BMD
bone:Lumber Spine BMD
bone:Lumber Pelvis BMD
heart:Triglycerides
heart:LDL-cholesterol
heart:Total Cholesterol
blood:MCHC
blood:TIBC, Frozen Serum
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
blood:Mean cell hemoglobin
blood:Mean cell volume
kidney:Uric acid
kidney:Blood urea nitrogen
kidney:Total calcium
kidney:Creatinine
blood:Ferritin
blood:Red blood cell count
body measures:Weight
blood:Segmented neutrophils percent
body measures:Total Lean excl BMC
body measures:Trunk Lean excl BMC
body measures:Body Mass Index
body measures:Waist Circumference
body measures:Triceps Skinfold
body measures:Maximal Calf Circumference
body measures:Thigh Circumference
liver:Gamma glutamyl transferase
blood pressure:60 sec. pulse:
metabolic:Insulin
body measures:Total Fat
body measures:Trunk Fat
body measures:Subscapular Skinfold
blood pressure:mean systolic
immunological:C-reactive protein
liver:Globulin
immunological:Monocyte number
immunological:Segmented neutrophils number
immunological:Lymphocyte number
immunological:White blood cell count
immunological:Basophils number
immunological:Eosinophils number
blood:Mean platelet volume
heart:Homocysteine
nutrition:Methylmalonic acid
kidney:Osmolality
kidney:Chloride
kidney:Sodium
kidney:Albumin, urine
blood pressure:60 sec HR
cancer:PSA. total
blood:Platelet count SI
blood:Protoporphyrin
blood:Red cell distribution width
bone:Bone alkaline phosphotase
liver:Alkaline phosphotase
blood pressure:mean diastolic
metabolic:C-peptide: SI
metabolic:Glycohemoglobin
metabolic:Glucose, plasma
metabolic:Glucose, serum
inﬂammation
adiposity
kidney function
metabolic traits

7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
kidney:Phosphorus
liver:Total protein
cancer:PSA, free
bone:Head BMD
bone:Total BMD
heart:Triglycerides
blood:MCHC
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
kidney:Uric acid
kidney:Creatinine
blood:Ferritin
metabolic:Insulin
liver:Globulin
heart:Homocysteine
kidney:Osmolality
kidney:Chloride
kidney:Sodium
cancer:PSA. total
“bad” cholesterol
“good” cholesterol

7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
kidney:Phosphorus
liver:Total protein
cancer:PSA, free
bone:Head BMD
bone:Total BMD
heart:Triglycerides
blood:MCHC
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
kidney:Uric acid
kidney:Creatinine
blood:Ferritin
metabolic:Insulin
liver:Globulin
heart:Homocysteine
kidney:Osmolality
kidney:Chloride
kidney:Sodium
cancer:PSA. total
height + BMD

Triglycerides
Total Cholesterol
LDL-cholesterol
Trunk Fat
Albumin, urine
Insulin
Total Fat
Head Circumference
Blood urea nitrogen
Albumin
Homocysteine
C-peptide: SI
C-reactive protein
Body Mass Index
Ferritin
Thigh Circumference
Total calcium
Total bilirubin
Mean cell volume
Uric acid
Protoporphyrin
Hemoglobin
Total protein
Waist Circumference
Hematocrit
Weight
Standing Height
1/Creatinine
Creatinine
Trunk Lean excl BMC
Methylmalonic acid
Triceps Skinfold
Lymphocyte number
Total Lean excl BMC
TIBC, Frozen Serum
Phosphorus
Lumber Pelvis BMD
Glycohemoglobin
Globulin
Chloride
Bicarbonate
60 sec. pulse:
Upper Leg Length
Total BMD
Potassium
Glucose, serum
Glucose, plasma
Lumber Spine BMD
Platelet count SI
MCHC
Osmolality
Monocyte number
mean systolic
Lymphocyte percent
Recumbent Length
Eosinophils number
Monocyte percent
Head BMD
mean diastolic
60 sec HR
Basophils number
Sodium
PSA, free
Eosinophils percent
PSA. total
Basophils percent
0 10 20 30 40
R^2 * 100
1 to 66 exposures identiﬁed for 81
phenotypes

Additive eﬀect of E factors:

Describe < 20% of variability in P
(On average: 8%)
σ2
E?

Emerging technologies to ascertain exposome will enable
biomedical discovery
High-throughput E standards:

mitigate fragmented literature of associations
Confounding, reverse causality:

how to handle at large dimension?
e.g., EWASs in T2D, telomere length, and mortality
Facilitate G and E interaction investigations and

more precise deﬁnitions of P

Possible to use high-throughput data modalities to discover
the role of E (and G) in P.
−log10(pvalue)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
012
A Serum cotinine B Serum total mercury
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
P = G + E

Harvard HMS
Isaac Kohane

Susanne Churchill

Stan Shaw

Nathan Palmer

Jenn Grandﬁeld

Sunny Alvear

Michal Preminger

Harvard Chan
Hugues Aschard

Francesca Dominici

Stanford
John Ioannidis

Atul Butte (UCSF)

U Queensland
Jian Yang

Peter Visscher

Cochrane
Belinda Burford
Chirag Lakhani
Adam Brown
Nam Pho
Danielle Rasooly
Arjun Manrai
Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
CDC/NCHS
Ajay Yesupriya

Imperial
Ioanna Tzoulaki

Paul Elliott

Lund (Sweden)
Jan Sundquist

Kristina Sundquist
NIH Common Fund

Big Data to Knowledge
Thanks...
Stefano Monti
David Scherr

Studying the elusive in larger scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Studying the elusive in larger scale

Ähnlich wie Studying the elusive in larger scale (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Studying the elusive in larger scale