Functional Genomics Journal Club presentation on the following publication:
Kuzawa, C. W., Chugani, H. T., Grossman, L. I., Lipovich, L., Muzik, O., Hof, P. R., … Lange, N. (2014). Metabolic costs and evolutionary implications of human brain development. Proceedings of the National Academy of Sciences, 111(36), 13010–13015. https://doi.org/10.1073/pnas.1323099111
1. PREDICTING THE CLINICAL IMPACT OF HUMAN MUTATION
WITH DEEP NEURAL NETWORKS
JOURNAL CLUB PRESENTATION BY:
BRIAN M. SCHILDER, BIOINFORMATICIAN
RAJ LAB, DEPARTMENT OF NEUROSCIENCE
ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI
01/15/2019
Sundaram et al. 2018
2. AUTHORS
• 1. Illumina Artificial Intelligence Laboratory, Illumina Inc, San Diego, CA, USA.
• 2. Department of Computer Science, Stanford University, Stanford, CA, USA.
• 3. National Science Foundation Center for Big Learning, University of Florida,
Gainesville, FL, USA.
• 4. Analytic and Translational Genetics Unit (ATGU), Department of Medicine,
Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
• 5. Toyota Technological Institute at Chicago, Chicago, IL, USA.
Illumina-ti???
4. CHALLENGES IN GENETICS
• Difficult to predict the effect of rare variants
• Subtle effects and/or undocumented
• Poses a major obstacle in both population-wide
and personalized medicine
• Limited by the lack of an adequate dataset
• Too small
• Limited diversity
• Don’t span whole genome
• Literature & expert curator biases
• ~50% ClinVar variants from 200 genes
• Supervised ML can learn these biases
Assimes & Roberts (2016)
5. HUMAN DIVERSITY BOTTLENECK
• Relatively little time and extremely long generation times left our population with very
little genetic diversity
• Chimpanzees and other non-human primates have far more genetic diversity than
humans (2X as many SNVs) despite a much smaller population size today
https://www.quora.com/What-would-happen-if-the-
Supervolcano-Toba-erupted-tomorrow
• Sumatra, ~75 KYA
• Decade-long winter
(all the way to
Vermont), 1000 years
of lower temps
• Human population
bottleneck attributed
to this event (only 2-
10k individuals)
• Though see Yost
et al. (2018)
http://wallace.genetics.uga.edu/
https://en.wikipedia.org/wiki/Population_bottleneck
Toba supervolcano cloud radius
Bottleneck Event
Large fraction of
human ancestral
diversity was lost
6. HYPOTHESIS
• Leveraging the genomic diversity of closely related non-human primates (NHPs) can enhance prediction
accuracy of pathogenicity in human variants
8. OBJECTIVE
Assess whether the frequency of NHP variants can serve as a reasonable
proxy for pathogenicity of those equivalent variants in humans.
Singletons: The rarest of rare variants, variants seen exactly one time.
Balancing selection: A class of selective regimes that maintain polymorphism above what is expected under neutrality.
Identical-by-state: Two different alleles derived from different evolutionary histories with the same effect (i.e. resulting in the same amino acid change)
9. METHODS
• Databases:
• Humans (123,000+ individuals 85k variants)
• Exome Aggregation Consortium (ExAC)
• Genome Aggregation Database (gnomAD)
• ClinVar
• NHPs (124 individuals, 300K variants)
• Great Ape Genome Project
• Single Nucleotide Polymorphism Database (dbSNP)
• Included only orthologous variants (identical-by-state)
• Each primate species contributes more variants than all of
ClinVar (~42K)
• Key Comparisons:
1. Human vs. chimpanzee (~6 MY divergence)
2. Human vs. 6 primate species (≤ ~35 MY divergence )
3. Human vs. mouse, pig, goat, cow, chicken & zebrafish
(≤ ~450 MY divergence)
Six Non-human Primate Species
arkive.org
oregonzoo.org
minizoo.cz
nationalgeographic.com
Pan troglodytes (n=24+35)
Gorilla gorilla (n=27)
Macaca mulatta (n=16) Callithrix jacchus (n=9)
Pongo abelii (n=10)
Pan paniscus (n=13)
GreatApes Old World MonkeyGreatapes
New World Monkey
10. RESULTS
• 27% of missense mutations that are
benign in distant species are actually
deleterious in humans
• This figure is only 9% if you use NHPs
• NHPs offer the benefit of a more diverse
sample while still being very relevant to
humans
Fig 2.
Fig 1.
Small but diverse NHP sample
= large benefits
MSR: missense/synonymous ratio
11. PART II: PREDICT
A deep learning network for variant pathogenicity classification
12. OBJECTIVE
Create a more accurate predictive model of human variant pathogenicity
using NHP and human variants + deep learning
13. METHODS
• Training Dataset of Common/Benign Variants:
• 300K NHP variants
• 84K human variants
• PrimateAI
• A novel deep learning-based predictive model
• 1D Convolutional Neural Network (CNN)
• Automatic feature extraction
• Prediction function:
• How likely is a mutation to be a
common/benign vs. rare/pathogenic variant?
Input –
[Multi-alignment (51 AAs x 99 vertebrates),
secondary structure,
solvent accessibility]
Hidden Layers –
[hierarchical features]
Output –
[0-1 pathogenicity score]
Separate CNN models to predict:
• Secondary structure (SPIDER2)
• Heffernan et al., 2016
• Helix, beta sheet, or coil
• Solvent accessibility (DeepCNF)
• Wang et al. 2016
• Buried, intermediate, or
exposed
SPIDER2 DeepCNF
14. RESULTS
Primate AI example output along
all SCN2A AA positions
Primate AI outperforms
existing tools on 10K withheld
common primate variants
PrimateAI can distinguish
between DDD cases vs. sibling
controls
c
Primate AI outperforms
existing tools on DDD
cases vs. sibling controls
c
DDD:
Deciphering Developmental Disorders cohort
with 4,293 cases and 2,517 sibling controls.
• PrimateAI had a 91%
accuracy score (next best
model: only 80%)
15. RESULTS II
• Needed to demonstrate that
PrimateAI not just scoring based
genes with higher rates of de novo
mutation
• Repeated with only de novo missense
variants within 605 disease genes
• However, greater diversity of disease
genes is needed before generalizing
all mendelian disorders
PrimateAI AUC almost at max!
16. RESULTS III
• Actually compared 20 other tools (Supp Fig. 9)
• PrimateAI outperformed them all in all tests sets
18. NOVEL CANDIDATE GENE DISCOVERY
• Increases enrichment of de novo missense mutations in DDD patients from 1.5-fold to 2.2-fold
• Identified 14 additional candidate genes in intellectual disability
20. COMPARE WITH HUMAN EXPERT CURATION
• Curators tend to:
• Overly rely on straightforward metrics like Grantham score
• Underutilize secondary structure and solvent accessibility
Table 2. Comparison of the difference in Grantham score, protein surface-exposure, and amino acid sequence conservation
between human expert annotated variants in ClinVar and de novo variants in DDD cases versus controls.
21. CONCLUSIONS
• Adding even a few primate species disproportionately improves pathogenicity prediction of human
variants
• “134 individuals from six non-human primate species examined in this study contribute nearly four times as
many common missense variants as the 123,136 humans from the ExAC study”
• Training PrimateAI on more distant species decreases performance
22. FUTURE DIRECTIONS
• More NHP species, more samples (high return for even a few)
• 27 NHP species now on UCSC Genome Browser
• Non-coding variants in conserved regions
23. REFERENCES
1. Sundaram, L., Gao, H., Padigepati, S. R., McRae, J. F., Li, Y., Kosmicki, J. A., … Farh, K. K.-H. (2018). Predicting the clinical impact of human mutation with deep
neural networks. Nature Genetics, 50(8), 1161–1170. https://doi.org/10.1038/s41588-018-0167-z
2. https://www.smithsonianmag.com/smart-news/ancient-humans-weathered-toba-supervolcano-just-fine-180968479/
3. Yost, C. L., Jackson, L. J., Stone, J. R., & Cohen, A. S. (2018). Subdecadal phytolith and charcoal records from Lake Malawi, East Africa imply minimal effects
on human evolution from the ∼74 ka Toba supereruption. Journal of Human Evolution, 116, 75–94. https://doi.org/10.1016/j.jhevol.2017.11.005
4. Assimes, T. L., & Roberts, R. (2016). Genetics: Implications for Prevention and Management of Coronary Artery Disease. Journal of the American College of
Cardiology, 68(25), 2797–2818. https://doi.org/10.1016/j.jacc.2016.10.039
5. Landrum, M. J., Lee, J. M., Benson, M., Brown, G., Chao, C., Chitipiralla, S., … Maglott, D. R. (2016). ClinVar: Public archive of interpretations of clinically
relevant variants. Nucleic Acids Research, 44(D1), D862–D868. https://doi.org/10.1093/nar/gkv1222
Hinweis der Redaktion
Common chimp variants defined as occurring ≥2 times in a cohort of 24
99.8% of variants have been under purifying selection
Excluded major histocompatibility complex (MHC) regions
Accounted for other factors include mutation rate, technical artifacts such as sequencing coverage, and factors impacting neutral genetic drift such as gene conversion.
As the number of unlabeled variants greatly exceeds the size of the labeled benign training dataset, we trained eight networks in parallel, each using a different set of unlabeled variants matched to the benign training dataset, to obtain a consensus prediction.
Three 51-length position frequency matrices are generated from multiple sequence alignments of 99 vertebrates, including one for 11 primates, one for 50 mammals excluding primates, and one for 38 vertebrates excluding primates and mammals.
SPIDER2: Structural Property prediction with Integrated DEep neuRal network
DeepCNF: Deep Learning extension of Conditional Neural Fields (CNF)
Given that the DDD population largely consists of index cases of affected children without affected first degree relatives, it is essential to show that the classifier has not inflated its accuracy by favoring pathogenicity in genes with de novo dominant modes of inheritance. We restricted the analysis to 605 genes that were nominally significant for disease association in the DDD study, calculated from protein-truncating variation only
Simulations with ExAC show that discovery of common human variants (>0.1% allele frequency) plateaus quickly after only a few hundred individuals