A description of the gnomAD resource, the loss-of-function variants discovered, and their applications to drug target discovery, and a case study in LRRK2.
Slides from Daniel MacArthur, myself, Eric Minikel, and Nicky Whiffin, and thanks to countless others involved in generating and analyzing the resource.
2. Given known mutation rates, it is almost certain that
every possible single base change compatible with
life exists in a living human
The power of seven billion people
3. Opportunities and challenges of
genetic data aggregation
over three million exomes
and genomes sequenced
Challenges:
• difficulty of moving data
• inadequate consent and
data use permissions
• objections to data
sharing
• inconsistent processing
and variant calling
4. A history of genome data aggregation at
Broad
Exome Aggregation
Consortium (ExAC)
Genome Aggregation
Database (gnomAD)
• 60,076 exomes
• began in 2012, first release Oct 2014
• preprint in Oct 2015
• published in May 2016
• 125,748 exomes and 15,708 genomes
• began in 2016, first release Oct 2016
• final publication data release Oct 2018
• 7 preprints in Jan-March 2019
5. gnomAD 2.1 samples
• Data provided by 109 PIs for 141,456 individuals including 125,748
exomes & 15,708 whole genomes
• Primarily from case-control studies of complex adult-onset diseases (e.g.
type 2 diabetes, heart attack, neuropsychiatric conditions)
• Removed low-quality samples, related individuals, known severe
pediatric disease cases plus first-degree relatives
• Diverse range of ancestries (57% European, over ~10,000 samples
apiece from South Asian, Latino, African/African-American, and East
Asian populations)
6. gnomAD’s impact
• 20.8 M pageviews of the ExAC and gnomAD browsers, by 230,000 users
from 166 countries
• Aided in the diagnosis of over 50,000 rare disease families
• 3,962 papers have cited the ExAC paper
7. The gnomAD preprints on bioRxiv
http://broad.io/gnomad_lof
http://broad.io/gnomad_drugs
http://broad.io/gnomad_lrrk2
http://broad.io/tx_annotation
http://broad.io/gnomad_mnv
http://broad.io/gnomad_uorfs
http://broad.io/gnomad_sv
8. Thank you
Production team
Eric Banks
Charlotte Tolonen
Christopher
Llanwarne
David Roazen
Diane Kaplan
Gordon Wade
Jeff Gentry
Jose Soto
Kathleen Tibbetts
Kristian Cibulskis
Laura Gauthier
Louis Bergelson
Miguel Covarrubias
Nikelle Petrillo
Ruchi Munshi
Sam Novod
Thibault Jeandet
Valentin Ruano-
Rubio
Yossi Farjoun
Analysis team
Konrad Karczewski
Laurent Francioli
Grace Tiao
Kristen Laricchia
Anne O'Donnell-
Luria
Ben Neale
Beryl Cummings
Eric Minikel
Irina Armean
James Ware
Kaitlin Samocha
Mark Daly
Nicola Whiffin
Qingbo Wang
Ryan Collins
Cotton Seed
Tim Poterba
Arcturus Wang
Chris Vittal
Structural Variation
team
Ryan Collins
Harrison Brand
Konrad Karczewski
Laurent Francioli
Nick Watts
Matthew Solomonson
Xuefang Zhao
Laura Gauthier
Harold Wang
Chelsea Lowther
Mark Walker
Christopher Whelan
Ted Brookings
Ted Sharpe
Jack Fu
Eric Banks
Michael Talkowski
Website team
Matthew
Solomonson
Nick Watts
Ben Weisburd
Konrad Karczewski
Ethics team
Andrea Saltzman
Molly Schleicher
Namrata Gupta
Stacey Donnelly
Broad
Genomics
Platform
Stacey Gabriel
Kristen Connolly
Steven Ferriera
Funding
NIGMS R01 GM104371
(PI: MacArthur)
NIDDK U54 DK105566
(PIs: MacArthur and
Neale)
NHGRI U24 HG010262
(PI: Phillipakis)
NIMH R56 MH115957
(PI: Talkowski)
The vast majority of the
data storage, computing
resources, and human
effort used to generate this
call set were donated by
the Broad Institute
Coordination
Jessica Alföldi
9. Thank youPrincipal Investigators
Daniel MacArthur
Aarno Palotie
Andres Metspalu
Anne Remes
Adolfo Correa
Andre Franke
Ann Pulver
Ben Glaser
Ben Neale
Bong-Jo Kim
Bruce Cohen
Carlos Pato
Carlos A Aguilar Salinas
Christina Hultman
Christine M. Albert
Christopher Haiman
Clicerio Gonzalez
Colin Palmer
Craig Hanis
Dan Roden
Dan Turner
Dana Dabelea
Daniel Chasman
Danish Saleheen
David Altshuler
David Goldstein
Dawood Darbar
Dermot McGovern
Diego Ardissino
Donald Bowden
Dost Ongur
Emelia J. Benjamin
Erkki Vartiainen
Erwin Bottinger
Gad Getz
George Kirov
Gil Atzmon
Harlan M. Krumholz
Harry Sokol
Heribert Schunkert
Hilkka Soininen
Hugh Watkins
Jaakko Kaprio
Jaana Suvisaari
James Meigs
James Ware
James Wilson
Jaspal Kooner
Jaume Marrugat
Jeanette Erdmann
Jeremiah Scharf
John Barnard
John Chambers
John D. Rioux
Jose Florez
Josée Dupuis
Judy Cho
Juliana Chan
Kari Mattila
Kyong Soo Park
Laurent Beaugerie
Leif Groop
Lorena Orozco
Lori Bonnycastle
Maija Wessman
Mark Daly
Mark McCarthy
Markku Laakso
Martti Färkkilä
Matthew Bown
Matthew Harms
Matti Holi
Michael Boehnke
Michael O'Donovan
Michael Owen
Mikko Hiltunen
Mikko Kallela
Mina Chung
Ming Tsuang
Moore Shoemaker
Nazneen Rahman
Nilesh Samani
Olle Melander
Pamela Sklar
Patrick T. Ellinor
Patrick Sullivan
Peter Nilsson
Ramnik Xavier
Ravindranath
Duggirala
Rinse Weersma
Roberto Elosua
Ronald Ma
Ruth Loos
Ruth McPherson
Samuli Ripatti
Sekar Kathiresan
Seppo Koskinen
Soo Heon Kwak
Stephen Glatt
Steve McCarroll
Steven A. Lubitz
Subra
Kugathasan
Tai Shyong
Tariq Ahmad
Teresa Tusie
Luna
Terho Lehtimäki
Tim Spector
Tõnu Esko
Tuomi Tiinamaija
Veikko Salomaa
Yik Ying Teo
Young Jin Kim
Jerome Rotter
Steven Rich
10. Variation across 141,456
individuals reveals the
spectrum of loss-of-function
intolerance of the human
genome
Konrad Karczewski
April 11, 2019
@konradjk
broad.io/gnomad_lof
11. Range of LoF impact
embryonic lethal
recessive disease
non-essential
complex disease
beneficial
haploinsufficient disease
12. Identifying true LoF variants is challenging
• LoFs are rare
• LoFs are enriched for artifacts
13. Identifying true LoF variants is challenging
• LoFs are rare
• LoFs are enriched for artifacts
15. Staggering amounts of pLoFs
• gnomAD contains:
• 230M variants in 15,708
genomes
• 15M variants in 125,748
exomes
• Of these, we observe
515,326 predicted loss-of-
function (pLoF) variants
• Stop-gained
• Essential splice
• Frameshift indel
pLoF
0
100,000
200,000
300,000
400,000
0 40,000 80,000 125,748
Sample size
Numberobserved
16. Identifying true LoF variants is challenging
• LoFs are rare
• LoFs are enriched for artifacts
17. LOFTEE removes benign variation
• LoF filtering plugin to VEP, LOFTEE
• Variants retained by LOFTEE are:
• rarer, and thus
• more deleterious
• After filtering, we discover 443,769
high-confidence pLoFs in gnomAD
https://github.com/konradjk/loftee
●
●
●
●
0.00
0.05
0.10
0.15
synonymous
missense
low
confidence pLoF
high confidence pLoF
MAPS
Rarer, more
deleterious
18. Detecting genes depleted for pLoFs
• Mutational model that predicts the number of SNVs in a given
functional class we would expect to see in each gene in a cohort
• Now incorporating methylation, improved coverage correction, LOFTEE
• Previously transformed into the probability of LoF intolerance (pLI)
• Applying to 125,748 gnomAD exomes
• Median of 17.3 pLoFs expected per gene
• Direct estimate of observed/expected ratio
Kaitlin Samocha
(Samocha et al. 2014;
Lek et al. 2016)
19. Most genes are depleted of LoF variation
MED13L FNDC3B
Phenotype Severe Intellectual Disability Unknown
Observed Expected Obs/Exp (CI) Observed Expected Obs/Exp (CI)
Synonymous 462 465 0.993 (0.92-1.07) 271 266 1.02 (0.92-1.13)
pLoF 0 102 0 (0-0.029) 0 68 0 (0-0.043)
• Many are extremely depleted
(<20% observed compared to
expected)
• Including most known (curated)
haploinsufficient genes
• Using upper bound of
confidence interval corrects
for small genes
0
500
1000
1500
0.0 0.5 1.0 1.5
Observed/Expected
Numberofgenes
0
200
400
600
800
0.0 0.5 1.0 1.5 2.0
LOEUF
Numberofgenes
20. • Binning this spectrum into deciles
Resolving the spectrum of LoF intolerance
Haploinsufficient
Autosomal Recessive
Olfactory Genes
0%
20%
40%
0% 20% 40% 60% 80% 100%
LOEUF decile
Percentofgenelist
More depleted
More constrained
More tolerant
Less constrained
21. • Known haploinsufficient genes have ~10% of the expected pLoFs
Resolving the spectrum of LoF intolerance
Haploinsufficient
Autosomal Recessive
Olfactory Genes
0%
20%
40%
0% 20% 40% 60% 80% 100%
LOEUF decile
Percentofgenelist
22. • Autosomal recessive genes are centered around 60% of expected
Resolving the spectrum of LoF intolerance
Haploinsufficient
Autosomal Recessive
Olfactory Genes
0%
20%
40%
0% 20% 40% 60% 80% 100%
LOEUF decile
Percentofgenelist
Gene list from:
Blekhman et al., 2008
Berg et al., 2013
27. ●
●
●
●
●
●
●● ●● ●● ●
●
●● ●● ●●
synonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymous
pLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoF
0
5
10
15
0% 20% 40% 60% 80% 100%
LOEUF decile
Rateratiofordenovo
variantsinID/DDcases
comparedtocontrols
Constraint improves rare disease diagnosis
• Patients with developmental
delay/intellectual disability are
15X more likely to have an de
novo LoF in a constrained gene
• 8,095 de novos in 5,305 cases
• 2,623 de novos in 2,179 controls
• Integrating expression data
improves this further
Jack
Kosmicki
Beryl
Cummings
broad.io/tx_annotation
28. Constraint informs common disease
etiologies
• Compared to genome-wide
background, SNPs near
constrained genes are
enriched in their contribution
to heritability of common traits
• In particular, traits that
previously1 showed an
enrichment of ultra-rare
variants are also enriched
among constrained genes
●
●
●
●
●
●
● ●
●
●
1.0
1.2
1.4
0% 20% 40% 60% 80% 100%
LOEUF decile
Partitioningheritability
enrichment
Schizoprenia
Qualifications: College or University degree
Duration to first press of snap−button in each round
Educational attainment Bipolar
10-2
10
-4
10
-6
10-8
10-10
10
-12
10
-14
Activities
Cardiovascular
Cognitive
Environment
Hematological
Metabolic
Nutritional
Ophthalmological
Psychiatric
Reproduction
Respiratory
Skeletal
Social Interactions
Other
Traitenrichment
p−value
1Ganna et al. 2018 AJHG
Andrea
Ganna
29. Data publicly released with no publication restrictions
gnomad.broadinstitute.org
Matt
Solomonson
Nick
Watts
Gene model with
transcripts
Pathogenic Clinvar
Variants
Dataset
selection box
Tissue
isoform
expression
Constraint
metrics
pext:
broad.io/tx_annotation
30. Now featuring: structural variant calls in the browser
gnomad.broadinstitute.org
Matt
Solomonson
Nick
Watts
31. Acknowledgments
• Laurent Francioli
• Grace Tiao
• Beryl Cummings
• Jack Kosmicki
• Andrea Ganna
• Qingbo Wang
• Kaitlin Samocha
Ben Neale
• Daniel Birnbaum
• Jessica Alföldi
Kristen Laricchia
• Matt Solomonson
Nick Watts
• Ryan Collins
Harrison Brand
• Raymond K. Walters
Kate Tashman
• Daniel Rhodes
Moriel Singer-Berk
Eleina England
Eleanor G. Seaby
• Hail team
Tim Poterba
Cotton Seed
Arcturus Wang
• Laura Gauthier
Yossi Farjoun
Eric Banks
• Analytic and
Translational Genetics
Unit
• Mark Daly
• Daniel MacArthur
broad.io/gnomad_lof
32. Evaluating potential drug
targets through human loss-of-
function genetic variation
Eric Vallabh Minikel
April 11, 2019
@cureffi
broad.io/gnomad_drugs
33. Why study LoF variants in drug discovery?
• LoF variants can be an in vivo, whole human, lifelong model of
inhibition of a target.
from Plenge 2013, PMID: 23868113
34. Why study LoF variants in drug discovery?
• LoF variants can be an in vivo, whole human, lifelong model of
inhibition of a target.
• With caveats:
• drug effect may not exactly mimic LoF
• developmental effects
• tissue-specific effects
• dosage
• difference in our ancestors' environment vs. our environment
35. How do drug targets compare to all genes
& specific gene lists in constraint?
36. How do drug targets compare to all genes
& specific gene lists in constraint?
38. How constrained are some well-known
drug targets?
• 19% of all drug targets (N=73, including 53 targets of inhibitors, antagonists,
etc.) have obs/exp < 13%, the average for haploinsufficient genes
39. How constrained are some well-known
drug targets?
• 19% of all drug targets (N=73, including 53 targets of inhibitors, antagonists,
etc.) have obs/exp < 13%, the average for haploinsufficient genes
• These include some chemotherapy targets but also aspirin, statins, and
antimuscarinics!
40. How constrained are some well-known
drug targets?
• Not all chemotherapy targets are so constrained
41. How constrained are some well-known
drug targets?
• Drug targets span the full spectrum – constraint alone should not rule a potential
target in or out
42. Can we find and phenotype LoF individuals
for a gene of interest?
• If you can find them, phenotyping of LoF individuals (het or hom)
can be deeply informative for safety and/or efficacy
• Examples: PCSK9, APOC3, CETP, LPA, HAO1...
• Questions for today:
• Is it always realistic to expect to find enough LoF heterozygotes
or homozygotes to permit your analysis of interest?
• What is the best strategy to go about finding them?
• How should you curate pLoF variants before starting to
recontact?
43. Cumulative allele frequency of LoF variants
• Cumulative allele frequency (CAF) = Σ(AF) for all LoF variants
• Define p = proprtion of the haplotypes in population that are LoF
• In an outbred population:
• LoF het frequency = 2p(1-p)
• LoF hom / compound het frequency = p2
• This analysis:
• Use gnomAD data to compute p for each gene
• Predict the hom/compound het frequency for each gene in the population
— assuming this genotype is not lethal (!)
50. Which populations to sequence?
• For the near future, analysis for most genes will need to focus on
heterozygotes, regardless of population
• For finding homozygotes, best strategy is to sample diverse
bottlenecked populations and consanguineous individuals
51. How to curate?
• "the more interesting something looks, the less likely it is to be real"
• Solutions:
• LOFTEE (Karczewski 2019, broad.io/gnomad_lof)
• Expression-aware annotation (Cummings 2019,
broad.io/tx_annotation)
• Deep manual curation is still important
Non-random distribution of pLoFs across the coding sequence is
suspicious
• Next up: examples of curation of 3 genes with different error modes
52. HTT
• Cumulative pLoF allele frequency: 6.2%
• Mostly driven by several common variants in exon 1
• Highly suspicious given the lethal mouse knockout phenotype!
53. HTT
• common LoFs are all alignment artifacts at polyQ and polyP repeat regions
• after filtering & curation, cumulative pLoF allele frequency: 0.013%
55. MAPT
• almost all pLoFs are in exons not expressed in the brain!
• the remainder are various artifacts
• after filtering & curation, cumulative pLoF allele frequency: 0%
• Transcript-aware expression – see Cummings et al, broad.io/tx_annotation
57. PRNP
• N-terminal variants are true LoF. In N terminus, not constrained at all (obs/exp = 6/6.05 = 99%)
• C-terminal truncating variants cause disease through gain-of-function (literature variants added). The
gnomAD C-terminal frameshift turns out to be a dementia case!
58. Very different answers before/after curation
CAF
gene before after
HTT 6.2% 0.013%
LRRK2 0.23% 0.09%
MAPT 14% 0%
PRNP 0.0035% 0.0021%
SNCA 0.0012% 0%
SOD1 0.0060% 0.0038%
59. Very different answers before/after curation
CAF prevalence
gene before after LoF hets GoF disease
HTT 6.2% 0.013% 1 in 3,800 1 in 2,400-4,400
LRRK2 0.23% 0.09% 1 in 500 1 in 3,300
MAPT 14% 0% not observed 1 in 5,000 – 31,000
PRNP 0.0035% 0.0021% 1 in 18,000 1 in 50,000
SNCA 0.0012% 0% not observed 1 in 360,000
SOD1 0.0060% 0.0038% 1 in 26,000 1 in 27,000-83,000
60. Very different answers before/after curation
CAF prevalence
gene before after LoF hets GoF disease
HTT 6.2% 0.013% 1 in 3,800 1 in 2,400-4,400
LRRK2 0.23% 0.09% 1 in 500 1 in 3,300
MAPT 14% 0% not observed 1 in 5,000 – 31,000
PRNP 0.0035% 0.0021% 1 in 18,000 1 in 50,000
SNCA 0.0012% 0% not observed 1 in 360,000
SOD1 0.0060% 0.0038% 1 in 26,000 1 in 27,000-83,000
61. Very different answers before/after curation
• Even without recontact & phenotyping, curation can be highly informative
CAF prevalence
gene before after LoF hets GoF disease
HTT 6.2% 0.013% 1 in 3,800 1 in 2,400-4,400
LRRK2 0.23% 0.09% 1 in 500 1 in 3,300
MAPT 14% 0% not observed 1 in 5,000 – 31,000
PRNP 0.0035% 0.0021% 1 in 18,000 1 in 50,000
SNCA 0.0012% 0% not observed 1 in 360,000
SOD1 0.0060% 0.0038% 1 in 26,000 1 in 27,000-83,000
62. Very different answers before/after curation
• Even without recontact & phenotyping, curation can be highly informative
• But remember, even MAPT and SNCA might be great drug targets!
CAF prevalence
gene before after LoF hets GoF disease
HTT 6.2% 0.013% 1 in 3,800 1 in 2,400-4,400
LRRK2 0.23% 0.09% 1 in 500 1 in 3,300
MAPT 14% 0% not observed 1 in 5,000 – 31,000
PRNP 0.0035% 0.0021% 1 in 18,000 1 in 50,000
SNCA 0.0012% 0% not observed 1 in 360,000
SOD1 0.0060% 0.0038% 1 in 26,000 1 in 27,000-83,000
63. Suggested guidelines for evaluating drug
targets based on LoF
• It's complicated - no simple formula, evaluate each target on a
case-by-case basis
• Filter and curate
• Consider positional distribution
• Calculate cumulative allele frequency
• Experimentally validate loss-of-function
• Don't eliminate a gene from consideration just because you can't find
LoF individuals
• Read the pre-print: broad.io/gnomad_drugs
64. Acknowledgments
• Contact: eminikel@broadinstitute.org / danmac@broadinstitute.org
• Funding: NIH F31 AI22592
• Many thanks to East London Genes & Health
• Thanks to co-authors: Konrad, Beryl, Nicky, Jessica; Stuart Schreiber; Hilary Martin,
Richard Trembath, & David van Heel (ELGH); gnomAD consortium & production team
• FYI: Sonia & Eric's thesis defenses (primary prevention targeting PRNP) – April 16,
9:00a – 11:00a, Broad Auditorium
broad.io/gnomad_drugs
65. From LoF to phenotype: a pilot
study using LRRK2
Nicky Whiffin
@nickywhiffin
Research fellow, Imperial College London
Irina ArmeanAaron Kleinman
broad.io/gnomad_lrrk2
66. • Gain of function missense variants in LRRK2 cause early-
onset Parkinson’s
• LRRK2 is over-activated in general Parkinson’s
GoF LRRK2 variants cause Parkinson’s
67. • Gain of function missense variants in LRRK2 cause early-
onset Parkinson’s
• LRRK2 is over-activated in general Parkinson’s
• Multiple pharma companies now pursuing LRRK2 inhibitors as
generalised Parkinson’s therapy
GoF LRRK2 variants cause Parkinson’s
68. • Early pre-clinical model organism studies – KO animals have
lung, liver and renal phenotypes
Early concerns for toxicity
69. • Early pre-clinical model organism studies – KO animals have
lung, liver and renal phenotypes
Is partial reduction of LRRK2 protein levels safe in humans?
Early concerns for toxicity
70. Cohorts included
gnomAD v2.1
141,456 sequenced
individuals
Case-control and cohort
studies
23andMe
>4 million research-
consented individuals
Genotyped and imputed
76. ...and are genuinely LoF
lymphoblastoid cells from
individuals with heterozygous
LRRK2 LoF
CRISPR-edited embryonic
stem cells differentiated into
cardiomyocytes
Jamie Marshall
Homozygous
reference
Homozygous
reference
p.Cys1313Ter
p.Arg1483Ter
p.Arg1693Ter
77. • 1,358 carriers of 111 pLoF variants
• Appear to be true LoF
But what effect do these have on human health?
A curated dataset of LRRK2 pLoF individuals
79. • 60 LRRK2 LoF carriers in gnomAD had available
phenotype data
• Genomic Psychiatric Cohort, Pakistan Risk of Myocardial Infarction
Study, Swedish Schizophrenia and Bipolar Studies, the FINRISK
study, the BioMe Biobank, the Estonian Biobank
• Very diverse sources including EHRs and questionnaires
Manual curation of gnomAD phenotype data
Jessica Alföldi
80. • 60 LRRK2 LoF carriers in gnomAD had available
phenotype data
• Genomic Psychiatric Cohort, Pakistan Risk of Myocardial Infarction
Study, Swedish Schizophrenia and Bipolar Studies, the FINRISK
study, the BioMe Biobank, the Estonian Biobank
• Very diverse sources including EHRs and questionnaires
• Manually assessed for lung, liver, kidney, CV, nervous
system, immune system phenotypes and cancer
• No enrichment for any adverse phenotypes
• No sign of syndromic phenotypes
Manual curation of gnomAD phenotype data
Jessica Alföldi
83. • ~1 in 550 humans has a heterozygous pLoF variant in LRRK2
• ~50% reduction in LRRK2 protein
• likely across all tissues throughout life
• No discernable negative impact across >1100 carriers
• No effect on overall mortality
• No enrichment for any assessed phenotypes
• Suggests that partial LRRK2 inhibitors should be well-tolerated,
even with chronic administration
• Demonstrates the power of large-scale genetics to assess
tolerability for drug discovery
Key message for LRRK2 drug development
84. Acknowledgements
Irina Armean
Jamie Marshall
Eric Minikel
Konrad Karczewski
Beryl Cummings
Laurent Francioli
Kristen Laricchia
Qingbo Wang
James Ware
Jessica Alföldi
Daniel MacArthur
Aaron Kleinman
Anna Guan
Babak Alipanahi
Peter Morrison
the 23andMe Research
Team
Paul Cannon
Genome Aggregation Database
Production Group
Genome Aggregation Database
Consortium
Marco Baptista
Kalpana Merchant
Aki Havulinna
Bozenna Iliadou
Jung-Jin Lee
Grish Nadkarni
Cole Whiteman
Mark Daly
Tõnu Esko
Christina Hultman
Ruth Loos
Lili Milani
Aarno Palotie
Carlos Pato
Michele Pato
Danish Saleheen
Patrick Sullivan
Editor's Notes
Citations, diagnoses, impact
I'm going to talk some cool things you can do with this dataset to understand the impact of loss-of-function variation on the human genome.
Imagine if we could put each of the 20K genes in the genome along a spectrum of sensitivity to functional disruption, that is, the clinical or phenotypic impact that a loss-of-function variant might have in that gene.
For instance, here over on the left are genes where we’ll never see LoF variants in living humans as these would be incompatible with human life. In the middle are variants and genes that we typically study in the clinical genetics space, from causal variants for dominant and recessive diseases to risk factors for complex disease. On the right, we have genes that are relatively tolerant of LoF variation, potentially even homozygous inactivation.
And unlike in model organisms, where we can effectively engineer such mutations, there are obvious technical and ethical barriers to doing so in humans. But when we sequence healthy individuals or individuals with common diseases, we find plenty of genes inactivated, in the form of naturally-occurring predicted loss-of-function variants (or pLoFs). However…
Because true LoFs are deleterious, a number of factors conspire to make them difficult to characterize. In particular...
This dataset contains a substantial amount of variation, including...
Here you can see the number of variants discovered in the exomes, broken down by functional class, as a function of sample size, which follow approximately a square root law. If we zoom into the predicted LoFs...
...we observe over half a million LoFs, following the same pattern of discovery. And I should clarify that when I'm talking about pLoFs today, I'm referring to stop-gained, essential splice, and frameshift variants.
So now that we've increased our sensitivity and discovered a bunch of rare putative LoFs, now we'd like to increase our specificity...
To this end, we've created a tool called LOFTEE, a plugin to VEP that filters out common error modes based on first principles, and importantly, does not use frequency. In spite of that, when we look at the mutability adjusted proportion of singletons, or MAPS, a metric of deleteriousness based on frequency, LOFTEE filters out variants that have a frequency spectrum consistent with missense variants, while variants that are retained are much more rare on average and thus more deleterious. After filtering...
1649 confident homozygous.
With a high-quality catalog of predicted loss-of-function variants, we can not only look at genes which have LoF variants in the general population, but also genes where we don’t see any LoFs.
A few years ago, Kaitlin Samocha built a mutational model to predict...
And we've now built on this model with a number of improvements to refine the model and increase specificity.
Previously, this constraint metric was defined in a metric called pLI.
However, now that we're applying to a larger dataset, with our greater resolution, we can use the more interpretable observed/expected ratio, and build a confidence interval around this value, which can give us a conservative estimate of the observed to expected ratio.
Using this method, we can return to the question of where genes fall on this spectrum of LoF tolerance. Most genes have a depletion of pLoFs (that is, observed/expected less than 1), and many are extremely depleted, including most known HI genes.
Just a note for anyone who has used the pLI scores previously, we've now flipped the scale, so the genes over on the left side are high pLI genes. So these improvements solidify our ability to detect constraint, here are two very clear examples.
LoFs in MED13L previously demonstrated to cause severe ID, facial features, and cardiac phenotypes. FNDC3B has no known human phenotype, but results in death at birth when knocked out in mice. But if you find a rare disease patient with an LoF in this gene, you might be concerned.
Some of you may notice this tick on the left side, where observed/expected is zero. This can happen due to extreme constraint, or small genes (say, observed = 0, expected = 2). At larger sample sizes, this will even itself out and we could use the observed/expected ratio, but for now, we can use the upper bound of the confidence interval, which we term LOEUF, resulting in a much smoother distribution, which I’ll use from here on out.
So this metric is a conservative estimate that takes into account the gene size. As our sample sizes grow, LOEUF will converge to the o/e ratio, but for now this is a useful metric.
So we can bin this metric into deciles, which I'll show on every slide from here on out with the left ...
350
so the metric is well-calibrated, and importantly this means we now have improved LoF tolerance scores for all protein-coding genes in the genome
This fits with what we see in model systems, where genes that are early lethal in mouse are more likely to have an ortholog in the human constrained genes. Similarly, in CRISPR screens, genes that are essential for cell viability are also more likely to be constrained and the opposite for the confidently non-essential genes.
We next explored the correlation between constraint against SNVs with patterns of structural variation. Ryan and Harrison called SVs in 14K individuals, identifying about 10K rare biallelic autosomal LoFs that disrupt gene function.
When they looked at the occurrence of SVs in each of the constraint deciles, they found that on average, the constrained genes had a strong depletion of structural variation,
Important to note that this is not a per-gene SV metric, as even this dataset of 15K has less than one rare LoF SV per gene.
For more information on this dataset, see the recently posted preprint from Ryan and crew
If we look at the burden of de novo LoFs in patients with developmental delay or intellectual disability, we observe a 15-fold increased rate in the top 10% most constrained genes in the genome, in cases compared to controls. So in other words, this lowest decile contains genes where a single LoF mutation will prevent you, by an estimable amount, from progressing through a healthy development during childhood.
Finally, we can investigate how these constraint metrics relate to common disease biology. In a partitioning heritability analysis of 600 traits from UKBiobank, we find that SNPs near constrained genes are enriched for heritability of common traits.
If we zoom in on which traits have the strongest enrichment for heritability among constrained genes, we find schizophrenia, bipolar disorder, and educational attainment, which is consistent with previous work that marked these traits as enriched for ultra-rare coding variants
Thanks to the efforts of Matt and Nick
Not depleted LOEUF = 0.64
Not depleted LOEUF = 0.64
Not depleted LOEUF = 0.64
Not depleted LOEUF = 0.64
By LoF I mean nonsense, frameshift or essential splice site variants
Manual curation
Variant quality metrics
Reads on IGV
LoF rescue either by co-localised variants or cryptic/alternative splice sites
‘GC’ still works as a strong splice donor site
Protein domains - Chi-square P=0.23
Thank Jamie by name
“Western blot” of protein levels
Kolmogorov-Smirnov P=0.085 and 0.46 respectively
Last known age not survival
~4 million individuals, over 1000 of which are known carriers giving reasonable power to detect an association