SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Optimal	Scoring	
of	Variants	Altering	
Transcription	Factor	Binding
Fri	24	May
Daniele	Merico,	PhD
Director	of	Molecular	Genetics,	Deep	Genomics	Inc.
Visiting	Scientist,	The	Hospital	for	Sick	Children
(Toronto,	Canada)
Preliminary	Note:
Mutational	Background	for	Promoters	and	Enhancers
Factors	impact	somatic	mutation	in	promoters	and	enhancers:
• Trinucleotide	mutation	probability	à mutational	mechanism
• Open	chromatin	configuration	à accessibility	for	repair
• Transcription	factor	and	nucleosome	binding à accessibility	for	repair
• Transcriptional	activity	à accessibility	for	repair
Sabarinathan et	al.	Nucleotide	excision	repair	is	impaired	by	binding	of	transcription	factors	to	DNA.	Nature	2016
Perera et	al.	Differential	DNA	repair	underlies	mutation	hotspots	at	active	promoters	in	cancer	genomes.	Nature	2016
Polak et	al.	Reduced	local	mutation	density	at	regulatory	DNA	of	cancer	genomes	is	linked	to	DNA	repair.	Nat	Biotech	2014
3
Transcription factor binding
• Transcription factors (TFs) have key importance in regulating gene expression by binding to regulatory
genomic elements (TFs demonstrate sequence-based specificities towards these binding sites)
• Understanding the process of TF-DNA binding can help us understand the intricate process of gene
regulation, develop actionable hypotheses that can be used in drug development/therapy, etc.
• With the aid of technologies like ChIP-seq, SELEX and PBMs, many TF binding sites (TFBSs) have been
characterized
khanacademy.org
4
Modelling transcription factor binding
• Binding sites have been used to train computational models
• Position weight matrices or PWMs (simplest models)
• More complex machine learning (deep learning)
approaches are able to learn far more complex patterns in
binding sites
• The performance by which TF binding models are able to
distinguish their binding regions from random genomic
regions has been well characterized
Giraud et al. 2010
Alipanahi et al. 2015
5
Genetic variation and transcription factor binding
• Genetic variation falling within specificity
determinants of TF binding sites (TFBSs) can alter
binding by introducing novel binding sites or
diminishing existing binding sites,
• Can result in a substantial impact on molecular
phenotypes through changes in gene expression
• PWMs and DL-based models have been used to
assess impact of variants on binding sites
• Have become an essential component of many
variant prioritization pipelines
6
Motivation
• How well do binding prediction models perform at predicting impact of variants?
• Variants from the Human Genome Mutation Database (HGMD), genome-wide association
studies (GWAS), and quantitative trait loci (QTL) studies have previously been used
• Little has been done to explore the ability of these models to assess the impact of genetic
variants on binding in a TF-specific manner.
• Not many curated datasets on variants impacting TFBSs
• Allele-specific ChIP-seq data
7
Allele-specific ChIP-seq data
• Gather heterozygous mutations
• ChIP-seq for a particular TF are mapped onto each of the alleles
of the diploid genome
• Compare the read counts between the two parental
chromosomes (using binomial test)
• Significant binomial test (Pbinomial < 0.01):
- Allele-specific binding variants (ASB)
- variant impacts binding
• Non-significant binomial test (Pbinomial > 0.5):
- Non-allele-specific binding variant (non-ASB)
- variant has little to no impact on binding
Chen et al. 2016
8
• Assess performance of binding predictors at predicting variant impact
• Collect ASB data (read counts on heterozygous variants)
• Compile TFBS predictors, score ASB and non-ASB
A compendium of allele-specific binding events
9
• Mapped reads for heterozygous variants were
obtained from individual studies and not uniformly
processed
• To ensure reliability of cross-study read-counts:
correlate log ref/alt reads for overlapping ASB
events between studies
• mean Pearson r = 0.79
• Conclusion: although read counts come from
different studies, they remain in agreement.
A compendium of allele-specific binding events
10
Properties of allele-specific binding data
11
Properties of allele-specific binding data:
ASB loss variants are under purifying selection
• Assess proportion of ASB/non-ASB variants that
are rare wrt ExAC, 1000G and ESP6500
• Loss ASB variants are under purifying selection
(larger proportion of rare variants)
12
Properties of allele-specific binding data:
Non-coding variant impact predictors do not differentiate ASB
from non-ASB
• Several other non-coding predictors that do not
take into account TF-binding motifs and instead
utilise metrics such as conservation are not able to
recapitulate
• i.e. knowledge on TF-specific binding specificity
can help identify impactful non-coding variants
13
Properties of allele-specific binding data
Take-home messages
• Compiled largest known ASB dataset
• Loss ASBs are under purifying selection and therefore of significant importance
• Current non-coding variant impact predictors are unable to distinguish ASB
variants
• ASB data is suitable to assess performance of TF-binding models at predicting
variant impact
14
Performance of transcription factor binding
variant impact predictions
Model collection
• Collected pre-trained and trained models for TFs with ASB data from five different
methods ranging from simple methods (PWMs) to deep learning approaches
1. PWMs
2. DeepBind
3. DeepSEA
4. DanQ
5. GERV
6. gkmSVM
15
Performance of transcription factor binding
variant impact predictions
Model collection
Method Model type No. Models for TFs with ASB data Source data
DeepBind Pre-trained 91 ENCODE ChIP-seq
DeepSEA Pre-trained 91 ENCODE/RE ChIP-seq
DanQ Pre-trained 91 ENCODE/RE ChIP-seq
gkmSVM Trained 91 ENCODE ChIP-seq (Same data used to train DeepBind models)
GERV Pre-trained 60 ENCODE ChIP-seq
PWM - JASPAR Pre-trained 56 JASPAR PWMs
PWM - MEME ChIP Trained 87 Over-represented motifs discovered by MEME-ChIP using DeepBind training data
16
Performance of transcription factor binding
variant impact predictions
Variant-impact metric definition
Method Metric Description
DeepSea/DanQ Log FC Chromatin feature probability log fold changes
Diff. Chromatin feature probability differences
gkmSVM deltaSVM
Change in the sum of k-mer weights for wildtype and variant
sequences
GERV GERV score
L2 norm of the difference between predicted ChIP-seq signal in a
given window for the reference and the alternate allele
DeepBind/PWMs Max delta raw
Difference between raw model scores for reference and alternate
alleles with the maximum absolute value across multiple windows
Delta max raw
Difference of the maximum reference and alternate raw model scores
across multiple windows
Max delta Pbind
Difference between probability-transformed scores for reference and
alternate alleles with the maximum absolute value across multiple
windows
Pcomb
Signed liklihood of loss or gain depending on which has higher
magnitute
Psum Sum of liklihood of loss and gain signed by effect size
Defined in this study
17
Performance of transcription factor binding
variant impact predictions
Performance measure
• Loss ASB variants (Pbinomial < 0.01) and ref_reads >
alt_reads and at least 10 total reads
• Gain ASB variants (Pbinomial < 0.01) and alt_reads >
ref_reads at least 10 total reads
• Non-ASB variants (Pbinomial > 0.50) and at least 10 total
reads
• Use models for TFs with ≥10 ASB/non-ASB variants
• Measure AUROC/AUPRC
18
Performance of transcription factor binding
variant impact predictions
PWM metrics performance
• PWM metrics have similar AUROCs
• Exception of max delta raw
• All metrics significantly have higher AUROCs (p<1.26e-04)
• JASPAR PWMs showed similar results (data not shown)
• Due to maximising over multiple windows of a sequence, score is often inflated
19
Performance of transcription factor binding
variant impact predictions
DeepBind/DeepSEA/DanQ metrics performance
• DeepBind metrics have similar AUROCs
• DeepSEA metrics have similar AUROCs
• DanQ metrics have similar AUROCs
• → Choice of metric has no clear impact on performance
20
Performance of transcription factor binding
variant impact predictions
Comparison of ML vs. PWM-based methods
• For methods with multiple metrics, we picked one
representative metric
• PWMs → Delta max raw
• DeepBind → Max delta raw
• DeepSEA/DanQ → Log FC
• Compare performance
• gkmSVM/DeepBind/DeepSEA/DanQ all significantly
outperformed PWMs (p<3.11e-03)
●
●
●
●
●
●
●
●
0.4
0.5
0.6
0.7
0.8
G
ERV
G
ERV
score
PW
M
(M
EM
E,signif)
D
elta
m
ax
rawgkm
SVM
deltaSVMD
eepBind
M
ax
delta
raw
D
anQ
Log
FCD
eepSEA
Log
FC
AUROC
Performance for 34 TFs
21
Performance of transcription factor binding
variant impact predictions
Comparison of ML-based methods
• DeepSEA performs slightly better than gkmSVM (p=0.022) and
DeepBind (p=0.026)
• DanQ performs significantly better than gkmSVM (p=0.044) and
borderline significantly better DeepBind (p=0.057)
22
Performance of transcription factor binding
variant impact predictions
Take-home messages
• The choice of the scoring metric used in variant impact can often be critical to both interpretability
and performance, particularly for PWMs
• Deep learning-based methods significantly outperform other ML-based and PWM-based methods
• Amongst deep learning methods, no clear winner wrt significance, although DeepSEA/DanQ
generally have higher performance
23
What drives TF-specific performance?
• TFs show highly variable performance in assessing variant impact
• What are some of the factors that contribute to poor
performance?
• Do TFs that perform better at detecting their own binding sites
(Binding AUROC) perform better at assessing variant impact? No
• Some TFs that have distinct binding specificities, are unable to
predict variant impact
• What else could potentially drive poor performance?
24
What drives TF-specific performance?
Alternative binding mechanisms explain performance differences
• A TF model can have less specificity at predicting
variant impact due to:
• Co-factors: TFs in larger complexes could have
different specificities
• Methylation: TFs that depend on methylation
for binding
• DNA shape: TFs that depend on shape of the
DNA
• PTMs: can regulate TF binding specificity (e.g.
in p53)
25
What drives TF-specific performance?
Take-home messages
• Predictions for certain TFs were consistently poor, and our investigation supports efforts to use
features beyond sequence, such as methylation, DNA shape, and post-translational modifications
• Features such as cell-type/cell-line is also a confounding factor
26
Detecting TF-altering LoF variants in a genome
• Loss of binding does not necessarily imply phenotypic consequence
• How to assess performance of predictors wrt TFBS-altering variants that have a
phenotypic consequence?
• No large scale TF-specific datasets available
27
Detecting TF-altering LoF variants in a genome
• Manually curated 73 variants (11 gain
and 62 losses) with a phenotypic
consequence due to an altered TFBS
• 32 TFs and
• 36 phenotypes
• 35/73 (48%) of which have a
DeepSEA/DanQ/DeepBind ChIP-seq
binding model for the corresponding TF
• Scored variants against corresponding
DeepBind/DeepSEA/DanQ models
28
Detecting TF-altering LoF variants in a genome
• Also scored 10,000 randomly sampled
1000Genome variants with an AF > 5%
as a background set and used to define
an empirical p-value
• For a given score s of a curated variant,
p-value is computed using the number
of 1000G variants that have a score ≥ s
29
Detecting TF-altering LoF variants in a genome
• 70% of variants had a p-value <0.05
• 67% of variants had a p-value <0.01
• 30% of variants had a p-value of <0.001
→ Predictors were able to identify the
majority of these variants accurately
30
Detecting TF-altering LoF variants in a genome
P-value-transformed values vs. model scores
• P-value transformation using a background set (e.g.
1000G) is common practice in assessing variant
impact
• Is it necessary?
• Across the different TFs, there exists a strong linear
relationship between the raw score and 1000g-
transformed p-value, across TFs (a)
• P-value transformation is not a necessity can
simply use a universal cut-off on the model's score
31
Understanding our ability to detect LoF TFBS
variants in a genome
• Need: representative set of variants that are unlikely to cause LoF
• Collected variants from four relatively healthy patients (PGP)
• Restricted to
• Haploinsufficient genes as defined by ExAC pLI scores (pLI > 0.90)
• Falling within 5kb of the TSS (core promoter + extended region)
• Rare: gnomAD AF < 1e-4
• Average total of 79 variants per sample
32
Understanding our ability to detect LoF TFBS
variants in a genome
• At a given cutoff, assess the % variants falling below (loss) or above (gain) that cutoff by at least one TF model
• For a given genome, at a cutoff of -1 “sweet spot” we are able to recover ~70% of curated variants with a
phenotype
• Maintain an average of ~15% false positive rate across four genomes (0.15 * 79 = ~12 variants)
• Similar for gains, although much fewer number of curated variants
0.00
0.25
0.50
0.75
1.00
−6 −4 −2 0
DanQ Log FC score cutoff
Proportionvariants
withatleastonemodel<cutoff
Loss
0.00
0.25
0.50
0.75
1.00
0 2 4 6
DanQ Log FC score cutoff
Proportionvariants
withatleastonemodel>cutoff
Curated variants gain (n=4)
Curated variants loss (n=39)
PGPC_0003 (n=79)
PGPC_0004 (n=65)
PGPC_0005 (n=113)
PGPC_0007 (n=59)
Gain
0.10
0.05
0.01
33
Summary and wrap-up
• ASB data presents a useful resource for benchmarking TF model variant-
impact predictions
• Models could be trained to maximise variant-impact performance instead
of binding performance
• Our compiled set of ASB data (~100,000 variants, 150,000 TF-variant
pairs) is the largest available and is freely available online in the
supplementary data of the biorxiv paper http://goo.gl/2wFQ9w
34
Summary and wrap-up
• PWMs do not perform well at variant impact, DL-methods significantly
better
• TFs do not perform uniformly at predicting variant impact!
• TFs with poor performance at assessing variant impact often rely on
additional mechanisms such as binding partners, methylation, DNA shape
and PTMs
• Incorporation of these mechanisms into training TF-binding models will
drastically increase TF-binding/variant-impact performance
35
Summary and wrap-up
• Analysis of genome for healthy individuals reveals that DL models based
purely on sequence specificity in their current state perform reasonably
well at identifying LoF variants caused by altered TFBSs, while minimising
false positive rates
Acknowledgements
Allele-specific	transcription	factor	binding	as	a	benchmark	for	assessing	variant	
impact	predictors
http://biorxiv.org/content/early/2018/02/01/253427	
Omar	Wagih,	Daniele	Merico,	Andrew	Delong,	Brendan	Frey
(Deep	Genomics	Inc.)

Weitere ähnliche Inhalte

Ähnlich wie CDAC 2018 Merico optimal scoring

Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Varun Ojha
 
Data analysis
Data analysisData analysis
Data analysisamlbinder
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Identification, annotation and visualisation of extreme changes in splicing w...
Identification, annotation and visualisation of extreme changes in splicing w...Identification, annotation and visualisation of extreme changes in splicing w...
Identification, annotation and visualisation of extreme changes in splicing w...Mar Gonzàlez-Porta
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Asiri Wijesinghe
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
New Enhancements: GWAS Workflows with SVS
New Enhancements: GWAS Workflows with SVSNew Enhancements: GWAS Workflows with SVS
New Enhancements: GWAS Workflows with SVSGolden Helix
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
An Integrated Approach to Uncover Drivers of Cancer
An Integrated Approach to Uncover Drivers of CancerAn Integrated Approach to Uncover Drivers of Cancer
An Integrated Approach to Uncover Drivers of CancerRaunak Shrestha
 
TIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyTIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyAnax Fotopoulos
 
High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data AnalysisSetia Pramana
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongIddo
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821GenomeInABottle
 

Ähnlich wie CDAC 2018 Merico optimal scoring (20)

Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
 
Data analysis
Data analysisData analysis
Data analysis
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Identification, annotation and visualisation of extreme changes in splicing w...
Identification, annotation and visualisation of extreme changes in splicing w...Identification, annotation and visualisation of extreme changes in splicing w...
Identification, annotation and visualisation of extreme changes in splicing w...
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
New Enhancements: GWAS Workflows with SVS
New Enhancements: GWAS Workflows with SVSNew Enhancements: GWAS Workflows with SVS
New Enhancements: GWAS Workflows with SVS
 
2015-03-31_MotifGP
2015-03-31_MotifGP2015-03-31_MotifGP
2015-03-31_MotifGP
 
Vanderbilt b
Vanderbilt bVanderbilt b
Vanderbilt b
 
Technical Tips for qPCR
Technical Tips for qPCRTechnical Tips for qPCR
Technical Tips for qPCR
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
An Integrated Approach to Uncover Drivers of Cancer
An Integrated Approach to Uncover Drivers of CancerAn Integrated Approach to Uncover Drivers of Cancer
An Integrated Approach to Uncover Drivers of Cancer
 
TIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyTIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracy
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
High throughput Data Analysis
High throughput Data AnalysisHigh throughput Data Analysis
High throughput Data Analysis
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 

Mehr von Marco Antoniotti

CDAC 2018 Angaroni optimal control
CDAC 2018 Angaroni optimal controlCDAC 2018 Angaroni optimal control
CDAC 2018 Angaroni optimal controlMarco Antoniotti
 
CDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferringCDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferringMarco Antoniotti
 
CDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networksCDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networksMarco Antoniotti
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitMarco Antoniotti
 
CDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatinCDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatinMarco Antoniotti
 
CDAC 2018 Gonzales-Perez understanding cancer genomes
CDAC 2018 Gonzales-Perez understanding cancer genomesCDAC 2018 Gonzales-Perez understanding cancer genomes
CDAC 2018 Gonzales-Perez understanding cancer genomesMarco Antoniotti
 
CDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomesCDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomesMarco Antoniotti
 
CDAC 2018 Mishra immune system part b
CDAC 2018 Mishra immune system part bCDAC 2018 Mishra immune system part b
CDAC 2018 Mishra immune system part bMarco Antoniotti
 
CDAC 2018 Mishra immune system part a
CDAC 2018 Mishra immune system part aCDAC 2018 Mishra immune system part a
CDAC 2018 Mishra immune system part aMarco Antoniotti
 
CDAC 2018 Merico making sense of cancer somatic snv
CDAC 2018 Merico making sense of cancer somatic snvCDAC 2018 Merico making sense of cancer somatic snv
CDAC 2018 Merico making sense of cancer somatic snvMarco Antoniotti
 
CDAC 2018 Elemento A precision medicine
CDAC 2018 Elemento A precision medicineCDAC 2018 Elemento A precision medicine
CDAC 2018 Elemento A precision medicineMarco Antoniotti
 
CDAC 2018 Dubini microfluidic technologies for single cell manipulation
CDAC 2018 Dubini microfluidic technologies for single cell manipulationCDAC 2018 Dubini microfluidic technologies for single cell manipulation
CDAC 2018 Dubini microfluidic technologies for single cell manipulationMarco Antoniotti
 
CDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsiesCDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsiesMarco Antoniotti
 

Mehr von Marco Antoniotti (14)

CDAC 2018 Angaroni optimal control
CDAC 2018 Angaroni optimal controlCDAC 2018 Angaroni optimal control
CDAC 2018 Angaroni optimal control
 
CDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferringCDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferring
 
CDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networksCDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networks
 
Cdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution traitCdac 2018 antoniotti cancer evolution trait
Cdac 2018 antoniotti cancer evolution trait
 
CDAC 2018 Boeva discovery
CDAC 2018 Boeva discoveryCDAC 2018 Boeva discovery
CDAC 2018 Boeva discovery
 
CDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatinCDAC 2018 Boeva analysis chromatin
CDAC 2018 Boeva analysis chromatin
 
CDAC 2018 Gonzales-Perez understanding cancer genomes
CDAC 2018 Gonzales-Perez understanding cancer genomesCDAC 2018 Gonzales-Perez understanding cancer genomes
CDAC 2018 Gonzales-Perez understanding cancer genomes
 
CDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomesCDAC 2018 Gonzales-Perez interpretation of cancer genomes
CDAC 2018 Gonzales-Perez interpretation of cancer genomes
 
CDAC 2018 Mishra immune system part b
CDAC 2018 Mishra immune system part bCDAC 2018 Mishra immune system part b
CDAC 2018 Mishra immune system part b
 
CDAC 2018 Mishra immune system part a
CDAC 2018 Mishra immune system part aCDAC 2018 Mishra immune system part a
CDAC 2018 Mishra immune system part a
 
CDAC 2018 Merico making sense of cancer somatic snv
CDAC 2018 Merico making sense of cancer somatic snvCDAC 2018 Merico making sense of cancer somatic snv
CDAC 2018 Merico making sense of cancer somatic snv
 
CDAC 2018 Elemento A precision medicine
CDAC 2018 Elemento A precision medicineCDAC 2018 Elemento A precision medicine
CDAC 2018 Elemento A precision medicine
 
CDAC 2018 Dubini microfluidic technologies for single cell manipulation
CDAC 2018 Dubini microfluidic technologies for single cell manipulationCDAC 2018 Dubini microfluidic technologies for single cell manipulation
CDAC 2018 Dubini microfluidic technologies for single cell manipulation
 
CDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsiesCDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsies
 

Kürzlich hochgeladen

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 

Kürzlich hochgeladen (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 

CDAC 2018 Merico optimal scoring

  • 2. Preliminary Note: Mutational Background for Promoters and Enhancers Factors impact somatic mutation in promoters and enhancers: • Trinucleotide mutation probability à mutational mechanism • Open chromatin configuration à accessibility for repair • Transcription factor and nucleosome binding à accessibility for repair • Transcriptional activity à accessibility for repair Sabarinathan et al. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 2016 Perera et al. Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 2016 Polak et al. Reduced local mutation density at regulatory DNA of cancer genomes is linked to DNA repair. Nat Biotech 2014
  • 3. 3 Transcription factor binding • Transcription factors (TFs) have key importance in regulating gene expression by binding to regulatory genomic elements (TFs demonstrate sequence-based specificities towards these binding sites) • Understanding the process of TF-DNA binding can help us understand the intricate process of gene regulation, develop actionable hypotheses that can be used in drug development/therapy, etc. • With the aid of technologies like ChIP-seq, SELEX and PBMs, many TF binding sites (TFBSs) have been characterized khanacademy.org
  • 4. 4 Modelling transcription factor binding • Binding sites have been used to train computational models • Position weight matrices or PWMs (simplest models) • More complex machine learning (deep learning) approaches are able to learn far more complex patterns in binding sites • The performance by which TF binding models are able to distinguish their binding regions from random genomic regions has been well characterized Giraud et al. 2010 Alipanahi et al. 2015
  • 5. 5 Genetic variation and transcription factor binding • Genetic variation falling within specificity determinants of TF binding sites (TFBSs) can alter binding by introducing novel binding sites or diminishing existing binding sites, • Can result in a substantial impact on molecular phenotypes through changes in gene expression • PWMs and DL-based models have been used to assess impact of variants on binding sites • Have become an essential component of many variant prioritization pipelines
  • 6. 6 Motivation • How well do binding prediction models perform at predicting impact of variants? • Variants from the Human Genome Mutation Database (HGMD), genome-wide association studies (GWAS), and quantitative trait loci (QTL) studies have previously been used • Little has been done to explore the ability of these models to assess the impact of genetic variants on binding in a TF-specific manner. • Not many curated datasets on variants impacting TFBSs • Allele-specific ChIP-seq data
  • 7. 7 Allele-specific ChIP-seq data • Gather heterozygous mutations • ChIP-seq for a particular TF are mapped onto each of the alleles of the diploid genome • Compare the read counts between the two parental chromosomes (using binomial test) • Significant binomial test (Pbinomial < 0.01): - Allele-specific binding variants (ASB) - variant impacts binding • Non-significant binomial test (Pbinomial > 0.5): - Non-allele-specific binding variant (non-ASB) - variant has little to no impact on binding Chen et al. 2016
  • 8. 8 • Assess performance of binding predictors at predicting variant impact • Collect ASB data (read counts on heterozygous variants) • Compile TFBS predictors, score ASB and non-ASB A compendium of allele-specific binding events
  • 9. 9 • Mapped reads for heterozygous variants were obtained from individual studies and not uniformly processed • To ensure reliability of cross-study read-counts: correlate log ref/alt reads for overlapping ASB events between studies • mean Pearson r = 0.79 • Conclusion: although read counts come from different studies, they remain in agreement. A compendium of allele-specific binding events
  • 11. 11 Properties of allele-specific binding data: ASB loss variants are under purifying selection • Assess proportion of ASB/non-ASB variants that are rare wrt ExAC, 1000G and ESP6500 • Loss ASB variants are under purifying selection (larger proportion of rare variants)
  • 12. 12 Properties of allele-specific binding data: Non-coding variant impact predictors do not differentiate ASB from non-ASB • Several other non-coding predictors that do not take into account TF-binding motifs and instead utilise metrics such as conservation are not able to recapitulate • i.e. knowledge on TF-specific binding specificity can help identify impactful non-coding variants
  • 13. 13 Properties of allele-specific binding data Take-home messages • Compiled largest known ASB dataset • Loss ASBs are under purifying selection and therefore of significant importance • Current non-coding variant impact predictors are unable to distinguish ASB variants • ASB data is suitable to assess performance of TF-binding models at predicting variant impact
  • 14. 14 Performance of transcription factor binding variant impact predictions Model collection • Collected pre-trained and trained models for TFs with ASB data from five different methods ranging from simple methods (PWMs) to deep learning approaches 1. PWMs 2. DeepBind 3. DeepSEA 4. DanQ 5. GERV 6. gkmSVM
  • 15. 15 Performance of transcription factor binding variant impact predictions Model collection Method Model type No. Models for TFs with ASB data Source data DeepBind Pre-trained 91 ENCODE ChIP-seq DeepSEA Pre-trained 91 ENCODE/RE ChIP-seq DanQ Pre-trained 91 ENCODE/RE ChIP-seq gkmSVM Trained 91 ENCODE ChIP-seq (Same data used to train DeepBind models) GERV Pre-trained 60 ENCODE ChIP-seq PWM - JASPAR Pre-trained 56 JASPAR PWMs PWM - MEME ChIP Trained 87 Over-represented motifs discovered by MEME-ChIP using DeepBind training data
  • 16. 16 Performance of transcription factor binding variant impact predictions Variant-impact metric definition Method Metric Description DeepSea/DanQ Log FC Chromatin feature probability log fold changes Diff. Chromatin feature probability differences gkmSVM deltaSVM Change in the sum of k-mer weights for wildtype and variant sequences GERV GERV score L2 norm of the difference between predicted ChIP-seq signal in a given window for the reference and the alternate allele DeepBind/PWMs Max delta raw Difference between raw model scores for reference and alternate alleles with the maximum absolute value across multiple windows Delta max raw Difference of the maximum reference and alternate raw model scores across multiple windows Max delta Pbind Difference between probability-transformed scores for reference and alternate alleles with the maximum absolute value across multiple windows Pcomb Signed liklihood of loss or gain depending on which has higher magnitute Psum Sum of liklihood of loss and gain signed by effect size Defined in this study
  • 17. 17 Performance of transcription factor binding variant impact predictions Performance measure • Loss ASB variants (Pbinomial < 0.01) and ref_reads > alt_reads and at least 10 total reads • Gain ASB variants (Pbinomial < 0.01) and alt_reads > ref_reads at least 10 total reads • Non-ASB variants (Pbinomial > 0.50) and at least 10 total reads • Use models for TFs with ≥10 ASB/non-ASB variants • Measure AUROC/AUPRC
  • 18. 18 Performance of transcription factor binding variant impact predictions PWM metrics performance • PWM metrics have similar AUROCs • Exception of max delta raw • All metrics significantly have higher AUROCs (p<1.26e-04) • JASPAR PWMs showed similar results (data not shown) • Due to maximising over multiple windows of a sequence, score is often inflated
  • 19. 19 Performance of transcription factor binding variant impact predictions DeepBind/DeepSEA/DanQ metrics performance • DeepBind metrics have similar AUROCs • DeepSEA metrics have similar AUROCs • DanQ metrics have similar AUROCs • → Choice of metric has no clear impact on performance
  • 20. 20 Performance of transcription factor binding variant impact predictions Comparison of ML vs. PWM-based methods • For methods with multiple metrics, we picked one representative metric • PWMs → Delta max raw • DeepBind → Max delta raw • DeepSEA/DanQ → Log FC • Compare performance • gkmSVM/DeepBind/DeepSEA/DanQ all significantly outperformed PWMs (p<3.11e-03) ● ● ● ● ● ● ● ● 0.4 0.5 0.6 0.7 0.8 G ERV G ERV score PW M (M EM E,signif) D elta m ax rawgkm SVM deltaSVMD eepBind M ax delta raw D anQ Log FCD eepSEA Log FC AUROC Performance for 34 TFs
  • 21. 21 Performance of transcription factor binding variant impact predictions Comparison of ML-based methods • DeepSEA performs slightly better than gkmSVM (p=0.022) and DeepBind (p=0.026) • DanQ performs significantly better than gkmSVM (p=0.044) and borderline significantly better DeepBind (p=0.057)
  • 22. 22 Performance of transcription factor binding variant impact predictions Take-home messages • The choice of the scoring metric used in variant impact can often be critical to both interpretability and performance, particularly for PWMs • Deep learning-based methods significantly outperform other ML-based and PWM-based methods • Amongst deep learning methods, no clear winner wrt significance, although DeepSEA/DanQ generally have higher performance
  • 23. 23 What drives TF-specific performance? • TFs show highly variable performance in assessing variant impact • What are some of the factors that contribute to poor performance? • Do TFs that perform better at detecting their own binding sites (Binding AUROC) perform better at assessing variant impact? No • Some TFs that have distinct binding specificities, are unable to predict variant impact • What else could potentially drive poor performance?
  • 24. 24 What drives TF-specific performance? Alternative binding mechanisms explain performance differences • A TF model can have less specificity at predicting variant impact due to: • Co-factors: TFs in larger complexes could have different specificities • Methylation: TFs that depend on methylation for binding • DNA shape: TFs that depend on shape of the DNA • PTMs: can regulate TF binding specificity (e.g. in p53)
  • 25. 25 What drives TF-specific performance? Take-home messages • Predictions for certain TFs were consistently poor, and our investigation supports efforts to use features beyond sequence, such as methylation, DNA shape, and post-translational modifications • Features such as cell-type/cell-line is also a confounding factor
  • 26. 26 Detecting TF-altering LoF variants in a genome • Loss of binding does not necessarily imply phenotypic consequence • How to assess performance of predictors wrt TFBS-altering variants that have a phenotypic consequence? • No large scale TF-specific datasets available
  • 27. 27 Detecting TF-altering LoF variants in a genome • Manually curated 73 variants (11 gain and 62 losses) with a phenotypic consequence due to an altered TFBS • 32 TFs and • 36 phenotypes • 35/73 (48%) of which have a DeepSEA/DanQ/DeepBind ChIP-seq binding model for the corresponding TF • Scored variants against corresponding DeepBind/DeepSEA/DanQ models
  • 28. 28 Detecting TF-altering LoF variants in a genome • Also scored 10,000 randomly sampled 1000Genome variants with an AF > 5% as a background set and used to define an empirical p-value • For a given score s of a curated variant, p-value is computed using the number of 1000G variants that have a score ≥ s
  • 29. 29 Detecting TF-altering LoF variants in a genome • 70% of variants had a p-value <0.05 • 67% of variants had a p-value <0.01 • 30% of variants had a p-value of <0.001 → Predictors were able to identify the majority of these variants accurately
  • 30. 30 Detecting TF-altering LoF variants in a genome P-value-transformed values vs. model scores • P-value transformation using a background set (e.g. 1000G) is common practice in assessing variant impact • Is it necessary? • Across the different TFs, there exists a strong linear relationship between the raw score and 1000g- transformed p-value, across TFs (a) • P-value transformation is not a necessity can simply use a universal cut-off on the model's score
  • 31. 31 Understanding our ability to detect LoF TFBS variants in a genome • Need: representative set of variants that are unlikely to cause LoF • Collected variants from four relatively healthy patients (PGP) • Restricted to • Haploinsufficient genes as defined by ExAC pLI scores (pLI > 0.90) • Falling within 5kb of the TSS (core promoter + extended region) • Rare: gnomAD AF < 1e-4 • Average total of 79 variants per sample
  • 32. 32 Understanding our ability to detect LoF TFBS variants in a genome • At a given cutoff, assess the % variants falling below (loss) or above (gain) that cutoff by at least one TF model • For a given genome, at a cutoff of -1 “sweet spot” we are able to recover ~70% of curated variants with a phenotype • Maintain an average of ~15% false positive rate across four genomes (0.15 * 79 = ~12 variants) • Similar for gains, although much fewer number of curated variants 0.00 0.25 0.50 0.75 1.00 −6 −4 −2 0 DanQ Log FC score cutoff Proportionvariants withatleastonemodel<cutoff Loss 0.00 0.25 0.50 0.75 1.00 0 2 4 6 DanQ Log FC score cutoff Proportionvariants withatleastonemodel>cutoff Curated variants gain (n=4) Curated variants loss (n=39) PGPC_0003 (n=79) PGPC_0004 (n=65) PGPC_0005 (n=113) PGPC_0007 (n=59) Gain 0.10 0.05 0.01
  • 33. 33 Summary and wrap-up • ASB data presents a useful resource for benchmarking TF model variant- impact predictions • Models could be trained to maximise variant-impact performance instead of binding performance • Our compiled set of ASB data (~100,000 variants, 150,000 TF-variant pairs) is the largest available and is freely available online in the supplementary data of the biorxiv paper http://goo.gl/2wFQ9w
  • 34. 34 Summary and wrap-up • PWMs do not perform well at variant impact, DL-methods significantly better • TFs do not perform uniformly at predicting variant impact! • TFs with poor performance at assessing variant impact often rely on additional mechanisms such as binding partners, methylation, DNA shape and PTMs • Incorporation of these mechanisms into training TF-binding models will drastically increase TF-binding/variant-impact performance
  • 35. 35 Summary and wrap-up • Analysis of genome for healthy individuals reveals that DL models based purely on sequence specificity in their current state perform reasonably well at identifying LoF variants caused by altered TFBSs, while minimising false positive rates