SlideShare ist ein Scribd-Unternehmen logo
1 von 43
November 5, 2019
How Well Can You Detect Difficult
Variants? Benchmarking with
Genome in a Bottle
www.slideshare.net/genomeinabottle
Why start Genome in a Bottle?
• A map of every individual’s
genome will soon be possible, but
how will we know if it is correct?
• Diagnostics and precision
medicine require high levels of
confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
Human Genome Sequencing needed a new class of
Reference Materials with billions of reference values
By Russ London at English Wikipedia, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=9923576
GIAB has characterized 7 human
genomes
• Pilot genome
– NA12878
• PGP Human
Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also
characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
Open consent enables secondary reference samples to
meet specific clinical needs
• >50 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
• Clinical variants
• Somatic variants
• Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …
Design of our human genome reference values
Benchmark
Variant
Calls
Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls
Design of our human genome reference values
Reference
Values
Benchmark
Variant
Calls
Design of our human genome reference values
Benchmark
Regions
Variants from
any method
being evaluated
Design of our human genome reference values
Benchmark
Regions
Benchmark
Variant
Calls
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls
Design of our human genome reference values
Benchmark
Variant
Calls
Query
Variants
Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
This does not directly
give the accuracy of the
reference values, but
rather that they are fit
for purpose.
Design of our human genome reference values
GIAB Recently Published Resources for
“Easier” Small Variants
Now using linked and long reads for
difficult variants and regions
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI stLFR
– Hi-C
– Strand-seq (underway)
• Long Reads
– PacBio Continuous Long Reads
– PacBio Circular Consensus Seq
– Oxford Nanopore “ultralong”
– Promethion
GIAB Use Cases
• Develop structural variant
benchmark
– bioRxiv 664623
• Diploid assembly of difficult
regions like MHC
– On bioRxiv this week
• Expand small variant benchmark
– v4.0 draft available for testing
50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp
discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06
Diploid assembly of MHC
Martin, et al., 2016
BioRxiv 085050.
Chin and Khalak, 2019,
BioRxiv 705616
Alignments of assembly to reference
16
Two haplotigs (no
gap) span through
whole MHC region
Integrating
assembly- and
mapping-
based calls
gives best
MHC
benchmark
• MHC assembly-based bed
includes 23187 variants in
the MHC region, excluding:
• CYP21A2 and pseudogene
• Homopolymers >10bp
• SVs in assembly
• Very dense variants
• v4.0 mapping-based bed
includes 13964 variants in
the MHC region
• Only 11 differences
between assembly and
mapping based calls in
both beds
• 2 genotyping errors in
assembly-based
• 1 inaccurate complex allele
and cluster of 8 missed
variants in mapping-based
• Merged benchmark
includes 23229 variants in
the MHC region Mbp
• Covers most HLA genes
and CYP21A2/TNXA/TNXB
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
None 13899 13549 10 4 0.9993 0.9997 0.9995
These variants are fully phased through the MHC regions too!
v4.0 benchmark uses 10x and CCS to include more
bases, variants, and segmental duplications
v4.0 GRCh37 v4.0 GRCh38
Base pairs 2,504,027,936 2,509,269,277
Reference
covered
93.2% 91.03%
SNPs 3,323,773 3,314,941
Indels 519,152 519,494
Base pairs in
Segmental
Duplications
64,300,499 73,819,34280.00%
85.00%
90.00%
95.00%
GRCh37
v3.3.2
GRCh37
v4 draft
GRCh38
v3.3.2
GRCh38
v4 draft
Percent of reference
covered
v4.0 enables benchmarking in regions difficult for
short reads
Example comparison of Illumina RTG VCF against benchmark sets
Subset v3.3.2 FNs v4 draft FNs
All SNPs 8,594 30,229
Low mappability 6,708 25,295
Segmental duplications 1,429 14,008
v4.0 benchmark contains more variants in
potentially medically-relevant regions
• v4.0 covers >90 % of the MHC region (CYP21A2 and all HLA
genes except HLA-DRBx)
• Additional coding variants in other medically relevant genes:
TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13)
• From ACMG59, new variants in PMS2, RET, SCN5A, and TNNI3
“Medical Exome”
(exons from OMIM, HGMD, ClinVar, UniProt)
Variants Bases covered
Benchmark v3.3.2 8,209 12,821,160 (85.5 %)
Benchmark v4.0 9,527 13,748,850 (91.7 %)
Long range PCR + Sanger sequencing confirms
new difficult variants in clinically tested exons
• Confirmed all 63 covered
variants in CYP21A2, PMS2,
TNXA, TNXB, C4A, C4B, DMBT1,
STRC, and HSPG2
v4.0 covers
most of
PMS2
Now cover
SMN1, but
regions still
excluded due
to high CCS
coverage
Some CR1
regions still
excluded due
to slightly high
coverage
Should we
make a targeted
benchmark for
difficult genes?
v4.0 still only covers ~22 % of “dark
genes” for 100bp reads (Ebbert et al)
• Compare long read diploid assembly
to mapping of short and long reads
• Manually curate and resolve
discordant sites
• Which genes should we target?
• Exons and introns?
The road
ahead... 2019
Integration pipeline development
for small and structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/telomere
Diploid assembly benchmarking
...
Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://rdcu.be/bqpDT
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Diploid assembly,
cancer genomes,
other ‘omics, …
Germline Variant
Calling Benchmarking
Nathan Olson
nolson@nist.gov
AMP Reference Material
11/5/2019
Small Variant
Benchmarkin
g Highlights
(TLDR)
Best practices for benchmarking germline
variant calling
•https://rdcu.be/bVtIF
•Supplemental Table 2 summarizes best practices
Hap.py - best practices implementation
•Command line - https://github.com/Illumina/hap.py
•Graphical interface – https://precision.fda.gov/
HappyR – R package for hap.py results
•Github https://github.com/Illumina/happyR
www.slideshare.net/genomeinabottle
Benchmarking Process
Best Practices
Summary
Benchmark
Sets
Stringency of
variant
comparison
Variant
comparison
tools
Manual
Curation
Metric
Interpretation
Stratifications
Confidence
Intervals
Additional
Benchmarking
Approaches
Applying Best
Practices
Benchmarking
Demonstration
• Samples – GIAB AJ Trio
• Sequencing
• 2X150bp Illumina HiSeq
• 60X Coverage
• Variant Calling Pipeline*
• Mapping – BWA
• Variant Calling – GATK4
• Ref GRCh37
• Benchmarking with hap.py
and GA4GH stratifications
* Run on precisionFDA, see https://rdcu.be/bVtIw for method details
GA4GH
Benchmarking
Tool
Check
discrepancies
are errors in
query callset.
https://bit.ly/2JKUT4w
Stratified
Performance
Metrics
• Plot on a 1 minus metric log10 scale for better
separation. Here lower is better.
• Precision = TP/(TP + FP)
• Recall = TP/ (TP + FN)
• Confidence intervals indicate uncertainty and
account for differences in number of variants per
stratification.
Stratification
Scatter Plot
(Optional)
Optimization –
Identifying
biases
responsible
for performing
stratifications.
Take Home
Messages
Kruche et al. URL, is a great
resource for germ-line
small variant
benchmarking.
GA4GH benchmarking
tool available at
precisionFDA.gov and
github URL
Appropriate data visualizations (EDA) are critical to
interpreting benchmarking results.
Use manual curation to evaluate benchmarking
results
We are actively working on developing resources
for benchmarking small variants against GRCh38
and Structural Variants
Acknowledgements
GA4GH Benchmarking
Team
Genome In A Bottle
Consortium
NIST GIAB
Team
Justin Zook
Jennifer McDaniel
Justin Wagner
Questions - nolson@nist.gov

Weitere ähnliche Inhalte

Was ist angesagt?

Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomes
GenomeInABottle
 

Was ist angesagt? (20)

Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
New data from giab genomes promethion
New data from giab genomes   promethionNew data from giab genomes   promethion
New data from giab genomes promethion
 
New methods deep variant evaluation of draft v4alpha
New methods   deep variant evaluation of draft v4alphaNew methods   deep variant evaluation of draft v4alpha
New methods deep variant evaluation of draft v4alpha
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomes
 
2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin2017 amp benchmarking_poster_justin
2017 amp benchmarking_poster_justin
 
Giab ashg webinar 160224
Giab ashg webinar 160224Giab ashg webinar 160224
Giab ashg webinar 160224
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Giab sv genotyping
Giab sv genotypingGiab sv genotyping
Giab sv genotyping
 
160628 giab for festival of genomics
160628 giab for festival of genomics160628 giab for festival of genomics
160628 giab for festival of genomics
 
171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin
 
Giab ashg 2017
Giab ashg 2017Giab ashg 2017
Giab ashg 2017
 

Ähnlich wie GIAB for AMP GeT-RM Forum

Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 

Ähnlich wie GIAB for AMP GeT-RM Forum (20)

Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821Genome in a bottle for next gen dx v2 180821
Genome in a bottle for next gen dx v2 180821
 
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
Using accurate long reads to improve Genome in a Bottle Benchmarks 220923
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and How
 
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
Towards Precision Medicine: Tute Genomics, a cloud-based application for anal...
 
170326 giab abrf
170326 giab abrf170326 giab abrf
170326 giab abrf
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Giab agbt SVs_2019
Giab agbt SVs_2019Giab agbt SVs_2019
Giab agbt SVs_2019
 
Integrating arrays and RNA-Seq
Integrating arrays and RNA-Seq Integrating arrays and RNA-Seq
Integrating arrays and RNA-Seq
 

Mehr von GenomeInABottle

Mehr von GenomeInABottle (12)

2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
 
Stratomod ASHG 2023
Stratomod ASHG 2023Stratomod ASHG 2023
Stratomod ASHG 2023
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccs
 
New data from giab genomes strand-seq
New data from giab genomes   strand-seqNew data from giab genomes   strand-seq
New data from giab genomes strand-seq
 
New data from giab genomes intro and ultralong nanopore
New data from giab genomes   intro and ultralong nanoporeNew data from giab genomes   intro and ultralong nanopore
New data from giab genomes intro and ultralong nanopore
 
How giab fits in the rest of the world mdic somatic reference samples
How giab fits in the rest of the world   mdic somatic reference samplesHow giab fits in the rest of the world   mdic somatic reference samples
How giab fits in the rest of the world mdic somatic reference samples
 
How giab fits in the rest of the world telomere to telomere consortium
How giab fits in the rest of the world   telomere to telomere consortiumHow giab fits in the rest of the world   telomere to telomere consortium
How giab fits in the rest of the world telomere to telomere consortium
 
How giab fits in the rest of the world human genome structural variation co...
How giab fits in the rest of the world   human genome structural variation co...How giab fits in the rest of the world   human genome structural variation co...
How giab fits in the rest of the world human genome structural variation co...
 
How giab fits in the rest of the world introduction
How giab fits in the rest of the world introductionHow giab fits in the rest of the world introduction
How giab fits in the rest of the world introduction
 

Kürzlich hochgeladen

Call Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service Available
Dipal Arora
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Dipal Arora
 

Kürzlich hochgeladen (20)

Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
 
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
 
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...Top Rated  Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
Top Rated Hyderabad Call Girls Erragadda ⟟ 6297143586 ⟟ Call Me For Genuine ...
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
 
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Bareilly Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Bareilly Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bareilly Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 8250077686 Top Class Call Girl Service Available
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
Manyata Tech Park ( Call Girls ) Bangalore ✔ 6297143586 ✔ Hot Model With Sexy...
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
 

GIAB for AMP GeT-RM Forum

  • 1. November 5, 2019 How Well Can You Detect Difficult Variants? Benchmarking with Genome in a Bottle www.slideshare.net/genomeinabottle
  • 2. Why start Genome in a Bottle? • A map of every individual’s genome will soon be possible, but how will we know if it is correct? • Diagnostics and precision medicine require high levels of confidence • Well-characterized, broadly disseminated genomes are needed to benchmark performance of sequencing O’Rawe et al, Genome Medicine, 2013 https://doi.org/10.1186/gm432
  • 3. Human Genome Sequencing needed a new class of Reference Materials with billions of reference values By Russ London at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9923576
  • 4. GIAB has characterized 7 human genomes • Pilot genome – NA12878 • PGP Human Genomes – Ashkenazi Jewish son – Ashkenazi Jewish trio – Chinese son • Parents also characterized National I nstituteof S tandards & Technology Report of I nvestigation Reference Material 8391 Human DNA for Whole-Genome Variant Assessment (Son of Eastern European Ashkenazim Jewish Ancestry) This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer (10 mM TRIS, 1 mM EDTA, pH 8.0). This material is intended for assessing performance of human genome sequencing variant calling by obtaining estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as functional or clinical interpretation. Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods similar to described in reference 1. An information value is considered to be a value that will be of interest and use to the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available. These data and genomic characterizations will be maintained over time as new data accrue and measurement and informatics methods become available. The information values are given as a variant call file (vcf) that contains the high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information (NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
  • 5. Open consent enables secondary reference samples to meet specific clinical needs • >50 products now available based on broadly-consented, well-characterized GIAB PGP cell lines • Genomic DNA + DNA spike-ins • Clinical variants • Somatic variants • Difficult variants • Clinical matrix (FFPE) • Circulating tumor DNA • Stem cells (iPSCs) • Genome editing • …
  • 6. Design of our human genome reference values Benchmark Variant Calls
  • 7. Benchmark Regions – regions in which the benchmark contains (almost) all the variants Benchmark Variant Calls Design of our human genome reference values
  • 8. Reference Values Benchmark Variant Calls Design of our human genome reference values Benchmark Regions
  • 9. Variants from any method being evaluated Design of our human genome reference values Benchmark Regions Benchmark Variant Calls
  • 10. Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives Variants from any method being evaluated Benchmark Variant Calls Design of our human genome reference values
  • 11. Benchmark Variant Calls Query Variants Benchmark Regions Variants outside benchmark regions are not assessed Majority of variants unique to method should be false positives (FPs) Majority of variants unique to benchmark should be false negatives (FNs) Matching variants assumed to be true positives This does not directly give the accuracy of the reference values, but rather that they are fit for purpose. Design of our human genome reference values
  • 12. GIAB Recently Published Resources for “Easier” Small Variants
  • 13. Now using linked and long reads for difficult variants and regions GIAB Public Data • Linked Reads – 10x Genomics – Complete Genomics/BGI stLFR – Hi-C – Strand-seq (underway) • Long Reads – PacBio Continuous Long Reads – PacBio Circular Consensus Seq – Oxford Nanopore “ultralong” – Promethion GIAB Use Cases • Develop structural variant benchmark – bioRxiv 664623 • Diploid assembly of difficult regions like MHC – On bioRxiv this week • Expand small variant benchmark – v4.0 draft available for testing
  • 14. 50 to 1000 bp Alu Alu 1kbp to 10kbp LINE LINE Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering sequence changes within 20% edit distance in trio Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting sequences <20% different or BioNano/Nabsys support in trio Evaluate/genotype: 19748 SVs with consensus variant genotype from svviz in son Filter complex: 12745 SVs not within 1kb of another SV Regions: 9641 SVs inside 2.66 Gbp benchmark regions supported by diploid assembly v0.6 tinyurl.com/GIABSV06
  • 15. Diploid assembly of MHC Martin, et al., 2016 BioRxiv 085050. Chin and Khalak, 2019, BioRxiv 705616
  • 16. Alignments of assembly to reference 16 Two haplotigs (no gap) span through whole MHC region
  • 17. Integrating assembly- and mapping- based calls gives best MHC benchmark • MHC assembly-based bed includes 23187 variants in the MHC region, excluding: • CYP21A2 and pseudogene • Homopolymers >10bp • SVs in assembly • Very dense variants • v4.0 mapping-based bed includes 13964 variants in the MHC region • Only 11 differences between assembly and mapping based calls in both beds • 2 genotyping errors in assembly-based • 1 inaccurate complex allele and cluster of 8 missed variants in mapping-based • Merged benchmark includes 23229 variants in the MHC region Mbp • Covers most HLA genes and CYP21A2/TNXA/TNXB Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure ---------------------------------------------------------------------------------------------------- None 13899 13549 10 4 0.9993 0.9997 0.9995 These variants are fully phased through the MHC regions too!
  • 18. v4.0 benchmark uses 10x and CCS to include more bases, variants, and segmental duplications v4.0 GRCh37 v4.0 GRCh38 Base pairs 2,504,027,936 2,509,269,277 Reference covered 93.2% 91.03% SNPs 3,323,773 3,314,941 Indels 519,152 519,494 Base pairs in Segmental Duplications 64,300,499 73,819,34280.00% 85.00% 90.00% 95.00% GRCh37 v3.3.2 GRCh37 v4 draft GRCh38 v3.3.2 GRCh38 v4 draft Percent of reference covered
  • 19. v4.0 enables benchmarking in regions difficult for short reads Example comparison of Illumina RTG VCF against benchmark sets Subset v3.3.2 FNs v4 draft FNs All SNPs 8,594 30,229 Low mappability 6,708 25,295 Segmental duplications 1,429 14,008
  • 20. v4.0 benchmark contains more variants in potentially medically-relevant regions • v4.0 covers >90 % of the MHC region (CYP21A2 and all HLA genes except HLA-DRBx) • Additional coding variants in other medically relevant genes: TSPEAR (31), LAMA5 (28), FCGBP (18), TPSAB1 (15), HSPG2 (13) • From ACMG59, new variants in PMS2, RET, SCN5A, and TNNI3 “Medical Exome” (exons from OMIM, HGMD, ClinVar, UniProt) Variants Bases covered Benchmark v3.3.2 8,209 12,821,160 (85.5 %) Benchmark v4.0 9,527 13,748,850 (91.7 %)
  • 21. Long range PCR + Sanger sequencing confirms new difficult variants in clinically tested exons • Confirmed all 63 covered variants in CYP21A2, PMS2, TNXA, TNXB, C4A, C4B, DMBT1, STRC, and HSPG2
  • 23. Now cover SMN1, but regions still excluded due to high CCS coverage
  • 24. Some CR1 regions still excluded due to slightly high coverage
  • 25. Should we make a targeted benchmark for difficult genes? v4.0 still only covers ~22 % of “dark genes” for 100bp reads (Ebbert et al) • Compare long read diploid assembly to mapping of short and long reads • Manually curate and resolve discordant sites • Which genes should we target? • Exons and introns?
  • 26. The road ahead... 2019 Integration pipeline development for small and structural variants Manuscripts for small and structural variants 2020 Difficult large variants Somatic sample development Germline samples from new ancestries Diploid assembly 2021+ Somatic integration pipeline Somatic structural variation Large segmental duplications Centromere/telomere Diploid assembly benchmarking ...
  • 27. Acknowledgment of many GIAB contributors Government Clinical Laboratories Academic Laboratories Bioinformatics developers NGS technology developers Reference samples * Funders * *
  • 28. For More Information www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle Public, Unembargoed Data: – http://www.nature.com/articles/sdata201625 – ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ – github.com/genome-in-a-bottle Global Alliance Benchmarking Team – https://github.com/ga4gh/benchmarking-tools – Web-based implementation at precision.fda.gov – Best Practices at https://rdcu.be/bqpDT Public workshops – Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA Justin Zook: jzook@nist.gov NIST postdoc opportunities available! Diploid assembly, cancer genomes, other ‘omics, …
  • 29. Germline Variant Calling Benchmarking Nathan Olson nolson@nist.gov AMP Reference Material 11/5/2019
  • 30. Small Variant Benchmarkin g Highlights (TLDR) Best practices for benchmarking germline variant calling •https://rdcu.be/bVtIF •Supplemental Table 2 summarizes best practices Hap.py - best practices implementation •Command line - https://github.com/Illumina/hap.py •Graphical interface – https://precision.fda.gov/ HappyR – R package for hap.py results •Github https://github.com/Illumina/happyR www.slideshare.net/genomeinabottle
  • 31.
  • 35. Benchmarking Demonstration • Samples – GIAB AJ Trio • Sequencing • 2X150bp Illumina HiSeq • 60X Coverage • Variant Calling Pipeline* • Mapping – BWA • Variant Calling – GATK4 • Ref GRCh37 • Benchmarking with hap.py and GA4GH stratifications * Run on precisionFDA, see https://rdcu.be/bVtIw for method details
  • 37. Check discrepancies are errors in query callset. https://bit.ly/2JKUT4w
  • 38.
  • 39. Stratified Performance Metrics • Plot on a 1 minus metric log10 scale for better separation. Here lower is better. • Precision = TP/(TP + FP) • Recall = TP/ (TP + FN) • Confidence intervals indicate uncertainty and account for differences in number of variants per stratification.
  • 42. Take Home Messages Kruche et al. URL, is a great resource for germ-line small variant benchmarking. GA4GH benchmarking tool available at precisionFDA.gov and github URL Appropriate data visualizations (EDA) are critical to interpreting benchmarking results. Use manual curation to evaluate benchmarking results We are actively working on developing resources for benchmarking small variants against GRCh38 and Structural Variants
  • 43. Acknowledgements GA4GH Benchmarking Team Genome In A Bottle Consortium NIST GIAB Team Justin Zook Jennifer McDaniel Justin Wagner Questions - nolson@nist.gov

Hinweis der Redaktion

  1. https://www.biorxiv.org/content/biorxiv/early/2016/11/14/085050.full.pdf Fast Assembly / Fast Iteration
  2. false-negatives (FN) : variants present in the truth set, but missed in the query.
  3. 3_79181930 Add this from what lindsey sent on slack
  4. This is a good slide for 644: give a clinical anecdote Also numbers - attendance, publications, data, RM unit sales Reference sample distributors How much money from IAA? - sustained funding Quantify collaborators' input GIAB steering committee Examples of others contributing data, analyses How to describe emails
  5. Why benchmark and when Validating and optimizing a measurement process - e.g. Clinical lab validating NGS pipeline NGS and bioinformatic pipeline development NGS process QC
  6. Generate variants calls for sample with benchmarks - start with DNA or publicly available datasets, starting point depends on what you are benchmarking or optimizing Compare query variant calls to truth callset Evaluate results (Optional) Use results to optimize measurement process
  7. Subset of FPs and FNs Multiple technologies Relevant annotations
  8. Zoom in to show what is in the table Show and define metrics
  9. Add metric definitions Recreate plot using binconf for uncertainties Plot on a 1 minus metric log10 scale for better separation. Here lower is better. Confidence intervals indicate uncertainty and account for differences in number of variants per stratification.
  10. Help identify poor performing stratifications Plot 1-Metric on a log-scale for better separation of stratifications with metric values close to 1
  11. Example IGV with hypothesis of error source Analysis to do - IGV with PCR and PCR-free data, CCS, 10X, variant calls, benchmark, relevant stratifications