GIAB ASHG 2019 Structural Variant poster

Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp
discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers
predicting sequences <20% different or BioNano/Nabsys
support in trio
Evaluate/genotype: 19748 SVs with consensus
variant genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
Introduction
A robust benchmark for human germline structural variants
Justin Zook,1 Lesley Chapman,1 Nancy Hansen,3 Fritz J. Sedlazeck,4 Aaron Wenger,5 Adam English,6 Chunlin Xiao,7 John Oliver,8 Joyce Lee,9 Alex Hastie,9 Ian Fiddes,10
Alvaro Barrio,10 Tobias Marschall,11 Mark Chaisson,12 John Farrell,13 Andrew Carroll,14 Paul C. Boutros15,16, Iman Hajirasouliha17, Christopher E. Mason17, Sayed
Mohammad Ebrahim Sahraeian,18 Marc Salit,2 and many other members of the Genome in a Bottle Consortium
(1) National Institute of Standards and Technology; (2) Joint Initiative for Metrology in Biology; (3) NHGRI/NIH; (4) Baylor College of Medicine; (5) Pacific Biosciences; (6) Spiral Genetics;
(7) NCBI/NIH; (8) Nabsys; (9) BioNano Genomics; (10) 10x Genomics; (11) Max Planck Institute; (12) University of Southern California; (13) Boston University Medical School; (14) Google; (15)
University of California, Los Angeles; (16) Ontario Institute for Cancer Research; (17) Weill Cornell Medicine; (18) Roche Sequencing Solutions
• NIST has hosted the Genome in a Bottle Consortium to develop
authoritatively-characterized, human genome Reference Materials
that are an enduring resource for benchmarking variant calls
Integrating data to form benchmark calls
Ongoing and Future GIAB Work
• Diploid assembly-based benchmarks
• Using long & linked reads in difficult-to-map regions
• Complex and clustered variants
• New collaborations to characterize difficult regions and
variants in these genomes are welcome! Email jzook@nist.gov
Crowd-sourced manual curation vs. benchmark set
Benchmark calls are strongly supported
Zook et al., Scientific Data, 2016.
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data
Our benchmark sets are useful in evaluating
multiple technologies
2012
• No human
benchmark
calls
available
• GIAB
Consortium
formed
2014
• Small
variant
genotypes
for ~77% of
pilot
genome
NA12878
2015
• NIST
releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
~90% of 7
genomes for
GRCh37/38
2018
• Draft SV
benchmark
• Difficult to
map regions
2019+
• Characteriz-
ing difficult
variants and
regions
• Assembly
benchmarks
• Cancer
Benchmark described in https://doi.org/10.1101/664623
• Goal: When comparing any callset
to our vcf within the bed, most
putative FPs and FNs should be
errors in the tested callset
• We benchmarked several callsets
from assembly-based and non-
assembly-based methods with short
and long reads.
• Upon manual curation, the majority
of most FPs and FNs were errors in
the tested callset
• Exception: FP insertions from pbsv,
suggesting we may miss ~5%
of true insertions
• Exception: One FP insertion from
Bionano was correctly larger
github.com/nspies/svviz2
50 to 1000 bp
doi.org/10.1101/581264
1kbp to 10kbp
Alu
Alu
LINE
LINE
• Candidates examined by
11 curators on average
• 627/635 consensus
manual curations agreed
with v0.6 genotype in
benchmark regions
• Most “discordant” sites
related to inclusion of
20-49bp indels in
curation
github.com/spiralgenetics/truvari
Short reads
• Illumina
• Complete Genomics
Long reads
• PacBio (CLR and CCS)
• Oxford Nanopore
• Promethion
• “Ultralong”
Linked reads
• 10x Genomics
• 6kb Mate-pair
• HiC
• stLFR
Optical/electronic
mapping
• BioNano
• Nabsys
Public
GIAB
Data
Short reads have limitations
for large insertions and SVs in
tandem repeats
Trio Mendelian genotype
violation rate
20/7973 = 0.3%
Only 2 violations likely to
be errors in HG002
(Excludes X/Y and sites
with no GT in a parent)
Support from long reads Support from short readsSupport from optical mapping
SV discovery and genotyping methods have
different strengths and weaknesses
More methods discover
SVs that are deletions, not
in tandem repeats, and
smaller insertions github.com/nhansen/SVanalyzer
Goal for our human genome
reference values
Benchmark
variant calls
(Reference
Values)
Variants from
any method
being evaluated
Benchmark
regions
(Reference
Values)
Variants
outside
benchmark
regions are
not assessed
Majority of
variants
unique to
method
should be
false
positives
(FPs)
Majority of
variants unique to
benchmark should
be false negatives
(FNs)
Matching
variants
assumed to be
true positives
Het Hom
Het Hom
Het Hom
Het Hom
Het Hom
Het Hom
Het Hom
Het Hom
Genome in
a Bottle
Consortium
SV Benchmarking tools:
Not In Tandem
Repeats (Solid)
and
In Tandem
Repeats
(Dashed)
DEL
INS
DEL
INS

GIAB ASHG 2019 Structural Variant poster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GIAB ASHG 2019 Structural Variant poster

Similar to GIAB ASHG 2019 Structural Variant poster (20)

More from GenomeInABottle

More from GenomeInABottle (12)

Recently uploaded

Recently uploaded (20)

GIAB ASHG 2019 Structural Variant poster

Editor's Notes