2. What’s Genome in a Bottle?
• Authoritative Characterization of Human
Genomes
– enduring commitment to resource
availability
• Samples
• Data
– widely available open resources
– all data made available without
embargo
• Enable technology and tool-building with
benchmark samples and methods for…
– development
– optimization
– demonstration
• Germline samples available now
• Developing capacity for somatic sample
development
4. Now using linked and long reads
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI
stLFR
• Long Reads
– PacBio continuous long reads
– PacBio circular consensus seq
– Oxford Nanopore “ultralong”
GIAB Use Cases
• Expand small variant
benchmark
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC
5. Linked Reads
• Short reads, but
barcodes give long
range information
>100kb
• Most useful for:
– Phasing variants & reads
– Difficult-to-map regions
– De novo assembly
https://dx.doi.org/10.1038%2Fnbt.3432
6. PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS)
Double-stranded DNA
Ligate adapters
Anneal primer and bind
DNA polymerase
Sequence
Generate
consensus HiFi read
Subreads
(passes)
Subread errors
Passes
5 10 15 200
30
0
10
20
40
50
Accuracy(Phred)
Wenger, Peluso, et al. (2018). bioRxiv. doi:10.1101/519025
Read accuracy improves
with more passes
7. 15X Coverage by reads > 100Kb
Oxford Nanopore Can Produce “Ultralong” Reads
9. Long+Linked Reads expand small
variant benchmark set
Benchmark includes more bases, variants, and segmental duplications in v4⍺
v3.3.2 v4⍺ In v4⍺ not in
v3.3.2
In v3.3.2 not in
v4⍺
Base pairs
covered
2,358,060,765 2,572,421,057 225,990,474 11,630,182
Percent of
GRCh37 covered
87.84% 95.82% 8.42% 0.43%
SNPs 3,046,933 3,432,698 385,765 25,219
Indels 465,670 537,035 71,365 15,382
Base pairs in
Segmental
Duplications
13,722,546 116,687,703 103,466,431 501,274
10. Small variant performance metrics
decrease vs. new benchmark
Comparison of Illumina GATK4 VCF against benchmark sets
• SNP FN rate increases by a factor of 10
– almost entirely due to new benchmark variants in difficult to
map regions (lowmap) and segmental duplications (segdups)
Subset v3.3.2 Recall v4⍺ Recall v3.3.2 Precision v4⍺ Precision
All SNPs 0.9995 0.9914 0.9981 0.9941
Lowmap 100 bp 0.9799 0.7911 0.9623 0.8582
Lowmap 250 bp no mismatch 0.9474 0.4916 0.8911 0.7171
Segdups 0.9982 0.9103 0.9910 0.9014
13. 50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360
unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4
technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 11869 SVs inside
2.69 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06
14. Resolve MHC regions from
HG002
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC
Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng,
Erik Garrison, Shilpa Garg
Mar. 25-27, UCSC, The Human Pangenomics Hackathon
15. Goals
• Make the best haplotype correct
assemblies for the MHC regions of
HG002 from all available data
• Fewest gaps
• Correct phasing for both SNPs and
SVs
• Provide the best genomic sequences
for future GIAB SNP and SV
benchmark for this complicated but
medically important region
16. Preliminary MHC Diploid Assembly Results
MHC region MHC region
Haplotype II
(3 contigs spanning the region)
Haplotype I
(2 contigs spanning the region)
A loop in the assembly
graph
Missing Sequence?
17. The road
ahead... 2019
Integration pipeline
development for small and
structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/ telomere
...
18. Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
19. For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Hinweis der Redaktion
Non RefN bases in GRCh37 HG002: 2684573005
false-negatives (FN) : variants present in the truth set, but missed in the query.
This is a good slide for 644:
give a clinical anecdote
Also numbers - attendance, publications, data, RM unit sales
Reference sample distributors
How much money from IAA?
- sustained funding
Quantify collaborators' input
GIAB steering committee
Examples of others contributing data, analyses
How to describe emails