GIAB and long reads for bio it world 190417

April 17, 2019
Long Read Sequencing and the
Genome in a Bottle Consortium

What’s Genome in a Bottle?
• Authoritative Characterization of Human
Genomes
– enduring commitment to resource
availability
• Samples
• Data
– widely available open resources
– all data made available without
embargo
• Enable technology and tool-building with
benchmark samples and methods for…
– development
– optimization
– demonstration
• Germline samples available now
• Developing capacity for somatic sample
development

GIAB Recently Published Resources for
“Easier” Small Variants

Now using linked and long reads
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI
stLFR
• Long Reads
– PacBio continuous long reads
– PacBio circular consensus seq
– Oxford Nanopore “ultralong”
GIAB Use Cases
• Expand small variant
benchmark
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC

Linked Reads
• Short reads, but
barcodes give long
range information
>100kb
• Most useful for:
– Phasing variants & reads
– Difficult-to-map regions
– De novo assembly
https://dx.doi.org/10.1038%2Fnbt.3432

PACBIO CIRCULAR CONSENSUS SEQUENCING (CCS)
Double-stranded DNA
Ligate adapters
Anneal primer and bind
DNA polymerase
Sequence
Generate
consensus HiFi read
Subreads
(passes)
Subread errors
Passes
5 10 15 200
30
0
10
20
40
50
Accuracy(Phred)
Wenger, Peluso, et al. (2018). bioRxiv. doi:10.1101/519025
Read accuracy improves
with more passes

15X Coverage by reads > 100Kb
Oxford Nanopore Can Produce “Ultralong” Reads

Expand small variant
benchmark set to difficult to
map regions
Justin Wenger, NIST

Long+Linked Reads expand small
variant benchmark set
Benchmark includes more bases, variants, and segmental duplications in v4⍺
v3.3.2 v4⍺ In v4⍺ not in
v3.3.2
In v3.3.2 not in
v4⍺
Base pairs
covered
2,358,060,765 2,572,421,057 225,990,474 11,630,182
Percent of
GRCh37 covered
87.84% 95.82% 8.42% 0.43%
SNPs 3,046,933 3,432,698 385,765 25,219
Indels 465,670 537,035 71,365 15,382
Base pairs in
Segmental
Duplications
13,722,546 116,687,703 103,466,431 501,274

Small variant performance metrics
decrease vs. new benchmark
Comparison of Illumina GATK4 VCF against benchmark sets
• SNP FN rate increases by a factor of 10
– almost entirely due to new benchmark variants in difficult to
map regions (lowmap) and segmental duplications (segdups)
Subset v3.3.2 Recall v4⍺ Recall v3.3.2 Precision v4⍺ Precision
All SNPs 0.9995 0.9914 0.9981 0.9941
Lowmap 100 bp 0.9799 0.7911 0.9623 0.8582
Lowmap 250 bp no mismatch 0.9474 0.4916 0.8911 0.7171
Segdups 0.9982 0.9103 0.9910 0.9014

Error in current
benchmark excluded
in new benchmark
v4⍺
v3.3.2
Illumina
PacBio
CCS
10X
ONT
v4⍺
v3.3.2

Develop sequence-resolved
structural variant benchmark set
GIAB Analysis Team

50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360
unique) calls >=20bp discovered in 30+ sequence-resolved callsets from 4
technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 11869 SVs inside
2.69 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06

Resolve MHC regions from
HG002
https://github.com/NCBI-Hackathons/TheHumanPangenome/tree/master/MHC
Justin Wenger, Justin Zook, Mikko Rautiainen, Jason Chin, Tobias Marschall, Qian Zeng,
Erik Garrison, Shilpa Garg
Mar. 25-27, UCSC, The Human Pangenomics Hackathon

Goals
• Make the best haplotype correct
assemblies for the MHC regions of
HG002 from all available data
• Fewest gaps
• Correct phasing for both SNPs and
SVs
• Provide the best genomic sequences
for future GIAB SNP and SV
benchmark for this complicated but
medically important region

Preliminary MHC Diploid Assembly Results
MHC region MHC region
Haplotype II
(3 contigs spanning the region)
Haplotype I
(2 contigs spanning the region)
A loop in the assembly
graph
Missing Sequence?

The road
ahead... 2019
Integration pipeline
development for small and
structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/ telomere
...

Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

GIAB and long reads for bio it world 190417

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie GIAB and long reads for bio it world 190417

Ähnlich wie GIAB and long reads for bio it world 190417 (20)

Mehr von GenomeInABottle

Mehr von GenomeInABottle (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GIAB and long reads for bio it world 190417

Hinweis der Redaktion