The document discusses the technical roadmap for germline genome benchmarks from the Genome in a Bottle (GIAB) Consortium. It summarizes GIAB's past and ongoing work developing small variant and structural variant benchmarks for reference samples. It outlines plans to expand assembly-based benchmarks to more medically relevant genes and regions using new long-read assemblies. It proposes collaborations to improve X/Y chromosome benchmarks and develop new benchmarking tools. A draft timeline is provided for upcoming GIAB deliverables through 2021 and beyond, including developing assembly-based benchmarks, uncertainty metrics for deep learning methods, and expanding to additional reference genomes. Feedback is sought on priorities and challenges in using GIAB data.
2. Our goals for
today
Check in about our progress
and roadmap since the 2020
GIAB Workshop was cancelled
Today's focus is on the
technical roadmap for germline
benchmarks
3.
4. GIAB’s
evolving
benchmarks
2012–2014
Published first small variant
benchmark, covering ~77% of
HG001
2015–2019
Published small variant
benchmark for 80-90% of 7
samples
GA4GH/GIAB publish best
practices for benchmarking
small variants
2020
Published benchmarks for
structural variants and the MHC
Released v4 small variant
benchmarks using long reads to
cover more challenging regions
5. Ongoing Work in Human Genome Benchmarks
Mapping-
based
v0.6 SV benchmark
v4 small variant
benchmark
precisionFDA Challenge
Assembly-
based
MHC Benchmark
Medically-relevant genes
(Small+Structural Variants)
Human Pangenome
Reference Consortium
"Hard to Measure" (H2M)
Benchmark and T2T
Somatic
Mosaic variants
MDIC Somatic Reference
Samples
Tumor-normal cell lines
13. De novo assembly in 2020 and GIAB’s role
Using GIAB to evaluate assemblies
• Developed prototype assembly
benchmarking pipeline using
GIAB’s variant benchmarks
• Collaborated with Human
Pangenome Reference
Consortium to benchmark new
assembly methods
Using assemblies to expand GIAB’s
benchmarks
• Telomere to Telomere and
Human Pangenome Reference
Consortia have rapidly advanced
assembly methods
• Trio-based hifiasm is highly
concordant with the v4
benchmark (~1 difference/300k
bases)
14. Improving coverage of difficult medical genes
Progress using trio-hifiasm
• We now cover 264/347 genes
that were <90% covered by v4.2
New assembly benchmarks needed
• We still exclude >10% of 83
genes, including some quite
important genes like:
• SMN1/2 in highly repetitive region
• Duplications in KIR/HLA-DRB genes
• LPA with very large repeat
expansions
• Also missing X and Y
Total Bases SNPs INDELs
Low map and
segdups
v4.2 9,431,825 15,850 2,317 1,005,747
Draft
benchmark
11,938,832 24,212 4,978 1,712,135
Total Bases SVs >= 50bp
Draft benchmark 11,798,343 207
Small Variant Benchmark Statistics for Genes <90%
covered by v4.2 and covered by the assembly
Structural Variant Benchmark Statistics
15. Why Assembly-based Benchmarks?
Small+
structural
variants
New assembly methods
enable accurate phased
benchmarks with all
variant types
Clear resolution of
complex variants, including
in tandem repeats
Better resolves highly
divergent and some seg
dup regions
Duplications
vs. GRCh37/38
Variant calls from mapping
are difficult to interpret in
CNVs (duplications)
Assembly can resolve
location of duplicated
sequence
Assembly can identify
variants from each copy
Native
assembly
benchmarking
Enables benchmarking of
assemblies where there
are no standards for
variant representation
Path towards using future
pangenome tools
Easier to make calls on
GRCh37, GRCh38,
CHM13...
16. Assembly-based benchmark roadmap
Medically-
relevant genes
First, local assembly for
MHC
Next, pilot using whole
genome diploid assembly
for difficult genes
Phased benchmark
small+structural variant vcf
for 264 genes
Benchmarks for small
variants and isolated large
insertions/deletions
T2T X and Y
for HG002
(with HPRC)
First GIAB benchmarks for
haploid chromosomes
Use T2T assemblies of
HG002 X and Y
Combined small+structural
variant benchmark
Potential synthetic diploid
X with HG002+CHM13-X
Diploid
assembly
First use trio-hifiasm and
define benchmark regions
Exploring Strand-
seq+hifiasm for parental
assemblies
Explore deep learning-
based uncertainties and
explanations
Later work with HPRC-T2T
for HG002 to cover all
difficult regions
17. Benchmarking tools needed
Tandem
Repeats
New metrics
needed
New comparison
tools needed
Complex
structural
variants
Current tools can't
compare different
representations of
complex SVs, so we
exclude them from
our benchmarks
Assemblies
Current GIAB
variant-based
benchmarking is
limited to small
variants and
regions where
HG002 is similar to
GRCh38
18. Draft GIAB Deliverable Timeline
2020
Q2: Develop initial assembly
benchmarking pipeline
Q3: V0.6 SV and MHC
manuscripts published
Q4: Submit v4 small variant
benchmark and precisionFDA
challenge manuscripts
Q4: Evaluate assembly-based
benchmarks for HG002
medical genes
2021
Q1: Phased v4 benchmarks for
7 GIAB genomes
Q1-Q2: Strand-seq data for
GIAB trios and assemblies
Q1: Medical gene benchmark
paper
Q2: Mosaic/somatic
benchmark for HG002
Q3: Ultralong ONT for GIAB
trios
Q4: RNA-seq data
Future/
Collaboration/
Resource-dependent
T2T HG002 X/Y-based
benchmarks
Assembly-based benchmarks
for all gene (& other?) regions
New benchmarking tools (TRs,
complex SVs, assemblies)
Work with HPRC-T2T to cover
all difficult regions
19. GIAB will gain more support as a part of a new
intramural NIST program to use deep learning to
accelerate genome characterization
20. Proposed GIAB Deep Learning Working Group
• NIST’s intramural research program is supporting development of
uncertainties and explanations for deep learning-based genomic Standard
Reference Materials
• Goal is to enable faster recharacterization of existing genomes and new genomes like
cancer
• This is not to the exclusion of other genome characterization methods
• Can we use deep learning to assign reliable uncertainties to certified values
(i.e., variant and reference calls)?
• Can we provide useful explanations of deep learning-based genotypes and
uncertainties?
• If we are able to assign reliable uncertainties to variant calls, how would
you use them?
21. Questions for Discussion
• Are we missing anything or would you prioritize something higher?
• Should new benchmarking tools be a higher priority?
• How useful are phased variants?
• Other things?
• Relative priority of better characterizing HG002 vs applying methods
to all 7 genomes?
• Relative priority of expanding to new ancestries?
• What would make the data easier to use? What challenges have you
had in using the data?
• Is it a priority to study batch-to-batch variability?