1. Genome in a Bottle: Tools for
Using NIST Reference Materials
Next Generation Diagnostics Summit Short Course
August 2014
Justin Zook, Marc Salit, and the Genome in a Bottle
Consortium
2. Learning Objectives
• How can Genome in a Bottle Reference
Materials help with validating NGS assays?
• Comparing your variant calls to high-
confidence calls
• Tools available for understanding potential
false positives and false negatives
• Examples of how labs are using our high-
confidence calls
3. NIST-hosted
Genome in a Bottle Consortium
• Infrastructure for performance
assessment of NGS
– support science-based regulatory
oversight
• No widely accepted set of metrics
to characterize the fidelity of
variant calls from NGS…
• Genome in a Bottle Consortium is
developing standards to address
this…
– human genomes as Reference
Materials (RMs)
• characterize and disseminate by NIST
– tools and methods to use these RMs
• common sequencing instruments
• bioinformatics workflows.
http://genomeinabottle.org
4. Whole genome sequencing technologies
disagree about 100,000’s of variants
3,198,316
(80.05%)
125,574
(3.14%)
Platform
#1
Platform
#2
Platform #3
230,311
(5.76%)
121,440
(3.04%)
208,038
(5.21%)
71,944
(1.80%)
39,604
(0.99%)
# SNPs
(% of SNPs detected
by any platform)
6. Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference
materials will be
developed to
characterize
performance of a part
of process
– materials will be
certified for their
variants against a
reference sequence,
with confidence
estimates
genericmeasurementprocess
7. NIST Human Genome RMs in the
pipeline
• All 10 ug samples of DNA
isolated from multistage large
growth cell cultures
– all are intended to act as stable,
homogeneous references
suitable for use in regulated
applications
– all genomes also available from
Coriell repository
• Pilot Genome
– ~8400 tubes
• Ashkenazim Jewish Trio
– ~10000 son; ~2500 each parent
• Asian Trio
– ~10000 son; parents not yet
planned as NIST RM
8. Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
8
9. Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
10. Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics
methods agree or we
understand the biases
causing disagreement
• At least some methods have
no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be
difficult for current
technologies
• State reasons for lower
confidence
• If a site is near a low
confidence site, make it low
confidence
12. Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
– Homopolymers, STRs,
duplications
– Can be similar or different
in different genomes
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
12
13. Preliminary uses of high-confidence
NIST-GIAB genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
14. NIST Plays a Role in the First FDA Authorization for
Next-Generation Sequencer
November 20, 2013
16. GCAT – Interactive Performance
Metrics
• NIST is working with GCAT
to use our highly
confident variant calls
• Assess performance of
many combinations of
mappers and variant
callers
• Currently assesses only
exome sequencing
• www.bioplanet.com/gcat
16
20. Freebayes SNP calls changed very little in 2013
http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem-
freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/snp/group-quality
21. Freebayes indel calls improved in 2013
http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem-
freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/indel/group-quality
22. Background
• Clinical laboratory – Division of Genomic Diagnostics Certified by regulatory
agencies (CAP).
• CWES test requires stringent validation per CAP criteria to establish
performance metrics of the test.
Utilizing NIST data in validation of CWES Test
• Sequence and call variants of NA12878 at CHOP
• CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file
• Compare CHOP dataset to NIST data set for concordance
NIST Data Set Details:
*High quality reference data set on NA12878 (Dec. 2013)
*NIST’s highly confident Region of Interests (ROI)
*Variants called in 219,222 regions on hg19 assembly
*: National Institute of Standards and Technology
Analytical Validation of Clinical
Whole-Exome Sequencing (CWES) Test
30. Feedback from MoCha lab in NCI
• We built a targeted amplicons NGS assay for
detecting mutations in clinical tumor specimens
• To assess the assay’s specificity, we compared 84
runs of CEPH NA12878 data from our assay with
NIST’s consensus variant list (VCF v2.15)
• We observed a high overall concordance with a
few FP variants in homopolymeric regions unique
in our platform
• We concluded that NIST GIAB is a useful
reference standard to evaluate assay specificity
31. Using Genome in a Bottle calls to
benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of
NA12878 technical replicates
against GIAB for each new
pipeline version.”
38. How Can I Get Involved?
• Use our integrated SNP/indel
genotypes for NA12878 and give
us feedback
– Cells and DNA currently available
from Coriell
– NIST RM available late 2014
• Sequencing/analyzing the new
Genome in a Bottle samples
• Help with Structural Variant calls
• Help with analyzing data from
long-read technologies
• Attend our biannual workshops
(January in CA, August in MD)
• Help develop methods to
measure performance using our
well-characterized genomes
http://genomeinabottle.org
Email:
Justin Zook - jzook@nist.gov
Marc Salit – salit@nist.gov
Slides on slideshare at:
http://www.slideshare.net/Gen
omeInABottle
Editor's Notes
One FN snv is confirmed to be a reference
One FP indel is confirmed to be REAL indel
Three FP SNVs are confirmed to be REAL SNVs