2. NIST RM Development Plans
Genome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015
HG-
001/NA1287
8
Release NIST
RM8398;
Preliminary
large
deletions
Refined
Structural
Variants
HG-002 to
HG-004
(Ashkenazim
trio)
Illumina,
Complete
Genomics,
Ion,
BioNano,
and SOLiD
data
Preliminary
SNPs/indels;
100x PacBio
data;
Illumina
assembled
long reads
Refined
SNPs/indels;
Preliminary
SVs
Refined
Structural
Variants
NIST RMs
8391/8392
release
HG-005 (son
in Asian trio)
Illumina,
Complete
Genomics,
Ion,
BioNano,
and SOLiD
data
Illumina
assembled
long reads
Preliminary
SNPs/indels
Refined
SNPs/indels;
Refined
Structural
Variants
NIST
RM8393
release
3. Preliminary uses of high-confidence
NIST-GIAB genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
4. Data Release Plans
Individual Datasets
• Uploaded to GIAB FTP site
as it is collected
• May include raw reads,
aligned reads, and
variant/reference calls
Integrated High-confidence Calls
• First develop SNP, indel, and
homozygous reference calls
• Then develop SV and non-
SV calls
• Released calls are versioned
• Preliminary callsets will be
made available to be
critiqued
• Data jamboree??
5. Pilot RM (NA12878)
• HapMap/1000
Genomes sample
• Lots of public data and
analyses
• Not consented for
commercial
redistribution
• Data from pedigree
available and analyzed
• ~8000 units for NIST RM
• High-confidence calls
released
– integrates multiple
datasets and phased
pedigree analysis
• Developing SV calls
• Planned release as NIST
RM8398 in Q4 2014
6. Ashkenazim PGP trio
• Personal Genome Project
trio
(huAA53E0/hu8E87A9/hu6E
4515)
• Father/mother/son at
Coriell
(GM24143/GM24149/GM2
4385)
• Consented for commercial
redistribution
• Most short-read data will be
available Q3 2014
• 100x PacBio WGS
completed ~Q1 2015
• 10x Illumina assembled long
reads for son ~Q1 2015
• Planned NIST RM release
~Q4 2015
– NIST RM 8391 will be only the
son (~8000 units)
– NIST RM 8392 will contain all
3 family members (~2500
units)
7. Asian PGP trio
• Personal Genome Project
trio
(hu91BD69/hu38168C/hu
CA017E)
• Father/mother/son at
Coriell
(GM24695/GM24694/GM
24631)
• Only the son planned for
NIST RM but trio will be
characterized
• Consented for
commercial redistribution
• Most short-read data will
be available Q3-Q4 2014
• 10x Illumina assembled
long reads for son ~Q1
2015
• Planned NIST RM release
~Q4 2015
– NIST RM 8393 will be only
the son (~11000 units)
8. New Platform-specific (-independent?)
Integration Method
Normalize and
take union of calls
Simple
SNPs/indels
Illumina/SOLiD –
GATK HC force
calls
Ion – TVC force
calls
If all biased or low
qual, uncertain
Elseif all
concordant, high-
conf
Elseif all unbiased
are concordant,
high-conf
Else uncertain
CG – use Ref file
Complex Variants
Use vcfeval or
SMASH for
sequential pair-
wise comparison
9. Integration Method Plans
• Implement new integration methods on the cloud
– Easier for…
• distributed analysis
• scalability
• transparency
• others to reproduce results
• First, analyze NA12878 RM data with new
methods to ensure they work well
• Then, apply to PGP trios