The Platinum Genomes project aims to create a comprehensive set of variant calls that can be used as a "truth" dataset. They are sequencing and analyzing a large pedigree to identify variant calls that are consistent with Mendelian inheritance. Variants that are 100% consistent across all samples are considered accurate. Over 3 million accurate SNPs and 240 thousand accurate indels were identified. Additional variants were incorporated from other callers to create a more complete catalog. The pedigree analysis can also identify potential deletions and help distinguish true variants from errors.
2. Platinum Genome project: Goals
Problem: No comprehensive truth set of variant calls for validation
Solution: Sequence and analyze large family pedigree
Use Mendelian inheritance to identify good / bad variant calls
– Including SNPs, indels & SVs
Aggressively incorporate variant calls
– Incorporate multiple algorithms and sequencing technologies
– Do not limit this just to what is currently easy to call
Make the data available publicly
– Both raw data and processed calls with accuracy assessment
Re-assess algorithms against a better truth data
– Better and more comprehensive truth data will allow for rapid advances in software
2
3. Using inheritance to detect conflicts: trio analysis
MOM DAD CHILD
Child receives blue chromosome from mother and
green chromosome from father: e.g. typical trio analysis
Father’s chromosomes
Mother’s chromosomes
When we do a trio analysis like this only 50% of the parents DNA is passed on to
the child so many of the variants will only be called in one parent
– Have no power to detect false positives in the parents
A trio analysis is also not very sensitive to detecting errors
– For example if father is AC and mother is AC then the child can be AA, AC or CC and
still be consistent with Mendelian inheritance
– Many errors occur at sites that are systematically het but trio analysis assumes that
these are correct
3
4. Using inheritance to determine accuracy: larger pedigree
CHILDREN
MOM DAD 1 2 3 4 5 6 7
Possible GT Patterns
A T A A A A T A A A T A A A T A A A
T A A A T A A A T A A A T A A A T A
A A A T A A A T A T A A A A A A A T
A A T A A T A A A A A T A T A T A A
OBSERVED GENOTYPES
A A A T A A A T A T A A A A A A A T
4
5. Using inheritance to determine accuracy: larger pedigree
MOM DAD 1 2 3 4 5 6 7
A T A A A A T A A A T A A A T A A A 6
T A A A T A A A T A A A T A A A T A
A A A T A A A T A T A A A A A A A T
A A T A A T A A A A A T A T A T A A
# Errors / Hamming Distance
OBSERVED GENOTYPES
A A A T A A A T A T A A A A A A A T
5
6. Using inheritance to determine accuracy: larger pedigree
MOM DAD 1 2 3 4 5 6 7
A T A A A A T A A A T A A A T A A A 6
T A A A T A A A T A A A T A A A T A 5
A A A T A A A T A T A A A A A A A T
A A T A A T A A A A A T A T A T A A
OBSERVED GENOTYPES
A A A T A A A T A T A A A A A A A T
6
7. Using inheritance to determine accuracy: larger pedigree
MOM DAD 1 2 3 4 5 6 7
A T A A A A T A A A T A A A T A A A 6
T A A A T A A A T A A A T A A A T A 5
A A A T A A A T A T A A A A A A A T 0
A A T A A T A A A A A T A T A T A A
OBSERVED GENOTYPES
A A A T A A A T A T A A A A A A A T
7
8. Using inheritance to determine accuracy: larger pedigree
MOM DAD 1 2 3 4 5 6 7
A T A A A A T A A A T A A A T A A A 6
T A A A T A A A T A A A T A A A T A 5
A A A T A A A T A T A A A A A A A T 0
A A T A A T A A A A A T A T A T A A 7
OBSERVED GENOTYPES
A A A T A A A T A T A A A A A A A T
8
9. Using inheritance to determine accuracy: larger pedigree
MOM DAD 1 2 3 4 5 6 7
A T A A A A T A A A T A A A T A A A 6
T A A A T A A A T A A A T A A A T A 5
A A A T A A A T A T A A A A A A A T 0
A A T A A T A A A A A T A T A T A A 7
OBSERVED GENOTYPES
A A A T A A A T A T A A A A A A A T
100% consistent therefore we predict that all genotypes are correct
9
10. Platinum Genomes - CEPH/Utah Pedigree 1463
12889 12890 12891 12892
12877
12877 12878
12878
Analysis of SNPs in
the parents and 11
children
12879 12880 12881 12882
12882 12883 12884 12885 12886 12887 12888 12893
All 17 members sequenced to at least 50x depth (PCR-Free protocol)
– SNPs & indels called using BWA + GATK + VQSR
Each member of the trio highlighted in bold is sequenced to 200x
An additional 200x technical replicate was done for NA12882
10
11. Analysis of the data
50x raw data was aligned and variants called using BWA + GATK + VQSR
– Accurate calls were supplemented with accurate variant calls made by Cortex using
the same sequence data and accurate CGI calls made across the same pedigree
First step is to define the inheritance of the parental chromosomes to the eleven
children everywhere in the genome
– Identified 709 crossover events between the parents and eleven children
Define accurate variants as those where the genotypes are 100% consistent
with the transmission of the parental haplotypes
– At any position of the genome there are only 16 possible combinations of genotypes
(biallelic & diploid) across the pedigree that are consistent with the inheritance pattern
– 313 (~1.6M) possible genotype combinations
Subsequent analysis mostly excludes all variants that are homozygous
alternative across the last two generations of this pedigree (~750k)
– Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate
accurate from systematic errors or validate ploidy
11
12. Set Input all possible data and
Set A use the inheritance to
B separate good from bad:
Set
C Variants are unlikely to
accidentally match
inheritance
Compare
Against
Inheritance
NO CONFLICTS CONFLICTS
Score Assess
(plat./gold) Problem
BIOLOGY BAD
Score
db w/score Comment
(gold/silver)
db db
w/comments w/comments
12
14. Accurate SNP positions based on the pedigree analysis
3.5 3,217,748 Pedigree Analysis
3.0 Correct
Counts (Millions)
Normally might exclude
2.5 these from our analysis Problematic
because the variant
2.0 caller filtered some of the
calls
1.5 Additional 754,014
SNPs are “trivially
consistent” – i.e. all 13
1.0 samples are hom alt.
408,915
0.5
0.0
All Pass Filtered
GATK Site Description*
14 *Filtered means that at least one variant call was called but quality filtered
15. Hamming distance for the “accurate” SNPs to the 2nd best
solution
60
At these sites >85% of the
positions would require at least
four (very specific) genotype
errors to have erroneously ended
40 up with the observed predicted-
Percent
accurate calls
20
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13
15
Hamming Distance
16. Using other call sets for a more comprehensive catalogue
60 57,270 (1.6%)
Counts (x1000)
40 Pedigree Analysis
Unique
22,922 (0.6%)
Common
20
0
Cortex CGI
16
17. Concordance between “pedigree-accurate” GTs
# Same GT
Comparison* # Sites # Diff GTs
GTs Concordance
GATK & Cortex 2,053,136 5 26,690,763 99.99998%
GATK & CGI 3,146,399 19 40,903,168 99.99995%
Cortex & CGI 1,890,718 7 24,579,327 99.99997%
*Excluding sites where alleles did not match or all samples homozygous alternative
Includes 763,085 GT calls and 264,771 positions quality filtered by GATK
Attempting to validate a sample of the sites that are unique to a single call set
– Targeting ~300 per call set
17
19. Accurate GATK indel positions based on pedigree
240,490
250 Pedigree Analysis
Correct
Counts (thousands)
200 141,508
Problematic
150
Additional 115,587
100 indels are “trivially
consistent” – i.e. all 13
samples are hom alt.
50
0
All Pass Filtered
Site Description
19
20. Using other call sets for a more comprehensive catalogue
60
Counts (x1000)
39,335 (10%)
40 Pedigree Analysis
Unique
Common
20
9,637 (2.4%)
0
Cortex CGI
20
21. Concordance between overlapping “accurate” indels
# Same GT
Comparison*1 # Sites # Diff GTs
GTs Concordance
GATK & Cortex 96,228 43 1,250,921 99.997%
GATK & CGI 219,445 2,817 2,514,785 99.901%
Cortex & CGI 78,050 198 1,014,650 99.981%
*Excluding sites where alleles did not match or all samples homozygous alternative
Attempting to validate a sample of the sites that are unique to a single call set
– Targeting ~300 per call set
21
23. Conflict mode: Hemizygous deletions
MOM DAD 1 2 3 4 5 6 7
A T A A A A T A A A T A A A T A A A 6
T A A A T A A A T A A A T A A A T A 7
A A A T A A A T A T A A A A A A A T 2
A A T A A T A A A A A T A T A T A A 7
OBSERVED GENOTYPES
A A A T A A A T T T A A A A A A T T
“Best” solution still indicates multiple errors
23
24. Conflict mode: Hemizygous deletions
MOM DAD 1 2 3 4 5 6 7
A - A T A A - T A T - A A A - A A T 6
- A T A - T A A - A A T - T A T - A 5
- A A T - A A T - T A A - A A A - T 0
A - T A A T - A A A - T A T - T A A 7
OBSERVED GENOTYPES
A A A T A A A T T T A A A A A A T T
100% consistent therefore we predict that there is a deletion
Hamming distance will be less when including deletions so need to be careful
24
25. Read depth of 5,180 SNPs predicted to overlap deletions
Hom Del Haploid Diploid
5000
Depth shown for positions where
4000 the genotypes indicate that the
SNP overlaps a deletion. Large
number of children allows us to
more-reliably separate errors
3000
Counts
from deletions.
2000 A-
AA AB
1000 -B
BB
0
0 20 40 60 80 100
Depth
25
26. Have many potential large deletions to validate…
5,180 SNPs are predicted to overlap a hemizygous deletion
These SNPs cluster into ~902 unique events
– Clusters show evidence for ~279 deletions >1kb segregating in this pedigree
– Largest event is >152kb with 274 SNPs supporting the call
Have begun validating these events beyond just visual inspection
– 132 overlap with previously reported events (1kGP)
– Working to define the breakpoints for wet lab validation
Incorporating other calling methods (Cortex, breakdancer…)
Some SNPs also support the presence of duplications in a single parent
26
27. Summary
We have sequenced a large pedigree and used the inheritance information to
create a catalogue of ~4.45M accurate SNP calls
– Over 3.7M biallelic SNPs agree with transmission of parental chromosomes
– Over 750k homozygous alternative SNPs are trivially accurate across the pedigree
Have called indels using four different methods also to produce over 550k
“accurate” indel calls across the pedigree
– Over 428k bi-allelic indels agree with transmission of parental chromosomes
– Over 110k homozygous alternative indels are trivially accurate across the pedigree
Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs
and 99.9% for indels between call sets
SVs are in progress (just deletions right now)
The SNP and indel results presented here can be used for comparison
– Incorporating homozygous reference calls across the pedigree for completeness
– May see immediate gains by testing new algorithms against a better truth set
27
28. Acknowledgements
Morten Kallberg – alignment & variant calling
Han-Yu Chuang – analysis of SNP calls
Phil Tedder – validation of de novo SNPs
Sean Humphray
Epameinondas Fritzilas
Wendy Wong
David Bentley
Elliott Margulies
28