SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Platinum Genomes:
                                                                                                                                        Towards a
                                                                                                                                   comprehensive
                                                                                                                                     truth data set
                                                                                                                                                                                               Michael A. Eberle

                                                                                                                                                            Morten Kallberg, Han-Yu Chuang




© 2010 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,
GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
Platinum Genome project: Goals


    Problem: No comprehensive truth set of variant calls for validation


    Solution: Sequence and analyze large family pedigree
      Use Mendelian inheritance to identify good / bad variant calls
        – Including SNPs, indels & SVs

      Aggressively incorporate variant calls
        – Incorporate multiple algorithms and sequencing technologies
        – Do not limit this just to what is currently easy to call

      Make the data available publicly
        – Both raw data and processed calls with accuracy assessment

      Re-assess algorithms against a better truth data
        – Better and more comprehensive truth data will allow for rapid advances in software



2
Using inheritance to detect conflicts: trio analysis
      MOM        DAD         CHILD


                                         Child receives blue chromosome from mother and
                                         green chromosome from father: e.g. typical trio analysis




                            Father’s chromosomes
     Mother’s chromosomes

     When we do a trio analysis like this only 50% of the parents DNA is passed on to
     the child so many of the variants will only be called in one parent
       – Have no power to detect false positives in the parents

     A trio analysis is also not very sensitive to detecting errors
       – For example if father is AC and mother is AC then the child can be AA, AC or CC and
         still be consistent with Mendelian inheritance
       – Many errors occur at sites that are systematically het but trio analysis assumes that
         these are correct

3
Using inheritance to determine accuracy: larger pedigree
                                CHILDREN
       MOM   DAD    1     2     3      4       5     6     7




                                                                Possible GT Patterns
       A T   A A   A A   T A   A A    T A     A A   T A   A A

       T A   A A   T A   A A   T A    A A     T A   A A   T A

       A A   A T   A A   A T   A T    A A     A A   A A   A T

       A A   T A   A T   A A   A A    A T     A T   A T   A A


                         OBSERVED GENOTYPES

       A A   A T   A A   A T   A T    A A     A A   A A   A T




4
Using inheritance to determine accuracy: larger pedigree

       MOM   DAD    1     2     3     4        5       6       7




       A T   A A   A A   T A   A A    T A     A A    T A      A A      6
       T A   A A   T A   A A   T A    A A     T A    A A      T A

       A A   A T   A A   A T   A T    A A     A A    A A      A T

       A A   T A   A T   A A   A A    A T     A T    A T      A A
                                              # Errors / Hamming Distance
                         OBSERVED GENOTYPES

       A A   A T   A A   A T   A T    A A     A A    A A      A T




5
Using inheritance to determine accuracy: larger pedigree

       MOM   DAD    1     2     3     4        5     6     7




       A T   A A   A A   T A   A A    T A     A A   T A   A A   6
       T A   A A   T A   A A   T A    A A     T A   A A   T A   5
       A A   A T   A A   A T   A T    A A     A A   A A   A T

       A A   T A   A T   A A   A A    A T     A T   A T   A A


                         OBSERVED GENOTYPES

       A A   A T   A A   A T   A T    A A     A A   A A   A T




6
Using inheritance to determine accuracy: larger pedigree

       MOM   DAD    1     2     3     4        5     6     7




       A T   A A   A A   T A   A A    T A     A A   T A   A A   6
       T A   A A   T A   A A   T A    A A     T A   A A   T A   5
       A A   A T   A A   A T   A T    A A     A A   A A   A T   0
       A A   T A   A T   A A   A A    A T     A T   A T   A A


                         OBSERVED GENOTYPES

       A A   A T   A A   A T   A T    A A     A A   A A   A T




7
Using inheritance to determine accuracy: larger pedigree

       MOM   DAD    1     2     3     4        5     6     7




       A T   A A   A A   T A   A A    T A     A A   T A   A A   6
       T A   A A   T A   A A   T A    A A     T A   A A   T A   5
       A A   A T   A A   A T   A T    A A     A A   A A   A T   0
       A A   T A   A T   A A   A A    A T     A T   A T   A A   7
                         OBSERVED GENOTYPES

       A A   A T   A A   A T   A T    A A     A A   A A   A T




8
Using inheritance to determine accuracy: larger pedigree

       MOM      DAD        1        2        3        4        5           6    7




       A T      A A      A A      T A      A A       T A      A A      T A     A A   6
       T A      A A      T A      A A      T A       A A      T A      A A     T A   5
       A A      A T      A A      A T      A T       A A      A A      A A     A T   0
       A A      T A      A T      A A      A A       A T      A T      A T     A A   7
                                  OBSERVED GENOTYPES

       A A      A T      A A      A T      A T       A A      A A      A A     A T

     100% consistent therefore we predict that all genotypes are correct



9
Platinum Genomes - CEPH/Utah Pedigree 1463

                      12889           12890           12891           12892




                              12877
                              12877                           12878
                                                              12878




                                                                                              Analysis of SNPs in
                                                                                              the parents and 11
                                                                                              children

      12879   12880   12881   12882
                              12882   12883   12884   12885   12886   12887   12888   12893




     All 17 members sequenced to at least 50x depth (PCR-Free protocol)
       – SNPs & indels called using BWA + GATK + VQSR

     Each member of the trio highlighted in bold is sequenced to 200x
     An additional 200x technical replicate was done for NA12882
10
Analysis of the data


     50x raw data was aligned and variants called using BWA + GATK + VQSR
       – Accurate calls were supplemented with accurate variant calls made by Cortex using
         the same sequence data and accurate CGI calls made across the same pedigree
     First step is to define the inheritance of the parental chromosomes to the eleven
     children everywhere in the genome
       – Identified 709 crossover events between the parents and eleven children
     Define accurate variants as those where the genotypes are 100% consistent
     with the transmission of the parental haplotypes
       – At any position of the genome there are only 16 possible combinations of genotypes
         (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern
       – 313 (~1.6M) possible genotype combinations

     Subsequent analysis mostly excludes all variants that are homozygous
     alternative across the last two generations of this pedigree (~750k)
       – Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate
         accurate from systematic errors or validate ploidy


11
Set               Input all possible data and
                          Set          A                use the inheritance to
                           B                            separate good from bad:
                                  Set
                                   C                    Variants are unlikely to
                                                        accidentally match
                                                        inheritance

                               Compare
                                Against
                              Inheritance

               NO CONFLICTS                 CONFLICTS

        Score                                       Assess
     (plat./gold)                                   Problem

                                          BIOLOGY              BAD

                             Score
     db w/score                                                         Comment
                          (gold/silver)




                              db                                          db
                          w/comments                                  w/comments


12
Cataloging the accurate SNPs




13
Accurate SNP positions based on the pedigree analysis



                          3.5   3,217,748                                       Pedigree Analysis

                          3.0                                                          Correct
      Counts (Millions)




                                               Normally might exclude
                          2.5                  these from our analysis                 Problematic
                                               because the variant
                          2.0                  caller filtered some of the
                                               calls
                          1.5                                                   Additional 754,014
                                                                                SNPs are “trivially
                                                                                consistent” – i.e. all 13
                          1.0                                                   samples are hom alt.
                                                         408,915
                          0.5

                          0.0
                                All Pass                Filtered
                                    GATK Site Description*
14   *Filtered means that at least one variant call was called but quality filtered
Hamming distance for the “accurate” SNPs to the 2nd best
 solution

                  60
                                                  At these sites >85% of the
                                                  positions would require at least
                                                  four (very specific) genotype
                                                  errors to have erroneously ended
                  40                              up with the observed predicted-
        Percent




                                                  accurate calls




                  20




                   0   0 1 2 3 4 5 6 7 8 9 10 11 12 13

15
                            Hamming Distance
Using other call sets for a more comprehensive catalogue



                      60                   57,270 (1.6%)
     Counts (x1000)




                      40                                   Pedigree Analysis

                                                                Unique
                           22,922 (0.6%)
                                                                Common
                      20



                       0
                             Cortex            CGI

16
Concordance between “pedigree-accurate” GTs



                                                                                # Same         GT
      Comparison*                       # Sites           # Diff GTs
                                                                                  GTs      Concordance


      GATK & Cortex                   2,053,136                  5            26,690,763    99.99998%


      GATK & CGI                      3,146,399                 19            40,903,168    99.99995%


      Cortex & CGI                    1,890,718                  7            24,579,327    99.99997%

      *Excluding sites where alleles did not match or all samples homozygous alternative


     Includes 763,085 GT calls and 264,771 positions quality filtered by GATK
     Attempting to validate a sample of the sites that are unique to a single call set
       – Targeting ~300 per call set

17
Indel analysis




18
Accurate GATK indel positions based on pedigree


                                240,490
                          250                                       Pedigree Analysis

                                                                           Correct
     Counts (thousands)




                          200                            141,508
                                                                           Problematic
                          150

                                                                    Additional 115,587
                          100                                       indels are “trivially
                                                                    consistent” – i.e. all 13
                                                                    samples are hom alt.
                           50

                            0
                                All Pass                 Filtered
                                          Site Description
19
Using other call sets for a more comprehensive catalogue



                      60
     Counts (x1000)




                           39,335 (10%)
                      40                                 Pedigree Analysis

                                                              Unique

                                                              Common
                      20
                                          9,637 (2.4%)



                       0
                             Cortex          CGI

20
Concordance between overlapping “accurate” indels



                                                                                # Same         GT
      Comparison*1                      # Sites           # Diff GTs
                                                                                  GTs      Concordance


      GATK & Cortex                      96,228                 43             1,250,921     99.997%


      GATK & CGI                        219,445               2,817            2,514,785     99.901%


      Cortex & CGI                       78,050                198             1,014,650     99.981%

      *Excluding sites where alleles did not match or all samples homozygous alternative




     Attempting to validate a sample of the sites that are unique to a single call set
       – Targeting ~300 per call set


21
CNVs




22
Conflict mode: Hemizygous deletions

      MOM       DAD         1        2         3       4      5     6     7




       A T       A A      A A       T A       A A      T A   A A   T A   A A   6
       T A       A A      T A       A A       T A      A A   T A   A A   T A   7
       A A       A T      A A       A T       A T      A A   A A   A A   A T   2
       A A       T A      A T       A A       A A      A T   A T   A T   A A   7
                                    OBSERVED GENOTYPES

       A A       A T      A A       A T       T T      A A   A A   A A   T T

     “Best” solution still indicates multiple errors



23
Conflict mode: Hemizygous deletions

       MOM      DAD        1        2         3        4        5     6       7




       A -      A T       A A      - T      A T       - A      A A   - A    A T     6
       - A      T A       - T      A A      - A       A T      - T   A T    - A     5
       - A      A T      - A       A T      - T      A A       - A   A A    - T     0
       A -      T A      A T       - A      A A      - T       A T   - T    A A     7
                                   OBSERVED GENOTYPES

       A A      A T      A A       A T      T T      A A       A A   A A    T T

     100% consistent therefore we predict that there is a deletion

     Hamming distance will be less when including deletions so need to be careful

24
Read depth of 5,180 SNPs predicted to overlap deletions

                           Hom Del    Haploid   Diploid
                5000
                                                Depth shown for positions where
                4000                            the genotypes indicate that the
                                                SNP overlaps a deletion. Large
                                                number of children allows us to
                                                more-reliably separate errors
                3000
       Counts




                                                from deletions.


                2000                                       A-
                                                           AA          AB



                1000                                             -B
                                                                 BB



                   0
                       0       20    40    60   80      100
                                      Depth
25
Have many potential large deletions to validate…


     5,180 SNPs are predicted to overlap a hemizygous deletion
     These SNPs cluster into ~902 unique events
       – Clusters show evidence for ~279 deletions >1kb segregating in this pedigree
       – Largest event is >152kb with 274 SNPs supporting the call

     Have begun validating these events beyond just visual inspection
       – 132 overlap with previously reported events (1kGP)
       – Working to define the breakpoints for wet lab validation

     Incorporating other calling methods (Cortex, breakdancer…)
     Some SNPs also support the presence of duplications in a single parent




26
Summary


     We have sequenced a large pedigree and used the inheritance information to
     create a catalogue of ~4.45M accurate SNP calls
       – Over 3.7M biallelic SNPs agree with transmission of parental chromosomes
       – Over 750k homozygous alternative SNPs are trivially accurate across the pedigree

     Have called indels using four different methods also to produce over 550k
     “accurate” indel calls across the pedigree
       – Over 428k bi-allelic indels agree with transmission of parental chromosomes
       – Over 110k homozygous alternative indels are trivially accurate across the pedigree
     Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs
     and 99.9% for indels between call sets
     SVs are in progress (just deletions right now)
     The SNP and indel results presented here can be used for comparison
       – Incorporating homozygous reference calls across the pedigree for completeness
       – May see immediate gains by testing new algorithms against a better truth set


27
Acknowledgements


 Morten Kallberg – alignment & variant calling
 Han-Yu Chuang – analysis of SNP calls
 Phil Tedder – validation of de novo SNPs


 Sean Humphray
 Epameinondas Fritzilas
 Wendy Wong
 David Bentley
 Elliott Margulies




28

Weitere ähnliche Inhalte

Mehr von GenomeInABottle

Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020GenomeInABottle
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGenomeInABottle
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGenomeInABottle
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGenomeInABottle
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGenomeInABottle
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGenomeInABottle
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyGenomeInABottle
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GenomeInABottle
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphsGenomeInABottle
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normalGenomeInABottle
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccsGenomeInABottle
 
New data from giab genomes strand-seq
New data from giab genomes   strand-seqNew data from giab genomes   strand-seq
New data from giab genomes strand-seqGenomeInABottle
 

Mehr von GenomeInABottle (20)

Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
GIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussionGIAB Technical Germline Benchmark roadmap discussion
GIAB Technical Germline Benchmark roadmap discussion
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GHGa4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
 
GIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant posterGIAB ASHG 2019 Structural Variant poster
GIAB ASHG 2019 Structural Variant poster
 
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATKGIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
GIAB GRC Workshop ASHG 2019 Billy Rowell Evaluation of v4 with CCS GATK
 
GIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant posterGIAB ASHG 2019 Small Variant poster
GIAB ASHG 2019 Small Variant poster
 
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant BenchmarkGRC GIAB Workshop ASHG 2019 Small Variant Benchmark
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
 
Jason Chin MHC diploid assembly
Jason Chin MHC diploid assemblyJason Chin MHC diploid assembly
Jason Chin MHC diploid assembly
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417GIAB and long reads for bio it world 190417
GIAB and long reads for bio it world 190417
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccs
 
New data from giab genomes strand-seq
New data from giab genomes   strand-seqNew data from giab genomes   strand-seq
New data from giab genomes strand-seq
 

Platinum Genomes: Towards a comprehensive truth data set

  • 1. Platinum Genomes: Towards a comprehensive truth data set Michael A. Eberle Morten Kallberg, Han-Yu Chuang © 2010 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
  • 2. Platinum Genome project: Goals Problem: No comprehensive truth set of variant calls for validation Solution: Sequence and analyze large family pedigree Use Mendelian inheritance to identify good / bad variant calls – Including SNPs, indels & SVs Aggressively incorporate variant calls – Incorporate multiple algorithms and sequencing technologies – Do not limit this just to what is currently easy to call Make the data available publicly – Both raw data and processed calls with accuracy assessment Re-assess algorithms against a better truth data – Better and more comprehensive truth data will allow for rapid advances in software 2
  • 3. Using inheritance to detect conflicts: trio analysis MOM DAD CHILD Child receives blue chromosome from mother and green chromosome from father: e.g. typical trio analysis Father’s chromosomes Mother’s chromosomes When we do a trio analysis like this only 50% of the parents DNA is passed on to the child so many of the variants will only be called in one parent – Have no power to detect false positives in the parents A trio analysis is also not very sensitive to detecting errors – For example if father is AC and mother is AC then the child can be AA, AC or CC and still be consistent with Mendelian inheritance – Many errors occur at sites that are systematically het but trio analysis assumes that these are correct 3
  • 4. Using inheritance to determine accuracy: larger pedigree CHILDREN MOM DAD 1 2 3 4 5 6 7 Possible GT Patterns A T A A A A T A A A T A A A T A A A T A A A T A A A T A A A T A A A T A A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 4
  • 5. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A # Errors / Hamming Distance OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 5
  • 6. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 6
  • 7. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 7
  • 8. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 8
  • 9. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 100% consistent therefore we predict that all genotypes are correct 9
  • 10. Platinum Genomes - CEPH/Utah Pedigree 1463 12889 12890 12891 12892 12877 12877 12878 12878 Analysis of SNPs in the parents and 11 children 12879 12880 12881 12882 12882 12883 12884 12885 12886 12887 12888 12893 All 17 members sequenced to at least 50x depth (PCR-Free protocol) – SNPs & indels called using BWA + GATK + VQSR Each member of the trio highlighted in bold is sequenced to 200x An additional 200x technical replicate was done for NA12882 10
  • 11. Analysis of the data 50x raw data was aligned and variants called using BWA + GATK + VQSR – Accurate calls were supplemented with accurate variant calls made by Cortex using the same sequence data and accurate CGI calls made across the same pedigree First step is to define the inheritance of the parental chromosomes to the eleven children everywhere in the genome – Identified 709 crossover events between the parents and eleven children Define accurate variants as those where the genotypes are 100% consistent with the transmission of the parental haplotypes – At any position of the genome there are only 16 possible combinations of genotypes (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern – 313 (~1.6M) possible genotype combinations Subsequent analysis mostly excludes all variants that are homozygous alternative across the last two generations of this pedigree (~750k) – Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate accurate from systematic errors or validate ploidy 11
  • 12. Set Input all possible data and Set A use the inheritance to B separate good from bad: Set C Variants are unlikely to accidentally match inheritance Compare Against Inheritance NO CONFLICTS CONFLICTS Score Assess (plat./gold) Problem BIOLOGY BAD Score db w/score Comment (gold/silver) db db w/comments w/comments 12
  • 14. Accurate SNP positions based on the pedigree analysis 3.5 3,217,748 Pedigree Analysis 3.0 Correct Counts (Millions) Normally might exclude 2.5 these from our analysis Problematic because the variant 2.0 caller filtered some of the calls 1.5 Additional 754,014 SNPs are “trivially consistent” – i.e. all 13 1.0 samples are hom alt. 408,915 0.5 0.0 All Pass Filtered GATK Site Description* 14 *Filtered means that at least one variant call was called but quality filtered
  • 15. Hamming distance for the “accurate” SNPs to the 2nd best solution 60 At these sites >85% of the positions would require at least four (very specific) genotype errors to have erroneously ended 40 up with the observed predicted- Percent accurate calls 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 Hamming Distance
  • 16. Using other call sets for a more comprehensive catalogue 60 57,270 (1.6%) Counts (x1000) 40 Pedigree Analysis Unique 22,922 (0.6%) Common 20 0 Cortex CGI 16
  • 17. Concordance between “pedigree-accurate” GTs # Same GT Comparison* # Sites # Diff GTs GTs Concordance GATK & Cortex 2,053,136 5 26,690,763 99.99998% GATK & CGI 3,146,399 19 40,903,168 99.99995% Cortex & CGI 1,890,718 7 24,579,327 99.99997% *Excluding sites where alleles did not match or all samples homozygous alternative Includes 763,085 GT calls and 264,771 positions quality filtered by GATK Attempting to validate a sample of the sites that are unique to a single call set – Targeting ~300 per call set 17
  • 19. Accurate GATK indel positions based on pedigree 240,490 250 Pedigree Analysis Correct Counts (thousands) 200 141,508 Problematic 150 Additional 115,587 100 indels are “trivially consistent” – i.e. all 13 samples are hom alt. 50 0 All Pass Filtered Site Description 19
  • 20. Using other call sets for a more comprehensive catalogue 60 Counts (x1000) 39,335 (10%) 40 Pedigree Analysis Unique Common 20 9,637 (2.4%) 0 Cortex CGI 20
  • 21. Concordance between overlapping “accurate” indels # Same GT Comparison*1 # Sites # Diff GTs GTs Concordance GATK & Cortex 96,228 43 1,250,921 99.997% GATK & CGI 219,445 2,817 2,514,785 99.901% Cortex & CGI 78,050 198 1,014,650 99.981% *Excluding sites where alleles did not match or all samples homozygous alternative Attempting to validate a sample of the sites that are unique to a single call set – Targeting ~300 per call set 21
  • 23. Conflict mode: Hemizygous deletions MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 7 A A A T A A A T A T A A A A A A A T 2 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T T T A A A A A A T T “Best” solution still indicates multiple errors 23
  • 24. Conflict mode: Hemizygous deletions MOM DAD 1 2 3 4 5 6 7 A - A T A A - T A T - A A A - A A T 6 - A T A - T A A - A A T - T A T - A 5 - A A T - A A T - T A A - A A A - T 0 A - T A A T - A A A - T A T - T A A 7 OBSERVED GENOTYPES A A A T A A A T T T A A A A A A T T 100% consistent therefore we predict that there is a deletion Hamming distance will be less when including deletions so need to be careful 24
  • 25. Read depth of 5,180 SNPs predicted to overlap deletions Hom Del Haploid Diploid 5000 Depth shown for positions where 4000 the genotypes indicate that the SNP overlaps a deletion. Large number of children allows us to more-reliably separate errors 3000 Counts from deletions. 2000 A- AA AB 1000 -B BB 0 0 20 40 60 80 100 Depth 25
  • 26. Have many potential large deletions to validate… 5,180 SNPs are predicted to overlap a hemizygous deletion These SNPs cluster into ~902 unique events – Clusters show evidence for ~279 deletions >1kb segregating in this pedigree – Largest event is >152kb with 274 SNPs supporting the call Have begun validating these events beyond just visual inspection – 132 overlap with previously reported events (1kGP) – Working to define the breakpoints for wet lab validation Incorporating other calling methods (Cortex, breakdancer…) Some SNPs also support the presence of duplications in a single parent 26
  • 27. Summary We have sequenced a large pedigree and used the inheritance information to create a catalogue of ~4.45M accurate SNP calls – Over 3.7M biallelic SNPs agree with transmission of parental chromosomes – Over 750k homozygous alternative SNPs are trivially accurate across the pedigree Have called indels using four different methods also to produce over 550k “accurate” indel calls across the pedigree – Over 428k bi-allelic indels agree with transmission of parental chromosomes – Over 110k homozygous alternative indels are trivially accurate across the pedigree Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs and 99.9% for indels between call sets SVs are in progress (just deletions right now) The SNP and indel results presented here can be used for comparison – Incorporating homozygous reference calls across the pedigree for completeness – May see immediate gains by testing new algorithms against a better truth set 27
  • 28. Acknowledgements Morten Kallberg – alignment & variant calling Han-Yu Chuang – analysis of SNP calls Phil Tedder – validation of de novo SNPs Sean Humphray Epameinondas Fritzilas Wendy Wong David Bentley Elliott Margulies 28