SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
SSAHA_pileup:
A Genome Variation Detection Pipeline for
     Various Sequencing Platforms




             Photo Credit: saynine on flickr.com


          Ben Blackburne
     Wellcome Trust Sanger Institute
Acknowledgments


●Zemin Ning
●Yong Gu
●Antony Cox
●Adam Spargo
●Hannes Ponstingl
Introduction
●New sequencing technologies
  – More data
  – Different kinds of data
     ●Solexa, 454
     ●capillary, too
  – Diploid genomes
  – SNPs, indels, VNTRs




                              Photo Credit: mknowles on flickr.com
SSAHA_pileup
●Sequence Search and Alignment by Hashing
 Algorithm
●SSAHA_SNP
  – Global positioning with SSAHA algorithm
  – Fast Smith-Waterman implementation (from
    Cross_Match)
  – Identification of best match
●SSAHA_pileup
  – Determines SNPs from set of best alignments
●Works on Solexa, 454, and capillary reads
The Toolchain
Reference
 Genome




            SSAHA_snp/
                         Alignments      SSAHA_pileup
             SSAHA2




                                           variations
 Reads


                            refinement
SSAHA_SNP
●Reference genome is “hashed”
  – table made of all k-mer words
  – overlapping or not, at user's option
SSAHA_SNP
●k-mer matches found for query in reference


  chr n




  chr m
SSAHA_SNP


chr n

        Global Mapping


chr m
SSAHA_SNP


chr n
                           score: 126
        Local Mapping
        (Smith-Waterman)
                           score: 113
chr m
SSAHA_SNP


chr n
                            score: 126
        Select best match

                            score: 113
chr m
SSAHA_SNP
●Read pair information
  – currently possible with
    extra step using SSAHA2
  – being integrated into
    SSAHA_SNP
  – Removes incorrectly
    mapped pairs




                              Photo Credit: Matthew Fang on flickr.com
SSAHA_pileup
Reference
 Genome




            SSAHA_snp/
                         Alignments      SSAHA_pileup
             SSAHA2




                                           variations
 Reads


                            refinement
SSAHA_pileup
                      Reference
...GGTCCCACAGAGCTGGAGAAAG...
   GGTCCCACGGAGCTGGAG
        CCACGGAGCTGGAGAAAGCCT
     TCCCACGGAGCTGGAGAAAGCCT
     TCCCACGGAGCTGGAGAAAGCCT
     TCCCACGGAGCTGGAGAAAGCCT
                     Aligned reads
 Homozygous SNP
SSAHA_pileup
                      Reference
...GGTCCCACAGAGCTGGAGAAAG...
    GGTCCCACAGAGCTGGAG
          CCACAGAGCTGGAGAAAGCCT
       TCCCACGGAGCTGGAGAAAGCCT
       TCCCACGGAGCTGGAGAAAGCCT
       TCCCACGGAGCTGGAGAAAGCCT
                       Aligned reads
 Heterozygous SNP
SSAHA_pileup
                      Reference
...GGTCCCACAGAGCTGGAGAAAG...
     GGTCCCACAGAGCTGGAG
           CCACAGAGCTGGAGAAAGCCT
        TCCCACggagCTGGAGAAAGCCT
        TCCCACggagcTGGAGAAAGCCT
        TCCCacggagcTGGAGAAAGCCT
                             Aligned reads
Heterozygous SNP??
                   (Probably not)
SSAHA_pileup
                      Reference
...GGTCCCACAGAGCTGGAGAAAG...
   GGTCCCAC-----TGGAG
        CCAC-----TGGAGAAAGCCT
     TCCCACGGAGCTGGAGAAAGCCT
     TCCCACGGAGCTGGAGAAAGCCT
     TCCCACGGAGCTGGAGAAAGCCT
                       Aligned reads
    Heterozygous indel
How well does it work?
Datasets
●Venter: ABI capillary reads
  – Celera: 19,397,599     55% in pairs
  – JCVI: 12,541,352       98% in pairs
  – Total: 31,938,951    72% in pairs (90% mapped)
●Watson: 454 GS FLX reads
  – Baylor & Roche 74,198,831 (90.5% mapped)
  – single end reads with length 150 – 280 bps
●Chromosome X Illumina reads
  – 278,557,156 reads (71.6% mapped)
  – (paired with insert size 200bps)
How conservative should we
           be?
How conservative should we
           be?
Or....




How liberal should we be?
How do we even know if we are
         winning?
dbSNP
(but not ideal)
Filtering
●Processes that cause bogus SNPs
  – Incorrect global mapping
  – Incorrect local alignment
  – Poor quality reads
  – Sequence amplification errors
Global Mapping Problems
●Reads from unmapped regions of the genome
  – Lead to absurdly high apparent coverage

                                                        chr n




                `          `               `
                       `               `            `
                                   `
                ` ``       `   `               ``
                                           `
                                                        chr m
Global Mapping Problems
●Reads from unmapped regions of the genome
  – Lead to absurdly high apparent coverage

                                                        chr n




                `          `               `
                       `               `            `
                                   `
                ` ``       `   `               ``
                                           `
                                                        chr m
Global Mapping Problems
●Reads from unmapped regions of the genome
  – Lead to absurdly high apparent coverage

                                              chr n
              `
                               `
             `  ``
              `
              `
                  `
                `
                          `
                      `            `
                          ``
                  `
SNPs
Solution:
 Filter out SNPs called from
abnormally high read depths
Global Mapping Problems
●Incorrectly aligned reads


                                  chr n
               `     score: 132




               `     score: 136
                                  chr m
Solution:
                          nd
Filter out SNPs where 2 best
       score is too close
Local Alignment Problems
●Misalignment
  – Uncaught incorrect global alignment
  – Variations in short repeats
Local Misalignment
                      Reference
...GGTCCCACAGAGCTGGAGAAAA...
    GGTCCCACT---CTAGTG
        CCACT---CTAGTGAAAA
      TCCCACT---CTAGTGAAAA


                       Aligned reads
 Real SNPs?
Local Misalignment
                      Reference
..TAATAATAATAATAATAATAAGAAG..
    AATAATAAGAAGAAGAAGAAGAAG
    AATAATAAGAAGAAGAAGAAGAAG
    AATAATAAGAAGAAGAAGAAGAAG


                       Aligned reads
 Real SNPs?
Solution:
Filter out short blocks of many
             SNPs
Venter SNP Calling (Capillary)

                 count     fraction in dbSNP

Homozygous SNPs 1 347 806 97.1%

Heterozygous SNPs 1 857 167 90.9%

Total SNPs       3 204 973 93.5%
Watson SNP Calling (454)

                  count    fraction in
                           dbSNP

Homozygous SNPs   1 298 309 93.0%

Heterozygous SNPs 1 767 951 63.9%

Total SNPs        3 066 260 76.3%
X Chromosome SNPs (Solexa)

                  count    fraction in dbSNP

Homozygous SNPs 27 708     92.8%

Heterozygous SNPs 63 197   81.8%

Total SNPs        90 905   85.1%
Venter-Watson Overlap



  1 593 791   1 611 182   1 455 078




   Venter                     Watson
X Chromosome Overlap

             Solexa X reads
                  40 625


         19 978            12 590

                  17 712


    26 502        6 588       22 872

    Venter                    Watson
Conclusions
●SSAHA_pileup is effective across both new and
 old sequencing technologies
●Questions
  – When is a SNP not a SNP?
  – Homozygous/Heterozygous SNPs
Conclusions
●SSAHA_pileup is effective across both new and
 old sequencing technologies
●Questions
  – When is a SNP not a SNP?
  – Homozygous/Heterozygous SNPs
●Length matters...?
  – But it's what you do with it that counts
Obtaining SSAHA_pileup
                 SSAHA_pileup:
    ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/

                    SSAHA2:
http://www.sanger.ac.uk/Software/analysis/SSAHA2/
                   These Slides:
             http://slideshare.net/bpb/

Weitere ähnliche Inhalte

Andere mochten auch (20)

Osmius 8.01 - Open Source Monitoring Tool
Osmius 8.01 - Open Source Monitoring ToolOsmius 8.01 - Open Source Monitoring Tool
Osmius 8.01 - Open Source Monitoring Tool
 
Day Two
Day TwoDay Two
Day Two
 
Internet
InternetInternet
Internet
 
B A U T I S M O2
B A U T I S M O2B A U T I S M O2
B A U T I S M O2
 
E X P O R T A N D O M I S D I B U J O S
E X P O R T A N D O  M I S  D I B U J O SE X P O R T A N D O  M I S  D I B U J O S
E X P O R T A N D O M I S D I B U J O S
 
Grabalo
GrabaloGrabalo
Grabalo
 
Abschlusspräsentation
AbschlusspräsentationAbschlusspräsentation
Abschlusspräsentation
 
Cuento 1
Cuento 1Cuento 1
Cuento 1
 
Mashuta Mashuta
Mashuta MashutaMashuta Mashuta
Mashuta Mashuta
 
Carnaval de San Diego
Carnaval de San DiegoCarnaval de San Diego
Carnaval de San Diego
 
Mellorconhumor
MellorconhumorMellorconhumor
Mellorconhumor
 
Musica1eso
Musica1esoMusica1eso
Musica1eso
 
flickr + slide + animoto
flickr + slide + animotoflickr + slide + animoto
flickr + slide + animoto
 
Abusoinfantil
AbusoinfantilAbusoinfantil
Abusoinfantil
 
Quase
QuaseQuase
Quase
 
Colonus - rock -
Colonus - rock -Colonus - rock -
Colonus - rock -
 
La France 2140 C O N T E X T O
La  France 2140 C O N T E X T OLa  France 2140 C O N T E X T O
La France 2140 C O N T E X T O
 
Kingdoms Of Southeast Asia And Korea2
Kingdoms Of Southeast Asia And Korea2Kingdoms Of Southeast Asia And Korea2
Kingdoms Of Southeast Asia And Korea2
 
Sesion 05 WinForm
Sesion 05 WinFormSesion 05 WinForm
Sesion 05 WinForm
 
аэг нов с домиками
аэг нов с домикамиаэг нов с домиками
аэг нов с домиками
 

Ähnlich wie SSAHA_pileup

De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
Jan Aerts
 
Genotype Imputation via Matrix Completion
Genotype Imputation via Matrix CompletionGenotype Imputation via Matrix Completion
Genotype Imputation via Matrix Completion
echi99
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
Winnowmap2: A long read mapping method for highly repetitive reference sequences
Winnowmap2: A long read mapping method for highly repetitive reference sequencesWinnowmap2: A long read mapping method for highly repetitive reference sequences
Winnowmap2: A long read mapping method for highly repetitive reference sequences
Chirag Jain
 
Photomorphogenesis talk
Photomorphogenesis talkPhotomorphogenesis talk
Photomorphogenesis talk
Hugh Shanahan
 

Ähnlich wie SSAHA_pileup (20)

Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detection
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Genomics lecture 3
Genomics lecture 3Genomics lecture 3
Genomics lecture 3
 
Genotype Imputation via Matrix Completion
Genotype Imputation via Matrix CompletionGenotype Imputation via Matrix Completion
Genotype Imputation via Matrix Completion
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
CQNCER
CQNCERCQNCER
CQNCER
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
Winnowmap2: A long read mapping method for highly repetitive reference sequences
Winnowmap2: A long read mapping method for highly repetitive reference sequencesWinnowmap2: A long read mapping method for highly repetitive reference sequences
Winnowmap2: A long read mapping method for highly repetitive reference sequences
 
NSMS IGERT Nano Cafe 2/12/09
NSMS IGERT Nano Cafe 2/12/09NSMS IGERT Nano Cafe 2/12/09
NSMS IGERT Nano Cafe 2/12/09
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education
 
Photomorphogenesis talk
Photomorphogenesis talkPhotomorphogenesis talk
Photomorphogenesis talk
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
New RNA tools for optimized CRISPR/Cas9 genome editing
New RNA tools for optimized CRISPR/Cas9 genome editingNew RNA tools for optimized CRISPR/Cas9 genome editing
New RNA tools for optimized CRISPR/Cas9 genome editing
 
20140710 6 c_mason_ercc2.0_workshop
20140710 6 c_mason_ercc2.0_workshop20140710 6 c_mason_ercc2.0_workshop
20140710 6 c_mason_ercc2.0_workshop
 
Fly chromatin dynamics using bidirectional hidden markov model
Fly chromatin dynamics using bidirectional hidden markov modelFly chromatin dynamics using bidirectional hidden markov model
Fly chromatin dynamics using bidirectional hidden markov model
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 

Kürzlich hochgeladen

Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
Matteo Carbone
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
amitlee9823
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
Abortion pills in Kuwait Cytotec pills in Kuwait
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 

Kürzlich hochgeladen (20)

Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...Boost the utilization of your HCL environment by reevaluating use cases and f...
Boost the utilization of your HCL environment by reevaluating use cases and f...
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
 
Grateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdfGrateful 7 speech thanking everyone that has helped.pdf
Grateful 7 speech thanking everyone that has helped.pdf
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdf
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 

SSAHA_pileup

  • 1. SSAHA_pileup: A Genome Variation Detection Pipeline for Various Sequencing Platforms Photo Credit: saynine on flickr.com Ben Blackburne Wellcome Trust Sanger Institute
  • 2. Acknowledgments ●Zemin Ning ●Yong Gu ●Antony Cox ●Adam Spargo ●Hannes Ponstingl
  • 3. Introduction ●New sequencing technologies – More data – Different kinds of data ●Solexa, 454 ●capillary, too – Diploid genomes – SNPs, indels, VNTRs Photo Credit: mknowles on flickr.com
  • 4.
  • 5.
  • 6.
  • 7. SSAHA_pileup ●Sequence Search and Alignment by Hashing Algorithm ●SSAHA_SNP – Global positioning with SSAHA algorithm – Fast Smith-Waterman implementation (from Cross_Match) – Identification of best match ●SSAHA_pileup – Determines SNPs from set of best alignments ●Works on Solexa, 454, and capillary reads
  • 8. The Toolchain Reference Genome SSAHA_snp/ Alignments SSAHA_pileup SSAHA2 variations Reads refinement
  • 9. SSAHA_SNP ●Reference genome is “hashed” – table made of all k-mer words – overlapping or not, at user's option
  • 10. SSAHA_SNP ●k-mer matches found for query in reference chr n chr m
  • 11. SSAHA_SNP chr n Global Mapping chr m
  • 12. SSAHA_SNP chr n score: 126 Local Mapping (Smith-Waterman) score: 113 chr m
  • 13. SSAHA_SNP chr n score: 126 Select best match score: 113 chr m
  • 14. SSAHA_SNP ●Read pair information – currently possible with extra step using SSAHA2 – being integrated into SSAHA_SNP – Removes incorrectly mapped pairs Photo Credit: Matthew Fang on flickr.com
  • 15. SSAHA_pileup Reference Genome SSAHA_snp/ Alignments SSAHA_pileup SSAHA2 variations Reads refinement
  • 16. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCACGGAGCTGGAG CCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT Aligned reads Homozygous SNP
  • 17. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCACAGAGCTGGAG CCACAGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT Aligned reads Heterozygous SNP
  • 18. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCACAGAGCTGGAG CCACAGAGCTGGAGAAAGCCT TCCCACggagCTGGAGAAAGCCT TCCCACggagcTGGAGAAAGCCT TCCCacggagcTGGAGAAAGCCT Aligned reads Heterozygous SNP?? (Probably not)
  • 19. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCAC-----TGGAG CCAC-----TGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT Aligned reads Heterozygous indel
  • 20. How well does it work?
  • 21. Datasets ●Venter: ABI capillary reads – Celera: 19,397,599 55% in pairs – JCVI: 12,541,352 98% in pairs – Total: 31,938,951 72% in pairs (90% mapped) ●Watson: 454 GS FLX reads – Baylor & Roche 74,198,831 (90.5% mapped) – single end reads with length 150 – 280 bps ●Chromosome X Illumina reads – 278,557,156 reads (71.6% mapped) – (paired with insert size 200bps)
  • 25. How do we even know if we are winning?
  • 26.
  • 28. Filtering ●Processes that cause bogus SNPs – Incorrect global mapping – Incorrect local alignment – Poor quality reads – Sequence amplification errors
  • 29. Global Mapping Problems ●Reads from unmapped regions of the genome – Lead to absurdly high apparent coverage chr n ` ` ` ` ` ` ` ` `` ` ` `` ` chr m
  • 30. Global Mapping Problems ●Reads from unmapped regions of the genome – Lead to absurdly high apparent coverage chr n ` ` ` ` ` ` ` ` `` ` ` `` ` chr m
  • 31. Global Mapping Problems ●Reads from unmapped regions of the genome – Lead to absurdly high apparent coverage chr n ` ` ` `` ` ` ` ` ` ` ` `` `
  • 32. SNPs
  • 33. Solution: Filter out SNPs called from abnormally high read depths
  • 34. Global Mapping Problems ●Incorrectly aligned reads chr n ` score: 132 ` score: 136 chr m
  • 35. Solution: nd Filter out SNPs where 2 best score is too close
  • 36. Local Alignment Problems ●Misalignment – Uncaught incorrect global alignment – Variations in short repeats
  • 37. Local Misalignment Reference ...GGTCCCACAGAGCTGGAGAAAA... GGTCCCACT---CTAGTG CCACT---CTAGTGAAAA TCCCACT---CTAGTGAAAA Aligned reads Real SNPs?
  • 38. Local Misalignment Reference ..TAATAATAATAATAATAATAAGAAG.. AATAATAAGAAGAAGAAGAAGAAG AATAATAAGAAGAAGAAGAAGAAG AATAATAAGAAGAAGAAGAAGAAG Aligned reads Real SNPs?
  • 39. Solution: Filter out short blocks of many SNPs
  • 40. Venter SNP Calling (Capillary) count fraction in dbSNP Homozygous SNPs 1 347 806 97.1% Heterozygous SNPs 1 857 167 90.9% Total SNPs 3 204 973 93.5%
  • 41. Watson SNP Calling (454) count fraction in dbSNP Homozygous SNPs 1 298 309 93.0% Heterozygous SNPs 1 767 951 63.9% Total SNPs 3 066 260 76.3%
  • 42. X Chromosome SNPs (Solexa) count fraction in dbSNP Homozygous SNPs 27 708 92.8% Heterozygous SNPs 63 197 81.8% Total SNPs 90 905 85.1%
  • 43. Venter-Watson Overlap 1 593 791 1 611 182 1 455 078 Venter Watson
  • 44. X Chromosome Overlap Solexa X reads 40 625 19 978 12 590 17 712 26 502 6 588 22 872 Venter Watson
  • 45. Conclusions ●SSAHA_pileup is effective across both new and old sequencing technologies ●Questions – When is a SNP not a SNP? – Homozygous/Heterozygous SNPs
  • 46. Conclusions ●SSAHA_pileup is effective across both new and old sequencing technologies ●Questions – When is a SNP not a SNP? – Homozygous/Heterozygous SNPs ●Length matters...? – But it's what you do with it that counts
  • 47. Obtaining SSAHA_pileup SSAHA_pileup: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ SSAHA2: http://www.sanger.ac.uk/Software/analysis/SSAHA2/ These Slides: http://slideshare.net/bpb/