SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Babelomics
NGS data Preprocessing
             Granada, June 2011


           Javier Santoyo-Lopez
                 jsantoyo@cipf.es
                http://bioinfo.cipf.es
               Genomics Department
   Centro de Investigacion Principe Felipe (CIPF)
                 (Valencia, Spain)
History of DNA Sequencing
                                      Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)
                                      Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998)



                              1870     Miescher: Discovers DNA

                              1940     Avery: Proposes DNA as ‘Genetic Material’
             Efficiency
                              1953     Watson & Crick: Double Helix Structure of DNA
          (bp/person/year)

            1                 1965     Holley: Sequences Yeast tRNAAla


            15
                              1970     Wu: Sequences λ Cohesive End DNA

            150

                                       Sanger: Dideoxy Chain Termination
                              1977     Gilbert: Chemical Degradation
            1,500

                              1980     Messing: M13 Cloning
            15,000


            25,000
                              1986     Hood et al.: Partial Automation


            50,000            1990      Cycle Sequencing
                                        Improved Sequencing Enzymes
            200,000                     Improved Fluorescent Detection Schemes



            50,000,000        2002

                                       Next Generation Sequencing
                                       Improved enzymes and chemistry
            100,000,000,000    2008    Improved image processing
Sequence Databases Trend
   EMBL database growth (March 2011)
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
NGS platforms comparison




06/14/11
                        Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
Adapted from John McPherson, OICR

Next-gen sequencers
                                                        Adapted from John McPherson, OICR




                                                         AB SOLiDv3
                                                         200Gb, 50 bp reads
                         100 Gb
                                                         Illumina HiSeq
 bases per machine run



                                                         150Gb, 100bp reads
                          10 Gb



                                                               454 GS FLX Titanium
                           1 Gb
                                                             0.4-1 Gb, 100-500 bp reads


                         100 Mb




                          10 Mb                                     ABI capillary sequencer
                                                                    (0.04-0.08 Mb,
                                                                    450-800 bp reads
                          1 Mb




                                  10 bp     100 bp               1,000 bp

                                          read length
Many Gbs of Sequences and...
• Data management becomes a challenge.
  – Moving data across file systems takes time (several hundred Gbs)


• What structure has the data?
  – Different sequencers output different files, but
  – There are some data formats that are being accepted widely (e.g.
    FastQ format)


• Raw sequence data formats
  –   SFF
  –   Fasta, csfasta
  –   Qual file
  –   Fastq
Fasta & Fastq formats
• FastA format (everybody knows about it)
  – Header line starts with “>” followed by a sequence ID
  – Sequence (string of nt).


• FastQ format
  – First is the sequence (like Fasta but starting with “@”)
  – Then “+” and sequence ID (optional) and in the following line are
    QVs encoded as single byte ASCII codes
    • Different quality encode variants


• Nearly all downstream analysis take FastQ as input
  sequence
Sequence to Variation Workflow
                  Raw
                  Data
                            FastQ



       IGV
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Sequence to Variation Workflow
                  Raw                        FastX
                                                        FastQ
                  Data                       Tookit
                            FastQ


                                                               BWA/
       IGV                                                    BWASW
                                          BWA/
                                         BWASW

                                                        SAM



                                                              SAM
                                                              Tools
                                             Samtools
                BCF         BCF
                                             mPileup
       Filter   Tools Raw   Tools
 GFF                                Pileup              BAM
       VCF            VCF
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
                 If something is wrong can I fix it?
Why Quality Control and
        Preprocessing?
   Sequencer output:
           Reads + quality
           Is the quality of my sequenced data OK?
                  If something is wrong can I fix it?
   Problem:
           HUGE files...
Why Quality Control and
         Preprocessing?
   Sequencer output:
            Reads + quality
            Is the quality of my sequenced data OK?
                   If something is wrong can I fix it?
   Problem:
            HUGE files... How do they look?
                @HWUSI­EAS460:2:1:368:1089#0/1
                TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG
                +HWUSI­EAS460:2:1:368:1089#0/1
                aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB

                @HWUSI­EAS460:2:1:368:528#0/1
                CTATTATAATATGACCGACCAGCTAGATCTACAGTC
                +HWUSI­EAS460:2:1:368:528#0/1
                abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_

   Files are flat files and are big... tens of Gbs (please... don't
    use MS word to see or edit them)
Sequence Quality
Per base Position

                                          Good data
                                         Consistent
                                          High quality along
                                          the read



  * The central red line is the median value
  * The yellow box represents the inter-quartile range (25-75%)
  * The upper and lower whiskers represent the 10% and 90% points
  * The blue line represents the mean quality
Sequence Quality
Per base Position


                  Bad data
                 High variance
                 Quality decrease
                  with length
Per Sequence Quality
     Distribution


                  Good data
                 Most are high-quality
                  sequences
Per Sequence Quality
       Distribution


                        Bad data
                       Not uniform
                        distribution


Low Quality Reads
Nucleotide Content
   per position


                  Good data
                 Smooth over
                  length
                 Organism
                  dependent (GC)
Nucleotide Content
   per position


                  Bad data
                 Sequence
                  position bias
GC Distribution



                  Good data
                 Fits with the
                  expected
                 Organism
                  dependent
Per sequence
GC Distribution


                 Bad data
                It does not fit
                 with expected
                Organism
                 dependent
                 Library
                 contamination?
Per base
GC Distribution


                      Good data
                     No variation
                      across read
                      sequence
Per base
GC Distribution


                      Bad data
                     Variation
                      across
                      read
                      sequence
Per base
N content


            It's not
            good if
            there are
            N bias per
            base
            position
Duplicated Sequences


      I don't expect high number
      of duplicated sequences:
              PCR artifact?
Distribution Length

   This is descriptive.
   Some sequencers
    output sequences of
    different length (e.g.
    454)
Overrepresented Sequences
Question:
       If you obtain the exact same sequence too
          many times      Do you have a problem?


Answer:
       Sometimes!


Examples                PCR primers (Illumina)
      GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT

      CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
K-mer Content



               Helps to detect
                problems
               Adapters?
Practical:
     FastQC and Fastx-toolkit

   Use FastQC to see your starting state.
   Use Fastx-toolkit to optimize different datasets
    and then visualize the result with FastQC to
    prove your success!

    Hints: Try trimming, clipping and quality filtering.



    Go to the tutorial and try the exercises...

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
PIR & MINT
PIR & MINTPIR & MINT
PIR & MINT
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
RNA structure analysis
RNA structure analysis RNA structure analysis
RNA structure analysis
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Protein database
Protein databaseProtein database
Protein database
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Kegg
KeggKegg
Kegg
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptx
 
BLAST
BLASTBLAST
BLAST
 
Primer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applicationsPrimer designing for pcr and qpcr and their applications
Primer designing for pcr and qpcr and their applications
 
MEGA (Molecular Evolutionary Genetics Analysis)
MEGA (Molecular Evolutionary Genetics Analysis)MEGA (Molecular Evolutionary Genetics Analysis)
MEGA (Molecular Evolutionary Genetics Analysis)
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 

Andere mochten auch

File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGScursoNGS
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 
Next-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicNext-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicQIAGEN
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the CloudMatt Wood
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 

Andere mochten auch (20)

File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Introduction to Linux
Introduction to LinuxIntroduction to Linux
Introduction to Linux
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Next-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones InfographicNext-Generation Sequencing Clinical Research Milestones Infographic
Next-Generation Sequencing Clinical Research Milestones Infographic
 
True Single Molecule Sequencing
True Single Molecule SequencingTrue Single Molecule Sequencing
True Single Molecule Sequencing
 
Clinical Application 2.0
Clinical Application 2.0Clinical Application 2.0
Clinical Application 2.0
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Genomics in the Cloud
Genomics in the CloudGenomics in the Cloud
Genomics in the Cloud
 
Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 

Ähnlich wie NGS Data Preprocessing

Next Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewNext Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewPatrick Merel
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraThomas Keane
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nicolas Gobet
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Lex Nederbragt
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010BOSC 2010
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipelineDan Bolser
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...confluent
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...Thomas Keane
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
 
Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...Lex Nederbragt
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)James Hadfield
 

Ähnlich wie NGS Data Preprocessing (20)

Next Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies OverviewNext Gen Sequencing Technologies Overview
Next Gen Sequencing Technologies Overview
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01Nextgentechnologies 124159213386-phpapp01
Nextgentechnologies 124159213386-phpapp01
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
 
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
New High Throughput Sequencing technologies at the Norwegian Sequencing Centr...
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)
 
Xin Zhou - Saturday Closing Plenary
Xin Zhou - Saturday Closing PlenaryXin Zhou - Saturday Closing Plenary
Xin Zhou - Saturday Closing Plenary
 

Mehr von cursoNGS

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemscursoNGS
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanacursoNGS
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNAcursoNGS
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformaticscursoNGS
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformaticacursoNGS
 

Mehr von cursoNGS (8)

Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systems
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNA
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Linux for bioinformatics
Linux for bioinformaticsLinux for bioinformatics
Linux for bioinformatics
 
Introduccion a la bioinformatica
Introduccion a la bioinformaticaIntroduccion a la bioinformatica
Introduccion a la bioinformatica
 

Kürzlich hochgeladen

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Kürzlich hochgeladen (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

NGS Data Preprocessing

  • 1. Babelomics NGS data Preprocessing Granada, June 2011 Javier Santoyo-Lopez jsantoyo@cipf.es http://bioinfo.cipf.es Genomics Department Centro de Investigacion Principe Felipe (CIPF) (Valencia, Spain)
  • 2. History of DNA Sequencing Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) 1870 Miescher: Discovers DNA 1940 Avery: Proposes DNA as ‘Genetic Material’ Efficiency 1953 Watson & Crick: Double Helix Structure of DNA (bp/person/year) 1 1965 Holley: Sequences Yeast tRNAAla 15 1970 Wu: Sequences λ Cohesive End DNA 150 Sanger: Dideoxy Chain Termination 1977 Gilbert: Chemical Degradation 1,500 1980 Messing: M13 Cloning 15,000 25,000 1986 Hood et al.: Partial Automation 50,000 1990 Cycle Sequencing Improved Sequencing Enzymes 200,000 Improved Fluorescent Detection Schemes 50,000,000 2002 Next Generation Sequencing Improved enzymes and chemistry 100,000,000,000 2008 Improved image processing
  • 3. Sequence Databases Trend EMBL database growth (March 2011)
  • 4. From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069 From Michael Metzker, http://view.ncbi.nlm.nih.gov/pubmed/19997069
  • 5. NGS platforms comparison 06/14/11 Source http://www.clcngs.com/2008/12/ngs-platforms-overview/
  • 6. Adapted from John McPherson, OICR Next-gen sequencers Adapted from John McPherson, OICR AB SOLiDv3 200Gb, 50 bp reads 100 Gb Illumina HiSeq bases per machine run 150Gb, 100bp reads 10 Gb 454 GS FLX Titanium 1 Gb 0.4-1 Gb, 100-500 bp reads 100 Mb 10 Mb ABI capillary sequencer (0.04-0.08 Mb, 450-800 bp reads 1 Mb 10 bp 100 bp 1,000 bp read length
  • 7. Many Gbs of Sequences and... • Data management becomes a challenge. – Moving data across file systems takes time (several hundred Gbs) • What structure has the data? – Different sequencers output different files, but – There are some data formats that are being accepted widely (e.g. FastQ format) • Raw sequence data formats – SFF – Fasta, csfasta – Qual file – Fastq
  • 8. Fasta & Fastq formats • FastA format (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt). • FastQ format – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants • Nearly all downstream analysis take FastQ as input sequence
  • 9. Sequence to Variation Workflow Raw Data FastQ IGV BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 10. Sequence to Variation Workflow Raw FastX FastQ Data Tookit FastQ BWA/ IGV BWASW BWA/ BWASW SAM SAM Tools Samtools BCF BCF mPileup Filter Tools Raw Tools GFF Pileup BAM VCF VCF
  • 11. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK?
  • 12. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?
  • 13. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?  Problem:  HUGE files...
  • 14. Why Quality Control and Preprocessing?  Sequencer output:  Reads + quality  Is the quality of my sequenced data OK? If something is wrong can I fix it?  Problem:  HUGE files... How do they look? @HWUSI­EAS460:2:1:368:1089#0/1 TACGTACGTACGTACGTACGTAGATCGGAAGAGCGG +HWUSI­EAS460:2:1:368:1089#0/1 aa[a_a_a^a^a]VZ]R^P[]YNSUTZBBBBBBBBB @HWUSI­EAS460:2:1:368:528#0/1 CTATTATAATATGACCGACCAGCTAGATCTACAGTC +HWUSI­EAS460:2:1:368:528#0/1 abbbbaaaabba^aa`Y``aa`aaa``a`a__`[_  Files are flat files and are big... tens of Gbs (please... don't use MS word to see or edit them)
  • 15. Sequence Quality Per base Position Good data  Consistent  High quality along the read * The central red line is the median value * The yellow box represents the inter-quartile range (25-75%) * The upper and lower whiskers represent the 10% and 90% points * The blue line represents the mean quality
  • 16. Sequence Quality Per base Position Bad data  High variance  Quality decrease with length
  • 17. Per Sequence Quality Distribution Good data  Most are high-quality sequences
  • 18. Per Sequence Quality Distribution Bad data  Not uniform distribution Low Quality Reads
  • 19. Nucleotide Content per position Good data  Smooth over length  Organism dependent (GC)
  • 20. Nucleotide Content per position Bad data  Sequence position bias
  • 21. GC Distribution Good data  Fits with the expected  Organism dependent
  • 22. Per sequence GC Distribution Bad data  It does not fit with expected  Organism dependent Library contamination?
  • 23. Per base GC Distribution Good data  No variation across read sequence
  • 24. Per base GC Distribution Bad data  Variation across read sequence
  • 25. Per base N content It's not good if there are N bias per base position
  • 26. Duplicated Sequences I don't expect high number of duplicated sequences:  PCR artifact?
  • 27. Distribution Length  This is descriptive.  Some sequencers output sequences of different length (e.g. 454)
  • 28. Overrepresented Sequences Question: If you obtain the exact same sequence too many times Do you have a problem? Answer: Sometimes! Examples PCR primers (Illumina)  GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT  CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC
  • 29. K-mer Content  Helps to detect problems  Adapters?
  • 30. Practical: FastQC and Fastx-toolkit  Use FastQC to see your starting state.  Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success! Hints: Try trimming, clipping and quality filtering. Go to the tutorial and try the exercises...