SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Quality Control of NGS Data
Solutions
Surya Saha ss2489@cornell.edu
BTI PGRP Summer Internship Program 2014
Slides: Aureliano Bombarely ab782@cornell.edu
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as qscore and
length distributions and reads duplications.
Data:
(Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools:
tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 2
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq
and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
4. Type ‘fastqc’ to start the FastQC program. Load the four
sequence files in the program.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Solution 1:
1. Untar and Unzip the file: ch4_demo_dataset.tar.gz
tar -zxvf ch4_demo_dataset.tar.gz
or in two steps:
gunzip ch4_demo_dataset.tar.gz
tar -xvf ch4_demo_dataset.tar
2. Raw data will be found in two dirs: breaker and immature_fruit.
Print the first 10 lines for the files: SRR404331_ch4.fq,
SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq.
head SRR404331_ch4.fq; head SRR404333_ch4.fq; head SRR404334_ch4.fq; head
SRR404336_ch4.fq;
Question 1.1: Do these files have fastq format? Yes
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Solution 1:
• Change the extension of the .fq files to .fastq
mv SRR404331_ch4.fq SRR404331_ch4.fastq
mv SRR404333_ch4.fq SRR404333_ch4.fastq
mv SRR404334_ch4.fq SRR404334_ch4.fastq
mv SRR404336_ch4.fq SRR404336_ch4.fastq
• Count number of sequences in each fastq file using
commands you learnt earlier
grep –c ‘^+$’ SRR404331_ch4.fastq
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Solution 1:
• Convert the fastq files to fasta
fastq_to_fasta –I SRR404331_ch4.fastq –o SRR404331_ch4.fasta –Q33
-n
The –Q33 flag is to denote Sanger Phred 33 encoding. It expects Illumina Phred+64 by
default. See http://seqanswers.com/forums/showthread.php?t=7596
-n tells it not to remove any sequences from the file. It removes any reads containing N’s by
default
• Now count the number of sequences in fasta file and see
if the number of sequences has changed.
grep –c ‘^>’ SRR404331_ch4.fasta
It may be different if you did not use -n parameter. Not using -n will remove sequences with
any ‘N’ bases
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 6
Solutions 2:
Question 2.2: How many sequences there are per file in
FastQC?
SRR404331_ch4.fastq = 762,365
SRR404333_ch4.fastq = 744,048
SRR404334_ch4.fastq = 592,123
SRR404336_ch4.fastq = 880,982
Question 2.3: Which is the length range for these reads?
53 in most of the cases, except in SRR404336 where is 54
Question 2.4: Which is the qscore range for these reads? Which
one looks best quality-wise?
The range goes from 2 in some cases such as SRR404334 to a maximum of 44. SRR404331 has
consistent quality over entire length with least 3’ end drop-off.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Solutions 1:
Question 2.5: Do these datasets have read
overrepresentation?
In some cases such as SRR404331. A Blast search of the top overrepresented sequence
reveals that it is the gene ASR1 from tomato, so it is not a contamination and probably it has
some biological relevance.
Question 2.6: Looking into the kmer content, do you think that
the samples have an adaptor?
No, it is possible to see some structure at the 5’ extreme, but it starts at the position 7 and
doesn’t affect to all the sequences.
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 8
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data:
(Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools:
fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 9
Exercise 3:
• Download the file: adapters1.fa from
ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCo
rpoica/adapters1.fa
• Run the read processing program over each of the
datasets using a min. qscore of 30 and a min. length of 40
bp.
fastq-mcf -q 30 -l 40 -o SRR404331_ch4.q30l40.fastq
/home/bioinfo/Downloads/adapters1.fa SRR404331_ch4.fastq
Remember to specify the right path for the adapter1.fa and the input file
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Exercise 2:
•Type ‘fastqc’ to start the FastQC program. Load the
four sequence files in the program. Compare the
results with the previous datasets.
Before fastq-mcf processing
880,982 reads
After fastq-mcf processing
840,759 reads
SRR404336
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 11
Need Help??
7/8/2014 BTI PGRP Summer Internship Program 2014 12

Weitere ähnliche Inhalte

Was ist angesagt?

Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialDeanna Church
 
Aug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformaticsAug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformaticsGenomeInABottle
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
 
NBIS RNA-seq course
NBIS RNA-seq courseNBIS RNA-seq course
NBIS RNA-seq coursePhil Ewels
 
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)Phil Ewels
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15palvaro
 
Adding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessAdding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessEnis Afgan
 
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)James Clause
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierCrai Macdonald
 
ELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-coreELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-corePhil Ewels
 

Was ist angesagt? (15)

Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
Crash course in R and BioConductor
Crash course in R and BioConductorCrash course in R and BioConductor
Crash course in R and BioConductor
 
Aug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformaticsAug2014 working group report characterization bioinformatics
Aug2014 working group report characterization bioinformatics
 
3rd presentation
3rd presentation3rd presentation
3rd presentation
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
NBIS RNA-seq course
NBIS RNA-seq courseNBIS RNA-seq course
NBIS RNA-seq course
 
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15Lineage-driven Fault Injection, SIGMOD'15
Lineage-driven Fault Injection, SIGMOD'15
 
Adding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation ProcessAdding Transparency and Automation into the Galaxy Tool Installation Process
Adding Transparency and Automation into the Galaxy Tool Installation Process
 
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
Demand-Driven Structural Testing with Dynamic Instrumentation (ICSE 2005)
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
 
ELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-coreELIXIR Proteomics Community - Connection with nf-core
ELIXIR Proteomics Community - Connection with nf-core
 

Andere mochten auch

Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016Surya Saha
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesJan Aerts
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 

Andere mochten auch (7)

Sequencing 2016
Sequencing 2016Sequencing 2016
Sequencing 2016
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologies
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 

Ähnlich wie Quality Control of NGS Data Solutions

Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Shen Lu
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Ivo Jimenez
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data Surya Saha
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorHoffman Lab
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRANRevolution Analytics
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 
Performance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4jPerformance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4jCraig Dickson
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...NETWAYS
 
White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...Perforce
 
Automatic test packet generation
Automatic test packet generationAutomatic test packet generation
Automatic test packet generationtusharjadhav2611
 
Instructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeInstructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeImperial College, London
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedRob Vesse
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...MITRE ATT&CK
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 
Tracer Evaluation
Tracer EvaluationTracer Evaluation
Tracer EvaluationQiao Han
 

Ähnlich wie Quality Control of NGS Data Solutions (20)

Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
 
Quality Control of Sequencing Data
Quality Control of Sequencing Data Quality Control of Sequencing Data
Quality Control of Sequencing Data
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 
Performance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4jPerformance Analysis and Monitoring with Perf4j
Performance Analysis and Monitoring with Perf4j
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
 
White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...White Paper: Perforce Administration Optimization, Scalability, Availability ...
White Paper: Perforce Administration Optimization, Scalability, Availability ...
 
Automatic test packet generation
Automatic test packet generationAutomatic test packet generation
Automatic test packet generation
 
Instructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeInstructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping code
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Optimizing and Profiling Golang Rest Api
Optimizing and Profiling Golang Rest ApiOptimizing and Profiling Golang Rest Api
Optimizing and Profiling Golang Rest Api
 
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
Detection as Code, Automation, and Testing: The Key to Unlocking the Power of...
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Tracer Evaluation
Tracer EvaluationTracer Evaluation
Tracer Evaluation
 

Mehr von Surya Saha

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...Surya Saha
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingSurya Saha
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingSurya Saha
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesSurya Saha
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Surya Saha
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Surya Saha
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017Surya Saha
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all OmicsSurya Saha
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...Surya Saha
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Surya Saha
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Surya Saha
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Surya Saha
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Surya Saha
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Surya Saha
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSurya Saha
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014Surya Saha
 
Sequencing: The Next Generation
Sequencing: The Next GenerationSequencing: The Next Generation
Sequencing: The Next GenerationSurya Saha
 

Mehr von Surya Saha (20)

An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...An open access resource portal for arthropod vectors and agricultural pathosy...
An open access resource portal for arthropod vectors and agricultural pathosy...
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
 
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meetingUpdates on Citrusgreening.org database from USDA NIFA project meeting
Updates on Citrusgreening.org database from USDA NIFA project meeting
 
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingUpdates on the ACP v3 genome and annotation from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
 
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesAgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
 
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...Visualization of insect vector-plant pathogen interactions in the citrus gree...
Visualization of insect vector-plant pathogen interactions in the citrus gree...
 
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
 
Sequencing 2017
Sequencing 2017Sequencing 2017
Sequencing 2017
 
Community resources for all y’all Omics
Community resources for all y’all OmicsCommunity resources for all y’all Omics
Community resources for all y’all Omics
 
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis... CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
 
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
 
Tomato Genome Build SL3.0
Tomato Genome Build SL3.0Tomato Genome Build SL3.0
Tomato Genome Build SL3.0
 
Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015Sequencing and Bioinformatics PGRP Summer 2015
Sequencing and Bioinformatics PGRP Summer 2015
 
Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015Sequencing: The Next Generation 2015
Sequencing: The Next Generation 2015
 
Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…Tomato Genome SL2.50 and Beyond…
Tomato Genome SL2.50 and Beyond…
 
Sequencing
SequencingSequencing
Sequencing
 
Sequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN PlatformSequencing, Genome Assembly and the SGN Platform
Sequencing, Genome Assembly and the SGN Platform
 
ICAR Soybean Indore 2014
ICAR Soybean Indore 2014ICAR Soybean Indore 2014
ICAR Soybean Indore 2014
 
Sequencing: The Next Generation
Sequencing: The Next GenerationSequencing: The Next Generation
Sequencing: The Next Generation
 

Kürzlich hochgeladen

Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Kürzlich hochgeladen (20)

Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Quality Control of NGS Data Solutions

  • 1. Quality Control of NGS Data Solutions Surya Saha ss2489@cornell.edu BTI PGRP Summer Internship Program 2014 Slides: Aureliano Bombarely ab782@cornell.edu
  • 2. Goal: Learn the use of read evaluation programs keeping attention in relevant parameters such as qscore and length distributions and reads duplications. Data: (Illumina data for two tomato ripening stages) /home/bioinfo/Data/ch4_demo_dataset.tar.gz Tools: tar -zxvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 2
  • 3. Exercise 1: 1. Untar and Unzip the file: /home/bioinfo/Data/ch4_demo_dataset.tar.gz 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 1.1: Do these files have fastq format? 3. Change the extension of the .fq files to .fastq 4. Type ‘fastqc’ to start the FastQC program. Load the four sequence files in the program. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 3
  • 4. Solution 1: 1. Untar and Unzip the file: ch4_demo_dataset.tar.gz tar -zxvf ch4_demo_dataset.tar.gz or in two steps: gunzip ch4_demo_dataset.tar.gz tar -xvf ch4_demo_dataset.tar 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. head SRR404331_ch4.fq; head SRR404333_ch4.fq; head SRR404334_ch4.fq; head SRR404336_ch4.fq; Question 1.1: Do these files have fastq format? Yes Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 4
  • 5. Solution 1: • Change the extension of the .fq files to .fastq mv SRR404331_ch4.fq SRR404331_ch4.fastq mv SRR404333_ch4.fq SRR404333_ch4.fastq mv SRR404334_ch4.fq SRR404334_ch4.fastq mv SRR404336_ch4.fq SRR404336_ch4.fastq • Count number of sequences in each fastq file using commands you learnt earlier grep –c ‘^+$’ SRR404331_ch4.fastq Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 5
  • 6. Solution 1: • Convert the fastq files to fasta fastq_to_fasta –I SRR404331_ch4.fastq –o SRR404331_ch4.fasta –Q33 -n The –Q33 flag is to denote Sanger Phred 33 encoding. It expects Illumina Phred+64 by default. See http://seqanswers.com/forums/showthread.php?t=7596 -n tells it not to remove any sequences from the file. It removes any reads containing N’s by default • Now count the number of sequences in fasta file and see if the number of sequences has changed. grep –c ‘^>’ SRR404331_ch4.fasta It may be different if you did not use -n parameter. Not using -n will remove sequences with any ‘N’ bases Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 6
  • 7. Solutions 2: Question 2.2: How many sequences there are per file in FastQC? SRR404331_ch4.fastq = 762,365 SRR404333_ch4.fastq = 744,048 SRR404334_ch4.fastq = 592,123 SRR404336_ch4.fastq = 880,982 Question 2.3: Which is the length range for these reads? 53 in most of the cases, except in SRR404336 where is 54 Question 2.4: Which is the qscore range for these reads? Which one looks best quality-wise? The range goes from 2 in some cases such as SRR404334 to a maximum of 44. SRR404331 has consistent quality over entire length with least 3’ end drop-off. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 7
  • 8. Solutions 1: Question 2.5: Do these datasets have read overrepresentation? In some cases such as SRR404331. A Blast search of the top overrepresented sequence reveals that it is the gene ASR1 from tomato, so it is not a contamination and probably it has some biological relevance. Question 2.6: Looking into the kmer content, do you think that the samples have an adaptor? No, it is possible to see some structure at the 5’ extreme, but it starts at the position 7 and doesn’t affect to all the sequences. Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 8
  • 9. Goal: Trim the low quality ends of the reads and remove the short reads. Data: (Illumina data for two tomato ripening stages) ch4_demo_dataset.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file) Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 9
  • 10. Exercise 3: • Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCo rpoica/adapters1.fa • Run the read processing program over each of the datasets using a min. qscore of 30 and a min. length of 40 bp. fastq-mcf -q 30 -l 40 -o SRR404331_ch4.q30l40.fastq /home/bioinfo/Downloads/adapters1.fa SRR404331_ch4.fastq Remember to specify the right path for the adapter1.fa and the input file Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 10
  • 11. Exercise 2: •Type ‘fastqc’ to start the FastQC program. Load the four sequence files in the program. Compare the results with the previous datasets. Before fastq-mcf processing 880,982 reads After fastq-mcf processing 840,759 reads SRR404336 Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 11
  • 12. Need Help?? 7/8/2014 BTI PGRP Summer Internship Program 2014 12