SlideShare ist ein Scribd-Unternehmen logo
1 von 42
1

1/26/2014

NGS
Data Formats & QC Analysis

Karan Veer Singh
Scientist, NBAGR
2

1/26/2014

Sequence Formats
 All

Sequence formats are ASCII text
containing sequence ID, Quality Scores,
Annotation details, comments, and other
descriptions about sequence

 Formats

are designed to hold sequence
data and other information about
sequence
3

1/26/2014

Why so many formats?


Created based on the information required for each step of analysis



Efficient Data & time management

Types of sequence file formats

‱
‱
‱
‱
‱


Raw Sequence files
Co-ordinate files
Parameter files
Annotation files
Metadata files

Each Data formats vary in the information they contain
4

Read output formats
 454

 Solexa/Illumina
 SOLiD

1/26/2014
454 output formats
Standard flowgram
format

.sff

.fna
.qual

5

1/26/2014
Illumina output formats
6

.seq.txt
.prb.txt

Illumina FASTQ

(ASCII – 64 is Illumina score)

Qseq
(ASCII – 64 is Phred score)
Phred quality scores

Illumina single line format
SCARF Solexa Compact ASCII
Read Format

1/26/2014
Illumina FastQ

 ASCII

7

1/26/2014

value for h= 103
 Quality of Base A at the position 1 = 103- 64
 103- 64 = 39
 Where 39 is the phred score
8

1/26/2014

SOLiD output format(s)

CSFASTA

color-space sequence reads in a fasta format



These reads can be retained and analyzed in color-space by
software



The Format Conversion Tool offers options for cleaning of the
CSFASTA files
Read Length
‱ Sanger reads lengths ~ 800-2000bp
‱ Generally we define short reads as anything below 200bp
−Illumina (100bp – 250bp)
−SoLID (75bp max)
−Ion Torrent (200-300bp max – currently...)
−Roche 454 – 400-800bp
‱ Even with these platforms it is cheaper to produce short reads (e.g. 50bp)
rather than 100 or 200bp reads
‱ Diminishing returns:
−For some applications 50bp is more than sufficient
−Resequencing of smaller organisms
−Bacterial de-novo assembly
−ChIP-Seq
−Digital Gene Expression profiling
−Bacterial RNA-seq
10

1/26/2014

Common (“standard”) format for read
alignments: Alignment/Assembly Format
SAM

BAM
MAQ

(= binary SAM)
Sequencers & Sequence
Assembly Packages
11

1/26/2014
12

1/26/2014

Formats for Genome/Gene annotation
BED format

(genome-browser tracks)

GFF format

(gene/genome features)

BioXSD

(XML)

(any annotation; under development)
13

1/26/2014

If reads should be deposited in a public
repository:
SRA (Short Read Archive) at NCBI
14

1/26/2014

Points to remember on Data Formats
 For base-call data, “standard” FASTQ (Sanger, Phred)
 For read alignments, SAM/BAM/MAQ format
 For annotation results (e.g. GFF or BED format)
15

QC analysis

1/26/2014
All platforms have errors

Illumina

1.
2.
3.

SoLID/ABI-Life

Roche 454

Ion Torrent

Removal of low quality bases/ Low complexity regions
Removal of adaptor sequences
Homopolymer-associated base call errors (3 or more
identical DNA bases) causes higher number of (artificial)
frameshifts
Illumina artefacts

 under represented GC rich regions
 PCR
 Sequencing
 GGC/GCC motif is associated with low
quality and mismatches
 Low quality reads < 20% phred score
18

1/26/2014

Need for QC & Preprocessing
QC analysis of sequence data is extremely important for meaningful
downstream analysis


To analyze problems in quality scores/ statistics of sequencing data



To check whether further analysis with sequence is possible



To remove redundancy (filtering)



To remove low quality reads from analysis



To remove adapter contamination

Highly efficient and fast processing tools are required to handle large volume
of datasets
19

1/26/2014

Need for QC & Preprocessing
 The

quality of data is very important for various
downstream analyses, such as sequence assembly,
single nucleotide polymorphisms identification

 Most

of the programs available for downstream
analyses do not provide the utility for quality check
and filtering of NGS data before processing
20

1/26/2014

NGS QC Toolkit & FastQC
 NGS QC Toolkit is for quality check and filtering of high-quality read

 This toolkit is a standalone and open source application freely available at
http://www.nipgr.res.in/ngsqctoolkit.html
 Application have been implemented in Perl programming language
 QC of sequencing data generated using Roche 454 and Illumina
platforms
 Additional tools to aid QC : (sequence format converter and trimming
tools) and analysis (statistics tools)

FastQC can be used only for preliminary analysis
21

1/26/2014
22

1/26/2014
NGSQC toolkit Output
23

1/26/2014
NGSQC toolkit Output
24

1/26/2014
Comparison - QC tools
25

1/26/2014
26

1/26/2014

FastQC
 Basic

statistics
 Quality- Per base position
 Per Sequence Quality Distribution
 Nucleotide content per position
 Per sequence GC distribution
 Per base GC distribution
 Per base N content
 Length Distribution
 Overrepresented/ duplicated sequences
 K-mer content
27

FastQC (Box-Whisker plot)

Y axis- Quality Score
X axis- Base position

1/26/2014
28

2. Quality- Per base position

1/26/2014
29

2. Quality- Per base position

1/26/2014
3.Per Sequence Quality
Distribution
30

1/26/2014
3. Per Sequence Quality
Distribution
31

1/26/2014
4.Nucleotide content per
position
32

1/26/2014
33

1/26/2014

4. Nucleotide content per position
5.Per sequence GC
distribution
34

1/26/2014
5.Per sequence GC
distribution
35

1/26/2014
36

6. Per base GC distribution

1/26/2014
37

6. Per base GC distribution

1/26/2014
38

7. Per base N content

1/26/2014
39

7. Length Distribution

1/26/2014
8. Kmer content

40

1/26/2014

Any k-mer showing more than a 3 fold overall enrichment or a 5 fold
enrichment at any given base position will be reported by this module.
9. Overrepresented/
duplicate sequences
41

1/26/2014

The analysis of overrepresented sequences will spot an
increase in any exactly duplicated sequences
Too many duplicate regions in the sequence will be due to
sequencing problems
This module will issue a warning if any sequence is
found to represent more than 0.1% of the total.
42

1/26/2014

QC Report
 Sequence Statistics
Total No. of Sequences
6970943
Avg. Sequence Length
54
Max Sequence Length
54
Min Sequence Length
54
Total Sequence Length
376430922
Total N bases
14254521
% N bases
3.78676
No of Sequences with Ns 278635
% Sequences with Ns
3.99709
Quality Statistics
Total HQ bases 334195496
%HQ bases
88.78
Total HQ reads 6350256
%HQ reads
91.0961

Alignment statistics

Weitere Àhnliche Inhalte

Was ist angesagt?

Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)LOGESWARAN KA
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Mrinal Vashisth
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewDominic Suciu
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...VHIR Vall d’Hebron Institut de Recerca
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Next generation sequencing
Next  generation  sequencingNext  generation  sequencing
Next generation sequencingNidhi Singh
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsprateek kumar
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approachesCharupriyaChauhan1
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)IndrajaDoradla
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingShelomi Karoon
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisDespoina Kalfakakou
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingUzma Jabeen
 

Was ist angesagt? (20)

Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Next generation sequencing
Next  generation  sequencingNext  generation  sequencing
Next generation sequencing
 
Biological networks - building and visualizing
Biological networks - building and visualizingBiological networks - building and visualizing
Biological networks - building and visualizing
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 

Ähnlich wie NGS Data Formats & QC Analysis Summary

GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...
GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...
GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...CGIAR Generation Challenge Programme
 
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A RathoreGRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A RathoreCGIAR Generation Challenge Programme
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.compRichard Emes
 
Rahul_Ramani_Profile
Rahul_Ramani_ProfileRahul_Ramani_Profile
Rahul_Ramani_ProfileRahul Ramani
 
Demo how to efficiently evaluate nf-vi performance by leveraging opnfv testi...
Demo  how to efficiently evaluate nf-vi performance by leveraging opnfv testi...Demo  how to efficiently evaluate nf-vi performance by leveraging opnfv testi...
Demo how to efficiently evaluate nf-vi performance by leveraging opnfv testi...OPNFV
 
28 h 264-avc_by_dhchang
28   h 264-avc_by_dhchang28   h 264-avc_by_dhchang
28 h 264-avc_by_dhchangBadri Patro
 
2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...
2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...
2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...FOODCROPS
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compilerMaria Akther
 
High Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data PlaneHigh Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data PlaneMahesh Dananjaya
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGatewayRaminder Singh
 
How good is your SPARQL endpoint? A QoS-Aware SPARQL Endpoint Monitoring and...
How good is your SPARQL endpoint?  A QoS-Aware SPARQL Endpoint Monitoring and...How good is your SPARQL endpoint?  A QoS-Aware SPARQL Endpoint Monitoring and...
How good is your SPARQL endpoint? A QoS-Aware SPARQL Endpoint Monitoring and...Ali Intizar
 
Improving Code Review Effectiveness Through Reviewer Recommendations
Improving Code Review Effectiveness Through Reviewer RecommendationsImproving Code Review Effectiveness Through Reviewer Recommendations
Improving Code Review Effectiveness Through Reviewer RecommendationsThe University of Adelaide
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15David E. Kling
 

Ähnlich wie NGS Data Formats & QC Analysis Summary (20)

GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...
GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...
GRM 2011: ISMU pipeline for NGS data analysis and facilitating molecular bree...
 
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A RathoreGRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 
NFV Testing
NFV TestingNFV Testing
NFV Testing
 
Đ Đ”ŃˆĐ”ĐœĐžŃ WANDL Đž NorthStar ĐŽĐ»Ń ĐŸĐżĐ”Ń€Đ°Ń‚ĐŸŃ€ĐŸĐČ
Đ Đ”ŃˆĐ”ĐœĐžŃ WANDL Đž NorthStar ĐŽĐ»Ń ĐŸĐżĐ”Ń€Đ°Ń‚ĐŸŃ€ĐŸĐČĐ Đ”ŃˆĐ”ĐœĐžŃ WANDL Đž NorthStar ĐŽĐ»Ń ĐŸĐżĐ”Ń€Đ°Ń‚ĐŸŃ€ĐŸĐČ
Đ Đ”ŃˆĐ”ĐœĐžŃ WANDL Đž NorthStar ĐŽĐ»Ń ĐŸĐżĐ”Ń€Đ°Ń‚ĐŸŃ€ĐŸĐČ
 
Spectra OE Webcast July 2010
Spectra OE Webcast July 2010Spectra OE Webcast July 2010
Spectra OE Webcast July 2010
 
Rahul_Ramani_Profile
Rahul_Ramani_ProfileRahul_Ramani_Profile
Rahul_Ramani_Profile
 
Demo how to efficiently evaluate nf-vi performance by leveraging opnfv testi...
Demo  how to efficiently evaluate nf-vi performance by leveraging opnfv testi...Demo  how to efficiently evaluate nf-vi performance by leveraging opnfv testi...
Demo how to efficiently evaluate nf-vi performance by leveraging opnfv testi...
 
28 h 264-avc_by_dhchang
28   h 264-avc_by_dhchang28   h 264-avc_by_dhchang
28 h 264-avc_by_dhchang
 
2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...
2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...
2015. abhishek rathore. ismu 2.0 a multi algorithm pipeline for genomic selec...
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compiler
 
High Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data PlaneHigh Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data Plane
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
 
How good is your SPARQL endpoint? A QoS-Aware SPARQL Endpoint Monitoring and...
How good is your SPARQL endpoint?  A QoS-Aware SPARQL Endpoint Monitoring and...How good is your SPARQL endpoint?  A QoS-Aware SPARQL Endpoint Monitoring and...
How good is your SPARQL endpoint? A QoS-Aware SPARQL Endpoint Monitoring and...
 
Improving Code Review Effectiveness Through Reviewer Recommendations
Improving Code Review Effectiveness Through Reviewer RecommendationsImproving Code Review Effectiveness Through Reviewer Recommendations
Improving Code Review Effectiveness Through Reviewer Recommendations
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
BioWeka
BioWekaBioWeka
BioWeka
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
 

Mehr von Karan Veer Singh

Yak genetic resources of india
Yak genetic resources of indiaYak genetic resources of india
Yak genetic resources of indiaKaran Veer Singh
 
Microsatellites Markers
Microsatellites  MarkersMicrosatellites  Markers
Microsatellites MarkersKaran Veer Singh
 
Tick identification guide
Tick identification guideTick identification guide
Tick identification guideKaran Veer Singh
 
Social groups for awareness
Social groups for awarenessSocial groups for awareness
Social groups for awarenessKaran Veer Singh
 
Access and Benefit sharing from Genetic Resources
Access and Benefit sharing from Genetic ResourcesAccess and Benefit sharing from Genetic Resources
Access and Benefit sharing from Genetic ResourcesKaran Veer Singh
 
Indian acts governing different IPRs
Indian acts governing different IPRsIndian acts governing different IPRs
Indian acts governing different IPRsKaran Veer Singh
 
Ip protected invention in the field of biotechnology
Ip protected invention in the field of biotechnologyIp protected invention in the field of biotechnology
Ip protected invention in the field of biotechnologyKaran Veer Singh
 
Patent In Molecular Biology
Patent In Molecular BiologyPatent In Molecular Biology
Patent In Molecular BiologyKaran Veer Singh
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013Karan Veer Singh
 
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSESMICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSESKaran Veer Singh
 
Semen Banking for conservation of livestock biodiversity
Semen Banking for conservation of  livestock biodiversitySemen Banking for conservation of  livestock biodiversity
Semen Banking for conservation of livestock biodiversityKaran Veer Singh
 
DiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresisDiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresisKaran Veer Singh
 

Mehr von Karan Veer Singh (20)

Pcr primer design
Pcr primer designPcr primer design
Pcr primer design
 
Yak genetic resources of india
Yak genetic resources of indiaYak genetic resources of india
Yak genetic resources of india
 
DNA Barcoding
DNA BarcodingDNA Barcoding
DNA Barcoding
 
Microsatellites Markers
Microsatellites  MarkersMicrosatellites  Markers
Microsatellites Markers
 
Tick identification guide
Tick identification guideTick identification guide
Tick identification guide
 
Social groups for awareness
Social groups for awarenessSocial groups for awareness
Social groups for awareness
 
Access and Benefit sharing from Genetic Resources
Access and Benefit sharing from Genetic ResourcesAccess and Benefit sharing from Genetic Resources
Access and Benefit sharing from Genetic Resources
 
IPR
IPRIPR
IPR
 
Indian acts governing different IPRs
Indian acts governing different IPRsIndian acts governing different IPRs
Indian acts governing different IPRs
 
Ip protected invention in the field of biotechnology
Ip protected invention in the field of biotechnologyIp protected invention in the field of biotechnology
Ip protected invention in the field of biotechnology
 
Patent In Molecular Biology
Patent In Molecular BiologyPatent In Molecular Biology
Patent In Molecular Biology
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSESMICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Semen Banking for conservation of livestock biodiversity
Semen Banking for conservation of  livestock biodiversitySemen Banking for conservation of  livestock biodiversity
Semen Banking for conservation of livestock biodiversity
 
DiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresisDiGE....2-D gel electrophoresis
DiGE....2-D gel electrophoresis
 
Tecto3
Tecto3Tecto3
Tecto3
 
Paradigm
ParadigmParadigm
Paradigm
 
Electrophoresis
ElectrophoresisElectrophoresis
Electrophoresis
 
Electrophoresis
ElectrophoresisElectrophoresis
Electrophoresis
 

KĂŒrzlich hochgeladen

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 

KĂŒrzlich hochgeladen (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1CĂłdigo Creativo y Arte de Software | Unidad 1
CĂłdigo Creativo y Arte de Software | Unidad 1
 

NGS Data Formats & QC Analysis Summary

  • 1. 1 1/26/2014 NGS Data Formats & QC Analysis Karan Veer Singh Scientist, NBAGR
  • 2. 2 1/26/2014 Sequence Formats  All Sequence formats are ASCII text containing sequence ID, Quality Scores, Annotation details, comments, and other descriptions about sequence  Formats are designed to hold sequence data and other information about sequence
  • 3. 3 1/26/2014 Why so many formats?  Created based on the information required for each step of analysis  Efficient Data & time management Types of sequence file formats ‱ ‱ ‱ ‱ ‱  Raw Sequence files Co-ordinate files Parameter files Annotation files Metadata files Each Data formats vary in the information they contain
  • 4. 4 Read output formats  454  Solexa/Illumina  SOLiD 1/26/2014
  • 5. 454 output formats Standard flowgram format .sff .fna .qual 5 1/26/2014
  • 6. Illumina output formats 6 .seq.txt .prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Phred quality scores Illumina single line format SCARF Solexa Compact ASCII Read Format 1/26/2014
  • 7. Illumina FastQ  ASCII 7 1/26/2014 value for h= 103  Quality of Base A at the position 1 = 103- 64  103- 64 = 39  Where 39 is the phred score
  • 8. 8 1/26/2014 SOLiD output format(s) CSFASTA color-space sequence reads in a fasta format  These reads can be retained and analyzed in color-space by software  The Format Conversion Tool offers options for cleaning of the CSFASTA files
  • 9. Read Length ‱ Sanger reads lengths ~ 800-2000bp ‱ Generally we define short reads as anything below 200bp −Illumina (100bp – 250bp) −SoLID (75bp max) −Ion Torrent (200-300bp max – currently...) −Roche 454 – 400-800bp ‱ Even with these platforms it is cheaper to produce short reads (e.g. 50bp) rather than 100 or 200bp reads ‱ Diminishing returns: −For some applications 50bp is more than sufficient −Resequencing of smaller organisms −Bacterial de-novo assembly −ChIP-Seq −Digital Gene Expression profiling −Bacterial RNA-seq
  • 10. 10 1/26/2014 Common (“standard”) format for read alignments: Alignment/Assembly Format SAM BAM MAQ (= binary SAM)
  • 11. Sequencers & Sequence Assembly Packages 11 1/26/2014
  • 12. 12 1/26/2014 Formats for Genome/Gene annotation BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (XML) (any annotation; under development)
  • 13. 13 1/26/2014 If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI
  • 14. 14 1/26/2014 Points to remember on Data Formats  For base-call data, “standard” FASTQ (Sanger, Phred)  For read alignments, SAM/BAM/MAQ format  For annotation results (e.g. GFF or BED format)
  • 16. All platforms have errors Illumina 1. 2. 3. SoLID/ABI-Life Roche 454 Ion Torrent Removal of low quality bases/ Low complexity regions Removal of adaptor sequences Homopolymer-associated base call errors (3 or more identical DNA bases) causes higher number of (artificial) frameshifts
  • 17. Illumina artefacts  under represented GC rich regions  PCR  Sequencing  GGC/GCC motif is associated with low quality and mismatches  Low quality reads < 20% phred score
  • 18. 18 1/26/2014 Need for QC & Preprocessing QC analysis of sequence data is extremely important for meaningful downstream analysis  To analyze problems in quality scores/ statistics of sequencing data  To check whether further analysis with sequence is possible  To remove redundancy (filtering)  To remove low quality reads from analysis  To remove adapter contamination Highly efficient and fast processing tools are required to handle large volume of datasets
  • 19. 19 1/26/2014 Need for QC & Preprocessing  The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification  Most of the programs available for downstream analyses do not provide the utility for quality check and filtering of NGS data before processing
  • 20. 20 1/26/2014 NGS QC Toolkit & FastQC  NGS QC Toolkit is for quality check and filtering of high-quality read  This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html  Application have been implemented in Perl programming language  QC of sequencing data generated using Roche 454 and Illumina platforms  Additional tools to aid QC : (sequence format converter and trimming tools) and analysis (statistics tools) FastQC can be used only for preliminary analysis
  • 25. Comparison - QC tools 25 1/26/2014
  • 26. 26 1/26/2014 FastQC  Basic statistics  Quality- Per base position  Per Sequence Quality Distribution  Nucleotide content per position  Per sequence GC distribution  Per base GC distribution  Per base N content  Length Distribution  Overrepresented/ duplicated sequences  K-mer content
  • 27. 27 FastQC (Box-Whisker plot) Y axis- Quality Score X axis- Base position 1/26/2014
  • 28. 28 2. Quality- Per base position 1/26/2014
  • 29. 29 2. Quality- Per base position 1/26/2014
  • 31. 3. Per Sequence Quality Distribution 31 1/26/2014
  • 36. 36 6. Per base GC distribution 1/26/2014
  • 37. 37 6. Per base GC distribution 1/26/2014
  • 38. 38 7. Per base N content 1/26/2014
  • 40. 8. Kmer content 40 1/26/2014 Any k-mer showing more than a 3 fold overall enrichment or a 5 fold enrichment at any given base position will be reported by this module.
  • 41. 9. Overrepresented/ duplicate sequences 41 1/26/2014 The analysis of overrepresented sequences will spot an increase in any exactly duplicated sequences Too many duplicate regions in the sequence will be due to sequencing problems This module will issue a warning if any sequence is found to represent more than 0.1% of the total.
  • 42. 42 1/26/2014 QC Report  Sequence Statistics Total No. of Sequences 6970943 Avg. Sequence Length 54 Max Sequence Length 54 Min Sequence Length 54 Total Sequence Length 376430922 Total N bases 14254521 % N bases 3.78676 No of Sequences with Ns 278635 % Sequences with Ns 3.99709 Quality Statistics Total HQ bases 334195496 %HQ bases 88.78 Total HQ reads 6350256 %HQ reads 91.0961 Alignment statistics