The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine
2. 2
1/26/2014
Sequence Formats
ï All
Sequence formats are ASCII text
containing sequence ID, Quality Scores,
Annotation details, comments, and other
descriptions about sequence
ï Formats
are designed to hold sequence
data and other information about
sequence
3. 3
1/26/2014
Why so many formats?
ï±
Created based on the information required for each step of analysis
ï±
Efficient Data & time management
Types of sequence file formats
âą
âą
âą
âą
âą
ï±
Raw Sequence files
Co-ordinate files
Parameter files
Annotation files
Metadata files
Each Data formats vary in the information they contain
8. 8
1/26/2014
SOLiD output format(s)
CSFASTA
color-space sequence reads in a fasta format
ï
These reads can be retained and analyzed in color-space by
software
ï
The Format Conversion Tool offers options for cleaning of the
CSFASTA files
9. Read Length
âą Sanger reads lengths ~ 800-2000bp
âą Generally we define short reads as anything below 200bp
âIllumina (100bp â 250bp)
âSoLID (75bp max)
âIon Torrent (200-300bp max â currently...)
âRoche 454 â 400-800bp
âą Even with these platforms it is cheaper to produce short reads (e.g. 50bp)
rather than 100 or 200bp reads
âą Diminishing returns:
âFor some applications 50bp is more than sufficient
âResequencing of smaller organisms
âBacterial de-novo assembly
âChIP-Seq
âDigital Gene Expression profiling
âBacterial RNA-seq
12. 12
1/26/2014
Formats for Genome/Gene annotation
BED format
(genome-browser tracks)
GFF format
(gene/genome features)
BioXSD
(XML)
(any annotation; under development)
14. 14
1/26/2014
Points to remember on Data Formats
ï± For base-call data, âstandardâ FASTQ (Sanger, Phred)
ï± For read alignments, SAM/BAM/MAQ format
ï± For annotation results (e.g. GFF or BED format)
16. All platforms have errors
Illumina
1.
2.
3.
SoLID/ABI-Life
Roche 454
Ion Torrent
Removal of low quality bases/ Low complexity regions
Removal of adaptor sequences
Homopolymer-associated base call errors (3 or more
identical DNA bases) causes higher number of (artificial)
frameshifts
17. Illumina artefacts
ï± under represented GC rich regions
ï± PCR
ï± Sequencing
ï± GGC/GCC motif is associated with low
quality and mismatches
ï± Low quality reads < 20% phred score
18. 18
1/26/2014
Need for QC & Preprocessing
QC analysis of sequence data is extremely important for meaningful
downstream analysis
ï±
To analyze problems in quality scores/ statistics of sequencing data
ï±
To check whether further analysis with sequence is possible
ï±
To remove redundancy (filtering)
ï±
To remove low quality reads from analysis
ï±
To remove adapter contamination
Highly efficient and fast processing tools are required to handle large volume
of datasets
19. 19
1/26/2014
Need for QC & Preprocessing
ï The
quality of data is very important for various
downstream analyses, such as sequence assembly,
single nucleotide polymorphisms identification
ï Most
of the programs available for downstream
analyses do not provide the utility for quality check
and filtering of NGS data before processing
20. 20
1/26/2014
NGS QC Toolkit & FastQC
ï NGS QC Toolkit is for quality check and filtering of high-quality read
ï This toolkit is a standalone and open source application freely available at
http://www.nipgr.res.in/ngsqctoolkit.html
ï Application have been implemented in Perl programming language
ï QC of sequencing data generated using Roche 454 and Illumina
platforms
ï Additional tools to aid QC : (sequence format converter and trimming
tools) and analysis (statistics tools)
FastQC can be used only for preliminary analysis
26. 26
1/26/2014
FastQC
ï Basic
statistics
ï Quality- Per base position
ï Per Sequence Quality Distribution
ï Nucleotide content per position
ï Per sequence GC distribution
ï Per base GC distribution
ï Per base N content
ï Length Distribution
ï Overrepresented/ duplicated sequences
ï K-mer content
40. 8. Kmer content
40
1/26/2014
Any k-mer showing more than a 3 fold overall enrichment or a 5 fold
enrichment at any given base position will be reported by this module.
41. 9. Overrepresented/
duplicate sequences
41
1/26/2014
The analysis of overrepresented sequences will spot an
increase in any exactly duplicated sequences
Too many duplicate regions in the sequence will be due to
sequencing problems
This module will issue a warning if any sequence is
found to represent more than 0.1% of the total.
42. 42
1/26/2014
QC Report
ï Sequence Statistics
Total No. of Sequences
6970943
Avg. Sequence Length
54
Max Sequence Length
54
Min Sequence Length
54
Total Sequence Length
376430922
Total N bases
14254521
% N bases
3.78676
No of Sequences with Ns 278635
% Sequences with Ns
3.99709
ïQuality Statistics
Total HQ bases 334195496
%HQ bases
88.78
Total HQ reads 6350256
%HQ reads
91.0961
Alignment statistics