SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Data formats and visualization in
next-generation sequencing analysis
Li Shen, Asst. Prof.
Neuro core
Sep 2013
Introduction to the Shenlab
Lab location: Icahn 10-20 office suite
Two focuses:
1. Next-generation sequencing analysis
2. Novel software development for NGS
http://neuroscience.mssm.edu/shen/index.html
DNA sequencing overview
Primer
Template sequence
DNA polymerase/ligase
A
C
G
T
5’ 3’
5’3’
1. How to “freeze” the procedure?
2. What kind of signal to generate?
3. How to capture the signals?
Sanger sequencing
Pyrosequencing
Solexa sequencing
SOLiD sequencing
Ion Torrent sequencing
SMRT sequencing
…and many others
Extending sequence
What is “next-generation” sequencing?
-- first-generation sequencers: –
Sanger sequencer: 384 samples
per single batch
-- next-generation sequencers: --
Illumina, SOLiD sequencer: billions
per single batch, ~3 million fold
increase in throughput!
Massively Parallel:
What are “short” reads?
http://www.edgebio.com/blog_old/uploads/2011/06/1.png
http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg
Read position
Qualityscore
Illumina:
50-250bp
SOLiD:
35-50bp
454 pyro:
700bp
Sanger:
900bp
Limit of read length
Illumina sequencing terminology
Chip, slide, flow cell…
HiSeq 2500
DNA fragment
Information flow of sequencing data
fastq
SAM/BAM
coverage
HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10
3000101 255 51M * 0 0
AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTA
AATTTTTT
=@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIG
GHEII XA:i:0 MD:Z:51 NM:i:0
HISEQ2:197:D08GUACXX:6:1105:9303:81340 0 chr10
3000301 255 51M * 0 0
GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAG
AGAGATTAA
BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII
XA:i:0 MD:Z:51 NM:i:0
HISEQ2:197:D08GUACXX:7:2102:2396:174630 16 chr10
3000373 255 51M * 0 0
CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCT
TTTGCTT
JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC
XA:i:0 MD:Z:51 NM:i:0
HISEQ2:197:D08GUACXX:8:2108:12162:127556 0 chr10
3000388 255 51M * 0 0
AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCA
CTGGGGA
7
Image analysis
FASTQ
Raw sequence format
What is FASTQ?
• Text-based format for storing both biological
sequences and corresponding quality scores.
• FASTQ = FASTA + QUALITY
• A FASTQ file uses four lines per sequence.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAA
+SEQ_ID(Optional)
!''*((((***+))%%%++)(%%%%).1**
1
2
3
4
Illumina sequence identifiers
@SOLEXA-DELL:6:1:8:1376#0/1
Instrument name
Lane
Tile
X-coordinate
Y-coordinate
Index number
Paired read
@SEQ_ID
Quality score calculation
+SEQ_ID
!''*((((***+))%%%++)(%%%%).1** ?
A quality value Q is an integer representation of the probability p that the
corresponding base call is incorrect.
1 2
Sanger
Solexa
Figures from Wikepedia
Quality score interpretation
Phred Quality Score
Probability of incorrect
base call
Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
Materials from Wikepedia
Quality score encoding
• Formula: score + offset =>
look for ascii symbol
• Two variants:
offset=64(Illumina 1.0-
before 1.8);
offset=33(Sanger, Illumina
1.8+).
• A quality score is typically:
[0, 40]
(33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
(64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh
Figures from Wikepedia
What can you do with FASTQ files?
• Quality control: quality score distribution, GC
content, k-mer enrichment, etc.
• Preprocessing: adapter removal, low-quality
reads filtering, etc.
GATTTGGGGTTCAAAGCAGTATCGATCAAA
!''*((((***+))%%%++)(%%%%).1** Mean quality
GC contentK-mer enrichment
Adapter? (miRNA)
Quality Quality
…
SAM/BAM
Alignment format
Short read alignment
• Many choices: BWA, Bowtie, Maq, Soap,
Star, Tophat, etc.
FASTQ files Alignments
Index
Genomic reference sequence
Alignment
format
Bowtie
ELAND
BWA
Soap
Maq
SHRiMP
SAM
The SAM format
2. chromosome
Short read
Reference sequence
1. seqid
3. position
? 4. mapping quality
mismatch Indel: insertion, deletion
5. CIGAR: description of alignment operations
6. sequence
7. quality
The SAM specification
https://github.com/samtools/hts-specs
MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244
303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGG
TGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT
IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8
AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+
NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0
An example line:
N = hundreds of millions
BAM: the binary version of SAM
• SAM files are large: 1M short reads =>
200MB; 100M short reads => 20GB.
• Makes sense for compression
• BAM: Binary sAM; compress using gzip
library.
• Two parts: compressed data + index
• Index: random access (visualization,
analysis, etc.)
Layout of binary BAM file
Short read
alignment
Hundreds of millions of alignments
Gzip blocks
Time: O(n), n = #alignments
q = chr: X–Y
Chromosome:
A naĂŻve approach
Chromosome:
...
...
One index per base-pair?
Wait, the human chr1 is as long as 200Mb!
Gzip blocks:
A binning strategy
Chromosome bins:
E.g.: bin = 16Kb each,
~10,000 indices per
chromosome
Gzip blocks: ...
Long alignment
RNA splicing
Assume all alignments are sorted according to genomic coordinates:
...
...
Hierarchical binning and linear index
0
1 2 3 4 5 6 7 8
512Mb
64Mb
Level 0:
Level 1:
. . .Level 5: 16Kb
.
.
.
Linear Index:
16Kb tiling windows: file offset of the left-most alignment
that overlaps the window
Binning:
. . .
A hypothetical example
0
1 2 3 4
a b c
d
e
f g
h
bin 0: f, g, h
bin 1: a
bin 2: b
bin 3: c, d
bin 4: e
q
1. bins(q): [0, 3];
2. Candidate alignments: f->h->c->d->g;
3. LinearIndex(3): start(h) => larger than end(f);
4. Remove f without reading;
5. Read h, c, d;
6. start(d) larger than boundary(q);
7. Stop: without reading g.
Done: saved TWO disk seeks!
WIGGLE
Coverage format
From alignment to read depth
• Read depth: the number of times a base-pair
is covered by aligned short reads.
• Can be normalized: depth / library size * 1E6
= read depth per million aligned reads.
• Many tools to use: samtools depth, bedtools,
and so on.
1 2 3 4
Reference:
Alignments
Example:
Describing depth: the Wiggle format
• Line-oriented text file, two options: variable
step and fixed step.
variableStep chrom=chr1 span=2
100 1
variableStep chrom=chr1 span=3
1000 2
variableStep chrom=chr1 span=4
10000 3
11 222 3333
chr1:
100 1000 10000
Wiggle: fixed step
fixedStep chrom=chr1 start=100 step=100 span=3
1
2
3
111 222 333
chr1:
100 200 300
Reference:
w w w w …
fixedStep chrom=chr? start=??? step=w span=w
Dump your data here
If you have very large wiggle files…
• Wiggle files can be huge: average per 10bp window => 300M
elements for human genome.
• Makes sense to compress and index.
Gzip blocks
Genome browser
v.s.
Pros: very comprehensive
Cons: data have to be
transmitted via network
Pros: locally installed
Cons: less genome
annotation
UCSC genome browser
Genome browsers: lots of options
Wiki: 34 in total
and that is not all!
DEMO: GENOME BROWSER
Alignment, BAM, Wiggle, Peak calling, BED…
NGS.PLOT: GLOBAL
VISUALIZATION FOR NEXT
GENERATION SEQUENCING DATA
A genome is a huge collection of functional
elements
• TSS: transcriptional
start site
• TES: transcriptional
end site
• Exon: mRNA
components
• CpG island: has roles
in gene regulation
and evolution
• Enhancer: activate
genes
• Dnase hyper-
sensitive site: where
TFs bind
• And more…
Images from Google
image search
35
Categorizing functional elements
TSS TES Enhancer CpG islandExon
GB view
TSS1
TSS2
TSS3
TSS4
TSS5
.
.
.
Chrom Start End
chr1 100 101
chr2 200 201
.
.
.
Avg. profile
Heatmap
H3K4me3
36
Step 1: choose a region of interest
Where to
download?
Which database
to use?
What kind of
formats do
they use?
0-based
coordinates?
1-based
coordinates?
Subset regions
by function?
ngs.plot collects lots of genome annotations
Variable Count Description
Database 3 Refseq , Ensembl and ENCODE
Genome 9 dm3, hg19, mm10, mm9, rn4, rn5, sacCer3, Tair10, Zv9
Biotype 7 tss, tes, genebody, cgi, dhs, enhancer, exon
Gene type 4 protein_coding, lincRNA, miRNA, pseudogene
Cell line 9 Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf
Total functional elements: 15,944,952
H3K27me3
SEM bar
SmoothingShade
Flanking region
Robust estimation
total
hc
none
diff
prod
pca
max
Gene
ranking
algo.:
39
Step 2: plot something at this region
ngs.plot: a global visualization tool for NGS
data
• Written in R, easy-to-use command line program.
ngs.plot.r -G genome -R tss -C chipseq.bam -O output
40
Testing biological hypotheses with NGS data
Ian Maze
Allis lab
Rockefeller
Nucleosome
H3 Var A
H3 Var B
ChIP-seq
N
Understand questions
Transform -> analytics
bioinfor
maticia
n
Time spent:
Super
Not bad
Normal…
41
Visualization the ngs.plot way
A B H3 RNA-seq
A.bam -1 “A”
B.bam -1 “B”
H3.bam -1 “H3”
Config file:
ngs.plot.r -G mm9 -R
genebody -C config.txt -GO
diff -O XXX diff
Export gene order list: go.txt
ngs.plot.r -E go.txt -G mm9
-R genebody -F rnaseq –C
RNA.bam -GO none -O YYY
42
ngs.plot is also available on Galaxy!
URL: https://ineuron.mssm.edu/galaxy
43
DEMO: NGS.PLOT
Global visualization made easy…

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencingBhavya Sree
 
Metagenomics sk presentation 17.10.2017
Metagenomics sk presentation 17.10.2017 Metagenomics sk presentation 17.10.2017
Metagenomics sk presentation 17.10.2017 SUNILKUMARSAHOO16
 
Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...
Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...
Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...QIAGEN
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation SequencingPierre Lindenbaum
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysestuxette
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Small Molecule Real Time Sequencing
Small Molecule Real Time SequencingSmall Molecule Real Time Sequencing
Small Molecule Real Time SequencingUSD Bioinformatics
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewDominic Suciu
 
Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)Hamza Khan
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeSean Davis
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approachesCharupriyaChauhan1
 

Was ist angesagt? (20)

Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
qRT PCR
qRT PCRqRT PCR
qRT PCR
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
 
Metagenomics sk presentation 17.10.2017
Metagenomics sk presentation 17.10.2017 Metagenomics sk presentation 17.10.2017
Metagenomics sk presentation 17.10.2017
 
Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...
Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...
Introduction to Real Time PCR (Q-PCR/qPCR/qrt-PCR): qPCR Technology Webinar S...
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
File formats for Next Generation Sequencing
File formats for Next Generation SequencingFile formats for Next Generation Sequencing
File formats for Next Generation Sequencing
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analyses
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Small Molecule Real Time Sequencing
Small Molecule Real Time SequencingSmall Molecule Real Time Sequencing
Small Molecule Real Time Sequencing
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
 
Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)Exome seuencing (steps, method, and applications)
Exome seuencing (steps, method, and applications)
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 

Andere mochten auch

Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsGolden Helix Inc
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...VHIR Vall d’Hebron Institut de Recerca
 
Whole exome sequencing(wes)
Whole exome sequencing(wes)Whole exome sequencing(wes)
Whole exome sequencing(wes)Ibrahim Vazirabad
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingStephen Turner
 

Andere mochten auch (6)

Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensem...
 
Whole exome sequencing(wes)
Whole exome sequencing(wes)Whole exome sequencing(wes)
Whole exome sequencing(wes)
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 

Ă„hnlich wie Next-generation sequencing format and visualization with ngs.plot

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolHong ChangBum
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxBiancaMoreira45
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbMongoDB
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.ppt1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.pptssuser0cd7c9
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBMongoDB
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015hansjansen9999
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 

Ă„hnlich wie Next-generation sequencing format and visualization with ngs.plot (20)

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Macs course
Macs courseMacs course
Macs course
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
RNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptxRNA-Seq_analysis_course(2).pptx
RNA-Seq_analysis_course(2).pptx
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.ppt1 Cryptography Introduction_shared.ppt
1 Cryptography Introduction_shared.ppt
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 

KĂĽrzlich hochgeladen

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...SĂ©rgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSĂ©rgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...SĂ©rgio Sacani
 

KĂĽrzlich hochgeladen (20)

CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 

Next-generation sequencing format and visualization with ngs.plot

  • 1. Data formats and visualization in next-generation sequencing analysis Li Shen, Asst. Prof. Neuro core Sep 2013
  • 2. Introduction to the Shenlab Lab location: Icahn 10-20 office suite Two focuses: 1. Next-generation sequencing analysis 2. Novel software development for NGS http://neuroscience.mssm.edu/shen/index.html
  • 3. DNA sequencing overview Primer Template sequence DNA polymerase/ligase A C G T 5’ 3’ 5’3’ 1. How to “freeze” the procedure? 2. What kind of signal to generate? 3. How to capture the signals? Sanger sequencing Pyrosequencing Solexa sequencing SOLiD sequencing Ion Torrent sequencing SMRT sequencing …and many others Extending sequence
  • 4. What is “next-generation” sequencing? -- first-generation sequencers: – Sanger sequencer: 384 samples per single batch -- next-generation sequencers: -- Illumina, SOLiD sequencer: billions per single batch, ~3 million fold increase in throughput! Massively Parallel:
  • 5. What are “short” reads? http://www.edgebio.com/blog_old/uploads/2011/06/1.png http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg Read position Qualityscore Illumina: 50-250bp SOLiD: 35-50bp 454 pyro: 700bp Sanger: 900bp Limit of read length
  • 6. Illumina sequencing terminology Chip, slide, flow cell… HiSeq 2500 DNA fragment
  • 7. Information flow of sequencing data fastq SAM/BAM coverage HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10 3000101 255 51M * 0 0 AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTA AATTTTTT =@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIG GHEII XA:i:0 MD:Z:51 NM:i:0 HISEQ2:197:D08GUACXX:6:1105:9303:81340 0 chr10 3000301 255 51M * 0 0 GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAG AGAGATTAA BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII XA:i:0 MD:Z:51 NM:i:0 HISEQ2:197:D08GUACXX:7:2102:2396:174630 16 chr10 3000373 255 51M * 0 0 CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCT TTTGCTT JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:51 NM:i:0 HISEQ2:197:D08GUACXX:8:2108:12162:127556 0 chr10 3000388 255 51M * 0 0 AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCA CTGGGGA 7 Image analysis
  • 9. What is FASTQ? • Text-based format for storing both biological sequences and corresponding quality scores. • FASTQ = FASTA + QUALITY • A FASTQ file uses four lines per sequence. @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAA +SEQ_ID(Optional) !''*((((***+))%%%++)(%%%%).1** 1 2 3 4
  • 10. Illumina sequence identifiers @SOLEXA-DELL:6:1:8:1376#0/1 Instrument name Lane Tile X-coordinate Y-coordinate Index number Paired read @SEQ_ID
  • 11. Quality score calculation +SEQ_ID !''*((((***+))%%%++)(%%%%).1** ? A quality value Q is an integer representation of the probability p that the corresponding base call is incorrect. 1 2 Sanger Solexa Figures from Wikepedia
  • 12. Quality score interpretation Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% 50 1 in 100000 99.999% Materials from Wikepedia
  • 13. Quality score encoding • Formula: score + offset => look for ascii symbol • Two variants: offset=64(Illumina 1.0- before 1.8); offset=33(Sanger, Illumina 1.8+). • A quality score is typically: [0, 40] (33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI (64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh Figures from Wikepedia
  • 14. What can you do with FASTQ files? • Quality control: quality score distribution, GC content, k-mer enrichment, etc. • Preprocessing: adapter removal, low-quality reads filtering, etc. GATTTGGGGTTCAAAGCAGTATCGATCAAA !''*((((***+))%%%++)(%%%%).1** Mean quality GC contentK-mer enrichment Adapter? (miRNA) Quality Quality …
  • 16. Short read alignment • Many choices: BWA, Bowtie, Maq, Soap, Star, Tophat, etc. FASTQ files Alignments Index Genomic reference sequence
  • 18. The SAM format 2. chromosome Short read Reference sequence 1. seqid 3. position ? 4. mapping quality mismatch Indel: insertion, deletion 5. CIGAR: description of alignment operations 6. sequence 7. quality
  • 19. The SAM specification https://github.com/samtools/hts-specs MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244 303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGG TGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8 AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+ NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0 An example line: N = hundreds of millions
  • 20. BAM: the binary version of SAM • SAM files are large: 1M short reads => 200MB; 100M short reads => 20GB. • Makes sense for compression • BAM: Binary sAM; compress using gzip library. • Two parts: compressed data + index • Index: random access (visualization, analysis, etc.)
  • 21. Layout of binary BAM file Short read alignment Hundreds of millions of alignments Gzip blocks Time: O(n), n = #alignments q = chr: X–Y Chromosome:
  • 22. A naĂŻve approach Chromosome: ... ... One index per base-pair? Wait, the human chr1 is as long as 200Mb! Gzip blocks:
  • 23. A binning strategy Chromosome bins: E.g.: bin = 16Kb each, ~10,000 indices per chromosome Gzip blocks: ... Long alignment RNA splicing Assume all alignments are sorted according to genomic coordinates: ... ...
  • 24. Hierarchical binning and linear index 0 1 2 3 4 5 6 7 8 512Mb 64Mb Level 0: Level 1: . . .Level 5: 16Kb . . . Linear Index: 16Kb tiling windows: file offset of the left-most alignment that overlaps the window Binning: . . .
  • 25. A hypothetical example 0 1 2 3 4 a b c d e f g h bin 0: f, g, h bin 1: a bin 2: b bin 3: c, d bin 4: e q 1. bins(q): [0, 3]; 2. Candidate alignments: f->h->c->d->g; 3. LinearIndex(3): start(h) => larger than end(f); 4. Remove f without reading; 5. Read h, c, d; 6. start(d) larger than boundary(q); 7. Stop: without reading g. Done: saved TWO disk seeks!
  • 27. From alignment to read depth • Read depth: the number of times a base-pair is covered by aligned short reads. • Can be normalized: depth / library size * 1E6 = read depth per million aligned reads. • Many tools to use: samtools depth, bedtools, and so on. 1 2 3 4 Reference: Alignments Example:
  • 28. Describing depth: the Wiggle format • Line-oriented text file, two options: variable step and fixed step. variableStep chrom=chr1 span=2 100 1 variableStep chrom=chr1 span=3 1000 2 variableStep chrom=chr1 span=4 10000 3 11 222 3333 chr1: 100 1000 10000
  • 29. Wiggle: fixed step fixedStep chrom=chr1 start=100 step=100 span=3 1 2 3 111 222 333 chr1: 100 200 300 Reference: w w w w … fixedStep chrom=chr? start=??? step=w span=w Dump your data here
  • 30. If you have very large wiggle files… • Wiggle files can be huge: average per 10bp window => 300M elements for human genome. • Makes sense to compress and index. Gzip blocks
  • 31. Genome browser v.s. Pros: very comprehensive Cons: data have to be transmitted via network Pros: locally installed Cons: less genome annotation UCSC genome browser
  • 32. Genome browsers: lots of options Wiki: 34 in total and that is not all!
  • 33. DEMO: GENOME BROWSER Alignment, BAM, Wiggle, Peak calling, BED…
  • 34. NGS.PLOT: GLOBAL VISUALIZATION FOR NEXT GENERATION SEQUENCING DATA
  • 35. A genome is a huge collection of functional elements • TSS: transcriptional start site • TES: transcriptional end site • Exon: mRNA components • CpG island: has roles in gene regulation and evolution • Enhancer: activate genes • Dnase hyper- sensitive site: where TFs bind • And more… Images from Google image search 35
  • 36. Categorizing functional elements TSS TES Enhancer CpG islandExon GB view TSS1 TSS2 TSS3 TSS4 TSS5 . . . Chrom Start End chr1 100 101 chr2 200 201 . . . Avg. profile Heatmap H3K4me3 36
  • 37. Step 1: choose a region of interest Where to download? Which database to use? What kind of formats do they use? 0-based coordinates? 1-based coordinates? Subset regions by function?
  • 38. ngs.plot collects lots of genome annotations Variable Count Description Database 3 Refseq , Ensembl and ENCODE Genome 9 dm3, hg19, mm10, mm9, rn4, rn5, sacCer3, Tair10, Zv9 Biotype 7 tss, tes, genebody, cgi, dhs, enhancer, exon Gene type 4 protein_coding, lincRNA, miRNA, pseudogene Cell line 9 Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf Total functional elements: 15,944,952
  • 39. H3K27me3 SEM bar SmoothingShade Flanking region Robust estimation total hc none diff prod pca max Gene ranking algo.: 39 Step 2: plot something at this region
  • 40. ngs.plot: a global visualization tool for NGS data • Written in R, easy-to-use command line program. ngs.plot.r -G genome -R tss -C chipseq.bam -O output 40
  • 41. Testing biological hypotheses with NGS data Ian Maze Allis lab Rockefeller Nucleosome H3 Var A H3 Var B ChIP-seq N Understand questions Transform -> analytics bioinfor maticia n Time spent: Super Not bad Normal… 41
  • 42. Visualization the ngs.plot way A B H3 RNA-seq A.bam -1 “A” B.bam -1 “B” H3.bam -1 “H3” Config file: ngs.plot.r -G mm9 -R genebody -C config.txt -GO diff -O XXX diff Export gene order list: go.txt ngs.plot.r -E go.txt -G mm9 -R genebody -F rnaseq –C RNA.bam -GO none -O YYY 42
  • 43. ngs.plot is also available on Galaxy! URL: https://ineuron.mssm.edu/galaxy 43

Hinweis der Redaktion

  1. Good morning. How are you? Today we’ll talk about Data formats and visualization in next-generation sequencing analysis.
  2. I want to give you a brief introduction to my lab. My name is Li Shen. I run a small team of bioinformatics within the department of neuroscience. We are located at the Icahn 10-20 office suite. Right now, we have two focuses: first, next-generation sequencing analysis. I have collaborations with many PIs within the department, such as Eric Nestler, Scott Russo and YasminHurd. Pretty much anybody who has a sequencing project. Second, we are also highly interested in developing novel software to analyze the sequencing data. And I’ll talk about one of the of them in today’s lecture.
  3. To give you a bit of the background information. I want you to get a feel of: what are those sequencing data? And how are they generated? Sequencing is basically a process to determine the order of nucleotides of a DNA sequence. Despite the fact that there are many sequencing technologies on the market, the basic idea is the same. And it can be summarized as this figure. Starting from a primer sequence, the DNA polymerase [pol-uh-muh-reys, -reyz] will try to produce the complement of the template sequence, one by one. A DNA sequencer will try to capture the activity of the DNA polymerase, and record the nucleotide that is being added. Finally, a complete readout gives us the template sequence. Now, there are several questions need to be answered: first, at each step, how do you freeze the sequencing procedure so that the system has enough time to take a snapshot of the nucleotide? Second, what kind of signals shall be generated? Third, how to capture those signals? There are many different answers to the three questions. Considering the combinations of these answers gives us a large array of different sequencing technologies. Such as, sanger sequencing, pyrosequencing, solexa sequencing, solid sequencing, and many others. Most of these sequencing technologies have been commercialized and backed up by various companies. And these are some of the major players.
  4. So what do you mean by next-generation sequencing, what’s the technology behind this buzz word, or market hype? Well, the keyword is parallel. The next-generation sequencing is massively parallel. For example, the first generation sequencers, represented by the automated sanger sequencer, can only analyze less than 400 samples per single batch. While for the next-gen sequencers, the illumina and solid sequencers can analyze billions of samples per single batch, that is about 3 million fold increase in throughput, which generate a huge amount of data.
  5. However, these sequencers are not without limitations. One of the major limits is the read length. The sequencing quality always degenerates by read length. At certain point, the quality would become so low that it is basically meaningless to continue sequencing. This figure shows you the typical read length of the different sequencers. The old sanger sequencer can actually produce very long reads, up to 900 basepairs. The 454 pyrosequencers can also produce long reads, up to 700 basepairs. While the illumina and solid sequencers are on the other side, they produce very short reads, typically between 35 and 250 basepairs. So how do you sequence the entire genome which can be as long as 3 billion basepairs? What people do is to randomly break the long DNA sequence into many smaller fragments and sequence those fragments. So you get a little piece of data from here and there. And later, a compter program has to be used to assemle those little pieces into the whole genome.
  6. This picture gives you a feel of the illumina sequencing machine. This hand is holding a sequencing chip, as you can see, it is actually fairly small. You can call it a chip, a slide, or a flow cell, basically the same thing. Before sequencing begins, you need to load your DNA samples into this small chip and then send it to the sequencer for sequencing. This figure explains some of the concepts involving a flow cell. Each flow cell is separated into 8 different lanes. All lanes are sequenced together but you can load different samples into each lane. A lane is further separated into two columns and each column is divided into many tiles. A tile is like a small grid on the flow cell, which is basically the smallest unit for imagining. On this image, you can see that there are a lot of little dots. Each dot represents a nucleotide that is being added to the extension DNA strand. Altogether, a lot of images will be generated during sequencing, each of which has to be analyzed to extract the information about the sequencing reads.
  7. This is a flowchart of the data that are transformed once the sequencing is done. After image analysis, the short read data obtained from a sequencing machine is stored in a so called fastq format. These short reads must be aligned to a reference genome before they can be further analyzed, producing alignment files such as the sam/bam format. The alignment files can be summarized to generate coverage and be displayed in a human-readable way such as this figure.
  8. Fastq is a text-based format for stroing…if you are familiar with the fasta format, then fastq is basically fasta plus quality. A fastq file uses four lines to represent a sequence. The first line is a sequence id, which always starts with an “@” sign; the second line is the base-pairs, all the acgt’s; and the third line is again the same sequence id starts with a “+” sign, or just the “+” sign; the fourth line is the sequencing quality scores which are encoded in ascii symbols. And this quality line has to be the same length as the sequence line.
  9. In the case of illumina sequencers, the sequence id is very systematic. This is an actual sequence id from mount sinai’s sequencing core. After the “@” sign, there is the instrument name, followed by a colon, then goes lane number, colon, tile number, colon, and then the x and y coordinates of the dot on the tile image. Finally, after the pound sign, there is the index number and paired read number. In this case, the sample is not multiplexed so the index number is 0. if the sequencing was single end, then this number is always 1. if it’s paired-end, then it can be 1 or 2.
  10. The trickiest part of a fastq file is probably the sequence quality encoding. The definition of a quality score is that it is an integer representation of the probability p that the corresponding base pair is incorrect. There has been two variants in terms of how the quality score is calculated. In the standard Sanger encoding, q equals negative 10 times log10 p. while in the illumina encoding prior to version 1.3, q equals negative 10 times log10 p over 1 minus p. so the two versions are slightly different. But you can see that when p is very small, they are almost identical.
  11. The quality score encoding actually leads to very intuitive interpretation. Using the Sanger encoding as an example, if the score equals 10, that means 1 out of 10 base calls is incorrect, or the base call accuracy is 90%. If the score is 20, 1 out of 100 base calls is incorrect, base call accuracy is 99%. If it is 30, base call accuracy is 99.9%, and so on.
  12. To represent the quality scores in a concise fashion, each score is recorded as an ascii symbol. The formula to do this is to add an offset to the score and look for the symbol in this ascii table on the right side. And again, there are two variants in doing this. In the case of illumina score, the offset is 64 before version 1.8. while for Sanger score, the offset is 33. since a quality score is typically between 0 and 40, if it is 33 encoding, then it is represented as one of these symbols. While if it is 64 encoding, then it is represented as one of these symbols. this leas to the following rule of thumb in practice. If somebody throws you a fastq file without letting you know where it comes from. You can just open the fastq file, look at the quality scores, if they are mostly signs, numbers, and big letters, then they are 33 encoded. If they are mostly big letters, brackets and little letters, then they are 64 encoded.
  13. So we’ve talked so much about the format of fastq files. What can we do about them? Well, the first thing we often do is to check the quality of the sequencing. We have a quality score for each nucleotide of each short read, it’s very easy to get an average score for this read. Repeating the procedure for all reads in your library, you can get an overall feel about the quality of your library. Some other interesting things to check is like the GC content. It is known that on the old illumina machines, the sequenced reads tend to be GC rich. And you can also calculate the enrichment of different k-mers. Sometimes, your library may become contaminated, and you’ll see spikes of enrichment of different k-mers. After quality check, you may also want to perform preprocessing on your fastq files. In the case of micro RNA sequencing, this is a must-do because micro RNAs are very short, about 20bp. While your read length may be much longer than that. So you’ll see adapter sequences at the 3’ end of the short reads and they must be clipped before alignment.
  14. Fastq files are just the raw sequence reads and they must be aligned to the reference genome to make any sense. This works by building an index on the reference sequences so that the alignment can be done efficiently. Luckily, you don’t have to do it yourself. Sequence alignment has been a very hot field in the past decade and there are many choices when it comes to short read alignment. Some popular choices are like BWA, bowtie, map, soap, etc.
  15. Just a few years ago, each alignment software will produce alignment files in their own format. If you are an application developer, this really sucks. That basically means you’ll have to write your program like a swiss knife so that it can read all these formats properly. Finally, a group of researchers, mainly from the Sanger institute and the broad institute, developed a format called SAM which is supposed to be a generic format for sequence alignment. And it soon becomes the standard.
  16. So, instead of giving you an elaboration on the SAM format, I’d like to flip the question and ask, if you were going to design an alignment format, what will you put there? first, each short read comes with a sequence id, then you want to know which chromosome it has been aligned, and of course, the starting position of the alignment. Due to the existence of sequencing errors, and especially the repetitive regions on the genome, the sequence alignment cannot be 100% accurate. So you want to associate each alignment with a mapping quality score. In the case of mismatch, insertions or deletions, you also need to describe that using a string called CIGAR. Finally, you can keep the raw sequences and quality strings just in case some programs may need them.
  17. The actual Sam format is just like what I described. It has 11 required fields that are separated by the tab. If you are interested to know more details, you can go to its website and read the specification. An example line of a sam file is sth. like this. And you may have hundreds of millions of lines like this in your sam file.
  18. As I mentioned earlier, the next-generation sequencers can produce a huge number of short reads these days, so the sam files can be very large. A sam file with one million short reads is around 200 mega bytes, and a file with 100 million reads is about 20 giga bytes. If you have a large project with many sequencing samples, the data storage could become a problem. So it totally makes sense that we should convert the text based sam into binary format for compression. The bam format is developed as the binary counter part of sam, which uses the standard gzip library for compression. And it has two parts: one is the compressed data and the other is the index. Having an index on the bam file is very useful because it allows random access to the short reads. For example, if you want to retrieve the aligned reads for a certain gene, you don’t want to go through the entire sam file. You just want that part of the file to be retrieved precisely. This kind of function can be very important for visualization and analysis.
  19. So how is random access implemented for bam files? To answer that question, let’s first look at the layout of a bam file. A bam file can be considered as a concatenation of gzip blocks. Each block is a piece of compressed data. If it is uncompressed, you can retrieve the short reads that are contained within it. Let’s assume we want to query an interval which is located on this chromosome between x and y, which should correspond theses three blocks. Without knowing the genomic coordinates that each block corresponds to, there is no way for us to do this intelligently. We’ll have to read every block on the disk and examine them one by one before we can find the blocks we want.
  20. Now let’s consider a naïve approach to solve this problem. Why not create an index that points to those blocks for each of the base pair on the genome? Right? Then we can determine which blocks we want to read. But wait! The human chromosome 1 is as long as 200 million base pairs. That means, we need an index vector that is as long as 200 million. That would simply be too cumbersome to store and read.
  21. Let’s take a step further. If we assume all alignments are sorted according to their genomic coordinates, then we can divide each chromosome into larger unit so called bins. Each bin can be as large as 16 kb, then there are only about 10 thousands indices per chromosome, which are much easier to store and read. But there is a problem, so far, we have assumed that all alignments are very short so that they are well contained within a bin. What about those very long alignments? A long alignment can be the result of RNA splicing where two exons that are far apart are stitched together during transcription. The messenger RNAs were sequenced to produce a short read like this, then after this short read was aligned to the reference sequence, the start and end positions of the alignment becomes very distant from each other. In some cases, an RNA alignment can be as long as 100kb. So our binning strategy works in most cases, but there is still some issue need to be addressed.
  22. To deal with these issues, the bam designers have developed a strategy called hierarchical bin. Basically, there are several levels of bin size. Each bin on the top level is a multiple of the bins on the lower level. For example, at level 0, the bin is 512 Mb, while at level 1, the bin is 64 Mb, and so on until level 5, where each bin is 16 kb. This kind of hierarchical structure allows us to have a flexible way to contain an alignment of any size. For example, this long alignment crosses bin 3 and 4 at level 1, then it should go to bin 0 at level 0. however, the binning strategy can still be inefficient when there are long alignments. in addition to that, a linear index has been created, which is basically 16kb tiling windows for the entire chromosome, each window contains the file offset of the left-most alignment that overlaps the window.
  23. Now, let’s see how this strategy plays out. To make things easy to interpret, we’ll use a two level binning with one bin on level 1 and four bins on level 2. assume we have alignments a to h. a to e are short alignments that are contained within bin 1 to 4. while f, g and h are long alignments that have to be put in bin 0. now, we have query q which is just a short interval within bin 3. so we calculate the bins that overlap this query and we get 0 and 3. so the candidate alignments if sorted by start location would be f, h, c, d and g. now we apply the linear index for bin 3 and we found that the start position of h is larger than the end position of f, so we can remove f without reading it. Then we read h, c and d. after d is read, we found that the start of d is already beyond the boundary of q, so we stop reading and ignore g from consideration. So you can see that by using binning and linear index, we have avoided two disk seeks.
  24. After the short reads have been aligned to the reference sequences, we can convert the alignment information into read depth which basically tells you the number of times a base pair is covered by aligned short reads. Sometimes, this depth can be further normalized using the library size to get the read depth per million aligned reads. The purpose of doing this is to remove the effect of different library sizes so that two sequencing samples can be compared. There are many tools you can use to do this, such as the samtools depth or bedtools. Here is an example of the read depth calculation. Assuming we have four short reads aligned to the reference, then the depth at these four different positions are 1, 2, 3 and 4.
  25. There is a format that is often used to describe read depth, which is called wiggle format. A wiggle file is a line oriented text file. There are two options to specify a wiggle file, they are variable step and fixed step. In variable step, you put down the chromosome name, the start position and the read depth. You can also specify the number of times that the depth should be repeated using parameter “span”. Here is an example wiggle file using variable step. It basically tells us that value 1 should be repeated 2 times at position 100 on chromosome 1; value 2 should be repeated 3 times at position 1000; and value 3 should be repeated 4 times at position 10,000.
  26. In the fixed step option, you specify the chromosome, start position, step and span, then just dump all the data in the following. In this example, you have 1 repeated for 3 times at 100, then jump to 200, repeat 2 for 3 times, and then jump to 300, repeat 3 for 3 times. The fixed step option can be useful when you want to use tiling windows to divide the reference sequences and then summarize for each window. … this is often used to represent the coverage information of a chip-seq sample.
  27. If Wiggle files are used to describe coverage information for the entire genome, then they can be huge. For example, if you want to calculate the average value for 10bp tiling windows, your wiggle will contain 300 million values for the human genome. So it makes sense to convert wiggles to binary format and then compress and index them. Jim kent, the guy who invented the ucsc genome browser, also invented bigwig format. In big wig, the wiggle information are compressed into gzip blocks and then indexed using a data structure called r-tree. In a way that is similar to bam file indexing.
  28. Alright, now we’ve talked about coveage format, how can you visualize them? A genome browser can be a handy tool when it comes to visualizing sequencing data. Two popular choices are the ucsc genome browser and the igv genome browser. The pros of the ucsc is that it is very comprehensive. But if you want to see your own data, you’ll have to upload them via the internet. That can be cumbersome if you have a large amount of data. On the other hand, the igv genome browser is locally installed application. It is written in java so it basically run everywhere. The cons of the igv is that it contains less genome annotation.
  29. Genome browser has been another hot area of research in the past few years. Somebody actually created a wiki page to list the genome browsers that he or she knows. And there are 34 in total. But that is not all. I initiated and bult the star genome browser when I was still doing my postdoc at ucsd, which was later continued by my colleagues. The paper about star was recently submitted to bioinformatics and should be accepted soon. If you are interested, you can try it out at home.
  30. I want to spend the rest of my lecture talking about ngs.plot, a tool that my group has been focusing. It’s a very useful tool for global visualization of ngs data.
  31. We now know that, a genome is like a huge collection of functional elements. There are tss and tes which are the start and end points of transcription. There are exons which are the components of messenger rna. And there are cpg islands which are cg rich and have roles in gene regulation and evolution. There are also enhancers and dnase hyper-sensitive sites, and many many more.
  32. All these functional elements are scattered around the genome in a kind of random way. But you can certainly organize them into different categories. For example, all the transcriptional start sites can be listed in a table like this. A striking feature of these functional elements is that the same type often share high similarity in chromatin modification. As this averaged profile or heatmap shows. This is histone mark h3k4me3, which is depleted right at the TSS but enriched on both sides. So a figure like this can often speak for itself and tell you a story about the protein of interest. However, it is not trivial to create such kind of figures.
  33. So how do you create those figures? Well, there are basically two steps. In step 1, you want to choose a region of interest, such as tss up down 2 kb. Somebody may tell you that: that’s easy. jut go ahead and download the genomic coordinates from some website. However, these questions may pop into your mind. Where shall I download the annotation? Which databases shall I use? What kind of formats do those databases use? Are these coordinates 0-based or 1-based? What if I want to subset those regions by function? Even if you are a seasoned bioinformatician, if you have to repeat this procedure for many times, that’s gonna make your head explode.
  34. So when we were designing ngs.plot, we were thinking: why not let us do the dirty job and do this all at once? We can collect the genome annotations from different databases and convert them into a unified format. Then in the future, all you need to do is to tell the program: I want this genome, at that functjional element, then everthing is there. So this is how we did it. We developed a genome crawler that will go to the major databases like ucsc, ensembl and encode and automaticaly download the annotatios for a genome, transform and organize them into different categories. And our program can even analzye the relationships between different transcripts and perform exon classification. This table is a bit old already. But it give you a brief summary. our program collects information from 3 databases, for 9 genomes. It considers 7 biotypes, such as tss, tes, genebody and enhancer. It classifies genes into protein coding, lincrna, microrna and pseudogene. It even contains information about cell lines for enhancers and dhs. In total, there are nearly 16 million functional elements, all at the touch of your finger tips.
  35. Now, after having chosen a region, in step 2, you need to plot something at the region. One thing that made us really frustrated with other tools is that they have very limited options in visualization. Often, you have to accep the figure as is. And some tools don’t even provide the raw data for you to re-generate the figure. So when we designed ngs.plot, we kept this in mind and provided a lot of functions for you to tune a figure. For the average profile on the left, you can do all these kinds of tunings, just as an example. And for the heatmap on the right, you can rank the genes using 7 different algorithms. This can be particularly useful, if you want to discover those so called gene modules within a large group of genes. If you want to use these figures in your publication, I bet you’ll appreciate our work.
  36. Ngs.plot is written in R and developed as a command line tool. And it is really easy to use. For example, to create a TSS plot, you only need to type a command like this…. It is an open source project and is hosted on google code. Since it was born, it has been downloaded for hundreds of times by people from all over the world.
  37. Now, I want to give you a real example to demonstrate the power of ngs.plot. This dude from rockefeller is interested in studying the two variants of histone h3 in neurons. He has the following hypotheses. First, H3 has only two variants, so A plus B equals h3. second, A and B should be mutually exclusive. So when A is enriched, B should be depleted, and vice versa. Third, B is correlated with gene activation and A is the reverse. To test his hypotheses, he generated chip-seq data for variant A, B and H3. but how can he test these hypotheses? He needs to find a bioinformatician who tries to understand his questions, then transform them into analytics. Depending on the efficiency of communication, this may have to repeat for multiple times. Now, how long does this take? If this takes one day, … but with ngs.plot, it only takes less than 30 min.
  38. So how do you do this? In ngs.plot, all you need to do is to create a config file which tells the program the combinations of bam files and gene lists you want to draw. Then you provide this config file on the command line to ngs.plot and leave everthing else for the program to figure out. Since we are interested in the difference between variant A and B, we ask the program to rank the genes using the diff algorithm. as you can see clearly, the variants A and B are mutually exclusive. When A is enriched, B is depleted and vice versa. While H3 is kindly of like A or B added up together. So this basically validates the first two hypotheses. Next, you only need to export the gene order list into a text file, and tell ngsplot to plot RNA-seq based on this gene order. As the RNA-seq plot shows, there is a strong association between variant B enrichment and gene expression. While if A is enriched, genes are silenced.
  39. It’s worth mentioning that, ngs.plot is also available on Galaxy which is a very popular bioinformatic platform. If you are within mount sinai, you can access it at:… unfortunately, it is not accessible from outside. If you are a wetlab biologist, its’ very likely you are command line averse, so this is for you.