Next-generation sequencing format and visualization with ngs.plot

Data formats and visualization in
next-generation sequencing analysis
Li Shen, Asst. Prof.
Neuro core
Sep 2013

Introduction to the Shenlab
Lab location: Icahn 10-20 office suite
Two focuses:
1. Next-generation sequencing analysis
2. Novel software development for NGS
http://neuroscience.mssm.edu/shen/index.html

DNA sequencing overview
Primer
Template sequence
DNA polymerase/ligase
A
C
G
T
5’ 3’
5’3’
1. How to “freeze” the procedure?
2. What kind of signal to generate?
3. How to capture the signals?
Sanger sequencing
Pyrosequencing
Solexa sequencing
SOLiD sequencing
Ion Torrent sequencing
SMRT sequencing
…and many others
Extending sequence

What is “next-generation” sequencing?
-- first-generation sequencers: –
Sanger sequencer: 384 samples
per single batch
-- next-generation sequencers: --
Illumina, SOLiD sequencer: billions
per single batch, ~3 million fold
increase in throughput!
Massively Parallel:

What are “short” reads?
http://www.edgebio.com/blog_old/uploads/2011/06/1.png
http://en.wikipedia.org/wiki/File:DNA_Sequencing_gDNA_libraries.jpg
Read position
Qualityscore
Illumina:
50-250bp
SOLiD:
35-50bp
454 pyro:
700bp
Sanger:
900bp
Limit of read length

Illumina sequencing terminology
Chip, slide, flow cell…
HiSeq 2500
DNA fragment

Information flow of sequencing data
fastq
SAM/BAM
coverage
HISEQ2:197:D08GUACXX:8:2105:21056:104282 0 chr10
3000101 255 51M * 0 0
AAGGTCACCAAAGGCCCACCTTGTCTTTACCTTATTTGTTCTA
AATTTTTT
=@@DA:ADDHD;AA?:AAFHGIHHBDEFHIDGB9CFH<?F<DEEIG
GHEII XA:i:0 MD:Z:51 NM:i:0
3000301 255 51M * 0 0
GTGTTATTTCACAAGGTGAAGATAGAGCTTGGTGGCTGCCAG
AGAGATTAA
BB@FFFFFHHHHHJJJFGIJIIJJJJJJIJJJHIJJJIIJJJJIGIGIJII
XA:i:0 MD:Z:51 NM:i:0
3000373 255 51M * 0 0
CTGAATCTTCTCCTAAGTATCATCCTGAAGAACAAAATTCCTCT
TTTGCTT
JJIJJJJJJJJJJJJJJJIIJJJJJIJJJJJHJJJJJJHHHHHFFFFFCCC
XA:i:0 MD:Z:51 NM:i:0
3000388 255 51M * 0 0
AGTATCATCCTGAAGAACAAAATTCCTCTTTTGCTTAAAATTCA
CTGGGGA
7
Image analysis

What is FASTQ?
• Text-based format for storing both biological
sequences and corresponding quality scores.
• FASTQ = FASTA + QUALITY
• A FASTQ file uses four lines per sequence.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAA
+SEQ_ID(Optional)
!''*((((***+))%%%++)(%%%%).1**
1
2
3
4

Illumina sequence identifiers
@SOLEXA-DELL:6:1:8:1376#0/1
Instrument name
Lane
Tile
X-coordinate
Y-coordinate
Index number
Paired read
@SEQ_ID

Quality score calculation
+SEQ_ID
!''*((((***+))%%%++)(%%%%).1** ?
A quality value Q is an integer representation of the probability p that the
corresponding base call is incorrect.
1 2
Sanger
Solexa
Figures from Wikepedia

Quality score interpretation
Phred Quality Score
Probability of incorrect
base call
Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
Materials from Wikepedia

Quality score encoding
• Formula: score + offset =>
look for ascii symbol
• Two variants:
offset=64(Illumina 1.0-
before 1.8);
offset=33(Sanger, Illumina
1.8+).
• A quality score is typically:
[0, 40]
(33): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
(64): @ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefgh
Figures from Wikepedia

What can you do with FASTQ files?
• Quality control: quality score distribution, GC
content, k-mer enrichment, etc.
• Preprocessing: adapter removal, low-quality
reads filtering, etc.
GATTTGGGGTTCAAAGCAGTATCGATCAAA
!''*((((***+))%%%++)(%%%%).1** Mean quality
GC contentK-mer enrichment
Adapter? (miRNA)
Quality Quality
…

Short read alignment
• Many choices: BWA, Bowtie, Maq, Soap,
Star, Tophat, etc.
FASTQ files Alignments
Index
Genomic reference sequence

Alignment
format
Bowtie
ELAND
BWA
Soap
Maq
SHRiMP
SAM

The SAM format
2. chromosome
Short read
Reference sequence
1. seqid
3. position
? 4. mapping quality
mismatch Indel: insertion, deletion
5. CIGAR: description of alignment operations
6. sequence
7. quality

The SAM specification
https://github.com/samtools/hts-specs
MARILYN_0005:2:77:7570:3792#0 97 1 12017 1 76M = 12244
303 ACTTCCAGCAACTGCTGGCCTGTGCCAGGG
TGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCAT
IHGIIIIIIIIIIIIGGDBDIIHIIEIGDG=GGDDGGGGEDE>CGDG<GBGGBGDEEGDFFEB>2;C<C;BDDBB8
AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:32C43 YT:Z:UU XS:A:+
NH:i:3 CC:Z:15 CP:i:102519078 HI:i:0
An example line:
N = hundreds of millions

BAM: the binary version of SAM
• SAM files are large: 1M short reads =>
200MB; 100M short reads => 20GB.
• Makes sense for compression
• BAM: Binary sAM; compress using gzip
library.
• Two parts: compressed data + index
• Index: random access (visualization,
analysis, etc.)

Layout of binary BAM file
Short read
alignment
Hundreds of millions of alignments
Gzip blocks
Time: O(n), n = #alignments
q = chr: X–Y
Chromosome:

A naïve approach
Chromosome:
...
...
One index per base-pair?
Wait, the human chr1 is as long as 200Mb!
Gzip blocks:

A binning strategy
Chromosome bins:
E.g.: bin = 16Kb each,
~10,000 indices per
chromosome
Gzip blocks: ...
Long alignment
RNA splicing
Assume all alignments are sorted according to genomic coordinates:
...
...

Hierarchical binning and linear index
0
1 2 3 4 5 6 7 8
512Mb
64Mb
Level 0:
Level 1:
. . .Level 5: 16Kb
.
.
.
Linear Index:
16Kb tiling windows: file offset of the left-most alignment
that overlaps the window
Binning:
. . .

A hypothetical example
0
1 2 3 4
a b c
d
e
f g
h
bin 0: f, g, h
bin 1: a
bin 2: b
bin 3: c, d
bin 4: e
q
1. bins(q): [0, 3];
2. Candidate alignments: f->h->c->d->g;
3. LinearIndex(3): start(h) => larger than end(f);
4. Remove f without reading;
5. Read h, c, d;
6. start(d) larger than boundary(q);
7. Stop: without reading g.
Done: saved TWO disk seeks!

From alignment to read depth
• Read depth: the number of times a base-pair
is covered by aligned short reads.
• Can be normalized: depth / library size * 1E6
= read depth per million aligned reads.
• Many tools to use: samtools depth, bedtools,
and so on.
1 2 3 4
Reference:
Alignments
Example:

Describing depth: the Wiggle format
• Line-oriented text file, two options: variable
step and fixed step.
variableStep chrom=chr1 span=2
100 1
1000 2
10000 3
11 222 3333
chr1:
100 1000 10000

Wiggle: fixed step
fixedStep chrom=chr1 start=100 step=100 span=3
1
2
3
111 222 333
chr1:
100 200 300
Reference:
w w w w …
fixedStep chrom=chr? start=??? step=w span=w
Dump your data here

If you have very large wiggle files…
• Wiggle files can be huge: average per 10bp window => 300M
elements for human genome.
• Makes sense to compress and index.
Gzip blocks

Genome browser
v.s.
Pros: very comprehensive
Cons: data have to be
transmitted via network
Pros: locally installed
Cons: less genome
annotation
UCSC genome browser

Genome browsers: lots of options
Wiki: 34 in total
and that is not all!

DEMO: GENOME BROWSER
Alignment, BAM, Wiggle, Peak calling, BED…

NGS.PLOT: GLOBAL
VISUALIZATION FOR NEXT
GENERATION SEQUENCING DATA

A genome is a huge collection of functional
elements
• TSS: transcriptional
start site
• TES: transcriptional
end site
• Exon: mRNA
components
• CpG island: has roles
in gene regulation
and evolution
• Enhancer: activate
genes
• Dnase hyper-
sensitive site: where
TFs bind
• And more…
Images from Google
image search
35

Categorizing functional elements
TSS TES Enhancer CpG islandExon
GB view
TSS1
TSS2
TSS3
TSS4
TSS5
.
.
.
Chrom Start End
chr1 100 101
chr2 200 201
.
.
.
Avg. profile
Heatmap
H3K4me3
36

Step 1: choose a region of interest
Where to
download?
Which database
to use?
What kind of
formats do
they use?
0-based
coordinates?
1-based
coordinates?
Subset regions
by function?

ngs.plot collects lots of genome annotations
Variable Count Description
Database 3 Refseq , Ensembl and ENCODE
Genome 9 dm3, hg19, mm10, mm9, rn4, rn5, sacCer3, Tair10, Zv9
Biotype 7 tss, tes, genebody, cgi, dhs, enhancer, exon
Gene type 4 protein_coding, lincRNA, miRNA, pseudogene
Cell line 9 Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, Nhlf
Total functional elements: 15,944,952

H3K27me3
SEM bar
SmoothingShade
Flanking region
Robust estimation
total
hc
none
diff
prod
pca
max
Gene
ranking
algo.:
39
Step 2: plot something at this region

ngs.plot: a global visualization tool for NGS
data
• Written in R, easy-to-use command line program.
ngs.plot.r -G genome -R tss -C chipseq.bam -O output
40

Testing biological hypotheses with NGS data
Ian Maze
Allis lab
Rockefeller
Nucleosome
H3 Var A
H3 Var B
ChIP-seq
N
Understand questions
Transform -> analytics
bioinfor
maticia
n
Time spent:
Super
Not bad
Normal…
41

Visualization the ngs.plot way
A B H3 RNA-seq
A.bam -1 “A”
B.bam -1 “B”
H3.bam -1 “H3”
Config file:
ngs.plot.r -G mm9 -R
genebody -C config.txt -GO
diff -O XXX diff
Export gene order list: go.txt
ngs.plot.r -E go.txt -G mm9
-R genebody -F rnaseq –C
RNA.bam -GO none -O YYY
42

ngs.plot is also available on Galaxy!
URL: https://ineuron.mssm.edu/galaxy
43

DEMO: NGS.PLOT
Global visualization made easy…

Next-generation sequencing format and visualization with ngs.plot

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Next-generation sequencing format and visualization with ngs.plot

Ähnlich wie Next-generation sequencing format and visualization with ngs.plot (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Next-generation sequencing format and visualization with ngs.plot

Hinweis der Redaktion