20150601 bio sb_assembly_course

High-throughput sequencing technologies in
genome assembly
Hans Jansen

Dutch SME at Bioscience Park in Leiden, the Netherlands
• High throughput drug screens, and toxicity assays in zebrafish larvae
• Fish fertility (eel, pike perch, sole) to aid sustainable aquaculture
• Sequencing (genomes, transcriptomes)
• Bioinformatics
ZF-screens B.V.

Common carp (Cyprinus carpio)
High troughput screening model
Genome and transcriptomes
European and Japanese eel (Anguilla anguilla and Anguilla japonica)
Completing the life cycle in aquaculture
King cobra (Ophiophagus hannah)
Evolution and toxins
Some examples of genome projects

Chemical cleavage (Maxam and Gilbert)
Chain termination (Sanger, Nicklen, and Coulson)
Throughput: 5 samples, 1 Kb/day, micrograms
of ssDNA needed
1977 2000 2011
Massively parallel
signature sequencing
(Brenner)
SMRT (Pacific
Biosciences)
Throughput: 3x109 samples, 55 Gb/day,
single molecule of DNA needed
A brief history of DNA sequencing

February 1977: Maxam and Gilbert
Chemical cleavage: Modify nucleotides and cut at the modified position.
December 1977: Sanger, Nicklen, and Coulson
Chain termination: Use modified nucleotides to stop the
extension of a newly synthesized DNA strand.

Maxam and Gilbert sequencing was relatively soon abandoned. It was technically
complex, used some nasty chemicals and radioactivity.
The Sanger sequencing method has been improved and over the years was the method
of choice to sequence the first draft of a human genome.
• Thermostable polymerases alleviated the need for ssDNA template
• Fluorescent dye terminators to combine all four reactions in one.
• Automation of the separation of the DNA fragments.
Shotgun sequencing was already used by Sanger to sequence lambda DNA and proved
to be a powerful tool to sequence and assemble larger DNA molecules and even whole
genomes.

To make assembly easier partially overlapping BAC clones from the genome were first
selected and then sequenced and assembled by the shotgun method.
gDNA
BAC
This was a laborious method and later a whole genome shothun approach was used.

Genomic DNA
Break the DNA in < 1Kb fragments
3’
5’
Polish the ends of the DNA and
adenylate them
3’
5’
3’
A5’
3’
A
A
3’
5’
5’
Ligate adapter to the ends of the DNAT5’
3’T5’
3’
Amplify paired end library3’
5’
3’
5’
3’
5’
3’
5’
3’
5’
3’
5’
3’
5’
3’
5’
Bind ss-library to flowcell3’5’
Making a paired end library

Attach and cluster the library on a carrier

2 x 50 bp
Generate large fragments by shearing,
and label the ends with biotin (green dash).
Self ligate fragments in large volume,
and shear the circular fragments (black dash).
Isolate the biotinylated fragments, convert them to a
paired end library and sequence them (red arrows).
Problem: part of these fragments have unconvertible ends.
Problems: larger fragments will self ligate inefficiently.
Nicks in the DNA will enable digestion of circularized molecules
The above mentioned problems limit the library to ~10 kb insert size and they tend to have a low number of
unique fragments.
Obtaining scaffolding information: mate pairs

Generate large fragments by shearing, isolate
~39 kb fragments and clone in adapted fosmid
vector which contain insert flanking EcoP15I
sites (purple dash).
Cut with EcoP15I which leaves a 26 bp
overhang, end repair fragments and self ligate.
PCR the diTag library from these fragments, and
sequence the 52 bp inserts.
Problem: These large fragments will ligate inefficiently in the
fosmid vector leading to low complexity libraries.
Obtaining scaffolding information: Fosmid diTags

Library Insert Reads Gbp Coverage Span
PE200 <155 bp 2 × 76 nt 21.9 14.6×
PE280 230–305 bp 2 × 151 11.0 7.3×
PE500 370–485 bp
2 × 50–151
nt
19.3 12.9× 1.2×
MP2K 1.6–2.4 Kbp 2 × 36 nt 5.4 4.5×
MP7K 4–6 Kbp 2 × 51 nt 2.3 0.6×
MP10K 6.5–10 Kbp 2 × 51 nt 5.3 7.7×
MP15K 9–13 Kbp 2 × 51 nt 3.8 8.8×
69 Gbp 34.8× 22.9×
King cobra sequence data

Read merging
If the two reads of a paired end fragment overlap they can be merged into a single
longer read
• We use our own script since nothing was available at the time
• Now there are a number of tools: FLASH, SHERA, SeqPrep
• Paired end libraries need to be prepared with the read length in mind, and size
select as narrow as possible.
~600 bp
~270 bp

102
Fragmentsize (bp)
%oftheassembly
103 104 105 106
+ 500 bp + 2 Kbp+ 7 Kbp + 10 Kbp+ 15 Kbp
Assembly (cobra)

Contigs
N50 3982 bp
largest 70 Kbp
number 1186408
Tota length 1.45 Gbp
Scaffolds
N50 226 Kbp
largest 2.84 Mbp
number 716551
Total length 1.66 Gbp
number of genes 22183
King cobra sequence assembly

Genome Res. 2007 17: 240-248
This is a method to sequence (a small) part of a genome, and do this for
multiple siblings.
From the sequence data SNP’s can be identified and used as markers to build a
genetic map of this genome.
Analysis of the spotted gar genome cut with SbfI in the parents and 94
individuals from their progeny produced 8406 markers in 29 linkage groups.
Generating a RAD-tag genetic map

From Baird, PLoS ONE 2008
This can be done with multiple samples
when using barcodes
After adding the barcodes all samples can
be pooled to reduce workload
Pools of short fragments from different
individuals.

Amores A et al. Genetics 2011;188:799-808

Long DNA molecules Fluorescently labeled at specific sites are linearized in
nanochannels and imaged. The fluorescent fingerprints of each molecule can
be assembled and linked to contigs and scaffolds.
Optical mapping: BioNano Genomics
Gabino Sanchez-Perez lecture at 15.00 hrs. will explain this in much
more detail and show some great examples how to use this technology.

Just a genome is usually not the goal of a de novo sequencing project.
Based on the general structure of a gene, gene predictions can be made.
exon exon exon exon
AGGT AGT
A
G
Pyrich CAGG
splice acceptor site
ATG STOP
Poly adenylation signalA
C
splice donor site
CT A
Branch site
A C
G T
20-50 bases
intron
RNAseq reads can help validate predictions
Annotation of the genome

Different flavors of RNAseq
• Stranded dUTP RNAseq: simple modification of standard prep gives
information of the strandedness of the transcript.
• RNAseq with minimal quantities of RNA : a great tool to look at small
numbers of (FACS sorted) cells
• Cage : ideal to find transcription start site
• smallRNA: to explore the miRNA content of a sample
Transcriptome sequencing

Disadvantages of next generation sequencing:
• Complex sample preparation including PCR amplification.
• High run costs.
• Long run times.
• Short reads
Changes needed:
• Single molecule analysis
• Reading sequences at a high speed
• Highly parallel
• Long reads >10kb
• No errors
Long reads: what do we want?

Pacific Biosciences PacBio RS II
Available since 2010
Oxford Nanopore Technologies MinION
Available since 2014
Generating long reads

Pacific Biosciences PacBio RSII
It uses a zero mode waveguide
to measure fluorescence in a
very small volume.

Ligate hairpin adapters
Fragment gDNA and polish ends, and add adenosine.
Attach polymerase, load on SMRT cell and sequence
DNA polymerase
Transparent bottom of
zero mode waveguide
Pacific Biosciences

Pacific Biosciences P6-C4
• Yield 0.5-1 Gbp/SMRT cell.
• Since no amplification is done you
sequence the DNA as it comes out of your
sample (nicks, base modifications).
• There is very little sequence bias and no
systemic errors
Christoph Konig’s lecture at 14.15 hrs will delve much deeper into this technology.

• Started to work on nanopore sensing in 2005
• Investments to date 180 million GBP (227 M€)
• ~200 employees
• Broad IP portfolio
• Announced products: MinION and PromethION systems
• Access program for MinION (MAP)
Oxford Nanopore Technologies

But MAP is much more. It is about being a community and a playground to test new
applications.
Last part of the development of this technology is done “in field” in an fairly open
program.
100’s of MinIONs send around the globe to see how they would behave in real life.
MAP is visible as a web portal with information from ONT and social media like system
with blog possibilities, comment, likes, and a forum to ask advice.
MinION access program

Tethering oligo
Motor protein Brake protein
hairpin
abasic nucleotidesT
TA
A
Shear (optional)
DNA repair (optional), AmpureXP purification
end repair, AmpureXP purification
A tailing, AmpureXP purification
Ligation, His-tag purification,
Dilution in run buffer and ATP
A MuA transposase protocol is under development. This should further
simplify sample preparation (10 minutes).
Library preparation

Tethering oligo
Motor protein E5
Brake protein E3
hairpin
abasic nucleotides
Tether keeps DNA fragment on the membrane leading to a ~20K fold higher DNA
concentration close to the pore.
Motor protein unwinds DNA and ratchets it though the pore.
Abasic nucleotides in the hairpin are a recognition point.
Brake protein prevents the motor protein from zipping through the complement strand.
Sequencing

Stills taken from: https://www.nanoporetech.com/news/movies#movie-24-nanopore-dna-sequencing
Strand sequencing
ATP

GGCTCACTCCCATAAGC
GGCTC
GCTCA
CTCAC
TCACT
CACTC
ACTCC
CTCCC
Raw Data (ionic curent, pA)
Events (with time domain)
Squiggle (events with time domain removed)
Sensing the DNA

Squiggle plot for a complete read
First the template part in blue, then the abasic nucleotides in the hairpin in red, and
finally the complement part in turquoise .
Alignment of template and complement squiggles gives a 2d read.
Squiggle plot

MinKNOW controls the run and shows channel states…..
Interactive interface

….. and amount of events vs read length.
Metrichor agent runs in the background to send sequence files to and from
the (cloud based) base caller.
MinKNOW can interact with other software.
minoTour analyses reads in a streaming mode and can control MinKNOW.
Interactive interface

template mean 8734 bp complement mean 8126 bp 2D mean 9930 bp
Read length is limited by the non-nicked fragment length rather than the by the system.
My longest 2D read until now: 93.5 Kbp, template 120 Kb.
Read length distribution

There are actually 4 wells/detection
channel. QC at the beginning of the
run determines the quality of the
4wells. Sequencing starts on the best
set of wells. Each 24 hrs the next best
set of wells is chosen.
Yield over time

ref TGATGTATATGCTCTCTTTTCTGACGTTAGTCTCCGACGGCAGGCTTCAA-TGACCC-A-GGCTGAGAAATTCCCGGACCCTTTTTGCTCAAGAGCGATG
|||||||||||||| |||||||||||| ||||||||||||||||||| |||||| | ||||||||||||||||||||| |||||| |||| | |
MinION TGATGTATATGCTC----TTCTGACGTTAGCCTCCGACGGCAGGCTTCAATTGACCCGATGGCTGAGAAATTCCCGGACCC--TTTGCTACAGAGTG-T-
ref TTAATTTGTTCAATCATTTGGTTAGGAAAGCGGATGTTGCGGGTTGTTGTTCTGCGGGTTCTGTTCTTCGTTGACATGAG---GTTGCCCCGTATTCAGT
|||||||||||||||||||||||||||||||||| ||| |||||| | |||| ||||||| ||| |||||| | || | || | | |
MinION TTAATTTGTTCAATCATTTGGTTAGGAAAGCGGA---TGC-GGTTGT--TCCTGC-GGTTCTG----TCG-TGACATCCGTTATTTGCGCTGT-TACGC
ref GTCGC-TGATTTGTATTGTCTGAAGTTGTTTTTACGTTAAGTTGATGCAGATCAATTAATACGATACCT--GCGTCATAATTGATTATTTGACGT--GGT
| || || |||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||| |||||||||||||||||||||||| |||
MinION ATGGCATGTTTTGTATTGTCTGAAGTTGTTTTTACGTTAAGTTAATGCAGATCAATTAATACGATACCTCGGCGTCATAATTGATTATTTGACGTGGGGT
Error rate lies around 15% for current chemistry (R7.3). Typical passing 2D R7.3 read now is
2.8% deletions, 2.7% insertions and 1.7% substitutions.
R8&9 nanopores are in the pipeline (improving on G/C rich reads and better S/N).
Errors

Errors result from different parts of the system.
On the ASIC:
Events are missed by the translation from raw data to event data.
Solution: Sharpen up the raw data by playing with voltage and by new
nanopores with lower noise. Sequence faster.
In the base caller:
Bases outside the observed k-mer influence the current.
Solution: Higher k-mer models
Modified bases are currently not included in the k-mer model.
Solution: add modified k-mers to the model. Modified k-mers are
different from unmodified k-mers.
Errors

Throughput is defined by:
Number of channels. 512 on the MinION
Speed of translocation. 30 bps/sec
Occupancy of the pore. 90%
The time a Flow Cell can run. ~60 hrs.
Currently well over 1 Gb events.
On R7.3 this translates to ~400 Mb 2D data.
Throughput
In “fast mode” the MinION will read 500 bps/sec. Currently three MAP groups are
testing this. Throughput will increase to ~20 Gb in events.

Longest 2D read: 93.5 Kbp
Longest template read: 120 Kbp (231 Kbp)
Highest yield: 1.32 Gevents
R7
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Base pairs sequenced (Mbp)
Runs
template and 2D yield over the past year
template
2D
R7.3R6

repeatunique sequence in unique sequence out
Long reads can help to resolve repeat area’s in the assembly graph
And the resulting contigs will now look like this:
Untangle

1. Short read correction Quake (not for small genomes)
2. Short read assembly Velvet
3. MinION read alignment to Velvet contigs LAST
4. Link filtering and contig tiling Untangle script
5. Path detachment around repeats Untangle script
6. Bubble popping Untangle script
7. Delete unconfirmed connections Untangle script
8. Contig extraction Untangle script
Assembly and scaffolding strategy
Task Software

Agrobacterium NCPPB 1771 assembly graph
25× transposon →
(1160 bp)
8× transposon →
(873 bp)
4× rRNA →
(6.4 Kb)
271 nodes, 311 connections
154 contigs
N50 = 198 Kb
Sum = 5.87 Mb

• Alignment: LAST with optimized settings
• Links: alignment filtering and contig tiling
• 7328 reads aligned to contigs
• 438 reads aligned to multiple contigs
• 585 links between contigs
• 13158 reads on R6 and R7 chemistry
• 73.8 Mb total yield (template and 2D)
• 5–85970 nt length, typical ~12 Kb
MinION sequencing and scaffolding

Links between nodes are specific
Means link is confirmed by PCR

Final assembly graph after scaffolding
• 271 nodes + 312 connections → 49 nodes + 5 connections
• 154 contigs → ~8 contigs
• Complete chromosome 2 (1.2 Mb), pTi (190 Kb), cryptic megaplasmid (746 Kb)
• Slight residual fragmentation of chromosome 1

Reads are in HDF5 format and contain all data from the event data onwards.
A cloud based basecaller is provided by Oxford Naopore Technologies.
The MAP community is actively developing software to use this type of data.
Some examples:
Jared Simpson’s pipeline to correct and assemble using only nanopore reads.
Live monitoring, alignments and feedback to the MinION.
Matt Loose’s Minotour.
Squiggle space aligners
Each base is measured 5 times in consecutive kmers so it makes sense to avoid
basecalling and work directly with the events (squiggle space)
Software

London Calling 2015
Highlights from Clive Brown’s talk
• Improvements to the basecaller .
• Read until (and barcoding).
• Fast mode on the MinION MkI (500 bp/sec instead of 30).
• New 3000 channel ASIC with “crumpet” chip design to separate ASIC and fluidics part.
• MinION MkII and PromethION will have this new ASIC.
• Library prep on beads to reduce amounts of DNA needed (lower ng to pg).
• Direct RNA sequencing.
• Simplified sample preparation and VolTRAX.
• Pricing will be “pay as you go”. Initial payment for hardware include some hrs sequencing.
• MkI $270 and 3 hrs sequencing (~3 Gbp in fast mode).

London Calling 2015
Much emphasis on getting the library prep
simpler and faster to be able to leave the lab.
If the system leaves the lab many more
applications become possible.
VolTRAX

The technology underlying the MinION system is scalable so
larger throughput can be made available relatively easy.
It will use the new ASIC design and will have 144000 channels.
Projected throughput: 6.4 Tbp/day.
Too much data to do cloud baseclling so will be done locally.
Access Program will start later this year.
London Calling 2015
PromethION

Freek Vonk
Harald Kerkkamp
Asad Hyder
Michael Richardson
Christiaan Henkel
Paul Hooykaas
Ron Dirks
Guido van den Thillart
Herman Spaink
Pim Arntzen
Erwin Fakkert
Marten Boetzer
Walter Pirovano
Diana Uffink
R. Manjunatha Kini
Ken Kraaijeveld
Yavuz Ariyurek
Arnoud Schmitz
Yahya Anvar
Acknowledgments
Dan Turner
Oliver Hartwell

20150601 bio sb_assembly_course

20150601 bio sb_assembly_course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to 20150601 bio sb_assembly_course

Similar to 20150601 bio sb_assembly_course (20)

Recently uploaded

Recently uploaded (20)

20150601 bio sb_assembly_course