2. 1953: Discovery of the structure of the DNA double helix
Nobel prize in Physiology or Medicine 1962
3. History of DNA sequencing
1953 Discovery of the structure of the DNA double helix
1972 Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior to this, the
only accessible samples for sequencing were from bacteriophage or virus DNA.
1977 The first complete DNA genome to be sequenced is that of bacteriophage φX174
1977 Frederick Sanger publishes "DNA sequencing with chain-terminating inhibitors“
1984 Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb.
1987 Applied Biosystems markets first automated sequencing machine, the model ABI 370.
1990 The U.S. National Institutes of Health (NIH) begins large-scale sequencing trials on Mycoplasma capricolum,
Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae
1995 Craig Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) publish the first complete
genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137
bases and its publication in the journal Science marks the first use of whole-genome shotgun sequencing, eliminating the
need for initial mapping efforts.
1996 Pål Nyrén and his student Mostafa Ronaghi at the Royal Institute of Technology in Stockholm publish their method of
pyrosequencing
1998 Phil Green and Brent Ewing of the University of Washington publish "phred" for sequencer data analysis.
2001 A draft sequence of the human genome is published
2004 454 Life Sciences markets a parallelized version of pyrosequencing.The first version of their machine reduced
sequencing costs 6-fold compared to automated Sanger sequencing, and was the second of a new generation of
sequencing technologies, after MPSS.
5. A breakthrough: fluorescent chain-terminating inhibitors
ABI PRISM 377
First generation DNA sequencer
• Manual preparation of acrylamide gels
• Manual loading of samples
• Contigs of 500-600 bp
• 2.4 millions bp/year
(1000 years needed to sequence the human genome)
Automated DNA sequencer
• Capillary electrophoresis
• Costs reduced by 90%
• Human operation 15 min/day/machine
• 1 million bp/day
3730x/ DNA analyzer
6. Next-generation sequencing (NGS):
newer methods for DNA sequencing
The potential of NGS technologies is akin to the early days of PCR, with one’s
imagination being the primary limitation of its use (Metzker ML, 2010, Nature review)
NGS platforms produce an enormous volume of data cheaply, so it expands the
realm of experimentation beyond just determining the order of bases:
gene-expression studies (RNA-seq)
identification of rare transcripts without prior knowledge of a particular gene
alternative splicing identification
large-scale comparative and evolutionary studies
re-sequencing of human genomes to enhance our understanding of how genetic
differences affect health and disease
7. NGS technologies overview
The variety of NGS features makes it likely that multiple platforms coexist
in the marketplace, with some having clear advantages for particular
applications over others
NGS differs in template preparation, sequencing and imaging, and data
analysis
Commercially available technologies:
Roche/454
Illumina/Solexa
Helicos BioSciences
Life/APG – SOLiD system
Pacific Biosciences
Ion Torrent technology
Experimental
Nanopore sequencing
9. Roche/454 - Pyrosequencing
2.
Pyrosequencing: non-electrophoretic, bioluminescence method that
measures the release of inorganic pyrophosphate by proportionally
converting it into visible light using a series of enzymatic reaction
DNA polymerase
(DNA)n + dNTP
(DNA) n+1 + PPi
Nucleotide incorporation generates light seen as a peak
in the Pyrogram trace
Video http://www.youtube.com/watch?v=kYAGFrbGl6E
10. Roche/454 - Pyrosequencing
3.
Imaging
Sequencing and de novo assembly of
the Mycoplasma genitalium genome
25 million bases in one four-hour run
96% coverage at 99.96% accuracy
100-fold increase in throughput over current
Sanger sequencing
Most of errors result from a broadening of
signal distribution, particularly for large
homopolymers (seven or more), leading
to ambiguous base call
Future directions:
increasing in throughput by miniaturization
of the fibre-optic reactors
improvements to reduce cross-talking
between adjacent wells
13. Illumina/Solexa
1.
Solid-phase amplification can produce 100-200 million spatially
separated clusters, providing free ends to which a universal sequencing
primer can be hybridized to initiate the NGS reaction
14. Illumina/Solexa
Sequencing by Cyclic Reversible Termination (CRT): CRT uses
reversible terminators in a cyclic method that comprises nucleotide
incorporation, fluorescence imaging and cleavage
1.
2.
3.
a DNA polymerase, bound to the primed template, adds or incorporates just one
fluorescently modified nucleotide
Unincorporated nucleotides are washed away and a four-color imaging is
acquired by total internal reflection fluorescence (TIFR) using two laser
A cleavage step (TCEP, a reducing agent) removes the terminating group
restoring the 3’-OH group and the fluorescent dye
16. Illumina/Solexa
Paired reads are very powerful in all areas of the analysis because they
provided very accurate read alignment and thus improved the accuracy and
coverage of consensus sequence and SNP calling
Video http://www.youtube.com/watch?v=77r5p8IBwJk
17. Illumina/Solexa
1861 publications...
Applications
DNA sequencing
Gene Regulation Analysis
Sequencing-based Transcriptome Analysis
SNPs and SVs discovery
Cytogenetic Analysis
ChIP-sequencing
Small RNA discovery analysis
A whole human genome sequence was determined in 8 weeks to an average depth
of ~ 40X, discovering ~ 4 new million SNPs and ~400000 SVs (with an accuracy
<1% for both over-calls and under-calls)
Considering the whole human genome sequencing as a clinical tool in the near
future: unravel the complexities of human variation in cancer and other diseases,
paving the way for the use of personal genome sequences in medicine and
healthcare
18. Helicos BioSciences
The use of PCR is problematic for two reasons:
1.
2.
PCR introduces an uncontrolled bias in template representation because its
efficiencies vary as a function of template properties
PCR introduces errors (generating false-positive SNPs)
Single-molecule sequencing has been developed to circumvent these
problems
19. Helicos BioSciences
1.
Template preparation: one pass-sequencing
The library preparation process is simple and fast and does not require the use of
PCR. It results in single-stranded poly(dA)-tailed templates
Poly(dT) oligonucleotides are covalently anchored to glass cover slip at random
positions, and they are used to capture the template strands and as primers for
sequencing
20. Helicos BioSciences
2. Sequencing
Each cycle consists of:
1.
2.
3.
adding the polymerase and one
of the labeled nucleotide
rinsing, imaging of multiple
positions
cleavage of the dye labels
224 cycles were performed to
sequence the genome of the
M13 virus to an average depth
of >150X with 100% coverage
21. Helicos BioSciences
3. Imaging
The system showed higher error rates compared to the previous platforms, mostly
due to multiple incorporations in the presence of homopolymers
The two-pass sequencing improved the overall quality
23. Helicos BioSciences
ChIP-seq
Methy-seq
Pastor WA et al. (2011). Genome-wide mapping of
5-hydroxymethylcytosine in embryonic stem cells.
Nature. May 19;473(7347):394-7. Epub 2011 May 8
Direct RNA sequencing
Goren, A et al. (2010). Chromatin profiling by directly
sequencing small quantities of immunoprecipitated
DNA. Nat Methods 7, 47-49.
Ozsolak, F et al. (2010). Comprehensive
polyadenylation site maps in yeast and human
reveal pervasive alternative polyadenylation. Cell
143, 1018-1029.
cDNA-Based DGE, RNA-Seq and Small RNA
Sequencing
Ting, DT et al. (2011). Aberrant overexpression of
satellite repeats in pancreatic and other epithelial
cancers. Science 331, 593-6.
Lipson, D et al. (2009). Quantification of the yeast
transcriptome by single-molecule sequencing. Nat
Biotechnol 27, 652-658.
Video http://www.youtube.com/watch?v=TboL7wODBj4
24. Life/APG – SOLiD platform
Sequencing by ligation (SBL) uses another cyclic method that differs from
CRT in its use of DNA ligase and a two-base-encoded probes
Life/APG has commercialized their SBL platform called support
oligonucleotide ligation detection (SOLiD)
25. Life/APG – SOLiD platform
SOLiD sequencing Chemistry
Two-base-encoded probes: an oligonucleotide
sequence in which two interrogation bases are
associated with a particular dye
(e.g. AA, CC, GG, TT are encoded with a blue dye)
there are 16 possible combinations, each dye is
associated with 4
1,2-probes indicates that the first and second
nucleotides are the interrogation bases. The
remaining bases consist of either degenerate or
universal bases
A phosphorothiolate linkage is present between the
fifth and six nucleotides of the probe sequence,
which is then cleaved with silver ions.
26. Life/APG – SOLiD platform
1.
Emulsion-based sample preparation (emPCR)
2.
Chemical crosslinking to an amino-coated glass surface
27. Life/APG – SOLiD platform
3.
SBL protocol
Upon the annealing of a universal primer, a
library of 1,2-probes is added.
Ligation of complementary probes follows.
Four-color imaging
The ligated 1,2-probes are chemically
cleaved with silver ions to generate a 5’-PO 4
group
The SOLiD cycle is repeated 9 times
28. Life/APG – SOLiD platform
3.
SBL protocol
The extended primer is then stripped and four
more ligation rounds are performed, each with
ten ligation cycles
29. Life/APG – SOLiD platform
ChIP-seq
Chromatin immunoprecipitation
sequencing (ChIP-Seq) on the SOLiD™
System Publication: Nature Methods,
(2009)
Chromosome length influences replicationinduced topological stress
Publication: Nature (2011)
Methy-seq
Increased methylation variation in
epigenetic domains across cancer types
Publication: Nature Genetics (2011)
Metagenomics
The carnivorous bladderwort (Utricularia,
Lentibulaiceae) a system inflates
Publication: Journal of Experimental
Botany (2010)
cDNA-Based DGE, RNA-Seq and Small RNA
Sequencing
Evolution of yeast noncoding RNAs
reveals an alternative mechanism for
widespread Intron loss
Publication: Science (2010)
Video http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related
31. Pacific Biosciences
All the aforementioned methods use enzymatic activities and various
termination approaches, leading to short sequence reads (max. 350 bp)
Real-Time DNA sequencing wants to exploit the high catalytic rates and the
high processivity of the DNA polymerase, using the latter as a real-time
sequencing engine in order to obtain longer reads.
To fully harness the intrinsic speed, fidelity, and
processivity of the DNApol , several technical challenges must be met
simultaneously:
The speed at which each polymerase synthesizes DNA exhibits stochastic
fluctuation, so polymerases must be observed individually
A high nucleotide concentration is required, so a reduction in the observation
volume which allow single-molecule detection is needed
DNApol has to work with 100% fluorescently labeled dNTPs
A surface chemistry is required to retain the activity of DNApol and inhibits
nonspecific absorption of labeled dNTPs
32. Pacific Biosciences
Single Molecule Real Time (SMRT) DNA sequencing
The zero-mode waveguide (ZMW) design reduces the observation volume down to the zeptolitre
range (10-21 l ), reducing the number of stray fluorescently labeled molecules that enter the
detection layer for a given period
The residence time of phospholinked nucleotides in the active site is usually on the millisecond
scale, and that correspond to a recorded fluorescence pulse
34. Pacific Biosciences
An initial accuracy of the reading
was estimated at 83% at 1X.
Common mistakes were insertion,
deletion and mismatches.
Up to 15X, the authors demonstrated
that the accuracy is >99%
In 2009, Pacific Biosciences
reported improvements to their
platform. E.Coli was sequenced at
38X covering 99.3% of the genome,
with an accuracy of >99.999%
average read length: 964 bp
36. NGS technologies and personal genomes
Human genome studies aim to catalogue SNPs and SVs and their
association to phenotypic differences, with the eventual goal of personalized
genomics for medical purposes > Pharmacogenomics
Somatic mutations associated with acute myeloid leukemia have been identified
using Illumina/Solexa (Ley T.J. et al. 2008 Nature)
Elucidation of both allelic variants in a family with a recessive form of Charcotmarie-Tooth disease using the SOLiD platform (Lupsky J.R. et al. in press N.Engl.J.Med.)
The Cancer Genome Atlas aims at discovering SNPs and SVs associated with
major cancers (The Cancer Genome Atlas Research Network, 2011 Nature)
Beijing Genomics Institute (BGI) is working on the “1000 Plant & Animal
Reference Genomes Project" aiming at generating reference genomes for 1,000
economically and scientifically important plant/animal species. They use
Illumina/Solexa and SOLiD platforms
37.
38. Sequencing services and the $1,000 genome
Illumina announced a personal genome sequencing service that
provides 30-fold base coverage for the price of $48,000.
Complete Genomics offers a similar service with 40-fold coverage
priced at $5,000. It is based on a business model that is reliant on
huge customers volume. They use a newly optimized SBL protocol
which uses a combinatorial probe anchor ligation (cPAL). Reagents:
$4,400
The greatest challenge for current technology developers consists in
closing the gap between $10,000 and $1,000 for a single genome.
The timetable for the $1,000 draft genome is difficult to predict
Nanopore sequencing?
39. Nanopore sequencing
The system uses the Staphylococcus auereus toxin α-hemolysin, a robust
heptameric protein which normally forms holes in membranes.
DNA and RNA can be electrophoretically driven through a nanopore of
suitable diameter (Kasianowicz J.J. et al 1996 PNAS)
40. Nanopore sequencing – how does it work?
Hemolysin
When a small voltage (~100 mV) is imposed across
a nanopore in a membrane separating two
chambers containing acqueous electrolytes, the
ionic current through the pore can be measured
Molecules going through the nanopore cause
disruption in the ionic current, and by measuring
the disruption molecules can be identified.
Ionic current
Lipid bilayer with high electronic resistant
41. Nanopore – exonuclease sequencing
Exonuclease
DNA to be sequenced
Aminocycledextrin adaptor
42. Nanopore – strand sequencing
DNA Polymerase
The DNA polymer passes through
the nanopore itself
The nanopore is engineered to
allow single-base resolution within
the strand
A DNA polymerase, coupled with a
α-hemolysin, synthesizes a new
strand of DNA using as a template
the polymer coming out of the pore
Video nanopore: http://www.youtube.com/watch?v=_rRrOT9gfpo&feature=related
43. Nanopore sequencing
Advantages
minimal sample preparation
no requirement for polymerase or ligase
potential of very long read-lengths ( > 10,000 – 50,000 nt )
it might well achieve the $1,000 per mammalian genome goal
the instrument is inexpensive
Challenges
to slow down DNA translocation from microseconds per base to milliseconds
to reduce stochastic motion of the DNA molecule in transit in order to decrease
the signal/noise ratio
a stable support for the hemolysin heptamer
Erwin Chargaff rules: 1) units of guanine equals the units of cytosin and the same is for A and T
2) different percentages among different organisms
He met Crick and Watson in 1952