The analysis of all transcripts within a cell is of essential importance. Molecular biology provides many approaches to clone RNA transcripts into cDNA. Large cDNA collections are in the public domain to serve the research community. Today, however, new high-speed sequencing methods allow a much deeper view into transcriptomes than possible by classical cloning.
2. Classical View on the Utilization of Genomic Information
Transcript Start Site Nucleus
Promoter “Gene”
Genomic DNA
(storage of information)
Transcription Factors
Transcription by RNA polymerase II
AAAAA Coding mRNA
Cap
(transport of information)
(7-methylguanosine cap or m7G cap)
Translation at ribosome
Protein
Cytoplasm
(tools to operate “functions”)
Developed in the 50th and 60th of last century. 2
3. The Classical View Has Been Challenged by new Developments
Discovery/Project Importance Year
Discovery of reverse DNA can be synthesized from RNA 1969
transcriptases templates
Discovery of ligase and Establishing DNA recombination, 1960s and 70s
restriction DNA cloning, and preparation of
endonucleases DNA libraries
DNA sequencing Chain-termination method 1975
(“Sanger Sequencing”)
Human Genome Project Move to sequencing entire genomes 1990 to 2003
Expressed sequence tags First attempt to gene discovery 1991
(ESTs) and expression profiling
IMAGE Project Program to create cDNA collections 1993 to 2007
from key organisms
ENCODE Project Functional elements in human Since 2003
genome
3
4. Topics of the Presentation
Approaches to cDNA cloning
Special topics related to cDNA cloning
Large-scale cDNA cloning projects
Small RNA (sRNA) cloning
Tag-based approaches
Next-Generation Sequencing
Where do we go from here?
4
5. Approaches to cDNA cloning
AAAAA 3’ Capped and polyadenylated mRNA
5’ Cap
Cap mRNA A A A A A… 1st Strand cDNA synthesis:
TTTTT Commonly oligo(dT) priming
mRNA
Prime 2nd strand cDNA synthesis:
Adaptor
cDNA 5’-Linker ligation or tailing reaction
2nd Strand synthesis
Adaptor cDNA
(Option to make PCR)
Digestion with cloning enzyme(s):
cDNA Methylation can protect against internal
cleavage within cDNA
Ligation into phage or plasmid vector:
PlPasmi
Plasmid
d
(Plasmid with cDNA insert may be
excised from phage vector)
Phage
5
6. Special Topics Related to cDNA Cloning
Synthesis of very long cDNAs (>10.000 bp, not further discussed)
Full-length cDNA cloning (important to obtain functional cDNAs)
Normalization (key to gene discovery in large-scale projects)
Cloning vectors and applications (not further discussed)
Subtractive cloning (not further discussed)
Expression cloning (not further discussed)
Addressing splicing (left out of large-scale projects)
Ref.: Harbers M: The current status of cDNA cloning, Genomics. 2008 Mar;91(3):232-42.
6
7. Use of cDNA Libraries
Isolation of individual target genes
in Research Laboratories
Transcriptome Analysis and Genome Projects
Large-scale random clone picking
End-sequencing to build transcript catalogs
Full-length sequencing of selected clones
Creation of sequence data bases
Creation of cDNA collections
Ref.: Carninci P et al.: Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia.
Genome Res. 2003 Jun;13(6B):1273-89. 7
8. Benefits of Large-Scale cDNA Cloning Projects
Improved cDNA Cloning Technology
SNP Analysis:
Proteomics:
Sequence Data Location in Promoter or
Functional Studies on
Exon
Proteins
Clone Collections Functional Studies
Gene Regulation: Genomics:
Promoter Identification Gene Discovery
Expression Profiling Mapping
RNAi Noncoding RNA
Knock down Sense-antisense Pairs
Public sequence databases and clone collections are essential tools for research!
8
9. The mRNA Pool of a Cell
10,000 t0 20,000 transcripts
<20% of mRNA
5 t0 10 transcripts
up to 20% of mRNA
500 t0 2,000 transcripts
40 to 60 % of mRNA
(Old numbers estimated from
reassociation and hybridization studies)
Discovery of rarely expressed genes is a difficult task!
9
10. Normalization of cDNA Libraries
During a Normalization Step a cDNA pool is hybridized against an aliquot of the
original mRNA sample or the same cDNA pool. Due to concentration dependent
hybridization kinetics the number clones representing highly expressed genes will
be reduced yielding in a more equal distribution of different cDNAs in the library.
Without Normalization With Normalization Combine Normalization and
/Subtraction /Subtraction
Subtraction for higher Gene
/Hind III /Hind III
Discovery
9.4 kbp 9.4 kbp
6.6 kbp 6.6 kbp
Number of non-redundand clones
4.4 kbp 4.4 kbp
2.2 kbp 2.2 kbp Driver 2
2.0 kbp 2.0 kbp Lib. 4 +
Driver 2
Driver 1
Lib. 3 +
Driver 1
Lib. 2 No Driver
0.5 kbp 0.5 kbp
Lib. 1
: Highly expressed genes Example: Pancreas cDNA
Number of Libraries
10
11. Full-Length cDNA Cloning
“Cap Trapper” Method “Oligo Capping” Method
Cap P P P mRNA A A A A A…
Cap mRNA A A A A A… P mRNA A A A A A…
TTTTT
Phosphatase
Chemical reaction
Cap P P P mRNA A A A A A…
Biotin Cap mRNA mRNA A A A A A…
A A A A A…
cDNA TTTTT Pyrophosphatase
RNase I digestion P mRNA A A A A A…
mRNA A A A A A…
Biotin Cap mRNA A A A A A…
cDNA TTTTT RNA Ligase
Adaptor mRNA A A A A A…
Recovery on beads TTTTT
Biotin Cap mRNA
Beads A A A A A…
cDNA TTTTT Adaptor mRNA A A A A A…
cDNA TTTTT
Adaptor Primer
cDNA
cDNA
Key Steps: Key Steps:
Biotinylation of Cap structure and RNase I Treatment Replacement of Cap structure by RNA oligonucleotide
11
12. Examples for Large-Scale cDNA Cloning Projects
Targeting at the cloning and full-length sequencing of “one representative” cDNA clone for
each gene. This reduces cost, but it entirely ignores splicing events.
Project Organisms URL
IMAGE Consortium Human, mouse, rat, zebrafish, fugu, http://image.llnl.gov/
Xenopus (X. laevis and X. tropicalis),
cow, and primate
Mammalian Gene Human, mouse, rat, cow, others http://mgc.nci.nih.gov/
Collection (MGC)
Tokyo University Human http://cdna.hgc.jp/
RIKEN FANTOM Mouse http://fantom3.gsc.riken.go.jp/
Rice full-length cDNA Rice http://cdna01.dna.affrc.go.jp/cDNA/
Consortium
RIKEN Arabidopsis Arabidopsis http://www.brc.riken.jp/lab/epd/Eng/
news/071015.shtml
ORF Consortium Human (some mouse clones) http://www.orfeomecollaboration.org
12
13. Pre-mRNA is Spliced into mRNA
Large-scale cloning projects do not cover splice variants.
But maybe 75% of all signal transducers are regulated by splicing! 13
14. Capturing alternatively Spliced Exons in mRNA
Sense strand Antisense strand
Sample 1 Sample 2
Cut double-stranded regions
Capture single-stranded regions
Ref.: Watahiki A et al.: Libraries enriched for alternatively spliced exons reveal splicing patterns in melanocytes and melanomas.
Nature Methods 2004 Dec 1(3): 233-9.
14
15. The Discovery of small RNAs
Classical cloning protocols removed all cDNA fragments of less than
500 bp (avoid linker contamination, cutoff of cloning vectors).
Proteins of less than 100 amino acids were commonly not annotated.
However, small RNAs have important functions!
Small RNAs are non-coding RNAs (ncRNAs) often derived from maturation
processes in the cell that include digestion steps by RNases.
Most prominent example: microRNAs (miRNA) have reverse complement
sequences to other mRNA transcripts. They are around 21-23 base pairs long
after maturation and can alter the expression/translation of one or several
target genes through RNA interference.
And we are still finding many more new RNA species!
Ref.: Kawaji H, Hayashizaki Y. Exploration of small RNAs. PLoS Genet. 2008 Jan;4(1):e22.
15
16. Small RNA (sRNA) Cloning
5’ P OH 3’ Short RNA
Modify 3’ end:
P CCCCCCCCC P
C-Tailing or adaptor ligation
Modify 5’ end:
CCCCCCCCC
Here by adaptor ligation
CCCCCCCCC
GGGGGGGG 1st Strand cDNA synthesis
CCCCCCCCC
GGGGGGGG 2nd Strand synthesis and PCR
Sequence analysis:
PlPasmi Direct sequencing of DNA fragments
Plasmid
d
(Option to ligate into plasmid vector)
Key Steps:
Modification of 5’ and 3’ end of RNA for PCR amplification. Selection by size range. Commonly only sequenced.
No cloning needed as short cDNAs can be chemically synthesized.
16
17. Tag-Based Approaches
Gene discovery cannot be done by standard methods used in
expression profiling such as microarray or PCR.
Unsupervised approaches are needed for gene discovery that do
not require sequence information for probe design.
First approach to gene discovery was sequencing of 3’ ends of cDNA
clones (EST sequencing). Requires one read per clone.
Gene identification does not require sequences of 500 to 800 bp,
but much shorter sequences of some 20 bp or less are sufficient.
Use long sequencing reads to cover many short fragments by one run.
New protocols to isolated short fragments from RNA.
Tag-based approaches in expression profiling and gene discovery.
Ref.: Harbers M and Carninci P: Tag-based approaches for transcriptome research and genome annotation.
Nature Methods 2005 Jul 2(7): 495-502.
17
18. Tag-Based Approaches
Paired-end Tags or PETs
5’ end 3’ end
Anchoring enzyme sites
Cap selection Remove poly(A)
Cap mRNA AAAAA
CAGE SAGE SAGE 3’ SAGE
5’ SAGE (5’ related) (3’ related)
MPSS
DGE
RNA-Seq
or other shotgun approaches
18
19. Serial Analysis Gene Expression (SAGE)
(Digital Gene Expression (DGE))
mRNA A A A A A… 1st Strand cDNA Synthesis with biotinylated primer
TTTTTT Biotin (Commonly starting from mRNA.)
cDNA
Biotin Beads Preparation of double-stranded cDNA and digestion with anchoring enzyme
Adaptor cDNA
Biotin Beads Adaptor Ligation and digestion with Mme I (20 bp) or EcoP15I (27 bp)
Adaptor Adaptor Formation of “Di-Tags”
(Di-Tags can be used for direct sequencing (DGE).)
Concatenation and cloning into plasmid vector
(Classic sequencing of concatemers.)
Very well established and rich reference/annotation information.
Digital expression profiling by “tag counting”.
Ref.: Velculescu VE et al. Serial analysis of gene expression. Science. 1995 Oct 20;270(5235):368-9, 371.
19
20. Cap Analysis Gene Expression (CAGE)
5’ CAP mRNA AAAAA 3’ Commonly starting from 50g total RNA.
1st Strand cDNA Synthesis
(Covering poly(A-) mRNA and long mRNA.)
CAP mRNA AAAAA
cDNA NNNNNN
5’-End Selection on Beads by Cap Trapper
(Less bias due to chemical modification of Cap.)
Beads CAP mRNA AAAAA
cDNA NNNNNN
Adaptor Ligation and 2nd Strand Synthesis
Adaptor I
cDNA NNNNNN
Digestion with Mme I (20 bp) or EcoP15I (27 bp)
Adaptor I cDNA
Isolation of CAGE TAGs
Adaptor I TAG
3’-End Adaptor Ligation
Adaptor I TAG Adaptor II Preferably used for direct sequencing (>4,000,000 tags per run).
Ref.: Kodzius R et al.: Cap analysis of gene expression: transcription start site mapping and expression profiling.
Nature Methods 2006 Mar 3(3): 211-222. 20
21. Cap Analysis Gene Expression (CAGE)
Signal 1 Signal 2 Signal 3 CAP mRNA A A A A A
TSS
Genome TF1 TF2 TF3 Exon 1 2 3 4 5
Tiling Array/RNA-Seq
Array/RNA-
Microarray
TF CAGE Tags SAGE
ChIP RACE
CAGE tags experimentally link transcripts to their promoters.
CAGE tags integrate information based on genome annotations.
CAGE tags can be linked to whole genome tiling arrays and RNA-Seq data.
CAGE tags can be linked to Chromatin IP/ChIP-Seq data.
CAGE tags correlate with open chromatin.
CAGE tags provide primer information for cloning new transcripts.
21
22. Classical DNA Sequencing by Chain-Termination Method
dNTP/ddNTP Mix
G C G
A T G
T
C C
A A A G C T
Primer T A
A C C A
DNA Template T G G T T G C T G C C A A T G T
One reaction per nucleotide
DNA Polymerase
A T G C T G G T T G C T G C C A A T G T
T G G T T G C T G C C A
T G G T T G
T G G T T G C T G C
Capillary Sequencer Analyze fragments DNA fragments from
by gel electrophoresis Primer extension reactions
Over 30 years the most important method in molecular biology.
Challenged by emerging new sequencing technologies: Next-Generation Sequencing.
22
23. Next-Generation Sequencing
Driven by the “$1000 genome” different companies are on the move to provide new sequencing
technologies based on “sequencing by synthesis” or “ligation-based sequencing”. Other approaches
may use hybridization methods or physical means in the future.
Platform Mb per run/read length Method
Roche 454 Sequencing 100 Mb/250 bp/7h per run Emulsion PCR and Pyrosequencing
Illumina (Solexa) 1300 Mb/32-40bp/4 days per run Bridge PCR and sequencing-by-
synthesis
ABI SOLiD 3000 Mb/35 bp/5 days per run Emulsion PCR and ligation-based
sequencing
Helicos 25 to 90 Mb per h/up to 55 bp Single-molecule detection
Ref.: Mardis ER. The impact of next-generation sequencing technology on genetics.
Trends Genet. 2008 Mar;24(3):133-41. Epub 2008 Feb 11.
von Bubnoff A. Next-generation sequencing: the race is on. Cell. 2008 Mar 7;132(5):721-3.
23
24. Example for Ligation-Based Sequencing: ABI SOLID System
DNA fragments having Project specific data analysis:
adaptor sequences: Mapping to genome
Genomic DNA Reference information
Tag Sequencing
Images are the courtesy of ABI and were kindly provided by ABI Japan.
24
25. Example for Ligation-Based Sequencing: ABI SOLID System
Images are the courtesy of ABI and were kindly provided by ABI Japan. 25
26. Example for Ligation-Based Sequencing: ABI SOLID System
Images are the courtesy of ABI and were kindly provided by ABI Japan. 26
27. Example for Sequencing-by-Synthesis: Illumina 1G System
DNA per run Addition of Add to flow Preparation
0.1 ~1µg 2 adaptors cell of clusters
Images are the courtesy of Illumina and were kindly provided by Illumina Japan. 27
28. Example for Sequencing-by-Synthesis: Illumina 1G System
3’ 5’
Cycle 1
A Addition of the sequence reagent
T
C G One base extension reaction
C Removal of non-incorporated bases
G C
G Detect fluorescence signal
T A
A C T Removal of the fluorescence label
G
C Cycle 2
T C
C
C Repetition of the above reactions
C A G
T
A Cycle 3, 4, 5…..
T C
A
G C
Repetition of the above reaction
A
G
T
A G T
T G
T
5’ Images are the courtesy of Illumina and were kindly provided by Illumina Japan. 28
29. Example for Sequencing-by-Synthesis: Illumina 1G System
40,000,000 clusters on a flow cell
20um
100um
Images are the courtesy of Illumina and were kindly provided by Illumina Japan. 29
30. Where do we go from here?
Next-Generation Sequencing will push genome sequencing field for
re-sequencing and de novo sequencing (“1000 Genome Project”).
Metagenomics (Environmental Genomics, Ecogenomics, or
Community Genomics): Direct analysis of genetic materials obtained
from environmental samples.
Expression profiling: SAGE (DGE), CAGE, PET, RNA-Seq.
Analytical applications to identify functional regions/elements in
genomes: ChIP-Seq, open chromatin, SNPs, splicing, others to come .
Analytical applications in mutation screens.
Analytical applications for detection of infectious agents.
30
31. Transcriptome Analysis: The Dominance of noncoding RNA
Genome sequencing and annotation did not tell us about the real
extent of gene expression!
Tiling array experiments and deep sequencing by next-generation
sequencing methods indicates that >90% of the genome is expressed.
Maybe 40 to 50% of the mRNA is not polyadenylated, and we did not
analyze it yet.
Most of the transcripts are potentially noncoding RNAs having
unknown (regulatory ?) functions.
The definition of a “gene” may no longer hold with many different
transcripts derived from same loci.
We do not understand the “hidden layers” regulating the utilization of
genomic information.
Ref.: Mattick, J.S. "Challenging the dogma: The hidden layer of non-protein-coding RNAs on complex organisms"
Bioessays. (2003) 25, 930-939.
31
32. Example for RNA-Seq in Yeast Saccharomyces pombe (fission yeast)
Illumina 1G sequencer; average read length 39.1 base, fragments from poly(A) mRNA
> 23 mil reads (~60 genome length) proliferating cells.
> 99 mil reads (~ 190 genome length) from five different stages.
Covering ~94% nuclear and > 99% of mitochondrial genome.
Confirmed expression from intergenic regions by RT-PCR.
Control experiments using whole genome tiling arrays (25 mer/20 nt intervals)
confirmed identification novel transcripts (26 out of 453 may encode short
proteins).
Recent publications on the use of RNA-Seq include S. pombe, S. cerevisiae, Arabidopsis,
mouse tissues, mouse stem cells, and HeLa S3.
Ref.: Wilhelm BT, Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution.
Nature. 2008 Jun 26;453(7199):1239-43. Epub 2008 May 18.
Graveley BR. Molecular biology: power sequencing. Nature. 2008 Jun 26;453(7199):1197-8.
32
33. Examples for Genome Size (haploid)
Genome Length in bp Estimated gene number
Phi-X 174 5,386 10
Human mitochondrion 16,569 37
E. coli 4,639,221 4,377
Saccharomyces cerevisiae 12,495,682 5,770
Caenorhabditis elegans 100,258,171 19,427
Arabidopsis thaliana 115,409,949 ~28,000
Drosophila melanogaster 122,653,977 13,379
Humans 3.3 x 109 ~20,500
Amphibians 109–1011 ?
Values taken from: http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/G/GenomeSizes.html out of July 2007
33
34. Where are our limitations?
Mammalian genome size and transcriptome complexity:
Enrichment of fragments e.g. using microarrays,
Normalization and longer reads required.
Thus far uneven representation requires use of more than one method.
Requirements for starting materials (target is to analyze single cells).
No unified cDNA library method: using different methods depending on RNA length.
Very large data files and lack of computational analysis tools.
What is transcriptional noise?
Research dominated by “detection” rather than “functional analysis”.
Ref.: Struhl K. Transcriptional noise and the fidelity of initiation by RNA polymerase II.
Nat Struct Mol Biol. 2007 Feb;14(2):103-5.
34
35. Present Strategies for Transcriptome Analysis
Interest has shifted to next-generation sequencing to profile transcriptional
activities.
We cannot predict ends of transcripts, and therefore tag-based approaches
to indentify start sites and termination sites are needed.
Identification of transcription start sites in combination with other
information is driving “gene networks studies” and “system biology”.
RNA-Seq provides new means for the identification of splice sites and
expressed mutations.
We do not clone all those new transcripts, but there will be a need to get
resources for functional analysis of new transcripts.
We are more than ever falling short on the functional analysis of new transcripts.
Thus far we have not even analyzed all coding transcripts!
It is an exciting time to work on transcriptome analysis offering many challenges and rewards!
35
36. Contact:
Dr. Matthias Harbers
DNAFORM Inc.
Leading Venture Plaza-2, 75-1, Ono-cho
Tsurumi-ku, Yokohama City, Kanagawa, 230-0046
Japan
E-mail: matthias.harbers@dnaform.jp
Phone: +81-(0)45-510-0607
FAX: +81-(0) 45-510-0608
URL: http://www.dnaform.jp
36