This document summarizes recent advances in transcriptome analysis technologies. It discusses limitations of microarray-based approaches and how next-generation sequencing-based RNA-seq provides more comprehensive transcriptome profiling. RNA-seq can detect thousands of new transcript variants and isoforms. It also describes direct RNA sequencing without cDNA conversion, revealing polyadenylation profiles with single-molecule resolution. Comprehensive polyadenylation maps in human and yeast showed previously unannotated sites and alternative polyadenylation, providing insights into regulatory mechanisms.
2. Is there a correlation between the size of the
genome and the morphological complexity?
0 Only to a certain extent!
0 There is not a clear correlation between the size of a genome and the overall
complexity of an organism
3. Is there a correlation between the number of
genes and morphological complexity ?
0 Once again, only to a certain extent
0 The complexity of an organism increases much more than the number of genes
4. 0 It is not needed to increase the variety of the pieces
available in order to increase the complexity of a
construction, but you have to increase the complexity
of the project
Transcription factors
Operators
Enhancers
Promoters
ncRNA (e.g. involved in
alternative splicing and
miRNA)
0 Antisense transcripts
0
0
0
0
0
Intergenic DNA
30%
Introns
24%
Transposons
45%
Exons
1%
6. The Transcriptome
Messenger RNA
Ribosomal RNA
Signal recognition particle RNA
Transfer RNA
Transfer-messenger RNA
RNAs involved in protein synthesis
mRNA
Codes for protein
rRNA
Translation
7SL RNA or
Membrane integration
SRP RNA
tRNA
Translation
tmRNA
Rescuing stalled ribosomes
All organisms
All organisms
All organisms
All organisms
Bacteria
RNAs involved in post-transcriptional modification or DNA replication
Type
Abbr.
Function
Distribution
Eukaryotes and
Small nuclear RNA
snRNA
Splicing and other functions
archaea
Eukaryotes and
Small nucleolar RNA
snoRNA
Nucleotide modification of RNAs
archaea
SmY RNA
SmY
mRNA trans-splicing
Nematodes
Type of snoRNA; Nucleotide modification of
Small Cajal body-specific RNA scaRNA
RNAs
Kinetoplastid
Guide RNA
gRNA
mRNA nucleotide modification
mitochondria
Ribonuclease P
RNase P
tRNA maturation
All organisms
Ribonuclease MRP
RNase MRP
rRNA maturation, DNA replication
Eukaryotes
Y RNA
RNA processing, DNA replication
Animals
Telomerase RNA
Telomere synthesis
Most eukaryotes
Type
Antisense RNA
Abbr.
aRNA
Cis-natural antisense transcript
CRISPR RNA
crRNA
Long noncoding RNA
MicroRNA
Piwi-interacting RNA
Small interfering RNA
Trans-acting siRNA
Repeat associated siRNA
Long ncRNA
miRNA
piRNA
siRNA
tasiRNA
rasiRNA
Regulatory RNAs
Function
Transcriptional attenuation / mRNA
degradation / mRNA stabilisation /
Translation block
Gene regulation
Resistance to parasites, probably by
targeting their DNA
Various
Gene regulation
Transposon defense, maybe other functions
Gene regulation
Gene regulation
Type of piRNA; transposon defense
Distribution
All organisms
Bacteria and archaea
Eukaryotes
Most eukaryotes
Most animals
Most eukaryotes
Land plants
Drosophila
7. Transcriptomic
0 To catalogue all species of transcripts;
0 To determine the transcriptional structure of genes, in terms of their starting
site, 5’ and 3’ ends, splicing patterns and other post-transcriptional modification;
0 To quantify the changing expression levels of each transcript during
development and under different conditions.
Various technologies have been developed to deduce and quantify the
transcriptome, including hybridization-based approaches (microarray) or
sequence-based approaches (RNA-seq)
10. …and its limitations
0 Reliance upon existing knowledge about genome sequence
0 High background levels owing to cross-hybridization
0 A limited dynamic range of detection due to both background and
saturation of signals
0 Comparing expression levels across different experiments is often
difficult and can require complicated normalization methods
11. Sequence-based approaches
0 Directly determine the cDNA sequence, hence defining the
1.
2.
corresponding mRNA
Sanger sequencing of cDNA or EST libraries
0 Low-throughput, expensive, generally not quantitative
Tag-based methods were developed: SAGE, CAGE, MPSS
0 Still expensive because based on Sanger sequencing, short tags cannot be uniquely
mapped to the reference genome, isoforms are generally not distiguishable
3. RNA-seq, based on NGS technologies
0 By analyzing the transcriptome at spectacular and unprecedented depth and
accuracy, thousands of new transcripts variants and isoforms have been
shown to be expressed in mammalian tissues or organs
0 it greatly accelerated our understanding of the complexity of gene
expression, regulation and networks for mammalian cells
16. Challenges for RNA-seq
Library construction
0 Larger RNA molecules must be fragmented into smaller pieces (200-500bp) to be
compatible with most deep-sequencing technologies
RNA fragmentation has little bias over the
transcript body, but is depleted for transcript
ends compared with other methods
cDNA fragmentation is usually strongly
biased towards the identification of sequences
from the 3’ ends of transcripts
17. Challenges for RNA-seq
Bioinformatic challenges
0 Development of efficient methods to store, retrieve and process large amounts of data:
ELAND, SOAP, MAQ and RMAP
High-quality reads are selected and
matched against a reference genome,
or they are first assembled into contigs
before alignining them to the genomic
sequence to reveal transcription structure
1.
2.
Junctions reads are difficult to map:
a junction library containing all known and predicted junction sequences has been created and
junction reads are mapped there
Many reads match multiple locations in the genome (e.g. repetitive regions)
Multi-matched reads are assigned proportionally to the number of reads mapped to their neighbouring
unique sequences
Roche 454 to obtain longer reads (250 bp)
Paired-end sequencing strategy (Solexa)
18. Challenges for RNA-seq
Defining transcription level
0 RNA-seq can be used to determine levels more accurately than microarrays. In
principle, it is possible to determine the absolute quantity of every molecule in a cell
population, and directly compared results between experiments.
0 Gene expression level is deduced from the total number of reads that fall into the exons of a
1. RNA fragmentation + cDNA synthesis (exons’ body-biased):
gene, normalized by the length of exons that can be uniquely mapped
read counts from a window near the 3’end are used
2. cDNA fragmentation (3’end-biased):
0
0 RNA-seq can capture transcriptome dynamics across different tissues or conditions
without sophisticated normalization of data sets.
19. Life/SOLiD
0 mRNA-seq on a single mouse blastomere and oocyte
0 They detected the expression of 75% (5270) more genes than microarray
techniques
0 They identified 1753 previously unknown splice junctions called by at least 5
read
0 8-19% of the genes with multiple known transcript isoforms expressed at least
two isoforms in the same blastomere or oocyte
0 Dicer1-/- and Ago2-/- oocytes show 1696 and 1553 genes, respectively, to be
upregulated compared to wild-type controls, with 619 genes in common
20. Mitinouri S. et al, Nat Protocol, 2007
5 min > 30 min
64% genes
3 min > 6 min
(80-130 bp)
21. High accuracy of the sequencing technique and mapping algorithms
22. Comparison of mRNA-Seq and microarray assays
0 Microarray analysis of 320 blastomeres found 6650 genes in common with RNA-seq.
Overall RNA-seq detected 60% more genes compared with microarray.
0 mRNA-Seq missed 5.7% of the transcripts (400 genes)
0 327/400 genes had fluorescence intensity on the chip lower than 100
0 9/11 genes tested by RT-PCR were found to be false positive
0 Cross-hybridization
0 Stochastically, some low-expressed genes on a single cell can be either on or off.
0 Very similar expression pattern compared to a NIH mouse array
0 380 genes detected by RNA-seq were chosen and tested by RT-PCR. 71% were clearly
confirmed
23. New splice isoforms identified by mRNA-seq
1.
Generation of a library containing all possible combinations of exon-exon
junctions as 84-bp sequences, with 42-bp from each exon
3.
Matching between RNA-seq reads and the new library
2.
Removing of all known exon junctions
Results
0 One blastomere: 6701 and 1753 new junctions with at least 2 or 5 reads, respectively
0 8/8 confirmed by RT-PCR
0 One mature oocyte: 9012 and 2070 new junctions
0 335 genes (19% of all known genes with at least two known isoforms) expressed
more than two transcripts insoforms in a single blastomere, at the same time
24. RNA-seq to dissect functional differences: Dicer1-/- vs WT
0 Two separately processed single wild-type mature oocytes showed very similar
transcriptome profiles. Same results for Dicer1-/- oocytes
0 Differences between Ago2-/- and WT were clearly less than that between Dicer1-/- and WT
>> this observation correlates with the fact that Ago2-/- oocytes phenotype is similar but
milder than that of Dicer1-/-
25. RNA-seq to dissect functional differences: Dicer1-/- vs WT
0 Single-exon resolution of RNA-seq with low or even no background:
in Dicer1-/- oocytes, exon 23 is deleted by loxP-directed Cre recombination. Result confirmed
by TaqMan assay.
0 Abnormal upregulation was detected for three genes Ccne1, Dppa5 and Klf2 and confirmed
by RT-PCR. They may contribute to the compromised developental potential of Dicer1-/- and
Ago2-/- oocytes
26. RNA-seq to dissect functional differences: Dicer1-/- vs WT
Overall results
Dicer1-/Upregulated
1696
Downregulated
1571
Ago2-/Upregulated
619
Downregulated
589
Upregulated
1553
Downregulated
1121
Core candidates to dissect the function of microRNAs and endogenous
small interfering RNAs involved in oogenesis
27. Conclusions
0 mRNA-seq on a single mouse blastomere > small amount of starting material
0 7% > 64% of full-length cDNAs captured
0 They detected the expression of 75% (5270) more genes than microarray
techniques and identified 1753 previously unknown splice junctions
0 8-19% of the genes with multiple known transcript isoforms expressed at least two
isoforms in the same blastomere or oocyte
0 Dicer1-/- and Ago2-/- oocytes show 1696 and 1553 genes, respectively, to be
upregulated compared to wild-type controls, with 619 genes in common
Limitations
Only poly(A) mRNA are captured (e.g. histone mRNA is not detected)
For mRNAs longer than 3 Kb, the 5’end will not be characterized
The assay uses double-stranded cDNAs but cannot discriminate between sense and
antisense
28. Helicos/tSMS
(DRS)
0 cDNA synthesis introduces multiple biases:
0 Erases RNA strand information
0 Spurious second-strand cDNA artefacts can be introduced, owing to the DNAdependent DNA polymerase (DDDP) activites
0 Artefactual cDNAs due to template switching
0 Error prone and inefficiency of the enzyme
Direct single molecule RNA sequencing without prior conversion of RNA to
cDNA >> it captures all RNAs
The sequencing was performed on Poly(A)+ S.cerevisiae RNA strain
30. DRS sequencing read-length statistics
Pilot experiment with oligoribonucleotides
0 48.5% of aligned reads have a sequence
length of at least 20 nucleotides (nt)
0 38 nt is the longest read with no errors
0 Errors: 4%
0 2-3% missing base errors
0 1-2% insertion rate
0 0.1%-0.3% substitution errors
Poly(A)+ S.cerevisiae (Clontech)
0 Femtomoles quantities of RNA needed
0 120 cycles in 3 days
0 41261 reads of > 20 nt, average of 28 nt
0 50 nt is the longest read
0 19501 reads (48.4%) aligned to the yeast
genome using the BLAT algorithm
31. DRS sequencing read-length statistics
0 Of the aligned reads, 91% were within 400 nt downstream of annotated yeast
gene 3’ ORF ends
0 Most of the reads were in close proximity to EST 3’ ends
0 ~2% of the total reads were from ribosomal RNAs and small nucleolar RNAs,
indicating that at least a fraction of those can be polyadenylated posttranscriptionally.
32. 0 The emerging discoveries on the link between polyadenylation and disease states
(oculopharyngeal muscular dystrophy, thalassemias, thrombophilia, and IPEX syndrome )
underline the need to fully characterize genome-wide polyadenylation states. Here, we
report comprehensive maps of global polyadenylation events in human and yeast generated
using refinements to the Direct RNA Sequencing technology. This direct approach provides a
quantitative view of genome-wide polyadenylation states in a strand-specific manner and
requires only attomole RNA quantities. The polyadenylation profiles revealed an
abundance of unannotated polyadenylation sites, alternative polyadenylation patterns,
and regulatory element-associated poly(A)+ RNAs. We observed differences in sequence
composition surrounding canonical and noncanonical human polyadenylation sites,
suggesting novel noncoding RNA-specific polyadenylation mechanisms in humans.
Furthermore, we observed the correlation level between sense and antisense transcripts
to depend on gene expression levels, supporting the view that overlapping transcription
from opposite strands may play a regulatory role. Our data provide a comprehensive
view of the polyadenylation state and overlapping transcription.
33. Conclusions
0 Requirement of minor RNA quantities
0 No biases due to cDNA synthesis, end repair, ligation and amplification procedures
0 Potentially useful to study short RNA species
Future perspective
1. Generation of a complete catalogue of transcripts that are derived from genomes
ranging from those of simple unicellular organisms to complex mammalian cells,
normal or disease tissues, single-cells and formalin-paraffin embedded tissues
2. Generation of complex biological networks in a wide range of biological
specimens
3. Use of these networks to fully understand the biological pathways that are active
in various physiological conditions
Immediate application in clinical diagnostic: analyses of extracellular nucleic acid
(e.g. fetal RNA) and cells (e.g. circulating tumor cells)