3. Why do we care about RNA?
RNA is important for translation and gene regulation
2
3 of the ribosome is RNA. Ribosomal function is preserved
even after amino-acid residues are deleted from the active site!
Current estimates indicate that the number of ncRNA genes is
comparable to the number of protein coding genes.
mDNA
uDNA
rDNA
tDNA
pre-mRNA
mRNA
nascent
protein
localised
protein
spliceosome
ribosome
tRNA
+
RNase P
RNase MRP+snoRNP
snoRNP
SRP
tmRNA
transcription
splicing
translation
transport
RISC (miRNA)
Paul Gardner RNA bioinformatics
4. RNA: why is this stuff interesting?
RNA world was an essential step to modern protein-DNA
based life (using current reasonable models).
Which came first, DNA or protein?
RNA has catalytic potential (like protein), carries hereditary
information (like DNA).
Image by James W. Brown, www.mbio.ncsu.edu/JWB/soup.html
Paul Gardner RNA bioinformatics
6. RNA: structure
G
C
G
G
A
U
UU
A
GCUC
AGD
D
G
G G A
G A G C
G
C
C
A
GA
C
U
G
A A
.
A
.
C
U
G
GAGG
U
C
C U G U G
T . C
G
A
UC
CACAG
A
A
U
U
C
G
C
A
C
CA
Variable
LoopAnticodon
Loop
T ΨC
Loop
10 15 20 25 30 355 40 45 50 55 60 65 70 75
Anticodon
Loop
Acceptor
Stem
GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA5’ 3’
Secondary Structure Tertiary StructureB C
Primary StructureA
Acceptor
Stem
T ΨC
Loop
ΨΨ
Ψ
Ψ
Y
65
60
55
40
10
20
15
5
70
75
25
30
35
45
50
D Loop
3’
5’
5’
3’
D Loop
Paul Gardner RNA bioinformatics
7. RNA: base-pairing
Canonical (Watson-Crick) base-pairs C · G, A · U.
Non-canonical (Wobble) base-pair G · U
Note: other non-canonical base-pairs do occur, but these are
“rare” and generally re-defined as “tertiary” interactions.
Central dogma of structural biology: structure is important for
function.
Images lifted from: http://en.wikipedia.org/wiki/Base pair
Paul Gardner RNA bioinformatics
9. RNA: base-pairing
bpC C:G U:A U:G G:A C:A U:C A:A C:C G:G U:U Total
WC 49.8% 14.4% 0.01% 1.2% 0.1% 0.5% - - - - 66.1%
Wb 0.06% 0.06% 7.1% - 0.2% - 0.3% 0.5% 0.2% 0.9% 9.6%
Other 0.8% 5.8% 1.5% 9.4% 2.3% 0.6% 2.6% 0.5% 0.7% 0.3% 24.3%
Total 50.7% 20.3% 8.7% 10.6% 2.6% 1.0% 2.9% 1.0% 0.9% 1.3% 100.0%
Just 71.3% of rRNA contacts are canonical or G:U wobble!
Lee & Gutell (2004) Diversity of base-pair conformations and their occurrence in rRNA structure and RNA
structural motifs J Mol Biol.
Paul Gardner RNA bioinformatics
10. RNA stacking
Laurberg et al. (2008) Structural basis for translation termination on the 70S ribosome Nature. Image lifted from:
http://rna.ucsc.edu/pdbrestraints/index.html
Paul Gardner RNA bioinformatics
11. RNA: number of structures
AN is the number of possible secondary sequences of length N.
AN ∼ 4N
SN is the number of possible secondary structures of length N.
S0 = S1 = 1
SN+1 = SN +
N
j=1
Sj−1SN−j+1
SN ∼ 1.8N
Hofacker et al. (1998) Combinatorics of RNA Secondary Structures, Discrete Applied Mathematics.
Paul Gardner RNA bioinformatics
12. How can we make a secondary structure prediction
algorithm?
Maximize the number of base-pairs in a
RNA sequence?
Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.
Paul Gardner RNA bioinformatics
13. Structure prediction: Nussinov
Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.
Image from: Eddy SR (2004) How do RNA folding algorithms work? Nature Biotechnology.
Paul Gardner RNA bioinformatics
14. Structure prediction: Nussinov
Maximize the number of base-pairs in RNA sequence.
Seq = s1s2 · · · sn
Ni,j = 0, ∀ j − i < 3.
Ni,j = max
Ni+1,j−1 + ρ(i, j), i, j pair
Ni+1,j , i unpaired
Ni,j−1, j unpaired
maxi<k<j [Ni,k + Nk+1,j ] bifurcation
O(n3) in CPU, O(n2) in memory.
ρ(i, j) = 1 if si and sj are complementary, otherwise
ρ(i, j) = 0.
N1,n = BPmax .
Nussinov et al. (1978) Algorithms for loop matching, SIAM J. Appl. Math.
Paul Gardner RNA bioinformatics
15. Structure prediction: Nussinov
There are a few problems with this approach:
the solution to Nussinov is frequently not unique. For example,
the 77 nucleotide long tRNAhis
has 22 base-pairs in the
phylogentic structure, there are 149, 126 structures with the
maximal number of 26 base-pairs!
The method ignores stacking interactions.
Fontana (2002) Modelling ‘evo-devo’ with RNA. BioEssays.
Paul Gardner RNA bioinformatics
16. Structure prediction: Zuker
Nearest neighbour model
Modified Nussinov algorithm to find minimal free energy
(most stable) structures
A U
C G
U A
G C
S3
S2
S1
S1 S2 S3
GU L
A C
Free Energy = L + + +
= −1.70 kcal/mol
= 5.00 − 2.11 − 2.35 − 2.24
∆Gstack = ∆H37,stack − T∆S37,stack
∆Gloop = −T∆S37,loop
Tinoco et al. (1971) Estimation of secondary structure in RNA. Nature.
Paul Gardner RNA bioinformatics
17. Structure prediction: Zuker
WXY Z CG GC AU UA GU UG
CG -3.26 -2.36 -2.11 -2.08 -1.41 -2.11
GC -3.42 -3.26 -2.35 -2.24 -1.53 -2.51
AU -2.24 -2.08 -0.93 -1.10 -0.55 -1.36
UA -2.35 -2.11 -1.33 -0.93 -1.00 -1.27
GU -2.51 -2.11 -1.27 -1.36 +0.47 +1.29
UG -1.53 -1.41 -1.00 -0.55 +0.30 +0.47
Energies (∆G in kcals/mol) of 5
3
W
X
Y
Z
3
5 stacked basepairs.
Note that ∆G of 5
3
W
X
Y
Z
3
5 stacks is the same as 5
3
Z
Y
X
W
3
5 stacks.
Mathews et al. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA
secondary structure. JMB.
Paul Gardner RNA bioinformatics
18. Suboptimal structures
“There is an embarrassing abundance of structures having a free
energy near that of the optimum.” (McCaskill 1990)
−5 0 5 10 15 20 25 30 35
−22
−21.8
−21.6
−21.4
−21.2
−21
−20.8
−20.6
−20.4
−20.2
−20
dBP
(Si
,Smfe
)
∆G(kcal/mol)
G
C
G
G
A
U
U
U
A
G
CU
C
A
G U
U
G
G
G
A
G
A
G
C
G
C
C
A
G
A
C
U
G
A
A
G
A U U
U
G
G
AG
G
U
C
C
U
G
U
G
U
U
C
G
A
U
C
C
A
C
A
G
A
A
U
U
C
G
C
A
G
C
G
G
A
UUU
A
GCUC
AGU
U
G
G G A
G A G C
G
C
C
A
G
A
C
U
G A
A
GA
U
U
U
G
GAGG
U
C
C U G U G
U U
C
G
AUC
CACAG
A
A
U
U
C
G
C
A
G
C
G
G
A
U
U
UA
G
C
UCAGUUG
GGAG
A
G C G
C C A
G A C U G A
AGAU
U
U G
G A
G G U C
C
U G
U
G
U
UC
GAUC
CA
CA
G
A
A
U
U
C
G
C
A
Biological
Suboptimal
MFE
Wuchty et al. (1999) Complete suboptimal folding of RNA and the stability of secondary structures, Biopolymers.
Paul Gardner RNA bioinformatics
19. Accuracy of MFE predictions
Non-independant benchmarks:
Walter et al. (1994) Mean sensitivity 63.6
Mathews et al. (1999) Mean sensitivity 72.9%
Independant benchmarks:
Doshi et al. (2004) Mean sensitivity 41%
Dowell & Eddy (2004) Mean sensitivity 56% Mean PPV 48%
Gardner & Giegerich (2004) Mean sensitivity 56% Mean PPV
46%
Data-sets: tRNA, SSU rRNA, LSU rRNA, SRP, RNase P, tmRNA.
Paul Gardner RNA bioinformatics
20. Limitations of MFE predictions
Energy parameters: estimated at constant salt
concentrations and temperatures.
Energy model: models of loop energies are extrapolated from
relatively few experiments, no pseudoknots, ...
Cellular environment: contains proteins, RNAs, DNAs,
sugars, etc
Post-transcriptional modifications: many functional RNAs
have been covalently modified.
Folding kinetics: RNAs fold along “pathways”, perhaps
becoming trapped in sub-optimal conformations.
Co-transcriptional folding: RNAs fold during transcription,
the transcriptional apparatus occludes 3’ portions of the
sequence.
Transcription is jerky: transcriptional pausing can influence
folding.
Paul Gardner RNA bioinformatics
21. Comparative sequence analysis
Input: a set of sequences with the same biological function
which are assumed to have approximately the same structure.
Output: the common structural elements, aligned sequences
and a phylogeny which best explains the observed data.
2
4
5
3
1
>1
GCAUCCAUGGCUGAAUGGUUAAAGCGCCCAACUCAUAAUUGGCGAACUCGCGGGUUCAAUUCCUGCUGGAUGCA
>2
GCAUUGGUGGUUCAGUGGUAGAAUUCUCGCCUGCCACGCGGGAGGCCCGGGUUCGAUUCCCGGCCAAUGCA
>3
UGGGCUAUGGUGUAAUUGGCAGCACGACUGAUUCUGGUUCAGUUAGUCUAGGUUCGAGUCCUGGUAGCCCAG
>4
GAAGAUCGUCGUCUCCGGUGAGGCGGCUGGACUUCAAAUCCAGUUGGGGCCGCCAGCGGUCCCGGGCAGGUUCGACUCCUGUGAUCUUCCG
>5
CUAAAUAUAUUUCAAUGGUUAGCAAAAUACGCUUGUGGUGCGUUAAAUCUAAGUUCGAUUCUUAGUAUUUACC
** *
1 GCAUCCAUGGCUGAAU-GGUU-AAAGCGCCCAACUCAUAAUUGGCGAA--
2 GCAUUGGUGGUUCAGU-GGU--AGAAUUCUCGCCUGCCACGCGG-GAG--
3 UGGGCUAUGGUGUAAUUGGC--AGCACGACUGAUUCUGGUUCAG-UUA--
4 GAAGAUCGUCGUCUCC-GGUG-AGGCGGCUGGACUUCAAAUCCA-GU-UG
5 CUAAAUAUAUUUCAAU-GGUUAGCAAAAUACGCUUGUGGUGCGU-UAA--
**** * **
1 ------------------CUCGCGGGUUCAAUUCCUGCUGGAUGC-A
2 ------------------G-CCCGGGUUCGAUUCCCGGCCAAUGC-A
3 ------------------G-UCUAGGUUCGAGUCCUGGUAGCCCA-G
4 GGGCCGCCAGCGGUCCCG--GGCAGGUUCGACUCCUGUGAUCUUCCG
5 ------------------A-UCUAAGUUCGAUUCUUAGUAUUUAC-C
S
M
A
D
M
Y
MUR
SYUC
A
MY-
G
G
Y
u a A
V M M M
R M
H
C
R
MY
U
S
H V R
H
K
C
V
R
c
K
W
A
-
-
-
-
- c c - c
c
a
-
c
-
-
-
c
c
c
-V-YS Y R R G
U U
C
R
AY
U
CCYRS
Y
M
D
M
Y
V
M
c
V
Paul Gardner RNA bioinformatics
22. Comparative sequence analysis
Evolution of RNA sequences
Base-pairs that covary have strong evolutionary support
U
A
C
A
A
G
A
G
U
G C
G
U
U
U
A
A
G
U
AY
R
Y
A
A
S
M
G
U
S C
G
Y
K
K
A
A
G
Y
RY
A
U
A
A
N
A
D
U
G C
G
U
U
G
A
A
G
U
R
c
b
(((..(((....)))..)))
(((..(((....)))..)))
(((..(((....)))..)))
(((..(((....)))..)))
UACAAGAGUGCGCUUAAGUA
UGCAAAAGUCCGUUUAAGCA
UAUAACCUUUCGAGGAAAUA
CAUAAUAAUGCGUUGAAGUG
a
MIS
YAUAANADUGCGUUGAAGURAncestral
UACAAGAGUGCGUUUAAGUA
YRYAASMGUSCGYKKAAGYR
consensus
consensusAncestral MIS
G U
A U
G C
U G
C G
U A
fast fast
slow
Paul Gardner RNA bioinformatics
23. Alignment Folding: RNAalifold
Generate an alignment (e.g. with ClustalW)
Find a consensus structure that is both energetically stable in
all sequences and has covariation support
G C G G A A U U A G C U C A G U U _ G G G A G A G C G C C A G A C U G A A A A U C U G G A G G U C C C C _ G G U U C G A A U C C C G G A A U C C G C A
G C G G A A U U A G C U C A G U U _ G G G A G A G C G C C A G A C U G A A A A U C U G G A G G U C C C C _ G G U U C G A A U C C C G G A A U C C G C A
GCGGAAUUAGCUCAGUU_GGGAGAGCGCCAGACUGAAAAUCUGGAGGUCCCC_GGUUCGAAUCCCGGAAUCCGCA
GCGGAAUUAGCUCAGUU_GGGAGAGCGCCAGACUGAAAAUCUGGAGGUCCCC_GGUUCGAAUCCCGGAAUCCGCA
G
C
B
K
M
W
WU
A
GCUC
A
GU
u
-
G
G K A
G A G C
R
Y
Y
W
S
A
Y
U
K
A W
R
A
U
C
W
R
RAKG
u
C
S C S -R G
U U
C
G
AWY
CYSKB
W
W
U
S
S
G
C
A
UA
Hofacker et al. (2002) Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol.
Paul Gardner RNA bioinformatics
24. Alignment Folding: RNAalifold
RNAalifold: energy + covariation.
βi,j =
1
N
N
α
Zα
i,j − Cov
Ci,j =
2
N(N − 1)
bα
i bα
j ,bβ
i bβ
j
DH(bα
i bα
j , bβ
i bβ
j )Πα
ij Πβ
ij
Hofacker et al. (2002) Secondary Structure Prediction for Aligned RNA Sequences, J.Mol.Biol.
Paul Gardner RNA bioinformatics
25. Covariation metrics
Lindgreen, Gardner & Krogh (2006) Measuring covariation in RNA alignments: physical realism improves
information measures. Bioinformatics.
Paul Gardner RNA bioinformatics