2. Classification problems
What is this?
The second most abundant M. tuberculosis transcript
No hits to Rfam, no homologs have been identified, no known
function
0e+00
1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
7e+05
Mycobacteria tuberculosis genome: RNA−seq and annotations
Readcounts
+ve strand
−ve strand
CDSs
Pfam domain
ncRNAs
terminators
4099500 4100000 4100500 4101000 4101500 4102000
genome coordinate
Rv3661
HAD
miscrna00343
Rv3662c
Rank Reads Gene
1 5,741K SSU rRNA
2 689K sRNA / RUF?
3 236K RNaseP RNA
4 197K LSU rRNA
5 119K AT P synthase
6 119K HSP X
7 115K tmRNA
8 103K Secreted protein
Arnvig et al.
(2011) PLoS
Pathog.
Paul Gardner Hunting RNA motifs
3. Classification problems
What about this?
The seventh most abundant N. gonorrhoeae transcript
No hits to Rfam, no homologs have been identified, no known
function
0
20000
40000
60000
80000
Neisseria gonorrhoeae genome: RNA−seq and annotations
Readcounts
+ve strand
−ve strand
CDSs
Pfam domain
ncRNAs
terminators
2137500 2138000 2138500 2139000
genome coordinate
NGO2161
Inositol_P
terminator12061
terminator12062 NGO2162
SEC−C
terminator12063
terminator12064
Rank Reads Gene
1 206K RNaseP RNA
2 178K tRNA -Ser
3 130K ZapA 3‘ UT R?
4 120K P hage protein
5‘ UT R?
5 97K tmRNA
6 93K BRO protein
7 81K sRNA ?
8 81K Hsp70 protein
9 76K tRNA -A sp
Isabella & Clark
(2011) BMC
Genomics.
Paul Gardner Hunting RNA motifs
4. Classification problems
And this?
The ninth most abundant C. difficile transcript
No hits to Rfam, no homologs have been identified, no known
function
0
1000
2000
3000
4000
Clostridium difficile genome: RNA−seq and annotations
Readcounts
+ve strand
−ve strand
CDSs
Pfam domain
ncRNAs
terminators
3014000 3014500 3015000 3015500 3016000 3016500 3017000
genome coordinate
toxB toxB toxB
terminator0881
terminator1760 trpS
Rank Reads Gene
1 12K RNaseP RNA
2 12K 6S RNA
3 12K SRP RNA
4 11K ldhA
5 10K tmRNA
6 6K spoVG
7 6K Gly reductase
8 5K slpA
9 4K sRNA / RUF?
10 4K hypothetical prot.
11 4K sRNA / RUF?
Deakin, Lawley
et al. (2012)
Unpublished.
Paul Gardner Hunting RNA motifs
5. Classification problems
And this?
The eleventh most abundant C. difficile transcript
No hits to Rfam, no homologs have been identified, no known
function
0
1000
2000
3000
4000
Clostridium difficile genome: RNA−seq and annotations
Readcounts
+ve strand
−ve strand
CDSs
Pfam domain
ncRNAs
terminators
1760600 1760800 1761000 1761200 1761400 1761600
genome coordinate
feoA2 terminator1637terminator1105 acetyltransferase
Rank Reads Gene
1 12K RNaseP RNA
2 12K 6S RNA
3 12K SRP RNA
4 11K ldhA
5 10K tmRNA
6 6K spoVG
7 6K Gly reductase
8 5K slpA
9 4K sRNA / RUF?
10 4K hypothetical prot.
11 4K sRNA / RUF?
Deakin, Lawley
et al. (2012)
Unpublished.
Paul Gardner Hunting RNA motifs
6. What is our next big challenge?
Past:
How many non-coding RNA genes are there?
Future:
Can we determine the functions, if any, for large sets of given
RNAs?
Paul Gardner Hunting RNA motifs
7. Will a “periodic table of RNA” be useful?
My aim is to build an analog to the Periodic Table for
classifying RNA families and motifs, enabling researchers to
predict function.
base basepair
R
A
U
A
G
A
U Y
A
C
A
U
U
5´
Y
G
A
A
R
5´
C
U
U C
G
G
5´
R
U
R R R
Y
5´
R
R
G
C
G
U
R A
R
A
G
C
Y
5´
R
Y
G
G
A
G
Y
R R
R
R
C R
R
G
A
R
R
5´
C
G
A
A
G
Y
Y
R
Y
Y RR
G
G
G
R
U
G
G
A
G
5´
C
C
R
A
Y
C
C
C
R
U C
C
G
A
A
C
U
Y
G
G
5´
A N Y A G N R A U N C G T loop U t ur n k t r n1 k t r n2 tw ist
R
C
Y
R
G
G
A
AC
U
G
A
RC
R
U
Y
AG
U
A
C
G
GG
A R R A5´
Y
Y
Y
A
G
U A
G Y R
A
G
G
A
A
R
R
R
5´ R
Y
G
R
Y
A
A
Y
C
RY
A Y
Y
A
G
R
GA
A
Y
C
5´
R
C
A
GG A
G
Y
5´
A
C
A
C U
G
R
Y
R
Y G Y R R R R
R
Y
C
A
R
U
Y
5´
R
A
G
C
R
C
G
R
A
G
Y AY
G
Y
Y
R
G
U
U
Y
5´
A
A
A
A
A
G
C
Y
R
Y
Y R
R
Y
G
G
Y
U
U
U
U
UU
Y U Y5´
R
R
A
R
R Y
Y
U
U
UU
U U Y5´
sar r ic1 sar r ic2 U A A G A N C sr C loop dom V t er m 1 t er m 2
R Y Y Y Y
G
C
G
A
G
C
A
G
A
C
G
C
A
R
A
A
C
R
C
C
C
R
R
Y
R
R
Y
G
G
G
Y
G
U
U
Y
U
G
C
G
U
C
U
G
C
U
C
G
C
R R R R5´
Y
U
Y
UC
U
C
A
A
C
AG
UG
Y
U
U
G
R
R
R
A
A
Y
5´ Y
Y
Y
Y
Y
A
U
GA
Y G
R
Y
Y
Y
YA
A
A Y
Y
Y
YY
R
R
G
R
R
Y
C U GA
U
Y
Y
Y
R
R
R
5´
G
G
G
U
C
U
C
U
C
U
G
Y
U
A
G
A
C
C
A
G
AU
CU
G
A
G
C
C
UG
GG
A
G
C
U
C
U
C
U
G
G
C
U
A
R
C
U
A
G
G
G
A
A
C
C
CA
C5´ UG
U
A
A
A
C
A
U
C
CU
Y G
A
C
U
G
G
A
A
G
C
UG
U
R
R
R
Y R Y
R
R
RR
G
C
U
U
U
C
A
G
U
C
G
G
A
U
G
U
U
U
G
C
5´ U CU
U
U
G
G
U
U
A
U
C
U
A
G
C
U
G
UA
U
G
AG
U
G
Y
Y R
C
RU
C
A
UA
A
A
G
C
U
A
G
A
U
A
C
C
G
A
AR
U5´ C
Y
Y
R
UC
C
C
U
G
A
G
A
C
C
C
U
A
A
C
Y
U
G
U
G
AG
Y
U
Y
YY
A
G
Y
UU
C
A
C
A
R
G
U
R
G
G
Y
U
C
U
Y
G
G
G
R
CY
R
G
G
5´
G
C
U
A
A
A
A
G
G
A
A
C
G
A
U
C
G
U
U
G
U
G
A
U
A
U
GC
G
U
U
RRU
U
YC
G
U
U
AC
A
U
A
U
C
A
C
A
G
U
G
A
U
U
U
U
C
C
U
U
U
A
U
A
R
CG
C5´ C Y GY
G
Y
Y
C
A
U
C
U
U
A
C
Y
G
RG
C
A
G
U
G
U
U
G
GA
U
G
Y
YY R
R
G
Y
C
UC
U
A
A
Y
A
C
U
G
YC
U
G
G
U
A
A
Y
G
A
U
G
R
C
RY
C G G5´ Y Y Y Y R R G
Y
A
C
A
U
R
C
U
U
C
U
U
U
A
U
A
U
C
C
C
A
U
AY
R
A
Y
R
R
R
CU
A
U
G
G
A
A
U
G
U
A
A
A
G
A
A
G
U
A
U
G
U
A
Y Y Y G G Y5´ Y R R YY
C
R
U
C
A
A
A
R
U
G
G
Y
U
G
U
G
A
R U
G
U
Y
R
U
CA
U
A
U
C
A
C
A
G
C
C
A
C
U
U
U
G
A
U
G
AG
Y U Y R R5´ Y A A RA
A
G
G
G
A
A
Y
R
G
U
U
G
C
U
G
U
G
A
U
R
U
A
Y
Y
Y
A Y
Y
Y
Y
U
YU
A
U
A
U
C
A
C
A
G
U
G
G
C
U
G
U
U
C
U
U
U
UU
G G U Y5´ Y
C
R
G
G
U
G
A
G
G
U
A
G
U
A
G
G
U
U
G
U
A
U
A
G
U
U
RR
R
R
Y
Y
Y
Y Y
G
G
A
GY
A
A
C
U
R
U
A
C
A
A
Y
C
U
R
C
U
A
C
U
U
Y
C
C
U
G
R
5´
G
G
C
U
G
G
U
C
C
G
A
R
RG
U
A
G
U
G
G
G
U
U
A
Y
R
U
Y
A
AY
Y
Y
Y
U
U
R
Y
Y Y YU
C
Y
C
C
CYC
Y
C
A
C
U RC
UR
YA
C
U
U
G
A
C
U
R
G
C
CU
U U5´ Y
Y
Y
C
U
G
Y
R
R
U
G
U
C
G
UA
R
Y
Y
Y
Y
Y
U
G
A
R
C
CRAY
Y
Y
Y
Y
Y
G
G
G
R
G
Y
Y
Y
Y
Y
R
G
G
YA
G C
C
C
YY
G
G
GA
A
R
C
A
A
R
Y
R
R
R
R
Y
R
C
C
C A CCU
R
R
R
Y
R
YRG
G
U
U
C
A
R
R
R
R
Y
A
C
G
G
C
A
Y
Y
R
Y
G
G
R
Y
YY
Y5´
Y
Y
R
C
G
R
C
C
A
UA
C
R
R
R
G
R
A
R
C
A
CC
Y
G
R
U
C
C
CA
U C
C
G
A A
CY
C
R
GA
A
G
U UA
A
GC
Y
Y
Y Y
GG
C Y
R R
G
U
A C U
R
G R YG
RG
R
AYC
CUG
GG
AA
RY
RGGU
G
Y
Y
G
Y
R
R
Y5´
G
RU
A
GYYY
AR
Y
G
G
Y AR
R R C
RY
Y
R
G
Y
U
Y A
A
Y
Y
R
R
RR
Y
R
RG
G UU
C
R
AR
U
C
C
Y
YY
YR
5´
R
R
AAR
Y
U
C
R
Y
R
R
R
R
GYYAC
R
R
YG
A
G
U
R
Y
Y R
YRCU
C Y
CYYY
Y G G G A A GGU
C U G A G
A
R
G
C
CAY
Y
R
C
C
CU
G
GGGYR
Y
Y
Y
Y
Y
Y
GR
R
R
R
G
R
R
R
R Y G R G Y Y
A
C
C
AG
A A A Y
R
R Y Y
Y
Y
R
RGY
U
U
GGAA
RRCUYRY
GGCY
RG Y R R Y U
A
G
U
C
A
A
U
R
Y
GRR
Y
R
R
Y
Y
Y
R
AAC
Y
C
R
A
UUCAG
A
C
U
A
UCU
Y
Y
5´
T R I T I R E SE C I S m ir -T A R m ir -30 m ir -9 lin-4 m ir -5 m ir -8 m ir -1 m ir -2 m ir -6 let -7 Y R N A 6S 5S t R N A R N aseP
AURRGRYA
G
G
YA
U
U
G
AA
CUGU
AU
U
G
U
G
CR
C
C
UU
GCAUARAGCUAAAGCACUAAAAAGGAGUAA5´
A
G
U
C
A
U
G
A
U
YG
C
U
A
U
U
C
Y
Y Y
A
A
A
U
A
G
UG
A
U
U
G
U
G
A
U
AG
C
G
A
U
G
C
G
G
Y
G
U
G
U
UG C
G
C
A
C
R
Y
C
G
Y
A
Y
C
G
CG
C U5´
AGAGGAARCR
G
G
G
G
C
CAY
G
C
A
GAAGC
G
U
UC
ACG
U
C
G
C
G
G
C
C
C
CU
GUC
A
G
A
U
U
C
RGU
R
A
A
U
C
U
GC
GAAUUCUGCU5´
G A U AC
A
U
A
G
G
A
A
C
C
U
C
C
U
C A
A
A
G
G
A
U
U
C
U
A
U
GG
A C AG
U
C
G
A
U
G
C
A
G
G
G
A
G
G
G A CR
R
C
U
C
C
C
U
G
C
A
U
C
G
G
CG
A U U U U5´ A
C
G
R
RG
U
R RA
R
UG
C
G
A U A A Y A YA
A
U
A
A
U
GAAA
U
U
C
C
U
CU
U U G A C
G
G
C
C
A
A
U
A
GC
GA
U
A
U
U
G
G
C
CA
U
U
U
U
U
U
U
5´ R
Y
C
U
U
U
A
G
C
G
GG
Y
U
R
RR
U
Y A R U CU
R
G
Y
Y
G
G
Y
G
U
U
U
C
G
C
C
G
R
C
Y YU
R
C
Y
Y
U
G
A
Y
R
Y
5´
RYYRYYCC
G
U
G
G
UG
A
U
U
U
G
RYC
GGCCGG
C
U
U
G
C
AG
C
C
A
C
GU
UAAAYAAUCGCUAAARAGGCCGRGGRRR5´
G
UCGRR
U
Y Y C A
C
UG
A U G AG U C Y
U R
ARGAC
G
A
AA
C
5´ Y Y R
A
U
Y
U
AAA
RA
A
A
C
A G CU
U UC
A AG
U G CCU U U Y U GC
A G
U
U
YYY
CARGAGCGC
A
A
G
A
U
RG
R U A5´
R
Y
G
GY
Y G
Y
U
U
G
C
C
A
U
A
C
G
C
C
C
YY
Y YY
C
G
G
C A
GG
U
A
U
G
G
A
A
R
C
A
C
C
C
YC
G Y A CG
A
C
U
G
GY
Y
C G
G
A
C
A
CY
GY
C
G
U
C
CC
G
C
C
A
G
A
U
C
5´ CA
C
A
U
C
A
G
A
U U U
C
C
U
G
G
U
G
UA
A CG
A
A
U
U
U
U
C
A
A
G
U
G
C
U U C
U
U
G
C
A
U
A
A
G
C
A
A
G
U
U
URA
U
C
C
C
G
C
Y
C
CY
YC
G
R
G
Y
C
G
G
G
A
UU
U5´ A U GG
A
G
A
C
A
UGGCR
U
AA
AG
C C AG
A R
A
G U R A
G
A
AC
R U A A C
Y
U
A
G
A
C
U
R
U
ACUUGAA
C
U
G
A
U
UYRC
A
U
C
U
CA
U U U U5´
G
C
R
C
Y
G
C
AA
AA
U
C
R
G
R
Y
G
C
C G G G A
U UG
G
YA
YCCCG
R
A
Y
R
R
R
R
Y
R
A R C G C
Y
GCGYU
U
U
U
U
U
5´
Y U R C G U G A C G A A G CG
C
G
C
G
CA
A
A
G
UG
G A C
AA
U
A
A
AG
C
C
UR
A G C
RU
Y
R
A
G
UAG
U
C
G
Y
CAG
A
C
G
C
C
G
G
U
U A A
G
C
C
G
G
C
G
U
UU
U U U5´ YR
Y
A
C
G
UR
Y
C
Y
G
U
U
R
UR
G
Y
C
C
G
G
U
U
G
C
U
U
UG
GU
C
G
G
U
G
A
C
C
G
G
R
R R R
R
A
G
C
C
C
R
C
UU G
G
U
G
G
G
Y
U
UU
U U5´
G
G
Y
C
R
G
C
Y
C
R
CC
C C CC
R
G
R
G
C
Y
G
R
C
C
G A C G G C C C C C G C
U CC
C
C
CCY
GGCGGGGGYCGUC
C
C
Y
Y
5´
U U G G C G A U R UU
U
U
U
G
GU
U G
G
A
A
U
G
UAGUGY
YY
UU
A
R C A C U AA
A CG
C U G
CC
A C AA
A
U
A
A
CCUG
U
CAGU
U
A
U
U
U
C
A
Y
C
A
A
A
AA
U A A A5´
RYYRYUG
C
C
C
UCY
G
G
G
CG
UUUCCUCCCUAGACUU
G
G
C
Y
Y
YY
R
R
G
G
C
CU
UUUUUUUYYY5´
SA M V sym R C P E B 3 F inP sr oB m sr SA M a H H 3 V m nt n3 livK D sr A C A E SA R isr K sr oD isr B 6C r spL suhB
UY
G
C
A
UCCGCYAA
Y
CGGUY
A G C C GU G UC
G C GG A
A G
G
U
U
Y Y
Y
A
A
C
CA
G C UR
Y
Y U Y Y G RA
ACRRAG
RRA
GGUG
A
G
C
G
5´
UG
A
A
A
GAC
G
C
G
C
A
U
U
U
GU
U A U C A U CA
UC
C CU
G
U Y
C
A
G
AG
A
U
G
Y
A
A
U
U
U
GG
CC
AC
AG
Y
RY
G
U
G
G
C
C
U
U
U
U
C
5´
* U
U
C
U
A
C
U
G
A
C
U
C
UU
U
U
A
AA
A
U
A
AU
U
A
U
U
C
A
U
U
G
G
AG
G U UU
A
A
UA
U
G
A
A
U
A
UA
A A G G A U G A G CA
U A
U
A
G
A
AG
C
GUUUG
C
UCYUU
GU
U
A
G
AU
C
R
G
U
U
A
G
U
A
G
G
AA
5´
G A U U UG
G
U
R
R
C
U
G
C
G
C
U
C
U
U C U
A
A
G
C
C
A
G
U
U
A
C
C
CG
G
U
U
C
A
A
A
R
A
U
U
G C C
A
G
C
U
U
Y
G
A
A
C
CU
UC
G
A
A
A
A
A
C
C
A
C
C
U
Y C
R
R
G
G
U
G
G
U
U
U
U
U
U
C
GU
5´
R R R R R R R R
C
U
C
R
U
AU
A
A
YYYCRRR
AA
U
A
UG GY
Y Y G R R A
GU
U U C UAC
C R R G Y R
C CG
U
AAA
YRYYYG
A
CU
A
Y
G
A
G
RR
R5´
C
G
G
C
A
U
C
C
C
C
A
U
U
A C C
U
A
U
G
G AC
A
CG
G
U
G
C
C
G
C A R G C U C U G G R A
G UU
C
GUYCCRGAGYYUG
Y
Y
G
G
A
A
R
G
G
U
U
U
U
C
C
G
U
G
U
C
C
A
G
5´
R
R
Y
G
G
A
R
G
CRR
U
GA
R
Y
R
Y
Y
Y
YU
Y
A
U
YU
G G GCA
C
Y
U
G
R
R
R
Y
R
YG
G
A
G
C
YAG
U R GU
G
C
A
ACCG
R
C
C
R
Y
R
R
R
5´
G
U
U
G
U
A
A
C
U
AU
G
U
U
G
C
A
R
YA
R A C G AG
A
A
C
C
G
AG
U
A
U
A
G
U
U
C
A
U
GG
G
R
U Y A
CA
UG
AA
UU G U UU
A
A
CU
RU
CC
U
C
U
GG
A
U U
C
CC
G
U
C
C
AU
G
R
C
A
GU
C
G
G
U
U
C
5´
CUUA
C
U
G
A
GA
G
C
A
C
AA
A
GU
UUC
C
C
G
U
GC
CA
A
C
A
G
G
G
A
G
U
G
U
UAU
A
AC
G
G
UU
UAUU
A
G
U
C
U
G
G
AG
ACG
G
C
A
G
A
C
U
AU
CCUCUUC
C
C
G
G
U
C
CC
CUA
U
G
C
C
G
G
GU
UUUUUUUAUGUC5´
UURGRYUYRCCUG
A
A
U
G
U
G
A
CU
A
U
C
A
C
U
U
CA
AACRRYGRGYAACCUCAGUAUCAUCRYRGAGYUA
A
A
C
C
C
U
C
G
C
C
G
C
CUG
A
C
G
G
Y
G
A
G
G
G
U
U
UU
CUUUUGGR5´
U G U A A A A A A C A U Y A U U UA
G
C
GUGAYU
U
U
C
U
A
U
C
A
ACA
GC U A A C
A
A
U
U
G
U
UA
U
U
A
C
UG
C
CUA
A
Y
G
Y
U
C
A
UA
A G G G U A AUU
U
U
A
A
A
A
A
AGG
G CG
A
U
A
A
AA
A
A
C
G
A
U
U
G G GGG
A
U
G
A
G
A
Y
A
U
G
AAC
G
C
UC
A A G C A5´
C C C A G A G G U A U U G A UU
G
G
U
G
A
U
R
R
C
A
Y
Y
U C U
R
U
G
Y
U
Y
A
U
UY
A
U
UR
C
A
C
C
A
A C C U G C G C RG
A
UGCGCAGGU
U
U
U
U
U
U
U
5´
AR
R
R
Y
Y
YYYAAURYCAACYUUUAGCGCACG
G
C
U
C
U
YY
A
A
G
A
G
C
CA
UUYCCCUA
G
R
C
C
A
A
A
C
A
G
GAAU
Y
G
U
U
U
G
G
Y
C
UU
UUUUU5´
G
G
G
C
A
R
G
A
U
A
U
G
U
G
A
A
GU
R
GC
Y
A
C
C
GC
AA
GC
YGR
U
A
CY
CUU
CAC
Y
Y Y C C
U
U
A U UC
G C
U
Y
GC
U
CAAC
GGR
A
U
C
Y
U
G
C
U
CU
G C G A G G C Y5´
GUGCRRYCYRAUUYYR
G
Y
Y
G
Y
G
C
C
Y
R
Y
R
A
R
AAC
AUCAYAA
R
A
U
A
CG
G
C
R
C
R
R
CC
ACRAUUUCCCUG
G
U
G
U
U
GG
C
G
C
A
GU
AUU
C
G
C
G
C
A
C
CC
CGGUCUACC5´
Y
U
U
Y
R
Y
U
R
R
U
U
U
Y
A
U
C
A
R
A
YC
U GU
U
U
G
A
U
R
R
A
A
G
Y
U
A
R
Y
G
A
R
R Y Y C A Y UA
A
C
R
G
C
U
Y
U
Y
GC
Y G
G
C
Y Y G
A
C
C
C
G
A
G
R
Y
Y
G
U
UU
U U U U5´
RACGUUCA
Y
C
C
Y
YY
R
G
G
R
CGCAYRA
Y
C
A
R
R
Y
C
A
Y
G
GAA
C
G
G
G
G
R
Y
Y
U
G
R
R
5´
sucA Sr aD sxy R N A I P ur ine SA M -C hl cdiG M P 2 A nt i-Q G adY r nk ldr P r fA O m r A -B R yeB t r aJ 2 Sr aH 23Sm et h D S-p ep
U U C G G C C Y CG
C
R
R
C
G
YU
U YU
Y
C
G
Y
Y
G
CC
C U C U G C A YG
C
C
G
U
C
G
C
C
G
A
CGCAY
U
C
C
Y
A
U
U
CG
A
A Y Y G U
G
C
G
A
U
C
C
U
G
U
C
G
C CY
U
C
C
U
GC
G
G
C
G
C
G
G
C
5´ CG
Y
R
G
C
G
C
U
U
G
U
UA
U U
U
R
Y
Y
G C U
G
U
G
U
A
G U GUC
G
U
C
Y
YR
A R Y Y R G R R Y Y Y
A
A
A
C
C
C
C
G
C
C
Y
UU
Y
G
G
C
G
G
G
G
U
U
U
UG
C U U U U U5´
** C
U
U
A
C
C
G
G
A
G
GY
R
U
A
UGGAC
C
C
UG
A UC
C C AC
Y C C U
C
U
C
C
C
C G
A
UG
G
A
G
AA
U
Y
Y
YU
U
U
C
C
G
G
U
A
A
GC
C Y G Y C U Y Y
R
C
U
G
Y
Y
U
U
A
C
C
G
G UG
Y
G
U
A
A
G
G
C
A
G
UG
A C G U Y U5´
G
G
R
A
G
R
Y
R
Y
CU
G
GU G R
Y
C
G
G
C
U
UC
A AA
CC
GR
Y G
RR
G
Y
R
Y
Y
Y
Y
G
G
Y
RG
G U
U
C
G
AY
U
C
C
Y
RY
Y
C
U
Y
C
C
5´ U
G
A
C
C
C
U
U
U
A R
C
C
R
A
G
G
G
U
C
AC
C U A G C C A A C U G A C GU
U
G
U
U
AG
U
G
A
A
Y
YY
A
U
G
U
U
C
A
C A
RA
U
A
R
GC
C
A
A
U
C
G
C
U
U
U
G
C
G
R
U
U
G
GC
U U U U U U U U U5´ C U U A A UR
A
A
CAA
G
A
A
A
A
C
YA
A
R C G
U
A
C
Y
U
U
C
C
Y C
C
U
G
AG
UU
C
A
G
G
C
U
G
G
A
A
UG
C
G
C A
CA
G C U RA
U U G U U G A U AA
G G G CU
ACUC
AUACCGACAA
GC
CAGU
G
A
A
G
C
G
AU
G
A
AU
G
U
C
GG
U
U
CC
A C5´
R
U
Y
Y
RC
U
G
A
Y
GA
G
U
C
C
C
A
A AU
A
G
G
A
CGA
A
A C G C
GCGU
CY
G
R
A
U
5´ CU
C
C
A
U
GU
A
U
C
U
UU
G
G
G
A
C
C
U
G
U
C
A
GC
UG
U
G
G
C
A
G U
CU
C
C
C U
UC
C
U
A
G
CC
A
U
G
G
AA
G A G C A U A U U C UU
G
U
U
U
AU
U
G
G
C
A
A
A
GC
U
G
U
CA
C
C
A
U
UU
RA
U
U
G
G
UA
U
C
A
G
A U U
C
U
GAC
U
U
G
C
A
C
A
AG
U
A
A
C
AU
U C5´ C Y G G U U GG
U
G
G
C
G
C
A
C
U
U
C
C
Y
Y
A
C
G
G
G
C
G
G
U
G
U R
U
Y
A
CG
Y R Y U R Y R R Y A G A R R R A Y A C C
A
G
C
C
C
G
C
Y
RR
R
A
G
C
G
G
G
C
UU
U U U U5´
G
U
C
A
U
A
C
U
A
C
G
G
UG
C
A
A
Y
GY
R
RA
A
A
G
U A
A
AC
G
A
U
G
A
C
C C Y
A
RG
A
A
C
U
C
Y
RG
G U A
A A
A
U
R
CR
UAUC
A
A
A
A
U
G
Y
A
A
A
A
U
U
G
U
Y U G A C C U G G GR
UY
Y
UCCGGGUYRG
Y
U
Y
U
U
U
U
5´
U R U G C U A A C U R R R A A YG
U
U
G
Y
A U
R
Y
A
A
CCC
U
U
G
R
Y
G
C
U
U
A
U Y
CC
U
U
U
R
Y
C
A
A
GC
A U A U U A Y AR
C
G
R
U
C
G
Y
YA
A A G G A G A A A U G5´
U C R A A A G A A C AU
G
A
A
A
U
G
G
A
G
G
AGAAAUU
AC
A
GC
A A U U UA
UC
AR
C U
G
A
A
A
UU
A
U
AG
G
U
GU
AG
AC
AC
A
UGUC
A
GC
R G UG
G
A
A
A
CAGUU
UC U A
UC
A A A A UU
A A AG
U
A
U
UUAG
A
G
AUUUU
C
C
U
C A
AA
U
U
U
C
AA
A U5´
ACAG
G
G
U
A
R
G
G
R
Y
Y
Y
Y
Y
UU
RU
R
R
R
R
R
Y
C
C
U
U
A
C
C
G
GR
UUUCU
C
A
A
R
U
Y
G
G
R
G
YA
AA
Y
C
C
G
R
U
U
G
RA
RUAUARAGGARG5´
CGYGUUA
U
A
U
G
CC
UU
U
A
U
U
G
UC
ACARUUYUUUUUYYG
Y
U
G
R
Y
C
A
U
U
G
GYAY
YA
U
U
R
A
U
U
Y
C
C
A
G
CR
AUAAAYG
A
C
A
A
G
C
C
C
G
A
A
C
RY
U
G
U
U
C
G
G
G
C
U
U
U
UU
UUURRUYA5´
Y Y Y AU
G
G
Y
G
G
Y
G
R
G
G
G
R
RCC
UU
Y
G GG Y
Y
G
C
C
G
GUU
C
C
YY
R
CCG
GU Y U RC
C
A
A
C
C
C
Y
Y
R
C
Y
R
C
C
AC
C Y5´
AUGGAYRU
G
C
G
C
A
GGA
A
G
C
G
CR
AAGACARACAGGGACACRYAGGRA
C
CCG
GA
UGGYGGRRYAGGAUGUCAGGRAACAGUCUGCA
A
A
G
C
C
C
C
G
C
YY
YG
G
C
G
G
G
G
U
U
U
U
5´
P s-R ho r nk ps M gsens t R N A S Q r r isr C H H 1 SN R 24 T r p ldr gr eA pr eQ 12 H A R 1F T er m L eu M icC C 4 R sm Y R ib osom e
Paul Gardner Hunting RNA motifs
8. What is a motif?
Escher, MC (1938) Sky and Water 1.
Wiktionary definitions for motif:
A recurring or dominant element; a
theme.
For the purposes of this work:
an RNA motif is a recurring RNA
sequence and/or secondary structure
found within larger structures that
can be modelled by either a
covariance model or a profile HMM.
Paul Gardner Hunting RNA motifs
9. What is a motif?
The kink-turn RNA motif
Found in rRNA, riboswitches, RNase P, snoRNAs, ...
An asymmetric internal loop
Causes a sharp kink between two flanking helical regions
Cruz & Westhof (2011) Sequence-based identification of 3D structural modules in RNA with RMDetect. Nature
Methods.
Paul Gardner Hunting RNA motifs
10. Can we build a seed alignment and CM resource of these
motifs?
For the hairpin motifs?
YES
For the internal loop
motifs?
MAYBE
For the sequence motifs?
Not certain yet
GNRA
Y
G
A
A
R
5´
k-turn-1
R
Y
G
G
A
G
Y
R R
R
R
C R
R
G
A
R
R
5´
k-turn-2
C
G
A
A
G
Y
Y
R
Y
Y R
R
G
G
G
R
U
G
G
A
G
5´
Doshi et al. (2004) Evaluation of the suitability of free-
energy minimization using nearest-neighbor energy pa-
rameters for RNA secondary structure prediction. BMC
Bioinformatics.
Paul Gardner Hunting RNA motifs
11. The RNA Motifs: RMfam
ANYA
R
A
U
A
G
A
U Y
A
C
A
U
U
5´
GNRA
Y
G
A
A
R
5´
T-loop
R
U
R R R
Y
5´
UNCG
C
U
U C
G
G
5´
U-turn
R
R
G
C
G
U
R A
R
A
G
C
Y
5´
k-turn-1
R
Y
G
G
A
G
Y
R R
R
R
C R
R
G
A
R
R
5´
k-turn-2
C
G
A
A
G
Y
Y
R
Y
Y R
R
G
G
G
R
U
G
G
A
G
5´
pK-turn
A G
G
R
R
R
Y
R
R
Y
R C Y
C
G
Y
Y
R
Y
Y
Y
C
5´
sarcin-ricin-1
R
A
C
C
Y
R
R
Y
G
A
A
C
U
G
A
A
R C
A
U
C
U
C
A
G
U
A
R
Y
R
G
A
G
G
A R A A5´
sarcin-ricin-2
Y
Y
A
G
U
A
G Y R
A
G
G
A
A
R
5´
tandem-GA
G C
R
R
U
G
A
R
R
R
R
A
Y
A
A
G
A
R
AR
YRY
Y
Y
Y
R
G
A
G
5´
twist_up
C
C
G
A
Y
C
C
C
A
U C
C
G
A
A
C
U
C
G
G
5´
UAA_GAN
R
Y
G
R
Y
A
A
Y
C
R
Y
A Y
Y
A
G
R
G
A
A
Y
C
5´
C-loop
G
A
C
A
C U
G
Y
R
Y R Y R R
R
R
C
A
R
U
Y
5´
Csr-Rsm
R
C
A
GG A
G
Y
5´
domain-V
R
A
G
C
R
C
G
R
A
G
Y A
Y
G
Y
Y
R
G
U
U
Y
5´
SRP_S_domain
R
GR
A
C Y
Y
R
Y
C
A
G
G
Y
G
R A
A
R
A
G
C
A
G
Y
R
A
A
Y
5´
vapC_target
A U A U
Y
G
R
R
C
R
C
C
R
G
Y
C
G
C R
Y
Y
G Y C5´
terminator1
A
A
A
A
A
G
C
Y
R
Y
Y R
R
Y
G
G
Y
U
U
U
U
U
U Y U Y5´
terminator2
R
R
A
R
R Y
Y
U
U
U
U U U Y5´
TRIT
R Y Y Y Y
G
C
G
A
G
C
A
G
A
C
G
C
A
R
A
A
C
R
C
C
C
R
R
Y
R
R
Y
G
G
G
Y
G
U
U
Y
U
G
C
G
U
C
U
G
C
U
C
G
C
R R R R5´
R R A R G G R G R R Y A U G R R5´
R R G G R R R Y A U G A R R5´
A A G G R R A U G R5´
Paul Gardner Hunting RNA motifs
12. Breaking Rules
Prefer motifs found in
multiple families &/or
multiple locations in the
same family
Aligning analogous
structures
Motifs are allowed to
overlap
Models are non-specific
High false-positive rates
Sequences are not edited,
spliced or otherwise
interfered with
Sequence sources are
diverse (PDB, EMBL &
publications)
Paul Gardner Hunting RNA motifs
13. An example entry
# STOCKHOLM 1.0
#=GF ID twist_up
#=GF AU Gardner PP
#=GF SE Gardner PP
#=GF SS Published; PMID:21976732
#=GF GA 19.00
#=GF DR URL;http://rnafrabase.cs.put.poznan.pl/?act=pdbdetails&id=1S72
#=GF RN [1]
#=GF RM 21976732
#=GF RT Clustering RNA structural motifs in ribosomal RNAs using
#=GF RT secondary structural alignment.
#=GF RA Zhong C, Zhang S
#=GF RL Nucleic Acids Res. 2012;40:1307-17.
2zkr_Y/27-46 CCGAUCUCGUC-UGAUCUCGG
#=GR 2zkr_Y/27-46 SS <<<<---<_____>--->>>>
2gv3_A/3-20 CCAGACUC---CCGAAUCUGG
#=GR 2gv3_A/3-20 SS (((((.(.---....))))))
3iz9_C/29-48 CGGAUCCCGUC-AGAACUCCG
#=GR 3iz9_C/29-48 SS ((((...(...-.)...))))
3bbo_B/31-50 CCAA-UCCAUCCCGAACUUGG
#=GR 3bbo_B/31-50 SS .(.....<.....>.....).
1nkw_9/33-53 CCCACCCCAUGCCGAACUGGG
#=GR 1nkw_9/33-53 SS (((...............)))
1c2x_C/31-51 CUGACCCCAUGCCGAACUCAG
#=GR 1c2x_C/31-51 SS ((((...<.....>...))))
1giy_B/33-53 CCGUUCCCAUUCCGAACACGG
1S72_9/29-50 CCGUACCCAUCCCGAACACGG
#=GR 1S72_9/29-50 SS ((((...(.....)...))))
#=GR 1S72_9/29-50 MT ...xxxxx.....xxxxx...
#=GC SS_cons ((((...(.....)...))))
//
Paul Gardner Hunting RNA motifs
14. An example entry
# STOCKHOLM 1.0
#=GF ID twist_up
#=GF AU Gardner PP
#=GF SE Gardner PP
#=GF SS Published; PMID:21976732
#=GF GA 19.00
#=GF DR URL;http://rnafrabase.cs.put.poznan.pl/?act=pdbdetails&id=1S72
#=GF RN [1]
#=GF RM 21976732
#=GF RT Clustering RNA structural motifs in ribosomal RNAs using
#=GF RT secondary structural alignment.
#=GF RA Zhong C, Zhang S
#=GF RL Nucleic Acids Res. 2012;40:1307-17.
2zkr_Y/27-46 CCGAUCUCGUC-UGAUCUCGG
#=GR 2zkr_Y/27-46 SS <<<<---<_____>--->>>>
2gv3_A/3-20 CCAGACUC---CCGAAUCUGG
#=GR 2gv3_A/3-20 SS (((((.(.---....))))))
3iz9_C/29-48 CGGAUCCCGUC-AGAACUCCG
#=GR 3iz9_C/29-48 SS ((((...(...-.)...))))
3bbo_B/31-50 CCAA-UCCAUCCCGAACUUGG
#=GR 3bbo_B/31-50 SS .(.....<.....>.....).
1nkw_9/33-53 CCCACCCCAUGCCGAACUGGG
#=GR 1nkw_9/33-53 SS (((...............)))
1c2x_C/31-51 CUGACCCCAUGCCGAACUCAG
#=GR 1c2x_C/31-51 SS ((((...<.....>...))))
1giy_B/33-53 CCGUUCCCAUUCCGAACACGG
1S72_9/29-50 CCGUACCCAUCCCGAACACGG
#=GR 1S72_9/29-50 SS ((((...(.....)...))))
#=GR 1S72_9/29-50 MT ...xxxxx.....xxxxx...
#=GC SS_cons ((((...(.....)...))))
//
Paul Gardner Hunting RNA motifs
15. An example annotated Rfam alignment
# STOCKHOLM 1.0
#=GF ID 5S_rRNA
#=GF AC RF00001
#...
#=GF WK 5S_ribosomal_RNA
#=GF SQ 712
#=GF MT.G GNRA
#=GF MT.s sarcin-ricin-2
#=GF MT.t twist_up
X62858.1/62-115 GCUGCG-U-UGUGGGUGUGUACUGCGGUUUU-UUG-CUGUGGGAA--GCCCACUUCA-CUG
X01588.1/64-115 UCACGU--UAGUGGG---GCCGUGGAUACCGUGAGGAUCC-GCAG-CCCCACU-AAG-CUG
#=GR X01588.1/64-115 MT.0 .......................GGGGGGGGGGGGGGGGG.....................
X05870.1/363-414 UCACGU--UGGUGGG---GCCGUGGAUACCGUGAGGAUCC-GCAG-CCCCACU-AAG-CUG
#=GR X05870.1/363-414 MT.0 .......................GGGGGGGGGGGGGGGGG.....................
L27170.1/63-116 CCAGCG--UCCGGCA--AGUACUGGAGUGCGCGAGCCUCUGGGAA-AUCCGGU-UCG-CCG
#=GR L27170.1/63-116 MT.0 ..ssss..sssssss..ssssssssssssssssssssssssssss.sssssss.sss.ss.
L27163.1/61-115 CCAGCGUUCCGGGGA---GUACUGGAGUGCGCGACCCUCUGGGAA--ACCGGGUUCG-CCG
#=GR L27163.1/61-115 SS ...(((((((((.........)))))))))(((((((...((......))))).))).)..
#=GR L27163.1/61-115 MT.0 .................................GGGGGGGGGGGG..GGGGGGG.......
#=GR L27163.1/61-115 MT.1 ..sssssssssssss...sssssssssssssssssssssssssss..ssssssssss.ss.
L27343.1/60-114 CCAGCGAACCAGCUA---GUACUAGAGUGGGAGACCCUCUGGGAG-CGCUGGU-UCG-CCG
#=GR L27343.1/60-114 MT.0 .......................GGGGGGGGGGGGGGGGG.....................
#=GR L27343.1/60-114 MT.1 ....sssssssssss...sssssssssssssssssssssssssss.sssssss.sss.s..
M33891.1/66-117 CCAGCG--CCGGAGA---GUACUGCGC-GGGCAACCGCGUGGGAG--GCGAGG-CCGCUCG
#=GR M33891.1/66-117 MT.0 .......................GGGG.GGGGGGGGGGGG.....................
X01484.1/62-114 GUCAGG--CCCAGUU--AGUACUGAGGUGGGCGACCACUUGGGAA-CACUGGG-U-G-CUG
#=GR X01484.1/62-114 MT.0 ........................GGGGGGGGGGGGGGGG.....................
#=GR X01484.1/62-114 MT.1 ...sss..sssssss..ssssssssssssssssssssssssssss.sssssss.s.s.ss.
#=GC SS_cons ::::::::::<-<<--------<<-----<<____>>----->>----->>->::::::::
#=GC RF cCaGcG..cccggga...GUAcUggGGUggGcgAcCcucUGGgAA.CaCccgg.uCG.CUG
//
Y
R
C
G
R
C
C
A
U
A
C
R
G
R
R
C
A
C
C
G
U
C
C
C
R U Y
C
G
A
C
C
G
R
A
G
U Y
A
A
GC
Y
G
G C Y R
G
U
A C U
R R Y
G GG
G
ACY
YUG
GG
AA
Y
RGGY
G
Y
Y
G
Y
R
5'
Paul Gardner Hunting RNA motifs
16. An example annotated Rfam alignment
# STOCKHOLM 1.0
#=GF ID 5S_rRNA
#=GF AC RF00001
#...
#=GF WK 5S_ribosomal_RNA
#=GF SQ 712
#=GF MT.G GNRA
#=GF MT.s sarcin-ricin-2
#=GF MT.t twist_up
X62858.1/62-115 GCUGCG-U-UGUGGGUGUGUACUGCGGUUUU-UUG-CUGUGGGAA--GCCCACUUCA-CUG
X01588.1/64-115 UCACGU--UAGUGGG---GCCGUGGAUACCGUGAGGAUCC-GCAG-CCCCACU-AAG-CUG
#=GR X01588.1/64-115 MT.0 .......................GGGGGGGGGGGGGGGGG.....................
X05870.1/363-414 UCACGU--UGGUGGG---GCCGUGGAUACCGUGAGGAUCC-GCAG-CCCCACU-AAG-CUG
#=GR X05870.1/363-414 MT.0 .......................GGGGGGGGGGGGGGGGG.....................
L27170.1/63-116 CCAGCG--UCCGGCA--AGUACUGGAGUGCGCGAGCCUCUGGGAA-AUCCGGU-UCG-CCG
#=GR L27170.1/63-116 MT.0 ..ssss..sssssss..ssssssssssssssssssssssssssss.sssssss.sss.ss.
L27163.1/61-115 CCAGCGUUCCGGGGA---GUACUGGAGUGCGCGACCCUCUGGGAA--ACCGGGUUCG-CCG
#=GR L27163.1/61-115 SS ...(((((((((.........)))))))))(((((((...((......))))).))).)..
#=GR L27163.1/61-115 MT.0 .................................GGGGGGGGGGGG..GGGGGGG.......
#=GR L27163.1/61-115 MT.1 ..sssssssssssss...sssssssssssssssssssssssssss..ssssssssss.ss.
L27343.1/60-114 CCAGCGAACCAGCUA---GUACUAGAGUGGGAGACCCUCUGGGAG-CGCUGGU-UCG-CCG
#=GR L27343.1/60-114 MT.0 .......................GGGGGGGGGGGGGGGGG.....................
#=GR L27343.1/60-114 MT.1 ....sssssssssss...sssssssssssssssssssssssssss.sssssss.sss.s..
M33891.1/66-117 CCAGCG--CCGGAGA---GUACUGCGC-GGGCAACCGCGUGGGAG--GCGAGG-CCGCUCG
#=GR M33891.1/66-117 MT.0 .......................GGGG.GGGGGGGGGGGG.....................
X01484.1/62-114 GUCAGG--CCCAGUU--AGUACUGAGGUGGGCGACCACUUGGGAA-CACUGGG-U-G-CUG
#=GR X01484.1/62-114 MT.0 ........................GGGGGGGGGGGGGGGG.....................
#=GR X01484.1/62-114 MT.1 ...sss..sssssss..ssssssssssssssssssssssssssss.sssssss.s.s.ss.
#=GC SS_cons ::::::::::<-<<--------<<-----<<____>>----->>----->>->::::::::
#=GC RF cCaGcG..cccggga...GUAcUggGGUggGcgAcCcucUGGgAA.CaCccgg.uCG.CUG
//
Y
R
C
G
R
C
C
A
U
A
C
R
G
R
R
C
A
C
C
G
U
C
C
C
R U Y
C
G
A
C
C
G
R
A
G
U Y
A
A
GC
Y
G
G C Y R
G
U
A C U
R R Y
G GG
G
ACY
YUG
GG
AA
Y
RGGY
G
Y
Y
G
Y
R
5'
Paul Gardner Hunting RNA motifs
17. Uses for a motif DB (I): Improving a RNA alignments and
structure prediction constraints
Y
R
C
G
R
C
C
A
U
A
C
R
G
R
R
C
A
C
C
G
U
C
C
C
R U Y
C
G
A
C
C
G
R
A
G
U Y
A
A
GC
Y
G
G C Y R
G
U
A C U
R R Y
G GG
G
ACY
YUG
GG
AA
Y
RGGY
G
Y
Y
G
Y
R
5'
Original Rfam
model
Sarcin-ricin
loop
GNRA
tetraloop
Refined Rfam
model
Y
R
C
G
G
C
C
A
U
A
C
R
G
R
R
C
A
C
C
G
U
C
C
C
R U
Y
C
G
A
C
C
G
A
A
G
U
Y
A
A
GC
Y
G
G C Y R G
U A C U
R R G G G
GACCYUGGGAA
YRGGY
G
Y
Y
G
Y
R
5'
Paul Gardner Hunting RNA motifs
18. Uses for a motif DB (II): RNA structural motifs
The Lysine Riboswitch
R
R
G
U
A
GA
G
G
Y
GCRRYY
A
AG
U
A
YUY
G
RR
R
R
Y C R R U
G
A
R R R Y
G
A A R G
G
R R R R U Y G C
C
G
A A
R
Y
R
R
R
Y Y
Y
Y
Y
G
Y
U G
G
Y
R
Y
R
Y
G A
A
R
R
R
Y
R
R
R
A
C
U
G
U C A Y R R
Y
YRUGG
A
G
GC
U
A
Y
Y
5´
G
A
G
Y
R
R
C R
G
A
5´
R
G
R
Y
R A
C
Y
5´
G
R
A
5´
U
R
5´
32% K-turn11% U-turn
13% T-loop 21% GNRA
Tetraloop
P1
P2
P3
P4
P5
Paul Gardner Hunting RNA motifs
19. Uses for a motif DB (II): RNA structural motifs
The Lysine Riboswitch
Pasteurella multocida
Bacillus sp.
Escherichia coli
Staphylococcus epidermidis
Lactococcus lactis subsp. lactis
Staphylococcus haemolyticus
Staphylococcus saprophyticus
Actinobacillus pleuropneumoniae
Haemophilus ducreyi
Bacillus halodurans
Vibrio splendidus
Bacillus subtilis
Serratia proteamaculans
Listeria monocytogenes
Vibrio sp.
Vibrio vulnificus
Lactobacillus plantarum
Mannheimia succiniciproducens
Bacillus cereus
Klebsiella pneumoniae
Lactococcus lactis subsp. cremoris
Listeria innocua
Haemophilus influenzae
Vibrio parahaemolyticus
Clostridium botulinum
Thermoanaerobacter tengcongensis
Lactobacillus brevis
Thermoanaerobacter sp.
Bacillus licheniformis
Pectobacterium atrosepticum
Mannheimia haemolytica
Yersinia frederiksenii
Staphylococcus aureus
Haemophilus somnus
Listeria welshimeri
Proteobacteria
Firmicutes
Pasteurell.
Vibrionales
Enterobac.
Clostridia
Lactobacil.
Bacillales
P2 stem
R
Y
G
G
A
G
Y
R R
R
R
C R
R
G
A
R
R
5´
Kink-turn
R
R
G
C
G
U
R A
R
A
G
C
Y
5´
U-turn
P4 stem
Y
G
A
A
R
5´
GNRA
tetrloop
R
U
R R R
Y
5´
T-loop
P2 P4
Paul Gardner Hunting RNA motifs
20. Uses for a motif DB (III): Functionally characterising novel
RNAs
Y
Y
U
R
R
U
Y
Y
Y
A G
R
A
RYC
RCAY
G
G
A Y
G Y
G R Y A G
R
Y
G
G
C
C
A
Y G G
A
Y
G
G
C
R
Y
R
U
R
R
C
G
G
Y
Y
C
G U Y
R
R
Y
R
A
R
Y
Y
Y
R
R5'
Y
R
GG A
R
5´
Csr-Rsm binding
motif
TwoAYGGAY sRNA: 216 homologues
found in Human gut metagenome
sequences, γ-Proteobacteria and
Clostridiales species
Weinberg Z et al. (2010)
”Comparative genomics reveals 104
candidate structured RNAs from
bacteria, archaea and their
metagenomes”. Genome Biol.
Paul Gardner Hunting RNA motifs
21. Csr-Rsm background
Toledo-Arana et al. (2007) Small noncoding RNAs controlling pathogenesis. Current Opinion in Microbiology.
Paul Gardner Hunting RNA motifs
22. Uses for a motif DB (IV): Identifying and ranking ncRNAs
of interest in transcriptome results
0 50 100 150 200
0.000.030.06
Distribution of motif scores on S. typhi ncRNAs
Sum of motif scores (bits)
Density Known RNAs
RNA−seq ncRNAs
Shuffled
0 50 100 150 200
0.00.40.8
CDF of motif scores on S. typhi ncRNAs
Sum of motif scores (bits)
CDF
K−S.test(Known RNA > Shuffled) => P=2.11e−37
K−S.test(RNA−seq RNA > Shuffled) => P=1.47e−09
Paul Gardner Hunting RNA motifs
23. Uses for a motif DB (IV): Identifying and ranking ncRNAs
of interest in transcriptome results
0 500 1000 1500 2000
0.00000.0010
Distribution of motif scores on S. typhi ncRNAs alignments
Sum of motif scores (bits)
Density Known RNAs
RNA−seq ncRNAs
Shuffled
0 500 1000 1500 2000
0.00.40.8
CDF of motif scores on S. typhi ncRNAs alignments
Sum of motif scores (bits)
CDF
K−S.test(Known RNA > Shuffled) => P=8.15e−06
K−S.test(RNA−seq RNA > Shuffled) => P=0.53
Paul Gardner Hunting RNA motifs
25. What about the highly expressed ncRNAs?
U U R R C C U U G U G U G Y R G G U C G A G U A C A A A G G A G C A C
G
G
A
A
G
C
Y
C
G
G
R
A
G
GC
C
A
A G
G
Y
C
G
A
C
C
G
G
A
A G A
G
A
A
G
G
Y
U
C
G
A
Y
C
U
C
C
C
G
A
Y
C
C
G
R
G C
Y
CC
C
A G C A C G G R
C C
G G
Y A C
C C A C G C G
G
A
G
C
R G
C
C
G
C
R A Y
A
A
RG
G
C
A
R
A
A
G
Y
G
U U
G
Y
G
G
G
C
C
U
G
C
G
U
R
A U U C G R A
R
Y
R
R
A
C GG Y
CG
A
C
G
R C C C U
U
U
G
G
GUG
G
GGY
U
G
C
RRCCRRAR
G
U
C
G
Y
R
R
C
G
C
C
G
A
G
G
C
C
A
C
C
C
A C
G
CA
A
C
C
CRY
Y
G
C
A
CGCUUGG
UAA
CC
G
RG
UCCGUGCU
R
G
C
G
G
G
C
G
G
Y
G
A
Y
C
R
U
Y
G
G
R
C
G
C
C
G
C
C
C
G
C
U
UYRYUR
5'
R
Y
G
G
A
G
Y
R R
R
R
C R
R
G
A
R
R
5´
Possible K-turn
R
R
A
R
R Y
Y
U
U
U
U U U Y5´
Possible Rho-independent
terminator
Possible Rho-independent
terminator
Possible Rho-independent
terminator
Possible Rho-independent
terminator
Possible Rho-independent
terminator
Possible Rho-independent
terminator
N_gon_RUF7
Y Y
G
C
U
G
RC
C
R
C
Y
G
C
Y
U
U
U
A
U C C
C
U
U
U
U
G
G
C
R
G
U
G
C
A
G
C
A A G U A A
G
A
C
G
U
UU
U
C
C C
G
C
A
AU
G
U
A
U
U G
R
C
C
A
U
U
C
A
U
A
C
U
U
C C
C
U
U
R
G
U
A
U
G
R
A
U
G
G
U
U
U
U
U
U
U
U
R
Y
R
R
R A A A U G C C G U C U G A A Y
G
UUCAGACGGCAUUURU
5´
C_dif_RUF9
U
C
A
G
A
A
G
G
C
U
U
G
Y
U
U
U
C
U
G
A
Y U U U U U U A U U C A A A U C A A A A A Y A R G U A G U U A A Y U U U U A Y C U U
A
U
U
C
U
A
C
U
U
U
C
C C C
A
A
A
G
U
A
A
G
A
A
U
A U Y A C Y Y U U R
G
A
U
R
R
G
C
A
G
C
Y
Y
A
U
C
U U U U U U U A A A5´
C_dif_RUF11
U
C
A
G
A
A
G
G
C
U
U
G
Y
U
U
U
C
U
G
A
Y U U U U U U A U U C A A A U C A A A A A Y A R G U A G U U A A Y U U U U A Y C U U
A
U
U
C
U
A
C
U
U
U
C
C C C
A
A
A
G
U
A
A
G
A
A
U
A U Y A C Y Y U U R
G
A
U
R
R
G
C
A
G
C
Y
Y
A
U
C
U U U U U U U U A A R5´
M_tub_RUF3
Possible Rho-independent
terminator
Low Seq. G+C
RNA RNA code RNA z M otifs complexity sRNA (gen.)
M .tub. RUF3 No RNA (prob=1.00) Terminator,k-turn No 59% (66%)
N.gon. RUF7 No RNA (prob=0.94) Terminator No 43% (52%)
C.dif. RUF9 ? (P =0.044) RNA (prob=1.00) Terminator No 30% (29%)
C.dif. RUF11 ? (P =0.079) RNA (prob=1.00) Terminator Yes (23/ 170nts) 26% (29%)
Paul Gardner Hunting RNA motifs
26. What about the highly expressed ncRNAs?
Are they antisense to any 5‘ UTRs
0 5 10 15 20 25 30
0.00.40.8
RNAup RNA−RNA energies between sRNAs and 5' UTRs
negenergy (−kcal/mol)
Freq. M.tub3 (native)
M.tub3 (shuffled)
N.gon7 (native)
N.gon7 (shuffled)
C.dif9 (native)
C.dif9 (shuffled)
C.dif11 (native)
C.dif11 (shuffled)
0 5 10 15 20 25 30
0.000.10
Difference between native and shuffled CDFs
negenergy (−kcal/mol)
∆CDF
K−S.test(M.tub3 RNA ~ Shuffled) => P=3.70e−07
K−S.test(N.gon7 RNA ~ Shuffled) => P=0.01
K−S.test(C.dif9 RNA ~ Shuffled) => P<2.2e−16
K−S.test(C.dif11 RNA ~ Shuffled) => P<2.2e−16
Paul Gardner Hunting RNA motifs
27. What about the highly expressed ncRNAs?
Are they antisense to any 3‘ UTRs
0 5 10 15 20
0.00.40.8
RNAup RNA−RNA energies between sRNAs and 3' UTRs
negenergy (−kcal/mol)
Freq. M.tub3 (native)
M.tub3 (shuffled)
N.gon7 (native)
N.gon7 (shuffled)
C.dif9 (native)
C.dif9 (shuffled)
C.dif11 (native)
C.dif11 (shuffled)
0 5 10 15 20
0.000.10
Difference between native and shuffled CDFs
negenergy (−kcal/mol)
∆CDF
K−S.test(M.tub3 RNA ~ Shuffled) => P=5.13e−09
K−S.test(N.gon7 RNA ~ Shuffled) => P=8.21e−06
K−S.test(C.dif9 RNA ~ Shuffled) => P<2.2e−16
K−S.test(C.dif11 RNA ~ Shuffled) => P<2.2e−16
Paul Gardner Hunting RNA motifs
28. Discussion points
A useful curation tool for building alignments and structures
Can provide testable function hypotheses, sometimes
Limited use outside of alignments based on the Salmonella
Typhi results
Even in alignments, false-positives & negatives are common
More motifs (Terminators, Shine-Dalgarno, ...)
Limit trivial motifs and prevent rediscovery of old motifs
Snapshots of the data are available from
https://github.com/ppgardne/RMfam
Can anyone here do better on MTB3, NGON7, CDIF9 &
CDIF11?
Paul Gardner Hunting RNA motifs
29. Thanks!
Lars Barquist, Alex
Bateman & Rob Knight
PPG is supported by a Rutherford Discovery Fellowship from Government funding, administered by the Royal
Society of New Zealand.
Paul Gardner Hunting RNA motifs
30. RNA motifs
ANYA
R
A
U
A
G
A
U Y
A
C
A
U
U
5´
GNRA
Y
G
A
A
R
5´
UNCG
C
U
U C
G
G
5´
T-loop
R
U
R R R
Y
5´
U-turn
R
R
G
C
G
U
R A
R
A
G
C
Y
5´
k-turn-1
R
Y
G
G
A
G
Y
R R
R
R
C R
R
G
A
R
R
5´
k-turn-2
C
G
A
A
G
Y
Y
R
Y
Y R
R
G
G
G
R
U
G
G
A
G
5´
sarcin-ricin-1
R
A
C
C
Y
R
R
Y
G
A
A
C
U
G
A
A
R C
A
U
C
U
C
A
G
U
A
R
Y
R
G
A
G
G
A R A A5´
sarcin-ricin-2
Y
Y
A
G
U
A
G Y R
A
G
G
A
A
R
5´
twist_up
C
C
G
A
Y
C
C
C
A
U C
C
G
A
A
C
U
C
G
G
5´
UAA_GAN
R
Y
G
R
Y
A
A
Y
C
R
Y
A Y
Y
A
G
R
G
A
A
Y
C
5´
Csr-Rsm
R
C
A
GG A
G
Y
5´
C-loop
G
A
C
A
C U
G
Y
R
Y R Y R R
R
R
C
A
R
U
Y
5´
domain-V
R
A
G
C
R
C
G
R
A
G
Y A
Y
G
Y
Y
R
G
U
U
Y
5´
terminator1
A
A
A
A
A
G
C
Y
R
Y
Y R
R
Y
G
G
Y
U
U
U
U
U
U Y U Y5´
terminator2
R
R
A
R
R Y
Y
U
U
U
U U U Y5´
TRIT
R Y Y Y Y
G
C
G
A
G
C
A
G
A
C
G
C
A
R
A
A
C
R
C
C
C
R
R
Y
R
R
Y
G
G
G
Y
G
U
U
Y
U
G
C
G
U
C
U
G
C
U
C
G
C
R R R R5´
R R A R G G R G R R Y A U G R R5´
R R G G R R R Y A U G A R R5´
A A G G R R A U G R5´
T−loopTerminator
Csr−Rsm
Domain−V
GNRA
tandem−GAk−turn
SRP_S
UNCG
U−turntwist_up
sarcin−ricin
q
q
q
q
q
q
q
q
q
q
q
q
tmRNA
cyano_tmRNA
tRNA−Sec
tRNA
mascRNA−menRNA
RNaseP_bact_a
suhBH
is_leader
L10_leader
L20_leader
Phe_leader
Thr_leader
rimP
Qrr
IRES_Aptho
CsrB
PrrB_RsmZ
TwoAYGGAY
U6
Intron_gpII
RNaseP_arch
RNaseP_bact_b
G
EM
M
_RNA_m
otif
alpha_tm
RNA
Intron_gpI
group−II−D1D4−7
group−II−D1D4−6group−II−D1D4−5
group−II−D1D4−3
group−II−D1D4−1
MOCO_RNA_motif
manA
Clostridiales−1
RtT
serC
T−box
SNO
RA73SAM
Archaea_SRP
Bacteria_small_SRP
Bacteria_large_SRP
Fungi_SRP
Metazoa_SRP
Plant_SRP
wcaG
U3
C4
HIV_FETermite−leu
5S_rRNA
SSU
_rR
N
A
_bacteria
SSU_rRNA_archaea
SSU_rRNA_eukarya
Cobalamin
FMN
Entero_5_CRE
group−II−D1D4−4
q
q
q
q
q
q
q
q
q
q
q
q
qqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q q q q
q
q
q
q
q
q
q
q
q
q
q
Paul Gardner Hunting RNA motifs
31. RNA classification
Backofen et al. (2007) RNAs everywhere: genome-wide annotation of structured RNAs. Journal of Experimental
Zoology.
Paul Gardner Hunting RNA motifs