SlideShare ist ein Scribd-Unternehmen logo
1 von 5
R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm
Yanan Jianga
, Weijia Xub
, Lee Parnell Thompsonc
, Robin R. Gutella
, Daniel P. Mirankerc
a
Institute for Cellular and Molecular Biology b
Texas Advanced Computing Center c
Department of Computer Science
The University of Texas at Austin, Austin, USA
a
{yanan.jiang, robin.gutell}@utexas.edu b
xwj@tacc.utexas.edu c
{parnell, miranker}@cs.utexas.edu
Abstract—We present a fast pairwise RNA sequence
alignment method using structural information, named R-
PASS (RNA Pairwise Alignment of Structure and Sequence),
which shows good accuracy on sequences with low sequence
identity and significantly faster than alternative methods. The
method begins by representing RNA secondary structure as a
set of structure motifs. The motifs from two RNAs are then
used as input into a bipartite graph-matching algorithm, which
determines the structure matches. The matches are then used
as constraints in a constrained dynamic programming
sequence alignment procedure. The R-PASS method has an
O(nm) complexity. We compare our method with two other
structure-based alignment methods, LARA and ExpaLoc, and
with a sequence-based alignment method, MAFFT, across
three benchmarks and obtain favorable results in accuracy and
orders of magnitude faster in speed.
Keywords-RNA pairwise structural alignment; structure
motif; bipartite graph matching; constraint sequence alignment
I. INTRODUCTION
A trend in the sequence analysis is to process
increasingly larger scales of sequences in order to detect
sequence homologues, predict consensus secondary
structures[1], identify structure motifs[2] and infer
phylogenetic relationships[3]. Non-canonical base pairs and
structure motifs are found based on a MSA of 2,240 16S
rRNA sequences[2] and a set of statistical free energy
values are computed from more than 50,000 ncRNA
sequences[4]. Furthermore, genome wide sequence
alignments identifying ncRNAs have become increasingly
routine. RNAz[5] predicted over 30,000 structured RNA
elements in human genomes from 438,788 alignments of
non-coding regions. Scalable and fast computation method
is a key to make large scale analysis feasible.
Sequence-based RNA sequence alignment programs, e.g.
MAFFT[6], generate accurate alignments when the RNA
sequences are conserved. However, these programs are
unable to produce reliable alignments when sequence
identity drops below 50–60%[7]. Exploiting the
phenomenon of the coevolution of base-pairs and the
preservation of secondary structure are promising
approaches to improve RNA alignment accuracy[7].
Although many structure-based programs exist, most of
them have high complexity and are not applicable to long
RNAs. In this paper, we present a method, R-PASS (RNA
Pairwise Alignment of Structure and Sequence). We
evaluated our method compared with two state-of-art
structure-based alignment programs, LARA[8] and
ExpaLoc[9], and a popular sequence-based alignment
program MAFFT. Of the programs tested, R-PASS is the
fastest. The results also show improved accuracy upon
MAFFT and ExpaLoc and comparable accuracy with LARA.
II. RELATED WORK
Most structure-based alignment programs continue the
tradition of the Sankoff’s algorithm[10], where it
simultaneously folds and aligns a set of pseudo knot-free
RNAs using a dynamic programming approach (DP).
Although efforts have been made to reduce the time and
space complexity, this approach still requires O(n4
) time in
the pairwise alignment case. Thus most structure-based
programs are not practical for long RNAs[8].
R-PASS assumes the structure information is available
for both sequences. We compare our program to the two
most recent structure based alignment programs that target
the same problem, LARA[8] and ExpaLoc[9]. LARA adopts
a graph-based representation and models the alignment as
an integer linear program. ExpaLoc combines ExpaRNA[9]
and LocARNA[11], where ExpaRNA detects the longest
exact pattern match of two RNA structures and LocARNA
fills in the unaligned space between those patterns.
Our program differs with LARA in that the RNA
structures are matched at the structure motif level instead of
at the nucleotide level and an optimal alignment is found by
a DP algorithm. Unlike ExpaRNA which finds the exact
pattern matches, our matching algorithm is more flexible, so
more structure constraints can be used in alignment
construction. Also, the alignment building process in our
program is still sequence-based, and thus has a much lower
computation complexity than LocARNA.
To evaluate the effectiveness of using additional
structure information, we also compare our program with
MAFFT[6]. Its iterative refinement method L-INS-I is
evaluated to be one of the most accurate sequence-based
alignment programs that produce high quality alignments
with average pairwise sequence identity above 55%[12].
III. ALGORITHM
Given two RNA sequences with known secondary
structures, we parse the annotated base pairs into a set of
structure motifs. The feature vectors of the structure motifs
are computed and form the vertices of a bipartite graph.
The weight of an edge is based on the similarity of two
feature vectors. A set of edges which represent the
correspondence between structural motifs are obtained by a
bipartite graph matching algorithm. These correspondences
are then applied as anchor points to construct an optimal
sequence alignment using a DP algorithm.
A. Structure Matching
1) Infering motifs from base pairing annotation: The
secondary structure of a RNA sequences consists of a set of
2011 IEEE International Conference on Bioinformatics and Biomedicine
978-0-7695-4574-5/11 $26.00 © 2011 IEEE
DOI 10.1109/BIBM.2011.74
618
nested base pairs. The majority of the nucleotides in a RNA
structure form canonical base pairings in a regular helix
region. The remaining unpaired nucleotides form loops,
which can be further categorized to bulges, internal, hairpin
and multi-stem loops. Thus a RNA secondary structure can
be viewed as a set of those RNA motifs.
Stem is the continuous base-paired double strand region
and is composed of a 5’end half and a 3’ end half. Bulge is a
loop appearing only on one strand of a stem. Internal loop is
the unpaired regions occurring on both strands of a stem.
Hairpin loop is the loop linking the 5’ end and 3’ end of a
stem. Multi-stem loop is the intersection of three or more
stems. It is separated by single strand sequences. Free end is
the unpaired region at the 5’ end or 3’ end of a structure.
Given a RNA secondary structure annotated by its base
pairings, we first segment it into Stem, Bulge, Internal loop
and Hairpin loop motifs. Then we merge the basic motifs
into Compound Stem and Hairpin. A Compound Stem is the
discontinuous base-paired region merged by Stems, Bulges
and Internal loops. A Hairpin is formed by combining a
Hairpin loop and its flanking Stem/Compound Stem. The
Compound Stem, Stem, Hairpin, Multi-stem loop and Free
end form a complete RNA structure. To reduce computation,
we only consider Compound Stem, Stem and Hairpin motifs
in the following matching step. The other motifs are
spontaneously matched after matching of those three motifs.
2) Vector model of motifs: For each Stem/Compound
Stem/Hairpin motif we compute a feature vector which
includes the start (ps) and end position (pe) of the motif and
the number of nucleotides in it (l). Since a Stem/Compound
Stem consist of two halves, the end position of the first half
(pse) and the start position of the second half (pes) are also
included in its feature vector. The feature vector of a
Hairpin is the same as the feature vector of its stem
component. Therefore f{Stem, Compound Stem, Hairpin} = (ps, pse, pes,
pe, l). A RNA secondary structure S can thus be represented
as S = {fti}, i ‫ג‬ (1, k), where k is the number of motifs in S
and t ‫ג‬ {Stem, Compound Stem, Hairpin}.
The similarity of two structure motifs is measured by
how close their relative positions are in the corresponding
global sequences and how similar their lengths are. The
similarity is only computed between motifs of the same type,
i.e. Hairpin with Hairpin and Stem with Stem or Compound
Stem. We define the following heuristic similarity function
composed of position similarity (pos) and length similarity
(len), where sl is the global sequence length, and wl is the
weight of length similarity.
݀ሺ݂௔ǡ ݂௕ሻ
ൌ ൜
Ͳǡ ݂݅ ܽǡ ܾ ܽ‫݁ݎ‬ ݊‫ݐ݋‬ ‫݂݋‬ ‫ݐ‬ ݁ ‫݁݉ܽݏ‬ ‫݁݌ݕݐ‬
‫ݏ݋݌‬ሺ݂௔ǡ ݂௕ሻǤ ݈݁݊ሺ݂௔ǡ ݂௕ሻǡ ‫ݐ݋‬ ݁‫݁ݏ݅ݓݎ‬
݈݁݊ሺ݂௔ǡ ݂௕ሻ ൌ ͳ െ ‫ݓ‬௟Ǥฬ
݈௔ െ ݈௕
݈௔ ൅ ݈௕
ฬ
‫ݏ݋݌‬ሺ݂௔ǡ ݂௕ሻ
ൌ ͳ െ ݉ܽ‫ݔ‬ ൮
݉݅݊ ൬ฬ
‫݌‬௦௔
‫݈ݏ‬௔
െ
‫݌‬௦௕
‫݈ݏ‬௕
ฬ ǡ ฬ
‫݌‬௦௘௔
‫݈ݏ‬௔
െ
‫݌‬௦௘௕
‫݈ݏ‬௕
ฬ൰ ǡ
݉݅݊ ൬ฬ
‫݌‬௘௦௔
‫݈ݏ‬௔
െ
‫݌‬௘௦௕
‫݈ݏ‬௕
ฬ ǡ ฬ
‫݌‬௘௔
‫݈ݏ‬௔
െ
‫݌‬௘௕
‫݈ݏ‬௕
ฬ ൰
൲ ǡ ݂݅ ܽǡ ܾ ‫ג‬ ܵ‫݉݁ݐ‬
In the above equation, the score for different types of
the structural motif is always 0. For the comparison of the
same motif type, the score is the product of the relative
position similarity and the length similarity. The relative
position difference for a single strand is defined as the
minimum difference as measured by start and end position.
For a Stem/Compound Stem structure, the relative position
difference is first evaluated on each strand individually and
then the maximum score between the two strands is used.
For simplicity, only the weight on length difference (wl) is
used. The similarity score ranges between 0 and 1
inclusively.
3) Structure motif matching algorithm: The structure
motifs from two RNA structures are matched by a bipartite
graph matching (BGM). BGM is a powerful matching
technique that can achieve global and optimal matching
results in polynomial time. It has been applied to protein
structure alignment[13].
We denote a bipartite graph G as G = (V = L R, E). V
is a set of vertices which can be divided into two subsets, L
and R. Each subset individually represents structure motifs
from one sequence and contains their feature vectors as
vertices, i.e. L = S1, R = S2. E is a set of weighted edges that
link between vertices in L and vertices in R. The edge
weight is the similarity score between two feature vectors.
A threshold for edge weight t is used in a preprocessing step
of graph construction, so only edges with weight no less
than t are used for matching. The limitation t > 0 ensures
that the matches are only between motifs of the same type.
There is no edge linking vertices from the same subset.
Hence, E = {eLi,Rj = d(fLi, fRj) t} , i ‫ג‬ (1, k), j ‫ג‬ (1, h),
where k is the number of motifs in S1 and h is the number of
motifs in S2.
Based on this bipartite graph, the correspondence of
structure motifs are then found by a stable matching
algorithm[14]. The algorithm generates a set of matched
structure motifs, more specifically the feature vectors of
structure motifs, M = {fLi, fRi}, i ‫ג‬ (1, c), where c is the
number of matches. The stable matching ensures that no two
motifs are better matched together than with the motif they
are currently matched with. The complexity of this
algorithm is O(kh). Any graph matching algorithm could be
used as a substitute for the stable matching, such as a
maximum weighted matching algorithm. The stable
matching was chosen for its fast runtime.
4) Computing matching blocks: The matching structure
motifs are then converted into sequence blocks. The
boundaries of the matched motifs are obtained based on
their feature vectors. For a Hairpin motif, its sequence is
divided into three parts, 5’ end stem, Hairpin loop and 3’
end stem, and the Stem and Compound Stem motif are
segmented into the 5’ end stem and 3’ end stem. Thus the
matching segments from two sequences form a match block,
of which the boundaries are determined by the start and end
positions of the segments (Fig. 1).
The above generated sequence block set may contain
crossing blocks caused by mismatches in BGM. In cases
where the structures are unknown and require prediction,
619
overlapping blocks may also appear i
structures contain a set of overlapping cand
solve this problem, the Dijkstra’s shortest p
applied[15]. We construct a graph where th
blocks and two pseudo blocks representing
end of the sequences. The edge weight
between two blocks. It is set to be 0 for cro
the total number of position gaps betwee
both sequences for non-crossing blocks. O
positive weight are added to the graph. A
then computed between the start block an
The match blocks on the path are used in the
B. Constraint sequence alignment
Figure 1. Convert structure alignment to sequen
For sequence A = {ai} and B = {bj}, i ‫ג‬
where n is the length of sequence A and m
sequence B, a constraint alignment is then
1). The optimal path is built only through
the DP matrix using an affine gap penalty m
a matrix H, where H(i, j) is the best alig
(a1,…, ai, b1,…, bj) with ai aligned to a g
where V(i, j) is the best alignment score up
bj aligned to a gap, and a DP matrix D, wh
best alignment score up to ai and bj, the re
for the constrained DP is then calculated as:
‫ܪ‬ሺ݅ǡ ݆ሻ ൌ ݉ܽ‫ݔ‬ ሺ‫ܦ‬ሺ݅ െ ͳǡ ݆ሻ ൅ ‫݋‬ǡ ‫ܪ‬ሺ݅ െ ͳǡ ݆ሻ ൅ ݁ሻ
ܸሺ݅ǡ ݆ሻ ൌ ݉ܽ‫ݔ‬ ሺ‫ܦ‬ሺ݅ǡ ݆ െ ͳሻ ൅ ‫݋‬ǡ ܸሺ݅ǡ ݆ െ ͳሻ ൅ ݁ሻ
‫ܦ‬ሺ݅ǡ ݆ሻ
ൌ ቊ
݉ܽ‫ݔ‬ ቀ‫ܦ‬ሺ݅ െ ͳǡ ݆ െ ͳሻ ൅ ܵ൫ܽ௜ǡ ܾ௝൯ǡ ‫ܪ‬ሺ݅ǡ ݆ሻǡ ܸሺ݅ǡ ݆
െ ǡ ‫ݐ݋‬ ݁‫݅ݓݎ‬‫ݏ‬
In the above equations, o is the gap open
gap extension penalty and S(ai, bj) is the s
between nucleotides ai and bj. The com
alignment step is O(nm).
IV. RESULTS
The program generated pairwise alignm
by comparison to a gold standard reference
nucleotide level. A pairwise alignment A c
a set of nucleotide correspondence arrang
order, so A = {ai, bi}, i ‫ג‬ (1, n), where n is th
of the two sequences, and ai and bi a
nucleotides from two sequences. Let
alignment be A and the testing alignment b
if the predicted
didate motifs. To
path algorithm is
he vertices are the
the start and the
is the distance
ossing blocks and
en the blocks on
Only edges with
A shortest path is
nd the end block.
e next step.
nce alignment.
(1, n), j ‫ג‬ (1, m),
m is the length of
n computed (Fig.
match blocks in
model[16]. Given
gnment score of
gap; a matrix V,
p to ai and bj with
here D(i, j) is the
currence relation
:
݆ሻቁ ǡ ݂݅ ሺ݅ǡ ݆ሻ ݅݊ ܾ݈‫݇ܿ݋‬
‫ݏ‬݁
n penalty, e is the
substitution score
mplexity of the
ment is evaluated
e alignment at the
can be denoted as
ged in sequential
he smaller length
are the matched
the reference
be A’= {a’i, b’i},
we define a correct correspondence
The “=” here means two nucle
nucleotide index in the sequence.
is the percentage of correct correspo
of all correspondence in the referenc
A. Testing data
We created the pairwise sequen
with structure information from thre
2.1[7], Rfam 10.1[17] and CRW
Rfam are popular alignment bench
evaluations and provide consensus
various ncRNA families. The CRW
high quality alignments provides in
structure in various formats which s
We took three RNA fami
spliceosomal RNA (RF00004), nuc
and Bacterial RNase P class A
families from Bralibase data-set 1 a
U5 spliceosomal RNA, tRNA and
families from CRW Site: 5S rRNA
Besides the data-set2 from Bra
pairwise sequence alignments, all th
by breaking the multiple sequence
sequence alignments. The secon
sequence is either inferred fro
provided in the original dataset or
Site. The pseudoknots are excluded
A single test contains two RN
corresponded RNA secondary stru
bracket format[19]. Based on RN
datasets are grouped into small (
1000nt) and large (> 1000nt) RNA
RNA family are further divided in
40, 60 and 80 by sequence identity
indicates the pairwise sequence i
between 40 and 60. Due to the page
sets and all the results are available
B. Comparison with other program
We implemented the algorithm
sequence alignment, we use the RI
[20] with gap open penalty of -8 an
of -1. R-PASS is written in Java
MAFFT are implemented in C. E
small RNA datasets, while the ot
tested on all datasets. Each test co
input. Except for MAFFT-L-INS
sequence information, the structure
to all the other three programs. For
first executed and the output constr
LocARNA to obtain complete align
executed with default setting in
processor at 3.16G Hz.
1) Alignment accuracy: Usin
described above, we evaluated
generated by all four programs
reference alignments. As shown in
can achieve > 90% accuracy in
e as ai = a’i and bi = b’i .
eotides have the same
The alignment accuracy
ondence over the number
ce alignment A.
ce alignment testing data
ee benchmarks, Bralibase
Site[18]. Bralibase and
hmarks used by program
secondary structures for
W Site well known for its
ndividual RNA secondary
serve our purpose well.
ilies from Rfam: U2
clear RNase P (RF00009)
(RF00010); four RNA
and data-set 2: g2intron,
5S rRNA and two RNA
and 16S rRNA.
alibase which consists of
he other tests are created
alignments into pairwise
ndary structure of each
om consensus structure
retrieved from the CRW
from the structures.
NA sequences with their
uctures annotated in dot
NA sequence length, the
(< 200nt), middle (200-
A (Table 1). Tests in each
nto at least three subsets:
y. For example, subset 40
identity of each test is
e limit, the details of data
upon request.
ms
in the tool R-PASS. For
IBOSUM scoring matrix
nd gap extension penalty
a, LARA, ExpaLoc and
ExpaLoc is tested on the
ther three programs are
ontains two sequences as
S-I which only use the
e annotation is provided
r ExpaLoc, ExpaRNA is
raints are then piped into
nments. All programs are
n a desktop with Intel
ng the scoring method
the alignment quality
with the corresponding
Fig. 2, all four programs
small and large RNA
620
datasets with sequence identity above 60% and in middle
RNA datasets with sequence identity above 80%. In this
zone, using structure information does not improve
alignment quality significantly. The additional structure
information has a remarkable effect in the twilight zone,
where the sequence identity is below 60%. While MAFFT
performance drops sharply with decrease of sequence
identity, R-PASS and LARA can maintain high alignment
quality (Fig. 2).
ExpaLoc does not show superiority over MAFFT in the
twilight zone in the small RNA datasets. One possible
explanation is that as the sequence identity drops, the
number of exact pattern matches, i.e., the number of
structure constraints also decreases, thus the degree of
freedom expands as LocARNA aligns the sequences,
causing potential alignment errors.
In the twilight zone, R-PASS is comparable with LARA
in terms of alignment quality. R-PASS outperforms all other
programs in the tRNA dataset from Bralibase and RNase P
datasets from Rfam (Fig. 2). In the other datasets except
U5, the largest difference between R-PASS and LARA is
less than 3%.
Figure 2. Accuracy comparisons in small, middle and large RNA
datasets.
2) Running time: The total running time of each
program on datasets of the same sized RNA group is used.
ExpaLoc costs the most CPU time in the small RNA
datasets (data not shown here).
TABLE I. PROGRAM RUNNING TIME IN SECONDS.
Program Small Middle Large
R-PASS
17.8 17.6
27
LARA 178.24 (10)a
1701 (97) 30000 (1111)
MAFFT-L-NS-I 491.34 (27) 573 (33) 108 (4)
# tests 2806 2893 1046
Avg. length 96 314 1467
a. The number of times that R-PASS is faster than the program
As shown in Table 1, R-PASS is the fastest among all
programs. The advantage over LARA is more obvious as
the length of the RNA increases (Fig. 3). R-PASS is 10
times faster than LARA and 27 times faster than MAFFT in
small RNA datasets; and is 97 times faster than LARA and
33 times faster than MAFFT for middle RNA datasets. It is
4 times faster than the sequence-based alignment program
MAFFT and above 1,100 times faster than LARA in large
RNA datasets while maintaining comparable alignment
quality (Fig. 2). Therefore, our approach is more suitable for
large-scale RNA sequences.
Figure 3. Program running time versus sequence length.
3) R-PASS structure motif matching with LARA
subroutine: In R-PASS, individual sequence blocks
generated after structure motif match can be fed into any
alignment algorithm to produce a global alignment. Since
LARA finds the structure motif correspondence at the base
pair level, it performs better than R-PASS in some datasets
in the twilight zone. Thus integrating LARA to align the
stem regions may improve R-PASS performance in those
datasets. We compute the structure match blocks by R-
PASS and use LARA to align the blocks generated from
Stem motifs and Gotoh algorithm to align the blocks of loop
and free end motifs. Each local alignment is then linked in
sequential order to form a global alignment.
We tested the integrated R-PASS and LARA in the
small RNA datasets. This approach performs better than R-
PASS and is comparable with LARA. In some cases, e.g.
U5 (Fig. 4), the integrated approach improves the alignment
accuracy upon LARA.
20
40
60
80
100
0 20 40 60 80
Accuracy
PID subset
Small RNA: tRNA
R-PASS
LARA
ExpaLoc
MAFFT-L-
INS-I
30
40
50
60
70
80
90
100
20 30 40 50 60 70 80 90
Accuracy
PID subset
Middle RNA: Nuclear RNase P (RF00009)
R-PASS
LARA
MAFFT-
L-INS-I
80
85
90
95
100
20 40 60 80
Accuracy
PID subset
Large RNA: 16S rRNA
R-PASS
LARA
MAFFT-L-
INS-I
0
1
2
3
4
5
96 314 1467
log10(runningtimein
milisecondpertest)
Sequence length (nucleotide)
Program running time VS sequence length
R-PASS
LARA
MAFFT-
L-INS-I
621
Figure 4. Accuracy comparison on U5.
V. CONCLUSION AND FUTURE WORK
We have developed a new workflow for pairwise
alignment of RNA sequences with known structure
information, named R-PASS. We utilize the RNA structure
motif correspondence found in a bipartite graph framework
to constrain the sequence alignment problem and perform
the final alignment. The complexity of our algorithm is
O(nm), yet it can achieve alignment quality comparable with
the current state of the art programs. This is especially
apparent in the twilight zone. R-PASS is also significantly
faster than competing methods, often by orders of
magnitude, which makes our method well suited for use in
iterative algorithms and high-throughput RNA analysis.
We are currently working on various improvements to
the R-PASS framework described in this paper. The
alignment can be refined by matching the basic motifs in a
Compound Stem/Hairpin motif. Improvements can also be
extended to the nucleotide level where all the base pairings
in each sequence could be used as constraints.
The similarity function between structural motifs we use
is established by practical experience and the values are
determined based on preliminary experiments. This function
can be further enhanced by including additional information
such as base pair similarity and nucleotide similarity. The
additional information can improve the sensitivity and the
selectivity of the matching algorithm.
Our results show that incorporating structure
information significantly improves alignment accuracy upon
sequence-based alignment methods, especially for less
conserved sequences. While our current algorithm focuses
on aligning known structures, it can be adapted to align
unknown structures by structure prediction using RNA
folding algorithms or generating all potential structure
motifs and find correct matches by an advanced similarity
function.
The fast performance of our alignment program also
makes it promising to compute a multiple sequence
alignment over a large set of sequences. While R-PASS
focuses on pairwise alignment, the result can be extended to
produce multiple sequence alignments using a guide tree or
a progressive approach. Our method can also target
template-based alignment problems, where new sequences
are added into an existing alignment by matching the new
sequence with the consensus sequence of the alignment.
ACKNOWLEDGMENT
This research is supported by the National Institutes of
Health grant, GM085337.
REFERENCES
[1] Gutell, R. R., Weiser, B., Woese, C. R. and Noller, H. F. Comparative
anatomy of 16-S-like ribosomal RNA. Prog Nucleic Acid Res Mol Biol, 32
(1985), 155-216.
[2] Gutell, R. R., Larsen, N. and Woese, C. R. Lessons from an evolving
rRNA: 16S and 23S rRNA structures from a comparative perspective.
Microbiol Rev, 58, 1 (Mar 1994), 10-26.
[3] Phillips, A., Janies, D. and Wheeler, W. Multiple sequence alignment in
phylogenetic analysis. Mol Phylogenet Evol, 16, 3 (Sep 2000), 317-330.
[4] Wu, J. C., Gardner, D. P., Ozer, S., Gutell, R. R. and Ren, P.
Correlation of RNA secondary structure statistics with thermodynamic
stability and applications to folding. J Mol Biol, 391, 4 (Aug 28 2009), 769-
783.
[5] Washietl, S., Hofacker, I. L., Lukasser, M., Huttenhofer, A. and Stadler,
P. F. Mapping of conserved RNA secondary structures predicts thousands
of functional noncoding RNAs in the human genome. Nature
Biotechnology, 23, 11 (Nov 2005), 1383-1390.
[6] Katoh, K., Kuma, K., Miyata, T. and Toh, H. Improvement in the
accuracy of multiple sequence alignment program MAFFT. Genome
Inform, 16, 1 (2005), 22-33.
[7] Gardner, P. P., Wilm, A. and Washietl, S. A benchmark of multiple
sequence alignment programs upon structural RNAs. Nucleic Acids Res,
33, 8 (2005), 2433-2439.
[8] Bauer, M., Klau, G. W. and Reinert, K. Accurate multiple sequence-
structure alignment of RNA sequences using combinatorial optimization.
BMC Bioinformatics, 8(2007), 271.
[9] Heyne, S., Will, S., Beckstette, M. and Backofen, R. Lightweight
comparison of RNAs based on exact sequence-structure matches.
Bioinformatics, 25, 16 (Aug 15 2009), 2095-2102.
[10] Sankoff, D. Simultaneous solution of the RNA folding, alignment, and
proto-sequence problems. SIAM J Appl Mat, 45(1985), 810-825.
[11] Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F. and Backofen, R.
Inferring noncoding RNA families and classes by means of genome-scale
structure-based clustering. PLoS Comput Biol, 3, 4 (Apr 13 2007), e65.
[12] Wilm, A., Mainz, I. and Steger, G. An enhanced RNA alignment
benchmark for sequence alignment programs. Algorithms Mol Biol,
1(2006), 19.
[13] Wang, Y., Makedon, F., Ford, J. and Huang, H. A bipartite graph
matching framework for finding correspondences between structural
elements in two proteins. Conf Proc IEEE Eng Med Biol Soc, 4(2004),
2972-2975.
[14] Shapley, D. G. a. L. S. College Admissions and the Stability of
Marriage. American Mathematical Monthly, 69(1962), 9-14.
[15] Dijkstra, E. W. A note on two problems in connexion with graphs.
Numerische Mathematik, 1(1959), 269-271.
[16] Gotoh, O. An improved algorithm for matching biological sequences.
J Mol Biol, 162, 3 (Dec 15 1982), 705-708.
[17] Gardner, P. P., Daub, J., Tate, J., Moore, B. L., Osuch, I. H., Griffiths-
Jones, S., Finn, R. D., Nawrocki, E. P., Kolbe, D. L., Eddy, S. R. and
Bateman, A. Rfam: Wikipedia, clans and the "decimal" release. Nucleic
Acids Res, 39, Database issue (Jan 2011), D141-145.
[18] Cannone, J. J., Subramanian, S., Schnare, M. N., Collett, J. R.,
D'Souza, L. M., Du, Y., Feng, B., Lin, N., Madabusi, L. V., Muller, K. M.,
Pande, N., Shang, Z., Yu, N. and Gutell, R. R. The comparative RNA web
(CRW) site: an online database of comparative sequence and structure
information for ribosomal, intron, and other RNAs. BMC Bioinformatics,
3(2002), 2.
[19] Gruber, A. R., Lorenz, R., Bernhart, S. H., Neubock, R. and Hofacker,
I. L. The Vienna RNA websuite. Nucleic Acids Res, 36, Web Server issue
(Jul 1 2008), W70-74.
[20] Klein, R. J. and Eddy, S. R. RSEARCH: finding homologs of single
structured RNA sequences. BMC Bioinformatics, 4(Sep 22 2003), 44.
80
85
90
95
100
20 40 60 80
Accuracy
PID subset
U5
LARA
subroutine
R-PASS
LARA
622

Weitere ähnliche Inhalte

Was ist angesagt?

Blast and fasta
Blast and fastaBlast and fasta
Blast and fastaALLIENU
 
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...ijceronline
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence AlignmentAjayPatil210
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICMVernon D Dutch Jr
 
RNA structure analysis
RNA structure analysis RNA structure analysis
RNA structure analysis Afra Fathima
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesProf. Wim Van Criekinge
 
Automatically inferring structure correlated variable set for concurrent atom...
Automatically inferring structure correlated variable set for concurrent atom...Automatically inferring structure correlated variable set for concurrent atom...
Automatically inferring structure correlated variable set for concurrent atom...ijseajournal
 
Swaati algorithm of alignment ppt
Swaati algorithm of alignment pptSwaati algorithm of alignment ppt
Swaati algorithm of alignment pptSwati Kumari
 
Path loss exponent estimation
Path loss exponent estimationPath loss exponent estimation
Path loss exponent estimationNguyen Minh Thu
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Safa Khalid
 
Energy Minimization Using Gromacs
Energy Minimization Using GromacsEnergy Minimization Using Gromacs
Energy Minimization Using GromacsRajendra K Labala
 
Simulation of Route Optimization with load balancing Using AntNet System
Simulation of Route Optimization with load balancing Using AntNet SystemSimulation of Route Optimization with load balancing Using AntNet System
Simulation of Route Optimization with load balancing Using AntNet SystemIOSR Journals
 
Network topologies
Network topologiesNetwork topologies
Network topologiesJem Mercado
 
Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...redpel dot com
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titlestema_solution
 

Was ist angesagt? (16)

Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
PSO-based Training, Pruning, and Ensembling of Extreme Learning Machine RBF N...
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence Alignment
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICM
 
RNA structure analysis
RNA structure analysis RNA structure analysis
RNA structure analysis
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
 
Automatically inferring structure correlated variable set for concurrent atom...
Automatically inferring structure correlated variable set for concurrent atom...Automatically inferring structure correlated variable set for concurrent atom...
Automatically inferring structure correlated variable set for concurrent atom...
 
Swaati algorithm of alignment ppt
Swaati algorithm of alignment pptSwaati algorithm of alignment ppt
Swaati algorithm of alignment ppt
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Path loss exponent estimation
Path loss exponent estimationPath loss exponent estimation
Path loss exponent estimation
 
Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)Dot matrix Analysis Tools (Bioinformatics)
Dot matrix Analysis Tools (Bioinformatics)
 
Energy Minimization Using Gromacs
Energy Minimization Using GromacsEnergy Minimization Using Gromacs
Energy Minimization Using Gromacs
 
Simulation of Route Optimization with load balancing Using AntNet System
Simulation of Route Optimization with load balancing Using AntNet SystemSimulation of Route Optimization with load balancing Using AntNet System
Simulation of Route Optimization with load balancing Using AntNet System
 
Network topologies
Network topologiesNetwork topologies
Network topologies
 
Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...Concept of node usage probability from complex networks and its applications ...
Concept of node usage probability from complex networks and its applications ...
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 

Andere mochten auch

Trabajo de informatica 1 e
Trabajo de informatica 1 eTrabajo de informatica 1 e
Trabajo de informatica 1 eelvistaco
 
从运维系统的开发谈安全架构设计
从运维系统的开发谈安全架构设计从运维系统的开发谈安全架构设计
从运维系统的开发谈安全架构设计mysqlops
 
Tribus urbanas
Tribus urbanasTribus urbanas
Tribus urbanasMarianSixx
 
Pharma Analytppt
Pharma AnalytpptPharma Analytppt
Pharma Analytpptrmaclennan
 
Bolsa de valores.
Bolsa de valores.Bolsa de valores.
Bolsa de valores.ByronMat
 
怎样成为优秀软件模型设计者
怎样成为优秀软件模型设计者怎样成为优秀软件模型设计者
怎样成为优秀软件模型设计者mysqlops
 
Eixos estruturantes currículos pedaços
Eixos estruturantes currículos pedaçosEixos estruturantes currículos pedaços
Eixos estruturantes currículos pedaçosbbetocosta77
 
MJ_Tucker resume 05.22.16
MJ_Tucker resume 05.22.16MJ_Tucker resume 05.22.16
MJ_Tucker resume 05.22.16Mike Tucker
 
Progrmas para solucionar algoritmos
Progrmas para solucionar algoritmosProgrmas para solucionar algoritmos
Progrmas para solucionar algoritmosAlejo Padilla
 
Les grandes étapes du règlement d'une succession
Les grandes étapes du règlement d'une successionLes grandes étapes du règlement d'une succession
Les grandes étapes du règlement d'une successionGroupe Althémis
 
Facultad de filosofia letras y ciencias de la
Facultad de filosofia letras y ciencias de laFacultad de filosofia letras y ciencias de la
Facultad de filosofia letras y ciencias de laPaty_1989
 
Ciencia de la investigacion
Ciencia de la investigacionCiencia de la investigacion
Ciencia de la investigacionalexavane98
 
reciclaje tic 2
reciclaje tic 2reciclaje tic 2
reciclaje tic 2isaquiitho
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetMark Voelker
 

Andere mochten auch (20)

Trabajo de informatica 1 e
Trabajo de informatica 1 eTrabajo de informatica 1 e
Trabajo de informatica 1 e
 
Journal 5
Journal 5Journal 5
Journal 5
 
从运维系统的开发谈安全架构设计
从运维系统的开发谈安全架构设计从运维系统的开发谈安全架构设计
从运维系统的开发谈安全架构设计
 
Tribus urbanas
Tribus urbanasTribus urbanas
Tribus urbanas
 
Pharma Analytppt
Pharma AnalytpptPharma Analytppt
Pharma Analytppt
 
TIPOS DE APRENDIZAJE
TIPOS DE APRENDIZAJE TIPOS DE APRENDIZAJE
TIPOS DE APRENDIZAJE
 
Presentación virus
Presentación virusPresentación virus
Presentación virus
 
Bolsa de valores.
Bolsa de valores.Bolsa de valores.
Bolsa de valores.
 
Curso de historia de méxico
Curso de historia de méxicoCurso de historia de méxico
Curso de historia de méxico
 
resume IT security
resume IT securityresume IT security
resume IT security
 
怎样成为优秀软件模型设计者
怎样成为优秀软件模型设计者怎样成为优秀软件模型设计者
怎样成为优秀软件模型设计者
 
Eixos estruturantes currículos pedaços
Eixos estruturantes currículos pedaçosEixos estruturantes currículos pedaços
Eixos estruturantes currículos pedaços
 
MJ_Tucker resume 05.22.16
MJ_Tucker resume 05.22.16MJ_Tucker resume 05.22.16
MJ_Tucker resume 05.22.16
 
Progrmas para solucionar algoritmos
Progrmas para solucionar algoritmosProgrmas para solucionar algoritmos
Progrmas para solucionar algoritmos
 
Les grandes étapes du règlement d'une succession
Les grandes étapes du règlement d'une successionLes grandes étapes du règlement d'une succession
Les grandes étapes du règlement d'une succession
 
Facultad de filosofia letras y ciencias de la
Facultad de filosofia letras y ciencias de laFacultad de filosofia letras y ciencias de la
Facultad de filosofia letras y ciencias de la
 
Ciencia de la investigacion
Ciencia de la investigacionCiencia de la investigacion
Ciencia de la investigacion
 
reciclaje tic 2
reciclaje tic 2reciclaje tic 2
reciclaje tic 2
 
Cbo100053
Cbo100053Cbo100053
Cbo100053
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with Puppet
 

Ähnlich wie Gutell 116.rpass.bibm11.pp618-622.2011

Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Robin Gutell
 
Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Robin Gutell
 
Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200Robin Gutell
 
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONAMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONcscpconf
 
Gutell 117.rcad_e_science_stockholm_pp15-22
Gutell 117.rcad_e_science_stockholm_pp15-22Gutell 117.rcad_e_science_stockholm_pp15-22
Gutell 117.rcad_e_science_stockholm_pp15-22Robin Gutell
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...sipij
 
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISHMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISijcseit
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure predictionSamvartika Majumdar
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
 
Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473Robin Gutell
 
Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Robin Gutell
 
Secondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elementsSecondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elementsVinaKhan1
 

Ähnlich wie Gutell 116.rpass.bibm11.pp618-622.2011 (20)

Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676
 
Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105
 
Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
 
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONAMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
 
Gutell 117.rcad_e_science_stockholm_pp15-22
Gutell 117.rcad_e_science_stockholm_pp15-22Gutell 117.rcad_e_science_stockholm_pp15-22
Gutell 117.rcad_e_science_stockholm_pp15-22
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
 
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISHMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Sequence alignment.pptx
Sequence alignment.pptxSequence alignment.pptx
Sequence alignment.pptx
 
Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473
 
Bs4201462467
Bs4201462467Bs4201462467
Bs4201462467
 
Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Secondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elementsSecondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elements
 

Mehr von Robin Gutell

Gutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xiGutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xiRobin Gutell
 
Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803Robin Gutell
 
Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013Robin Gutell
 
Gutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_dataGutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_dataRobin Gutell
 
Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Robin Gutell
 
Gutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfigGutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfigRobin Gutell
 
Gutell 115.rna2dmap.bibm11.pp613-617.2011
Gutell 115.rna2dmap.bibm11.pp613-617.2011Gutell 115.rna2dmap.bibm11.pp613-617.2011
Gutell 115.rna2dmap.bibm11.pp613-617.2011Robin Gutell
 
Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768Robin Gutell
 
Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Robin Gutell
 
Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485Robin Gutell
 
Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195Robin Gutell
 
Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277Robin Gutell
 
Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2Robin Gutell
 
Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043Robin Gutell
 
Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016Robin Gutell
 
Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535Robin Gutell
 
Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289Robin Gutell
 
Gutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.goodGutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.goodRobin Gutell
 
Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533Robin Gutell
 
Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931Robin Gutell
 

Mehr von Robin Gutell (20)

Gutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xiGutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xi
 
Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803
 
Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013
 
Gutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_dataGutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_data
 
Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383
 
Gutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfigGutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfig
 
Gutell 115.rna2dmap.bibm11.pp613-617.2011
Gutell 115.rna2dmap.bibm11.pp613-617.2011Gutell 115.rna2dmap.bibm11.pp613-617.2011
Gutell 115.rna2dmap.bibm11.pp613-617.2011
 
Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768
 
Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497
 
Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485
 
Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195
 
Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277
 
Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2
 
Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043
 
Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016
 
Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535
 
Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289
 
Gutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.goodGutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.good
 
Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533
 
Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931
 

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Gutell 116.rpass.bibm11.pp618-622.2011

  • 1. R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm Yanan Jianga , Weijia Xub , Lee Parnell Thompsonc , Robin R. Gutella , Daniel P. Mirankerc a Institute for Cellular and Molecular Biology b Texas Advanced Computing Center c Department of Computer Science The University of Texas at Austin, Austin, USA a {yanan.jiang, robin.gutell}@utexas.edu b xwj@tacc.utexas.edu c {parnell, miranker}@cs.utexas.edu Abstract—We present a fast pairwise RNA sequence alignment method using structural information, named R- PASS (RNA Pairwise Alignment of Structure and Sequence), which shows good accuracy on sequences with low sequence identity and significantly faster than alternative methods. The method begins by representing RNA secondary structure as a set of structure motifs. The motifs from two RNAs are then used as input into a bipartite graph-matching algorithm, which determines the structure matches. The matches are then used as constraints in a constrained dynamic programming sequence alignment procedure. The R-PASS method has an O(nm) complexity. We compare our method with two other structure-based alignment methods, LARA and ExpaLoc, and with a sequence-based alignment method, MAFFT, across three benchmarks and obtain favorable results in accuracy and orders of magnitude faster in speed. Keywords-RNA pairwise structural alignment; structure motif; bipartite graph matching; constraint sequence alignment I. INTRODUCTION A trend in the sequence analysis is to process increasingly larger scales of sequences in order to detect sequence homologues, predict consensus secondary structures[1], identify structure motifs[2] and infer phylogenetic relationships[3]. Non-canonical base pairs and structure motifs are found based on a MSA of 2,240 16S rRNA sequences[2] and a set of statistical free energy values are computed from more than 50,000 ncRNA sequences[4]. Furthermore, genome wide sequence alignments identifying ncRNAs have become increasingly routine. RNAz[5] predicted over 30,000 structured RNA elements in human genomes from 438,788 alignments of non-coding regions. Scalable and fast computation method is a key to make large scale analysis feasible. Sequence-based RNA sequence alignment programs, e.g. MAFFT[6], generate accurate alignments when the RNA sequences are conserved. However, these programs are unable to produce reliable alignments when sequence identity drops below 50–60%[7]. Exploiting the phenomenon of the coevolution of base-pairs and the preservation of secondary structure are promising approaches to improve RNA alignment accuracy[7]. Although many structure-based programs exist, most of them have high complexity and are not applicable to long RNAs. In this paper, we present a method, R-PASS (RNA Pairwise Alignment of Structure and Sequence). We evaluated our method compared with two state-of-art structure-based alignment programs, LARA[8] and ExpaLoc[9], and a popular sequence-based alignment program MAFFT. Of the programs tested, R-PASS is the fastest. The results also show improved accuracy upon MAFFT and ExpaLoc and comparable accuracy with LARA. II. RELATED WORK Most structure-based alignment programs continue the tradition of the Sankoff’s algorithm[10], where it simultaneously folds and aligns a set of pseudo knot-free RNAs using a dynamic programming approach (DP). Although efforts have been made to reduce the time and space complexity, this approach still requires O(n4 ) time in the pairwise alignment case. Thus most structure-based programs are not practical for long RNAs[8]. R-PASS assumes the structure information is available for both sequences. We compare our program to the two most recent structure based alignment programs that target the same problem, LARA[8] and ExpaLoc[9]. LARA adopts a graph-based representation and models the alignment as an integer linear program. ExpaLoc combines ExpaRNA[9] and LocARNA[11], where ExpaRNA detects the longest exact pattern match of two RNA structures and LocARNA fills in the unaligned space between those patterns. Our program differs with LARA in that the RNA structures are matched at the structure motif level instead of at the nucleotide level and an optimal alignment is found by a DP algorithm. Unlike ExpaRNA which finds the exact pattern matches, our matching algorithm is more flexible, so more structure constraints can be used in alignment construction. Also, the alignment building process in our program is still sequence-based, and thus has a much lower computation complexity than LocARNA. To evaluate the effectiveness of using additional structure information, we also compare our program with MAFFT[6]. Its iterative refinement method L-INS-I is evaluated to be one of the most accurate sequence-based alignment programs that produce high quality alignments with average pairwise sequence identity above 55%[12]. III. ALGORITHM Given two RNA sequences with known secondary structures, we parse the annotated base pairs into a set of structure motifs. The feature vectors of the structure motifs are computed and form the vertices of a bipartite graph. The weight of an edge is based on the similarity of two feature vectors. A set of edges which represent the correspondence between structural motifs are obtained by a bipartite graph matching algorithm. These correspondences are then applied as anchor points to construct an optimal sequence alignment using a DP algorithm. A. Structure Matching 1) Infering motifs from base pairing annotation: The secondary structure of a RNA sequences consists of a set of 2011 IEEE International Conference on Bioinformatics and Biomedicine 978-0-7695-4574-5/11 $26.00 © 2011 IEEE DOI 10.1109/BIBM.2011.74 618
  • 2. nested base pairs. The majority of the nucleotides in a RNA structure form canonical base pairings in a regular helix region. The remaining unpaired nucleotides form loops, which can be further categorized to bulges, internal, hairpin and multi-stem loops. Thus a RNA secondary structure can be viewed as a set of those RNA motifs. Stem is the continuous base-paired double strand region and is composed of a 5’end half and a 3’ end half. Bulge is a loop appearing only on one strand of a stem. Internal loop is the unpaired regions occurring on both strands of a stem. Hairpin loop is the loop linking the 5’ end and 3’ end of a stem. Multi-stem loop is the intersection of three or more stems. It is separated by single strand sequences. Free end is the unpaired region at the 5’ end or 3’ end of a structure. Given a RNA secondary structure annotated by its base pairings, we first segment it into Stem, Bulge, Internal loop and Hairpin loop motifs. Then we merge the basic motifs into Compound Stem and Hairpin. A Compound Stem is the discontinuous base-paired region merged by Stems, Bulges and Internal loops. A Hairpin is formed by combining a Hairpin loop and its flanking Stem/Compound Stem. The Compound Stem, Stem, Hairpin, Multi-stem loop and Free end form a complete RNA structure. To reduce computation, we only consider Compound Stem, Stem and Hairpin motifs in the following matching step. The other motifs are spontaneously matched after matching of those three motifs. 2) Vector model of motifs: For each Stem/Compound Stem/Hairpin motif we compute a feature vector which includes the start (ps) and end position (pe) of the motif and the number of nucleotides in it (l). Since a Stem/Compound Stem consist of two halves, the end position of the first half (pse) and the start position of the second half (pes) are also included in its feature vector. The feature vector of a Hairpin is the same as the feature vector of its stem component. Therefore f{Stem, Compound Stem, Hairpin} = (ps, pse, pes, pe, l). A RNA secondary structure S can thus be represented as S = {fti}, i ‫ג‬ (1, k), where k is the number of motifs in S and t ‫ג‬ {Stem, Compound Stem, Hairpin}. The similarity of two structure motifs is measured by how close their relative positions are in the corresponding global sequences and how similar their lengths are. The similarity is only computed between motifs of the same type, i.e. Hairpin with Hairpin and Stem with Stem or Compound Stem. We define the following heuristic similarity function composed of position similarity (pos) and length similarity (len), where sl is the global sequence length, and wl is the weight of length similarity. ݀ሺ݂௔ǡ ݂௕ሻ ൌ ൜ Ͳǡ ݂݅ ܽǡ ܾ ܽ‫݁ݎ‬ ݊‫ݐ݋‬ ‫݂݋‬ ‫ݐ‬ ݁ ‫݁݉ܽݏ‬ ‫݁݌ݕݐ‬ ‫ݏ݋݌‬ሺ݂௔ǡ ݂௕ሻǤ ݈݁݊ሺ݂௔ǡ ݂௕ሻǡ ‫ݐ݋‬ ݁‫݁ݏ݅ݓݎ‬ ݈݁݊ሺ݂௔ǡ ݂௕ሻ ൌ ͳ െ ‫ݓ‬௟Ǥฬ ݈௔ െ ݈௕ ݈௔ ൅ ݈௕ ฬ ‫ݏ݋݌‬ሺ݂௔ǡ ݂௕ሻ ൌ ͳ െ ݉ܽ‫ݔ‬ ൮ ݉݅݊ ൬ฬ ‫݌‬௦௔ ‫݈ݏ‬௔ െ ‫݌‬௦௕ ‫݈ݏ‬௕ ฬ ǡ ฬ ‫݌‬௦௘௔ ‫݈ݏ‬௔ െ ‫݌‬௦௘௕ ‫݈ݏ‬௕ ฬ൰ ǡ ݉݅݊ ൬ฬ ‫݌‬௘௦௔ ‫݈ݏ‬௔ െ ‫݌‬௘௦௕ ‫݈ݏ‬௕ ฬ ǡ ฬ ‫݌‬௘௔ ‫݈ݏ‬௔ െ ‫݌‬௘௕ ‫݈ݏ‬௕ ฬ ൰ ൲ ǡ ݂݅ ܽǡ ܾ ‫ג‬ ܵ‫݉݁ݐ‬ In the above equation, the score for different types of the structural motif is always 0. For the comparison of the same motif type, the score is the product of the relative position similarity and the length similarity. The relative position difference for a single strand is defined as the minimum difference as measured by start and end position. For a Stem/Compound Stem structure, the relative position difference is first evaluated on each strand individually and then the maximum score between the two strands is used. For simplicity, only the weight on length difference (wl) is used. The similarity score ranges between 0 and 1 inclusively. 3) Structure motif matching algorithm: The structure motifs from two RNA structures are matched by a bipartite graph matching (BGM). BGM is a powerful matching technique that can achieve global and optimal matching results in polynomial time. It has been applied to protein structure alignment[13]. We denote a bipartite graph G as G = (V = L R, E). V is a set of vertices which can be divided into two subsets, L and R. Each subset individually represents structure motifs from one sequence and contains their feature vectors as vertices, i.e. L = S1, R = S2. E is a set of weighted edges that link between vertices in L and vertices in R. The edge weight is the similarity score between two feature vectors. A threshold for edge weight t is used in a preprocessing step of graph construction, so only edges with weight no less than t are used for matching. The limitation t > 0 ensures that the matches are only between motifs of the same type. There is no edge linking vertices from the same subset. Hence, E = {eLi,Rj = d(fLi, fRj) t} , i ‫ג‬ (1, k), j ‫ג‬ (1, h), where k is the number of motifs in S1 and h is the number of motifs in S2. Based on this bipartite graph, the correspondence of structure motifs are then found by a stable matching algorithm[14]. The algorithm generates a set of matched structure motifs, more specifically the feature vectors of structure motifs, M = {fLi, fRi}, i ‫ג‬ (1, c), where c is the number of matches. The stable matching ensures that no two motifs are better matched together than with the motif they are currently matched with. The complexity of this algorithm is O(kh). Any graph matching algorithm could be used as a substitute for the stable matching, such as a maximum weighted matching algorithm. The stable matching was chosen for its fast runtime. 4) Computing matching blocks: The matching structure motifs are then converted into sequence blocks. The boundaries of the matched motifs are obtained based on their feature vectors. For a Hairpin motif, its sequence is divided into three parts, 5’ end stem, Hairpin loop and 3’ end stem, and the Stem and Compound Stem motif are segmented into the 5’ end stem and 3’ end stem. Thus the matching segments from two sequences form a match block, of which the boundaries are determined by the start and end positions of the segments (Fig. 1). The above generated sequence block set may contain crossing blocks caused by mismatches in BGM. In cases where the structures are unknown and require prediction, 619
  • 3. overlapping blocks may also appear i structures contain a set of overlapping cand solve this problem, the Dijkstra’s shortest p applied[15]. We construct a graph where th blocks and two pseudo blocks representing end of the sequences. The edge weight between two blocks. It is set to be 0 for cro the total number of position gaps betwee both sequences for non-crossing blocks. O positive weight are added to the graph. A then computed between the start block an The match blocks on the path are used in the B. Constraint sequence alignment Figure 1. Convert structure alignment to sequen For sequence A = {ai} and B = {bj}, i ‫ג‬ where n is the length of sequence A and m sequence B, a constraint alignment is then 1). The optimal path is built only through the DP matrix using an affine gap penalty m a matrix H, where H(i, j) is the best alig (a1,…, ai, b1,…, bj) with ai aligned to a g where V(i, j) is the best alignment score up bj aligned to a gap, and a DP matrix D, wh best alignment score up to ai and bj, the re for the constrained DP is then calculated as: ‫ܪ‬ሺ݅ǡ ݆ሻ ൌ ݉ܽ‫ݔ‬ ሺ‫ܦ‬ሺ݅ െ ͳǡ ݆ሻ ൅ ‫݋‬ǡ ‫ܪ‬ሺ݅ െ ͳǡ ݆ሻ ൅ ݁ሻ ܸሺ݅ǡ ݆ሻ ൌ ݉ܽ‫ݔ‬ ሺ‫ܦ‬ሺ݅ǡ ݆ െ ͳሻ ൅ ‫݋‬ǡ ܸሺ݅ǡ ݆ െ ͳሻ ൅ ݁ሻ ‫ܦ‬ሺ݅ǡ ݆ሻ ൌ ቊ ݉ܽ‫ݔ‬ ቀ‫ܦ‬ሺ݅ െ ͳǡ ݆ െ ͳሻ ൅ ܵ൫ܽ௜ǡ ܾ௝൯ǡ ‫ܪ‬ሺ݅ǡ ݆ሻǡ ܸሺ݅ǡ ݆ െ ǡ ‫ݐ݋‬ ݁‫݅ݓݎ‬‫ݏ‬ In the above equations, o is the gap open gap extension penalty and S(ai, bj) is the s between nucleotides ai and bj. The com alignment step is O(nm). IV. RESULTS The program generated pairwise alignm by comparison to a gold standard reference nucleotide level. A pairwise alignment A c a set of nucleotide correspondence arrang order, so A = {ai, bi}, i ‫ג‬ (1, n), where n is th of the two sequences, and ai and bi a nucleotides from two sequences. Let alignment be A and the testing alignment b if the predicted didate motifs. To path algorithm is he vertices are the the start and the is the distance ossing blocks and en the blocks on Only edges with A shortest path is nd the end block. e next step. nce alignment. (1, n), j ‫ג‬ (1, m), m is the length of n computed (Fig. match blocks in model[16]. Given gnment score of gap; a matrix V, p to ai and bj with here D(i, j) is the currence relation : ݆ሻቁ ǡ ݂݅ ሺ݅ǡ ݆ሻ ݅݊ ܾ݈‫݇ܿ݋‬ ‫ݏ‬݁ n penalty, e is the substitution score mplexity of the ment is evaluated e alignment at the can be denoted as ged in sequential he smaller length are the matched the reference be A’= {a’i, b’i}, we define a correct correspondence The “=” here means two nucle nucleotide index in the sequence. is the percentage of correct correspo of all correspondence in the referenc A. Testing data We created the pairwise sequen with structure information from thre 2.1[7], Rfam 10.1[17] and CRW Rfam are popular alignment bench evaluations and provide consensus various ncRNA families. The CRW high quality alignments provides in structure in various formats which s We took three RNA fami spliceosomal RNA (RF00004), nuc and Bacterial RNase P class A families from Bralibase data-set 1 a U5 spliceosomal RNA, tRNA and families from CRW Site: 5S rRNA Besides the data-set2 from Bra pairwise sequence alignments, all th by breaking the multiple sequence sequence alignments. The secon sequence is either inferred fro provided in the original dataset or Site. The pseudoknots are excluded A single test contains two RN corresponded RNA secondary stru bracket format[19]. Based on RN datasets are grouped into small ( 1000nt) and large (> 1000nt) RNA RNA family are further divided in 40, 60 and 80 by sequence identity indicates the pairwise sequence i between 40 and 60. Due to the page sets and all the results are available B. Comparison with other program We implemented the algorithm sequence alignment, we use the RI [20] with gap open penalty of -8 an of -1. R-PASS is written in Java MAFFT are implemented in C. E small RNA datasets, while the ot tested on all datasets. Each test co input. Except for MAFFT-L-INS sequence information, the structure to all the other three programs. For first executed and the output constr LocARNA to obtain complete align executed with default setting in processor at 3.16G Hz. 1) Alignment accuracy: Usin described above, we evaluated generated by all four programs reference alignments. As shown in can achieve > 90% accuracy in e as ai = a’i and bi = b’i . eotides have the same The alignment accuracy ondence over the number ce alignment A. ce alignment testing data ee benchmarks, Bralibase Site[18]. Bralibase and hmarks used by program secondary structures for W Site well known for its ndividual RNA secondary serve our purpose well. ilies from Rfam: U2 clear RNase P (RF00009) (RF00010); four RNA and data-set 2: g2intron, 5S rRNA and two RNA and 16S rRNA. alibase which consists of he other tests are created alignments into pairwise ndary structure of each om consensus structure retrieved from the CRW from the structures. NA sequences with their uctures annotated in dot NA sequence length, the (< 200nt), middle (200- A (Table 1). Tests in each nto at least three subsets: y. For example, subset 40 identity of each test is e limit, the details of data upon request. ms in the tool R-PASS. For IBOSUM scoring matrix nd gap extension penalty a, LARA, ExpaLoc and ExpaLoc is tested on the ther three programs are ontains two sequences as S-I which only use the e annotation is provided r ExpaLoc, ExpaRNA is raints are then piped into nments. All programs are n a desktop with Intel ng the scoring method the alignment quality with the corresponding Fig. 2, all four programs small and large RNA 620
  • 4. datasets with sequence identity above 60% and in middle RNA datasets with sequence identity above 80%. In this zone, using structure information does not improve alignment quality significantly. The additional structure information has a remarkable effect in the twilight zone, where the sequence identity is below 60%. While MAFFT performance drops sharply with decrease of sequence identity, R-PASS and LARA can maintain high alignment quality (Fig. 2). ExpaLoc does not show superiority over MAFFT in the twilight zone in the small RNA datasets. One possible explanation is that as the sequence identity drops, the number of exact pattern matches, i.e., the number of structure constraints also decreases, thus the degree of freedom expands as LocARNA aligns the sequences, causing potential alignment errors. In the twilight zone, R-PASS is comparable with LARA in terms of alignment quality. R-PASS outperforms all other programs in the tRNA dataset from Bralibase and RNase P datasets from Rfam (Fig. 2). In the other datasets except U5, the largest difference between R-PASS and LARA is less than 3%. Figure 2. Accuracy comparisons in small, middle and large RNA datasets. 2) Running time: The total running time of each program on datasets of the same sized RNA group is used. ExpaLoc costs the most CPU time in the small RNA datasets (data not shown here). TABLE I. PROGRAM RUNNING TIME IN SECONDS. Program Small Middle Large R-PASS 17.8 17.6 27 LARA 178.24 (10)a 1701 (97) 30000 (1111) MAFFT-L-NS-I 491.34 (27) 573 (33) 108 (4) # tests 2806 2893 1046 Avg. length 96 314 1467 a. The number of times that R-PASS is faster than the program As shown in Table 1, R-PASS is the fastest among all programs. The advantage over LARA is more obvious as the length of the RNA increases (Fig. 3). R-PASS is 10 times faster than LARA and 27 times faster than MAFFT in small RNA datasets; and is 97 times faster than LARA and 33 times faster than MAFFT for middle RNA datasets. It is 4 times faster than the sequence-based alignment program MAFFT and above 1,100 times faster than LARA in large RNA datasets while maintaining comparable alignment quality (Fig. 2). Therefore, our approach is more suitable for large-scale RNA sequences. Figure 3. Program running time versus sequence length. 3) R-PASS structure motif matching with LARA subroutine: In R-PASS, individual sequence blocks generated after structure motif match can be fed into any alignment algorithm to produce a global alignment. Since LARA finds the structure motif correspondence at the base pair level, it performs better than R-PASS in some datasets in the twilight zone. Thus integrating LARA to align the stem regions may improve R-PASS performance in those datasets. We compute the structure match blocks by R- PASS and use LARA to align the blocks generated from Stem motifs and Gotoh algorithm to align the blocks of loop and free end motifs. Each local alignment is then linked in sequential order to form a global alignment. We tested the integrated R-PASS and LARA in the small RNA datasets. This approach performs better than R- PASS and is comparable with LARA. In some cases, e.g. U5 (Fig. 4), the integrated approach improves the alignment accuracy upon LARA. 20 40 60 80 100 0 20 40 60 80 Accuracy PID subset Small RNA: tRNA R-PASS LARA ExpaLoc MAFFT-L- INS-I 30 40 50 60 70 80 90 100 20 30 40 50 60 70 80 90 Accuracy PID subset Middle RNA: Nuclear RNase P (RF00009) R-PASS LARA MAFFT- L-INS-I 80 85 90 95 100 20 40 60 80 Accuracy PID subset Large RNA: 16S rRNA R-PASS LARA MAFFT-L- INS-I 0 1 2 3 4 5 96 314 1467 log10(runningtimein milisecondpertest) Sequence length (nucleotide) Program running time VS sequence length R-PASS LARA MAFFT- L-INS-I 621
  • 5. Figure 4. Accuracy comparison on U5. V. CONCLUSION AND FUTURE WORK We have developed a new workflow for pairwise alignment of RNA sequences with known structure information, named R-PASS. We utilize the RNA structure motif correspondence found in a bipartite graph framework to constrain the sequence alignment problem and perform the final alignment. The complexity of our algorithm is O(nm), yet it can achieve alignment quality comparable with the current state of the art programs. This is especially apparent in the twilight zone. R-PASS is also significantly faster than competing methods, often by orders of magnitude, which makes our method well suited for use in iterative algorithms and high-throughput RNA analysis. We are currently working on various improvements to the R-PASS framework described in this paper. The alignment can be refined by matching the basic motifs in a Compound Stem/Hairpin motif. Improvements can also be extended to the nucleotide level where all the base pairings in each sequence could be used as constraints. The similarity function between structural motifs we use is established by practical experience and the values are determined based on preliminary experiments. This function can be further enhanced by including additional information such as base pair similarity and nucleotide similarity. The additional information can improve the sensitivity and the selectivity of the matching algorithm. Our results show that incorporating structure information significantly improves alignment accuracy upon sequence-based alignment methods, especially for less conserved sequences. While our current algorithm focuses on aligning known structures, it can be adapted to align unknown structures by structure prediction using RNA folding algorithms or generating all potential structure motifs and find correct matches by an advanced similarity function. The fast performance of our alignment program also makes it promising to compute a multiple sequence alignment over a large set of sequences. While R-PASS focuses on pairwise alignment, the result can be extended to produce multiple sequence alignments using a guide tree or a progressive approach. Our method can also target template-based alignment problems, where new sequences are added into an existing alignment by matching the new sequence with the consensus sequence of the alignment. ACKNOWLEDGMENT This research is supported by the National Institutes of Health grant, GM085337. REFERENCES [1] Gutell, R. R., Weiser, B., Woese, C. R. and Noller, H. F. Comparative anatomy of 16-S-like ribosomal RNA. Prog Nucleic Acid Res Mol Biol, 32 (1985), 155-216. [2] Gutell, R. R., Larsen, N. and Woese, C. R. Lessons from an evolving rRNA: 16S and 23S rRNA structures from a comparative perspective. Microbiol Rev, 58, 1 (Mar 1994), 10-26. [3] Phillips, A., Janies, D. and Wheeler, W. Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol, 16, 3 (Sep 2000), 317-330. [4] Wu, J. C., Gardner, D. P., Ozer, S., Gutell, R. R. and Ren, P. Correlation of RNA secondary structure statistics with thermodynamic stability and applications to folding. J Mol Biol, 391, 4 (Aug 28 2009), 769- 783. [5] Washietl, S., Hofacker, I. L., Lukasser, M., Huttenhofer, A. and Stadler, P. F. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nature Biotechnology, 23, 11 (Nov 2005), 1383-1390. [6] Katoh, K., Kuma, K., Miyata, T. and Toh, H. Improvement in the accuracy of multiple sequence alignment program MAFFT. Genome Inform, 16, 1 (2005), 22-33. [7] Gardner, P. P., Wilm, A. and Washietl, S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res, 33, 8 (2005), 2433-2439. [8] Bauer, M., Klau, G. W. and Reinert, K. Accurate multiple sequence- structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics, 8(2007), 271. [9] Heyne, S., Will, S., Beckstette, M. and Backofen, R. Lightweight comparison of RNAs based on exact sequence-structure matches. Bioinformatics, 25, 16 (Aug 15 2009), 2095-2102. [10] Sankoff, D. Simultaneous solution of the RNA folding, alignment, and proto-sequence problems. SIAM J Appl Mat, 45(1985), 810-825. [11] Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F. and Backofen, R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol, 3, 4 (Apr 13 2007), e65. [12] Wilm, A., Mainz, I. and Steger, G. An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol Biol, 1(2006), 19. [13] Wang, Y., Makedon, F., Ford, J. and Huang, H. A bipartite graph matching framework for finding correspondences between structural elements in two proteins. Conf Proc IEEE Eng Med Biol Soc, 4(2004), 2972-2975. [14] Shapley, D. G. a. L. S. College Admissions and the Stability of Marriage. American Mathematical Monthly, 69(1962), 9-14. [15] Dijkstra, E. W. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1959), 269-271. [16] Gotoh, O. An improved algorithm for matching biological sequences. J Mol Biol, 162, 3 (Dec 15 1982), 705-708. [17] Gardner, P. P., Daub, J., Tate, J., Moore, B. L., Osuch, I. H., Griffiths- Jones, S., Finn, R. D., Nawrocki, E. P., Kolbe, D. L., Eddy, S. R. and Bateman, A. Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Res, 39, Database issue (Jan 2011), D141-145. [18] Cannone, J. J., Subramanian, S., Schnare, M. N., Collett, J. R., D'Souza, L. M., Du, Y., Feng, B., Lin, N., Madabusi, L. V., Muller, K. M., Pande, N., Shang, Z., Yu, N. and Gutell, R. R. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3(2002), 2. [19] Gruber, A. R., Lorenz, R., Bernhart, S. H., Neubock, R. and Hofacker, I. L. The Vienna RNA websuite. Nucleic Acids Res, 36, Web Server issue (Jul 1 2008), W70-74. [20] Klein, R. J. and Eddy, S. R. RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics, 4(Sep 22 2003), 44. 80 85 90 95 100 20 40 60 80 Accuracy PID subset U5 LARA subroutine R-PASS LARA 622