More Related Content
Similar to 50120140502014
Similar to 50120140502014 (20)
More from IAEME Publication
More from IAEME Publication (20)
50120140502014
- 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
130
GENERIC APPROACH OF PATTERN MATCHING OF AMINO ACID
SEQUENCES USING MATCHING POLICY & PATTERN POLICY
A. K. Payra1
, S. Saha1
1
Dept. of Computer Science & Engg, Dr. Sudhir Chandra Sur Degree Engineering College,
DumDum, Kolkata
ABSTRACT
Pattern matching is hugely used in various applications like image, audio, video, bio-
informatics etc. Definitely, there are several pattern matching algorithms which are already present
like BM, Naive, and KMP etc, as well as hybrid approaches of existing methods are known to us. To
improve complexity and to bring new idea of pattern matching, here a new concept of Matching
policy & Pattern Policy has been introduced.
Keywords: ASCII, Pre-align, MP, PP, Heap, Success Ratio.
I. INTRODUCTION
Pattern matching has been studied throughout multiple courses, and is crucial through its
computation and analysis. Pattern matching algorithms have been extensively applied in various
computer applications or industries, for example, in retrieval of information, information security,
and searching nucleotide or amino acid sequence patterns in biological sequence databases. Pattern
matching problem can be defined as finding one or more often all the occurrence of a given pattern
(P = p0p1…pm − 1 ) of length m in a text (T = t 0t1…tn − 1 ) of length n, which is built over a finite
alphabet set Σ of size σ. Before proceeding further, let us overview the various upcoming sections in
this paper. Section II gives the review of several efficient algorithms in practice. Section III describes
the proposed algorithms in detail. In section IV, the experiment results with complexity analysis has
been discussed whereas the entire algorithm has been concluded in section V.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 5, Issue 2, February (2014), pp. 130-139
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2014): 4.4012 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
131
II. PREVIOUS WORK
The interpretation of string pattern matching is to detect the position of substring in a given
string. Though there are many string matching algorithms, here we will discuss some of the
major well known algorithms among them. Pattern matching algorithms can be categorized as
single and multiple based on their functionalities. The naïve approach [15] simply tests all
the possible placement of Pattern P [1 . . . m] relative to text T [1 . . . n]. Specifically, we try
shift s = 0, 1. . . n - m,
successively and for each shift, s.
Compare T[s +1 . . . s + m] to P [1 . . . m].
NAÏVE_STRING_MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. for s ← 0 to n - m do
4. if P[1 . . m] = T[s +1 . . s + m]
5. then return valid shift s
The naïve string-matching procedure can be interpreted as a sliding a pattern P [1 . . . m] over
the text T [1 . . . n] and noting for which shift all of the characters in the pattern match the
corresponding characters in the text. The Naive pattern searching algorithm doesn’t work well in
cases where we see many matching characters followed by a mismatching character which is
overcome both by BM and KMP algorithm. Boyer-Moore (BM) algorithm [1] utilizes two heuristics,
bad character and good suffix, to reduce the number of comparisons. Quick Search (QS) algorithm
[2], which scans the characters of the window in any order, and computes its shifts with the
occurrence shift of the character. In KMP algorithm [3], the prefix function (Π) for a pattern
encapsulates knowledge about how the pattern matches against shifts of itself. This information can
be used to avoid useless shifts of the pattern ‘p’. In other words, this enables avoidance of
backtracking on the string ‘S’. With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs, the
occurrence of ‘p’ in ‘S’ is detected along with the return of the number of shifts of ‘p’ after which
occurrence is found. Like these algorithms, in this paper, ASCII values of considered pattern and text
is considered while comparing in order to reduce complexity and to maximize success rate of
matching design Pattern Policy (PP).It has been embedded in the Matching Policy (MP) of the
algorithm.
PRESENT WORK
Motivation: Many approaches [4, 5] have been discussed in previous section over the sequences
of amino acid. After studying and going through various papers it can be analyzed that very few
assessment had been pursued on basic of ASCII values of Pattern to obtain maximum heuristic value
by skipping number of comparisons. This analyzation prompts us to assess it.
Dataset: Data has been collected only through serine–phosphorylated peptides of length 13 (i.e.,
13-mers centered at serine) from the Phospho.ELM database [6], which are experimentally
determined to be substrates of different kinases.
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
132
PROPOSED METHOD
The basic of algorithm is quite similar with naïve method but it is superior then naive. Our
algorithm consist of two major segments, they are given below-
Algorithm Matching Policy
// Consider, Amino acid sequence or text is T with length Tlen //and pattern P with length Plen.
// Stext is the probable matching sub-text of the text(T)
// i is an integer positional variable of probable Sub-text(Stext) //matching.
// S and S1 are ASCII sum of Pattern (P) and Sub-text (Stext) //respectively.
read T,P.
Tlen :=length(T). Plen :=length(P) .
S:=sum(ASCII(P)).
for i :=0 to (Tlen-Plen+1) step +1 then{
j := i; r :=0;
if( T[i]=P[0]) then {
while(j≤ Plen +i-1) then {
Stext[r++]:=T[j++];
}
S1 := sum(ASCII(Stext));
if(S1=S2) then compare(S1,S2).
else skip.
}
}
Algorithm Pattern Policy
//Consider, Amino acid sequence or text is T with length Tlen.
//A0,…,.An : Sequence of protein contains n distinguishable amino acids.
//On is an integer array of storing occurrences of amino acid.
read T.
On := Occurrence(T(A0,A1….An)).
Heapsort(On) .
Generate pattern using descending order of amino acid occurrences (On).
PROPOSED METHOD WITH EXAMPLE
The proposed algorithms (Matching Policy and Pattern Policy) can be applied efficiently
which has been illustrated by using examples given below followed by several questions & its
answers which may arise in the mind of the readers while going through the entire algorithm.
Input: Here pre-aligned phosphorylated dataset of length of 13 mers and pattern length 3 mers have
been considered.
Text (T):
P H L P P C S P P K Q G K
Fig.1 Sample text
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
133
Pattern (P):
P K Q
Fig.2 Sample pattern
Matching Policy (MP)
The pattern is studied and features are extracted from it. Like length (Plen), ASCII value(S),
starting char (P [0]). Here, Plen=3, S=236, P [0] =’P’;
Next step is to find the probable positions (i) of pattern (P) present in the text (T).
So, i= {0, 3, 4, 7, 8} is represented below in gray.
P H L P P C S P P K Q G K
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
Fig.3 Possible index of P in the text
Sub-text (Stext) length (Slen) of Text (T) and pattern (P) length is compared. It may be equal for
a particular instance.
So, Slen=3 and Stext = {PHL} for i=0.
Table-I. Selected Sub-text
Instance 0 i=0 P H L
Instance 1 i=3 P P C
Instance 2 i=4 P C S
Instance 3 i=7 P P K
Instance 4 i=8 P K Q
Next the ASCII value (S1) of Stext is calculated.
S1= 228.
Table-II. Ascii table for Stext
If S≠S1, then it is not required to compare P and Stext, otherwise individual characters are
compared.
Instance i Stext S1 Character
Wise
Comparison
0 0 PHL 228 X ( 228≠236)
1 3 PPC 227 X
2 4 PCS 230 X
3 7 PPK 235 X
4 8 PKQ 236 (require)
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
134
Individual characters of Stext and P are compared to find whether both strings are equal or not.
P H L P P C S P P K Q G K
Fig-4: Stext matching with sample Pattern
Here, P is matched with Stext. So, pattern is present in the text and position of pattern in the
text at (i=8).
Situation will be crucial, when ASCII value Stext and patterns are same but both are different. For
example:
Table-III: ASCII value of Stext and pattern are same
Pattern(P) Sub-text (Stext)
PKQ
PMO
PQK
PPL
PKQ
Here, we get advantages of character wise comparison between Stext and P. This is given next:-
…………………. P M O ……………….
Fig-V: Sequential matching
So, comparing is skipped and attempt has been executed to find pattern in the text in next
probable position. Thus this approach provides efficiency and faster mechanism.
Next question may arise that how long this will continue? Answer of the question is simple and it
is derived below-
P H L P P C S P P K P K
Fig-6: Terminating Condition
The above steps have been repeated for L times for Text (T) length (Tlen) and Pattern length (Plen):
L= Tlen –Plen+1
P K Q
P K Q
Tlen
PlenTerminating condition: There is no
chance of presence of pattern in the text.
P K Q
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
135
In matching policy, the selected pattern can be entered by user or selected by different
policies to find best match. Here, the concept of pattern policy has been introduced, which is
discussed below.
Pattern Policy (PP)
Selection of a particular pattern for a sequence is taken based on amino acid occurrence (Oa)
in that particular sequence. To find sorted descending order of Oa values with resolve collisions, heap
sort has been used here. An example is given next:
How heap sort works?
Fig.7: Heap sort algorithm
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
136
How heap sort works in my pattern policy?
A selected pattern can be variable in size, thus it even depends on requirements. Here, each
time a pattern (P) is considered for each sequence (T). Heap sort is applied over obtained Oa values of
amino acids and pattern (P) is generated, which is given in tabular format in below –
T: SSVPTPSPLGPLA
Table IV: Pattern Selection
Sequence (T) Occurrence (Oa) After Sorting Selected Pattern(P)
SSVPTPSPL
GPLA
T->1, G->1,V->1,
A->1, P->4, S->3,
L->2
P 4
S 3
L 2
T 1
PSLT
IV. RESULT & EXPLANATION
Different length (l) of pattern (P) is applied over the indifferent set of pre-align data (T).
Matching policy provides better performance when pattern policy is executed simultaneously with it.
The obtained results are given below in tabular format.
l=1,Pattern length 1
Table-V. Pattern length 1
Patterns Match using
MP
Success Using
MP
Match using
MP+PP
Success Using
MP+PP
P|R|Q
E|A|T
S|V|K 229 1.0601 81 3.375
The number of sequence is n and number of possible pattern is m. If number of match using
MP is x1 and using MP+PP method is x2.
Then success (S’) using MP method will be:
But, in MP+PP every sequence has only one pattern. So, for n number of sequences can
possible maximum non-repetitive n number of patterns.
l=2,Pattern length 2
Table-VI: Pattern length 2
Patterns Match using
MP
Success Using
MP
Match using
MP+PP
Success Using
MP+PP
PS|PA|
ED|AS|
ST|VE|
SP| …..
……….
69 .136 14 .5833
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
137
l=3,Pattern length 3
Table-VII. Pattern length 3
Patterns Match using
MP
Success Using
MP
Match using
MP+PP
Success Using
MP+PP
PSF|PAQ|
EED|ASV|
STG|VED|
SPE| …
6 .0104 2 .084
l=4,Pattern length 4
Table-VIII: Pattern length 4
Patterns Match using
MP
Success Using
MP
Match using
MP+PP
Success Using
MP+PP
PPSF|PPAQ|
AASV|SSEE|
RRTF|… 0 .00 0 .00
Results due to different length of pattern conclude that if MP and PP works simultaneously
will produce best outcomes. The success rate is represented below in bar chart form.
To discuss the algorithm, we need to study complexity. Complexity of the any pattern
matching algorithms is important. So, complexity analysis is given below:
Consider, the length of sequence and pattern are LS and LP respectively. Total number of
sequences to be tested is N.
Complexity to find the probable positions in a sequence is O(LS).If average probable position
value of any pattern is Pp which may appear in the sequence then :
Complexity for:
• N number of sequences,
Searching ≤ O (N×LS).
• The length of the pattern equal to compare sequence length.
So, Comparison ≤ O (Pp× (LP)2
×N).
• N sequences:
Pattern Policy ≤ O (N×LS× log (LS))
• Where, 0 ≤ LP ≤ LS.
This approach is thus indeed very simple with low time complexity and robust in usability.
The success ratio obtained in this algorithm is graphically represented in Fig.8.
- 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
138
Fig.8: Success Ratio of the pattern matching
V. CONCLUSION
These unique approaches of algorithms bring a new concept with high success ratio and low
time complexity. As concept is simple it can work individually or can be applied with other existing
approaches to improve performances. If matching policy is designed to work bi-directional, then
these algorithms will be even more efficiently faster.
REFERENCES
[1] R. S. Boyer, J. S. Moore, “A fast string searching algorithm”, Communications of ACM,
20(10): 762-772,1977.
[2] D. M. Sunday, “A very fast substring search algorithm”. Communications of the ACM,
33(8):132-142, 1990.
[3] Tang Va-ling. KMP algorithm in the calculation of next array. Computer Technology and
Development [J] .2009, 19 (6):98-101.
[4] A Fast Hybrid Pattern Matching Algorithm for Biological Sequences.-cai, nie, Huang.
2009,IEEE.
[5] Hybrid pattern-matching algorithm based on BM KMP algorithm-lu,bao,feng, 2010, IEEE.
[6] http://bio.classcloud.org/f-motif/
[7] Average running time of the Boyer-Moore-Horspool Algorithm, BAEZA-YATES, R.A.,
RÉGNIER, M., Theoretical Computer Science 92(1) , 1992, pp. 19-31.
[8] A. Yao. The complexity of pattern matching for a random string. SIAM Journal on
Computing, 8(3):368{387, 1979.24.
[9] B. Watson. A new regular grammar pattern matching algorithm. In Proc 4th Annual
European Symposium, LNCS 1136, pages 364-377, 1996.
[10] E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1-3):132-137,
1985.
[11] G. Navarro and R. Baeza-Yates. A hybrid indexing method for approximate string matching.
Journal of Discrete Algorithms, 1(1):205-239, 2000.
[12] G.Navarro and M. Ra_not. Fast regular expression search. In Proc. 3rd Workshop on
Algorithm Engineering, LNCS 1668, pages 199-213, 1000.
[13] G. Navarro and M. Ra_not. Compact DFA representation for fast regular expression search.
In Proc. 5th Workshop on Algorithm Engineering (WAE'01), LNCS 2141, pages1-12, 2001.
- 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
ISSN 0976 - 6375(Online), Volume 5, Issue 2, February (2014), pp. 130-139 © IAEME
139
[14] G. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of
Mathematical Biology, 51:7-37, 1989.
[15] S.Roy, P.Suryanarayan, “The relation …..convolution/relation” IETEJE ,2010,vol-51
[16] G. Myers. A four russians algorithm for regular expression pattern matching. Journal of
the ACM, 39(2):430-448, 1992.
[17] G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic
programming. Journal of the ACM, 46(3):395-415, 1999
[18] “GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR
FUNCTION USING PROTEIN” - Anjan Kumar Payra…, INTERNATIONAL JOURNAL
OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET), ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online), Journal Impact Factor (2013): 6.1302 (Calculated by GISI).
[19] “FUNCTION PREDICTION USING CLUSTER ANALYSIS OF UNANNOTATED
ALIGN SEQUENCES”- Anjan Kumar Payra, INTERNATIONAL JOURNAL OF
CURRENT RESEARCH AND REVIEW (IJCRR), ISSN 2231-2196 (Print) ISSN 0975-
5241(Online), Journal IC Value: 4.18 (2013).