Ijetcas14 624

International Association of Scientific Innovation and Research (IASIR)
(An Association Unifying the Sciences, Engineering, and Applied Research)
International Journal of Emerging Technologies in Computational
and Applied Sciences (IJETCAS)
www.iasir.net
IJETCAS 14-624; © 2014, IJETCAS All Rights Reserved Page 286
ISSN (Print): 2279-0047
ISSN (Online): 2279-0055
A Survey on String Similarity Matching Search Techniques
S.Balan1, Dr. P.Ponmuthuramalingam2
1Ph.D. Research Scholar, 2Associate Professor & Head,
Department of Computer Science, Government Arts College (Autonomous), Coimbatore, Tamilnadu, INDIA.
Abstract: String similarity matching search Problem is mainly used to find text which is present in the documents. In thousands of years many features are available in the modern world but yet people not realized to find the information correctly. Because of huge amount of information’s stored in the World Wide Web. The field of information retrieval was born in the year 1950 and H.P. Luhun in the year of 1957 find the basic idea of searching text with computer. The problem of string matching is to find errors .for example in online searching, user faces different problems and irrelevant information’s. The goal of this survey is to present overview of string similarity matching and comparison of different algorithms to conclude the better performance on searching the text. There are many areas where this problem appears and one of the most demanding is information retrieval to find relevant information in text collection and the important tool is named as string matching.
Keywords: Information retrieval, String Matching, Similarity Search, Approximate String Match
I. Introduction
In recent years the problem is growing communities of information retrieval and computational biology. The field of information retrieval problem can be addressed into different views. A string is a sequence of characters over a finite set of alphabet. Similarity search provides a list of input data similar to an input query. In the context of search engines such as Google or yahoo search is based on document similarity and query similarity. Document similarity is nothing but overall similarity of an entire document to the given query. Query similarity suggests many query strings while searching is based on machine learning. [Thomas Bocek, et al., 1997]. At first 1992, text retrieval conference or TREC [Harman 1993] sponsored by US government which aims to encouraging research in information retrieval from large text collections.
In that many old techniques are modified and many new techniques are identified to retrieve over large number of text collections. The first algorithms developed in information retrieval for searching the World Wide Web during the year 1996 to 1998. Early there are various models and implementations are available for information retrieval system. Boolean system is used to specify the user information based on combination of And, Or, Not’s. Using this system they are not overcome to produce the relevant information. Several models are proposed for these process in that three most models are vector space model, the probabilistic models, and inference network model [Amit Singhal 2001]. Vector space model is represented by a vector of terms [Gerard Salton, 1975]. Terms are typically words or phrases. Any text can be represented by a vector in high dimensional space. Text belongs to non-zero value. Most vector term processed in a positive value to assign a numeric score to a document for a query. In the year of 1960 maron and kuhun proposed many Probabilistic model and it is based on the general principle that document in a collection should be ranked by decreasing probability of their relevance to a query [Amit Singhal 2001]. Estimation is the key part of this model. Inference network model is a document retrieval model as an inference process in an inference network. [Van Rijsbergen1979] Most techniques implemented under this model. Similarity search is important for time- sensitive applications. The increasing amounts of electronic information available on the web in order to improve data quality or find all information based on the user request. To provide a similarity search in the dictionary size may be too slow for many applications. There are various existing methods are available for fast similarity search for example English dictionary and a randomly generated dictionary and compared search performance for dynamic programming, a keyword tree, neighborhood generation and n-grams with index lookup extraction [Amit Chandel, 2006]. The extraction of structured and unstructured text is a challenging problem in many applications such as data warehousing, web data integration and bio-informatics.
For example, to identify book author from html pages, match of text string with book author is displayed and found the accuracy of the string extraction [Amit Chandel, 2006]. This paper categorized into four sections. Section-1 contain the introduction to information retrieval and string similarity search, Section-2 contain the literature survey, Section-3 contain Analysis of string similarity search Section-4 includes conclusion while references mentioned in the last section.
II. Literature Survey
It is defined as a finite state pattern matching machine from the keywords to process the text string in a single pass. To improve the speed of a library bibliographic search program by factor of 5 to 10. The main purpose of

S. Balan et al., International Journal of Emerging Technologies in Computational and Applied Sciences, 9(3), June-August, 2014, pp. 286-
288
this technique is to allow a bibliographer to find in a citation index all titles and satisfying some Boolean function of keywords and phrases. If m is a program which takes as input the text string s and produces as output the locations in p at which keywords y appear as substrings. It consists of a set of states and it is represented by a number. The behavior of the pattern matching machine is carried out by three functions named as go to function go, a failure function fa and an output function out [Alfred V. Aho, et al.,1975].
Edit distance [Levenstein V.I, 1966] is the minimum number of operations required to transform one string into another with operations being a deletion, an insertion or a replacement. Navarro’s NR-grep [Navarro.G, 2000] is an exhaustive online similarity search algorithm. NR stands for non-deterministic reverse pattern matching. It uses bit-parallelism and forward and backward searching. An n-gram is created by sliding a window of length g over the data and noting the content and position of all such windows. An extension of this approach for large text collections uses cosine similarity [Koudas, et al., 2004], t is a global measure to represent a vector of their frequencies.
Approximate similarity search based on hashing is to hash the points from the database from the probability of higher objects that close to another. It is based on hierarchical tree decomposition for large number of dimensions. There are various algorithms such as locality-sensitive hashing, analysis of locality-sensitive hashing and nearest neighbor search. Approximate string matching is about finding a pattern in a text where one or both of them have suffered some kind of undesirable corruption. The classification and the existing schemes in context of data structure are suffix tree, suffix array, Q-grams, Q-samples. Search approach method is classified into two ways namely partitioning into exact searching and intermediate partitioning based on text and patterns [Kaushik Chakrabartie, et al., 2000].
The existing algorithms are hamming distance, reversals, block distance, Q-gram distance, allowing swaps, approximate searching in multidimensional texts, in graphs, multi pattern approximate matching , non standard algorithms such as approximate or parallel algorithms, indexed searching, these are the other surveys on string similarity matching. There are various string matching types namely multiple string match, extended string matching, regular expression matching and approximate matching. The approximate matching contains various algorithms to find the similarity of given string such as dynamic programming algorithms, computing edit distance, text searching, improving the average case, other algorithm based on dynamic programming, algorithms based on automata, bit-parallel algorithms, parallelizing the NFA, parallelizing the DP matrix, algorithm for fast filtering the text, partitioning into k + 1 pieces, approximate BNDM, other filtration algorithms, multi pattern approximate searching, a hashing based algorithm for one error, searching for extended strings and regular expressions.
III. Analysis of String Similarity Matching Techniques
Sno
Author Name
Title
Methods
Advantages
Dis Advantages
1
Alfred V. Aho and Margaret J. Corasick
Efficient String Matching An Aid to Bibliographic Search
Pattern matching algorithm
Construction of go to, output and failure functions
Time complexity of algorithms
Locates keyword in a text string
Directed graph begins at the state 0
Time complexity is large
Substrings may overlap with one another
Partially computed output function
Failure function stored in one dimensional array
2
Arvind Arasu, Venkatesh Ganti, et al.;
Efficient Exact-Set Similarity Joins
Threshold based SSJoin
Hamming SSJoin
Jaccard SSJoin
Threshold parameter is high
Vector representation between two sets
Similarity value is 0 or 1.
Different similarity sets
Dimension is differ
Common elements
3
Thomas Bocek, Burkhard Stiller, et al.,
Fast Similarity Search in Large Dictionaries
Edit distance
NR|-Grep
N-grams and Cosine Similarity
Minimum operations required from one string to one string to another
Reverse pattern matching
Offline approach
Dictionary size is low
Avoids number of searching words in NR- grep method
Similarity is shared
4
Kaushik Chakrabarti, Dong Xin, et al.,
An Efficient Filter for Approximate Membership Checking
Pruning condition
Filtering by ISH
Weighted signatures
Three similarity measures are identified
Sub string search is quick
Weighted signature is in decreasing order
Lower bound value is not identified
String similarity is less
Different number of signatures

S. Balan et al., International Journal of Emerging Technologies in Computational and Applied Sciences, 9(3), June-August, 2014, pp. 286-
288
5
Amit Chandel, P.C.Nagesh, et al.,
Efficient Batch Top-k for Dictionary-based Entity Recognition
Batch Top-K
Simple Top-K
Segmented Algorithm
Finding the most top-k score
Decreasing IDF Values
A token of a the sub query is strong or weak
Increasing run time for threshold values
Upper bound scoreless is removed
Existing tight features is not unique
6
Aristides Gionis, Piotr Indyk, et al.,
Similarity Search in High Dimensions via Hashing
Locality Sensitive Hashing
Color Histograms
Texture Features
Better run time
Dependence on data size
To measure the performance
Value is small and there is resort needed
One index is not sufficient
Compare with SR-tree is low
7
Daniel Karch,Dennis Luxen,etal.,
Improved Fast Similarity Search in Dictionaries
Preprocessing Space
Preprocessing Time
Query Performance
String Split Parameter based on query time
Ten Times Faster
Maximum Distance calculated
Speed is low
Does not Store any information’s
Query time and search space size is average.
8
Amit Singhal
Modern Information Retrieval: A Brief Overview
Vector Space Model
Probabilistic Model
Inference Network Model
Calculate using the Term Weighting
Relevance feedback based on user queries
Retrieval effectiveness
Boolean systems are less effective
Poor stemming
Style of phrase generation is not critical
IV. Conclusion
In this paper, survey focus on various algorithms for string similarity matching based on search techniques. Some of the algorithm for set similarity with its property value is 0 or 1. It indicates the previous algorithms matches more than in many cases. The performance of the algorithm is analyzed and stated in a table manner. Additionally it focuses on information retrieval and search engine in World Wide Web. To improve the quality of a word search similarity, next the exact similarity is finer based on semantic relationship of a word. This further reduces the time size for a large database.
V. References
[1]. Alfred V. Aho and Margaret J. Corasick Bell Laboratories, Efficient String Matching An Aid to Bibliographic Search, communications of the ACM, Vol. 18 No.6, June 1975.
[2]. Amit Chandel, P.C.Nagesh, Suita Sarawagi, Efficient Batch Top-k for Dictionary-based Entity Recognition, Proc. 22nd International Conference Data Engineering., pp.28, 2006.
[3]. Amit Singhal, Modern Information Retrieval: A Brief Overview, IEEE Computer Society Technical Committee on Data Engineering, pp 1-9, 2001.
[4]. Aristides Gionis, Piotr Indyk, Rajeev Motwani, Similarity Search in High Dimensions via Hashing, Proceedings of the 25th VLDB Conference,Edinburgh, Scotland, pp 518, 1999.
[5]. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik, Efficient Exact-Set Similarity Joins, VLDB ’06, September 12-15, 2006, Seoul, Korea,VLDB Endowment, ACM 1-59593-385-9/06/09.
[6]. Daniel Karch,Dennis Luxen, Peter Sanders, Improved Fast Similarity Search in Dictionaries, presented at the 17th Symposium on String Processing and Information Retrieval, 2010.
[7]. Gerard Salton, A.Wong, and C. S. Yang. A vector space model for information retrieval. Communications of the ACM, 18(11):613–620, November 1975.
[8]. Harman D.K, Overview of the first Text Retrieval Conference (TREC-1). In Proceedings of the First Text REtrieval Conference (TREC-1), pages 1–20. NIST Special Publication 500-207, March 1993.
[9]. Kaushik Chakrabarti, Dong Xin, et al., An Efficient Filter for Approximate Membership Checking, SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada, 2008 ACM 9781605581026/08/06.
[10]. Koudas D.S.N, A. Marathe. Flexible String Matching Against Large Databases in Practice. In VLDB, pages 1078–1086, 2004.
[11]. Levenstein V.I, Binary codes capable of correcting insertions and reversals. Sov. Phys. Dokl., 10:707–101966.
[12]. Navarro.G, NR-grep: A Fast and Flexible Pattern Matching Tool, Technical Report TR/DCC-2000-3 Technical report, University of Chile, Departmento de Ciencias de la Computacion, Santiago, 2000, http://www.dcc.uchile.cl/gnavarro.
[13]. Thomas Bocek, Burkhard Stiller, et al., Fast Similarity Search in Large Dictionaries, University of Zurich, Department of Informatics (IFI), Binzmühl estrasse 14, CH-8050 Zürich, Switzerland, 2007.
[14]. Van Rijsbergen C.J, Information Retrieval. Butter worths, London, 1979.

Ijetcas14 624

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Ijetcas14 624

Ähnlich wie Ijetcas14 624 (20)

Mehr von Iasir Journals

Mehr von Iasir Journals (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ijetcas14 624