SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
VCHUNK JOIN
An efficient algorithm for Edit similarity join
Vignesh Babu, M.R.J.
Computer Science and Engineering
K.L.N. College of Information Technology,
Sivagangai, India
vigneshbabumrj@gmail.com
Thanush Kodi, N. and Vijay Koushik, S.
Computer Science and Engineering
K.L.N. College of Information Technology,
Sivagangai, India
nthanushkodi@gmail.com
svijaykoushik@gmail.com
Abstract— Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Keywords—chunk bound dictionary; Edit similarity joins,
approximate string matching, prefix filtering, q-gram, edit distance
I. INTRODUCTION
This project will design of an efficient algorithm to reduce
the space requirements for edit similarity join using the rank
method to reduce the overlapping of filtered data form the
dataset during data integration process.
A similarity join finds pairs of objects from two datasets
such that the pair‟s similarity value is no less than a given
threshold. Early research in this area focuses on similarities
defined in the Euclidean space; current research has spread to
various similarity (or distance) functions. Similarity joins with
edit distance constraint of no more than a constant. The major
technical difficulty lies in the fact that the edit distance metric
is complex and costly to compute. In order to scale to large
datasets, all existing algorithms adopt the filter-and-verify
paradigm. This first eliminates non-promising string pairs with
relatively inexpensive computation and then performs
verification by the edit distance computation.
Gram-based methods extract all substrings of certain length
(q-grams), or according to a dictionary (VGRAMs), for each
string. Therefore, subsequent grams are overlapping and are
highly redundant. This results in large index size and incurs
high query processing costs when the index cannot be entirely
accommodated in the main memory. The total size of the q-
gram six time more than text string.
Our observation is that the main redundancy in q-gram
based approach in existing overlapping gram. In each string,
we divide it into several disjoint substrings, or chunks, and we
only need to process and index these chunks to improve the
space usage. Chunk based robust chunking scheme that have
salient property for each string. Robust chunking scheme based
on tail-restrict Chunk Boundary Dictionary (CBD). Edit
distance not more than the constant threshold. Here applying
prefix filtering and location-based mismatch filtering methods,
the number of signatures generated by our algorithm is
guaranteed to be fewer than or equal to that of the q-grams-
based method, q>= 2. Another feature of chunk based method
that use existing filter and several new filtering unique to
chunk based method including rank, chunk number and virtual
CBD filtering. We also consider the problem of selecting a
good chunking scheme for a given dataset.
II. PROBLEM DEFINITION
A. Existing System
Previously similarity join with edit distance was used to
extract overlapping grams from strings. Q-Gram is continuous
substring of the length Q of the given string; we then find the
position of the string to be matching. If they have same
content and positions are within the edit distance, it performs
an operation to calculate similarity through length, count
position, and prefix filtering of the data from the dataset. The
Q-Gram method index value are very large and then fixed
length chunk based method‟s index values are huge, and
variable gram(VGRAM) dictionary based method chunk does
not predict easily and this complicates the work to implement
this approach.
B. Proposed System
In our proposed system edit similarity join based on
extracting non-overlapping substring, or chunk from string. A
class of chunk based method notion tail-restricted chunk
boundary dictionary uses several existing filtering and new
powerful filtering techniques to reduce the size of index. Our
proposed algorithm performs length, position, prefix filtering,
content-based filtering and location-based filtering approach
help to reduce the amount of space required to calculate the
similarity of only the satisfied similar data. Resulting algorithm
we call it as VChunkJoin-NoLC. Here we introduce rank,
chunk number and virtual CBD algorithm to improve the
efficiency of the problem. This approach should satisfy the edit
distance value which is less than the threshold value will be
calculated as the similarity value.
III. INITIAL OPERATIONS
To perform similarity join operation on data strings, first
datasets must be gathered and should be parsed. Then parsed
data is added to the database.
A. Dataset Gathering
The first step is to gather datasets for the similarity
calculation. It was decided to use XML datasets for this
purpose. XML data sets are generic in nature and can be used
by software components of different platforms.
A data dataset (or data set) is a collection of data.
Most commonly a dataset corresponds to the contents of a
single database table, or a single statistical data matrix, where
every column of the table represents a particular variable, and
each row corresponds to a given member of the dataset in
question. The dataset lists values for each of the variables, such
as height and weight of an object, for each member of the
dataset. Each value is known as a datum. The dataset may
comprise data for one or more members, corresponding to the
number of rows.
The term dataset may also be used more loosely, to refer to
the data in a collection of closely related tables, corresponding
to a particular experiment or event.
Here the term dataset refers to collection of data in an XML
file. It was decided to use IMDB, DBLB, TREC, UNIREF,
UNRON datasets for this algorithm which usually have large
number of similar values.
B. Xml Parsing
XML parsing or Pull parsing treats the document as a
series of items which are read in sequence using the Iterator
design pattern. This allows for writing of recursive-descent
parsers in which the structure of the code performing the
parsing mirrors the structure of the XML being parsed, and
intermediate parsed results can be used and accessed as
local variables within the methods performing the parsing,
or passed down (as method parameters) into lower-level
methods, or returned (as method return values) to higher-
level methods.
Here we use XML parsing or pull parsing to extract the
data from the datasets chosen. These extracted data are then
inserted into the database using SQL queries. At least two
datasets are needed for similarity calculation.
IV. DATA FILTERING
The next step is to filter out the data that are likely to be
similar. Two data are considered similar by different methods
like the most used Prefix filtering. But on analysis prefix
filtering method produces overlapping data as filtered result.
This overlapped data occupy a large space in the memory. So
to overcome this problem it was decided to use rank based
filtering method to reduce the overlapping of data.
A. Prefix Filtering
Prefix Filtering method filters documents that
have fields containing terms with a specified prefix. It is
similar to phrase query, except that it acts as a filter. It can
be placed within queries that accept a filter.
This approach is applied to two data sets that we
have taken before. In this approach the first dataset is
considered ad prefix and the second dataset is considered
as the suffix. For the filtering the length of the prefix
string is compared with that of the suffix string. If the
prefix length matches the suffix length, then they are
filtered out and placed in a separately.
Consider two datasets P and Q on which prefix
filtering is to be applied. The filtered elements are placed
in the set F such that
F= {x | length(P(m)) is equal to length (Q(n))}
Where,
x – The filtered element.
lenth(P(m)) – The length of the string m of the
set P.
length(Q(n)) – The length of the string n of the
set Q.
Prefix filtering method was fast but it yielded a
large number of overlapping data which occupied too
much space in memory.
B. Rank Based Filtering
The existing prefix filtering method occupied a large
amount of memory, so we needed to use a new approach to
reduce space occupancy by the filtered data. Thus we
introduced a rank based method to reduce the overlapping data.
In Rank based approach we consider only the overlapped
data. Other filtered elements are kept as such. Here we apply
rank to the overlapped data on both datasets. The rank of both
elements whose difference is greater than the chosen threshold
is taken for further processing and rest are ignored. Rank
calculation involves certain Combination operations to obtain
rank of the string.
F={x│x is equal to rank}
Where,
rank = │rank(p(m)) – rank(Q(n))│≥ τ
Ď„- Threshold value.
rank(P(m)) – Rank of the string m of the dataset P.
rank(Q(n)) – Rank of the string n of the dataset Q.
By this approach most of the overlapping data are ignored
for the next process.
V. SIMILARITY CALCULATION
The final step is to calculate the similarity between
the strings of the filtered strings. The Similarity calculation
involves the following steps.
1. Edit distance calculation.
2. Chunk the strings.
3. Similarity calculation.
A. Edit Distance Calculation
In this module we calculate the edit distance
value from the matched text string from two dataset
values. Edit distance calculation have to be performed by
dynamic programming concept to get the minimum
number of operation required to perform insert, delete,
substitute and transform the string. Consider two string
where both the strings have seen the length and position
of the string are same the edit distance value as zero
otherwise the value taken as how many operation need to
transform one string to another as the edit distance value.
( )
{
( )
( )
( )
( )
Where,
D (i,j) – The value of the leading diagonal
element in matrix.
si – ith
element of the string s.
tj – jth
element of the string t.
The various operations are.
ď‚· Insertion.
ď‚· Deletion.
ď‚· Substitution.
ď‚· Transformation.
B. String Chunking
Chunks are very basic and necessary condition to
perform similarity value between the two selected similar data
from dataset. Chunks are the disjoint substring of the given
string. Our proposed dataset TREC as book oriented dataset in
this paper CBD selection have to be done by the character set
{a,e,g,l,s,v}. We split the string or substring into the given set
of data as chunk and taken the number of chunk in the string
as input condition to perform similarity value calculation. It
should be less than the constant threshold value.
Example:
Consider the string „Adolescence‟. According to the
chunking rule above the string would be chunked as follows
(underlined).
A dol e s cence
Thus the string Adolescence has been chunked into 5
parts.
C. Similarity Calculation
Similarity calculation is the final result of this
algorithm. We have to perform the similarity value
calculation between the two datasets. It depends on the
edit distance value. This similarity value calculation is
performed by which data are less than threshold and
between the two set ranks should less than the threshold
value.
Similarity value is calculated by subtracting Edit
distance value from the bigger length then divided by the
same value as similarity value between the data
[ ]
VI. EXPERIMENTS
We use the following five categories of algorithms in
the experiment, all of which are implemented as in-memory
algorithms, with all their inputs loaded into memory before
running:
ď‚· Fixed length gram based: Ed-Join [8] and Winnowing
ď‚· [16].
ď‚· VGRAM based: VGram-Join [11], [12].
ď‚· Tree based: Bed
-tree [21] and Trie-Join [22].
ď‚· Enumeration based: PartEnum [7] and NGPP [23].
ď‚· Vchunk based: VChunkJoin and VChunkJoin-NoLC.
VChunkJoin is our proposed algorithm equipped with all the
filterings and optimizations. When comparing with the
VGRAM-based method, we remove location-based and
content-based filterings, and name the resulting algorithm
VChunkJoin-NoLC.
VII. CONCLUSION
Here we have introduced Novel approach to process edit
similarity join efficiently based on partition strings into non-
overlapping chunks. We have designed an efficient edit
similarity join algorithm to incorporate all existing filter which
occupies less space than the existing filtering methods and it‟s
more powerful than the existing algorithm to calculate the
similarity value between the two set of strings. The existing q-
gram and VGRAM approaches occupies more amount of
index value and unnecessary condition strings. To perform
similarity value calculation it has to overcome through our
proposed algorithm VChunkJoin. CBD selection algorithm is
helpful to reduce the index size of the string and dynamic
program is used to perform the edit distance value of the two
dataset.
REFERENCES
[1] Arasu, A. Ganti, V. and Kaushik, R. “Efficient exact set-similarity
joins,” in VLDB, 2006.
[2] Behm, A. Ji, S. Li, S. and Lu, J. “Space-constrained gram-based
indexing for efficient approximate string search,” in ICDE, 2009.
[3] Elmagarmid, A. Ipeirotis, K. P. G. and Verykios, V. S. “Duplicate
record detection: A survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007
[4] Li, C. Wang, B. and Yang, X. “VGRAM: Improving performance of
approximate queries on string collections using variable-length grams,”
in VLDB, 2007.
[5] Manber, U. “Finding similar files in a large file system,” in USENIX
Winter, 1994, pp. 1–10.
[6] Ramasamy, Patel, J. M. Naughton, J. F. and Kaushik, R. “Set
containment joins: The good, the bad and the ugly,” in VLDB, 2000, pp.
351–362.
[7] Sarawagi, S. and Kirpal, A. “Efficient set joins on similarity predicates,”
in SIGMOD, 2004.
[8] Wang, J. Feng, J. and Li, G. “Trie-join: Efficient trie-based string
similarity joins with edit,” in VLDB, 2010.
[9] Xiao, C. Wang, W. Lin, X. and Yu, J. X. U. “Efficient similarity joins
for near duplicate detection,” in WWW, 2008.
[10] Xiao, C. Wang, W. and Lin, X. “Ed-Join: an efficient algorithm for
similarity joins with edit distance constraints,” PVLDB, vol. 1, no. 1, pp.
933–944, 2008.
[11] Yang, X. Wang, B. and Li, C. “Cost-based variable-length-gram
selection for string collections to support approximate queries
efficiently,” in SIGMOD Conference, 2008, pp. 353–364.

Weitere ähnliche Inhalte

Was ist angesagt?

Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological dataKrish_ver2
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Lect4
Lect4Lect4
Lect4sumit621
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)Praveen Kumar
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Resultsxiaojuzheng
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetVishva Abeyrathne
 

Was ist angesagt? (19)

Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
H1076875
H1076875H1076875
H1076875
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
 
Av33274282
Av33274282Av33274282
Av33274282
 
E1062530
E1062530E1062530
E1062530
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Lect4
Lect4Lect4
Lect4
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
 

Ă„hnlich wie Vchunk join an efficient algorithm for edit similarity joins

Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string searchIEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchJAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchIEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string searchJAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string searchIEEEGLOBALSOFTTECHNOLOGIES
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string searchIEEEFINALYEARPROJECTS
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot netredpel dot com
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...IOSR Journals
 
Clustering sentence level text using a novel fuzzy relational clustering algo...
Clustering sentence level text using a novel fuzzy relational clustering algo...Clustering sentence level text using a novel fuzzy relational clustering algo...
Clustering sentence level text using a novel fuzzy relational clustering algo...JPINFOTECH JAYAPRAKASH
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeAlexander Decker
 
F04463437
F04463437F04463437
F04463437IOSR-JEN
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string searchJPINFOTECH JAYAPRAKASH
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detectionieeepondy
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633IJERA Editor
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 

Ă„hnlich wie Vchunk join an efficient algorithm for edit similarity joins (20)

Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string searchJAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT Spatial approximate string search
 
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string searchJAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
JAVA 2013 IEEE NETWORKSECURITY PROJECT Spatial approximate string search
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot net
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
Analysis of Rayleigh Quotient in Extrapolation Method to Accelerate the Compu...
 
Clustering sentence level text using a novel fuzzy relational clustering algo...
Clustering sentence level text using a novel fuzzy relational clustering algo...Clustering sentence level text using a novel fuzzy relational clustering algo...
Clustering sentence level text using a novel fuzzy relational clustering algo...
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
F04463437
F04463437F04463437
F04463437
 
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...A Combined Approach for Feature Subset Selection and Size Reduction for High ...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 

KĂĽrzlich hochgeladen

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 

KĂĽrzlich hochgeladen (20)

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 

Vchunk join an efficient algorithm for edit similarity joins

  • 1. VCHUNK JOIN An efficient algorithm for Edit similarity join Vignesh Babu, M.R.J. Computer Science and Engineering K.L.N. College of Information Technology, Sivagangai, India vigneshbabumrj@gmail.com Thanush Kodi, N. and Vijay Koushik, S. Computer Science and Engineering K.L.N. College of Information Technology, Sivagangai, India nthanushkodi@gmail.com svijaykoushik@gmail.com Abstract— Similarity join is most important technique to involve many applications such as data integration, record linkage and pattern recognition. Here we introduce new algorithm for similarity join with edit distance constraints. Currently extracting overlapping grams from string and consider only string that share certain gram as candidate. Now we propose extracting non-overlapping substring or chunk from string. Chunk scheme based on tail-restricted chunk boundary dictionary (CBD). This approach integrated existing approach for calculating similarity with several new filters unique to chunk based method. Greedy algorithm automatically select good chunking scheme from given data set. Then show the result our method occupies less space and faster performance to compute the value Keywords—chunk bound dictionary; Edit similarity joins, approximate string matching, prefix filtering, q-gram, edit distance I. INTRODUCTION This project will design of an efficient algorithm to reduce the space requirements for edit similarity join using the rank method to reduce the overlapping of filtered data form the dataset during data integration process. A similarity join finds pairs of objects from two datasets such that the pair‟s similarity value is no less than a given threshold. Early research in this area focuses on similarities defined in the Euclidean space; current research has spread to various similarity (or distance) functions. Similarity joins with edit distance constraint of no more than a constant. The major technical difficulty lies in the fact that the edit distance metric is complex and costly to compute. In order to scale to large datasets, all existing algorithms adopt the filter-and-verify paradigm. This first eliminates non-promising string pairs with relatively inexpensive computation and then performs verification by the edit distance computation. Gram-based methods extract all substrings of certain length (q-grams), or according to a dictionary (VGRAMs), for each string. Therefore, subsequent grams are overlapping and are highly redundant. This results in large index size and incurs high query processing costs when the index cannot be entirely accommodated in the main memory. The total size of the q- gram six time more than text string. Our observation is that the main redundancy in q-gram based approach in existing overlapping gram. In each string, we divide it into several disjoint substrings, or chunks, and we only need to process and index these chunks to improve the space usage. Chunk based robust chunking scheme that have salient property for each string. Robust chunking scheme based on tail-restrict Chunk Boundary Dictionary (CBD). Edit distance not more than the constant threshold. Here applying prefix filtering and location-based mismatch filtering methods, the number of signatures generated by our algorithm is guaranteed to be fewer than or equal to that of the q-grams- based method, q>= 2. Another feature of chunk based method that use existing filter and several new filtering unique to chunk based method including rank, chunk number and virtual CBD filtering. We also consider the problem of selecting a good chunking scheme for a given dataset. II. PROBLEM DEFINITION A. Existing System Previously similarity join with edit distance was used to extract overlapping grams from strings. Q-Gram is continuous substring of the length Q of the given string; we then find the position of the string to be matching. If they have same content and positions are within the edit distance, it performs an operation to calculate similarity through length, count position, and prefix filtering of the data from the dataset. The Q-Gram method index value are very large and then fixed length chunk based method‟s index values are huge, and variable gram(VGRAM) dictionary based method chunk does not predict easily and this complicates the work to implement this approach. B. Proposed System In our proposed system edit similarity join based on extracting non-overlapping substring, or chunk from string. A class of chunk based method notion tail-restricted chunk boundary dictionary uses several existing filtering and new powerful filtering techniques to reduce the size of index. Our proposed algorithm performs length, position, prefix filtering, content-based filtering and location-based filtering approach help to reduce the amount of space required to calculate the similarity of only the satisfied similar data. Resulting algorithm we call it as VChunkJoin-NoLC. Here we introduce rank,
  • 2. chunk number and virtual CBD algorithm to improve the efficiency of the problem. This approach should satisfy the edit distance value which is less than the threshold value will be calculated as the similarity value. III. INITIAL OPERATIONS To perform similarity join operation on data strings, first datasets must be gathered and should be parsed. Then parsed data is added to the database. A. Dataset Gathering The first step is to gather datasets for the similarity calculation. It was decided to use XML datasets for this purpose. XML data sets are generic in nature and can be used by software components of different platforms. A data dataset (or data set) is a collection of data. Most commonly a dataset corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question. The dataset lists values for each of the variables, such as height and weight of an object, for each member of the dataset. Each value is known as a datum. The dataset may comprise data for one or more members, corresponding to the number of rows. The term dataset may also be used more loosely, to refer to the data in a collection of closely related tables, corresponding to a particular experiment or event. Here the term dataset refers to collection of data in an XML file. It was decided to use IMDB, DBLB, TREC, UNIREF, UNRON datasets for this algorithm which usually have large number of similar values. B. Xml Parsing XML parsing or Pull parsing treats the document as a series of items which are read in sequence using the Iterator design pattern. This allows for writing of recursive-descent parsers in which the structure of the code performing the parsing mirrors the structure of the XML being parsed, and intermediate parsed results can be used and accessed as local variables within the methods performing the parsing, or passed down (as method parameters) into lower-level methods, or returned (as method return values) to higher- level methods. Here we use XML parsing or pull parsing to extract the data from the datasets chosen. These extracted data are then inserted into the database using SQL queries. At least two datasets are needed for similarity calculation. IV. DATA FILTERING The next step is to filter out the data that are likely to be similar. Two data are considered similar by different methods like the most used Prefix filtering. But on analysis prefix filtering method produces overlapping data as filtered result. This overlapped data occupy a large space in the memory. So to overcome this problem it was decided to use rank based filtering method to reduce the overlapping of data. A. Prefix Filtering Prefix Filtering method filters documents that have fields containing terms with a specified prefix. It is similar to phrase query, except that it acts as a filter. It can be placed within queries that accept a filter. This approach is applied to two data sets that we have taken before. In this approach the first dataset is considered ad prefix and the second dataset is considered as the suffix. For the filtering the length of the prefix string is compared with that of the suffix string. If the prefix length matches the suffix length, then they are filtered out and placed in a separately. Consider two datasets P and Q on which prefix filtering is to be applied. The filtered elements are placed in the set F such that F= {x | length(P(m)) is equal to length (Q(n))} Where, x – The filtered element. lenth(P(m)) – The length of the string m of the set P. length(Q(n)) – The length of the string n of the set Q. Prefix filtering method was fast but it yielded a large number of overlapping data which occupied too much space in memory. B. Rank Based Filtering The existing prefix filtering method occupied a large amount of memory, so we needed to use a new approach to reduce space occupancy by the filtered data. Thus we introduced a rank based method to reduce the overlapping data. In Rank based approach we consider only the overlapped data. Other filtered elements are kept as such. Here we apply rank to the overlapped data on both datasets. The rank of both elements whose difference is greater than the chosen threshold is taken for further processing and rest are ignored. Rank calculation involves certain Combination operations to obtain rank of the string. F={x│x is equal to rank}
  • 3. Where, rank = │rank(p(m)) – rank(Q(n))│≥ Ď„ Ď„- Threshold value. rank(P(m)) – Rank of the string m of the dataset P. rank(Q(n)) – Rank of the string n of the dataset Q. By this approach most of the overlapping data are ignored for the next process. V. SIMILARITY CALCULATION The final step is to calculate the similarity between the strings of the filtered strings. The Similarity calculation involves the following steps. 1. Edit distance calculation. 2. Chunk the strings. 3. Similarity calculation. A. Edit Distance Calculation In this module we calculate the edit distance value from the matched text string from two dataset values. Edit distance calculation have to be performed by dynamic programming concept to get the minimum number of operation required to perform insert, delete, substitute and transform the string. Consider two string where both the strings have seen the length and position of the string are same the edit distance value as zero otherwise the value taken as how many operation need to transform one string to another as the edit distance value. ( ) { ( ) ( ) ( ) ( ) Where, D (i,j) – The value of the leading diagonal element in matrix. si – ith element of the string s. tj – jth element of the string t. The various operations are. ď‚· Insertion. ď‚· Deletion. ď‚· Substitution. ď‚· Transformation. B. String Chunking Chunks are very basic and necessary condition to perform similarity value between the two selected similar data from dataset. Chunks are the disjoint substring of the given string. Our proposed dataset TREC as book oriented dataset in this paper CBD selection have to be done by the character set {a,e,g,l,s,v}. We split the string or substring into the given set of data as chunk and taken the number of chunk in the string as input condition to perform similarity value calculation. It should be less than the constant threshold value. Example: Consider the string „Adolescence‟. According to the chunking rule above the string would be chunked as follows (underlined). A dol e s cence Thus the string Adolescence has been chunked into 5 parts. C. Similarity Calculation Similarity calculation is the final result of this algorithm. We have to perform the similarity value calculation between the two datasets. It depends on the edit distance value. This similarity value calculation is performed by which data are less than threshold and between the two set ranks should less than the threshold value. Similarity value is calculated by subtracting Edit distance value from the bigger length then divided by the same value as similarity value between the data [ ] VI. EXPERIMENTS We use the following five categories of algorithms in the experiment, all of which are implemented as in-memory algorithms, with all their inputs loaded into memory before running: ď‚· Fixed length gram based: Ed-Join [8] and Winnowing ď‚· [16]. ď‚· VGRAM based: VGram-Join [11], [12]. ď‚· Tree based: Bed -tree [21] and Trie-Join [22]. ď‚· Enumeration based: PartEnum [7] and NGPP [23]. ď‚· Vchunk based: VChunkJoin and VChunkJoin-NoLC. VChunkJoin is our proposed algorithm equipped with all the filterings and optimizations. When comparing with the VGRAM-based method, we remove location-based and content-based filterings, and name the resulting algorithm VChunkJoin-NoLC.
  • 4. VII. CONCLUSION Here we have introduced Novel approach to process edit similarity join efficiently based on partition strings into non- overlapping chunks. We have designed an efficient edit similarity join algorithm to incorporate all existing filter which occupies less space than the existing filtering methods and it‟s more powerful than the existing algorithm to calculate the similarity value between the two set of strings. The existing q- gram and VGRAM approaches occupies more amount of index value and unnecessary condition strings. To perform similarity value calculation it has to overcome through our proposed algorithm VChunkJoin. CBD selection algorithm is helpful to reduce the index size of the string and dynamic program is used to perform the edit distance value of the two dataset. REFERENCES [1] Arasu, A. Ganti, V. and Kaushik, R. “Efficient exact set-similarity joins,” in VLDB, 2006. [2] Behm, A. Ji, S. Li, S. and Lu, J. “Space-constrained gram-based indexing for efficient approximate string search,” in ICDE, 2009. [3] Elmagarmid, A. Ipeirotis, K. P. G. and Verykios, V. S. “Duplicate record detection: A survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007 [4] Li, C. Wang, B. and Yang, X. “VGRAM: Improving performance of approximate queries on string collections using variable-length grams,” in VLDB, 2007. [5] Manber, U. “Finding similar files in a large file system,” in USENIX Winter, 1994, pp. 1–10. [6] Ramasamy, Patel, J. M. Naughton, J. F. and Kaushik, R. “Set containment joins: The good, the bad and the ugly,” in VLDB, 2000, pp. 351–362. [7] Sarawagi, S. and Kirpal, A. “Efficient set joins on similarity predicates,” in SIGMOD, 2004. [8] Wang, J. Feng, J. and Li, G. “Trie-join: Efficient trie-based string similarity joins with edit,” in VLDB, 2010. [9] Xiao, C. Wang, W. Lin, X. and Yu, J. X. U. “Efficient similarity joins for near duplicate detection,” in WWW, 2008. [10] Xiao, C. Wang, W. and Lin, X. “Ed-Join: an efficient algorithm for similarity joins with edit distance constraints,” PVLDB, vol. 1, no. 1, pp. 933–944, 2008. [11] Yang, X. Wang, B. and Li, C. “Cost-based variable-length-gram selection for string collections to support approximate queries efficiently,” in SIGMOD Conference, 2008, pp. 353–364.