SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 138 | P a g e
Complexity Anaylsis of Indexing Techniques for Scalable
Deduplication and Data Linkage
Koneru Naga Venkata Rajeev, Vanguru Radha Krishna, Kambhampati Vamsi
Krishna, Siddhavatam Thirumala Reddy
School of Information Technology and Engineering VIT UNIVERSITY Vellore, India
School of Information Technology and Engineering VIT UNIVERSITY Vellore, India
School of Information Technology and Engineering VIT UNIVERSITY Vellore, India
School of Information Technology and Engineering VIT UNIVERSITY Vellore, India
ABSTRACT
The process of record or data matching from various databases regarding same entities is called Data Linkage.
Similarly, when this process is applied to a particular database, then it is called Deduplication. Record matching
plays a major role in today‟s environment as it is more expensive to obtain. The process of data cleaning from
database is the first and foremost step since copy of data severely affects the results of data mining. As the
amount of databases increasing day-by-day, the difficulty in matching records happens to be a most important
confront for data linkage. So many indexing techniques are designed for the process of data linkage. The main
aim of those indexing techniques is to minimize the number of data pairs by eliminating apparent non-matching
data pairs by maintaining maximum quality of matching. Hence in this paper, a survey is made of these indexing
techniques to analyze complexity and evaluate scalability using fake data sets and original data sets.
Keywords: Complexity, Data matching, Indexing techniques.
I. Introduction
Today in real world one task is getting more
significance in a number of fields. That is nothing but
the data matching that connected to same objects
from numerous databases. And this data from several
sources should be integrated and unified to increase
quality of data. Record linkage is employed in so
many sectors including government agencies to
recognize people who register for help or support
multiple numbers of times. Even in the domain of
detecting scams and crimes, this record linkage
technique is useful. To access files for a specific
person in enquiry or to cross check the histories of
that person from multiple databases, record linkage
plays a major role.
RESEARCH ARTICLE OPEN ACCESS
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 139 | P a g e
Fig. 1 Record Linkage Process
In biotechnology, record linkage is used to
identify genome sequences in a huge number of data
collections. In the area of information retrieval, it is
significant to eliminate identical documents. The way
of matching data is called data linkage by health
researchers. But in other communities such as
database and computer science, it is referred to field
matching [1], duplicate detection [2], [3], information
integration [4], data scrubbing or cleaning [5]. Data
linkage is nothing but a component of ETL tools.
Some recent surveys have delivered techniques and
experiments regarding deduplication and linkage [3].
In Fig. 1, necessary steps needed to link two
databases are described. As real-world data contain
noisy data, inconsistent data, first it should be
cleaned. Absence of good quality data leads to
unsuccessful record linkage [6]. The next step is
indexing, in which it creates sets of candidate records
that will be distinguished in the next comparison step.
After the comparison, based on their equality, it is
divided into three types, which are possible matches,
matches and non matches. Based on the view of
possible matches, they are classified either into match
or non-match. Finally, the complexity is evaluated or
analyzed in the last step.
II. Indexing for Deduplication and Data
Linkage
When different databases are to be
compared, each and every record from one particular
database should be matched with every record of the
other database. Hence it leads to product i.e.,
comparisons among the records
where is number of records in a particular
database. In the same way, if it is only for a single
database then the number of comparisons
Database A Database B
Data
Cleaning
Indexing
Record pair
comparison
Similarity
vector
classification
Matches
Non matches
Possible matches
E
V
A
L
U
A
T
I
O
N
Data
Cleaning
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 140 | P a g e
is . But the comparison of
values is more expensive when it involves a large
number of databases. If there are no matching
records, then min is the number of
comparisons.
Table.1 Examples for Generation of Blocking Keys
Record Fields Blocking Keys
Identifiers Names(
N)
Initials (I) Codes
(C)
Places(P) Sndx(N)+C Fi2D(C)+D
ME(I)
Sndx(P)
+La2D(
C)
A1 Rahul Christen 1004 Chennai R190-1004 10-KRST C345-04
A2 Rania Kristen 1011 Hyderabad R190-1011 10-KRST H470-
11
A3 Rohit Smith 1200 Bangalore R240-1200 12-BGLO B217-00
A4 Ravi Smyth 1300 Bangalore R470-1300 13-BGLO B217-00
Sndx, DME are two phonetic functions and
Fi2D(C) ----- First two digits of C
La2D(C) ---- Last two digits of C
The important objective of indexing is to
lessen the total number of comparisons by
eliminating the records that are regarding non
matches. Hence a technique is used for this approach
and called blocking. In this technique, it divides the
database into multiple blocks, so that the comparison
happens only in a particular block instead of the
whole database. A blocking key is used for this
purpose. Based on their similarity, records are pushed
into the blocks. Functions such as Double
Metaphone, NYSIIS and Soundex are used to create
blocking keys. In Table.1, it is explained in detail.
III. Indexing Techniques
The indexing techniques provide two stages
in the process of data linkage. Build and Retrieve are
the two stages in it.
Build: The records that present in each and every
database are completely read, Blocking Keys are
produced accordingly, and these records are pushed
into particular indexes. Mostly a data structure called
inverted index is selected for this purpose. The
Blocking Keys then act as keys of inverted index and
unique identifiers of the records that contain equal
blocking keys are pushed into the same data
structure. Fig. 2 shows an example for it. When
associating two or more databases, a distinct data
structure for every database can be constructed or
only one data structure with same keys can be
created.
Identifiers Initials Blocking keys(Encoded)
A1 Rahul R490
A2 Smith S520
A3 Jain J170
A4 Raahul R490
A5 Smyth S520
A6 Rahull R490
A7 Smith S520
A8 Smyth S520
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 141 | P a g e
R490 S520 J170
Fig. 2 Records with initials and encoded keys and equivalent inverted index data structure
Retrieve: For every block, a group of unique
identifiers are retrieved from the data structure, and
then candidate pairs are created using this. In the
process of data linkage, the records present in the
block of a single database are paired with the records
from remaining databases that contain equal blocking
key. If it is deduplication, records from the same
block contain different pairs. As given in Fig. 2,
R490 contain the pairs such as (A1, A4), (A4, A6),
(A1, A6).
3.1. Traditional Blocking
It is employed competently by utilizing the
inverted index data structure. The main disadvantage
of this is flaws in the fields of records leads to
incorrect blocking keys which in turn lead to wrong
block placement of records. But this can be
overwhelmed by consuming multiple descriptions for
blocking keys. Another disadvantage of using this
technique is block sizes created will be depending
upon frequency delivery of blocking keys.
Candidate record pairs are generated by using;
(1)
Where k is the number of distinct blocking keys,
Is number of records in P
Is number of records in Q
(2)
3.2 Q-Gram-Based Indexing
In this technique, the database is indexed so
that records that contain related but not as it is as
blocking keys are pushed into the same block. Hence
the main goal of this is to generate variations for
every blocking key using q-grams (subset of size q)
and can push multiple unique identifiers into multiple
blocks. Every blocking key is transformed into a
group of q-grams and then sub lists are created up to
a particular length that can be determined by the user.
Hence these are transformed to strings and can be
used as actual values in the data structure that is
described in Fig. 3.
Identifiers Blocking Keys Sub-lists Index values
A1 Rahul [ra, ah, hu, ul], [ra, ah, hu],
[ra, ah, ul], [ra, hu, ul],
[ah, hu, ul]
raahhuul, raahhu, raahul,
rahuul, ahhuul
A2 Raahul [ra, aa, ah, hu, ul], [ra, aa, ah, hu], [ra,
aa, ah, ul], [ra, aa, hu, ul], [ra, ah, hu,
ul], [aa, ah, hu, ul]
raaaahhuul, raaaahhu, raaaahul,
raaahuul, raahhuul, aaahhuul
A3 Rahull [ra, ah, hu, ul, ll], [ah, hu, ul, ll], [ra,
hu, ul, ll], [ra, ah, ul, ll], [ra, ah, hu,
ll], [ra, ah, hu, ul]
Raahhuulll, ahhuulll, rahuulll,
raahulll, raahhull, raahhuul
A2
A5
A7
A8
A1
A4
A6
A
3
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 142 | P a g e
Raaahuul Raahhuul raahulll
Fig.3. Q-Gram Technique considering bigrams (q=2) and representation using inverted index
3.3 Sorted Neighborhood Indexing
In this technique, the databases are sorted
regarding the values of blocking keys, and then
sequentially push a window of records above the
sorted values. Then using that window, record pairs
are generated. In two ways this technique is
developed.
3.3.1 Sorted Array-Based Approach
The blocking keys are sorted according to
their names and pushed into the array. Now move the
window above this array, then record pairs are
created. If the size of the array is , then the
number of positions of window
is . But in the duplication it
is . This is represented in Fig.4.
Window
positions
Blocking Keys Identifiers
1 Rahul A1
2 Raahul A4
3 Rahull A6
4 Jain A3
5 Smith A2
6 Smyth A5
7 Smyth A8
8 Smith A7
Window Range Record Pairs
1-3 (A1, A4), (A4, A6), (A1, A6)
2-4 (A4, A6), (A6, A3), (A4, A3)
3-5 (A6, A3), (A3, A2), (A6, A2)
4-6 (A3, A2), (A2, A5), (A3, A5)
5-7 (A2, A5), (A5, A8), (A2, A8)
6-8 (A5, A8), (A8, A7), (A5, A7)
Fig.4. Sorted Neighborhood Technique with a
window size of 3
The disadvantage of this approach is if the
chosen window length is very small, then all the
records will not be covered with the unique blocking
key. The best solution for this is combining more
fields such that there can be a number of values. One
more problem with this is sorting the blocking keys is
perceptive to flaws and deviations in the values. The
solution for this is to create multiple blocking keys or
creating those keys by reversing the values. The time
complexity of this is O (nlogn) where n=
for data linkage and n= in case of deduplication.
3.3.2 Inverted Index-Based Approach:
Instead of pushing into sorted array, a data
structure called inverted index is used in this
approach. This is same as above approach but one
difference between both the approaches is the
comparison happens only once. There are some
drawbacks in this technique. One of the drawbacks is
number of blocks are dominating the record pairs
created. The other drawback is blocking keys are
sorted assuming that beginning is flawless. This
technique is described by using the following figure
Fig. 5. The one with same blocking keys represent
different identifiers instead of in above approach.
Window Range Record Pairs
1-3 (A1, A4), (A1, A2), (A1, A7),
(A4, A2), (A4, A7), (A2, A7)
2-4 (A4, A2), (A4, A7), (A2, A7),
(A4, A3), (A2, A3), (A7, A3)
3-5 (A2, A7), (A2, A3), (A7, A3),
(A2, A6), (A7, A6), (A3, A6)
4-6 (A3, A6), (A3, A5), (A3, A8),
(A6, A5), (A6, A8), (A5, A8)
Fig.5. Inverted Index Approach
A2 A3
A1
A2
A3
Window
positions
Blocking
Keys
Identifiers
1 Rahul A1
2 Raahul A4
3 Smith A2, A7
4 Jain A3
5 Rahull A6
6 Smyth A5, A8
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 143 | P a g e
3.4 Indexing Based On Suffix-Arrays
In this indexing technique, push the
blocking keys even suffixes into suffix-based data
structure. The group of characters or strings in sorted
order is nothing but Suffix Array. If the length of the
string is „k‟, then the number of suffixes is (k-l+1)
where l is minimum length. The disadvantage of this
technique is flaws and deviations at the end of
blocking keys lead to records that are pushed into
multiple blocks instead of the same block. The
solution for this is to create not only the real suffixes
for blocking keys, but also to create substrings to a
particular length. This can be explained in detail with
the help of Fig. 6. The block size is nothing but a
number of identifiers.
Identifiers Blocking
Keys
Suffixes
A1 Fluorine fluorine,
luorine, uorine,
orine, rine, ine
A2 Chlorine chlorine,
hlorine, lorine,
orine, rine, ine
A3 Iodine iodine, odine,
dine, ine
A4 Amoxine amoxine,
moxine, oxine,
xine, ine
Suffix Identifiers
amoxine A4
chlorine A1,A2
dine A3
fluorine A1
hlorine A2
ine A1,A2,A3,A4
lorine A2
Luorine A1
Moxine A4
Odine A3
Orine A1,A2
Oxine A4
Rine A1,A2
Uorine A1
Xine A4
Fig.6. Suffix-Based Approach with length l=3 and
maximum block size=3. The block containing
suffix ine is removed as it contains more block
identifiers.
3.5 Canopy Clustering
Clusters are generated in such a way that the
similarities between blocking keys are calculated
using cosine or Jaccard measure. These are
dependent upon tokens that can be words or
characters. It can be employed efficiently using a data
structure called inverted index. By translating
blocking keys into a group of tokens, the data
structure is developed whereas the key is the
particular token. The records that contain the
particular token in blocking key are added to the data
structure. Two frequencies such as term and
document are calculated in this approach. Term
frequency is nothing but the presence of the token in
the number of records. Document frequency is how
often the token appears is calculated. In Fig.7, this
approach is represented.
Identifiers Blocking keys Sorted Lists
A1 Ranjan [ (an, 2), (ja, 1), (nj, 1), (ra,1)]
A2 Raghanan [(ag, 1), (an, 2), (gh, 1), (ha, 1), (na, 1), (ra, 1)]
A3 Pragranth [(ag, 1), (an, 1), (gr, 1), (nt, 1), (pr, 1), (ra, 2), (th, 1)]
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 144 | P a g e
Fig.7.Canopy Clustering and Bigram list containing DF counts with inverted index data structure
IV. Experimental Results
There are some measures to evaluate the
complexity of indexing and even quality. Let Mn and
Nn be the number of equal and unequal record pairs.
TM and TN are the true equal and unequal record pairs
such that .
By using one of the indexing techniques,
some of the pairs of records are eliminated by
applying a measure called Reduction Ratio
represented by RR. It can be represented as
(3)
The ratio of number of true equal record pairs to total
number of equal pairs is called Pairs Completeness
represented by PC as follows:
(4)
The ratio of number of true equal record pairs to total
number of record pairs is called Pairs Quality, simply
represented as PQ which is as follows.
(5)
If PQ value is high then the indexing
technique is best in terms of efficiency and creates a
number of truly matched records and in the same way
if PQ value is low then more non-matches are
created.
The f-score is used to calculate the harmonic
mean of Pairs Completeness and Pairs Quality and it
can be represented as follows:
(6)
Hence by using these measures, performance is
evaluated.
From Table 2, it is clearly observed that q-
gram based indexing technique is very slow when
compared to other techniques. Array based sorted
neighborhood and even traditional blocking are very
fast by comparing with others.
Table 2. Runtimes Evaluation for every indexing
technique per candidate record pair
Indexing
Technique
Time in ms per candidate
record pair
Minimum Maximum
Traditional
Blocking
0.002 0.972
Q-gram Based
Indexing
0.005 163,484.394
Array Based sorted
neighborhood
Inverted Index
based sorted
neighborhood
0.011
0.002
0.288
3.040
Indexing Based On
suffix-Arrays
0.024 168.561
Canopy Clustering 0.003 380.214
V. Conclusion
In this paper, five indexing techniques are
discussed. The candidate record pair generation is
discussed for every technique, based on that
complexity is analyzed. Only by defining the
blocking keys perfectly, accurate indexing and
efficiency is obtained. Always the data are divided
into non matches and matches so that it is easy to
identify the blocking keys accordingly. Future work
in this is to develop more new efficient techniques so
that comparisons between records in a particular
block should have very less similarity.
References
[1] A. Aizawa and K. Oyama, “A Fast Linkage
Detection Scheme for Multi-Source
Information Integration,”Proc. Int‟l
K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145
www.ijera.com 145 | P a g e
Workshop Challenges in Web Information
Retrieval and Integration (WIRI ‟05), 2005.
[2] A.K. Elmagarmid, P.G. Ipeirotis, and V.S.
Verykios, “Duplicate Record Detection: A
Survey,”IEEE Trans. Knowledge and Data
Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007.
[3] D.E. Clark, “Practical Introduction to
Record Linkage for Injury Research,”Injury
Prevention, vol. 10, pp. 186-191, 2004.
[4] E. Rahm and H.H. Do, “Data Cleaning:
Problems and Current Approaches,”IEEE
Technical Committee Data Eng. Bull., vol.
23, no. 4, pp. 3-13, Dec. 2000.
[5] M. Bilenko and R.J. Mooney, “On
Evaluation and Training-Set Construction
for Duplicate Detection,”Proc. Workshop
Data Cleaning, Record Linkage and Object
Consolidation (SIGKDD ‟03), pp. 7-12,
2003.
[6] W.W. Cohen, P. Ravikumar, and S.
Fienberg, “A Comparison of String Distance
Metrics for Name-Matching Tasks,”Proc.
Workshop Information Integration on the
Web (IJCAI ‟03), 2003

Weitere ähnliche Inhalte

Was ist angesagt?

EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXINGIJDKP
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
Ginix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchGinix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchIRJET Journal
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
 
15. STL - Data Structures using C++ by Varsha Patil
15. STL - Data Structures using C++ by Varsha Patil15. STL - Data Structures using C++ by Varsha Patil
15. STL - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patil12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Survey on scalable continual top k keyword search in relational databases
Survey on scalable continual top k keyword search in relational databasesSurvey on scalable continual top k keyword search in relational databases
Survey on scalable continual top k keyword search in relational databaseseSAT Journals
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
Survey on scalable continual top k keyword search in
Survey on scalable continual top k keyword search inSurvey on scalable continual top k keyword search in
Survey on scalable continual top k keyword search ineSAT Publishing House
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10AnwarrChaudary
 
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGSPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGijdms
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural LanguageIRJET Journal
 

Was ist angesagt? (19)

EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESEFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASES
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic Web
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Ginix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword SearchGinix Generalized Inverted Index for Keyword Search
Ginix Generalized Inverted Index for Keyword Search
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 
G1803054653
G1803054653G1803054653
G1803054653
 
03. Data Preprocessing
03. Data Preprocessing03. Data Preprocessing
03. Data Preprocessing
 
15. STL - Data Structures using C++ by Varsha Patil
15. STL - Data Structures using C++ by Varsha Patil15. STL - Data Structures using C++ by Varsha Patil
15. STL - Data Structures using C++ by Varsha Patil
 
12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patil12. Heaps - Data Structures using C++ by Varsha Patil
12. Heaps - Data Structures using C++ by Varsha Patil
 
At33264269
At33264269At33264269
At33264269
 
Survey on scalable continual top k keyword search in relational databases
Survey on scalable continual top k keyword search in relational databasesSurvey on scalable continual top k keyword search in relational databases
Survey on scalable continual top k keyword search in relational databases
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Analysis of the Datasets
Analysis of the DatasetsAnalysis of the Datasets
Analysis of the Datasets
 
Survey on scalable continual top k keyword search in
Survey on scalable continual top k keyword search inSurvey on scalable continual top k keyword search in
Survey on scalable continual top k keyword search in
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGSPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
 

Andere mochten auch

General Ajuste Jr 1009 Iquique
General Ajuste Jr 1009 IquiqueGeneral Ajuste Jr 1009 Iquique
General Ajuste Jr 1009 IquiqueWaxu Ku
 
Evaluación educativa
Evaluación educativaEvaluación educativa
Evaluación educativaEdgar68
 
Momentos vocacPastoral Vocacional Franciscana (HFIC): Reflexión Momentos Vo...
Momentos vocacPastoral Vocacional Franciscana (HFIC):  Reflexión  Momentos Vo...Momentos vocacPastoral Vocacional Franciscana (HFIC):  Reflexión  Momentos Vo...
Momentos vocacPastoral Vocacional Franciscana (HFIC): Reflexión Momentos Vo...HFIC
 
Smau Milano 2011 Marco Cilia
Smau Milano 2011 Marco CiliaSmau Milano 2011 Marco Cilia
Smau Milano 2011 Marco CiliaSMAU
 
Tatiana doronina
Tatiana doroninaTatiana doronina
Tatiana doroninaLorian
 
Para anunciar la cuaresma
Para anunciar la cuaresmaPara anunciar la cuaresma
Para anunciar la cuaresmaHFIC
 
Tracey Gyateng: The NPC Data Lab, 30 June 2014
Tracey Gyateng: The NPC Data Lab, 30 June 2014Tracey Gyateng: The NPC Data Lab, 30 June 2014
Tracey Gyateng: The NPC Data Lab, 30 June 2014Nuffield Trust
 
Pastoral Vocacional Franciscana (HFIC):
Pastoral Vocacional Franciscana (HFIC):  Pastoral Vocacional Franciscana (HFIC):
Pastoral Vocacional Franciscana (HFIC): HFIC
 
Smau padova 2013 gianfranco tonello
Smau padova 2013 gianfranco tonelloSmau padova 2013 gianfranco tonello
Smau padova 2013 gianfranco tonelloSMAU
 
Health Decisions Webinar: The Five Levers of Management Control for Your Heal...
Health Decisions Webinar: The Five Levers of Management Control for Your Heal...Health Decisions Webinar: The Five Levers of Management Control for Your Heal...
Health Decisions Webinar: The Five Levers of Management Control for Your Heal...Si Nahra
 
1迷人的西藏
1迷人的西藏1迷人的西藏
1迷人的西藏Jaing Lai
 
Devemos agradecer nossos sofrimentos
Devemos agradecer nossos sofrimentosDevemos agradecer nossos sofrimentos
Devemos agradecer nossos sofrimentosHelio Cruz
 
Smau Milano 2012 Igor Falcomata
Smau Milano 2012 Igor FalcomataSmau Milano 2012 Igor Falcomata
Smau Milano 2012 Igor FalcomataSMAU
 
Operações da configuração do windows
Operações da configuração do windowsOperações da configuração do windows
Operações da configuração do windowsfrancissal
 
Maximilian pfalzgraf
Maximilian pfalzgrafMaximilian pfalzgraf
Maximilian pfalzgrafLorian
 

Andere mochten auch (20)

Cap 5 10
Cap 5 10Cap 5 10
Cap 5 10
 
General Ajuste Jr 1009 Iquique
General Ajuste Jr 1009 IquiqueGeneral Ajuste Jr 1009 Iquique
General Ajuste Jr 1009 Iquique
 
Evaluación educativa
Evaluación educativaEvaluación educativa
Evaluación educativa
 
Momentos vocacPastoral Vocacional Franciscana (HFIC): Reflexión Momentos Vo...
Momentos vocacPastoral Vocacional Franciscana (HFIC):  Reflexión  Momentos Vo...Momentos vocacPastoral Vocacional Franciscana (HFIC):  Reflexión  Momentos Vo...
Momentos vocacPastoral Vocacional Franciscana (HFIC): Reflexión Momentos Vo...
 
Smau Milano 2011 Marco Cilia
Smau Milano 2011 Marco CiliaSmau Milano 2011 Marco Cilia
Smau Milano 2011 Marco Cilia
 
Tatiana doronina
Tatiana doroninaTatiana doronina
Tatiana doronina
 
Para anunciar la cuaresma
Para anunciar la cuaresmaPara anunciar la cuaresma
Para anunciar la cuaresma
 
Tracey Gyateng: The NPC Data Lab, 30 June 2014
Tracey Gyateng: The NPC Data Lab, 30 June 2014Tracey Gyateng: The NPC Data Lab, 30 June 2014
Tracey Gyateng: The NPC Data Lab, 30 June 2014
 
Pastoral Vocacional Franciscana (HFIC):
Pastoral Vocacional Franciscana (HFIC):  Pastoral Vocacional Franciscana (HFIC):
Pastoral Vocacional Franciscana (HFIC):
 
Smau padova 2013 gianfranco tonello
Smau padova 2013 gianfranco tonelloSmau padova 2013 gianfranco tonello
Smau padova 2013 gianfranco tonello
 
Health Decisions Webinar: The Five Levers of Management Control for Your Heal...
Health Decisions Webinar: The Five Levers of Management Control for Your Heal...Health Decisions Webinar: The Five Levers of Management Control for Your Heal...
Health Decisions Webinar: The Five Levers of Management Control for Your Heal...
 
1迷人的西藏
1迷人的西藏1迷人的西藏
1迷人的西藏
 
DodecaMojón
DodecaMojónDodecaMojón
DodecaMojón
 
Trolexismo
TrolexismoTrolexismo
Trolexismo
 
Emagrecer
EmagrecerEmagrecer
Emagrecer
 
Historial de préstamo
Historial de préstamoHistorial de préstamo
Historial de préstamo
 
Devemos agradecer nossos sofrimentos
Devemos agradecer nossos sofrimentosDevemos agradecer nossos sofrimentos
Devemos agradecer nossos sofrimentos
 
Smau Milano 2012 Igor Falcomata
Smau Milano 2012 Igor FalcomataSmau Milano 2012 Igor Falcomata
Smau Milano 2012 Igor Falcomata
 
Operações da configuração do windows
Operações da configuração do windowsOperações da configuração do windows
Operações da configuração do windows
 
Maximilian pfalzgraf
Maximilian pfalzgrafMaximilian pfalzgraf
Maximilian pfalzgraf
 

Ähnlich wie Z04506138145

Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd Iaetsd
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkIOSR Journals
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET-  	  Review on Information Retrieval for Desktop Search EngineIRJET-  	  Review on Information Retrieval for Desktop Search Engine
IRJET- Review on Information Retrieval for Desktop Search EngineIRJET Journal
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET Journal
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm IRJET Journal
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
 
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big DataIRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big DataIRJET Journal
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageIOSR Journals
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...nalini manogaran
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek outiaemedu
 

Ähnlich wie Z04506138145 (20)

Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl Framework
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET-  	  Review on Information Retrieval for Desktop Search EngineIRJET-  	  Review on Information Retrieval for Desktop Search Engine
IRJET- Review on Information Retrieval for Desktop Search Engine
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Sub1579
Sub1579Sub1579
Sub1579
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
IRJET- Text Document Clustering using K-Means Algorithm
IRJET-  	  Text Document Clustering using K-Means Algorithm IRJET-  	  Text Document Clustering using K-Means Algorithm
IRJET- Text Document Clustering using K-Means Algorithm
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
IRJET- Determining Document Relevance using Keyword Extraction
IRJET-  	  Determining Document Relevance using Keyword ExtractionIRJET-  	  Determining Document Relevance using Keyword Extraction
IRJET- Determining Document Relevance using Keyword Extraction
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big DataIRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big Data
 
Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...Elimination of data redundancy before persisting into dbms using svm classifi...
Elimination of data redundancy before persisting into dbms using svm classifi...
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 

Kürzlich hochgeladen

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Kürzlich hochgeladen (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Z04506138145

  • 1. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 138 | P a g e Complexity Anaylsis of Indexing Techniques for Scalable Deduplication and Data Linkage Koneru Naga Venkata Rajeev, Vanguru Radha Krishna, Kambhampati Vamsi Krishna, Siddhavatam Thirumala Reddy School of Information Technology and Engineering VIT UNIVERSITY Vellore, India School of Information Technology and Engineering VIT UNIVERSITY Vellore, India School of Information Technology and Engineering VIT UNIVERSITY Vellore, India School of Information Technology and Engineering VIT UNIVERSITY Vellore, India ABSTRACT The process of record or data matching from various databases regarding same entities is called Data Linkage. Similarly, when this process is applied to a particular database, then it is called Deduplication. Record matching plays a major role in today‟s environment as it is more expensive to obtain. The process of data cleaning from database is the first and foremost step since copy of data severely affects the results of data mining. As the amount of databases increasing day-by-day, the difficulty in matching records happens to be a most important confront for data linkage. So many indexing techniques are designed for the process of data linkage. The main aim of those indexing techniques is to minimize the number of data pairs by eliminating apparent non-matching data pairs by maintaining maximum quality of matching. Hence in this paper, a survey is made of these indexing techniques to analyze complexity and evaluate scalability using fake data sets and original data sets. Keywords: Complexity, Data matching, Indexing techniques. I. Introduction Today in real world one task is getting more significance in a number of fields. That is nothing but the data matching that connected to same objects from numerous databases. And this data from several sources should be integrated and unified to increase quality of data. Record linkage is employed in so many sectors including government agencies to recognize people who register for help or support multiple numbers of times. Even in the domain of detecting scams and crimes, this record linkage technique is useful. To access files for a specific person in enquiry or to cross check the histories of that person from multiple databases, record linkage plays a major role. RESEARCH ARTICLE OPEN ACCESS
  • 2. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 139 | P a g e Fig. 1 Record Linkage Process In biotechnology, record linkage is used to identify genome sequences in a huge number of data collections. In the area of information retrieval, it is significant to eliminate identical documents. The way of matching data is called data linkage by health researchers. But in other communities such as database and computer science, it is referred to field matching [1], duplicate detection [2], [3], information integration [4], data scrubbing or cleaning [5]. Data linkage is nothing but a component of ETL tools. Some recent surveys have delivered techniques and experiments regarding deduplication and linkage [3]. In Fig. 1, necessary steps needed to link two databases are described. As real-world data contain noisy data, inconsistent data, first it should be cleaned. Absence of good quality data leads to unsuccessful record linkage [6]. The next step is indexing, in which it creates sets of candidate records that will be distinguished in the next comparison step. After the comparison, based on their equality, it is divided into three types, which are possible matches, matches and non matches. Based on the view of possible matches, they are classified either into match or non-match. Finally, the complexity is evaluated or analyzed in the last step. II. Indexing for Deduplication and Data Linkage When different databases are to be compared, each and every record from one particular database should be matched with every record of the other database. Hence it leads to product i.e., comparisons among the records where is number of records in a particular database. In the same way, if it is only for a single database then the number of comparisons Database A Database B Data Cleaning Indexing Record pair comparison Similarity vector classification Matches Non matches Possible matches E V A L U A T I O N Data Cleaning
  • 3. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 140 | P a g e is . But the comparison of values is more expensive when it involves a large number of databases. If there are no matching records, then min is the number of comparisons. Table.1 Examples for Generation of Blocking Keys Record Fields Blocking Keys Identifiers Names( N) Initials (I) Codes (C) Places(P) Sndx(N)+C Fi2D(C)+D ME(I) Sndx(P) +La2D( C) A1 Rahul Christen 1004 Chennai R190-1004 10-KRST C345-04 A2 Rania Kristen 1011 Hyderabad R190-1011 10-KRST H470- 11 A3 Rohit Smith 1200 Bangalore R240-1200 12-BGLO B217-00 A4 Ravi Smyth 1300 Bangalore R470-1300 13-BGLO B217-00 Sndx, DME are two phonetic functions and Fi2D(C) ----- First two digits of C La2D(C) ---- Last two digits of C The important objective of indexing is to lessen the total number of comparisons by eliminating the records that are regarding non matches. Hence a technique is used for this approach and called blocking. In this technique, it divides the database into multiple blocks, so that the comparison happens only in a particular block instead of the whole database. A blocking key is used for this purpose. Based on their similarity, records are pushed into the blocks. Functions such as Double Metaphone, NYSIIS and Soundex are used to create blocking keys. In Table.1, it is explained in detail. III. Indexing Techniques The indexing techniques provide two stages in the process of data linkage. Build and Retrieve are the two stages in it. Build: The records that present in each and every database are completely read, Blocking Keys are produced accordingly, and these records are pushed into particular indexes. Mostly a data structure called inverted index is selected for this purpose. The Blocking Keys then act as keys of inverted index and unique identifiers of the records that contain equal blocking keys are pushed into the same data structure. Fig. 2 shows an example for it. When associating two or more databases, a distinct data structure for every database can be constructed or only one data structure with same keys can be created. Identifiers Initials Blocking keys(Encoded) A1 Rahul R490 A2 Smith S520 A3 Jain J170 A4 Raahul R490 A5 Smyth S520 A6 Rahull R490 A7 Smith S520 A8 Smyth S520
  • 4. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 141 | P a g e R490 S520 J170 Fig. 2 Records with initials and encoded keys and equivalent inverted index data structure Retrieve: For every block, a group of unique identifiers are retrieved from the data structure, and then candidate pairs are created using this. In the process of data linkage, the records present in the block of a single database are paired with the records from remaining databases that contain equal blocking key. If it is deduplication, records from the same block contain different pairs. As given in Fig. 2, R490 contain the pairs such as (A1, A4), (A4, A6), (A1, A6). 3.1. Traditional Blocking It is employed competently by utilizing the inverted index data structure. The main disadvantage of this is flaws in the fields of records leads to incorrect blocking keys which in turn lead to wrong block placement of records. But this can be overwhelmed by consuming multiple descriptions for blocking keys. Another disadvantage of using this technique is block sizes created will be depending upon frequency delivery of blocking keys. Candidate record pairs are generated by using; (1) Where k is the number of distinct blocking keys, Is number of records in P Is number of records in Q (2) 3.2 Q-Gram-Based Indexing In this technique, the database is indexed so that records that contain related but not as it is as blocking keys are pushed into the same block. Hence the main goal of this is to generate variations for every blocking key using q-grams (subset of size q) and can push multiple unique identifiers into multiple blocks. Every blocking key is transformed into a group of q-grams and then sub lists are created up to a particular length that can be determined by the user. Hence these are transformed to strings and can be used as actual values in the data structure that is described in Fig. 3. Identifiers Blocking Keys Sub-lists Index values A1 Rahul [ra, ah, hu, ul], [ra, ah, hu], [ra, ah, ul], [ra, hu, ul], [ah, hu, ul] raahhuul, raahhu, raahul, rahuul, ahhuul A2 Raahul [ra, aa, ah, hu, ul], [ra, aa, ah, hu], [ra, aa, ah, ul], [ra, aa, hu, ul], [ra, ah, hu, ul], [aa, ah, hu, ul] raaaahhuul, raaaahhu, raaaahul, raaahuul, raahhuul, aaahhuul A3 Rahull [ra, ah, hu, ul, ll], [ah, hu, ul, ll], [ra, hu, ul, ll], [ra, ah, ul, ll], [ra, ah, hu, ll], [ra, ah, hu, ul] Raahhuulll, ahhuulll, rahuulll, raahulll, raahhull, raahhuul A2 A5 A7 A8 A1 A4 A6 A 3
  • 5. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 142 | P a g e Raaahuul Raahhuul raahulll Fig.3. Q-Gram Technique considering bigrams (q=2) and representation using inverted index 3.3 Sorted Neighborhood Indexing In this technique, the databases are sorted regarding the values of blocking keys, and then sequentially push a window of records above the sorted values. Then using that window, record pairs are generated. In two ways this technique is developed. 3.3.1 Sorted Array-Based Approach The blocking keys are sorted according to their names and pushed into the array. Now move the window above this array, then record pairs are created. If the size of the array is , then the number of positions of window is . But in the duplication it is . This is represented in Fig.4. Window positions Blocking Keys Identifiers 1 Rahul A1 2 Raahul A4 3 Rahull A6 4 Jain A3 5 Smith A2 6 Smyth A5 7 Smyth A8 8 Smith A7 Window Range Record Pairs 1-3 (A1, A4), (A4, A6), (A1, A6) 2-4 (A4, A6), (A6, A3), (A4, A3) 3-5 (A6, A3), (A3, A2), (A6, A2) 4-6 (A3, A2), (A2, A5), (A3, A5) 5-7 (A2, A5), (A5, A8), (A2, A8) 6-8 (A5, A8), (A8, A7), (A5, A7) Fig.4. Sorted Neighborhood Technique with a window size of 3 The disadvantage of this approach is if the chosen window length is very small, then all the records will not be covered with the unique blocking key. The best solution for this is combining more fields such that there can be a number of values. One more problem with this is sorting the blocking keys is perceptive to flaws and deviations in the values. The solution for this is to create multiple blocking keys or creating those keys by reversing the values. The time complexity of this is O (nlogn) where n= for data linkage and n= in case of deduplication. 3.3.2 Inverted Index-Based Approach: Instead of pushing into sorted array, a data structure called inverted index is used in this approach. This is same as above approach but one difference between both the approaches is the comparison happens only once. There are some drawbacks in this technique. One of the drawbacks is number of blocks are dominating the record pairs created. The other drawback is blocking keys are sorted assuming that beginning is flawless. This technique is described by using the following figure Fig. 5. The one with same blocking keys represent different identifiers instead of in above approach. Window Range Record Pairs 1-3 (A1, A4), (A1, A2), (A1, A7), (A4, A2), (A4, A7), (A2, A7) 2-4 (A4, A2), (A4, A7), (A2, A7), (A4, A3), (A2, A3), (A7, A3) 3-5 (A2, A7), (A2, A3), (A7, A3), (A2, A6), (A7, A6), (A3, A6) 4-6 (A3, A6), (A3, A5), (A3, A8), (A6, A5), (A6, A8), (A5, A8) Fig.5. Inverted Index Approach A2 A3 A1 A2 A3 Window positions Blocking Keys Identifiers 1 Rahul A1 2 Raahul A4 3 Smith A2, A7 4 Jain A3 5 Rahull A6 6 Smyth A5, A8
  • 6. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 143 | P a g e 3.4 Indexing Based On Suffix-Arrays In this indexing technique, push the blocking keys even suffixes into suffix-based data structure. The group of characters or strings in sorted order is nothing but Suffix Array. If the length of the string is „k‟, then the number of suffixes is (k-l+1) where l is minimum length. The disadvantage of this technique is flaws and deviations at the end of blocking keys lead to records that are pushed into multiple blocks instead of the same block. The solution for this is to create not only the real suffixes for blocking keys, but also to create substrings to a particular length. This can be explained in detail with the help of Fig. 6. The block size is nothing but a number of identifiers. Identifiers Blocking Keys Suffixes A1 Fluorine fluorine, luorine, uorine, orine, rine, ine A2 Chlorine chlorine, hlorine, lorine, orine, rine, ine A3 Iodine iodine, odine, dine, ine A4 Amoxine amoxine, moxine, oxine, xine, ine Suffix Identifiers amoxine A4 chlorine A1,A2 dine A3 fluorine A1 hlorine A2 ine A1,A2,A3,A4 lorine A2 Luorine A1 Moxine A4 Odine A3 Orine A1,A2 Oxine A4 Rine A1,A2 Uorine A1 Xine A4 Fig.6. Suffix-Based Approach with length l=3 and maximum block size=3. The block containing suffix ine is removed as it contains more block identifiers. 3.5 Canopy Clustering Clusters are generated in such a way that the similarities between blocking keys are calculated using cosine or Jaccard measure. These are dependent upon tokens that can be words or characters. It can be employed efficiently using a data structure called inverted index. By translating blocking keys into a group of tokens, the data structure is developed whereas the key is the particular token. The records that contain the particular token in blocking key are added to the data structure. Two frequencies such as term and document are calculated in this approach. Term frequency is nothing but the presence of the token in the number of records. Document frequency is how often the token appears is calculated. In Fig.7, this approach is represented. Identifiers Blocking keys Sorted Lists A1 Ranjan [ (an, 2), (ja, 1), (nj, 1), (ra,1)] A2 Raghanan [(ag, 1), (an, 2), (gh, 1), (ha, 1), (na, 1), (ra, 1)] A3 Pragranth [(ag, 1), (an, 1), (gr, 1), (nt, 1), (pr, 1), (ra, 2), (th, 1)]
  • 7. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 144 | P a g e Fig.7.Canopy Clustering and Bigram list containing DF counts with inverted index data structure IV. Experimental Results There are some measures to evaluate the complexity of indexing and even quality. Let Mn and Nn be the number of equal and unequal record pairs. TM and TN are the true equal and unequal record pairs such that . By using one of the indexing techniques, some of the pairs of records are eliminated by applying a measure called Reduction Ratio represented by RR. It can be represented as (3) The ratio of number of true equal record pairs to total number of equal pairs is called Pairs Completeness represented by PC as follows: (4) The ratio of number of true equal record pairs to total number of record pairs is called Pairs Quality, simply represented as PQ which is as follows. (5) If PQ value is high then the indexing technique is best in terms of efficiency and creates a number of truly matched records and in the same way if PQ value is low then more non-matches are created. The f-score is used to calculate the harmonic mean of Pairs Completeness and Pairs Quality and it can be represented as follows: (6) Hence by using these measures, performance is evaluated. From Table 2, it is clearly observed that q- gram based indexing technique is very slow when compared to other techniques. Array based sorted neighborhood and even traditional blocking are very fast by comparing with others. Table 2. Runtimes Evaluation for every indexing technique per candidate record pair Indexing Technique Time in ms per candidate record pair Minimum Maximum Traditional Blocking 0.002 0.972 Q-gram Based Indexing 0.005 163,484.394 Array Based sorted neighborhood Inverted Index based sorted neighborhood 0.011 0.002 0.288 3.040 Indexing Based On suffix-Arrays 0.024 168.561 Canopy Clustering 0.003 380.214 V. Conclusion In this paper, five indexing techniques are discussed. The candidate record pair generation is discussed for every technique, based on that complexity is analyzed. Only by defining the blocking keys perfectly, accurate indexing and efficiency is obtained. Always the data are divided into non matches and matches so that it is easy to identify the blocking keys accordingly. Future work in this is to develop more new efficient techniques so that comparisons between records in a particular block should have very less similarity. References [1] A. Aizawa and K. Oyama, “A Fast Linkage Detection Scheme for Multi-Source Information Integration,”Proc. Int‟l
  • 8. K Naga Venkata Rajeev et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 5( Version 6), May 2014, pp.138-145 www.ijera.com 145 | P a g e Workshop Challenges in Web Information Retrieval and Integration (WIRI ‟05), 2005. [2] A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate Record Detection: A Survey,”IEEE Trans. Knowledge and Data Eng., vol. 19, no. 1, pp. 1-16, Jan. 2007. [3] D.E. Clark, “Practical Introduction to Record Linkage for Injury Research,”Injury Prevention, vol. 10, pp. 186-191, 2004. [4] E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Approaches,”IEEE Technical Committee Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000. [5] M. Bilenko and R.J. Mooney, “On Evaluation and Training-Set Construction for Duplicate Detection,”Proc. Workshop Data Cleaning, Record Linkage and Object Consolidation (SIGKDD ‟03), pp. 7-12, 2003. [6] W.W. Cohen, P. Ravikumar, and S. Fienberg, “A Comparison of String Distance Metrics for Name-Matching Tasks,”Proc. Workshop Information Integration on the Web (IJCAI ‟03), 2003