Efficient Parallel Set-Similarity Joins Using MapReduce

Efficient Parallel Set-Similarity
Joins Using MapReduce

Tilani Gunawardena

Content
• Introduction
• Preliminaries
• Self-Join case
• R-S Join case
• Handling insufficient memory
• Experimental evaluation
• Conclusions

Introduction

• Vast amount of data:
– Google N-gram database : ~1 trillion records
– GeneBank : 100 million records, size=416GB
– Facebook : 400 million active users

• Detecting similar pairs of records becomes a
challanging proble

Examples
• Detecting near duplicate web-pages in web crawlin
• Document clustering
• Plagiarism detection
• Master data management
– “John W. Smith” , “Smith, John” , “John William Smith”
• Making recommendations to users based on
their similarity to other users in query refinement
• Mining in social networking sites
– User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
• Identifying coalitions of click fraudsters in online advertising

Preliminaries
• Problem Statement: Given two collections of
objects/items/records, a similarity metric
sim(o1,o2) and a threshold λ , find the pairs of
objects/items/records satisfying sim(o1,o2)≥ λ

Set -similarity functions
• Jaccard or Tanimoto coefficient
– Jaccard(x, y) =|x ∩y| / |x U y|

• “I will call back” =[I, will, call, back]
• “I will call you soon”=[I, will, call, you, soon]

• Jaccard similarity=3/6=0.5

Set-similarity with MapReduce
• Why Hadoop ?
– Large amount data,shared nothign architecture

• map (k1,v1) -> list(k2,v2);
• reduce (k2,list(v2)) -> list(k3,v3)
• Problem :
– Too much data to transfer
– Too many pairs to verify(Two similar sets share at least
1 token)

Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on
effective filters

• string s =“I will call back”
• global token ordering {back,call, will, I}
• prefix of length 2 of s= [back, call]

• prefix filtering principle states that similar strings
need to share at least one common token in their
prefixes.

Prefix filtering: example

Record 1

Record 2

• Each set has 5 tokens
• “Similar”: they share at least 4 tokens
• Prefix length: 2
9

Parallel Set-Similarity Joins
• Stage I: Token Ordering
– Compute data statistics for good signatures
• Stage II -RID-Pair Generation
• Stage III: Record Join
– Generate actual pairs of joined records

Input Data
• RID = Row ID
• a : join column
• “A B C” is a string:
• Address: “14th Saarbruecker Strasse”
• Name: “John W. Smith”

Stage I: Token Ordering
• Basic Token Ordering(BTO)
• One Phase Token Ordering (OPTO)

Token Ordering

• Creates a global ordering of the tokens in the
join column, based on their frequency
RID a b c

1 A B D AA … …
2 BBDAE … …

Global Ordering: E D B A
(based on
frequency) 1 2 3 4

Basic Token Ordering(BTO)

• 2 MapReduce cycles:
– 1st : compute token frequencies
– 2nd: sort the tokens by their frequencies

Basic Token Ordering – 1st MapReduce cycle
, ,

map: reduce:
• tokenize the join • for each token, compute total
value of each record count (frequency)
• emit each token
with no. of occurrences 1

Basic Token Ordering – 2nd MapReduce cycle

map: reduce(use only 1 reducer):
• interchange key • emits the value
with value

One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
– Uses only one MapReduce Cycle (less I/O)
– In-memory token sorting, instead of using a
reducer

OPTO – Details
, ,
Use tear_down
method to order
the tokens in
memory

map:
reduce:
• tokenize the join
• for each token, compute
value of each record
total count (frequency)
• emit each token
with no. of occurrences 1

Stage II: RID-Pair Generation

 Basic Kernel(BK)
 Indexed Kernel(PK)

RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)
• consists of only one MapReduce cycle

Global ordering of tokens obtained in the previous
stage

RID-Pair Generation: Map Phase

• scan input records and for each record:
– project it on RID & join attribute
– tokenize it
– extract prefix according to global ordering of tokens obtained in the Token
Ordering stage
– route tokens to appropriate reducer

Grouping/Routing Strategies

• Goal: distribute candidates to the right
reducers to minimize reducers’ workload
• Like hashing (projected)records to the
corresponding candidate-buckets
• Each reducer handles one/more candidate-
buckets
• 2 routing strategies:

Using Individual Tokens Using Grouped Tokens

Routing: using individual tokens

• Treat each token as a key
• For each record, generates a (key, value) pair for each
of its prefix tokens:
Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48

“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))

Grouping/Routing: using individual tokens

• Advantage:
– high quality of grouping of candidates( pairs of
records that have no chance of being similar, are
never routed to the same reducer)
• Disadvantage:
– high replication of data (same records might be
checked for similarity in multiple reducers, i.e.
redundant work)

Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
(different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
the groups of the prefix tokens:

Example:
• Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48

“A B C” => prefix of length 2: A,B
Suppose A,B belong to group X and
C belongs to group Y
=> generate/emit 2 (key,value) pairs:
• (X, (1,A B C))
• (Y, (1,A B C))

Grouping/Routing: Using Grouped Tokens

• The groups of tokens (X,Y) are formed assigning
tokens to groups in a Round-Robin manner
Token A B E D G C F
Frequency 10 10 22 23 23 40 48

A D F B G E C

Group1 Group2 Group3

Grouping/Routing: Using Grouped Tokens
• Advantage:
– fewer replication of record projection

• Disadvantage:
– Quality of grouping is not so high (records having no
chance of being similar are sent to the same reducer
which checks their similarity)

– “ABCD” (A,B belong to Group X ; C belong to Group Y)
• o/p –(X,_) & (Y,_)
– “EFG” (E belong to Group Y )
• o/p –(Y,_)

RID-Pair Generation: Reduce Phase

• This is the core of the entire method
• Each reducer processes one/more buckets
• In each bucket, the reducer looks for pairs of join attribute values
satisfying the join predicate
If the similarity of the 2 candidates >= threshold
=> output their ids and also their similarity

Bucket of
candidates

RID-Pair Generation: Reduce Phase

• Computing similarity of the candidates in a
bucket comes in 2 flavors:

• Basic Kernel : uses 2 nested loops to verify each pair of
candidates in the bucket

• Indexed Kernel : uses a PPJoin+ index

RID-Pair Generation: Basic Kernel

• Straightforward method for finding candidates satisfying
the join predicate
• Quadratic complexity : O(#candidates2)

RID-Pair Generation:PPJoin+Indexed Kernal
• Uses a special index data structure
• Not so straightforward to implement
• map() -same as in BK algorithm
• Much more efficient

Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual
records
• Use the RID pairs generated in the previous stage to join
the actual records
• Main idea:
– bring in the rest of the each record (everything except the RID
which we already have)
• 2 approaches:
– Basic Record Join (BRJ)
– One-Phase Record Join (OPRJ)

Record Join: Basic Record Join

• Uses 2 MapReduce cycles
– 1st cycle: fills in the record information for each half of each pair
– 2nd cycle: brings together the previously filled in records

Record Join: One Phase Record Join

• Uses only one MapReduce cycle

R-S Join

• Challenge: We now have 2 different record sources => 2
different input streams

• Map Reduce can work on only 1 input stream

• 2nd and 3rd stage affected

• Solution: extend (key, value) pairs so that it includes a
relation tag for each record

Handling Insufficient Memory
• Map-Based Block Processing.
• Reduce-Based Block Processing

Evaluation

• Cluster: 10-node IBM x3650, running Hadoop
• Data sets:
• DBLP: 1.2M publications
• CITESEERX: 1.3M publication
• Consider only the header of each paper(i.e author, title, date of
publication, etc.)
• Data size synthetically increased (by various factors)
• Measure:
• Absolute running time
• Speedup
• Scaleup

Self-Join running time

• Best algorithm: BTO-PK-OPRJ
• Most expensive stage: the
RID-pair generation

Self-Join Speedup

• Fixed data size, vary the
cluster size
• Best time: BTO-PK-OPRJ

Self-Join Scaleup

• Increase data size and
cluster size together by the
same factor
• Best time: BTO-PK-OPRJ

Self-Join Summery
• I stage- BTO was the best choice.
• II stage- PK was the best choice.
• III stage,-the best choice depends on the amount
of data and the size of the cluster
– OPRJ was somewhat faster, but the cost of loading the
similar-RID pairs in memory was constant as the the
cluster size increased, and the cost increased as the
data size increased. For these reasons, we recommend
BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ

Speed Up
• I stage - R-S Join performance was identical to
the first stage in the self-join case
• II stage -noticed a similar speedup (almost
perfect) as for the self-join case.
• III stage - OPRJ approach was initially the
fastest (for the 2 and 4 node case), but it
eventually became slower than the BRJ
approach.

Conclusions

• For both self-join and R-S join cases, we recommend BTO-
PK-BRJ as a robust and scalable method.

• Useful in many data cleaning scenarios

• SSJoin and MapReduce: one solution for huge datasets

• Very efficient when based on prefix-filtering and PPJoin+

• Scales-up up nicely

Efficient Parallel Set-Similarity Joins Using MapReduce

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Ähnlich wie Efficient Parallel Set-Similarity Joins Using MapReduce

Ähnlich wie Efficient Parallel Set-Similarity Joins Using MapReduce (20)

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Efficient Parallel Set-Similarity Joins Using MapReduce

Hinweis der Redaktion