SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Efficient Parallel Set-Similarity
    Joins Using MapReduce




                 Tilani Gunawardena
Content
• Introduction
• Preliminaries
•   Self-Join case
•   R-S Join case
•   Handling insufficient memory
•   Experimental evaluation
•   Conclusions
Introduction

• Vast amount of data:
  – Google N-gram database : ~1 trillion records
  – GeneBank : 100 million records, size=416GB
  – Facebook : 400 million active users


• Detecting similar pairs of records becomes a
  challanging proble
Examples
•   Detecting near duplicate web-pages in web crawlin
•   Document clustering
•   Plagiarism detection
•   Master data management
    – “John W. Smith” , “Smith, John” , “John William Smith”
• Making recommendations to users based on
  their similarity to other users in query refinement
• Mining in social networking sites
    – User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest
• Identifying coalitions of click fraudsters in online advertising
Preliminaries
• Problem Statement: Given two collections of
  objects/items/records, a similarity metric
  sim(o1,o2) and a threshold λ , find the pairs of
  objects/items/records satisfying sim(o1,o2)≥ λ
Set -similarity functions
• Jaccard or Tanimoto coefficient
   – Jaccard(x, y) =|x ∩y| / |x U y|


• “I will call back” =[I, will, call, back]
• “I will call you soon”=[I, will, call, you, soon]

• Jaccard similarity=3/6=0.5
Set-similarity with MapReduce
• Why Hadoop ?
   – Large amount data,shared nothign architecture




• map (k1,v1) -> list(k2,v2);
• reduce (k2,list(v2)) -> list(k3,v3)
• Problem :
   – Too much data to transfer
   – Too many pairs to verify(Two similar sets share at least
     1 token)
Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on
  effective filters

• string s =“I will call back”
• global token ordering {back,call, will, I}
• prefix of length 2 of s= [back, call]

• prefix filtering principle states that similar strings
  need to share at least one common token in their
  prefixes.
Prefix filtering: example


   Record 1


   Record 2


• Each set has 5 tokens
• “Similar”: they share at least 4 tokens
• Prefix length: 2
                                    9
Parallel Set-Similarity Joins
•   Stage I: Token Ordering
     – Compute data statistics for good signatures
•   Stage II -RID-Pair Generation
•   Stage III: Record Join
     – Generate actual pairs of joined records
Input Data
• RID = Row ID
• a : join column
• “A B C” is a string:
   • Address: “14th Saarbruecker Strasse”
   • Name: “John W. Smith”
Stage I: Token Ordering
• Basic Token Ordering(BTO)
• One Phase Token Ordering (OPTO)
Token Ordering

• Creates a global ordering of the tokens in the
  join column, based on their frequency
        RID                a       b         c

          1           A B D AA     …         …
          2           BBDAE        …         …

   Global Ordering:    E       D   B     A
   (based on
   frequency)          1       2   3     4
Basic Token Ordering(BTO)

• 2 MapReduce cycles:
  – 1st : compute token frequencies
  – 2nd: sort the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
                  , ,




map:                           reduce:
  • tokenize the join             • for each token, compute total
   value of each record            count (frequency)
  • emit each token
   with no. of occurrences 1
Basic Token Ordering – 2nd MapReduce cycle




     map:                  reduce(use only 1 reducer):
       • interchange key      • emits the value
        with value
One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):
  – Uses only one MapReduce Cycle (less I/O)
  – In-memory token sorting, instead of using a
    reducer
OPTO – Details
                 , ,
                                             Use tear_down
                                             method to order
                                             the tokens in
                                             memory




map:
                             reduce:
  • tokenize the join
                                • for each token, compute
   value of each record
                                total count (frequency)
  • emit each token
   with no. of occurrences 1
Stage II: RID-Pair Generation

 Basic Kernel(BK)
 Indexed Kernel(PK)
RID-Pair Generation
• scans the original input data(records)
• outputs the pairs of RIDs corresponding to records
  satisfying the join predicate(sim)
• consists of only one MapReduce cycle

                   Global ordering of tokens obtained in the previous
                   stage
RID-Pair Generation: Map Phase

• scan input records and for each record:
   – project it on RID & join attribute
   – tokenize it
   – extract prefix according to global ordering of tokens obtained in the Token
     Ordering stage
   – route tokens to appropriate reducer
Grouping/Routing Strategies

• Goal: distribute candidates to the right
  reducers to minimize reducers’ workload
• Like hashing (projected)records to the
  corresponding candidate-buckets
• Each reducer handles one/more candidate-
  buckets
• 2 routing strategies:

   Using Individual Tokens          Using Grouped Tokens
Routing: using individual tokens

• Treat each token as a key
• For each record, generates a (key, value) pair for each
      of its prefix tokens:
                         Example:
                         • Given the global ordering:
                            Token       A     B     E    D     G    C    F
                         Frequency      10   10    22    23    23   40   48


                          “A B C”
                           => prefix of length 2: A,B
                           => generate/emit 2 (key,value) pairs:
                                     • (A, (1,A B C))
                                     • (B, (1,A B C))
Grouping/Routing: using individual tokens

• Advantage:
  – high quality of grouping of candidates( pairs of
    records that have no chance of being similar, are
    never routed to the same reducer)
• Disadvantage:
  – high replication of data (same records might be
    checked for similarity in multiple reducers, i.e.
    redundant work)
Routing: Using Grouped Tokens
• Multiple tokens mapped to one synthetic key
  (different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each
       the groups of the prefix tokens:

                                Example:
                                • Given the global ordering:
                           Token        A     B     E    D     G     C    F
                          Frequency    10    10    22    23    23    40   48

                            “A B C” => prefix of length 2: A,B
                             Suppose A,B belong to group X and
                                       C belongs to group Y
                             => generate/emit 2 (key,value) pairs:
                                     • (X, (1,A B C))
                                     • (Y, (1,A B C))
Grouping/Routing: Using Grouped Tokens

• The groups of tokens (X,Y) are formed assigning
  tokens to groups in a Round-Robin manner
             Token     A    B      E    D    G     C       F
           Frequency   10   10     22   23   23   40       48


           A D F             B G                  E C

           Group1            Group2               Group3
Grouping/Routing: Using Grouped Tokens
• Advantage:
  – fewer replication of record projection

• Disadvantage:
  – Quality of grouping is not so high (records having no
    chance of being similar are sent to the same reducer
    which checks their similarity)

  – “ABCD” (A,B belong to Group X ; C belong to Group Y)
     • o/p –(X,_) & (Y,_)
  – “EFG” (E belong to Group Y )
     • o/p –(Y,_)
RID-Pair Generation: Reduce Phase

  • This is the core of the entire method
  • Each reducer processes one/more buckets
  • In each bucket, the reducer looks for pairs of join attribute values
    satisfying the join predicate
                                 If the similarity of the 2 candidates >= threshold
                                 => output their ids and also their similarity




Bucket of
candidates
RID-Pair Generation: Reduce Phase

• Computing similarity of the candidates in a
  bucket comes in 2 flavors:

     • Basic Kernel : uses 2 nested loops to verify each pair of
       candidates in the bucket



     • Indexed Kernel : uses a PPJoin+ index
RID-Pair Generation: Basic Kernel

• Straightforward method for finding candidates satisfying
  the join predicate
• Quadratic complexity : O(#candidates2)
RID-Pair Generation:PPJoin+Indexed Kernal
•   Uses a special index data structure
•   Not so straightforward to implement
•   map() -same as in BK algorithm
•   Much more efficient
Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual
  records
• Use the RID pairs generated in the previous stage to join
  the actual records
• Main idea:
   – bring in the rest of the each record (everything except the RID
     which we already have)
• 2 approaches:
   – Basic Record Join (BRJ)
   – One-Phase Record Join (OPRJ)
Record Join: Basic Record Join

• Uses 2 MapReduce cycles
   – 1st cycle: fills in the record information for each half of each pair
   – 2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join

• Uses only one MapReduce cycle
R-S Join

• Challenge: We now have 2 different record sources => 2
  different input streams

• Map Reduce can work on only 1 input stream

• 2nd and 3rd stage affected

• Solution: extend (key, value) pairs so that it includes a
  relation tag for each record
Handling Insufficient Memory
• Map-Based Block Processing.
• Reduce-Based Block Processing
Evaluation

• Cluster: 10-node IBM x3650, running Hadoop
• Data sets:
       • DBLP: 1.2M publications
       • CITESEERX: 1.3M publication
       • Consider only the header of each paper(i.e author, title, date of
          publication, etc.)
       • Data size synthetically increased (by various factors)
• Measure:
       • Absolute running time
       • Speedup
       • Scaleup
Self-Join running time

• Best algorithm: BTO-PK-OPRJ
• Most expensive stage: the
  RID-pair generation
Self-Join Speedup

• Fixed data size, vary the
  cluster size
• Best time: BTO-PK-OPRJ
Self-Join Scaleup

• Increase data size and
  cluster size together by the
  same factor
• Best time: BTO-PK-OPRJ
Self-Join Summery
• I stage- BTO was the best choice.
• II stage- PK was the best choice.
• III stage,-the best choice depends on the amount
  of data and the size of the cluster
  – OPRJ was somewhat faster, but the cost of loading the
    similar-RID pairs in memory was constant as the the
    cluster size increased, and the cost increased as the
    data size increased. For these reasons, we recommend
    BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ
R-S Join Performance
Speed Up
• I stage - R-S Join performance was identical to
  the first stage in the self-join case
• II stage -noticed a similar speedup (almost
  perfect) as for the self-join case.
• III stage - OPRJ approach was initially the
  fastest (for the 2 and 4 node case), but it
  eventually became slower than the BRJ
  approach.
Conclusions

• For both self-join and R-S join cases, we recommend BTO-
  PK-BRJ as a robust and scalable method.

• Useful in many data cleaning scenarios

• SSJoin and MapReduce: one solution for huge datasets

• Very efficient when based on prefix-filtering and PPJoin+

• Scales-up up nicely
Thank You!

Weitere ähnliche Inhalte

Was ist angesagt?

The Big M Method - Operation Research
The Big M Method - Operation ResearchThe Big M Method - Operation Research
The Big M Method - Operation Research2013901097
 
Public Goods and Common Resources
Public Goods and Common ResourcesPublic Goods and Common Resources
Public Goods and Common ResourcesTuul Tuul
 
Les structures de données.pptx
Les structures de données.pptxLes structures de données.pptx
Les structures de données.pptxPROFPROF11
 
Sensitivity analysis in linear programming problem ( Muhammed Jiyad)
Sensitivity analysis in linear programming problem ( Muhammed Jiyad)Sensitivity analysis in linear programming problem ( Muhammed Jiyad)
Sensitivity analysis in linear programming problem ( Muhammed Jiyad)Muhammed Jiyad
 
Goal Programming
Goal ProgrammingGoal Programming
Goal ProgrammingEvren E
 
National Income Concepts
National Income ConceptsNational Income Concepts
National Income ConceptsArnab Ghosh
 
Microeconomics: Price Discrimination and Their Degrees
Microeconomics: Price Discrimination and Their DegreesMicroeconomics: Price Discrimination and Their Degrees
Microeconomics: Price Discrimination and Their DegreesJitendra Kumar
 
Income And Substitution Effect
Income And Substitution EffectIncome And Substitution Effect
Income And Substitution Effectnight seem
 
Economic Essential Diagrams 2
Economic Essential Diagrams 2Economic Essential Diagrams 2
Economic Essential Diagrams 2evangelxoxo
 
Meeting 4 - Stolper - Samuelson theorem (International Economics)
Meeting 4 - Stolper - Samuelson theorem (International Economics)Meeting 4 - Stolper - Samuelson theorem (International Economics)
Meeting 4 - Stolper - Samuelson theorem (International Economics)Albina Gaisina
 

Was ist angesagt? (11)

The Big M Method - Operation Research
The Big M Method - Operation ResearchThe Big M Method - Operation Research
The Big M Method - Operation Research
 
Public Goods and Common Resources
Public Goods and Common ResourcesPublic Goods and Common Resources
Public Goods and Common Resources
 
Game theory
Game theoryGame theory
Game theory
 
Les structures de données.pptx
Les structures de données.pptxLes structures de données.pptx
Les structures de données.pptx
 
Sensitivity analysis in linear programming problem ( Muhammed Jiyad)
Sensitivity analysis in linear programming problem ( Muhammed Jiyad)Sensitivity analysis in linear programming problem ( Muhammed Jiyad)
Sensitivity analysis in linear programming problem ( Muhammed Jiyad)
 
Goal Programming
Goal ProgrammingGoal Programming
Goal Programming
 
National Income Concepts
National Income ConceptsNational Income Concepts
National Income Concepts
 
Microeconomics: Price Discrimination and Their Degrees
Microeconomics: Price Discrimination and Their DegreesMicroeconomics: Price Discrimination and Their Degrees
Microeconomics: Price Discrimination and Their Degrees
 
Income And Substitution Effect
Income And Substitution EffectIncome And Substitution Effect
Income And Substitution Effect
 
Economic Essential Diagrams 2
Economic Essential Diagrams 2Economic Essential Diagrams 2
Economic Essential Diagrams 2
 
Meeting 4 - Stolper - Samuelson theorem (International Economics)
Meeting 4 - Stolper - Samuelson theorem (International Economics)Meeting 4 - Stolper - Samuelson theorem (International Economics)
Meeting 4 - Stolper - Samuelson theorem (International Economics)
 

Ähnlich wie Efficient Parallel Set-Similarity Joins Using MapReduce

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate SquaresKyle Teague
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTWsunnygleason
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newchizhangufl
 
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPEPRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPEnikhilcse1
 
456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).ppt456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).pptMohibKhan79
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftAmazon Web Services
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorterManchor Ko
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Ted Dunning
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...Javed Barkatullah
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 

Ähnlich wie Efficient Parallel Set-Similarity Joins Using MapReduce (20)

Just Count the Love-Hate Squares
Just Count the Love-Hate SquaresJust Count the Love-Hate Squares
Just Count the Love-Hate Squares
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Hash Functions FTW
Hash Functions FTWHash Functions FTW
Hash Functions FTW
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
1 DES.pdf
1 DES.pdf1 DES.pdf
1 DES.pdf
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
An overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology newAn overview of Peer-to-Peer technology new
An overview of Peer-to-Peer technology new
 
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPEPRESENTATION ON DATA STRUCTURE AND THEIR TYPE
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
 
456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).ppt456589.-Compiler-Design-Code-Generation (1).ppt
456589.-Compiler-Design-Code-Generation (1).ppt
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon Redshift
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
GOLDSTRIKETM 1: COINTERRA’S FIRST GENERATION CRYPTO-CURRENCY PROCESSOR FOR BI...
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Mehr von Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 

Kürzlich hochgeladen

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Efficient Parallel Set-Similarity Joins Using MapReduce

  • 1. Efficient Parallel Set-Similarity Joins Using MapReduce Tilani Gunawardena
  • 2. Content • Introduction • Preliminaries • Self-Join case • R-S Join case • Handling insufficient memory • Experimental evaluation • Conclusions
  • 3. Introduction • Vast amount of data: – Google N-gram database : ~1 trillion records – GeneBank : 100 million records, size=416GB – Facebook : 400 million active users • Detecting similar pairs of records becomes a challanging proble
  • 4. Examples • Detecting near duplicate web-pages in web crawlin • Document clustering • Plagiarism detection • Master data management – “John W. Smith” , “Smith, John” , “John William Smith” • Making recommendations to users based on their similarity to other users in query refinement • Mining in social networking sites – User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest • Identifying coalitions of click fraudsters in online advertising
  • 5. Preliminaries • Problem Statement: Given two collections of objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)≥ λ
  • 6. Set -similarity functions • Jaccard or Tanimoto coefficient – Jaccard(x, y) =|x ∩y| / |x U y| • “I will call back” =[I, will, call, back] • “I will call you soon”=[I, will, call, you, soon] • Jaccard similarity=3/6=0.5
  • 7. Set-similarity with MapReduce • Why Hadoop ? – Large amount data,shared nothign architecture • map (k1,v1) -> list(k2,v2); • reduce (k2,list(v2)) -> list(k3,v3) • Problem : – Too much data to transfer – Too many pairs to verify(Two similar sets share at least 1 token)
  • 8. Set-Similarity Filtering • Efficient set-similarity join algorithms rely on effective filters • string s =“I will call back” • global token ordering {back,call, will, I} • prefix of length 2 of s= [back, call] • prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
  • 9. Prefix filtering: example Record 1 Record 2 • Each set has 5 tokens • “Similar”: they share at least 4 tokens • Prefix length: 2 9
  • 10. Parallel Set-Similarity Joins • Stage I: Token Ordering – Compute data statistics for good signatures • Stage II -RID-Pair Generation • Stage III: Record Join – Generate actual pairs of joined records
  • 11. Input Data • RID = Row ID • a : join column • “A B C” is a string: • Address: “14th Saarbruecker Strasse” • Name: “John W. Smith”
  • 12. Stage I: Token Ordering • Basic Token Ordering(BTO) • One Phase Token Ordering (OPTO)
  • 13. Token Ordering • Creates a global ordering of the tokens in the join column, based on their frequency RID a b c 1 A B D AA … … 2 BBDAE … … Global Ordering: E D B A (based on frequency) 1 2 3 4
  • 14. Basic Token Ordering(BTO) • 2 MapReduce cycles: – 1st : compute token frequencies – 2nd: sort the tokens by their frequencies
  • 15. Basic Token Ordering – 1st MapReduce cycle , , map: reduce: • tokenize the join • for each token, compute total value of each record count (frequency) • emit each token with no. of occurrences 1
  • 16. Basic Token Ordering – 2nd MapReduce cycle map: reduce(use only 1 reducer): • interchange key • emits the value with value
  • 17. One Phase Tokens Ordering (OPTO) • alternative to Basic Token Ordering (BTO): – Uses only one MapReduce Cycle (less I/O) – In-memory token sorting, instead of using a reducer
  • 18. OPTO – Details , , Use tear_down method to order the tokens in memory map: reduce: • tokenize the join • for each token, compute value of each record total count (frequency) • emit each token with no. of occurrences 1
  • 19. Stage II: RID-Pair Generation  Basic Kernel(BK)  Indexed Kernel(PK)
  • 20. RID-Pair Generation • scans the original input data(records) • outputs the pairs of RIDs corresponding to records satisfying the join predicate(sim) • consists of only one MapReduce cycle Global ordering of tokens obtained in the previous stage
  • 21. RID-Pair Generation: Map Phase • scan input records and for each record: – project it on RID & join attribute – tokenize it – extract prefix according to global ordering of tokens obtained in the Token Ordering stage – route tokens to appropriate reducer
  • 22. Grouping/Routing Strategies • Goal: distribute candidates to the right reducers to minimize reducers’ workload • Like hashing (projected)records to the corresponding candidate-buckets • Each reducer handles one/more candidate- buckets • 2 routing strategies: Using Individual Tokens Using Grouped Tokens
  • 23. Routing: using individual tokens • Treat each token as a key • For each record, generates a (key, value) pair for each of its prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs: • (A, (1,A B C)) • (B, (1,A B C))
  • 24. Grouping/Routing: using individual tokens • Advantage: – high quality of grouping of candidates( pairs of records that have no chance of being similar, are never routed to the same reducer) • Disadvantage: – high replication of data (same records might be checked for similarity in multiple reducers, i.e. redundant work)
  • 25. Routing: Using Grouped Tokens • Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key) • For each record, generates a (key, value) pair for each the groups of the prefix tokens: Example: • Given the global ordering: Token A B E D G C F Frequency 10 10 22 23 23 40 48 “A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs: • (X, (1,A B C)) • (Y, (1,A B C))
  • 26. Grouping/Routing: Using Grouped Tokens • The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner Token A B E D G C F Frequency 10 10 22 23 23 40 48 A D F B G E C Group1 Group2 Group3
  • 27. Grouping/Routing: Using Grouped Tokens • Advantage: – fewer replication of record projection • Disadvantage: – Quality of grouping is not so high (records having no chance of being similar are sent to the same reducer which checks their similarity) – “ABCD” (A,B belong to Group X ; C belong to Group Y) • o/p –(X,_) & (Y,_) – “EFG” (E belong to Group Y ) • o/p –(Y,_)
  • 28. RID-Pair Generation: Reduce Phase • This is the core of the entire method • Each reducer processes one/more buckets • In each bucket, the reducer looks for pairs of join attribute values satisfying the join predicate If the similarity of the 2 candidates >= threshold => output their ids and also their similarity Bucket of candidates
  • 29. RID-Pair Generation: Reduce Phase • Computing similarity of the candidates in a bucket comes in 2 flavors: • Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket • Indexed Kernel : uses a PPJoin+ index
  • 30. RID-Pair Generation: Basic Kernel • Straightforward method for finding candidates satisfying the join predicate • Quadratic complexity : O(#candidates2)
  • 31. RID-Pair Generation:PPJoin+Indexed Kernal • Uses a special index data structure • Not so straightforward to implement • map() -same as in BK algorithm • Much more efficient
  • 32. Stage III: Record Join • Until now we have only pairs of RIDs, but we need actual records • Use the RID pairs generated in the previous stage to join the actual records • Main idea: – bring in the rest of the each record (everything except the RID which we already have) • 2 approaches: – Basic Record Join (BRJ) – One-Phase Record Join (OPRJ)
  • 33. Record Join: Basic Record Join • Uses 2 MapReduce cycles – 1st cycle: fills in the record information for each half of each pair – 2nd cycle: brings together the previously filled in records
  • 34. Record Join: One Phase Record Join • Uses only one MapReduce cycle
  • 35. R-S Join • Challenge: We now have 2 different record sources => 2 different input streams • Map Reduce can work on only 1 input stream • 2nd and 3rd stage affected • Solution: extend (key, value) pairs so that it includes a relation tag for each record
  • 36. Handling Insufficient Memory • Map-Based Block Processing. • Reduce-Based Block Processing
  • 37. Evaluation • Cluster: 10-node IBM x3650, running Hadoop • Data sets: • DBLP: 1.2M publications • CITESEERX: 1.3M publication • Consider only the header of each paper(i.e author, title, date of publication, etc.) • Data size synthetically increased (by various factors) • Measure: • Absolute running time • Speedup • Scaleup
  • 38. Self-Join running time • Best algorithm: BTO-PK-OPRJ • Most expensive stage: the RID-pair generation
  • 39. Self-Join Speedup • Fixed data size, vary the cluster size • Best time: BTO-PK-OPRJ
  • 40.
  • 41. Self-Join Scaleup • Increase data size and cluster size together by the same factor • Best time: BTO-PK-OPRJ
  • 42.
  • 43. Self-Join Summery • I stage- BTO was the best choice. • II stage- PK was the best choice. • III stage,-the best choice depends on the amount of data and the size of the cluster – OPRJ was somewhat faster, but the cost of loading the similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative • Best scaleup was achieved by BTO-PK-BRJ
  • 45. Speed Up • I stage - R-S Join performance was identical to the first stage in the self-join case • II stage -noticed a similar speedup (almost perfect) as for the self-join case. • III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
  • 46. Conclusions • For both self-join and R-S join cases, we recommend BTO- PK-BRJ as a robust and scalable method. • Useful in many data cleaning scenarios • SSJoin and MapReduce: one solution for huge datasets • Very efficient when based on prefix-filtering and PPJoin+ • Scales-up up nicely

Hinweis der Redaktion

  1. Before publishing a Journal, editors have to make sure there is no plagiarized paper among the hundreds of papers to be included in the Journaldifferent hosts holding the same redundant copies of a pageDetecting such similar pairs is challenging today, as there is an increasing trend of applications being expected to dealwith vast amounts of data that usually do not fit in the main memory of one machine.
  2. 2 maps reduce phases
  3. map: tokenize the join value of each record emit each token with no. of occurrences 1reduce: for each token, compute total count (frequency)
  4. Instead of using MapReduce to sort the tokens, we can explicitly sort the tokens in memory
  5. For each token, the function computes its total count and stores the information locally
  6. i = (i + 1) mod n
  7. Bring records for each id in each pairJoin two half filled records
  8. 3 Stage -most expensive. Reason-this stage had to scan two datasets instead of one