SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Efficient Parallel Set-Similarity Joins Using
                       MapReduce

                Rares Vernica        Michael J. Carey           Chen Li
                            Department of Computer Science
                             University of California, Irvine


         Special Interest Group on Management of Data, 2010




Rares Vernica (UC Irvine)         Fuzzy-Join in MapReduce             SIGMOD 2010   1 / 22
Outline



1   Motivation


2   Problem Statement


3   Algorithms


4   Experimental Evaluation




    Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   2 / 22
Example: Data Cleaning/Master-Data-Management

Customer data from two departments
                   Sales                                         Returns
     ID        Name                   ...           ID        Name                ...
    S10        John W Smith           ...          R20        Smith John          ...
      .
      .                                              .
                                                     .
      .                                              .


Master customer data across two departments
                                      Customers
                               ID   Name                       ...
                              C30   John W Smith               ...
                                .
                                .
                                .




  Rares Vernica (UC Irvine)         Fuzzy-Join in MapReduce                SIGMOD 2010   3 / 22
Example: Data Cleaning/Master-Data-Management

Customer data from two departments
                   Sales                                         Returns
     ID        Name                   ...           ID        Name                ...
    S10        John W Smith           ...          R20        Smith John          ...
      .
      .                                              .
                                                     .
      .                                              .


Master customer data across two departments
                                      Customers
                               ID   Name                       ...
                              C30   John W Smith               ...
                                .
                                .
                                .




  Rares Vernica (UC Irvine)         Fuzzy-Join in MapReduce                SIGMOD 2010   3 / 22
Example: Data Cleaning/Master-Data-Management

Customer data from two departments
                   Sales                                         Returns
     ID        Name                   ...           ID        Name                ...
    S10        John W Smith           ...          R20        Smith John          ...
      .
      .                                              .
                                                     .
      .                                              .


Master customer data across two departments
                                      Customers
                               ID   Name                       ...
                              C30   John W Smith               ...
                                .
                                .
                                .




  Rares Vernica (UC Irvine)         Fuzzy-Join in MapReduce                SIGMOD 2010   3 / 22
Example: Data Cleaning/Master-Data-Management

Customer data from two departments
                   Sales                                         Returns
     ID        Name                   ...           ID        Name             ...
    S10        John W Smith           ...          R20        John W Smith     ...
      .
      .                                              .
                                                     .
      .                                              .


Master customer data across two departments
                                      Customers
                               ID   Name                       ...
                              C30   John W Smith               ...
                                .
                                .
                                .




  Rares Vernica (UC Irvine)         Fuzzy-Join in MapReduce             SIGMOD 2010   3 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Example: Data Cleaning/Master-Data-Management


String → Set
    S10: “John W Smith” → {John, Smith, W}
    R20: “Smith John” → {John, Smith}

Set-Similarity Metric
                                                               |x∩y |
    Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) =   |x∪y |
                              2
    jaccard(S10, R20) =       3


Set-Similarity Join
Set-Similarity Join “Sales” and “Returns” on “Name”



  Rares Vernica (UC Irvine)       Fuzzy-Join in MapReduce   SIGMOD 2010   4 / 22
Problem Statement: Set-Similarity Join




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   5 / 22
Problem Statement: Set-Similarity Join




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   5 / 22
Problem Statement: Set-Similarity Join




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   5 / 22
Problem Statement: Set-Similarity Join




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   5 / 22
Problem Statement: Set-Similarity Join



Input
    Two files of records e.g., R(RID, a, b) and S(RID, c, d)
    A join column on each file e.g., R.a and S.c
    A similarity function, sim e.g., Jaccard
    A similarity threshold, τ

Output
All pairs of records from R and S where sim(R.a, S.c) ≥ τ




  Rares Vernica (UC Irvine)     Fuzzy-Join in MapReduce   SIGMOD 2010   6 / 22
Parallelizing Set-Similarity Joins


Large amounts of data
    E.g., GeneBank: 100M, Google N-gram: 1T
    Data or processing does not fit in one machine
    Use a cluster of machines and a parallel algorithm
    MapReduce: shared-nothing data-processing platform

Challenges
    Partition problem for parallelism
    Solve using Map, Sort, and Reduce
    Compute end-to-end set-similarity joins
    Deal with out-of-memory situations


  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce    SIGMOD 2010   7 / 22
Parallelizing Set-Similarity Joins


Large amounts of data
    E.g., GeneBank: 100M, Google N-gram: 1T
    Data or processing does not fit in one machine
    Use a cluster of machines and a parallel algorithm
    MapReduce: shared-nothing data-processing platform

Challenges
    Partition problem for parallelism
    Solve using Map, Sort, and Reduce
    Compute end-to-end set-similarity joins
    Deal with out-of-memory situations


  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce    SIGMOD 2010   7 / 22
MapReduce Review

map    (k1,v1)       → list(k2,v2);
reduce (k2,list(v2)) → list(k3,v3).




combine (k2,list(v2)) → list(k2,v2).

  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   8 / 22
MapReduce Review

map    (k1,v1)       → list(k2,v2);
reduce (k2,list(v2)) → list(k3,v3).




combine (k2,list(v2)) → list(k2,v2).

  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   8 / 22
MapReduce Review

map    (k1,v1)       → list(k2,v2);
reduce (k2,list(v2)) → list(k3,v3).




combine (k2,list(v2)) → list(k2,v2).

  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   8 / 22
Parallel Set-Similarity Joins in MapReduce

Main idea
       Hash-partition data across the network based on keys
       Join values cannot be directly used as keys
       Generate signatures from join values and use them as keys
               e.g., “John W Smith” → {John, Smith, W}

Three Stages
 1     Compute set-element statistics
       Records → Statistics
 2     Generate similar record-ID (“RID”) pairs
       {Records, Statistics} → RID-pairs
 3     Generate pairs of joined records
       {Records, RID-pairs} → Record-pairs


     Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   9 / 22
Parallel Set-Similarity Joins in MapReduce

Main idea
       Hash-partition data across the network based on keys
       Join values cannot be directly used as keys
       Generate signatures from join values and use them as keys
               e.g., “John W Smith” → {John, Smith, W}

Three Stages
 1     Compute set-element statistics
       Records → Statistics
 2     Generate similar record-ID (“RID”) pairs
       {Records, Statistics} → RID-pairs
 3     Generate pairs of joined records
       {Records, RID-pairs} → Record-pairs


     Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   9 / 22
Set-Similarity Filtering
E.g., Prefix Filtering
    Pigeonhole principle
    Global order for set elements:




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   10 / 22
Set-Similarity Filtering
E.g., Prefix Filtering
    Pigeonhole principle
    Global order for set elements:




    E.g., sim is overlap size, τ = 4




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   10 / 22
Set-Similarity Filtering
E.g., Prefix Filtering
    Pigeonhole principle
    Global order for set elements:




    E.g., sim is overlap size, τ = 4
    Prefix length is 2




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   10 / 22
Set-Similarity Filtering
E.g., Prefix Filtering
    Pigeonhole principle
    Global order for set elements:




    E.g., sim is overlap size, τ = 4
    Prefix length is 2




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   10 / 22
Set-Similarity Filtering
E.g., Prefix Filtering
    Pigeonhole principle
    Global order for set elements:




    E.g., sim is overlap size, τ = 4
    Prefix length is 2




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   10 / 22
Processing Stages and Alternatives

Stage 1: Token Ordering
    Compute the token frequencies and sort
            Two MapReduce phases: sort in MapReduce (BTO)
            One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)
    Use prefix-filter to divide, conquer using:
            Nested loops (BK)
            Single-machine set-similarity join algorithm (PK)

Stage 3: Record Join
    Generate pairs of similar records
            Two MapReduce phases: reduce-side join (BRJ)
            One MapReduce phase: map-side join (OPRJ)

   Rares Vernica (UC Irvine)      Fuzzy-Join in MapReduce       SIGMOD 2010   11 / 22
Processing Stages and Alternatives

Stage 1: Token Ordering
    Compute the token frequencies and sort
            Two MapReduce phases: sort in MapReduce (BTO)
            One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)
    Use prefix-filter to divide, conquer using:
            Nested loops (BK)
            Single-machine set-similarity join algorithm (PK)

Stage 3: Record Join
    Generate pairs of similar records
            Two MapReduce phases: reduce-side join (BRJ)
            One MapReduce phase: map-side join (OPRJ)

   Rares Vernica (UC Irvine)      Fuzzy-Join in MapReduce       SIGMOD 2010   11 / 22
Processing Stages and Alternatives

Stage 1: Token Ordering
    Compute the token frequencies and sort
            Two MapReduce phases: sort in MapReduce (BTO)
            One MapReduce phase: sort in memory (OPTO)

Stage 2: Kernel (RID-Pair Generation)
    Use prefix-filter to divide, conquer using:
            Nested loops (BK)
            Single-machine set-similarity join algorithm (PK)

Stage 3: Record Join
    Generate pairs of similar records
            Two MapReduce phases: reduce-side join (BRJ)
            One MapReduce phase: map-side join (OPRJ)

   Rares Vernica (UC Irvine)      Fuzzy-Join in MapReduce       SIGMOD 2010   11 / 22
Stage 2: RID-Pair Generation




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   12 / 22
Stage 2: RID-Pair Generation




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   12 / 22
Stage 2: RID-Pair Generation




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   12 / 22
Stage 2: RID-Pair Generation




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   12 / 22
Stage 2: RID-Pair Generation




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   12 / 22
Stage 2: RID-Pair Generation




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   12 / 22
Additional Contributions




    Solve R-S-join case
    Optimize memory requirements for R-S join
    Handle insufficient memory
            Introduce additional filters
            Use external memory




  Rares Vernica (UC Irvine)      Fuzzy-Join in MapReduce   SIGMOD 2010   13 / 22
Experimental Setting


Hardware
    10-node IBM x3650 cluster
            Intel Xeon processor E5520 2.26GHz with four cores
            Four 300GB hard disks
            12GB RAM

Software
    Ubuntu 9.06, 64-bit, server edition OS
    Java 1.6, 64-bit, server
    Hadoop 0.20.1




  Rares Vernica (UC Irvine)    Fuzzy-Join in MapReduce       SIGMOD 2010   14 / 22
Experimental Setting


Datasets
    DBLP
            Average length: 259 bytes
            Number of records: 1.2M
            Total size: 300MB
    CITESEERX
            Average length: 1374 bytes
            Number of records: 1.3M
            Total size: 1.8GB
    Increased each up to ×25, preserving join properties
            DBLP: 31M records, 8.2GB
            CITESEERX: 32M records, 45GB




  Rares Vernica (UC Irvine)    Fuzzy-Join in MapReduce   SIGMOD 2010   15 / 22
Running Time




                                                        Self-join DBLP×n
                                                        n ∈ [5, 25]
                                                        10-node cluster
                                                        Best time
                                                        Bulk of the time




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce      SIGMOD 2010   16 / 22
Running Time




                                                        Self-join DBLP×n
                                                        n ∈ [5, 25]
                                                        10-node cluster
                                                        Best time
                                                        Bulk of the time




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce      SIGMOD 2010   16 / 22
Running Time




                                                        Self-join DBLP×n
                                                        n ∈ [5, 25]
                                                        10-node cluster
                                                        Best time
                                                        Bulk of the time




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce      SIGMOD 2010   16 / 22
Speedup


          5
                        BTO-BK-BRJ
                        BTO-PK-BRJ                                      Relative running time
          4             BTO-PK-OPRJ
                        Ideal                                           Self-join DBLP×10
Speedup




                                                                        Different cluster sizes
          3


          2


          1
              2     3      4     5    6   7   8    9 10
                                 # Nodes

          Rares Vernica (UC Irvine)           Fuzzy-Join in MapReduce            SIGMOD 2010   17 / 22
Speedup Breakdown


                      Stage 1                                            Stage 2                                             Stage 3
          5                                                 5                                                    5
                      BTO                                               BK                                                   BRJ
          4           OPTO                                  4           PK                                       4           OPRJ
                      Ideal                                             Ideal                                                Ideal
Speedup




                                                  Speedup




                                                                                                       Speedup
          3                                                 3                                                    3


          2                                                 2                                                    2


          1                                                 1                                                    1
              2   3    4   5   6   7   8   9 10                 2   3     4     5   6   7   8   9 10                 2   3    4   5    6   7   8   9 10
                           # Nodes                                              # Nodes                                              # Nodes

                  Relative running time
                  Self-join DBLP×10
                  Different cluster sizes


              Rares Vernica (UC Irvine)                         Fuzzy-Join in MapReduce                                        SIGMOD 2010         18 / 22
Scaleup


                  1000
                                                                                    Running time
                    800                                                             Self-joining DBLP×n
Time (seconds)




                                                                                    n ∈ [5, 25]
                    600
                                                                                    Proportional cluster
                    400                              BTO-BK-BRJ
                                                     BTO-PK-BRJ
                    200                              BTO-PK-OPRJ
                                                     Ideal
                        0
                            2      3         4   5    6     7     8     9 10
                                 # Nodes and Dataset Size

                 Rares Vernica (UC Irvine)                Fuzzy-Join in MapReduce             SIGMOD 2010   19 / 22
Scaleup Breakdown


                            Stage 1                                                    Stage 2                                                    Stage 3
                 160                                                        600                                                        200

                                                                            500
Time (seconds)




                                                           Time (seconds)




                                                                                                                      Time (seconds)
                 120                                                                                                                   150
                                                                            400

                 80                                                         300                                                        100

                                                BTO                         200                            BK                                                          BRJ
                 40                             OPTO                                                       PK                          50                              OPRJ
                                                Ideal                       100                            Ideal                                                       Ideal
                  0                                                           0                                                         0
                       2    3   4   5   6   7   8   9 10                          2    3   4   5   6   7   8   9 10                          2    3    4   5   6   7   8   9 10
                           # Nodes and Dataset Size                                   # Nodes and Dataset Size                                   # Nodes and Dataset Size

                       Running time
                       Self-joining DBLP×n, n ∈ [5, 25]
                       Proportional cluster


                  Rares Vernica (UC Irvine)                                   Fuzzy-Join in MapReduce                                                 SIGMOD 2010       20 / 22
Summary


   Set-similarity joins in MapReduce
           Three-stage approach
           Balance workload and minimize replication
   End-to-end algorithms
           Self-join
           R-S join
   Memory issues
           Optimize memory requirements
           Handle insufficient memory
   Experiments
           40 cores, 40 disks cluster
           Speedup and scaleup




 Rares Vernica (UC Irvine)      Fuzzy-Join in MapReduce   SIGMOD 2010   21 / 22
More Info



  A longer version of the paper, the source-code and the datasets:
  http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

                              This work is part of the
                                     ASTERIX
                         http://asterix.ics.uci.edu/
                                        and
                                     Flamingo
                        http://flamingo.ics.uci.edu/
                               projects at UC Irvine.




  Rares Vernica (UC Irvine)      Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22
Stage 1: Basic Token Ordering (BTO)




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22
Stage 1: One Phase Token Ordering (OPTO)




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22
Stage 2: Basic Kernel (BK)




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22
Stage 2: Indexed Kernel (PK)




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22
Stage 3: Basic Record Join (BRJ)




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22
Stage 3: One Phase Record Join (OPRJ)




  Rares Vernica (UC Irvine)   Fuzzy-Join in MapReduce   SIGMOD 2010   22 / 22

Weitere ähnliche Inhalte

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Efficient Parallel Set-Similarity Joins Using MapReduce - SIGMOD 2010 Slides

  • 1. Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica Michael J. Carey Chen Li Department of Computer Science University of California, Irvine Special Interest Group on Management of Data, 2010 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 1 / 22
  • 2. Outline 1 Motivation 2 Problem Statement 3 Algorithms 4 Experimental Evaluation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 2 / 22
  • 3. Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 Smith John ... . . . . . . Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
  • 4. Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 Smith John ... . . . . . . Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
  • 5. Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 Smith John ... . . . . . . Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
  • 6. Example: Data Cleaning/Master-Data-Management Customer data from two departments Sales Returns ID Name ... ID Name ... S10 John W Smith ... R20 John W Smith ... . . . . . . Master customer data across two departments Customers ID Name ... C30 John W Smith ... . . . Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 3 / 22
  • 7. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 8. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 9. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 10. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 11. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 12. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 13. Example: Data Cleaning/Master-Data-Management String → Set S10: “John W Smith” → {John, Smith, W} R20: “Smith John” → {John, Smith} Set-Similarity Metric |x∩y | Jaccard similarity/Tanimoto coefficient: jaccard(x, y ) = |x∪y | 2 jaccard(S10, R20) = 3 Set-Similarity Join Set-Similarity Join “Sales” and “Returns” on “Name” Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 4 / 22
  • 14. Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
  • 15. Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
  • 16. Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
  • 17. Problem Statement: Set-Similarity Join Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 5 / 22
  • 18. Problem Statement: Set-Similarity Join Input Two files of records e.g., R(RID, a, b) and S(RID, c, d) A join column on each file e.g., R.a and S.c A similarity function, sim e.g., Jaccard A similarity threshold, τ Output All pairs of records from R and S where sim(R.a, S.c) ≥ τ Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 6 / 22
  • 19. Parallelizing Set-Similarity Joins Large amounts of data E.g., GeneBank: 100M, Google N-gram: 1T Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm MapReduce: shared-nothing data-processing platform Challenges Partition problem for parallelism Solve using Map, Sort, and Reduce Compute end-to-end set-similarity joins Deal with out-of-memory situations Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22
  • 20. Parallelizing Set-Similarity Joins Large amounts of data E.g., GeneBank: 100M, Google N-gram: 1T Data or processing does not fit in one machine Use a cluster of machines and a parallel algorithm MapReduce: shared-nothing data-processing platform Challenges Partition problem for parallelism Solve using Map, Sort, and Reduce Compute end-to-end set-similarity joins Deal with out-of-memory situations Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 7 / 22
  • 21. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). combine (k2,list(v2)) → list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
  • 22. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). combine (k2,list(v2)) → list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
  • 23. MapReduce Review map (k1,v1) → list(k2,v2); reduce (k2,list(v2)) → list(k3,v3). combine (k2,list(v2)) → list(k2,v2). Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 8 / 22
  • 24. Parallel Set-Similarity Joins in MapReduce Main idea Hash-partition data across the network based on keys Join values cannot be directly used as keys Generate signatures from join values and use them as keys e.g., “John W Smith” → {John, Smith, W} Three Stages 1 Compute set-element statistics Records → Statistics 2 Generate similar record-ID (“RID”) pairs {Records, Statistics} → RID-pairs 3 Generate pairs of joined records {Records, RID-pairs} → Record-pairs Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22
  • 25. Parallel Set-Similarity Joins in MapReduce Main idea Hash-partition data across the network based on keys Join values cannot be directly used as keys Generate signatures from join values and use them as keys e.g., “John W Smith” → {John, Smith, W} Three Stages 1 Compute set-element statistics Records → Statistics 2 Generate similar record-ID (“RID”) pairs {Records, Statistics} → RID-pairs 3 Generate pairs of joined records {Records, RID-pairs} → Record-pairs Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 9 / 22
  • 26. Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
  • 27. Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
  • 28. Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
  • 29. Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
  • 30. Set-Similarity Filtering E.g., Prefix Filtering Pigeonhole principle Global order for set elements: E.g., sim is overlap size, τ = 4 Prefix length is 2 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 10 / 22
  • 31. Processing Stages and Alternatives Stage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK) Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
  • 32. Processing Stages and Alternatives Stage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK) Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
  • 33. Processing Stages and Alternatives Stage 1: Token Ordering Compute the token frequencies and sort Two MapReduce phases: sort in MapReduce (BTO) One MapReduce phase: sort in memory (OPTO) Stage 2: Kernel (RID-Pair Generation) Use prefix-filter to divide, conquer using: Nested loops (BK) Single-machine set-similarity join algorithm (PK) Stage 3: Record Join Generate pairs of similar records Two MapReduce phases: reduce-side join (BRJ) One MapReduce phase: map-side join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 11 / 22
  • 34. Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
  • 35. Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
  • 36. Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
  • 37. Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
  • 38. Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
  • 39. Stage 2: RID-Pair Generation Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 12 / 22
  • 40. Additional Contributions Solve R-S-join case Optimize memory requirements for R-S join Handle insufficient memory Introduce additional filters Use external memory Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 13 / 22
  • 41. Experimental Setting Hardware 10-node IBM x3650 cluster Intel Xeon processor E5520 2.26GHz with four cores Four 300GB hard disks 12GB RAM Software Ubuntu 9.06, 64-bit, server edition OS Java 1.6, 64-bit, server Hadoop 0.20.1 Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 14 / 22
  • 42. Experimental Setting Datasets DBLP Average length: 259 bytes Number of records: 1.2M Total size: 300MB CITESEERX Average length: 1374 bytes Number of records: 1.3M Total size: 1.8GB Increased each up to ×25, preserving join properties DBLP: 31M records, 8.2GB CITESEERX: 32M records, 45GB Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 15 / 22
  • 43. Running Time Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
  • 44. Running Time Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
  • 45. Running Time Self-join DBLP×n n ∈ [5, 25] 10-node cluster Best time Bulk of the time Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 16 / 22
  • 46. Speedup 5 BTO-BK-BRJ BTO-PK-BRJ Relative running time 4 BTO-PK-OPRJ Ideal Self-join DBLP×10 Speedup Different cluster sizes 3 2 1 2 3 4 5 6 7 8 9 10 # Nodes Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 17 / 22
  • 47. Speedup Breakdown Stage 1 Stage 2 Stage 3 5 5 5 BTO BK BRJ 4 OPTO 4 PK 4 OPRJ Ideal Ideal Ideal Speedup Speedup Speedup 3 3 3 2 2 2 1 1 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes # Nodes # Nodes Relative running time Self-join DBLP×10 Different cluster sizes Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 18 / 22
  • 48. Scaleup 1000 Running time 800 Self-joining DBLP×n Time (seconds) n ∈ [5, 25] 600 Proportional cluster 400 BTO-BK-BRJ BTO-PK-BRJ 200 BTO-PK-OPRJ Ideal 0 2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 19 / 22
  • 49. Scaleup Breakdown Stage 1 Stage 2 Stage 3 160 600 200 500 Time (seconds) Time (seconds) Time (seconds) 120 150 400 80 300 100 BTO 200 BK BRJ 40 OPTO PK 50 OPRJ Ideal 100 Ideal Ideal 0 0 0 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 # Nodes and Dataset Size # Nodes and Dataset Size # Nodes and Dataset Size Running time Self-joining DBLP×n, n ∈ [5, 25] Proportional cluster Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 20 / 22
  • 50. Summary Set-similarity joins in MapReduce Three-stage approach Balance workload and minimize replication End-to-end algorithms Self-join R-S join Memory issues Optimize memory requirements Handle insufficient memory Experiments 40 cores, 40 disks cluster Speedup and scaleup Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 21 / 22
  • 51. More Info A longer version of the paper, the source-code and the datasets: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ This work is part of the ASTERIX http://asterix.ics.uci.edu/ and Flamingo http://flamingo.ics.uci.edu/ projects at UC Irvine. Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
  • 52. Stage 1: Basic Token Ordering (BTO) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
  • 53. Stage 1: One Phase Token Ordering (OPTO) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
  • 54. Stage 2: Basic Kernel (BK) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
  • 55. Stage 2: Indexed Kernel (PK) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
  • 56. Stage 3: Basic Record Join (BRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22
  • 57. Stage 3: One Phase Record Join (OPRJ) Rares Vernica (UC Irvine) Fuzzy-Join in MapReduce SIGMOD 2010 22 / 22