SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Tamer Elsayed, Jimmy Lin, and Douglas Oard


         Niveda Krishnamoorthy
 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
  • Inverted Index Construction
  • Pairwise document similarity calculation
 Results
 PubMed   – “More like this”
 Similar blog posts
 Google – Similar pages
 Framework   that supports distributed
  computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications
 Consider    two files:

      Hello                Hello
                                      Hello ,2
      World                Hadoop     World ,2
      Bye                  Goodbye     Bye,1
                                     Hadoop ,2
      World                Hadoop    Goodbye ,1
Hello             <Hello,1>

World             <World,1>
          Map 1
Bye               <Bye,1>

World             <World,1>


Hello             <Hello,1>

Hadoop            <Hadoop,1>
          Map 2
Goodbye           <Goodbye,1>

Hadoop            <Hadoop,1>
<Hello,1>
              S   <Hello (1,1)>   Reduce 1    Hello ,2
<World,1>
              H
              U
<Bye,1>           <World(1,1)>    Reduce 2    World ,2
              F
              F
<World,1>
              L    <Bye(1)>       Reduce 3     Bye,1
              E
<Hello,1>         <Hadoop(1,1)>   Reduce 4   Hadoop ,2
              &
<Hadoop,1>
              S   <Goodbye(1)>    Reduce 5   Goodbye ,1
<Goodbye,1>   O
              R
<Hadoop,1>    T
MAPREDUCE ALGORITHM           Scalable
•Inverted Index Computation      and
•Pairwise Similarity          Efficient
Document 1
A                    <A,(d1,2)>
A
B            Map 1   <B,(d1,1)>
C
                     <C,(d1,1)>
Document 2
B                    <B,(d2,1)>
D
D            Map 2
                     <D,(d2,2)>


Document 1           <A,(d3,1)>
A
B                    <B,(d3,2)>
             Map 3
B
E                    <E,(d3,1)>
<A,(d1,2)>
             S     <A,[(d1,2),                   <A,[(d1,2),
<B,(d1,1)>   H      (d3,1)]>        Reduce 1      (d3,1)]>
             U
<C,(d1,1)>   F   <B,[(d1,1), (d2,              <B,[(d1,1), (d2,
             F                      Reduce 2
                 1),(d3,2)]>                   1),(d3,2)]>
             L
<B,(d2,1)>   E     <C,[(d1,1)]>     Reduce 3    <C,[(d1,1)]>

<D,(d2,2)>   &
                   <D,[(d2,2)]>     Reduce 4    <D,[(d2,2)]>
             S
<A,(d3,1)>   O
             R     <E,[(d3,1)]>     Reduce 5    <E,[(d3,1)]>
<B,(d3,2)>   T

<E,(d3,1)>
 Group   by document ID, not pairs




 Golomb’s   compression for postings
 Individual Postings
 List of Postings
<(d1,d3),2>
  <A,[(d1,2),      Map 1
   (d3,1)]>
                           <(d1,d2),1
<B,[(d1,1), (d2,
                   Map 2   (d2,d3),2
1),(d3,2)]>
                           (d1,d3),2>
 <C,[(d1,1)]>


 <D,[(d2,2)]>


 <E,[(d3,1)]>
S
              H
<(d1,d3),2>   U
              F   <(d1,d2)[1]>                <(d1,d2)[1]>
                                   Reduce 1
              F
<(d1,d2),1    L
              E   <(d2,d3)[2]>     Reduce 2   <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
              &
                                   Reduce 3
                  <(d1,d3)[2,2]>              <(d1,d3)[4]>
              S
              O
              R
              T
 Hadoop   0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
  • 2.5 GB
  • 906k documents
 Tokenization
 Stop word removal
 Stemming
 Df-cut
  • Fraction of terms with highest document
   frequency is eliminated – 99% cut (9093)

            Linear space and time complexity

  • 3.7 billion pairs (vs) 81. trillion pairs
 Complexity:      O(n2)



 Df-cut
       of 99 percent eliminates meaning bearing
 terms and some irrelevant terms
  • Cornell, arthritis
  • sleek, frail
 Df-cut   can be relaxed to 99.9 percent
 Exact  algorithms used for inverted index
  construction and pair-wise document
  similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
  the quality of the results significantly?
 The results have not been evaluated.
MapReduce Framework for Scalable Pairwise Document Similarity Calculation

Weitere ähnliche Inhalte

Ähnlich wie MapReduce Framework for Scalable Pairwise Document Similarity Calculation

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphAndrew Yongjoon Kong
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with HadoopFerran Galí Reniu
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojurePaul Lam
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api TrainingSpark Summit
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 

Ähnlich wie MapReduce Framework for Scalable Pairwise Document Similarity Calculation (11)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
End sem solution
End sem solutionEnd sem solution
End sem solution
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 

Kürzlich hochgeladen

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

Kürzlich hochgeladen (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

MapReduce Framework for Scalable Pairwise Document Similarity Calculation

  • 1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy
  • 2.  PairwiseSimilarity  MapReduce Framework  Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation  Results
  • 3.  PubMed – “More like this”  Similar blog posts  Google – Similar pages
  • 4.  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications
  • 5.
  • 6.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1
  • 7. Hello <Hello,1> World <World,1> Map 1 Bye <Bye,1> World <World,1> Hello <Hello,1> Hadoop <Hadoop,1> Map 2 Goodbye <Goodbye,1> Hadoop <Hadoop,1>
  • 8. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2 <World,1> H U <Bye,1> <World(1,1)> Reduce 2 World ,2 F F <World,1> L <Bye(1)> Reduce 3 Bye,1 E <Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 & <Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1 <Goodbye,1> O R <Hadoop,1> T
  • 9. MAPREDUCE ALGORITHM Scalable •Inverted Index Computation and •Pairwise Similarity Efficient
  • 10. Document 1 A <A,(d1,2)> A B Map 1 <B,(d1,1)> C <C,(d1,1)> Document 2 B <B,(d2,1)> D D Map 2 <D,(d2,2)> Document 1 <A,(d3,1)> A B <B,(d3,2)> Map 3 B E <E,(d3,1)>
  • 11. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2), <B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U <C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L <B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]> <D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S <A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]> <B,(d3,2)> T <E,(d3,1)>
  • 12.  Group by document ID, not pairs  Golomb’s compression for postings  Individual Postings  List of Postings
  • 13. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1 <B,[(d1,1), (d2, Map 2 (d2,d3),2 1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>
  • 14. S H <(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F <(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]> (d2,d3),2 (d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T
  • 15.  Hadoop 0.16.0  20 machine (4GB memory, 100GB disk)  Similarity function - BM25  Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents
  • 16.  Tokenization  Stop word removal  Stemming  Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs
  • 17.
  • 18.
  • 19.  Complexity: O(n2)  Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail  Df-cut can be relaxed to 99.9 percent
  • 20.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified.  Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?  The results have not been evaluated.