SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Tamer Elsayed, Jimmy Lin, and Douglas Oard


         Niveda Krishnamoorthy
 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
  • Inverted Index Construction
  • Pairwise document similarity calculation
 Results
 PubMed   – “More like this”
 Similar blog posts
 Google – Similar pages
 Framework   that supports distributed
  computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications
 Consider    two files:

      Hello                Hello
                                      Hello ,2
      World                Hadoop     World ,2
      Bye                  Goodbye     Bye,1
                                     Hadoop ,2
      World                Hadoop    Goodbye ,1
Hello             <Hello,1>

World             <World,1>
          Map 1
Bye               <Bye,1>

World             <World,1>


Hello             <Hello,1>

Hadoop            <Hadoop,1>
          Map 2
Goodbye           <Goodbye,1>

Hadoop            <Hadoop,1>
<Hello,1>
              S   <Hello (1,1)>   Reduce 1    Hello ,2
<World,1>
              H
              U
<Bye,1>           <World(1,1)>    Reduce 2    World ,2
              F
              F
<World,1>
              L    <Bye(1)>       Reduce 3     Bye,1
              E
<Hello,1>         <Hadoop(1,1)>   Reduce 4   Hadoop ,2
              &
<Hadoop,1>
              S   <Goodbye(1)>    Reduce 5   Goodbye ,1
<Goodbye,1>   O
              R
<Hadoop,1>    T
MAPREDUCE ALGORITHM           Scalable
•Inverted Index Computation      and
•Pairwise Similarity          Efficient
Document 1
A                    <A,(d1,2)>
A
B            Map 1   <B,(d1,1)>
C
                     <C,(d1,1)>
Document 2
B                    <B,(d2,1)>
D
D            Map 2
                     <D,(d2,2)>


Document 1           <A,(d3,1)>
A
B                    <B,(d3,2)>
             Map 3
B
E                    <E,(d3,1)>
<A,(d1,2)>
             S     <A,[(d1,2),                   <A,[(d1,2),
<B,(d1,1)>   H      (d3,1)]>        Reduce 1      (d3,1)]>
             U
<C,(d1,1)>   F   <B,[(d1,1), (d2,              <B,[(d1,1), (d2,
             F                      Reduce 2
                 1),(d3,2)]>                   1),(d3,2)]>
             L
<B,(d2,1)>   E     <C,[(d1,1)]>     Reduce 3    <C,[(d1,1)]>

<D,(d2,2)>   &
                   <D,[(d2,2)]>     Reduce 4    <D,[(d2,2)]>
             S
<A,(d3,1)>   O
             R     <E,[(d3,1)]>     Reduce 5    <E,[(d3,1)]>
<B,(d3,2)>   T

<E,(d3,1)>
 Group   by document ID, not pairs




 Golomb’s   compression for postings
 Individual Postings
 List of Postings
<(d1,d3),2>
  <A,[(d1,2),     Map 1
   (d3,1)]>
                          <(d1,d2),1
<B,[(d1,1),
                  Map 2   (d2,d3),2
(d2,1),(d3,2)]>
                          (d1,d3),2>
 <C,[(d1,1)]>


 <D,[(d2,2)]>


 <E,[(d3,1)]>
S
              H
<(d1,d3),2>   U
              F   <(d1,d2)[1]>                <(d1,d2)[1]>
                                   Reduce 1
              F
<(d1,d2),1    L
              E   <(d2,d3)[2]>     Reduce 2   <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
              &
                                   Reduce 3
                  <(d1,d3)[2,2]>              <(d1,d3)[4]>
              S
              O
              R
              T
 Hadoop   0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
  • 2.5 GB
  • 906k documents
 Tokenization
 Stop word removal
 Stemming
 Df-cut
  • Fraction of terms with highest document
   frequency is eliminated – 99% cut (9093)

            Linear space and time complexity

  • 3.7 billion pairs (vs) 81. trillion pairs
 Complexity:      O(n2)



 Df-cut
       of 99 percent eliminates meaning bearing
 terms and some irrelevant terms
  • Cornell, arthritis
  • sleek, frail
 Df-cut   can be relaxed to 99.9 percent
 Exact  algorithms used for inverted index
  construction and pair-wise document
  similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
  the quality of the results significantly?
 The results have not been evaluated.
Pairwise document similarity in large collections with map reduce

Weitere ähnliche Inhalte

Was ist angesagt?

JenkinsとCodeBuildとCloud Buildと私
JenkinsとCodeBuildとCloud Buildと私JenkinsとCodeBuildとCloud Buildと私
JenkinsとCodeBuildとCloud Buildと私Shoji Shirotori
 
3分でわかるAzureでのService Principal
3分でわかるAzureでのService Principal3分でわかるAzureでのService Principal
3分でわかるAzureでのService PrincipalToru Makabe
 
Qlik Tips 20220315 チャートデザイン~アドバンストスタイリング
Qlik Tips 20220315 チャートデザイン~アドバンストスタイリングQlik Tips 20220315 チャートデザイン~アドバンストスタイリング
Qlik Tips 20220315 チャートデザイン~アドバンストスタイリングQlikPresalesJapan
 
MongoDB概要:金融業界でのMongoDB
MongoDB概要:金融業界でのMongoDBMongoDB概要:金融業界でのMongoDB
MongoDB概要:金融業界でのMongoDBippei_suzuki
 
Google BigQuery クエリの処理の流れ - #bq_sushi
Google BigQuery クエリの処理の流れ - #bq_sushi Google BigQuery クエリの処理の流れ - #bq_sushi
Google BigQuery クエリの処理の流れ - #bq_sushi Google Cloud Platform - Japan
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)Myungjin Lee
 
Azure Cosmos DB のキホンと使いドコロ
Azure Cosmos DB のキホンと使いドコロAzure Cosmos DB のキホンと使いドコロ
Azure Cosmos DB のキホンと使いドコロKazuyuki Miyake
 
AAD authentication for azure app v0.1.20.0317
AAD authentication for azure app v0.1.20.0317AAD authentication for azure app v0.1.20.0317
AAD authentication for azure app v0.1.20.0317Ayumu Inaba
 
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015curryon
 
ASP. NET Core 汎用ホスト概要
ASP. NET Core 汎用ホスト概要ASP. NET Core 汎用ホスト概要
ASP. NET Core 汎用ホスト概要TomomitsuKusaba
 
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?Kouhei Sutou
 
NOSQLの基礎知識(講義資料)
NOSQLの基礎知識(講義資料)NOSQLの基礎知識(講義資料)
NOSQLの基礎知識(講義資料)CLOUDIAN KK
 
Ingressの概要とLoadBalancerとの比較
Ingressの概要とLoadBalancerとの比較Ingressの概要とLoadBalancerとの比較
Ingressの概要とLoadBalancerとの比較Mei Nakamura
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsEtsuji Nakai
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphTrey Grainger
 
今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>
今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>
今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>Naoto Miyachi
 
Layout lm paper review
Layout lm paper review Layout lm paper review
Layout lm paper review taeseon ryu
 
An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011Craig Baumunk
 

Was ist angesagt? (20)

JenkinsとCodeBuildとCloud Buildと私
JenkinsとCodeBuildとCloud Buildと私JenkinsとCodeBuildとCloud Buildと私
JenkinsとCodeBuildとCloud Buildと私
 
3分でわかるAzureでのService Principal
3分でわかるAzureでのService Principal3分でわかるAzureでのService Principal
3分でわかるAzureでのService Principal
 
Qlik Tips 20220315 チャートデザイン~アドバンストスタイリング
Qlik Tips 20220315 チャートデザイン~アドバンストスタイリングQlik Tips 20220315 チャートデザイン~アドバンストスタイリング
Qlik Tips 20220315 チャートデザイン~アドバンストスタイリング
 
MongoDB概要:金融業界でのMongoDB
MongoDB概要:金融業界でのMongoDBMongoDB概要:金融業界でのMongoDB
MongoDB概要:金融業界でのMongoDB
 
Google BigQuery クエリの処理の流れ - #bq_sushi
Google BigQuery クエリの処理の流れ - #bq_sushi Google BigQuery クエリの処理の流れ - #bq_sushi
Google BigQuery クエリの処理の流れ - #bq_sushi
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
 
Azure Cosmos DB のキホンと使いドコロ
Azure Cosmos DB のキホンと使いドコロAzure Cosmos DB のキホンと使いドコロ
Azure Cosmos DB のキホンと使いドコロ
 
AAD authentication for azure app v0.1.20.0317
AAD authentication for azure app v0.1.20.0317AAD authentication for azure app v0.1.20.0317
AAD authentication for azure app v0.1.20.0317
 
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
Bits of Advice for the VM Writer, by Cliff Click @ Curry On 2015
 
ASP. NET Core 汎用ホスト概要
ASP. NET Core 汎用ホスト概要ASP. NET Core 汎用ホスト概要
ASP. NET Core 汎用ホスト概要
 
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
MySQLとPostgreSQLと日本語全文検索 - Azure DatabaseでMroonga・PGroongaを使いたいですよね!?
 
Db2 V11 GUIツール
Db2 V11 GUIツールDb2 V11 GUIツール
Db2 V11 GUIツール
 
NOSQLの基礎知識(講義資料)
NOSQLの基礎知識(講義資料)NOSQLの基礎知識(講義資料)
NOSQLの基礎知識(講義資料)
 
Ingressの概要とLoadBalancerとの比較
Ingressの概要とLoadBalancerとの比較Ingressの概要とLoadBalancerとの比較
Ingressの概要とLoadBalancerとの比較
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>
今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>
今更聞けない電子認証入門 -OAuth 2.0/OIDCからFIDOまで- <改定2版>
 
Kafka・Storm・ZooKeeperの認証と認可について #kafkajp
Kafka・Storm・ZooKeeperの認証と認可について #kafkajpKafka・Storm・ZooKeeperの認証と認可について #kafkajp
Kafka・Storm・ZooKeeperの認証と認可について #kafkajp
 
Layout lm paper review
Layout lm paper review Layout lm paper review
Layout lm paper review
 
An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011An Overview of Temporal Features in SQL:2011
An Overview of Temporal Features in SQL:2011
 

Ähnlich wie Pairwise document similarity in large collections with map reduce

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOPShital Kat
 
10th Maths model3 question paper
10th Maths model3 question paper10th Maths model3 question paper
10th Maths model3 question papersingarls19
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphAndrew Yongjoon Kong
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with HadoopFerran Galí Reniu
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojurePaul Lam
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api TrainingSpark Summit
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 

Ähnlich wie Pairwise document similarity in large collections with map reduce (19)

Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
 
Maths`
Maths`Maths`
Maths`
 
10th Maths model3 question paper
10th Maths model3 question paper10th Maths model3 question paper
10th Maths model3 question paper
 
10th Maths
10th Maths10th Maths
10th Maths
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
End sem solution
End sem solutionEnd sem solution
End sem solution
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Distributed batch processing with Hadoop
Distributed batch processing with HadoopDistributed batch processing with Hadoop
Distributed batch processing with Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Pairwise document similarity in large collections with map reduce

  • 1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy
  • 2.  PairwiseSimilarity  MapReduce Framework  Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation  Results
  • 3.  PubMed – “More like this”  Similar blog posts  Google – Similar pages
  • 4.  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications
  • 5.
  • 6.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1
  • 7. Hello <Hello,1> World <World,1> Map 1 Bye <Bye,1> World <World,1> Hello <Hello,1> Hadoop <Hadoop,1> Map 2 Goodbye <Goodbye,1> Hadoop <Hadoop,1>
  • 8. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2 <World,1> H U <Bye,1> <World(1,1)> Reduce 2 World ,2 F F <World,1> L <Bye(1)> Reduce 3 Bye,1 E <Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 & <Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1 <Goodbye,1> O R <Hadoop,1> T
  • 9. MAPREDUCE ALGORITHM Scalable •Inverted Index Computation and •Pairwise Similarity Efficient
  • 10. Document 1 A <A,(d1,2)> A B Map 1 <B,(d1,1)> C <C,(d1,1)> Document 2 B <B,(d2,1)> D D Map 2 <D,(d2,2)> Document 1 <A,(d3,1)> A B <B,(d3,2)> Map 3 B E <E,(d3,1)>
  • 11. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2), <B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U <C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L <B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]> <D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S <A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]> <B,(d3,2)> T <E,(d3,1)>
  • 12.  Group by document ID, not pairs  Golomb’s compression for postings  Individual Postings  List of Postings
  • 13. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1 <B,[(d1,1), Map 2 (d2,d3),2 (d2,1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>
  • 14. S H <(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F <(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]> (d2,d3),2 (d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T
  • 15.  Hadoop 0.16.0  20 machine (4GB memory, 100GB disk)  Similarity function - BM25  Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents
  • 16.  Tokenization  Stop word removal  Stemming  Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs
  • 17.
  • 18.
  • 19.  Complexity: O(n2)  Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail  Df-cut can be relaxed to 99.9 percent
  • 20.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified.  Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?  The results have not been evaluated.