SlideShare ist ein Scribd-Unternehmen logo
1 von 15
CSMR: A Scalable Algorithm for 
Text Clustering with Cosine 
Similarity and MapReduce 
Giannakouris – Salalidis Victor - Undergraduate Student 
Plerou Antonia - PhD Candidate 
Sioutas Spyros - Associate Professor
Introduction 
• Big Data: Massive amount of data as a result of the huge 
rate of growth 
• Big Data need to be faced in various domains: Business 
Intelligence, Bioinformatics, Social Media Analytics etc. 
• Text Mining: Classification/Clustering in digital libraries, 
e-mail, Sentiment Analysis on Social Media 
• CSMR: Performs pairwise text similarity, represents text 
data in a vector space and measures similarity in parallel 
manner using MapReduce
Background 
• Vector Space Model: An algebraic model for representing 
text documents as vectors 
• Efficient method for text similarity measurement
TF-IDF 
• Term Frequency – Inverse Document Frequency 
• A numerical statistic that reflects the significance of a 
term in a corpus of documents 
• Usually used in search engines, text mining, text 
similarity in the vector space 
푇퐹 × 퐼퐷퐹 = 
푛푖,푗 
푡 ∈ 푑푗 
× 푙표푔 
|퐷| 
|푑 ∈ 퐷: 푡 ∈ 푑|
Cosine Similarity 
• Cosine Similarity: A measure of similarity between two 
documents represented as vector 
• Measuring of the angle between two vectors 
A  B A  
B 
  
1 
1 2 2 
A  
B 
1 1 
cos(A,B) 
|| A|| || B|| 
( ) ( ) 
n 
i i 
n 
i 
i i 
i i 
 
  
 
 
Hadoop 
• Framework developed by Apache 
• Large-Scale Data Processing and Analytics 
• Scalable and parallel processing of data on large 
computer clusters using MapReduce 
• Runs on commodity, low-end hardware 
• Main Components: HDFS (Hadoop Distributed File 
System), MapReduce 
• Currently used by: Adobe, Yahoo!, Amazon, eBay, 
Facebook and many other companies
MapReduce 
• Programming Paradigm running on Apache Hadoop 
• The main component of Hadoop 
• Useful for processing of large data-sets 
• Breaks the data into key-value pairs 
• Model derived from map and reduce functions of 
Functional Programming 
• Every MR program constitutes of Mappers and Reducers
MapReduce Diagram
CSMR 
• The purposed method, CSMR combines all the above 
mentioned techniques 
• Scalable Algorithm for text clustering using MapReduce model 
• Applies MR model on TF-IDF and Cosine Similarity 
• 4 Phases: 
1. Word Counting 
2. Text Vectorization using term frequencies 
3. Apply TF-IDF on document vectors 
4. Cosine Similarity Measurement
Phase 1: Word Counting 
Algorithm 1: Word Count 
1: class Mapper 
2: method Map( document ) 
3: for each term ∈ document 
4: write ( ( term , docId ) , 1 ) 
5: 
6: class Reducer 
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 
8: sum = 0 
9: for each one ∈ ones do 
10: sum = sum +1 
11: return ( ( term , docId ) , o ) 
12: 
13: /* { o ∈ N : the number of occurrences } */
Phase 2: Term Frequency 
Algorithm 2: Term Frequency 
1: class Mapper 
2: method Map( ( term , docId ) , o ) 
3: for each element ∈ ( term , docId ) 
4: write ( docId, ( term, o ) ) 
5: 
6: class Reducer 
7: method Reduce( docId, (term, o) ) 
8: N = 0 
9: for each tuple ∈ ( term, o ) do 
10: N = N + o 
return ( (docId, N), (term, o) )
Phase 3: TF-IDF 
Algorithm 3: Tf-Idf 
1: class Mapper 
2: method Map( ( docId , N ), ( term , o ) ) 
3: for each element ∈ ( term , o ) 
4: write ( term, ( docId, o, N ) ) 
5: 
6: class Reducer 
7: method Reduce( term, ( docId , o , N ) ) 
8: n = 0 
9: for each element ∈ ( docId , o , N ) do 
10: n = n + 1 
11: tf = o / N 
12: idf = log|D| /(1n) 
13: return ( docId, ( term , tf×idf ) ) 
14: 
15: /* Where |D| is the number of documents in the corpus */
Phase 4: Cosine Similarity 
Algorithm 4: Cosine Similarity 
1: class Mapper 
2: method Map( docs ) 
3: n = docs.length 
4: 
5: for i = 0 to docs.length 
6: for j = i+1 to docs.length 
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 
8: 
9: class Reducer 
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 
11: A = docA.tfidf 
12: B = docB.tfidf 
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 
14: return ( (docId_A, docId_B), cosine )
Phase 4: Diagram 
Map 
Doc1,Doc2 
[Doc1 TF-IDF], [Doc2 TF-IDF] 
Doc1,Doc3 
[Doc1 TF-IDF], [Doc3 TF-IDF] 
Doc1,Doc4 
Input [Doc1 TF-IDF], [Doc4 TF-IDF] 
Output 
Doc4,Doc10 
[Doc4 TF-IDF], [Doc10 TF-IDF] 
DocM,DocN 
[DocM TF-IDF], [DocN TF-IDF] 
Reduce 
Doc1,Doc3 
Cosine(Doc1, Doc3) 
Doc1,Doc4 
Cosine(Doc1 ,Doc4) 
Doc4,Doc10 
Cosine(Doc4, Doc10) 
DocM,DocN 
Cosine(DocM, DocN) 
Doc1,Doc2 
Cosine(Doc1, Doc2)
Conclusions & Future Work 
• Finalized proposed method 
• Implementation of the method 
• Experimental tests on real data and computer clusters 
• Deployment of an open-source project 
• Additional implementation using more efficient tools such 
as Apache Spark and Scala 
• Publication of test results

Weitere ähnliche Inhalte

Was ist angesagt?

Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniquesPoonam Kshirsagar
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...marxliouville
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering enginesunyil96
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysisKrish_ver2
 

Was ist angesagt? (20)

Ir 08
Ir   08Ir   08
Ir 08
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Lect4
Lect4Lect4
Lect4
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Ghost
GhostGhost
Ghost
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
 

Andere mochten auch

OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationFlorian Leitner
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticALTIC Altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clusteringmobius.cn
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern MiningPrakhar Dhama
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsMatteo Romanello
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?Nicolas Robinson-Garcia
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Seattle DAML meetup
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 

Andere mochten auch (20)

OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Spark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, alticSpark Bi-Clustering - OW2 Big Data Initiative, altic
Spark Bi-Clustering - OW2 Big Data Initiative, altic
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
IntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotationsIntelliGO semantic similarity measure for Gene Ontology annotations
IntelliGO semantic similarity measure for Gene Ontology annotations
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 

Ähnlich wie CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmMark Levy
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Usersgreenwop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...Matthäus Zloch
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingMvenkatarao
 
Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Pramati Technologies
 

Ähnlich wie CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce (20)

Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
ESWC 2019 - A Software Framework and Datasets for the Analysis of Graphs Meas...
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Unit 2
Unit 2Unit 2
Unit 2
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searchingIntroduction to Data Structures Sorting and searching
Introduction to Data Structures Sorting and searching
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]Graph db - Pramati Technologies [Meetup]
Graph db - Pramati Technologies [Meetup]
 

Kürzlich hochgeladen

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Kürzlich hochgeladen (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • 1. CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce Giannakouris – Salalidis Victor - Undergraduate Student Plerou Antonia - PhD Candidate Sioutas Spyros - Associate Professor
  • 2. Introduction • Big Data: Massive amount of data as a result of the huge rate of growth • Big Data need to be faced in various domains: Business Intelligence, Bioinformatics, Social Media Analytics etc. • Text Mining: Classification/Clustering in digital libraries, e-mail, Sentiment Analysis on Social Media • CSMR: Performs pairwise text similarity, represents text data in a vector space and measures similarity in parallel manner using MapReduce
  • 3. Background • Vector Space Model: An algebraic model for representing text documents as vectors • Efficient method for text similarity measurement
  • 4. TF-IDF • Term Frequency – Inverse Document Frequency • A numerical statistic that reflects the significance of a term in a corpus of documents • Usually used in search engines, text mining, text similarity in the vector space 푇퐹 × 퐼퐷퐹 = 푛푖,푗 푡 ∈ 푑푗 × 푙표푔 |퐷| |푑 ∈ 퐷: 푡 ∈ 푑|
  • 5. Cosine Similarity • Cosine Similarity: A measure of similarity between two documents represented as vector • Measuring of the angle between two vectors A  B A  B   1 1 2 2 A  B 1 1 cos(A,B) || A|| || B|| ( ) ( ) n i i n i i i i i      
  • 6. Hadoop • Framework developed by Apache • Large-Scale Data Processing and Analytics • Scalable and parallel processing of data on large computer clusters using MapReduce • Runs on commodity, low-end hardware • Main Components: HDFS (Hadoop Distributed File System), MapReduce • Currently used by: Adobe, Yahoo!, Amazon, eBay, Facebook and many other companies
  • 7. MapReduce • Programming Paradigm running on Apache Hadoop • The main component of Hadoop • Useful for processing of large data-sets • Breaks the data into key-value pairs • Model derived from map and reduce functions of Functional Programming • Every MR program constitutes of Mappers and Reducers
  • 9. CSMR • The purposed method, CSMR combines all the above mentioned techniques • Scalable Algorithm for text clustering using MapReduce model • Applies MR model on TF-IDF and Cosine Similarity • 4 Phases: 1. Word Counting 2. Text Vectorization using term frequencies 3. Apply TF-IDF on document vectors 4. Cosine Similarity Measurement
  • 10. Phase 1: Word Counting Algorithm 1: Word Count 1: class Mapper 2: method Map( document ) 3: for each term ∈ document 4: write ( ( term , docId ) , 1 ) 5: 6: class Reducer 7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] ) 8: sum = 0 9: for each one ∈ ones do 10: sum = sum +1 11: return ( ( term , docId ) , o ) 12: 13: /* { o ∈ N : the number of occurrences } */
  • 11. Phase 2: Term Frequency Algorithm 2: Term Frequency 1: class Mapper 2: method Map( ( term , docId ) , o ) 3: for each element ∈ ( term , docId ) 4: write ( docId, ( term, o ) ) 5: 6: class Reducer 7: method Reduce( docId, (term, o) ) 8: N = 0 9: for each tuple ∈ ( term, o ) do 10: N = N + o return ( (docId, N), (term, o) )
  • 12. Phase 3: TF-IDF Algorithm 3: Tf-Idf 1: class Mapper 2: method Map( ( docId , N ), ( term , o ) ) 3: for each element ∈ ( term , o ) 4: write ( term, ( docId, o, N ) ) 5: 6: class Reducer 7: method Reduce( term, ( docId , o , N ) ) 8: n = 0 9: for each element ∈ ( docId , o , N ) do 10: n = n + 1 11: tf = o / N 12: idf = log|D| /(1n) 13: return ( docId, ( term , tf×idf ) ) 14: 15: /* Where |D| is the number of documents in the corpus */
  • 13. Phase 4: Cosine Similarity Algorithm 4: Cosine Similarity 1: class Mapper 2: method Map( docs ) 3: n = docs.length 4: 5: for i = 0 to docs.length 6: for j = i+1 to docs.length 7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) ) 8: 9: class Reducer 10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) ) 11: A = docA.tfidf 12: B = docB.tfidf 13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) )) 14: return ( (docId_A, docId_B), cosine )
  • 14. Phase 4: Diagram Map Doc1,Doc2 [Doc1 TF-IDF], [Doc2 TF-IDF] Doc1,Doc3 [Doc1 TF-IDF], [Doc3 TF-IDF] Doc1,Doc4 Input [Doc1 TF-IDF], [Doc4 TF-IDF] Output Doc4,Doc10 [Doc4 TF-IDF], [Doc10 TF-IDF] DocM,DocN [DocM TF-IDF], [DocN TF-IDF] Reduce Doc1,Doc3 Cosine(Doc1, Doc3) Doc1,Doc4 Cosine(Doc1 ,Doc4) Doc4,Doc10 Cosine(Doc4, Doc10) DocM,DocN Cosine(DocM, DocN) Doc1,Doc2 Cosine(Doc1, Doc2)
  • 15. Conclusions & Future Work • Finalized proposed method • Implementation of the method • Experimental tests on real data and computer clusters • Deployment of an open-source project • Additional implementation using more efficient tools such as Apache Spark and Scala • Publication of test results