CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

CSMR: A Scalable Algorithm for
Text Clustering with Cosine
Similarity and MapReduce
Giannakouris – Salalidis Victor - Undergraduate Student
Plerou Antonia - PhD Candidate
Sioutas Spyros - Associate Professor

Introduction
• Big Data: Massive amount of data as a result of the huge
rate of growth
• Big Data need to be faced in various domains: Business
Intelligence, Bioinformatics, Social Media Analytics etc.
• Text Mining: Classification/Clustering in digital libraries,
e-mail, Sentiment Analysis on Social Media
• CSMR: Performs pairwise text similarity, represents text
data in a vector space and measures similarity in parallel
manner using MapReduce

Background
• Vector Space Model: An algebraic model for representing
text documents as vectors
• Efficient method for text similarity measurement

TF-IDF
• Term Frequency – Inverse Document Frequency
• A numerical statistic that reflects the significance of a
term in a corpus of documents
• Usually used in search engines, text mining, text
similarity in the vector space
푇퐹 × 퐼퐷퐹 =
푛푖,푗
푡 ∈ 푑푗
× 푙표푔
|퐷|
|푑 ∈ 퐷: 푡 ∈ 푑|

Cosine Similarity
• Cosine Similarity: A measure of similarity between two
documents represented as vector
• Measuring of the angle between two vectors
A  B A 
B
 
1
1 2 2
A 
B
1 1
cos(A,B)
|| A|| || B||
( ) ( )
n
i i
n
i
i i
i i

 

 

Hadoop
• Framework developed by Apache
• Large-Scale Data Processing and Analytics
• Scalable and parallel processing of data on large
computer clusters using MapReduce
• Runs on commodity, low-end hardware
• Main Components: HDFS (Hadoop Distributed File
System), MapReduce
• Currently used by: Adobe, Yahoo!, Amazon, eBay,
Facebook and many other companies

MapReduce
• Programming Paradigm running on Apache Hadoop
• The main component of Hadoop
• Useful for processing of large data-sets
• Breaks the data into key-value pairs
• Model derived from map and reduce functions of
Functional Programming
• Every MR program constitutes of Mappers and Reducers

CSMR
• The purposed method, CSMR combines all the above
mentioned techniques
• Scalable Algorithm for text clustering using MapReduce model
• Applies MR model on TF-IDF and Cosine Similarity
• 4 Phases:
1. Word Counting
2. Text Vectorization using term frequencies
3. Apply TF-IDF on document vectors
4. Cosine Similarity Measurement

Phase 1: Word Counting
Algorithm 1: Word Count
1: class Mapper
2: method Map( document )
3: for each term ∈ document
4: write ( ( term , docId ) , 1 )
5:
6: class Reducer
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] )
8: sum = 0
9: for each one ∈ ones do
10: sum = sum +1
11: return ( ( term , docId ) , o )
12:
13: /* { o ∈ N : the number of occurrences } */

Phase 2: Term Frequency
Algorithm 2: Term Frequency
1: class Mapper
2: method Map( ( term , docId ) , o )
3: for each element ∈ ( term , docId )
4: write ( docId, ( term, o ) )
5:
6: class Reducer
7: method Reduce( docId, (term, o) )
8: N = 0
9: for each tuple ∈ ( term, o ) do
10: N = N + o
return ( (docId, N), (term, o) )

Phase 3: TF-IDF
Algorithm 3: Tf-Idf
1: class Mapper
2: method Map( ( docId , N ), ( term , o ) )
3: for each element ∈ ( term , o )
4: write ( term, ( docId, o, N ) )
5:
6: class Reducer
7: method Reduce( term, ( docId , o , N ) )
8: n = 0
9: for each element ∈ ( docId , o , N ) do
10: n = n + 1
11: tf = o / N
12: idf = log|D| /(1n)
13: return ( docId, ( term , tf×idf ) )
14:
15: /* Where |D| is the number of documents in the corpus */

Phase 4: Cosine Similarity
Algorithm 4: Cosine Similarity
1: class Mapper
2: method Map( docs )
3: n = docs.length
4:
5: for i = 0 to docs.length
6: for j = i+1 to docs.length
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) )
8:
9: class Reducer
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) )
11: A = docA.tfidf
12: B = docB.tfidf
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) ))
14: return ( (docId_A, docId_B), cosine )

Phase 4: Diagram
Map
Doc1,Doc2
[Doc1 TF-IDF], [Doc2 TF-IDF]
Doc1,Doc3
Doc1,Doc4
Input [Doc1 TF-IDF], [Doc4 TF-IDF]
Output
Doc4,Doc10
DocM,DocN
[DocM TF-IDF], [DocN TF-IDF]
Reduce
Doc1,Doc3
Cosine(Doc1, Doc3)
Doc1,Doc4
Cosine(Doc1 ,Doc4)
Doc4,Doc10
Cosine(Doc4, Doc10)
DocM,DocN
Cosine(DocM, DocN)
Doc1,Doc2
Cosine(Doc1, Doc2)

Conclusions & Future Work
• Finalized proposed method
• Implementation of the method
• Experimental tests on real data and computer clusters
• Deployment of an open-source project
• Additional implementation using more efficient tools such
as Apache Spark and Scala
• Publication of test results

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

Ähnlich wie CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce