Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce
1. CSMR: A Scalable Algorithm for
Text Clustering with Cosine
Similarity and MapReduce
Giannakouris – Salalidis Victor - Undergraduate Student
Plerou Antonia - PhD Candidate
Sioutas Spyros - Associate Professor
2. Introduction
• Big Data: Massive amount of data as a result of the huge
rate of growth
• Big Data need to be faced in various domains: Business
Intelligence, Bioinformatics, Social Media Analytics etc.
• Text Mining: Classification/Clustering in digital libraries,
e-mail, Sentiment Analysis on Social Media
• CSMR: Performs pairwise text similarity, represents text
data in a vector space and measures similarity in parallel
manner using MapReduce
3. Background
• Vector Space Model: An algebraic model for representing
text documents as vectors
• Efficient method for text similarity measurement
4. TF-IDF
• Term Frequency – Inverse Document Frequency
• A numerical statistic that reflects the significance of a
term in a corpus of documents
• Usually used in search engines, text mining, text
similarity in the vector space
푇퐹 × 퐼퐷퐹 =
푛푖,푗
푡 ∈ 푑푗
× 푙표푔
|퐷|
|푑 ∈ 퐷: 푡 ∈ 푑|
5. Cosine Similarity
• Cosine Similarity: A measure of similarity between two
documents represented as vector
• Measuring of the angle between two vectors
A B A
B
1
1 2 2
A
B
1 1
cos(A,B)
|| A|| || B||
( ) ( )
n
i i
n
i
i i
i i
6. Hadoop
• Framework developed by Apache
• Large-Scale Data Processing and Analytics
• Scalable and parallel processing of data on large
computer clusters using MapReduce
• Runs on commodity, low-end hardware
• Main Components: HDFS (Hadoop Distributed File
System), MapReduce
• Currently used by: Adobe, Yahoo!, Amazon, eBay,
Facebook and many other companies
7. MapReduce
• Programming Paradigm running on Apache Hadoop
• The main component of Hadoop
• Useful for processing of large data-sets
• Breaks the data into key-value pairs
• Model derived from map and reduce functions of
Functional Programming
• Every MR program constitutes of Mappers and Reducers
9. CSMR
• The purposed method, CSMR combines all the above
mentioned techniques
• Scalable Algorithm for text clustering using MapReduce model
• Applies MR model on TF-IDF and Cosine Similarity
• 4 Phases:
1. Word Counting
2. Text Vectorization using term frequencies
3. Apply TF-IDF on document vectors
4. Cosine Similarity Measurement
10. Phase 1: Word Counting
Algorithm 1: Word Count
1: class Mapper
2: method Map( document )
3: for each term ∈ document
4: write ( ( term , docId ) , 1 )
5:
6: class Reducer
7: method Reduce( ( term , docId ) , ones[ 1 , 1 , … , n ] )
8: sum = 0
9: for each one ∈ ones do
10: sum = sum +1
11: return ( ( term , docId ) , o )
12:
13: /* { o ∈ N : the number of occurrences } */
11. Phase 2: Term Frequency
Algorithm 2: Term Frequency
1: class Mapper
2: method Map( ( term , docId ) , o )
3: for each element ∈ ( term , docId )
4: write ( docId, ( term, o ) )
5:
6: class Reducer
7: method Reduce( docId, (term, o) )
8: N = 0
9: for each tuple ∈ ( term, o ) do
10: N = N + o
return ( (docId, N), (term, o) )
12. Phase 3: TF-IDF
Algorithm 3: Tf-Idf
1: class Mapper
2: method Map( ( docId , N ), ( term , o ) )
3: for each element ∈ ( term , o )
4: write ( term, ( docId, o, N ) )
5:
6: class Reducer
7: method Reduce( term, ( docId , o , N ) )
8: n = 0
9: for each element ∈ ( docId , o , N ) do
10: n = n + 1
11: tf = o / N
12: idf = log|D| /(1n)
13: return ( docId, ( term , tf×idf ) )
14:
15: /* Where |D| is the number of documents in the corpus */
13. Phase 4: Cosine Similarity
Algorithm 4: Cosine Similarity
1: class Mapper
2: method Map( docs )
3: n = docs.length
4:
5: for i = 0 to docs.length
6: for j = i+1 to docs.length
7: write ( ( docs[i].id, docs[j].id ),( docs[i].tfidf, docs[j].tfidf ) )
8:
9: class Reducer
10: method Reduce( ( docId_A, docId_B ),( docA.tfidf, docB.tfidf ) )
11: A = docA.tfidf
12: B = docB.tfidf
13: cosine = sum( A×B )/ (sqrt( sum(A2) )× sqrt( sum(B2) ))
14: return ( (docId_A, docId_B), cosine )
15. Conclusions & Future Work
• Finalized proposed method
• Implementation of the method
• Experimental tests on real data and computer clusters
• Deployment of an open-source project
• Additional implementation using more efficient tools such
as Apache Spark and Scala
• Publication of test results