SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Canopy Clustering and K-Means Clustering Machine Learning Big Data  at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan  analog76@gmail.com MLBigData 1
Movie Dataset	  Download the movie dataset from  	http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
Similarity Measure	 Jaccard similarity coefficient  Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words       A  B   /  A  B. For our applicaton I am going to compare the the subset of user z₁ and  z₂  where z₁,z₂  ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ 	List<String> lstSx=Arrays.asList(s1); 	List<String> lstSy=Arrays.asList(s2); 	Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); 	Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy);  sim= intersectionSxSy.size() /  (double)unionSxSy.size(); }  Anandha L Ranganathan analog76@gmail.com MLBigData
Cosine Similiarty distance  =  Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy  cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  First received point/data is center of Canopy .  Receive the second point and if it is distance from canopy center is less than T1 then they are point of that canopy.   If d(P1,P2) >T1 then that point is new canopy center. If d(P1,P2) < T1 they are point of centroidP1. Continue the step 2,3,4  until the mappercomplets its job.  Distance is measured between 0 to 1.  T1 value is 0.005 and I expect around 200 canopy clusters. T2 value is 0.0010.
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false 	for (Canopy canopy : canopies) { 	double centerPoint= canopyCenter.getPoint(); 	if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } 	if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
Data Massaging Convert the data into the required format.  In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
Threshold value  Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
ReducerMapper A -  Red center  Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
Add small error  => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ?  Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster -  Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
 Canopy Cluster – After  MR job Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Cells with values 1 are grouped together and users are moved from their original location
K – Means Clustering	 Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users.  To find Cosine similarity create a vector  in the format  <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Vector(A) - 1111000  Vector (B)-  0100111 Vector (C)-  1110010 distance(A,B) = Vector (A) * Vector (B) / 					(||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4   ¼=.25 Similarity (A,B) = .25
Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster  > # of Canopy cluster. After couple of map-reduce jobs  K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
Find Nearest Cluster of a point	- Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster>  lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){    double distance=distance(cluster.getCenter(),point) if(closesCluster ||  closestDistance >distance){ closesetCluster= cluster; closesDistance= distance          }  } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
Find convergence and Compute Centroid - Reduce Public void computeConvergence((Iterable<KMeansCluster> clusters){ 	for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster);                 if(cluster.getCentroid()==newCentroid){ cluster.converged=true;               }     else             { cluster.setCentroid(newCentroid)    }   } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
? Anandha L Ranganathan analog76@gmail.com MLBigData

Weitere ähnliche Inhalte

Was ist angesagt?

Sensing topics in Tweets
Sensing topics in TweetsSensing topics in Tweets
Sensing topics in TweetsAmar Budhiraja
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014multimediaeval
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepSupport Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepManish nath choudhary
 
Daa unit 6_efficiency of algorithms
Daa unit 6_efficiency of algorithmsDaa unit 6_efficiency of algorithms
Daa unit 6_efficiency of algorithmssnehajiyani
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learningVivek Maskara
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
Design and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract ViewDesign and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract ViewWaqas Nawaz
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
IRJET- Chord Classification of an Audio Signal using Artificial Neural Network
IRJET- Chord Classification of an Audio Signal using Artificial Neural NetworkIRJET- Chord Classification of an Audio Signal using Artificial Neural Network
IRJET- Chord Classification of an Audio Signal using Artificial Neural NetworkIRJET Journal
 
A New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash FunctionA New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash FunctionIRJET Journal
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduceThibault Debatty
 
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
 

Was ist angesagt? (19)

Sensing topics in Tweets
Sensing topics in TweetsSensing topics in Tweets
Sensing topics in Tweets
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014LIMSI @ MediaEval SED 2014
LIMSI @ MediaEval SED 2014
 
Support Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by StepSupport Vector Machine (Classification) - Step by Step
Support Vector Machine (Classification) - Step by Step
 
Daa unit 6_efficiency of algorithms
Daa unit 6_efficiency of algorithmsDaa unit 6_efficiency of algorithms
Daa unit 6_efficiency of algorithms
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
post119s1-file3
post119s1-file3post119s1-file3
post119s1-file3
 
Computer Network Assignment Help
Computer Network Assignment HelpComputer Network Assignment Help
Computer Network Assignment Help
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Design and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract ViewDesign and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract View
 
Computer Science Assignment Help
Computer Science Assignment Help Computer Science Assignment Help
Computer Science Assignment Help
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
IRJET- Chord Classification of an Audio Signal using Artificial Neural Network
IRJET- Chord Classification of an Audio Signal using Artificial Neural NetworkIRJET- Chord Classification of an Audio Signal using Artificial Neural Network
IRJET- Chord Classification of an Audio Signal using Artificial Neural Network
 
A57040102
A57040102A57040102
A57040102
 
A New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash FunctionA New Chaos Based Image Encryption and Decryption using a Hash Function
A New Chaos Based Image Encryption and Decryption using a Hash Function
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
 
I1803014852
I1803014852I1803014852
I1803014852
 
L1803016468
L1803016468L1803016468
L1803016468
 
Skyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed EnvironmentSkyline Query Processing using Filtering in Distributed Environment
Skyline Query Processing using Filtering in Distributed Environment
 

Andere mochten auch

Canopy clustering algorithm
Canopy clustering algorithmCanopy clustering algorithm
Canopy clustering algorithmAshish Karki
 
Kmeans with canopy clustering
Kmeans with canopy clusteringKmeans with canopy clustering
Kmeans with canopy clusteringSeongHyun Jeong
 
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Mateusz Brzoska
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
 
Portuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing CampaignPortuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing CampaignRehan Akhtar
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetMateusz Brzoska
 

Andere mochten auch (8)

Canopy clustering algorithm
Canopy clustering algorithmCanopy clustering algorithm
Canopy clustering algorithm
 
Kmeans with canopy clustering
Kmeans with canopy clusteringKmeans with canopy clustering
Kmeans with canopy clustering
 
Canopy k-means using Hadoop
Canopy k-means using HadoopCanopy k-means using Hadoop
Canopy k-means using Hadoop
 
Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.Data Mining – analyse Bank Marketing Data Set by WEKA.
Data Mining – analyse Bank Marketing Data Set by WEKA.
 
Bank market classification
Bank market classificationBank market classification
Bank market classification
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
Portuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing CampaignPortuguese Bank - Direct Marketing Campaign
Portuguese Bank - Direct Marketing Campaign
 
Data Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data SetData Mining – analyse Bank Marketing Data Set
Data Mining – analyse Bank Marketing Data Set
 

Ähnlich wie Canopy kmeans

Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)Manish nath choudhary
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5TigerGraph
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical EquationsIRJET Journal
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learningGReshma10
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAdeyemi Fowe
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetCrossing Minds
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Joachim Nkendeys
 
Tutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiSTutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiSFrederic Petrini-Monteferri
 

Ähnlich wie Canopy kmeans (20)

K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learning
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
KNN
KNNKNN
KNN
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
 
Project PPT
Project PPTProject PPT
Project PPT
 
The Origin of Grad-CAM
The Origin of Grad-CAMThe Origin of Grad-CAM
The Origin of Grad-CAM
 
Tutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiSTutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiS
 

Canopy kmeans

  • 1. Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan analog76@gmail.com MLBigData 1
  • 2. Movie Dataset Download the movie dataset from http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
  • 3. Similarity Measure Jaccard similarity coefficient Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
  • 4. JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words A  B / A  B. For our applicaton I am going to compare the the subset of user z₁ and z₂ where z₁,z₂ ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
  • 5. Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ List<String> lstSx=Arrays.asList(s1); List<String> lstSy=Arrays.asList(s2); Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy); sim= intersectionSxSy.size() / (double)unionSxSy.size(); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 6. Cosine Similiarty distance = Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 7. Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 8. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData First received point/data is center of Canopy . Receive the second point and if it is distance from canopy center is less than T1 then they are point of that canopy. If d(P1,P2) >T1 then that point is new canopy center. If d(P1,P2) < T1 they are point of centroidP1. Continue the step 2,3,4 until the mappercomplets its job. Distance is measured between 0 to 1. T1 value is 0.005 and I expect around 200 canopy clusters. T2 value is 0.0010.
  • 9. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false for (Canopy canopy : canopies) { double centerPoint= canopyCenter.getPoint(); if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
  • 10. Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 11. Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
  • 12. Threshold value Anandha L Ranganathan analog76@gmail.com MLBigData
  • 13. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 14. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 15. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 16. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 17. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 18. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 19. ReducerMapper A - Red center Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
  • 20. Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 21. Add small error => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
  • 22. So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 23. Canopy Cluster - Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
  • 24. Canopy Cluster – After MR job Anandha L Ranganathan analog76@gmail.com MLBigData
  • 25. Anandha L Ranganathan analog76@gmail.com MLBigData Cells with values 1 are grouped together and users are moved from their original location
  • 26. K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 27. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 28. Anandha L Ranganathan analog76@gmail.com MLBigData Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=.25 Similarity (A,B) = .25
  • 29. Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
  • 30. Find Nearest Cluster of a point - Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster> lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){ double distance=distance(cluster.getCenter(),point) if(closesCluster || closestDistance >distance){ closesetCluster= cluster; closesDistance= distance } } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 31. Find convergence and Compute Centroid - Reduce Public void computeConvergence((Iterable<KMeansCluster> clusters){ for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster); if(cluster.getCentroid()==newCentroid){ cluster.converged=true; } else { cluster.setCentroid(newCentroid) } } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 32. All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 33. Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 34. Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 35. ? Anandha L Ranganathan analog76@gmail.com MLBigData