SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Canopy Clustering and K-Means Clustering Machine Learning Big Data  at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan  analog76@gmail.com MLBigData 1
Movie Dataset	  Download the movie dataset from  	http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
Similarity Measure	 Jaccard similarity coefficient  Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words       A  B   /  A  B. For our applicaton I am going to compare the the subset of user z₁ and  z₂  where z₁,z₂  ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ 	List<String> lstSx=Arrays.asList(s1); 	List<String> lstSy=Arrays.asList(s2); 	Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); 	Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy);  sim= intersectionSxSy.size() /  (double)unionSxSy.size(); }  Anandha L Ranganathan analog76@gmail.com MLBigData
Cosine Similiarty distance  =  Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy  cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  First received point/data is center of Canopy .  Say P1 Receive the second point and if it is distance from canopy center is less than T2then they are point of that canopy.   If d(P1,P2) >T2then P2 point is new canopy center. If d(P1,P2) < T2 then P1is point of centroidP1. Continue the step 2,3,4  until the mappercomplets its job.  Distances are measured between 0 to 1.  T2 value is 0.005 and I expect around 200 canopy clusters. T1 value is 0.0010.
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false 	for (Canopy canopy : canopies) { 	double centerPoint= canopyCenter.getPoint(); 	if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } 	if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
Data Massaging Convert the data into the required format.  In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
Threshold value  Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  T1 and T2 are  wrong. Inner circle is T2 and outer circle is T1.
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
ReducerMapper A -  Red center  Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
Add small error  => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ?  Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster -  Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
 Canopy Cluster – After  MR job Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Cells with values 1 are grouped together and users are moved from their original location
K – Means Clustering	 Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users.  To find Cosine similarity create a vector  in the format  <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Vector(A) - 1111000  Vector (B)-  0100111 Vector (C)-  1110010 distance(A,B) = Vector (A) * Vector (B) / 					(||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4   ¼=.25 Similarity (A,B) = .25
Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster  > # of Canopy cluster. After couple of map-reduce jobs  K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
Find Nearest Cluster of a point	- Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster>  lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){    double distance=distance(cluster.getCenter(),point) if(closesCluster ||  closestDistance >distance){ closesetCluster= cluster; closesDistance= distance          }  } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
Compute centroid till it converges. Public void computeConvergence((Iterable<KMeansCluster> clusters){ 	for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster);                 if(cluster.getCentroid()==newCentroid){ cluster.converged=true;               }     else             { cluster.setCentroid(newCentroid)    }   } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
? Anandha L Ranganathan analog76@gmail.com MLBigData
References Apache Mahout - https://cwiki.apache.org/MAHOUT/canopy-clustering.html Canopy Clustering  - http://code.google.com/p/canopy-clustering/  Google Lectures. http://www.youtube.com/watch?v=1ZDybXl212Q http://cs.boisestate.edu/~amit/research/makho_ngazimbi_project.pdf Anandha L Ranganathan analog76@gmail.com MLBigData

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)Introduction to Generative Adversarial Networks (GANs)
Introduction to Generative Adversarial Networks (GANs)
 
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Pca
PcaPca
Pca
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Back propagation
Back propagationBack propagation
Back propagation
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Machine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And ApplicationsMachine Learning Ml Overview Algorithms Use Cases And Applications
Machine Learning Ml Overview Algorithms Use Cases And Applications
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Jonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen - Variational Autoencoders tutorialJonathan Ronen - Variational Autoencoders tutorial
Jonathan Ronen - Variational Autoencoders tutorial
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 

Ähnlich wie Canopy k-means using Hadoop

Canopy kmeans
Canopy kmeansCanopy kmeans
Canopy kmeans
nagwww
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
Joachim Nkendeys
 

Ähnlich wie Canopy k-means using Hadoop (20)

Canopy kmeans
Canopy kmeansCanopy kmeans
Canopy kmeans
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
KNN
KNNKNN
KNN
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learning
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Tutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiSTutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiS
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
 
Tutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GANTutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GAN
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Canopy k-means using Hadoop

  • 1. Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan analog76@gmail.com MLBigData 1
  • 2. Movie Dataset Download the movie dataset from http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
  • 3. Similarity Measure Jaccard similarity coefficient Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
  • 4. JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words A  B / A  B. For our applicaton I am going to compare the the subset of user z₁ and z₂ where z₁,z₂ ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
  • 5. Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ List<String> lstSx=Arrays.asList(s1); List<String> lstSy=Arrays.asList(s2); Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy); sim= intersectionSxSy.size() / (double)unionSxSy.size(); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 6. Cosine Similiarty distance = Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 7. Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 8. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData First received point/data is center of Canopy . Say P1 Receive the second point and if it is distance from canopy center is less than T2then they are point of that canopy. If d(P1,P2) >T2then P2 point is new canopy center. If d(P1,P2) < T2 then P1is point of centroidP1. Continue the step 2,3,4 until the mappercomplets its job. Distances are measured between 0 to 1. T2 value is 0.005 and I expect around 200 canopy clusters. T1 value is 0.0010.
  • 9. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false for (Canopy canopy : canopies) { double centerPoint= canopyCenter.getPoint(); if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
  • 10. Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 11. Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
  • 12. Threshold value Anandha L Ranganathan analog76@gmail.com MLBigData
  • 13. Anandha L Ranganathan analog76@gmail.com MLBigData T1 and T2 are wrong. Inner circle is T2 and outer circle is T1.
  • 14. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 15. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 16. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 17. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 18. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 19. ReducerMapper A - Red center Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
  • 20. Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 21. Add small error => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
  • 22. So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 23. Canopy Cluster - Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
  • 24. Canopy Cluster – After MR job Anandha L Ranganathan analog76@gmail.com MLBigData
  • 25. Anandha L Ranganathan analog76@gmail.com MLBigData Cells with values 1 are grouped together and users are moved from their original location
  • 26. K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 27. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 28. Anandha L Ranganathan analog76@gmail.com MLBigData Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=.25 Similarity (A,B) = .25
  • 29. Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
  • 30. Find Nearest Cluster of a point - Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster> lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){ double distance=distance(cluster.getCenter(),point) if(closesCluster || closestDistance >distance){ closesetCluster= cluster; closesDistance= distance } } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 31. Compute centroid till it converges. Public void computeConvergence((Iterable<KMeansCluster> clusters){ for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster); if(cluster.getCentroid()==newCentroid){ cluster.converged=true; } else { cluster.setCentroid(newCentroid) } } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 32. All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 33. Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 34. Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 35. ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 36. References Apache Mahout - https://cwiki.apache.org/MAHOUT/canopy-clustering.html Canopy Clustering - http://code.google.com/p/canopy-clustering/  Google Lectures. http://www.youtube.com/watch?v=1ZDybXl212Q http://cs.boisestate.edu/~amit/research/makho_ngazimbi_project.pdf Anandha L Ranganathan analog76@gmail.com MLBigData