SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Determining the k in k-means
with MapReduce
Thibault Debatty, Pietro Michiardi,
Wim Mees & Olivier Thonnard
Algorithms for MapReduce and Beyond 2014
Determining the k in k-means with MapReduce 2
Clustering & k-means
â—Ź Clustering
â—Ź K-means
[Stuart P. Lloyd. Least squares quantization in pcm. IEEE
Transactions on Information Theory, 28:129–137, 1982.]
– 1982 (a great year!)
– But still largely used
– Drawbacks (amongst others):
â—Ź Local minimum
â—Ź K is a parameter!
Determining the k in k-means with MapReduce 3
Clustering & k-means
â—Ź Determine k:
– VERY difficult
[Anil K Jain. Data Clustering : 50 Years Beyond K-Means.
Pattern Recognition Letters, 2009]
– Using cluster evaluation metrics:
Dunn's index, Elbow, Silhouette, “jump method” (based on
information theory), “Gap statistic”,...
O(k²)
Determining the k in k-means with MapReduce 4
G-means
â—Ź G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
â—Ź K-means : points in each cluster are
spherically distributed around the center
Source:scikit-learn
Determining the k in k-means with MapReduce 5
G-means
â—Ź G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
â—Ź K-means : points in each cluster are
spherically distributed around the center
normality test &
recursion
Determining the k in k-means with MapReduce 6
G-means
Dataset
Determining the k in k-means with MapReduce 7
G-means
1. Pick 2 centers
Determining the k in k-means with MapReduce 8
G-means
2. k-means
Determining the k in k-means with MapReduce 9
G-means
3. Project
Determining the k in k-means with MapReduce 10
G-means
3. Project
Determining the k in k-means with MapReduce 11
G-means
Normal?
No
=> recursion
4. Normality test
Determining the k in k-means with MapReduce 12
G-means
5. Recursion
Determining the k in k-means with MapReduce 13
MapReduce G-means
â—Ź Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 14
MapReduce G-means
â—Ź Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 15
MapReduce G-means
2. Reduce number of jobs
PickInitialCenters
while Not ClusteringCompleted do
KMeans
KMeansAndFindNewCenters
TestClusters
end while
Determining the k in k-means with MapReduce 16
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
3. Maximize
parallelism
4. Limit memory
usage
Determining the k in k-means with MapReduce 17
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
Bottleneck
3. Maximize
parallelism
4. Limit memory
usage (risk of crash)
Determining the k in k-means with MapReduce 18
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
In memory combiner
Determining the k in k-means with MapReduce 19
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Determining the k in k-means with MapReduce 20
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Experimentally:
64 Bytes / point
Determining the k in k-means with MapReduce 21
Comparison
MR multi-k-means MR G-means
Speed
Quality
all possible values of k
in a single job
Determining the k in k-means with MapReduce 22
Comparison
MR multi-k-means MR G-means
Speed O(nk²) computations O(nk) computations
But:
â—Ź more iterations
â—Ź more dataset reads
â—Ź log2
(k)
Quality New centers added if and
where needed
But:
tends to overestimate k!
Determining the k in k-means with MapReduce 23
Experimental results : Speed
â—Ź Hadoop
â—Ź Synthetic dataset
â—Ź 10M points in R10
â—Ź Euclidean distance
â—Ź 8 machines
Determining the k in k-means with MapReduce 24
Experimental results : Quality
â—Ź Hadoop
â—Ź Synthetic dataset
â—Ź 10M points in R10
â—Ź Euclidean distance
â—Ź 8 machines
k 100 200 400
kfound
150 279 639
Within Cluster Sum of Square
(less is better)
MR G-means 3.34 3.33 3.23
multi-k-means 3.71 3.6 3.39
(with same k)
x ~1.5
Determining the k in k-means with MapReduce 25
Conclusions & future work...
â—Ź MapReduce algorithm to determine k
â—Ź Running time proportional to k
â—Ź Future:
– Overestimation of k
– Test on real data
– Test scalability
– Reduce I/O (using Spark)
– Consider skewed data
– Consider impact of machine failure
Determining the k in k-means with MapReduce 26
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

3D Watershed Celebes
3D Watershed Celebes3D Watershed Celebes
3D Watershed CelebesHartanto Sanjaya
 
3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream NetworkHartanto Sanjaya
 
3D Analyst - Cut and Fill
3D Analyst - Cut and Fill3D Analyst - Cut and Fill
3D Analyst - Cut and FillHartanto Sanjaya
 
Presentation for Numerical Field Theory
Presentation for Numerical Field TheoryPresentation for Numerical Field Theory
Presentation for Numerical Field TheoryIndraneel Pole
 
3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTMHartanto Sanjaya
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingAlexander Schätzle
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentationPavel Popa
 
3D Analyst - Watershed
3D Analyst - Watershed3D Analyst - Watershed
3D Analyst - WatershedHartanto Sanjaya
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...GISRUK conference
 
3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang3D Analyst - Watershed, Padang
3D Analyst - Watershed, PadangHartanto Sanjaya
 
3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, TomohonHartanto Sanjaya
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting techniqueUday Vakalapudi
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentationSneh Pahilwani
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
3D Analyst Watershed Lombok
3D Analyst Watershed  Lombok3D Analyst Watershed  Lombok
3D Analyst Watershed LombokHartanto Sanjaya
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFishAnushree Prasanna Kumar
 

Was ist angesagt? (20)

3D Watershed Celebes
3D Watershed Celebes3D Watershed Celebes
3D Watershed Celebes
 
3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network
 
3D Analyst - Cut and Fill
3D Analyst - Cut and Fill3D Analyst - Cut and Fill
3D Analyst - Cut and Fill
 
Presentation for Numerical Field Theory
Presentation for Numerical Field TheoryPresentation for Numerical Field Theory
Presentation for Numerical Field Theory
 
3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP Processing
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentation
 
3D Analyst - Watershed
3D Analyst - Watershed3D Analyst - Watershed
3D Analyst - Watershed
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
 
3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang
 
3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
On Skyline Groups
On Skyline GroupsOn Skyline Groups
On Skyline Groups
 
3D Analyst Watershed Lombok
3D Analyst Watershed  Lombok3D Analyst Watershed  Lombok
3D Analyst Watershed Lombok
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFish
 

Ă„hnlich wie Determining the k in k-means with MapReduce

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm DesignGabriela Agustini
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblasMIT
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblasgraphulo
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization TechniquesAjay Bidyarthy
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph MotifAMR koura
 
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem 141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem Eunice Lin
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Vissarion Fisikopoulos
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Austin Benson
 

Ă„hnlich wie Determining the k in k-means with MapReduce (20)

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem 141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
LSH
LSHLSH
LSH
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 

Mehr von Thibault Debatty

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsThibault Debatty
 
Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummiesThibault Debatty
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessThibault Debatty
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsThibault Debatty
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...Thibault Debatty
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server MonitoringThibault Debatty
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detectionThibault Debatty
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT DetectionThibault Debatty
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataThibault Debatty
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopThibault Debatty
 

Mehr von Thibault Debatty (15)

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
 
Data diode
Data diodeData diode
Data diode
 
USB Portal
USB PortalUSB Portal
USB Portal
 
Smart Router
Smart RouterSmart Router
Smart Router
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 

KĂĽrzlich hochgeladen

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSĂ©rgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSĂ©rgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSĂ©rgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 

KĂĽrzlich hochgeladen (20)

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Determining the k in k-means with MapReduce

  • 1. Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Algorithms for MapReduce and Beyond 2014
  • 2. Determining the k in k-means with MapReduce 2 Clustering & k-means â—Ź Clustering â—Ź K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.] – 1982 (a great year!) – But still largely used – Drawbacks (amongst others): â—Ź Local minimum â—Ź K is a parameter!
  • 3. Determining the k in k-means with MapReduce 3 Clustering & k-means â—Ź Determine k: – VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] – Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,... O(k²)
  • 4. Determining the k in k-means with MapReduce 4 G-means â—Ź G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] â—Ź K-means : points in each cluster are spherically distributed around the center Source:scikit-learn
  • 5. Determining the k in k-means with MapReduce 5 G-means â—Ź G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] â—Ź K-means : points in each cluster are spherically distributed around the center normality test & recursion
  • 6. Determining the k in k-means with MapReduce 6 G-means Dataset
  • 7. Determining the k in k-means with MapReduce 7 G-means 1. Pick 2 centers
  • 8. Determining the k in k-means with MapReduce 8 G-means 2. k-means
  • 9. Determining the k in k-means with MapReduce 9 G-means 3. Project
  • 10. Determining the k in k-means with MapReduce 10 G-means 3. Project
  • 11. Determining the k in k-means with MapReduce 11 G-means Normal? No => recursion 4. Normality test
  • 12. Determining the k in k-means with MapReduce 12 G-means 5. Recursion
  • 13. Determining the k in k-means with MapReduce 13 MapReduce G-means â—Ź Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 14. Determining the k in k-means with MapReduce 14 MapReduce G-means â—Ź Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 15. Determining the k in k-means with MapReduce 15 MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while
  • 16. Determining the k in k-means with MapReduce 16 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure 3. Maximize parallelism 4. Limit memory usage
  • 17. Determining the k in k-means with MapReduce 17 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure Bottleneck 3. Maximize parallelism 4. Limit memory usage (risk of crash)
  • 18. Determining the k in k-means with MapReduce 18 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure In memory combiner
  • 19. Determining the k in k-means with MapReduce 19 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap
  • 20. Determining the k in k-means with MapReduce 20 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap Experimentally: 64 Bytes / point
  • 21. Determining the k in k-means with MapReduce 21 Comparison MR multi-k-means MR G-means Speed Quality all possible values of k in a single job
  • 22. Determining the k in k-means with MapReduce 22 Comparison MR multi-k-means MR G-means Speed O(nk²) computations O(nk) computations But: â—Ź more iterations â—Ź more dataset reads â—Ź log2 (k) Quality New centers added if and where needed But: tends to overestimate k!
  • 23. Determining the k in k-means with MapReduce 23 Experimental results : Speed â—Ź Hadoop â—Ź Synthetic dataset â—Ź 10M points in R10 â—Ź Euclidean distance â—Ź 8 machines
  • 24. Determining the k in k-means with MapReduce 24 Experimental results : Quality â—Ź Hadoop â—Ź Synthetic dataset â—Ź 10M points in R10 â—Ź Euclidean distance â—Ź 8 machines k 100 200 400 kfound 150 279 639 Within Cluster Sum of Square (less is better) MR G-means 3.34 3.33 3.23 multi-k-means 3.71 3.6 3.39 (with same k) x ~1.5
  • 25. Determining the k in k-means with MapReduce 25 Conclusions & future work... â—Ź MapReduce algorithm to determine k â—Ź Running time proportional to k â—Ź Future: – Overestimation of k – Test on real data – Test scalability – Reduce I/O (using Spark) – Consider skewed data – Consider impact of machine failure
  • 26. Determining the k in k-means with MapReduce 26 Thank you!