Given at PyDataSV 2014
In machine learning, clustering is a good way to explore your data and pull out patterns and relationships. Scikit-learn has some great clustering functionality, including the k-means clustering algorithm, which is among the easiest to understand. Let's take an in-depth look at k-means clustering and how to use it. This mini-tutorial/talk will cover what sort of problems k-means clustering is good at solving, how the algorithm works, how to choose k, how to tune the algorithm's parameters, and how to implement it on a set of data.
2. About Me
• Today: graduated from the University of Michigan!
• Soon: data scientist at Reonomy
• PyLadies co-organizer
• @sarah_guido
3. Outline
• What is k-means clustering?
• How it works
• When to use it
• K-means clustering in scikit-learn
• Basic implementation
• Implementation with tuned parameters
5. K-means clustering
• Formally: a method of vector quantization
• Partition space into Voronoi cells
• Separate samples
into n groups of
equal variance
• Uses the
Euclidean
distance metric
6. K-means clustering
• Iterative refinement
• Three basic steps
• Step 1: Choose k
• Iterate over:
• Step 2: Assignment
• Step 3: Update
• Repeats until convergence has been reached
8. K-means clustering
• Advantages
• Scales well
• Efficient
• Will always converge
• Disadvantages
• Choosing the wrong k
• Convergence to local minimum
9. K-means clustering
• When to use
• Normally distributed data
• Large number of samples
• Not too many clusters
• Distance can be measured in a linear fashion
11. Scikit-Learn
• Model = EstimatorObject()
• Unsupervised:
• Model.fit(dataset.data)
• dataset.data = dataset
• Supervised would use the labels as a second
parameter
12. K-means in scikit-learn
• Efficient and fast
• You: pick n clusters, kmeans: finds n initial
centroids
• Run clustering jobs in parallel
13. Dataset
• University of California Machine Learning
Repository
• Individual household power consumption
18. n_clusters: choosing k
• Graphing the variance
• from scipy.spatial.distance import cdist, pdist
• cdist: distance computation between sets of
observations
• pdist: pairwise distances between observations in the
same set
24. init
• k-means++
• Default
• Selects initial clusters in a way that speeds up
convergence
• random
• Choose k rows at random for initial centroids
• Ndarray that gives initial centers
• (n_clusters, n_features)
28. Comparing results: silhouette score
• Silhouette coefficient
• No ground truth
• Mean distance between an observation and all other
points in its cluster
• Mean distance between an observation and all other
points in the next nearest cluster
• Silhouette score in scikit-learn
• Mean of silhouette coefficient for all of the observations
• Closer to 1, the better the fit
• Large dataset == long time
30. What does this tell us?
• Patterns exist
• Groups of similar observations exist
• Sometimes, the defaults work
• We need more exploration!
31. A few tips
• Clustering is a good way to explore your data
• Intuition fails in high dimensions
• Use dimensionality reduction
• Combine with other models
• Know your data