Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Teaching k-Means New Tricks
Sergei Vassilvitskii
Google
k-Means Algorithm
The k-Means Algorithm [Lloyd ’57]
– Clusters points intro groups
– Remains a workhorse of machine learni...
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Initialize with random clusters
49
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Assign each point to nearest center
50
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Recompute optimum centers (means)
51
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat: Assign points to nearest center
52
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat: Recompute centers
53
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...
54
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...Until clustering does not change
55
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...Until clustering does not change
Total error redu...
MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Repeat...Until clustering does not change
Total error redu...
New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in pa...
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random?
57
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random?
58
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random? A bad idea
59
MR ML Algorithmics Sergei Vassilvitskii
k-means Initialization
Random? A bad idea
Even with many random restarts!
59
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Ce...
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Ce...
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Ce...
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Ce...
MR ML Algorithmics Sergei Vassilvitskii
Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Ce...
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
65
MR ML Algorithmics Sergei Vassilvitskii
Sensitive to Outliers
66
MR ML Algorithmics Sergei Vassilvitskii
Interpolate between two methods. Give preference to further points.
Let be the dis...
MR ML Algorithmics Sergei Vassilvitskii
k-means++
68
D(p) p
Interpolate between two methods. Give preference to further po...
MR ML Algorithmics Sergei Vassilvitskii
k-means++
69
D(p) p
Interpolate between two methods. Give preference to further po...
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
70
MR ML Algorithmics Sergei Vassilvitskii
k-means++
71
Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)
New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in pa...
Dealing with large data
The new initialization approach:
– Leads to very good clusterings
– But is very sequential!
• Must...
Speeding up initialization
Initialization:
kmeans++:
Select first point uniformly at random
for (int i=1; i < k; ++i) {
Se...
MR ML Algorithmics Sergei Vassilvitskii
k-means||
74
kmeans++:
Select first point uniformly at random
for (int i=1; i < k;...
MR ML Algorithmics Sergei Vassilvitskii
k-means||
75
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ...
MR ML Algorithmics Sergei Vassilvitskii
k-means||
76
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ...
MR ML Algorithmics Sergei Vassilvitskii
k-means||
77
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ...
MR ML Algorithmics Sergei Vassilvitskii
k-means||
78
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ...
MR ML Algorithmics Sergei Vassilvitskii
k-means||: Analysis
How Many Rounds?
– Theorem: After rounds, guarantee approximat...
MR ML Algorithmics Sergei Vassilvitskii
How well does this work?
80
1e+12
1e+13
1 10
log # Rounds
1e+11
1e+12
1e+13
1
1e+1...
MR ML Algorithmics Sergei Vassilvitskii
Performance vs. k-means++
– Even better on small datasets: 4600 points, 50 dimensi...
New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in pa...
Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center
Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center
– Naive approach: linea...
Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center
– Naive approach: linea...
Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have m...
Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have m...
Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have m...
Performance
Two large datasets:
– 1M points in each
– 7-25M features in each (very high dimensionality)
– Clustering into ...
Performance
Two large datasets:
– 1M points in each
– 7-25M features in each (very high dimensionality)
– Clustering into ...
K-Means Algorithm
Almost 60 years on, still incredibly popular and useful approach
It has gotten better with age:
– Better...
K-Means Algorithm
Almost 60 years on, still incredibly popular and useful approach
It has gotten better with age:
– Better...
Thank You.
Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007.
Bahmani, B., Moseley, B., Vattani A....
Nächste SlideShare
Wird geladen in …5
×

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

1.105 Aufrufe

Veröffentlicht am

Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

  1. 1. Teaching k-Means New Tricks Sergei Vassilvitskii Google
  2. 2. k-Means Algorithm The k-Means Algorithm [Lloyd ’57] – Clusters points intro groups – Remains a workhorse of machine learning even in the age of deep networks
  3. 3. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Initialize with random clusters 49
  4. 4. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Assign each point to nearest center 50
  5. 5. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Recompute optimum centers (means) 51
  6. 6. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat: Assign points to nearest center 52
  7. 7. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat: Recompute centers 53
  8. 8. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat... 54
  9. 9. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat...Until clustering does not change 55
  10. 10. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge. 55
  11. 11. MR ML Algorithmics Sergei Vassilvitskii Lloyd’s Method: k-means Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge. Minimizes: 56 (X, C) = X x2X d(x, C)2
  12. 12. New Tricks for k-Means Initialization: – Is random initialization a good idea? Large data: – Clustering many points (in parallel) – Clustering into many clusters
  13. 13. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? 57
  14. 14. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? 58
  15. 15. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? A bad idea 59
  16. 16. MR ML Algorithmics Sergei Vassilvitskii k-means Initialization Random? A bad idea Even with many random restarts! 59
  17. 17. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 60
  18. 18. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 61
  19. 19. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 62
  20. 20. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 63
  21. 21. MR ML Algorithmics Sergei Vassilvitskii Easy Fix Select centers using a furthest point algorithm (2-approximation to k- Center clustering). 64
  22. 22. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  23. 23. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  24. 24. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  25. 25. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 65
  26. 26. MR ML Algorithmics Sergei Vassilvitskii Sensitive to Outliers 66
  27. 27. MR ML Algorithmics Sergei Vassilvitskii Interpolate between two methods. Give preference to further points. Let be the distance between and the nearest cluster center. Sample next center proportionally to . k-means++ 67 D(p) p D↵ (p)
  28. 28. MR ML Algorithmics Sergei Vassilvitskii k-means++ 68 D(p) p Interpolate between two methods. Give preference to further points. Let be the distance between and the nearest cluster center. Sample next center proportionally to .D↵ (p) D↵ (p) P x D↵(p) kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }
  29. 29. MR ML Algorithmics Sergei Vassilvitskii k-means++ 69 D(p) p Interpolate between two methods. Give preference to further points. Let be the distance between and the nearest cluster center. Sample next center proportionally to .D↵ (p) ↵ = 1 ↵ = 2 Original Lloyd’s: Furthest Point: k-means++: ↵ = 0 D↵ (p) P x D↵(p) kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); }
  30. 30. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  31. 31. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  32. 32. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  33. 33. MR ML Algorithmics Sergei Vassilvitskii k-means++ 70
  34. 34. MR ML Algorithmics Sergei Vassilvitskii k-means++ 71 Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)
  35. 35. New Tricks for k-Means Initialization: – Is random initialization a good idea? Large data: – Clustering many points (in parallel) – Clustering into many clusters
  36. 36. Dealing with large data The new initialization approach: – Leads to very good clusterings – But is very sequential! • Must select one cluster at a time, then update the distribution we are sampling from – How to adapt it in the world of parallel computing?
  37. 37. Speeding up initialization Initialization: kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i) { Select next point p with probability ; UpdateDistance(); } Improving the speed: – Instead of selecting a single point, sample many points at a time – Oversample: select more than k centers, and then select the best k out of them. D2 (p) P x D2(x)
  38. 38. MR ML Algorithmics Sergei Vassilvitskii k-means|| 74 kmeans++: Select first point uniformly at random for (int i=1; i < k; ++i){ Select next point p with probability ; UpdateDistances(); } } D2 (p) P p D2(p)
  39. 39. MR ML Algorithmics Sergei Vassilvitskii k-means|| 75 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c))
  40. 40. MR ML Algorithmics Sergei Vassilvitskii k-means|| 76 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c)) Independent selection Easy MR
  41. 41. MR ML Algorithmics Sergei Vassilvitskii k-means|| 77 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c)) Independent selection Easy MR Oversampling Parameter
  42. 42. MR ML Algorithmics Sergei Vassilvitskii k-means|| 78 kmeans++: Select first point c uniformly at random for (int i=1; i < ; ++i){ Select point p independently with probability UpdateDistances(); } Prune to k points total by clustering the clusters } k · ` · D↵ (p) P x D↵(p) log`( (X, c)) Independent selection Easy MR Oversampling Parameter Re-clustering step
  43. 43. MR ML Algorithmics Sergei Vassilvitskii k-means||: Analysis How Many Rounds? – Theorem: After rounds, guarantee approximation – In practice: fewer iterations are needed – Need to re-cluster intermediate centers Discussion: – Number of rounds independent of k – Tradeoff between number of rounds and memory 79 O(1)O(log`(n )) O(k` log`(n ))
  44. 44. MR ML Algorithmics Sergei Vassilvitskii How well does this work? 80 1e+12 1e+13 1 10 log # Rounds 1e+11 1e+12 1e+13 1 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16 1 10 cost log # Rounds KDD Dataset, k=65 l/k=1 l/k=2 l/k=4 1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16 1 cost Random Initialization k-means++ k-means|| l=1 l=2 l=4
  45. 45. MR ML Algorithmics Sergei Vassilvitskii Performance vs. k-means++ – Even better on small datasets: 4600 points, 50 dimensions (SPAM) – Accuracy: – Time (iterations): 81
  46. 46. New Tricks for k-Means Initialization: – Is random initialization a good idea? Large data: – Clustering many points (in parallel) – Clustering into many clusters
  47. 47. Large k How do you run k-means when k is large? – For every point, need to find the nearest center
  48. 48. Large k How do you run k-means when k is large? – For every point, need to find the nearest center – Naive approach: linear scan
  49. 49. Large k How do you run k-means when k is large? – For every point, need to find the nearest center – Naive approach: linear scan – Better approach [Elkan]: • Use triangle inequality to see if the center could have possibly gotten closer • Still expensive when k is large
  50. 50. Using Nearest Neighbor Data Structures Expensive step of k-Means: – For every point, find the nearest center But we have many algorithms for nearest neighbors!
  51. 51. Using Nearest Neighbor Data Structures Expensive step of k-Means: – For every point, find the nearest center But we have many algorithms for nearest neighbors! First idea: – Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time
  52. 52. Using Nearest Neighbor Data Structures Expensive step of k-Means: – For every point, find the nearest center But we have many algorithms for nearest neighbors! First idea: – Index the centers. Then do a query into this data structure for every point – Need to rebuild the NN Data structure every time Better idea: – Index the points! – For every center, query the nearest points
  53. 53. Performance Two large datasets: – 1M points in each – 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters.
  54. 54. Performance Two large datasets: – 1M points in each – 7-25M features in each (very high dimensionality) – Clustering into k=1000 clusters. Index based k-means: – Simple implementation: 2-7x faster than traditional k-means – No degradation in quality (same objective function value) – More complex implementation: • An additional 8-50x speed improvement !
  55. 55. K-Means Algorithm Almost 60 years on, still incredibly popular and useful approach It has gotten better with age: – Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets – New implementations that handle points in many dimensions and clustering into many clusters – New approaches for online clustering
  56. 56. K-Means Algorithm Almost 60 years on, still incredibly popular and useful approach It has gotten better with age: – Better initialization approaches that are fast and accurate – Parallel implementations to handle large datasets – New implementations that handle points in many dimensions and clustering into many clusters – New approaches for online clustering More work remains! – Non spherical clusters – Other metric spaces – Dealing with outliers
  57. 57. Thank You. Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007. Bahmani, B., Moseley, B., Vattani A., Kumar, R., V.,S. Scalable k-means++. VLDB 2012. Broder, A., Garcia, L., Josifovski, V., V.S., Venkatesan, S. Scalable k-means by ranked retrieval. WSDM 2014.

×