Weitere ähnliche Inhalte Ähnlich wie Graphlab dunning-clustering (20) Mehr von Ted Dunning (20) Kürzlich hochgeladen (20) Graphlab dunning-clustering4. Goals
Cluster very large data sets
Facilitate large nearest neighbor search
Allow very large number of clusters
Achieve good quality
– low average distance to nearest centroid on held-out data
Based on Mahout Math
Runs on Hadoop (really MapR) cluster
FAST – cluster tens of millions in minutes
©MapR Technologies - Confidential 4
5. Non-goals
Use map-reduce (but it is there)
Minimize the number of clusters
Support metrics other than L2
©MapR Technologies - Confidential 5
6. Anti-goals
Multiple passes over original data
Scale as O(k n)
©MapR Technologies - Confidential 6
9. What’s that?
Find the k nearest training examples
Use the average value of the target variable from them
This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
©MapR Technologies - Confidential 9
10. How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
©MapR Technologies - Confidential 10
11. How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene
©MapR Technologies - Confidential 11
12. What We Did
Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices
Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering
– Kmeans, StreamingKmeans
©MapR Technologies - Confidential 12
15. K-means Search
Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
Recursive application
– to search a cluster, use a Searcher!
©MapR Technologies - Confidential 15
21. But This Requires k-means!
Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable
©MapR Technologies - Confidential 21
22. Basic Method
Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters
©MapR Technologies - Confidential 22
23. Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > 10 log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease
©MapR Technologies - Confidential 23
24. How It Works
Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
©MapR Technologies - Confidential 24
25. Parallel Speedup?
200
Non- threaded
✓
100
2
Tim e per point (μs)
Threaded version
3
50
4
40 6
5
8
30
10 14
12
20 Perfect Scaling 16
10
1 2 3 4 5 20
Threads
©MapR Technologies - Confidential 25
26. Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
©MapR Technologies - Confidential 26
27. Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
©MapR Technologies - Confidential 27
28. Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
Empirically, projection search beats 64 bit LSH by a bit
©MapR Technologies - Confidential 28
29. Moving to Scale
Map-reduce implementation nearly trivial
Map: rough-cluster input data, output ß, weighted centroids
Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory
Combiner possible, but essentially never important
©MapR Technologies - Confidential 29
30. Contact:
– tdunning@maprtech.com
– @ted_dunning
Slides and such:
– http://info.mapr.com/ted-mlconf
Hash tags: #mlconf #mahout #mapr
©MapR Technologies - Confidential 30