London Data Science - Super-Fast Clustering Report

1©MapR Technologies - Confidential
Super-Fast Clustering
Report from MapR workshop

 For Book Discount: @ellen_friedman
 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Twitter for this talk
– #mapr_uk
 Slides and such:
– http://info.mapr.com/ted-uk-05-2012

Company Background
 MapR provides the industry’s best Hadoop Distribution
– Combines the best of the Hadoop community
contributions with significant internally
financed infrastructure development
 Background of Team
– Deep management bench with extensive analytic,
storage, virtualization, and open source experience
– Google, EMC, Cisco, VMWare, Network Appliance, IBM,
Microsoft, Apache Foundation, Aster Data, Brio, ParAccel
 Proven
– MapR used across industries (Financial Services, Media,
Telcom, Health Care, Internet Services, Government)
– Strategic OEM relationship with EMC and Cisco
– Over 1,000 installs

We Also Do …
 Open source development
– Zookeeper
– Hadoop
– Mahout
– Stuff
 Partner workshops
– Machine learning
– Information architecture
– Cluster design

The Problem
 A certain bank
– had lots of customers
– had lots of prospective customers
– had a non-trivial number of fraudulent customers
– had a non-trivial number of fraudulent merchants
 They also
– collected data
– built models
– collected more data
– built more models

But …
 These models were arduous to build
 And hard to test
 So people suggested something simpler
 Like k-nearest neighbor

What’s that?
 Find the k nearest training examples
 Use the average value of the target variable from them
 This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
 Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time

What We Did
 Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
 Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
 Super-fast clustering
– Kmeans, StreamingKmeans

Projection Search

K-means Search

But These Require k-means!
 Need a new k-means algorithm to get speed
 Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable

How It Works
 For each point
– Find approximately nearest centroid (distance = d)
– If d > threshold, new centroid
– Else possibly new cluster
– Else add to nearest centroid
 If centroids > K ~ C log N
– Recursively cluster centroids with higher threshold
 Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!

Warning, Recursive Descent
 Inner loop requires finding nearest centroid
 With lots of centroids, this is slow
 But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)

 Contact:
– tdunning@maprtech.com
– @ted_dunning
 Slides and such:
– http://info.mapr.com/ted-uk-05-2012

Thank You

London Data Science - Super-Fast Clustering Report

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie London Data Science - Super-Fast Clustering Report

Ähnlich wie London Data Science - Super-Fast Clustering Report (20)

Mehr von MapR Technologies

Mehr von MapR Technologies (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

London Data Science - Super-Fast Clustering Report

Hinweis der Redaktion