SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Fast Single-pass k-means
        Clustering
whoami – Ted Dunning
• Chief Application Architect, MapR Technologies
• Committer, member, Apache Software
  Foundation
  – particularly Mahout, Zookeeper and Drill

• Contact me at
 tdunning@maprtech.com
 tdunning@apache.com
 ted.dunning@gmail.com
 @ted_dunning
Agenda
• Rationale
• Theory
  – clusterable data, k-mean failure modes, sketches
• Algorithms
  – ball k-means, surrogate methods
• Implementation
  – searchers, vectors, clusterers
• Results
• Application
RATIONALE
Why k-means?
• Clustering allows fast search
  – k-nn models allow agile modeling
  – lots of data points, 108 typical
  – lots of clusters, 104 typical
• Model features
  – Distance to nearest centroids
  – Poor man’s manifold discovery
What is Quality?
• Robust clustering not a goal
  – we don’t care if the same clustering is replicated
• Generalization to unseen data critical
  – number of points per cluster
  – distance distributions
  – target function distributions
  – model performance stability
• Agreement to “gold standard” is a non-issue
An Example
The Problem
• Spirals are a classic “counter” example for k-
  means
• Classic low dimensional manifold with added
  noise

• But clustering still makes modeling work well
An Example
An Example
The Cluster Proximity Features
• Every point can be described by the nearest
  cluster
  – 4.3 bits per point in this case
  – Significant error that can be decreased (to a point)
    by increasing number of clusters
• Or by the proximity to the 2 nearest clusters (2
  x 4.3 bits + 1 sign bit + 2 proximities)
  – Error is negligible
  – Unwinds the data into a simple representation
Diagonalized Cluster Proximity
Lots of Clusters Are Fine
The Limiting Case
• Too many clusters lead to over-fitting
• Which we mediate by averaging over several
  nearby clusters
• In the limit we get k-nn modeling
  – and probably use k-means to speed up search
THEORY
Intuitive Theory
• Traditionally, minimize over all distributions
  – optimization is NP-complete
  – that isn’t like real data
• Recently, assume well-clusterable data

            s 2 D 2 (X) > D 2 (X)
                  k-1       k


• Interesting approximation bounds provable
                 1+ O(s 2 )
For Example
            1
  D (X) >
   2
                    D (X)
                     2

            s
   4            2    5




        Grouping these
           two clusters
         seriously hurts
       squared distance
ALGORITHMS
Lloyd’s Algorithm
• Part of CS folk-lore
• Developed in the late 50’s for signal quantization, published
  in 80’s

  initialize k cluster centroids somehow
  for each of many iterations:
    for each data point:
         assign point to nearest cluster
    recompute cluster centroids from points assigned to clusters

• Highly variable quality, several restarts recommended
Typical k-means Failure

  Selecting two seeds
       here cannot be
     fixed with Lloyds

                 Result is that these two
                       clusters get glued
                                 together
Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
  clusters
• Avoids outliers in centroid computation

  initialize centroids randomly with distance maximizing
  tendency
  for each of a very few iterations:
    for each data point:
        assign point to nearest cluster
    recompute centroids using only points much closer than
  closest cluster
Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
  exponentially with k
• Alternative strategy has high probability of
  success, but takes O(nkd + k3d) time
Not good enough
Surrogate Method
• Start with sloppy clustering into κ = k log n
  clusters
• Use this sketch as a weighted surrogate for the
  data
• Cluster surrogate data using ball k-means
• Results are provably good for highly clusterable
  data
• Sloppy clustering is on-line
• Surrogate can be kept in memory
• Ball k-means pass can be done at any time
Algorithm Costs
• O(k d log n) per point per iteration for Lloyd’s
  algorithm
• Number of iterations not well known
• Iteration > log n reasonable assumption
Algorithm Costs
• Surrogate methods
  – fast, sloppy single pass clustering with κ = k log n
  – fast sloppy search for nearest cluster,
     O(d log κ) = O(d (log k + log log n)) per point
  – fast, in-memory, high-quality clustering of κ weighted
    centroids
     O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
     O(κ d log k) or O(d log κ log k) for larger k, looser quality
  – result is k high-quality centroids
     • Even the sloppy surrogate may suffice
Algorithm Costs
• How much faster for the sketch phase?
  – take k = 2000, d = 10, n = 100,000
  – k d log n = 2000 x 10 x 26 = 500,000
  – d (log k + log log n) = 10(11 + 5) = 170
  – 3,000 times faster is a bona fide big deal
Pragmatics
•   But this requires a fast search internally
•   Have to cluster on the fly for sketch
•   Have to guarantee sketch quality
•   Previous methods had very high complexity
How It Works
• For each point
  – Find approximately nearest centroid (distance = d)
  – If (d > threshold) new centroid
  – Else if (u > d/threshold) new cluster
  – Else add to nearest centroid
• If centroids > κ ≈ C log N
  – Recursively cluster centroids with higher threshold
Resulting Surrogate
• Result is large set of centroids
  – these provide approximation of original
    distribution
  – we can cluster centroids to get a close
    approximation of clustering original
  – or we can just use the result directly


• Either way, we win
IMPLEMENTATION
How Can We Search Faster?
• First rule: don’t do it
   – If we can eliminate most candidates, we can do less work
   – Projection search and k-means search

• Second rule: don’t do it
   – We can convert big floating point math to clever bit-wise
     integer math
   – Locality sensitive hashing

• Third rule: reduce dimensionality
   – Projection search
   – Random projection for very high dimension
Projection Search
               total ordering!
How Many Projections?
LSH Search
• Each random projection produces independent sign bit
• If two vectors have the same projected sign bits, they
  probably point in the same direction (i.e. cos θ ≈ 1)
• Distance in L2 is closely related to cosine

      x - y 2 = x - 2(x × y) + y
                   2               2



             = x 2 - 2 x y cosq + y 2
• We can replace (some) vector dot products with long
  integer XOR
1
                  LSH Bit-match Versus Cosine
           0.8


           0.6


           0.4


           0.2
Y Ax is




             0
                  0   8   16   24    32       40   48   56   64

          - 0.2


          - 0.4


          - 0.6


          - 0.8


            -1

                                    X Ax is
Results with 32 Bits
The Internals
• Mechanism for extending Mahout Vectors
  – DelegatingVector, WeightedVector, Centroid

• Searcher interface
  – ProjectionSearch, KmeansSearch, LshSearch, Brute

• Super-fast clustering
  – Kmeans, StreamingKmeans
Parallel Speedup?
                       200


                                                                      Non- threaded




                                                    ✓
                       100
                                    2
Tim e per point (μs)




                                                                       Threaded version
                                            3

                       50
                                                      4
                       40                                                6
                                                              5

                                                                               8
                       30
                                                                                   10        14
                                                                                        12
                       20                       Perfect Scaling                                   16




                       10
                             1          2       3         4       5                                    20


                                                    Threads
What About Map-Reduce?
• Map-reduce implementation is nearly trivial
  – Compute surrogate on each split
  – Total surrogate is union of all partial surrogates
  – Do in-memory clustering on total surrogate
• Threaded version shows linear speedup
  already
• Map-reduce speedup shows same linear
  speedup
How Well Does it Work?
• Theoretical guarantees for well clusterable
  data
  – Shindler, Wong and Meyerson, NIPS, 2011


• Evaluation on synthetic data
  – Rough clustering produces correct surrogates
  – Ball k-means strategy 1 performance is very good
    with large k
How Well Does it Work?
• Empirical evaluation on 20 newsgroups
• Alternative algorithms include ball k-means
  versus streaming k-means|ball k-means
• Results
  Average distance to nearest cluster on held-out data
  same or slightly smaller
  Median distance to nearest cluster is smaller
  > 10x faster (I/O and encoding limited)
APPLICATION
The Business Case
• Our customer has 100 million cards in
  circulation

• Quick and accurate decision-making is key.
  – Marketing offers
  – Fraud prevention
Opportunity
• Demand of modeling is increasing rapidly

• So they are testing something simpler and
  more agile

• Like k-nearest neighbor
What’s that?
• Find the k nearest training examples – lookalike
  customers

• This is easy … but hard
   – easy because it is so conceptually simple and you don’t
     have knobs to turn or models to build
   – hard because of the stunning amount of math
   – also hard because we need top 50,000 results

• Initial rapid prototype was massively too slow
   – 3K queries x 200K examples takes hours
   – needed 20M x 25M in the same time
K-Nearest Neighbor Example
Required Scale and Speed and
               Accuracy
• Want 20 million queries against 25 million
  references in 10,000 s
• Should be able to search > 100 million
  references
• Should be linearly and horizontally scalable
• Must have >50% overlap against reference
  search
How Hard is That?
• 20 M x 25 M x 100 Flop = 50 P Flop

• 1 CPU = 5 Gflops

• We need 10 M CPU seconds => 10,000 CPU’s

• Real-world efficiency losses may increase that by
  10x

• Not good!
K-means Search
• First do clustering with lots (thousands) of clusters

• Then search nearest clusters to find nearest points

• We win if we find >50% overlap with “true” answer

• We lose if we can’t cluster super-fast
   – more on this later
Lots of Clusters Are Fine
Lots of Clusters Are Fine
Some Details
• Clumpy data works better
  – Real data is clumpy 


• Speedups of 100-200x seem practical with
  50% overlap
  – Projection search and LSH give additional 100x


• More experiments needed
Summary
• Nearest neighbor algorithms can be blazing
  fast

• But you need blazing fast clustering
  – Which we now have
Contact Me!
• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Come get the slides at
   http://www.mapr.com/company/events/acmsf-2-25-13

• Get the code as part of Mahout trunk

• Contact me at tdunning@maprtech.com or @ted_dunning

Weitere ähnliche Inhalte

Was ist angesagt?

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion ModelsSangwoo Mo
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaTed Dunning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification學翰 施
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitShiladitya Sen
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density ModelsSangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsSangwoo Mo
 

Was ist angesagt? (20)

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 

Andere mochten auch

Campeonato brasileiro artilharia
Campeonato brasileiro   artilhariaCampeonato brasileiro   artilharia
Campeonato brasileiro artilhariaRafael Passos
 
91715307 huong dan_su_dung_wireshark_9292
91715307 huong dan_su_dung_wireshark_929291715307 huong dan_su_dung_wireshark_9292
91715307 huong dan_su_dung_wireshark_9292Mai Thanh Binh
 
De Luis Almagro a Tibisay Lucena
De Luis Almagro a Tibisay LucenaDe Luis Almagro a Tibisay Lucena
De Luis Almagro a Tibisay LucenaYsrrael Camero
 
Whatsapp
WhatsappWhatsapp
Whatsappvic0116
 
Unsur golongan ii a (logam alkali tanah
Unsur golongan ii a (logam alkali tanahUnsur golongan ii a (logam alkali tanah
Unsur golongan ii a (logam alkali tanahNur Huda
 
DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)
DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)
DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)Jothi Ganesan
 
Efecto invernadero
Efecto invernaderoEfecto invernadero
Efecto invernaderobryvite
 
Actividad nro 3 de agrario. bases sustentables.
Actividad nro 3 de agrario. bases sustentables.Actividad nro 3 de agrario. bases sustentables.
Actividad nro 3 de agrario. bases sustentables.Astrid Rivero
 
Kickstarting a Video Game
Kickstarting a Video GameKickstarting a Video Game
Kickstarting a Video GameRuss Clarke
 
Gestion de imagen
Gestion de imagenGestion de imagen
Gestion de imagennelsonepv
 
Презентація Гугл Білецького Данила
Презентація Гугл Білецького ДанилаПрезентація Гугл Білецького Данила
Презентація Гугл Білецького ДанилаDanylo Biletskyi
 
Маркетинговое агентство Брусника. Кто мы и что мы делаем?
Маркетинговое агентство Брусника. Кто мы и что мы делаем?Маркетинговое агентство Брусника. Кто мы и что мы делаем?
Маркетинговое агентство Брусника. Кто мы и что мы делаем?Tehhi #brusnyka Polonskaya
 
Pengorganisasian dan struktur
Pengorganisasian dan strukturPengorganisasian dan struktur
Pengorganisasian dan strukturHANI KHAIRUNISA
 
eNeo Labs and the Spanish Connected Home Market - J Zamora
eNeo Labs and the Spanish Connected Home Market - J ZamoraeNeo Labs and the Spanish Connected Home Market - J Zamora
eNeo Labs and the Spanish Connected Home Market - J Zamoramfrancis
 

Andere mochten auch (20)

Campeonato brasileiro artilharia
Campeonato brasileiro   artilhariaCampeonato brasileiro   artilharia
Campeonato brasileiro artilharia
 
91715307 huong dan_su_dung_wireshark_9292
91715307 huong dan_su_dung_wireshark_929291715307 huong dan_su_dung_wireshark_9292
91715307 huong dan_su_dung_wireshark_9292
 
Ruwatan
RuwatanRuwatan
Ruwatan
 
I.a.
I.a.I.a.
I.a.
 
De Luis Almagro a Tibisay Lucena
De Luis Almagro a Tibisay LucenaDe Luis Almagro a Tibisay Lucena
De Luis Almagro a Tibisay Lucena
 
Whatsapp
WhatsappWhatsapp
Whatsapp
 
Unsur golongan ii a (logam alkali tanah
Unsur golongan ii a (logam alkali tanahUnsur golongan ii a (logam alkali tanah
Unsur golongan ii a (logam alkali tanah
 
Tipos de archivos
Tipos de archivosTipos de archivos
Tipos de archivos
 
Task 9
Task 9Task 9
Task 9
 
DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)
DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)
DOLLAR NAGARAM BOOK FUNCTION AT TIRUPUR 27 JAN 2013 ( PDF 1)
 
Efecto invernadero
Efecto invernaderoEfecto invernadero
Efecto invernadero
 
Buildings
BuildingsBuildings
Buildings
 
Actividad nro 3 de agrario. bases sustentables.
Actividad nro 3 de agrario. bases sustentables.Actividad nro 3 de agrario. bases sustentables.
Actividad nro 3 de agrario. bases sustentables.
 
Early idea's (2)
Early idea's (2)Early idea's (2)
Early idea's (2)
 
Kickstarting a Video Game
Kickstarting a Video GameKickstarting a Video Game
Kickstarting a Video Game
 
Gestion de imagen
Gestion de imagenGestion de imagen
Gestion de imagen
 
Презентація Гугл Білецького Данила
Презентація Гугл Білецького ДанилаПрезентація Гугл Білецького Данила
Презентація Гугл Білецького Данила
 
Маркетинговое агентство Брусника. Кто мы и что мы делаем?
Маркетинговое агентство Брусника. Кто мы и что мы делаем?Маркетинговое агентство Брусника. Кто мы и что мы делаем?
Маркетинговое агентство Брусника. Кто мы и что мы делаем?
 
Pengorganisasian dan struktur
Pengorganisasian dan strukturPengorganisasian dan struktur
Pengorganisasian dan struktur
 
eNeo Labs and the Spanish Connected Home Market - J Zamora
eNeo Labs and the Spanish Connected Home Market - J ZamoraeNeo Labs and the Spanish Connected Home Market - J Zamora
eNeo Labs and the Spanish Connected Home Market - J Zamora
 

Ähnlich wie Fast Single-pass k-means Clustering

Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Lucidworks
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombeMatt Challacombe
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
OWASP Netherlands -- ML vs Cryptocoin Miners
OWASP Netherlands -- ML vs Cryptocoin MinersOWASP Netherlands -- ML vs Cryptocoin Miners
OWASP Netherlands -- ML vs Cryptocoin MinersJonn Callahan
 
Meetup Julio Algoritmos Genéticos
Meetup Julio Algoritmos GenéticosMeetup Julio Algoritmos Genéticos
Meetup Julio Algoritmos GenéticosDataLab Community
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesData-Centric_Alliance
 

Ähnlich wie Fast Single-pass k-means Clustering (20)

Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Class3
Class3Class3
Class3
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
Image-Based E-Commerce Product Discovery: A Deep Learning Case Study - Denis ...
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 
09 placement
09 placement09 placement
09 placement
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
generalized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombegeneralized_nbody_acs_2015_challacombe
generalized_nbody_acs_2015_challacombe
 
Modern Cryptography
Modern CryptographyModern Cryptography
Modern Cryptography
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
OWASP Netherlands -- ML vs Cryptocoin Miners
OWASP Netherlands -- ML vs Cryptocoin MinersOWASP Netherlands -- ML vs Cryptocoin Miners
OWASP Netherlands -- ML vs Cryptocoin Miners
 
Meetup Julio Algoritmos Genéticos
Meetup Julio Algoritmos GenéticosMeetup Julio Algoritmos Genéticos
Meetup Julio Algoritmos Genéticos
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 

Mehr von Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Mehr von Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Fast Single-pass k-means Clustering

  • 2. whoami – Ted Dunning • Chief Application Architect, MapR Technologies • Committer, member, Apache Software Foundation – particularly Mahout, Zookeeper and Drill • Contact me at tdunning@maprtech.com tdunning@apache.com ted.dunning@gmail.com @ted_dunning
  • 3. Agenda • Rationale • Theory – clusterable data, k-mean failure modes, sketches • Algorithms – ball k-means, surrogate methods • Implementation – searchers, vectors, clusterers • Results • Application
  • 5. Why k-means? • Clustering allows fast search – k-nn models allow agile modeling – lots of data points, 108 typical – lots of clusters, 104 typical • Model features – Distance to nearest centroids – Poor man’s manifold discovery
  • 6. What is Quality? • Robust clustering not a goal – we don’t care if the same clustering is replicated • Generalization to unseen data critical – number of points per cluster – distance distributions – target function distributions – model performance stability • Agreement to “gold standard” is a non-issue
  • 8. The Problem • Spirals are a classic “counter” example for k- means • Classic low dimensional manifold with added noise • But clustering still makes modeling work well
  • 11. The Cluster Proximity Features • Every point can be described by the nearest cluster – 4.3 bits per point in this case – Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) – Error is negligible – Unwinds the data into a simple representation
  • 13. Lots of Clusters Are Fine
  • 14. The Limiting Case • Too many clusters lead to over-fitting • Which we mediate by averaging over several nearby clusters • In the limit we get k-nn modeling – and probably use k-means to speed up search
  • 16. Intuitive Theory • Traditionally, minimize over all distributions – optimization is NP-complete – that isn’t like real data • Recently, assume well-clusterable data s 2 D 2 (X) > D 2 (X) k-1 k • Interesting approximation bounds provable 1+ O(s 2 )
  • 17. For Example 1 D (X) > 2 D (X) 2 s 4 2 5 Grouping these two clusters seriously hurts squared distance
  • 19. Lloyd’s Algorithm • Part of CS folk-lore • Developed in the late 50’s for signal quantization, published in 80’s initialize k cluster centroids somehow for each of many iterations: for each data point: assign point to nearest cluster recompute cluster centroids from points assigned to clusters • Highly variable quality, several restarts recommended
  • 20. Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together
  • 21. Ball k-means • Provably better for highly clusterable data • Tries to find initial centroids in each “core” of each real clusters • Avoids outliers in centroid computation initialize centroids randomly with distance maximizing tendency for each of a very few iterations: for each data point: assign point to nearest cluster recompute centroids using only points much closer than closest cluster
  • 22. Still Not a Win • Ball k-means is nearly guaranteed with k = 2 • Probability of successful seeding drops exponentially with k • Alternative strategy has high probability of success, but takes O(nkd + k3d) time
  • 24. Surrogate Method • Start with sloppy clustering into κ = k log n clusters • Use this sketch as a weighted surrogate for the data • Cluster surrogate data using ball k-means • Results are provably good for highly clusterable data • Sloppy clustering is on-line • Surrogate can be kept in memory • Ball k-means pass can be done at any time
  • 25. Algorithm Costs • O(k d log n) per point per iteration for Lloyd’s algorithm • Number of iterations not well known • Iteration > log n reasonable assumption
  • 26. Algorithm Costs • Surrogate methods – fast, sloppy single pass clustering with κ = k log n – fast sloppy search for nearest cluster, O(d log κ) = O(d (log k + log log n)) per point – fast, in-memory, high-quality clustering of κ weighted centroids O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality O(κ d log k) or O(d log κ log k) for larger k, looser quality – result is k high-quality centroids • Even the sloppy surrogate may suffice
  • 27. Algorithm Costs • How much faster for the sketch phase? – take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000 – d (log k + log log n) = 10(11 + 5) = 170 – 3,000 times faster is a bona fide big deal
  • 28. Pragmatics • But this requires a fast search internally • Have to cluster on the fly for sketch • Have to guarantee sketch quality • Previous methods had very high complexity
  • 29. How It Works • For each point – Find approximately nearest centroid (distance = d) – If (d > threshold) new centroid – Else if (u > d/threshold) new cluster – Else add to nearest centroid • If centroids > κ ≈ C log N – Recursively cluster centroids with higher threshold
  • 30. Resulting Surrogate • Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly • Either way, we win
  • 32. How Can We Search Faster? • First rule: don’t do it – If we can eliminate most candidates, we can do less work – Projection search and k-means search • Second rule: don’t do it – We can convert big floating point math to clever bit-wise integer math – Locality sensitive hashing • Third rule: reduce dimensionality – Projection search – Random projection for very high dimension
  • 33. Projection Search total ordering!
  • 35. LSH Search • Each random projection produces independent sign bit • If two vectors have the same projected sign bits, they probably point in the same direction (i.e. cos θ ≈ 1) • Distance in L2 is closely related to cosine x - y 2 = x - 2(x × y) + y 2 2 = x 2 - 2 x y cosq + y 2 • We can replace (some) vector dot products with long integer XOR
  • 36. 1 LSH Bit-match Versus Cosine 0.8 0.6 0.4 0.2 Y Ax is 0 0 8 16 24 32 40 48 56 64 - 0.2 - 0.4 - 0.6 - 0.8 -1 X Ax is
  • 38. The Internals • Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid • Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute • Super-fast clustering – Kmeans, StreamingKmeans
  • 39. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads
  • 40. What About Map-Reduce? • Map-reduce implementation is nearly trivial – Compute surrogate on each split – Total surrogate is union of all partial surrogates – Do in-memory clustering on total surrogate • Threaded version shows linear speedup already • Map-reduce speedup shows same linear speedup
  • 41. How Well Does it Work? • Theoretical guarantees for well clusterable data – Shindler, Wong and Meyerson, NIPS, 2011 • Evaluation on synthetic data – Rough clustering produces correct surrogates – Ball k-means strategy 1 performance is very good with large k
  • 42. How Well Does it Work? • Empirical evaluation on 20 newsgroups • Alternative algorithms include ball k-means versus streaming k-means|ball k-means • Results Average distance to nearest cluster on held-out data same or slightly smaller Median distance to nearest cluster is smaller > 10x faster (I/O and encoding limited)
  • 44. The Business Case • Our customer has 100 million cards in circulation • Quick and accurate decision-making is key. – Marketing offers – Fraud prevention
  • 45. Opportunity • Demand of modeling is increasing rapidly • So they are testing something simpler and more agile • Like k-nearest neighbor
  • 46. What’s that? • Find the k nearest training examples – lookalike customers • This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results • Initial rapid prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time
  • 48. Required Scale and Speed and Accuracy • Want 20 million queries against 25 million references in 10,000 s • Should be able to search > 100 million references • Should be linearly and horizontally scalable • Must have >50% overlap against reference search
  • 49. How Hard is That? • 20 M x 25 M x 100 Flop = 50 P Flop • 1 CPU = 5 Gflops • We need 10 M CPU seconds => 10,000 CPU’s • Real-world efficiency losses may increase that by 10x • Not good!
  • 50. K-means Search • First do clustering with lots (thousands) of clusters • Then search nearest clusters to find nearest points • We win if we find >50% overlap with “true” answer • We lose if we can’t cluster super-fast – more on this later
  • 51. Lots of Clusters Are Fine
  • 52. Lots of Clusters Are Fine
  • 53. Some Details • Clumpy data works better – Real data is clumpy  • Speedups of 100-200x seem practical with 50% overlap – Projection search and LSH give additional 100x • More experiments needed
  • 54. Summary • Nearest neighbor algorithms can be blazing fast • But you need blazing fast clustering – Which we now have
  • 55. Contact Me! • We’re hiring at MapR in US and Europe • MapR software available for research use • Come get the slides at http://www.mapr.com/company/events/acmsf-2-25-13 • Get the code as part of Mahout trunk • Contact me at tdunning@maprtech.com or @ted_dunning

Hinweis der Redaktion

  1. The basic idea here is that I have colored slides to be presented by you in blue. You should substitute and reword those slides as you like. In a few places, I imagined that we would have fast back and forth as in the introduction or final slide where we can each say we are hiring in turn.The overall thrust of the presentation is for you to make these points:Amex does lots of modelingit is expensivehaving a way to quickly test models and new variables would be awesomeso we worked on a new project with MapRMy part will say the following:Knn basic pictorial motivation (could move to you if you like)describe knn quality metric of overlapshow how bad metric breaks knn (optional)quick description of LSH and projection searchpicture of why k-means search is coolmotivate k-means speed as tool for k-means searchdescribe single pass k-means algorithmdescribe basic data structuresshow parallel speedupOur summary should state that we have achievedsuper-fast k-means clusteringinitial version of super-fast knn search with good overlap
  2. The sub-bullets are just for reference and should be deleted later
  3. This slide is red to indicate missing data
  4. The idea here is to guess what color a new dot should be by looking at the points within the circle. The first should obviously be purple. The second cyan. The third is uncertain, but probably isn’t green or cyan and probably is a bit more likely to be red than purple.