SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Mahout, New and Improved
Now with Super Fast Clustering

©MapR Technologies - Confidential   1
Agenda

     What happened in Mahout 0.7
       –   less bloat
       –   simpler structure
       –   general cleanup




©MapR Technologies - Confidential   2
To Cut Out Bloat




©MapR Technologies - Confidential   3
©MapR Technologies - Confidential   4
Bloat is Leaving in 0.7

     Lots of abandoned code in Mahout
       –   average code quality is poor
       –   no users
       –   no maintainers
       –   why do we care?
     Examples
       –   old LDA
       –   old Naïve Bayes
       –   genetic algorithms
     If you care, get on the mailing list



©MapR Technologies - Confidential         5
Bloat is Leaving in 0.7

     Lots of abandoned code in Mahout
       –   average code quality is poor
       –   no users
       –   no maintainers
       –   why do we care?
     Examples
       –   old LDA
       –   old Naïve Bayes
       –   genetic algorithms
     If you care, get on the mailing list
       –   oops, too late since 0.7 is already released


©MapR Technologies - Confidential              6
Integration of
  Collections




©MapR Technologies - Confidential   7
Nobody Cares about Collections

     We need it, math is built on it


     Pull it into math


     Broke the build (battle of the code expanders)


     Fixed now (thanks to Grant)




©MapR Technologies - Confidential       8
Pig Vector


©MapR Technologies - Confidential   9
What is it?

     Supports access to Mahout functionality from Pig


     So far -- text vectorization


     And classification


     And model saving




©MapR Technologies - Confidential    10
What is it?

     Supports Pig access to Mahout functions


     So far text vectorization


     And classification


     And model saving


     Kind of works (see pigML from twitter for better function)



©MapR Technologies - Confidential     11
Compile and Install

     Start by compiling and installing mahout in your local repository:
           cd ~/Apache
           git clone https://github.com/apache/mahout.git
           cd mahout
           mvn install -DskipTests


     Then do the same with pig-vector
           cd ~/Apache
           git clone git@github.com:tdunning/pig-vector.git
           cd pig-vector
           mvn package




©MapR Technologies - Confidential              12
Tokenize and Vectorize Text

     Tokenized is done using a text encoder
       –   the dimension of the resulting vectors (typically 100,000-1,000,000
       –   a description of the variables to be included in the encoding
       –   the schema of the tuples that pig will pass together with their data types
     Example:
           define EncodeVector
           org.apache.mahout.pig.encoders.EncodeVector
           ('10','x+y+1', 'x:numeric, y:word, z:text');


     You can also add a Lucene 3.1 analyzer in parentheses if you want
      something fancier



©MapR Technologies - Confidential               13
The Formula

     Not normal arithmetic


     Describes which variables to use, whether offset is included


     Also describes which interactions to use




©MapR Technologies - Confidential     14
The Formula

     Not normal arithmetic


     Describes which variables to use, whether offset is included


     Also describes which interactions to use
       –   but that doesn’t do anything yet!




©MapR Technologies - Confidential              15
Load and Encode Data

     Load the data
            a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
                      as (x1:int, x2:int, x3:int);
     And encode it
           b = foreach a generate 1 as key, EncodeVector(*) as v;
     Note that the true meaning of * is very subtle
     Now store it
           store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage
           (
           '-c com.twitter.elephantbird.pig.util.IntWritableConverter’,     '-c
           com.twitter.elephantbird.pig.util.GenericWritableConverter
           -t org.apache.mahout.math.VectorWritable’);




©MapR Technologies - Confidential                  16
Train a Model

     Pass previously encoded data to a sequential model trainer
                  define train org.apache.mahout.pig.LogisticRegression(
                  'iterations=5, inMemory=true, features=100000, categories=alt.atheism
                  comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
                  comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast
                  comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space
                  talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
                  soc.religion.christian talk.religion.misc');
     Note that the argument is a string with its own syntax




©MapR Technologies - Confidential                     17
Reservations and Qualms

     Pig-vector isn’t done


     And it is ugly


     And it doesn’t quite work


     And it is hard to build


     But there seems to be promise



©MapR Technologies - Confidential     18
Potential

     Add Naïve Bayes Model?


     Somehow simplify the syntax?


     Try a recent version of elephant-bird?


     Switch to pigML?




©MapR Technologies - Confidential     19
Large-scale k-Means Clustering


©MapR Technologies - Confidential   20
Goals

     Cluster very large data sets
     Facilitate large nearest neighbor search
     Allow very large number of clusters
     Achieve good quality
       –   low average distance to nearest centroid on held-out data
     Based on Mahout Math
     Runs on Hadoop (really MapR) cluster
     FAST – cluster tens of millions in minutes




©MapR Technologies - Confidential            21
Non-goals

     Use map-reduce (but it is there)
     Minimize the number of clusters
     Support metrics other than L2




©MapR Technologies - Confidential        22
Anti-goals

     Multiple passes over original data
     Scale as O(k n)




©MapR Technologies - Confidential     23
Why?




©MapR Technologies - Confidential    24
K-nearest Neighbor with
  Super Fast k-means




©MapR Technologies - Confidential   25
What’s that?

     Find the k nearest training examples
     Use the average value of the target variable from them


     This is easy … but hard
       –   easy because it is so conceptually simple and you have few knobs to turn
           or models to build
       –   hard because of the stunning amount of math
       –   also hard because we need top 50,000 results, not just single nearest


     Initial prototype was massively too slow
       –   3K queries x 200K examples takes hours
       –   needed 20M x 25M in the same time

©MapR Technologies - Confidential            26
Modeling with k-nearest Neighbors




                                        a



                                    b            c




©MapR Technologies - Confidential           27
Subject to Some Limits




©MapR Technologies - Confidential   28
Log Transform Improves Things




©MapR Technologies - Confidential   29
Neighbors Depend on Good Presentation




©MapR Technologies - Confidential   30
How We Did It

     2 week hackathon with 6 developers from MapR customer
     Agile-ish development
     To avoid IP issues
       –   all code is Apache Licensed (no ownership question)
       –   all data is synthetic (no question of private data)
       –   all development done on individual machines, hosting on Github
       –   open is easier than closed (in this case)
     Goal is new open technology to facilitate new closed solutions


     Ambitious goal of ~ 1,000,000 x speedup



©MapR Technologies - Confidential           31
How We Did It

     2 week hackathon with 6 developers from customer bank
     Agile-ish development
     To avoid IP issues
       –   all code is Apache Licensed (no ownership question)
       –   all data is synthetic (no question of private data)
       –   all development done on individual machines, hosting on Github
       –   open is easier than closed (in this case)
     Goal is new open technology to facilitate new closed solutions


     Ambitious goal of ~ 1,000,000 x speedup
       –   well, really only 100-1000x after basic hygiene


©MapR Technologies - Confidential             32
What We Did

     Mechanism for extending Mahout Vectors
       –   DelegatingVector, WeightedVector, Centroid


     Shared memory matrix
       –   FileBasedMatrix uses mmap to share very large dense matrices


     Searcher interface
       –   Brute, ProjectionSearch, KmeansSearch, LshSearch


     Super-fast clustering
       –   Kmeans, StreamingKmeans

©MapR Technologies - Confidential         33
Projection Search


                                         java.lang.TreeSet!




©MapR Technologies - Confidential   34
Projection Search

     Projection onto a line provides a total order on data
     Nearby points stay nearby
     Some other points also wind up close


     Search points just before or just after the query point




©MapR Technologies - Confidential      35
How Many Projections?




©MapR Technologies - Confidential   36
K-means Search

     Simple Idea
       –   pre-cluster the data
       –   to find the nearest points, search the nearest clusters


     Recursive application
       –   to search a cluster, use a Searcher!




©MapR Technologies - Confidential                 37
©MapR Technologies - Confidential   38
x




©MapR Technologies - Confidential       39
©MapR Technologies - Confidential   40
©MapR Technologies - Confidential   41
x




©MapR Technologies - Confidential       42
But This Requires k-means!

     Need a new k-means algorithm to get speed
       –   Hadoop is very slow at iterative map-reduce
       –   Maybe Pregel clones like Giraph would be better
       –   Or maybe not


     Streaming k-means is
       –   One pass (through the original data)
       –   Very fast (20 us per data point with threads on one node)
       –   Very parallelizable




©MapR Technologies - Confidential            43
Basic Method

     Use a single pass of k-means with very many clusters
       –   output is a bad-ish clustering but a good surrogate
     Use weighted centroids from step 1 to do in-memory clustering
       –   output is a good clustering with fewer clusters




©MapR Technologies - Confidential             44
Algorithmic Details

Foreach data point xn
           compute distance to nearest centroid, ∂
           sample u, if u > ∂/ß add to nearest centroid
           else create new centroid

           if number of centroids > k log n
                       recursively cluster centroids
                       set ß = 1.5 ß if number of centroids did not decrease




©MapR Technologies - Confidential                        45
How It Works


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly




©MapR Technologies - Confidential             46
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (μs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       47
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!




©MapR Technologies - Confidential       48
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)




©MapR Technologies - Confidential                49
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)


     Empirically, projection search beats 64 bit LSH by a bit
       –   More optimization may change this story


©MapR Technologies - Confidential                50
Moving to Ultra Mega Super Scale

     Map-reduce implementation nearly trivial


     Map: rough-cluster input data, output ß, weighted centroids


     Reduce:
       –   single reducer gets all centroids
       –   if too many centroids, merge using recursive clustering
       –   optionally do final clustering in-memory


     Combiner possible, but not important



©MapR Technologies - Confidential             51
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Slides and such:
       –   http://info.mapr.com/ted-boston-2012-07

       Hash tags: #boston-hug #mahout #mapr




©MapR Technologies - Confidential          52

Weitere ähnliche Inhalte

Was ist angesagt?

Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Ted Dunning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...Edge AI and Vision Alliance
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...Edge AI and Vision Alliance
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationDevansh16
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Sivadon Chaisiri
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data Alison B. Lowndes
 
Academic cloud experiences cern v4
Academic cloud experiences cern v4Academic cloud experiences cern v4
Academic cloud experiences cern v4Tim Bell
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 

Was ist angesagt? (19)

Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data
 
Academic cloud experiences cern v4
Academic cloud experiences cern v4Academic cloud experiences cern v4
Academic cloud experiences cern v4
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 

Ähnlich wie Boston hug-2012-07

Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportMapR Technologies
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San DiegoMapR Technologies
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in businessMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Mathieu Dumoulin
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopMapR Technologies
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoopTed Dunning
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 

Ähnlich wie Boston hug-2012-07 (20)

Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
London Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering ReportLondon Data Science - Super-Fast Clustering Report
London Data Science - Super-Fast Clustering Report
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 
Storm Users Group Real Time Hadoop
Storm Users Group Real Time HadoopStorm Users Group Real Time Hadoop
Storm Users Group Real Time Hadoop
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 

Mehr von Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Mehr von Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Boston hug-2012-07

  • 1. Mahout, New and Improved Now with Super Fast Clustering ©MapR Technologies - Confidential 1
  • 2. Agenda  What happened in Mahout 0.7 – less bloat – simpler structure – general cleanup ©MapR Technologies - Confidential 2
  • 3. To Cut Out Bloat ©MapR Technologies - Confidential 3
  • 4. ©MapR Technologies - Confidential 4
  • 5. Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list ©MapR Technologies - Confidential 5
  • 6. Bloat is Leaving in 0.7  Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care?  Examples – old LDA – old Naïve Bayes – genetic algorithms  If you care, get on the mailing list – oops, too late since 0.7 is already released ©MapR Technologies - Confidential 6
  • 7. Integration of Collections ©MapR Technologies - Confidential 7
  • 8. Nobody Cares about Collections  We need it, math is built on it  Pull it into math  Broke the build (battle of the code expanders)  Fixed now (thanks to Grant) ©MapR Technologies - Confidential 8
  • 10. What is it?  Supports access to Mahout functionality from Pig  So far -- text vectorization  And classification  And model saving ©MapR Technologies - Confidential 10
  • 11. What is it?  Supports Pig access to Mahout functions  So far text vectorization  And classification  And model saving  Kind of works (see pigML from twitter for better function) ©MapR Technologies - Confidential 11
  • 12. Compile and Install  Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests  Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package ©MapR Technologies - Confidential 12
  • 13. Tokenize and Vectorize Text  Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types  Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector ('10','x+y+1', 'x:numeric, y:word, z:text');  You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier ©MapR Technologies - Confidential 13
  • 14. The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use ©MapR Technologies - Confidential 14
  • 15. The Formula  Not normal arithmetic  Describes which variables to use, whether offset is included  Also describes which interactions to use – but that doesn’t do anything yet! ©MapR Technologies - Confidential 15
  • 16. Load and Encode Data  Load the data a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',') as (x1:int, x2:int, x3:int);  And encode it b = foreach a generate 1 as key, EncodeVector(*) as v;  Note that the true meaning of * is very subtle  Now store it store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage ( '-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’); ©MapR Technologies - Confidential 16
  • 17. Train a Model  Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( 'iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');  Note that the argument is a string with its own syntax ©MapR Technologies - Confidential 17
  • 18. Reservations and Qualms  Pig-vector isn’t done  And it is ugly  And it doesn’t quite work  And it is hard to build  But there seems to be promise ©MapR Technologies - Confidential 18
  • 19. Potential  Add Naïve Bayes Model?  Somehow simplify the syntax?  Try a recent version of elephant-bird?  Switch to pigML? ©MapR Technologies - Confidential 19
  • 20. Large-scale k-Means Clustering ©MapR Technologies - Confidential 20
  • 21. Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes ©MapR Technologies - Confidential 21
  • 22. Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2 ©MapR Technologies - Confidential 22
  • 23. Anti-goals  Multiple passes over original data  Scale as O(k n) ©MapR Technologies - Confidential 23
  • 24. Why? ©MapR Technologies - Confidential 24
  • 25. K-nearest Neighbor with Super Fast k-means ©MapR Technologies - Confidential 25
  • 26. What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you have few knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results, not just single nearest  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time ©MapR Technologies - Confidential 26
  • 27. Modeling with k-nearest Neighbors a b c ©MapR Technologies - Confidential 27
  • 28. Subject to Some Limits ©MapR Technologies - Confidential 28
  • 29. Log Transform Improves Things ©MapR Technologies - Confidential 29
  • 30. Neighbors Depend on Good Presentation ©MapR Technologies - Confidential 30
  • 31. How We Did It  2 week hackathon with 6 developers from MapR customer  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup ©MapR Technologies - Confidential 31
  • 32. How We Did It  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene ©MapR Technologies - Confidential 32
  • 33. What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – Brute, ProjectionSearch, KmeansSearch, LshSearch  Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 33
  • 34. Projection Search java.lang.TreeSet! ©MapR Technologies - Confidential 34
  • 35. Projection Search  Projection onto a line provides a total order on data  Nearby points stay nearby  Some other points also wind up close  Search points just before or just after the query point ©MapR Technologies - Confidential 35
  • 36. How Many Projections? ©MapR Technologies - Confidential 36
  • 37. K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher! ©MapR Technologies - Confidential 37
  • 38. ©MapR Technologies - Confidential 38
  • 39. x ©MapR Technologies - Confidential 39
  • 40. ©MapR Technologies - Confidential 40
  • 41. ©MapR Technologies - Confidential 41
  • 42. x ©MapR Technologies - Confidential 42
  • 43. But This Requires k-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads on one node) – Very parallelizable ©MapR Technologies - Confidential 43
  • 44. Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters ©MapR Technologies - Confidential 44
  • 45. Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > k log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease ©MapR Technologies - Confidential 45
  • 46. How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 46
  • 47. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 47
  • 48. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! ©MapR Technologies - Confidential 48
  • 49. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) ©MapR Technologies - Confidential 49
  • 50. Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit – More optimization may change this story ©MapR Technologies - Confidential 50
  • 51. Moving to Ultra Mega Super Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but not important ©MapR Technologies - Confidential 51
  • 52. Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr ©MapR Technologies - Confidential 52