SlideShare a Scribd company logo
1 of 21
Scalable Machine Learning with
                   Hadoop (most of the time)

                   Grant Ingersoll
                   Chief Scientist

                   October 2, 2012




© Copyright 2012
Anyone Here Use Machine Learning?

• Any users of:
  • Google?
     • Search
     • Translation
     • Priority Inbox

  • Facebook?
                               Google Translate

  • Twitter?

  • LinkedIn?
Proprietary
© 2012 LucidWorks
Topics

    • What is scalable machine learning?

    • Use Cases

    • Approaches
      • Hadoop-based
      • Alternatives

    • What is Apache Mahout?


    Proprietary
3   © 2012 LucidWorks
Machine Learning



  • “Machine Learning is programming computers to
     optimize a performance criterion using example data
     or past experience”
     • Intro. To Machine Learning by E. Alpaydin


  • Lots of related fields:
     • Information Retrieval
     • Stats
     • Biology
     • Linear algebra
     • Many more
Proprietary
© 2012 LucidWorks
What does scalable mean for us?


  • As data grows linearly, either scale linearly in time or in
     machines
     • 2X data requires 2X time or 2X machines (or less!)
  • Goal: Be as fast and efficient as possible given the
     intrinsic design of the algorithm
     • Some algorithms won’t scale to massive machine clusters
     • Others fit logically on a Map Reduce framework like Apache
       Hadoop
     • Still others will need different distributed programming models
     • Be pragmatic




Proprietary
© 2012 LucidWorks
Common Use Cases




                    http://www.readwriteweb.com/archives/linkedin_plots_your_profession
                    al_network_with_inma.php




Proprietary
© 2012 LucidWorks
My Use Cases


                                             Search




                                            Relevance
                                       Recommendations
                                         Related Items
                                    Content/User Classification
                                             Phrases
                                              Topics
                        Analytics                                 Discovery




    Proprietary
7   © 2012 LucidWorks
Scalable Approaches

    • Mind the Gap
      • Algorithms are the fun stuff, but you’ll spend more time on
        ETL, feature selection and post-processing
      • Simpler is usually better at scale

    1. Scale Data Pipeline -> Sample -> Sequential

    2. Hadoop

    3. Ensemble (distribute many sequential models)

    4. Spark, MPI & BSP, Others

    Proprietary
8   © 2012 LucidWorks
Open Source Machine Learning Libraries


    • Apache Mahout

    • Vowpal Wabbit

    • R Stats Project

    • Weka

    • LibSVM, SVMLight

    • Many, many more




    Proprietary
9   © 2012 LucidWorks
Apache Mahout


• An Apache Software Foundation project to
  create scalable machine learning libraries
  under the Apache Software License
  • http://mahout.apache.org
• Why Mahout?
  • Many Open Source ML libraries are either:
     • Lack Community
     • Lack Documentation and Examples
     • Lack Scalability
     • Lack the Apache License
     • Or are research-oriented
                          http://dictionary.reference.com/browse/mahout
Proprietary
© 2012 LucidWorks
Who uses Mahout?




 https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Proprietary
© 2012 LucidWorks
What Can I do with Mahout Right Now?

                    3 “C”s + Extras




Proprietary
© 2012 LucidWorks
Collaborative Filtering

• Recommender Approaches
  • User based
  • Item based

• Online and Offline support
  • Offline can utilize Hadoop

• Many different Similarity measures
  • Cosine, LLR, Tanimoto, Pearson, others



Proprietary
© 2012 LucidWorks
Hadoop Recommenders

     • Alternating Least Squares
          • Iterative, but scales well
          • Deals well with sparseness
          • “Large-scale Parallel Collaborative Filtering for the Netflix
            Prize” by Zhou et. al
          • https://cwiki.apache.org/MAHOUT/collaborative-filtering-with-
            als-wr.html

     • Slope One
       • Simple yet effective

     • Pseudo
       • Distribute sequential approach across Hadoop nodes


     Proprietary
14   © 2012 LucidWorks
Clustering

• Document level
  • Group documents based
    on a notion of similarity
  • K-Means, Fuzzy K-
    Means, Dirichlet, Canopy,
    Mean-Shift, Spectral, Top-
    Down                                  • Topic Modeling
  • Pluggable Distance                     • Cluster words across
    Measures                                 documents to identify topics
                                           • Latent Dirichlet Allocation
                                             • Using Collapsed
                                               Variational Bayes

Proprietary
                    http://carrotsearch.com/foamtree-overview.html
© 2012 LucidWorks
Clustering In Hadoop

• Many people start with K-Means, but others can be
  more effective



• Challenges
  • Iterative nature of many clustering algorithms can be slow

  • Distance measures and other factors can have dramatic
     impact on performance and quality

  • When in doubt, experiment


Proprietary
© 2012 LucidWorks
Classification                    “This gives a raw
                                  classification rate
                                  requirement of tens of
                                  millions of
• Place new items into            classifications per
  predefined categories           second, which is, as
                                  they say in the old
• Online and Offline              country, a lot.”

  supported                       “Mahout in Action”
                                  http://awe.sm/5FyNe
• Hadoop                  • Sequential
  • Naïve Bayes            • Logistic Regression
  • Complementary Naïve
    Bayes                    • Stochastic Grad.
  • Decision Forests          Descent
  • Clustering-based       • Hidden Markov Model
                           • Winnow/Perceptron

Proprietary
© 2012 LucidWorks
Scaling Mahout Classification
                                 Client (Train/Test/Classify)



                                   LucidWorks Big Data


                                    Classifier   Model                 Model
                                                                ...
                                    Node 1       A                     N



                    Zookeeper                           ...

                                    Classifier   Model                 Model
                                                                ...
                                    Node N       C                     Q




                         Model                                Model
                         Model            HDFS                Model




                         Offline (Map/Reduce?) Training
Proprietary
© 2012 LucidWorks
Other Mahout Features

• Apache Licensed:
  • Primitive Collections!
  • Extensive Math library
     • Vectors, Matrices, Statistics, etc.
     • Vector Encoding options

• Singular Value Decomposition

• Frequent Pattern Mining

• Collocations (statistically interesting phrases)

• I/O: Lucene, Cassandra, MongoDB and others
Proprietary
© 2012 LucidWorks
What’s Next for Mahout?

     • Streaming K-Means

     • Map/Reduce Training for HMM?

     • Clean Up towards 1.0 release

     • 1.0?




     Proprietary
20   © 2012 LucidWorks
Resources

     • http://www.lucidworks.com


     • grant@lucidworks.com
     • @gsingers




     Proprietary
21   © 2012 LucidWorks

More Related Content

What's hot

Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?Zouheir Cadi
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Toronto jaspersoft meetup
Toronto jaspersoft meetupToronto jaspersoft meetup
Toronto jaspersoft meetupPatrick McFadin
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OGeorge Cao
 
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)Klokie Grossfeld
 

What's hot (6)

Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?Stockage des données : quel système pour quel usage ?
Stockage des données : quel système pour quel usage ?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Toronto jaspersoft meetup
Toronto jaspersoft meetupToronto jaspersoft meetup
Toronto jaspersoft meetup
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)Recommendations in Drupal (Drupal DevDays Barcelona 2012)
Recommendations in Drupal (Drupal DevDays Barcelona 2012)
 

Similar to Scalable Machine Learning with Hadoop

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Model-Driven Cloud Data Storage
Model-Driven Cloud Data StorageModel-Driven Cloud Data Storage
Model-Driven Cloud Data Storagejccastrejon
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureCloudera, Inc.
 
Using mruby in the nosql database Avocadodb
Using mruby in the nosql database AvocadodbUsing mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodbavocadodb
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLzenyk
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpandeIndicThreads
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveYahoo Developer Network
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
Apache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM AlternativeApache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM AlternativeAndrus Adamchik
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 

Similar to Scalable Machine Learning with Hadoop (20)

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Model-Driven Cloud Data Storage
Model-Driven Cloud Data StorageModel-Driven Cloud Data Storage
Model-Driven Cloud Data Storage
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
 
Using mruby in the nosql database Avocadodb
Using mruby in the nosql database AvocadodbUsing mruby in the nosql database Avocadodb
Using mruby in the nosql database Avocadodb
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpande
 
Cloud patterns
Cloud patternsCloud patterns
Cloud patterns
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Apache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM AlternativeApache Cayenne: a Java ORM Alternative
Apache Cayenne: a Java ORM Alternative
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 

More from Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 

Scalable Machine Learning with Hadoop

  • 1. Scalable Machine Learning with Hadoop (most of the time) Grant Ingersoll Chief Scientist October 2, 2012 © Copyright 2012
  • 2. Anyone Here Use Machine Learning? • Any users of: • Google? • Search • Translation • Priority Inbox • Facebook? Google Translate • Twitter? • LinkedIn? Proprietary © 2012 LucidWorks
  • 3. Topics • What is scalable machine learning? • Use Cases • Approaches • Hadoop-based • Alternatives • What is Apache Mahout? Proprietary 3 © 2012 LucidWorks
  • 4. Machine Learning • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin • Lots of related fields: • Information Retrieval • Stats • Biology • Linear algebra • Many more Proprietary © 2012 LucidWorks
  • 5. What does scalable mean for us? • As data grows linearly, either scale linearly in time or in machines • 2X data requires 2X time or 2X machines (or less!) • Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need different distributed programming models • Be pragmatic Proprietary © 2012 LucidWorks
  • 6. Common Use Cases http://www.readwriteweb.com/archives/linkedin_plots_your_profession al_network_with_inma.php Proprietary © 2012 LucidWorks
  • 7. My Use Cases Search Relevance Recommendations Related Items Content/User Classification Phrases Topics Analytics Discovery Proprietary 7 © 2012 LucidWorks
  • 8. Scalable Approaches • Mind the Gap • Algorithms are the fun stuff, but you’ll spend more time on ETL, feature selection and post-processing • Simpler is usually better at scale 1. Scale Data Pipeline -> Sample -> Sequential 2. Hadoop 3. Ensemble (distribute many sequential models) 4. Spark, MPI & BSP, Others Proprietary 8 © 2012 LucidWorks
  • 9. Open Source Machine Learning Libraries • Apache Mahout • Vowpal Wabbit • R Stats Project • Weka • LibSVM, SVMLight • Many, many more Proprietary 9 © 2012 LucidWorks
  • 10. Apache Mahout • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • Why Mahout? • Many Open Source ML libraries are either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Lack the Apache License • Or are research-oriented http://dictionary.reference.com/browse/mahout Proprietary © 2012 LucidWorks
  • 11. Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout Proprietary © 2012 LucidWorks
  • 12. What Can I do with Mahout Right Now? 3 “C”s + Extras Proprietary © 2012 LucidWorks
  • 13. Collaborative Filtering • Recommender Approaches • User based • Item based • Online and Offline support • Offline can utilize Hadoop • Many different Similarity measures • Cosine, LLR, Tanimoto, Pearson, others Proprietary © 2012 LucidWorks
  • 14. Hadoop Recommenders • Alternating Least Squares • Iterative, but scales well • Deals well with sparseness • “Large-scale Parallel Collaborative Filtering for the Netflix Prize” by Zhou et. al • https://cwiki.apache.org/MAHOUT/collaborative-filtering-with- als-wr.html • Slope One • Simple yet effective • Pseudo • Distribute sequential approach across Hadoop nodes Proprietary 14 © 2012 LucidWorks
  • 15. Clustering • Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K- Means, Dirichlet, Canopy, Mean-Shift, Spectral, Top- Down • Topic Modeling • Pluggable Distance • Cluster words across Measures documents to identify topics • Latent Dirichlet Allocation • Using Collapsed Variational Bayes Proprietary http://carrotsearch.com/foamtree-overview.html © 2012 LucidWorks
  • 16. Clustering In Hadoop • Many people start with K-Means, but others can be more effective • Challenges • Iterative nature of many clustering algorithms can be slow • Distance measures and other factors can have dramatic impact on performance and quality • When in doubt, experiment Proprietary © 2012 LucidWorks
  • 17. Classification “This gives a raw classification rate requirement of tens of millions of • Place new items into classifications per predefined categories second, which is, as they say in the old • Online and Offline country, a lot.” supported “Mahout in Action” http://awe.sm/5FyNe • Hadoop • Sequential • Naïve Bayes • Logistic Regression • Complementary Naïve Bayes • Stochastic Grad. • Decision Forests Descent • Clustering-based • Hidden Markov Model • Winnow/Perceptron Proprietary © 2012 LucidWorks
  • 18. Scaling Mahout Classification Client (Train/Test/Classify) LucidWorks Big Data Classifier Model Model ... Node 1 A N Zookeeper ... Classifier Model Model ... Node N C Q Model Model Model HDFS Model Offline (Map/Reduce?) Training Proprietary © 2012 LucidWorks
  • 19. Other Mahout Features • Apache Licensed: • Primitive Collections! • Extensive Math library • Vectors, Matrices, Statistics, etc. • Vector Encoding options • Singular Value Decomposition • Frequent Pattern Mining • Collocations (statistically interesting phrases) • I/O: Lucene, Cassandra, MongoDB and others Proprietary © 2012 LucidWorks
  • 20. What’s Next for Mahout? • Streaming K-Means • Map/Reduce Training for HMM? • Clean Up towards 1.0 release • 1.0? Proprietary 20 © 2012 LucidWorks
  • 21. Resources • http://www.lucidworks.com • grant@lucidworks.com • @gsingers Proprietary 21 © 2012 LucidWorks

Editor's Notes

  1. A few things come to mind