SlideShare ist ein Scribd-Unternehmen logo
1 von 34
KNITTING BOAR
    Machine Learning, Mahout, and Parallel Iterative Algorithms




    Josh Patterson
    Principal Solutions Architect




1
✛ Josh Patterson
   > Master’s Thesis: self-organizing mesh networks
       ∗   Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
   > Conceived, built, and led Hadoop integration for openPDC project
      at Tennessee Valley Authority (TVA)
   > Twitter: @jpatanooga

   > Email:    josh@floe.tv
✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
Introduction to
    MACHINE LEARNING




4
✛ What is Data Mining?
  > “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
  > Raw data essentially useless
      ∗ Data is simply recorded facts
      ∗ Information is the patterns underlying the data

✛ Machine Learning
  > Algorithms for acquiring structural descriptions from
    data “examples”
      ∗ Process of learning “concepts”
✛ Information Retrieval
   > information science, information architecture,
     cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing
  > grounded in machine learning, especially statistical
    machine learning
✛ Statistics
  > Math and stuff
✛ Machine Learning
  > Considered a branch of artificial intelligence
✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization



        “Descriptive Statistics”
✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
  or Pig
✛ ML work performed with
  >   SAS
  >   SPSS
  >   R
  >   Mahout
Introduction to
9
    MAHOUT
✛ Classification
   > “Fraud detection”
 ✛ Recommendation
   > “Collaborative
     Filtering”
 ✛ Clustering
   > “Segmentation”
 ✛ Frequent Itemset
     Mining


10                       Copyright 2010 Cloudera Inc. All rights reserved
✛ Stochastic Gradient Descent
   > Single process
   > Logistic Regression Model Construction
 ✛ Naïve Bayes
   > MapReduce-based
   > Text Classification
 ✛ Random Forests
   > MapReduce-based




11                    Copyright 2010 Cloudera Inc. All rights reserved
✛ An algorithm that looks at a user’s past actions
  and suggests
   > Products
   > Services
   > People
✛ Advertisement
  > Cloudera has a great Data Science training course on
    this topic
  > http://university.cloudera.com/training/data_science/in
    troduction_to_data_science_-
    _building_recommender_systems.html
✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation
✛   Why Machine Learning?
    >   Growing interest in predictive modeling

✛   Linear Models are Simple, Useful
    >   Stochastic Gradient Descent is a very popular tool for
        building linear models like Logistic Regression

✛   Building Models Still is Time Consuming
    >   The “Need for speed”
    >   “More data beats a cleverer algorithm”
Introducing
KNITTING BOAR




 15
✛ Parallelize Mahout’s Stochastic Gradient Descent
  >   With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
   using YARN
  >   Wanted a first class Hadoop-Yarn citizen
  >   Work through dev progressions towards a stable state
  >   Worry about “frameworks” later
✛ We Need
     > Hypothesis about data
     > Cost function
     > Update function



✛ Basic Algorithm:




     Andrew Ng’s Tutorial:
     https://class.coursera.org/ml/lecture/preview_view/11

17
✛ Training                        Training Data

    > Simple gradient descent
      procedure
    > Loss functions needs to be
      convex
 ✛ Prediction                         SGD

   > Logistic Regression:
       ∗ Sigmoid function using
         parameter vector (dot)
         example as exponential
                                     Model
         parameter


18
Current Limitations
 ✛ Sequential algorithms on a single node only
   goes so far
 ✛ The “Data Deluge”
     > Presents algorithmic challenges when combined with
       large data sets
     > need to design algorithms that are able to perform in
       a distributed fashion


 ✛ MapReduce only fits certain types of algorithms



19
Distributed Learning Strategies
 ✛ Langford, 2007
    > Vowpal Wabbit
 ✛ McDonald 2010
   > Distributed Training Strategies for the Structured
     Perceptron




20
Input             Processor    Processor    Processor



                                         Superstep 1
     Map      Map      Map

                             Processor    Processor    Processor



     Reduce         Reduce               Superstep 2

                                             . . .
           Output


21
“Are the gains gotten from using X worth the
     integration costs incurred in building the end-to-
     end solution?

     If no, then operationally, we can consider the
     Hadoop stack …

     there are substantial costs in knitting together a
     patchwork of different frameworks, programming
     models, etc.”
     –– Lin, 2012



22
✛ Parallel Iterative implementation of SGD on
     YARN

 ✛ Workers
   > work on partitions of the data
   > Stay active over supersteps
 ✛ Master
   > Performs superstep
   > Averages parameter vector


23
✛ Collects all parameter vectors at each pass /
   superstep
 ✛ Produces new global parameter vector
     > By averaging workers’ vectors
 ✛ Sends update to all workers
   > Workers replace local parameter vector with new
     global parameter vector




24
✛ Each given a split of the total dataset
   > Similar to a map task
 ✛ Performs local logistic regression run
 ✛ Local parameter vector sent to master at
     superstep




25
OnlineLogisticRegression
                                              Knitting Boar’s POLR
                                    Split 1             Split 2             Split 3
           Training Data




                                 Worker 1             Worker 2
                                                                     …   Worker N




                                Partial Model        Partial Model       Partial Model
     OnlineLogisticRegression


                                                     Master



             Model
                                                    Global Model

26
300


               250


               200
seconds




               150                                                                     OLR
                                                                                       POLR
               100


                50


                 0
                     4.1   8.2   12.3   16.4   20.5   24.6   28.7   32.8   36.9   41


                                        Input Size in MB



                           Input Size vs Processing Time

          27
Knitting Boar
     PARTING THOUGHTS




28
✛ Parallel SGD
   > The Boar is temperamental, experimental
       ∗ Linear speedup (roughly)

 ✛ Developing YARN Applications
   > More complex the just MapReduce
   > Requires lots of “plumbing”
 ✛ IterativeReduce
    > Great native-Hadoop way to implement algorithms
    > Easy to use and well integrated



29
✛ Knitting Boar
   > https://github.com/jpatanooga/KnittingBoar
   > 100% Java
   > ASF 2.0 Licensed
   > Quick Start
       ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

 ✛ IterativeReduce
    > https://github.com/emsixteeen/IterativeReduce
    > 100% Java
    > ASF 2.0 Licensed


30
✛ Machine Learning is hard
       > Don’t believe the hype
       > Do the work
     ✛ Model development takes
       time
       > Lots of iterations
       > Speed is key here


        Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg



31
✛ “Parallel Linear Regression on Iterative
     Reduce and YARN”

 ✛ Hadoop Summit Europe 2013
   > March 20, 21
   > http://hadoopsummit.org/amsterdam/




32
✛ Strata / Hadoop World 2012 Slides
   > http://www.cloudera.com/content/cloudera/en/resourc
     es/library/hadoopworld/strata-hadoop-world-2012-
     knitting-boar_slide_deck.html
 ✛ McDonald, 2010
   > http://dl.acm.org/citation.cfm?id=1858068
 ✛ MapReduce is Good Enough? If All You Have is
     a Hammer, Throw Away Everything That’s Not a
     Nail!
     > http://arxiv.org/pdf/1209.2191v1.pdf


33
✛ http://eteamjournal.files.wordpress.com/2011/03/
   photos-of-mount-everest-pictures.jpg
 ✛ http://images.fineartamerica.com/images-
   medium-large/-say-hello-to-my-little-friend--luis-
   ludzska.jpg
 ✛ http://freewallpaper.in/wallpaper2/2202-2-
   2001_space_odyssey_-_5.jpg




34

Weitere ähnliche Inhalte

Was ist angesagt?

Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAMakoto Yui
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
Summer training on matlab
Summer training on matlabSummer training on matlab
Summer training on matlabdangerahad
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Eduserv
 

Was ist angesagt? (20)

Hivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CAHivemall tech talk at Redwood, CA
Hivemall tech talk at Redwood, CA
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modeling
 
Sathya Final review
Sathya Final reviewSathya Final review
Sathya Final review
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Summer training on matlab
Summer training on matlabSummer training on matlab
Summer training on matlab
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
 

Andere mochten auch

Textile Machineries
Textile Machineries Textile Machineries
Textile Machineries Liton Ahmed
 
Textile Machinery Industry - Current Scenario & Future Prospects
Textile Machinery Industry - Current Scenario & Future ProspectsTextile Machinery Industry - Current Scenario & Future Prospects
Textile Machinery Industry - Current Scenario & Future ProspectsSuvin Advisors Pvt. Ltd.
 
Machinery used in textile industry
Machinery used in textile industryMachinery used in textile industry
Machinery used in textile industrySayeed Ahmed
 
Compound needle warp knitted machine
Compound needle warp knitted machineCompound needle warp knitted machine
Compound needle warp knitted machineAakash Singh
 
comprehension 1 knitting
 comprehension 1 knitting comprehension 1 knitting
comprehension 1 knittingMohana Sindhu
 
Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)
Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)
Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)Shawan Roy
 
Basics of Kniting by Vasant Kothari
Basics of Kniting by Vasant KothariBasics of Kniting by Vasant Kothari
Basics of Kniting by Vasant KothariVasant Kothari
 

Andere mochten auch (13)

Textile Machineries
Textile Machineries Textile Machineries
Textile Machineries
 
Textile Machinery Industry - Current Scenario & Future Prospects
Textile Machinery Industry - Current Scenario & Future ProspectsTextile Machinery Industry - Current Scenario & Future Prospects
Textile Machinery Industry - Current Scenario & Future Prospects
 
Machinery used in textile industry
Machinery used in textile industryMachinery used in textile industry
Machinery used in textile industry
 
Semi jacquard by mamun,Milon,Plabon 36 batch(BUTex)
Semi jacquard by mamun,Milon,Plabon 36 batch(BUTex)Semi jacquard by mamun,Milon,Plabon 36 batch(BUTex)
Semi jacquard by mamun,Milon,Plabon 36 batch(BUTex)
 
Crochet warp knitting machine(bu tex)
Crochet warp knitting machine(bu tex)Crochet warp knitting machine(bu tex)
Crochet warp knitting machine(bu tex)
 
Compound needle warp knitted machine
Compound needle warp knitted machineCompound needle warp knitted machine
Compound needle warp knitted machine
 
comprehension 1 knitting
 comprehension 1 knitting comprehension 1 knitting
comprehension 1 knitting
 
12
1212
12
 
Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)
Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)
Presentation on Weft Knitting Machine (Single Jersey, Rib & Interlock)
 
KNITTING
KNITTINGKNITTING
KNITTING
 
Knitting
KnittingKnitting
Knitting
 
Basics of Kniting by Vasant Kothari
Basics of Kniting by Vasant KothariBasics of Kniting by Vasant Kothari
Basics of Kniting by Vasant Kothari
 
6 mz b
6 mz b6 mz b
6 mz b
 

Ähnlich wie Knitting boar atl_hug_jan2013_v2

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworksReem Abdel-Rahman
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...Jason Hearne-McGuiness
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspectiveপল্লব রায়
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceIRJET Journal
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 

Ähnlich wie Knitting boar atl_hug_jan2013_v2 (20)

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Cloud computing_processing frameworks
Cloud computing_processing frameworksCloud computing_processing frameworks
Cloud computing_processing frameworks
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 

Mehr von Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial IntelligenceJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkJosh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 

Mehr von Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 

Knitting boar atl_hug_jan2013_v2

  • 1. KNITTING BOAR Machine Learning, Mahout, and Parallel Iterative Algorithms Josh Patterson Principal Solutions Architect 1
  • 2. ✛ Josh Patterson > Master’s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) > Twitter: @jpatanooga > Email: josh@floe.tv
  • 3. ✛ Introduction to Machine Learning ✛ Mahout ✛ Knitting Boar and YARN ✛ Parting Thoughts
  • 4. Introduction to MACHINE LEARNING 4
  • 5. ✛ What is Data Mining? > “the process of extracting patterns from data” ✛ Why are we interested in Data Mining? > Raw data essentially useless ∗ Data is simply recorded facts ∗ Information is the patterns underlying the data ✛ Machine Learning > Algorithms for acquiring structural descriptions from data “examples” ∗ Process of learning “concepts”
  • 6. ✛ Information Retrieval > information science, information architecture, cognitive psychology, linguistics, and statistics. ✛ Natural Language Processing > grounded in machine learning, especially statistical machine learning ✛ Statistics > Math and stuff ✛ Machine Learning > Considered a branch of artificial intelligence
  • 7. ✛ ETL ✛ Joining multiple disparate data sources ✛ Filtering data ✛ Aggregation ✛ Cube materialization “Descriptive Statistics”
  • 8. ✛ Data collection performed w Flume ✛ Data cleansing / ETL performed with Hive or Pig ✛ ML work performed with > SAS > SPSS > R > Mahout
  • 10. ✛ Classification > “Fraud detection” ✛ Recommendation > “Collaborative Filtering” ✛ Clustering > “Segmentation” ✛ Frequent Itemset Mining 10 Copyright 2010 Cloudera Inc. All rights reserved
  • 11. ✛ Stochastic Gradient Descent > Single process > Logistic Regression Model Construction ✛ Naïve Bayes > MapReduce-based > Text Classification ✛ Random Forests > MapReduce-based 11 Copyright 2010 Cloudera Inc. All rights reserved
  • 12. ✛ An algorithm that looks at a user’s past actions and suggests > Products > Services > People ✛ Advertisement > Cloudera has a great Data Science training course on this topic > http://university.cloudera.com/training/data_science/in troduction_to_data_science_- _building_recommender_systems.html
  • 13. ✛ Cluster words across docs to identify topics ✛ Latent Dirichlet Allocation
  • 14. Why Machine Learning? > Growing interest in predictive modeling ✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression ✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • 16. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible ✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • 17. ✛ We Need > Hypothesis about data > Cost function > Update function ✛ Basic Algorithm: Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view/11 17
  • 18. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter 18
  • 19. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms 19
  • 20. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron 20
  • 21. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output 21
  • 22. “Are the gains gotten from using X worth the integration costs incurred in building the end-to- end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 2012 22
  • 23. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers > work on partitions of the data > Stay active over supersteps ✛ Master > Performs superstep > Averages parameter vector 23
  • 24. ✛ Collects all parameter vectors at each pass / superstep ✛ Produces new global parameter vector > By averaging workers’ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector 24
  • 25. ✛ Each given a split of the total dataset > Similar to a map task ✛ Performs local logistic regression run ✛ Local parameter vector sent to master at superstep 25
  • 26. OnlineLogisticRegression Knitting Boar’s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model 26
  • 27. 300 250 200 seconds 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size in MB Input Size vs Processing Time 27
  • 28. Knitting Boar PARTING THOUGHTS 28
  • 29. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated 29
  • 30. ✛ Knitting Boar > https://github.com/jpatanooga/KnittingBoar > 100% Java > ASF 2.0 Licensed > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > https://github.com/emsixteeen/IterativeReduce > 100% Java > ASF 2.0 Licensed 30
  • 31. ✛ Machine Learning is hard > Don’t believe the hype > Do the work ✛ Model development takes time > Lots of iterations > Speed is key here Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg 31
  • 32. ✛ “Parallel Linear Regression on Iterative Reduce and YARN” ✛ Hadoop Summit Europe 2013 > March 20, 21 > http://hadoopsummit.org/amsterdam/ 32
  • 33. ✛ Strata / Hadoop World 2012 Slides > http://www.cloudera.com/content/cloudera/en/resourc es/library/hadoopworld/strata-hadoop-world-2012- knitting-boar_slide_deck.html ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf 33
  • 34. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://freewallpaper.in/wallpaper2/2202-2- 2001_space_odyssey_-_5.jpg 34

Hinweis der Redaktion

  1. Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
  2. Frequent itemset mining – what appears together
  3. “What do other people w/ similar tastes like?”“strength of associations”
  4. “say hello to my leeeeetle friend….”
  5. Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  6. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  7. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  8. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  9. Bottou similar to Xu2010 in the 2010 paper
  10. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  11. Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  12. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  13. 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  14. Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  15. Basecamp: use story of how we get to basecamp to see how to climb some more