SlideShare ist ein Scribd-Unternehmen logo
1 von 32
KNITTING BOAR
    Building Machine Learning Tools with Hadoop‟s YARN




    Josh Patterson
    Principal Solutions Architect

    Michael Katzenellenbogen
    Principal Solutions Architect




1
✛ Josh Patterson - josh@cloudera.com
   > Master‟s Thesis: self-organizing mesh networks
       ∗   Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
   > Conceived, built, and led Hadoop integration for openPDC project
      at Tennessee Valley Authority (TVA)


✛ Michael Katzenellenbollen - michael@cloudera.com
   > Principal Solutions Architect @ Cloudera
   > Systems Guy („nuff said)
✛ Intro / Background
✛ Introducing Knitting Boar
✛ Integrating Knitting Boar and YARN
✛ Results and Lessons Learned
Background and
    INTRODUCTION




4
✛   Why Machine Learning?
    >   Growing interest in predictive modeling

✛   Linear Models are Simple, Useful
    >   Stochastic Gradient Descent is a very popular tool for
        building linear models like Logistic Regression

✛   Building Models Still is Time Consuming
    >   The “Need for speed”
    >   “More data beats a cleverer algorithm”
✛ Parallelize Mahout’s Stochastic Gradient Descent
  >   With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
   using YARN
  >   Wanted a first class Hadoop-Yarn citizen
  >   Work through dev progressions towards a stable state
  >   Worry about “frameworks” later
✛ Training                        Training Data

       > Simple gradient descent
         procedure
       > Loss functions needs to be
         convex
    ✛ Prediction                         SGD

      > Logistic Regression:
          ∗ Sigmoid function using
            parameter vector (dot)
            example as exponential
                                        Model
            parameter


7
✛ Currently Single Process
      > Multi-threaded parallel, but not cluster parallel
      > Runs locally, not deployed to the cluster
    ✛ Defined in:
      > https://cwiki.apache.org/MAHOUT/logistic-
        regression.html




8
Current Limitations
    ✛ Sequential algorithms on a single node only
      goes so far
    ✛ The “Data Deluge”
      > Presents algorithmic challenges when combined with
        large data sets
      > need to design algorithms that are able to perform in
        a distributed fashion
    ✛ MapReduce only fits certain types of algorithms




9
Distributed Learning Strategies
 ✛ Langford, 2007
    > Vowpal Wabbit
 ✛ McDonald 2010
   > Distributed Training Strategies for the Structured
     Perceptron
 ✛ Dekel 2010
   > Optimal Distributed Online Prediction Using Mini-
     Batches




10
Input             Processor    Processor    Processor



                                         Superstep 1
     Map      Map      Map

                             Processor    Processor    Processor



     Reduce         Reduce               Superstep 2

                                             . . .
           Output


11
“Are the gains gotten from using X worth the integration costs incurred in
     building the end-to-end solution?

     If no, then operationally, we can consider the Hadoop stack …

     there are substantial costs in knitting together a patchwork of different
     frameworks, programming models, etc.”

     –– Lin, 2012




12
Introducing
KNITTING BOAR




 13
✛ Parallel Iterative implementation of SGD on
     YARN

 ✛ Workers work on partitions of the data
 ✛ Master keeps global copy of merged parameter
     vector




14
✛ Each given a split of the total dataset
   > Similar to a map task
 ✛ Using a modified OLR
   > process N samples in a batch (subset of split)
 ✛ Batched gradient accumulation updates sent to
     master node
     > Gradient influences future models vectors towards
       better predictions




15
✛ Accumulates gradient updates
   > From batches of worker OLR runs
 ✛ Produces new global parameter vector
   > By averaging workers‟ vectors
 ✛ Sends update to all workers
   > Workers replace local parameter vector with new
     global parameter vector




16
OnlineLogisticRegression
                                              Knitting Boar‟s POLR
                                    Split 1             Split 2             Split 3
           Training Data




                                 Worker 1             Worker 2
                                                                     …   Worker N




                                Partial Model        Partial Model       Partial Model
     OnlineLogisticRegression


                                                     Master



             Model
                                                    Global Model

17
Integrating Knitting Boar with
YARN




18
✛ Yet Another Resource Negotiator

 ✛ Framework for scheduling distributed applications
 ✛ Typically runs on top of an HDFS cluster
    > Though not required,
      nor is it coupled to HDFS
                                                                            Node
                                                                           Manager

 ✛ MRv2 is now a                                                    Container   App Mstr

     distributed application         Client

                                                         Resource           Node
                                                         Manager           Manager
                                     Client

                                                                    App Mstr    Container




                                      MapReduce Status                      Node
                                                                           Manager
                                        Job Submission
                                        Node Status
                                      Resource Request              Container   Container




19
✛ High setup / teardown costs
 ✛ Not designed for super-step operations
 ✛ Need to refactor the problem to fit MapReduce
   > We can now just launch a distributed application




20
✛ Designed specifically for parallel iterative
     algorithms on Hadoop
     > Implemented directly on top of YARN
 ✛ Intrinsic Parallelism
    > Easier to focus on problem
    > Not focusing on the distributed application part




21
✛ ComputableMaster
                      Worker   Worker   Worker
   > Setup()
   > Compute()                 Master
   > Complete()
 ✛ ComputableWorker   Worker   Worker   Worker


   > Setup()
                               Master
   > Compute()
                                . . .




22
✛ Client
   > Launches the YARN ApplicationMaster
 ✛ Master
   > Computes required resources
   > Obtains resources from YARN
   > Launches Workers
 ✛ Workers
   > Computation on partial data (input split)
   > Synchronizes with Master



23
Pig, Hive, Scala, Java, Crunch



                           Algorithms


 MapReduce   IterativeReduce     BranchReduce      Giraph   …




                          HDFS / YARN




24
Knitting Boar
     PERFORMANCE, SCALING, AND RESULTS




25
300


     250


     200


     150                                                                     OLR
                                                                             POLR
     100


      50


       0
           4.1   8.2   12.3   16.4   20.5   24.6   28.7   32.8   36.9   41




                 Input Size vs Processing Time


26
✛ Parallel SGD
   > The Boar is temperamental, experimental
       ∗ Linear speedup (roughly)

 ✛ Developing YARN Applications
   > More complex the just MapReduce
   > Requires lots of “plumbing”
 ✛ IterativeReduce
    > Great native-Hadoop way to implement algorithms
    > Easy to use and well integrated



27
✛ Knitting Boar
   > 100% Java
   > ASF 2.0 Licensed
   > https://github.com/jpatanooga/KnittingBoar
   > Quick Start
       ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

 ✛ IterativeReduce
    > [ coming soon ]




28
The Road Ahead

                  ✛ SGD
                    > More testing
                    > Demo use cases
                  ✛ IterativeReduce
                     > Reliability
                     > Durability



                  Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg



29
✛ Mahout‟s SGD implementation
   > http://lingpipe.files.wordpress.com/2008/04/lazysgdre
     gression.pdf
 ✛ Hadoop AllReduce and Terascale Learning
   > http://hunch.net/?p=2094
 ✛ MapReduce is Good Enough? If All You Have is
     a Hammer, Throw Away Everything That‟s Not a
     Nail!
     > http://arxiv.org/pdf/1209.2191v1.pdf



30
✛ Langford
    > http://hunch.net/~vw/
 ✛ Zinkevick, 2011
    > http://www.research.rutgers.edu/~lihong/pub/Zinkevic
      h11Parallelized.pdf
 ✛ McDonald, 2010
   > http://dl.acm.org/citation.cfm?id=1858068
 ✛ Dekel, 2010
   > http://arxiv.org/pdf/1012.1367.pdf



31
✛ http://eteamjournal.files.wordpress.com/2011/03/
   photos-of-mount-everest-pictures.jpg
 ✛ http://images.fineartamerica.com/images-
   medium-large/-say-hello-to-my-little-friend--luis-
   ludzska.jpg
 ✛ http://agileknitter.com/wp-
   content/uploads/2010/06/Pictures_-_Misc_-
   _Knitting_Needles.jpg



32

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Application Timeline Server Past, Present and Future
Application Timeline Server  Past, Present and FutureApplication Timeline Server  Past, Present and Future
Application Timeline Server Past, Present and FutureNaganarasimha Garla
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Tsuyoshi OZAWA
 
Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataKaran Pardeshi
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSupun Dissanayake
 
Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us!  Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us! DataWorks Summit
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Rohit Agrawal
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
Intelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesIntelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesMaria Stylianou
 
Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based SchedulingMaria Stylianou
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithmShilpa Damor
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingGrigoris Anagnostopoulos
 

Was ist angesagt? (20)

Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Ecss des
Ecss desEcss des
Ecss des
 
Application Timeline Server Past, Present and Future
Application Timeline Server  Past, Present and FutureApplication Timeline Server  Past, Present and Future
Application Timeline Server Past, Present and Future
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
 
Fault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big DataFault tolerant mechanisms in Big Data
Fault tolerant mechanisms in Big Data
 
Solve it Differently with Reactive Programming
Solve it Differently with Reactive ProgrammingSolve it Differently with Reactive Programming
Solve it Differently with Reactive Programming
 
Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us!  Reservations Based Scheduling: if you’re late don’t blame us!
Reservations Based Scheduling: if you’re late don’t blame us!
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Intelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet ServicesIntelligent Placement of Datacenters for Internet Services
Intelligent Placement of Datacenters for Internet Services
 
Automatic Energy-based Scheduling
Automatic Energy-based SchedulingAutomatic Energy-based Scheduling
Automatic Energy-based Scheduling
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithm
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Parallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modelingParallel Computing: Perspectives for more efficient hydrological modeling
Parallel Computing: Perspectives for more efficient hydrological modeling
 

Andere mochten auch

Una historia particular
Una historia particularUna historia particular
Una historia particularpedro774
 
UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015
UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015
UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015Blankita Bmvt
 
Opportunities for commerce students
Opportunities for commerce studentsOpportunities for commerce students
Opportunities for commerce studentsAkhilesh shukla
 
Aplikasi corel draw
Aplikasi corel drawAplikasi corel draw
Aplikasi corel drawstfxpcm
 
Síndrome guillain barré
Síndrome guillain   barréSíndrome guillain   barré
Síndrome guillain barréMeli Mejía
 
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16MLconf
 
A verdadeira páscoa'
A verdadeira páscoa'A verdadeira páscoa'
A verdadeira páscoa'Marly Brito
 
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)EB 2,3 Rainha Santa Isabel - Carreira
 
Kisi kisi uas ips Kelas 6 Semester 1
Kisi kisi uas ips Kelas 6 Semester 1Kisi kisi uas ips Kelas 6 Semester 1
Kisi kisi uas ips Kelas 6 Semester 1Rachmah Safitri
 
Agricultural biodiversity
Agricultural biodiversityAgricultural biodiversity
Agricultural biodiversitymickymouseemail
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 

Andere mochten auch (17)

2 558311135338561537
2 5583111353385615372 558311135338561537
2 558311135338561537
 
Bibliografia
BibliografiaBibliografia
Bibliografia
 
Una historia particular
Una historia particularUna historia particular
Una historia particular
 
Estudio de caso
Estudio de caso Estudio de caso
Estudio de caso
 
UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015
UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015
UTE-ÉTICA Y ATENCIÓN A LA DIVERSIDAD-16NOVIEMBRE2015
 
Autoestima
AutoestimaAutoestima
Autoestima
 
07 03 lesson-04
07 03 lesson-0407 03 lesson-04
07 03 lesson-04
 
Ley de cias
Ley de ciasLey de cias
Ley de cias
 
Opportunities for commerce students
Opportunities for commerce studentsOpportunities for commerce students
Opportunities for commerce students
 
Aplikasi corel draw
Aplikasi corel drawAplikasi corel draw
Aplikasi corel draw
 
Síndrome guillain barré
Síndrome guillain   barréSíndrome guillain   barré
Síndrome guillain barré
 
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16
 
A verdadeira páscoa'
A verdadeira páscoa'A verdadeira páscoa'
A verdadeira páscoa'
 
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
Literacia Sociodigital, numa escola orientada para o futuro (AFIRSE 2017)
 
Kisi kisi uas ips Kelas 6 Semester 1
Kisi kisi uas ips Kelas 6 Semester 1Kisi kisi uas ips Kelas 6 Semester 1
Kisi kisi uas ips Kelas 6 Semester 1
 
Agricultural biodiversity
Agricultural biodiversityAgricultural biodiversity
Agricultural biodiversity
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 

Ähnlich wie Strata + Hadoop World 2012: Knitting Boar

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceIRJET Journal
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation ContestAMIT BORUDE
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 

Ähnlich wie Strata + Hadoop World 2012: Knitting Boar (20)

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
E031201032036
E031201032036E031201032036
E031201032036
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
YARN (2).pptx
YARN (2).pptxYARN (2).pptx
YARN (2).pptx
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 

Mehr von Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Kürzlich hochgeladen (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Strata + Hadoop World 2012: Knitting Boar

  • 1. KNITTING BOAR Building Machine Learning Tools with Hadoop‟s YARN Josh Patterson Principal Solutions Architect Michael Katzenellenbogen Principal Solutions Architect 1
  • 2. ✛ Josh Patterson - josh@cloudera.com > Master‟s Thesis: self-organizing mesh networks ∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm > Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) ✛ Michael Katzenellenbollen - michael@cloudera.com > Principal Solutions Architect @ Cloudera > Systems Guy („nuff said)
  • 3. ✛ Intro / Background ✛ Introducing Knitting Boar ✛ Integrating Knitting Boar and YARN ✛ Results and Lessons Learned
  • 4. Background and INTRODUCTION 4
  • 5. Why Machine Learning? > Growing interest in predictive modeling ✛ Linear Models are Simple, Useful > Stochastic Gradient Descent is a very popular tool for building linear models like Logistic Regression ✛ Building Models Still is Time Consuming > The “Need for speed” > “More data beats a cleverer algorithm”
  • 6. ✛ Parallelize Mahout’s Stochastic Gradient Descent > With as few extra dependencies as possible ✛ Wanted to explore parallel iterative algorithms using YARN > Wanted a first class Hadoop-Yarn citizen > Work through dev progressions towards a stable state > Worry about “frameworks” later
  • 7. ✛ Training Training Data > Simple gradient descent procedure > Loss functions needs to be convex ✛ Prediction SGD > Logistic Regression: ∗ Sigmoid function using parameter vector (dot) example as exponential Model parameter 7
  • 8. ✛ Currently Single Process > Multi-threaded parallel, but not cluster parallel > Runs locally, not deployed to the cluster ✛ Defined in: > https://cwiki.apache.org/MAHOUT/logistic- regression.html 8
  • 9. Current Limitations ✛ Sequential algorithms on a single node only goes so far ✛ The “Data Deluge” > Presents algorithmic challenges when combined with large data sets > need to design algorithms that are able to perform in a distributed fashion ✛ MapReduce only fits certain types of algorithms 9
  • 10. Distributed Learning Strategies ✛ Langford, 2007 > Vowpal Wabbit ✛ McDonald 2010 > Distributed Training Strategies for the Structured Perceptron ✛ Dekel 2010 > Optimal Distributed Online Prediction Using Mini- Batches 10
  • 11. Input Processor Processor Processor Superstep 1 Map Map Map Processor Processor Processor Reduce Reduce Superstep 2 . . . Output 11
  • 12. “Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution? If no, then operationally, we can consider the Hadoop stack … there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.” –– Lin, 2012 12
  • 14. ✛ Parallel Iterative implementation of SGD on YARN ✛ Workers work on partitions of the data ✛ Master keeps global copy of merged parameter vector 14
  • 15. ✛ Each given a split of the total dataset > Similar to a map task ✛ Using a modified OLR > process N samples in a batch (subset of split) ✛ Batched gradient accumulation updates sent to master node > Gradient influences future models vectors towards better predictions 15
  • 16. ✛ Accumulates gradient updates > From batches of worker OLR runs ✛ Produces new global parameter vector > By averaging workers‟ vectors ✛ Sends update to all workers > Workers replace local parameter vector with new global parameter vector 16
  • 17. OnlineLogisticRegression Knitting Boar‟s POLR Split 1 Split 2 Split 3 Training Data Worker 1 Worker 2 … Worker N Partial Model Partial Model Partial Model OnlineLogisticRegression Master Model Global Model 17
  • 19. ✛ Yet Another Resource Negotiator ✛ Framework for scheduling distributed applications ✛ Typically runs on top of an HDFS cluster > Though not required, nor is it coupled to HDFS Node Manager ✛ MRv2 is now a Container App Mstr distributed application Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container 19
  • 20. ✛ High setup / teardown costs ✛ Not designed for super-step operations ✛ Need to refactor the problem to fit MapReduce > We can now just launch a distributed application 20
  • 21. ✛ Designed specifically for parallel iterative algorithms on Hadoop > Implemented directly on top of YARN ✛ Intrinsic Parallelism > Easier to focus on problem > Not focusing on the distributed application part 21
  • 22. ✛ ComputableMaster Worker Worker Worker > Setup() > Compute() Master > Complete() ✛ ComputableWorker Worker Worker Worker > Setup() Master > Compute() . . . 22
  • 23. ✛ Client > Launches the YARN ApplicationMaster ✛ Master > Computes required resources > Obtains resources from YARN > Launches Workers ✛ Workers > Computation on partial data (input split) > Synchronizes with Master 23
  • 24. Pig, Hive, Scala, Java, Crunch Algorithms MapReduce IterativeReduce BranchReduce Giraph … HDFS / YARN 24
  • 25. Knitting Boar PERFORMANCE, SCALING, AND RESULTS 25
  • 26. 300 250 200 150 OLR POLR 100 50 0 4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41 Input Size vs Processing Time 26
  • 27. ✛ Parallel SGD > The Boar is temperamental, experimental ∗ Linear speedup (roughly) ✛ Developing YARN Applications > More complex the just MapReduce > Requires lots of “plumbing” ✛ IterativeReduce > Great native-Hadoop way to implement algorithms > Easy to use and well integrated 27
  • 28. ✛ Knitting Boar > 100% Java > ASF 2.0 Licensed > https://github.com/jpatanooga/KnittingBoar > Quick Start ∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start ✛ IterativeReduce > [ coming soon ] 28
  • 29. The Road Ahead ✛ SGD > More testing > Demo use cases ✛ IterativeReduce > Reliability > Durability Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg 29
  • 30. ✛ Mahout‟s SGD implementation > http://lingpipe.files.wordpress.com/2008/04/lazysgdre gression.pdf ✛ Hadoop AllReduce and Terascale Learning > http://hunch.net/?p=2094 ✛ MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That‟s Not a Nail! > http://arxiv.org/pdf/1209.2191v1.pdf 30
  • 31. ✛ Langford > http://hunch.net/~vw/ ✛ Zinkevick, 2011 > http://www.research.rutgers.edu/~lihong/pub/Zinkevic h11Parallelized.pdf ✛ McDonald, 2010 > http://dl.acm.org/citation.cfm?id=1858068 ✛ Dekel, 2010 > http://arxiv.org/pdf/1012.1367.pdf 31
  • 32. ✛ http://eteamjournal.files.wordpress.com/2011/03/ photos-of-mount-everest-pictures.jpg ✛ http://images.fineartamerica.com/images- medium-large/-say-hello-to-my-little-friend--luis- ludzska.jpg ✛ http://agileknitter.com/wp- content/uploads/2010/06/Pictures_-_Misc_- _Knitting_Needles.jpg 32

Hinweis der Redaktion

  1. Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
  2. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  3. The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  4. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  5. Bottou similar to Xu2010 in the 2010 paper
  6. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  7. Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
  8. “say hello to my leeeeetle friend….”
  9. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
  10. Segue into yarn
  11. Performance still largely dependent on implementation of algo
  12. 3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
  13. Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
  14. Basecamp: use story of how we get to basecamp to see how to climb some more