SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Josh Patterson
 Email:                   Past
                            Published in IAAI-09:
  josh@floe.tv
                                 “TinyTermite: A Secure Routing Algorithm”
 Twitter:                        Grad work in Meta-heuristics, Ant-
                                 algorithms

  @jpatanooga               Tennessee Valley Authority
                            (TVA)
 Github:                         Hadoop and the Smartgrid
                            Cloudera
  https://github.com/jp          Principal Solution Architect
  atanooga
                          Today
                            Independent Consultant
Sections

1. Modern Data Analytics
2. Parallel Linear Regression
3. Performance and Results
The World as Optimization
 Data tells us about our model/engine/product
   We take this data and evolve our product towards a
   state of minimal market error
 WSJ Special Section, Monday March 11, 2013
   Zynga changing games based off player behavior
   UPS cut fuel consumption by 8.4MM gallons
   Ford used sentiment analysis to look at how new car
   features would be received
The Modern Data Landscape
 Apps are coming but they need
   Platforms
   Components
   Workflows
 Lots of investment in Hadoop in this space
   Lots of ETL pipelines
   Lots of descriptive Statistics
   Growing interest in Machine Learning
Hadoop as The Linux of Data

 Hadoop has won the Cycle      “Hadoop is the
                               kernel of a
  Gartner: Hadoop will be in
                               distributed operating
  2/3s of advanced analytics
  products by 2015 [1]         system, and all the
                               other components
                               around the kernel
                               are now arriving on
                               this stage”
                                  ---Doug Cutting
Today’s Hadoop ML Pipeline
 Data cleansing / ETL performed with Hive or Pig
 Data In Place Processed
    Mahout
    R
    Custom MapReduce Algorithm
  Or Externally Processed
    SAS
    SPSS
    KXEN
    Weka
As Focus Shifts to Applications

 Data rates have been climbing fast
   Speed at Scale becomes the new Killer App
 Companies will want to leverage the Big Data
 infrastructure they’ve already been working with
   Hadoop
   HDFS as main storage system
 A drive to validate big data investments with results
   Emergence of applications which create “data
   products”
Patterson’s Law

“As the percent of your total data held
in a storage system approaches 100%
the amount of in-system processing
and analytics also approaches 100%”
Tools Will Move onto Hadoop

 Already seeing this with Vendors
  Who hasn’t announced a SQL engine on Hadoop
  lately?
 Trend will continue with machine learning tools
  Mahout was the beginning
  More are following
  But what about parallel iterative algorithms?
Distributed Systems Are Hard
 Lots of moving parts
   Especially as these applications become more complicated
 Machine learning can be a non-trivial operation
   We need great building blocks that work well together
 I agree with Jimmy Lin [3]: “keep it simple”
   “make sure costs don’t outweigh benefits”
 Minimize “Yet Another Tool To Learn” (YATTL) as much as
 we can!
To Summarize
 Data moving into Hadoop everywhere
   Patterson’s Law
   Focus on hadoop, build around next-gen “linux of data”
 Need simple components to build next-gen data base apps
   They should work cleanly with the cluster that the fortune
   500 has: Hadoop
   Also should be easy to integrate into Hadoop and with the
   hadoop-tool ecosystem
   Minimize YATTL
Linear Regression
 In linear regression, data is
 modeled using linear predictor
 functions

   unknown model parameters are
   estimated from the data.
 We use optimization techniques
 like Stochastic Gradient Descent to
 find the coeffcients in the model


  Y = (1*x0) + (c1*x1) + … + (cN*xN)
Machine Learning and Optimization

 Algorithms

 (Convergent) Iterative Methods

   Newton’s Method
   Quasi-Newton
   Gradient Descent
 Heuristics

   AntNet
   PSO
   Genetic Algorithms
Stochastic Gradient Descent

    Hypothesis about data

    Cost function

    Update function




Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view
/11
Stochastic Gradient Descent
                                      Training Data
Training
  Simple gradient descent
  procedure
  Loss functions needs to be
  convex (with exceptions)
                                        SGD
Linear Regression
  Loss Function: squared error of
  prediction
  Prediction: linear combination of
                                        Model
  coefficients and input variables
Mahout’s SGD
 Currently Single Process
  Multi-threaded parallel, but not cluster parallel
  Runs locally, not deployed to the cluster
  Tied to logistic regression implementation
Current Limitations
Sequential algorithms on a single node only goes so
far

The “Data Deluge”
 Presents algorithmic challenges when combined with
 large data sets
 need to design algorithms that are able to perform in a
 distributed fashion
MapReduce only fits certain types of algorithms
Distributed Learning Strategies

 McDonald, 2010
   Distributed Training Strategies for the Structured
   Perceptron
 Langford, 2007
   Vowpal Wabbit
 Jeff Dean’s Work on Parallel SGD
   DownPour SGD
   Sandblaster
MapReduce               vs. Parallel Iterative

      Input
                              Processor    Processor    Processor


Map      Map      Map
                                          Superstep 1


                              Processor    Processor    Processor


Reduce         Reduce
                                          Superstep 2


      Output                                  . . .
YARN
Yet Another Resource
                                                                          Node
                                                                         Manager


Negotiator                                                        Container   App Mstr




Framework for scheduling
                                   Client



distributed applications
                                                       Resource           Node
                                                       Manager           Manager
                                   Client

                                                                  App Mstr    Container

 Allows for any type of parallel
 application to run natively on
 hadoop                             MapReduce Status                      Node
                                                                         Manager
                                      Job Submission


 MRv2 is now a distributed
                                      Node Status
                                    Resource Request              Container   Container


 application
IterativeReduce
 Designed specifically for parallel iterative
 algorithms on Hadoop
   Implemented directly on top of YARN
 Intrinsic Parallelism
   Easier to focus on problem
   Not focusing on the distributed application part
IterativeReduce API
 ComputableMaster   Worker   Worker   Worker

  Setup()
                             Master
  Compute()
  Complete()        Worker   Worker   Worker
 ComputableWorker
                             Master
  Setup()
  Compute()                   . . .
SGD Master
 Collects all parameter vectors at each pass /
 superstep
 Produces new global parameter vector
  By averaging workers’ vectors
 Sends update to all workers
  Workers replace local parameter vector with new
  global parameter vector
SGD Worker
Each given a split of the total dataset
  Similar to a map task
Performs local SGD pass

Local parameter vector sent to master at
superstep

Stays active/resident between iterations
SGD: Serial vs Parallel
                     Split 1       Split 2            Split 3


  Training Data

                                                   Worker N
                  Worker 1     Worker 2
                                               …

                  Partial      Partial Model        Partial
                  Model                             Model



                                Master



    Model                      Global Model
Parallel Linear Regression with IterativeReduce


  Based directly on work we did with Knitting Boar
    Parallel logistic regression
  Scales linearly with input size
  Can produce a linear regression model off large amounts
  of data
  Packaged in a new suite of parallel iterative algorithms
  called Metronome
    100% Java, ASF 2.0 Licensed, on github
Unit Testing and IRUnit
 Simulates the IterativeReduce parallel framework
   Uses the same app.properties file that YARN
   applications do
 Examples
   https://github.com/jpatanooga/Metronome/blob/master/s
   rc/test/java/tv/floe/metronome/linearregression/iterative
   reduce/TestSimulateLinearRegressionIterativeReduce.j
   ava
   https://github.com/jpatanooga/KnittingBoar/blob/master
   /src/test/java/com/cloudera/knittingboar/sgd/iterativere
   duce/TestKnittingBoar_IRUnitSim.java
Running the Job via YARN
 Build with Maven

 Copy Jar to host with cluster access

 Copy dataset to HDFS

 Run job
  Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
Results
                               Linear Regression - Parallel vs Serial
                         200
 Total Processing Time




                         150

                         100
                                                                      Parallel Runs
                          50                                          Serial Runs
                           0
                               64      128    192     256       320
                                    Megabytes Processed Total
Lessons Learned
 Linear scale continues to be achieved with
 parameter averaging variations
 Tuning is critical
   Need to be good at selecting a learning rate
 YARN still experimental, has caveats
   Container allocation is still slow
   Metronome continues to be experimental
Special Thanks
 Michael Katzenellenbollen

 Dr. James Scott
  University of Texas at Austin
 Dr. Jason Baldridge
  University of Texas at Austin
Future Directions
 More testing, stability
 Cache vectors in memory for speed
 Metronome
   Take on properties of LibLinear
     Plugable optimization, general linear models
   YARN-centric first class Hadoop citizen
   Focus on being a complement to Mahout
   K-means, PageRank implementations
Github
 IterativeReduce
  https://github.com/emsixteeen/IterativeReduce
 Metronome
  https://github.com/jpatanooga/Metronome
 Knitting Boar
  https://github.com/jpatanooga/KnittingBoar
References
1. http://www.infoworld.com/d/business-
   intelligence/gartner-hadoop-will-be-in-two-thirds-of-
   advanced-analytics-products-2015-211475

2. https://cwiki.apache.org/MAHOUT/logistic-
   regression.html

3. MapReduce is Good Enough? If All You Have is a
   Hammer, Throw Away Everything That’s Not a Nail!

  •   http://arxiv.org/pdf/1209.2191.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
Databricks
 

Was ist angesagt? (20)

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
Cluster Schedulers
Cluster SchedulersCluster Schedulers
Cluster Schedulers
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Spark 101
Spark 101Spark 101
Spark 101
 

Ähnlich wie Parallel Linear Regression in Interative Reduce and YARN

Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
Josh Patterson
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Yahoo Developer Network
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
Adam Muise
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 

Ähnlich wie Parallel Linear Regression in Interative Reduce and YARN (20)

Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Strata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting BoarStrata + Hadoop World 2012: Knitting Boar
Strata + Hadoop World 2012: Knitting Boar
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012KnittingBoar Toronto Hadoop User Group Nov 27 2012
KnittingBoar Toronto Hadoop User Group Nov 27 2012
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
Hadoop at JavaZone 2010
Hadoop at JavaZone 2010Hadoop at JavaZone 2010
Hadoop at JavaZone 2010
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Fluentd meetup #3
Fluentd meetup #3Fluentd meetup #3
Fluentd meetup #3
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
YARN (2).pptx
YARN (2).pptxYARN (2).pptx
YARN (2).pptx
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Parallel Linear Regression in Interative Reduce and YARN

  • 1.
  • 2. Josh Patterson Email: Past Published in IAAI-09: josh@floe.tv “TinyTermite: A Secure Routing Algorithm” Twitter: Grad work in Meta-heuristics, Ant- algorithms @jpatanooga Tennessee Valley Authority (TVA) Github: Hadoop and the Smartgrid Cloudera https://github.com/jp Principal Solution Architect atanooga Today Independent Consultant
  • 3. Sections 1. Modern Data Analytics 2. Parallel Linear Regression 3. Performance and Results
  • 4.
  • 5. The World as Optimization Data tells us about our model/engine/product We take this data and evolve our product towards a state of minimal market error WSJ Special Section, Monday March 11, 2013 Zynga changing games based off player behavior UPS cut fuel consumption by 8.4MM gallons Ford used sentiment analysis to look at how new car features would be received
  • 6. The Modern Data Landscape Apps are coming but they need Platforms Components Workflows Lots of investment in Hadoop in this space Lots of ETL pipelines Lots of descriptive Statistics Growing interest in Machine Learning
  • 7. Hadoop as The Linux of Data Hadoop has won the Cycle “Hadoop is the kernel of a Gartner: Hadoop will be in distributed operating 2/3s of advanced analytics products by 2015 [1] system, and all the other components around the kernel are now arriving on this stage” ---Doug Cutting
  • 8. Today’s Hadoop ML Pipeline Data cleansing / ETL performed with Hive or Pig Data In Place Processed Mahout R Custom MapReduce Algorithm Or Externally Processed SAS SPSS KXEN Weka
  • 9. As Focus Shifts to Applications Data rates have been climbing fast Speed at Scale becomes the new Killer App Companies will want to leverage the Big Data infrastructure they’ve already been working with Hadoop HDFS as main storage system A drive to validate big data investments with results Emergence of applications which create “data products”
  • 10. Patterson’s Law “As the percent of your total data held in a storage system approaches 100% the amount of in-system processing and analytics also approaches 100%”
  • 11. Tools Will Move onto Hadoop Already seeing this with Vendors Who hasn’t announced a SQL engine on Hadoop lately? Trend will continue with machine learning tools Mahout was the beginning More are following But what about parallel iterative algorithms?
  • 12. Distributed Systems Are Hard Lots of moving parts Especially as these applications become more complicated Machine learning can be a non-trivial operation We need great building blocks that work well together I agree with Jimmy Lin [3]: “keep it simple” “make sure costs don’t outweigh benefits” Minimize “Yet Another Tool To Learn” (YATTL) as much as we can!
  • 13. To Summarize Data moving into Hadoop everywhere Patterson’s Law Focus on hadoop, build around next-gen “linux of data” Need simple components to build next-gen data base apps They should work cleanly with the cluster that the fortune 500 has: Hadoop Also should be easy to integrate into Hadoop and with the hadoop-tool ecosystem Minimize YATTL
  • 14.
  • 15. Linear Regression In linear regression, data is modeled using linear predictor functions unknown model parameters are estimated from the data. We use optimization techniques like Stochastic Gradient Descent to find the coeffcients in the model Y = (1*x0) + (c1*x1) + … + (cN*xN)
  • 16. Machine Learning and Optimization Algorithms (Convergent) Iterative Methods Newton’s Method Quasi-Newton Gradient Descent Heuristics AntNet PSO Genetic Algorithms
  • 17. Stochastic Gradient Descent Hypothesis about data Cost function Update function Andrew Ng’s Tutorial: https://class.coursera.org/ml/lecture/preview_view /11
  • 18. Stochastic Gradient Descent Training Data Training Simple gradient descent procedure Loss functions needs to be convex (with exceptions) SGD Linear Regression Loss Function: squared error of prediction Prediction: linear combination of Model coefficients and input variables
  • 19. Mahout’s SGD Currently Single Process Multi-threaded parallel, but not cluster parallel Runs locally, not deployed to the cluster Tied to logistic regression implementation
  • 20. Current Limitations Sequential algorithms on a single node only goes so far The “Data Deluge” Presents algorithmic challenges when combined with large data sets need to design algorithms that are able to perform in a distributed fashion MapReduce only fits certain types of algorithms
  • 21. Distributed Learning Strategies McDonald, 2010 Distributed Training Strategies for the Structured Perceptron Langford, 2007 Vowpal Wabbit Jeff Dean’s Work on Parallel SGD DownPour SGD Sandblaster
  • 22. MapReduce vs. Parallel Iterative Input Processor Processor Processor Map Map Map Superstep 1 Processor Processor Processor Reduce Reduce Superstep 2 Output . . .
  • 23. YARN Yet Another Resource Node Manager Negotiator Container App Mstr Framework for scheduling Client distributed applications Resource Node Manager Manager Client App Mstr Container Allows for any type of parallel application to run natively on hadoop MapReduce Status Node Manager Job Submission MRv2 is now a distributed Node Status Resource Request Container Container application
  • 24. IterativeReduce Designed specifically for parallel iterative algorithms on Hadoop Implemented directly on top of YARN Intrinsic Parallelism Easier to focus on problem Not focusing on the distributed application part
  • 25. IterativeReduce API ComputableMaster Worker Worker Worker Setup() Master Compute() Complete() Worker Worker Worker ComputableWorker Master Setup() Compute() . . .
  • 26. SGD Master Collects all parameter vectors at each pass / superstep Produces new global parameter vector By averaging workers’ vectors Sends update to all workers Workers replace local parameter vector with new global parameter vector
  • 27. SGD Worker Each given a split of the total dataset Similar to a map task Performs local SGD pass Local parameter vector sent to master at superstep Stays active/resident between iterations
  • 28. SGD: Serial vs Parallel Split 1 Split 2 Split 3 Training Data Worker N Worker 1 Worker 2 … Partial Partial Model Partial Model Model Master Model Global Model
  • 29. Parallel Linear Regression with IterativeReduce Based directly on work we did with Knitting Boar Parallel logistic regression Scales linearly with input size Can produce a linear regression model off large amounts of data Packaged in a new suite of parallel iterative algorithms called Metronome 100% Java, ASF 2.0 Licensed, on github
  • 30. Unit Testing and IRUnit Simulates the IterativeReduce parallel framework Uses the same app.properties file that YARN applications do Examples https://github.com/jpatanooga/Metronome/blob/master/s rc/test/java/tv/floe/metronome/linearregression/iterative reduce/TestSimulateLinearRegressionIterativeReduce.j ava https://github.com/jpatanooga/KnittingBoar/blob/master /src/test/java/com/cloudera/knittingboar/sgd/iterativere duce/TestKnittingBoar_IRUnitSim.java
  • 31.
  • 32. Running the Job via YARN Build with Maven Copy Jar to host with cluster access Copy dataset to HDFS Run job Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
  • 33. Results Linear Regression - Parallel vs Serial 200 Total Processing Time 150 100 Parallel Runs 50 Serial Runs 0 64 128 192 256 320 Megabytes Processed Total
  • 34. Lessons Learned Linear scale continues to be achieved with parameter averaging variations Tuning is critical Need to be good at selecting a learning rate YARN still experimental, has caveats Container allocation is still slow Metronome continues to be experimental
  • 35. Special Thanks Michael Katzenellenbollen Dr. James Scott University of Texas at Austin Dr. Jason Baldridge University of Texas at Austin
  • 36. Future Directions More testing, stability Cache vectors in memory for speed Metronome Take on properties of LibLinear Plugable optimization, general linear models YARN-centric first class Hadoop citizen Focus on being a complement to Mahout K-means, PageRank implementations
  • 37. Github IterativeReduce https://github.com/emsixteeen/IterativeReduce Metronome https://github.com/jpatanooga/Metronome Knitting Boar https://github.com/jpatanooga/KnittingBoar
  • 38. References 1. http://www.infoworld.com/d/business- intelligence/gartner-hadoop-will-be-in-two-thirds-of- advanced-analytics-products-2015-211475 2. https://cwiki.apache.org/MAHOUT/logistic- regression.html 3. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! • http://arxiv.org/pdf/1209.2191.pdf

Hinweis der Redaktion

  1. Reference some thoughts on attribution pipelines
  2. Talk about how you normally would use the Normal equation, notes from Andrew Ng
  3. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  4. “Unlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
  5. The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
  6. At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
  7. Bottou similar to Xu2010 in the 2010 paper
  8. Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
  9. Performance still largely dependent on implementation of algo
  10. POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point