In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.
1. KNITTING BOAR
Building Machine Learning Tools with Hadoop‟s YARN
Josh Patterson
Principal Solutions Architect
Michael Katzenellenbogen
Principal Solutions Architect
1
2. ✛ Josh Patterson - josh@cloudera.com
> Master‟s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)
✛ Michael Katzenellenbollen - michael@cloudera.com
> Principal Solutions Architect @ Cloudera
> Systems Guy („nuff said)
3. ✛ Intro / Background
✛ Introducing Knitting Boar
✛ Integrating Knitting Boar and YARN
✛ Results and Lessons Learned
5. ✛ Why Machine Learning?
> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”
6. ✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
7. ✛ Training Training Data
> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD
> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter
7
8. ✛ Currently Single Process
> Multi-threaded parallel, but not cluster parallel
> Runs locally, not deployed to the cluster
✛ Defined in:
> https://cwiki.apache.org/MAHOUT/logistic-
regression.html
8
9. Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms
9
10. Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-
Batches
10
12. “Are the gains gotten from using X worth the integration costs incurred in
building the end-to-end solution?
If no, then operationally, we can consider the Hadoop stack …
there are substantial costs in knitting together a patchwork of different
frameworks, programming models, etc.”
–– Lin, 2012
12
14. ✛ Parallel Iterative implementation of SGD on
YARN
✛ Workers work on partitions of the data
✛ Master keeps global copy of merged parameter
vector
14
15. ✛ Each given a split of the total dataset
> Similar to a map task
✛ Using a modified OLR
> process N samples in a batch (subset of split)
✛ Batched gradient accumulation updates sent to
master node
> Gradient influences future models vectors towards
better predictions
15
16. ✛ Accumulates gradient updates
> From batches of worker OLR runs
✛ Produces new global parameter vector
> By averaging workers‟ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector
16
17. OnlineLogisticRegression
Knitting Boar‟s POLR
Split 1 Split 2 Split 3
Training Data
Worker 1 Worker 2
… Worker N
Partial Model Partial Model Partial Model
OnlineLogisticRegression
Master
Model
Global Model
17
19. ✛ Yet Another Resource Negotiator
✛ Framework for scheduling distributed applications
✛ Typically runs on top of an HDFS cluster
> Though not required,
nor is it coupled to HDFS
Node
Manager
✛ MRv2 is now a Container App Mstr
distributed application Client
Resource Node
Manager Manager
Client
App Mstr Container
MapReduce Status Node
Manager
Job Submission
Node Status
Resource Request Container Container
19
20. ✛ High setup / teardown costs
✛ Not designed for super-step operations
✛ Need to refactor the problem to fit MapReduce
> We can now just launch a distributed application
20
21. ✛ Designed specifically for parallel iterative
algorithms on Hadoop
> Implemented directly on top of YARN
✛ Intrinsic Parallelism
> Easier to focus on problem
> Not focusing on the distributed application part
21
27. ✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated
27
29. The Road Ahead
✛ SGD
> More testing
> Demo use cases
✛ IterativeReduce
> Reliability
> Durability
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
29
30. ✛ Mahout‟s SGD implementation
> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ Hadoop AllReduce and Terascale Learning
> http://hunch.net/?p=2094
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That‟s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf
30
Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
Bottou similar to Xu2010 in the 2010 paper
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
“say hello to my leeeeetle friend….”
POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
Segue into yarn
Performance still largely dependent on implementation of algo
3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
Basecamp: use story of how we get to basecamp to see how to climb some more