Strata + Hadoop World 2012: Knitting Boar

KNITTING BOAR
Building Machine Learning Tools with Hadoop‟s YARN

Josh Patterson
Principal Solutions Architect

Michael Katzenellenbogen
Principal Solutions Architect

1

✛ Josh Patterson - josh@cloudera.com
> Master‟s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)

✛ Michael Katzenellenbollen - michael@cloudera.com
> Principal Solutions Architect @ Cloudera
> Systems Guy („nuff said)

✛ Intro / Background
✛ Introducing Knitting Boar
✛ Integrating Knitting Boar and YARN
✛ Results and Lessons Learned

Background and
INTRODUCTION

4

✛ Why Machine Learning?
> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression

✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”

✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later

✛ Training Training Data

> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD

> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter

7

✛ Currently Single Process
> Multi-threaded parallel, but not cluster parallel
> Runs locally, not deployed to the cluster
✛ Defined in:
> https://cwiki.apache.org/MAHOUT/logistic-
regression.html

8

Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms

9

Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-
Batches

10

Input Processor Processor Processor

Superstep 1
Map Map Map

Processor Processor Processor

Reduce Reduce Superstep 2

. . .
Output

11

“Are the gains gotten from using X worth the integration costs incurred in
building the end-to-end solution?

If no, then operationally, we can consider the Hadoop stack …

there are substantial costs in knitting together a patchwork of different
frameworks, programming models, etc.”

–– Lin, 2012

12

Introducing
KNITTING BOAR

13

✛ Parallel Iterative implementation of SGD on
YARN

✛ Workers work on partitions of the data
✛ Master keeps global copy of merged parameter
vector

14

✛ Each given a split of the total dataset
> Similar to a map task
✛ Using a modified OLR
> process N samples in a batch (subset of split)
✛ Batched gradient accumulation updates sent to
master node
> Gradient influences future models vectors towards
better predictions

15

✛ Accumulates gradient updates
> From batches of worker OLR runs
✛ Produces new global parameter vector
> By averaging workers‟ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector

16

OnlineLogisticRegression
Knitting Boar‟s POLR
Split 1 Split 2 Split 3
Training Data

Worker 1 Worker 2
… Worker N

Partial Model Partial Model Partial Model
OnlineLogisticRegression

Master

Model
Global Model

17

Integrating Knitting Boar with
YARN

18

✛ Yet Another Resource Negotiator

✛ Framework for scheduling distributed applications
✛ Typically runs on top of an HDFS cluster
> Though not required,
nor is it coupled to HDFS
Node
Manager

✛ MRv2 is now a Container App Mstr

distributed application Client

Resource Node
Manager Manager
Client

App Mstr Container

MapReduce Status Node
Manager
Job Submission
Node Status
Resource Request Container Container

19

✛ High setup / teardown costs
✛ Not designed for super-step operations
✛ Need to refactor the problem to fit MapReduce
> We can now just launch a distributed application

20

✛ Designed specifically for parallel iterative
algorithms on Hadoop
> Implemented directly on top of YARN
✛ Intrinsic Parallelism
> Easier to focus on problem
> Not focusing on the distributed application part

21

✛ ComputableMaster
Worker Worker Worker
> Setup()
> Compute() Master
> Complete()
✛ ComputableWorker Worker Worker Worker

> Setup()
Master
> Compute()
. . .

22

✛ Client
> Launches the YARN ApplicationMaster
✛ Master
> Computes required resources
> Obtains resources from YARN
> Launches Workers
✛ Workers
> Computation on partial data (input split)
> Synchronizes with Master

23

Pig, Hive, Scala, Java, Crunch

Algorithms

MapReduce IterativeReduce BranchReduce Giraph …

HDFS / YARN

24

Knitting Boar
PERFORMANCE, SCALING, AND RESULTS

25

300

250

200

150 OLR
POLR
100

50

0
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41

Input Size vs Processing Time

26

✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)

✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated

27

✛ Knitting Boar
> 100% Java
> ASF 2.0 Licensed
> https://github.com/jpatanooga/KnittingBoar
> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce
> [ coming soon ]

28

The Road Ahead

✛ SGD
> More testing
> Demo use cases
✛ IterativeReduce
> Reliability
> Durability

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

29

✛ Mahout‟s SGD implementation
> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ Hadoop AllReduce and Terascale Learning
> http://hunch.net/?p=2094
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That‟s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf

30

✛ Langford
> http://hunch.net/~vw/
✛ Zinkevick, 2011
> http://www.research.rutgers.edu/~lihong/pub/Zinkevic
h11Parallelized.pdf
✛ McDonald, 2010
> http://dl.acm.org/citation.cfm?id=1858068
✛ Dekel, 2010
> http://arxiv.org/pdf/1012.1367.pdf

31

✛ http://eteamjournal.files.wordpress.com/2011/03/
photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-
medium-large/-say-hello-to-my-little-friend--luis-
ludzska.jpg
✛ http://agileknitter.com/wp-
content/uploads/2010/06/Pictures_-_Misc_-
_Knitting_Needles.jpg

32

Strata + Hadoop World 2012: Knitting Boar

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Strata + Hadoop World 2012: Knitting Boar

Ähnlich wie Strata + Hadoop World 2012: Knitting Boar (20)

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Strata + Hadoop World 2012: Knitting Boar

Hinweis der Redaktion