Knitting boar atl_hug_jan2013_v2

KNITTING BOAR
Machine Learning, Mahout, and Parallel Iterative Algorithms

Josh Patterson
Principal Solutions Architect

1

✛ Josh Patterson
> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga

> Email: josh@floe.tv

✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts

Introduction to
MACHINE LEARNING

4

✛ What is Data Mining?
> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
> Raw data essentially useless
∗ Data is simply recorded facts
∗ Information is the patterns underlying the data

✛ Machine Learning
> Algorithms for acquiring structural descriptions from
data “examples”
∗ Process of learning “concepts”

✛ Information Retrieval
> information science, information architecture,
cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing
> grounded in machine learning, especially statistical
machine learning
✛ Statistics
> Math and stuff
✛ Machine Learning
> Considered a branch of artificial intelligence

✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization

“Descriptive Statistics”

✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
or Pig
✛ ML work performed with
> SAS
> SPSS
> R
> Mahout

✛ Classification
> “Fraud detection”
✛ Recommendation
> “Collaborative
Filtering”
✛ Clustering
> “Segmentation”
✛ Frequent Itemset
Mining

10 Copyright 2010 Cloudera Inc. All rights reserved

✛ Stochastic Gradient Descent
> Single process
> Logistic Regression Model Construction
✛ Naïve Bayes
> MapReduce-based
> Text Classification
✛ Random Forests
> MapReduce-based

11 Copyright 2010 Cloudera Inc. All rights reserved

✛ An algorithm that looks at a user’s past actions
and suggests
> Products
> Services
> People
✛ Advertisement
> Cloudera has a great Data Science training course on
this topic
> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-
_building_recommender_systems.html

✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation

✛ Why Machine Learning?
> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression

✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”

Introducing
KNITTING BOAR

15

✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later

✛ We Need
> Hypothesis about data
> Cost function
> Update function

✛ Basic Algorithm:

Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view/11

17

✛ Training Training Data

> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD

> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter

18

Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion

✛ MapReduce only fits certain types of algorithms

19

Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron

20

Input Processor Processor Processor

Superstep 1
Map Map Map

Processor Processor Processor

Reduce Reduce Superstep 2

. . .
Output

21

“Are the gains gotten from using X worth the
integration costs incurred in building the end-to-
end solution?

If no, then operationally, we can consider the
Hadoop stack …

there are substantial costs in knitting together a
patchwork of different frameworks, programming
models, etc.”
–– Lin, 2012

22

✛ Parallel Iterative implementation of SGD on
YARN

✛ Workers
> work on partitions of the data
> Stay active over supersteps
✛ Master
> Performs superstep
> Averages parameter vector

23

✛ Collects all parameter vectors at each pass /
superstep
✛ Produces new global parameter vector
> By averaging workers’ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector

24

✛ Each given a split of the total dataset
> Similar to a map task
✛ Performs local logistic regression run
✛ Local parameter vector sent to master at
superstep

25

OnlineLogisticRegression
Knitting Boar’s POLR
Split 1 Split 2 Split 3
Training Data

Worker 1 Worker 2
… Worker N

Partial Model Partial Model Partial Model
OnlineLogisticRegression

Master

Model
Global Model

26

300

250

200
seconds

150 OLR
POLR
100

50

0
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41

Input Size in MB

Input Size vs Processing Time

27

Knitting Boar
PARTING THOUGHTS

28

✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)

✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated

29

✛ Knitting Boar
> https://github.com/jpatanooga/KnittingBoar
> 100% Java
> ASF 2.0 Licensed
> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce
> https://github.com/emsixteeen/IterativeReduce
> 100% Java
> ASF 2.0 Licensed

30

✛ Machine Learning is hard
> Don’t believe the hype
> Do the work
✛ Model development takes
time
> Lots of iterations
> Speed is key here

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

31

✛ “Parallel Linear Regression on Iterative
Reduce and YARN”

✛ Hadoop Summit Europe 2013
> March 20, 21
> http://hadoopsummit.org/amsterdam/

32

✛ Strata / Hadoop World 2012 Slides
> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-
knitting-boar_slide_deck.html
✛ McDonald, 2010
> http://dl.acm.org/citation.cfm?id=1858068
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf

33

✛ http://eteamjournal.files.wordpress.com/2011/03/
photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-
medium-large/-say-hello-to-my-little-friend--luis-
ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-
2001_space_odyssey_-_5.jpg

34

Knitting boar atl_hug_jan2013_v2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Knitting boar atl_hug_jan2013_v2

Ähnlich wie Knitting boar atl_hug_jan2013_v2 (20)

Mehr von Josh Patterson

Mehr von Josh Patterson (20)

Knitting boar atl_hug_jan2013_v2

Hinweis der Redaktion