MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS

KNITTING BOAR
Machine Learning, Mahout, and Parallel Iterative Algorithms

Josh Patterson
Principal Solutions Architect

1

✛ Josh Patterson
> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga

> Email: josh@cloudera.com

✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts

Introduction to
MACHINE LEARNING

4

✛ What is Data Mining?
> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
> Raw data essentially useless
∗ Data is simply recorded facts
∗ Information is the patterns underlying the data

✛ Machine Learning
> Algorithms for acquiring structural descriptions from
data “examples”
∗ Process of learning “concepts”

✛ Information Retrieval
> information science, information architecture,
cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing
> grounded in machine learning, especially statistical
machine learning
✛ Statistics
> Math and stuff
✛ Machine Learning
> Considered a branch of artificial intelligence

✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization

“Descriptive Statistics”

✛ Don’t always assume you need “scale” and
parallelization
> Try it out on a single machine first
> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy
machine?
✛ We can always use the constructed model
back in MapReduce to score a ton of new
data

✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG
MOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for
predictive analytics

✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive
to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?
> What star rating might this user give this movie?

✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
or Pig
✛ ML work performed with
> SAS
> SPSS
> R
> Mahout

Introduction to
11
MAHOUT

✛ Classification
> “Fraud detection”
✛ Recommendation
> “Collaborative
Filtering”
✛ Clustering
> “Segmentation”
✛ Frequent Itemset
Mining

12 Copyright 2010 Cloudera Inc. All rights reserved

✛ Stochastic Gradient Descent
> Single process
> Logistic Regression Model Construction
✛ Naïve Bayes
> MapReduce-based
> Text Classification
✛ Random Forests
> MapReduce-based

13 Copyright 2010 Cloudera Inc. All rights reserved

✛ An algorithm that looks at a user’s past actions
and suggests
> Products
> Services
> People
✛ Advertisement
> Cloudera has a great Data Science training course on
this topic
> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-
_building_recommender_systems.html

✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation

✛ Why Machine Learning?
> Growing interest in predictive modeling

✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression

✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”

Introducing
KNITTING BOAR

17

✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible

✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later

✛ Training Training Data

> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD

> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter

19

Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms

20

Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-
Batches

21

Input Processor Processor Processor

Superstep 1
Map Map Map

Processor Processor Processor

Reduce Reduce Superstep 2

. . .
Output

22

“Are the gains gotten from using X worth the integration costs incurred in
building the end-to-end solution?

If no, then operationally, we can consider the Hadoop stack …

there are substantial costs in knitting together a patchwork of different
frameworks, programming models, etc.”

–– Lin, 2012

23

✛ Parallel Iterative implementation of SGD on
YARN

✛ Workers work on partitions of the data
✛ Master keeps global copy of merged parameter
vector

24

✛ Each given a split of the total dataset
> Similar to a map task
✛ Using a modified OLR
> process N samples in a batch (subset of split)
✛ Batched gradient accumulation updates sent to
master node
> Gradient influences future models vectors towards
better predictions

25

✛ Accumulates gradient updates
> From batches of worker OLR runs
✛ Produces new global parameter vector
> By averaging workers’ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector

26

OnlineLogisticRegression
Knitting Boar’s POLR
Split 1 Split 2 Split 3
Training Data

Worker 1 Worker 2
… Worker N

Partial Model Partial Model Partial Model
OnlineLogisticRegression

Master

Model
Global Model

27

300

250

200

150 OLR
POLR
100

50

0
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 41

Input Size vs Processing Time

28

Knitting Boar
PARTING THOUGHTS

29

✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)

✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated

30

✛ Knitting Boar
> https://github.com/jpatanooga/KnittingBoar
> 100% Java
> ASF 2.0 Licensed
> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start

✛ IterativeReduce
> https://github.com/emsixteeen/IterativeReduce
> 100% Java
> ASF 2.0 Licensed

31

✛ Machine Learning is hard
> Don’t believe the hype
> Do the work
✛ Model development takes
time
> Lots of iterations
> Speed is key here

Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg

32

✛ Strata / Hadoop World 2012 Slides
> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-
knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation
> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf

33

✛ Langford
> http://hunch.net/~vw/
✛ McDonald, 2010
> http://dl.acm.org/citation.cfm?id=1858068

34

✛ http://eteamjournal.files.wordpress.com/2011/03/
photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-
medium-large/-say-hello-to-my-little-friend--luis-
ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-
2001_space_odyssey_-_5.jpg

35

MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS

Ähnlich wie MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS (20)

Mehr von Adam Muise

Mehr von Adam Muise (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS

Hinweis der Redaktion