1. KNITTING BOAR
Machine Learning, Mahout, and Parallel Iterative Algorithms
Josh Patterson
Principal Solutions Architect
1
2. ✛ Josh Patterson
> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga
> Email: josh@floe.tv
3. ✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
5. ✛ What is Data Mining?
> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
> Raw data essentially useless
∗ Data is simply recorded facts
∗ Information is the patterns underlying the data
✛ Machine Learning
> Algorithms for acquiring structural descriptions from
data “examples”
∗ Process of learning “concepts”
6. ✛ Information Retrieval
> information science, information architecture,
cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing
> grounded in machine learning, especially statistical
machine learning
✛ Statistics
> Math and stuff
✛ Machine Learning
> Considered a branch of artificial intelligence
7. ✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization
“Descriptive Statistics”
8. ✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
or Pig
✛ ML work performed with
> SAS
> SPSS
> R
> Mahout
10. ✛ Classification
> “Fraud detection”
✛ Recommendation
> “Collaborative
Filtering”
✛ Clustering
> “Segmentation”
✛ Frequent Itemset
Mining
10 Copyright 2010 Cloudera Inc. All rights reserved
11. ✛ Stochastic Gradient Descent
> Single process
> Logistic Regression Model Construction
✛ Naïve Bayes
> MapReduce-based
> Text Classification
✛ Random Forests
> MapReduce-based
11 Copyright 2010 Cloudera Inc. All rights reserved
12. ✛ An algorithm that looks at a user’s past actions
and suggests
> Products
> Services
> People
✛ Advertisement
> Cloudera has a great Data Science training course on
this topic
> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-
_building_recommender_systems.html
13. ✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation
14. ✛ Why Machine Learning?
> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”
16. ✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
17. ✛ We Need
> Hypothesis about data
> Cost function
> Update function
✛ Basic Algorithm:
Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view/11
17
18. ✛ Training Training Data
> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD
> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter
18
19. Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms
19
20. Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron
20
22. “Are the gains gotten from using X worth the
integration costs incurred in building the end-to-
end solution?
If no, then operationally, we can consider the
Hadoop stack …
there are substantial costs in knitting together a
patchwork of different frameworks, programming
models, etc.”
–– Lin, 2012
22
23. ✛ Parallel Iterative implementation of SGD on
YARN
✛ Workers
> work on partitions of the data
> Stay active over supersteps
✛ Master
> Performs superstep
> Averages parameter vector
23
24. ✛ Collects all parameter vectors at each pass /
superstep
✛ Produces new global parameter vector
> By averaging workers’ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector
24
25. ✛ Each given a split of the total dataset
> Similar to a map task
✛ Performs local logistic regression run
✛ Local parameter vector sent to master at
superstep
25
26. OnlineLogisticRegression
Knitting Boar’s POLR
Split 1 Split 2 Split 3
Training Data
Worker 1 Worker 2
… Worker N
Partial Model Partial Model Partial Model
OnlineLogisticRegression
Master
Model
Global Model
26
29. ✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated
29
31. ✛ Machine Learning is hard
> Don’t believe the hype
> Do the work
✛ Model development takes
time
> Lots of iterations
> Speed is key here
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
31
32. ✛ “Parallel Linear Regression on Iterative
Reduce and YARN”
✛ Hadoop Summit Europe 2013
> March 20, 21
> http://hadoopsummit.org/amsterdam/
32
33. ✛ Strata / Hadoop World 2012 Slides
> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-
knitting-boar_slide_deck.html
✛ McDonald, 2010
> http://dl.acm.org/citation.cfm?id=1858068
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf
33
Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
Frequent itemset mining – what appears together
“What do other people w/ similar tastes like?”“strength of associations”
“say hello to my leeeeetle friend….”
Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
Bottou similar to Xu2010 in the 2010 paper
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
Basecamp: use story of how we get to basecamp to see how to climb some more