MACHINE LEARNING, MAHOUT, AND PARALLEL ITERATIVE ALGORITHMS
1. KNITTING BOAR
Machine Learning, Mahout, and Parallel Iterative Algorithms
Josh Patterson
Principal Solutions Architect
1
2. ✛ Josh Patterson
> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project
at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga
> Email: josh@cloudera.com
3. ✛ Introduction to Machine Learning
✛ Mahout
✛ Knitting Boar and YARN
✛ Parting Thoughts
5. ✛ What is Data Mining?
> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?
> Raw data essentially useless
∗ Data is simply recorded facts
∗ Information is the patterns underlying the data
✛ Machine Learning
> Algorithms for acquiring structural descriptions from
data “examples”
∗ Process of learning “concepts”
6. ✛ Information Retrieval
> information science, information architecture,
cognitive psychology, linguistics, and statistics.
✛ Natural Language Processing
> grounded in machine learning, especially statistical
machine learning
✛ Statistics
> Math and stuff
✛ Machine Learning
> Considered a branch of artificial intelligence
7. ✛ ETL
✛ Joining multiple disparate data sources
✛ Filtering data
✛ Aggregation
✛ Cube materialization
“Descriptive Statistics”
8. ✛ Don’t always assume you need “scale” and
parallelization
> Try it out on a single machine first
> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy
machine?
✛ We can always use the constructed model
back in MapReduce to score a ton of new
data
9. ✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIG
MOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for
predictive analytics
✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive
to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?
> What star rating might this user give this movie?
10. ✛ Data collection performed w Flume
✛ Data cleansing / ETL performed with Hive
or Pig
✛ ML work performed with
> SAS
> SPSS
> R
> Mahout
12. ✛ Classification
> “Fraud detection”
✛ Recommendation
> “Collaborative
Filtering”
✛ Clustering
> “Segmentation”
✛ Frequent Itemset
Mining
12 Copyright 2010 Cloudera Inc. All rights reserved
13. ✛ Stochastic Gradient Descent
> Single process
> Logistic Regression Model Construction
✛ Naïve Bayes
> MapReduce-based
> Text Classification
✛ Random Forests
> MapReduce-based
13 Copyright 2010 Cloudera Inc. All rights reserved
14. ✛ An algorithm that looks at a user’s past actions
and suggests
> Products
> Services
> People
✛ Advertisement
> Cloudera has a great Data Science training course on
this topic
> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-
_building_recommender_systems.html
15. ✛ Cluster words across docs to identify topics
✛ Latent Dirichlet Allocation
16. ✛ Why Machine Learning?
> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful
> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming
> The “Need for speed”
> “More data beats a cleverer algorithm”
18. ✛ Parallelize Mahout’s Stochastic Gradient Descent
> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms
using YARN
> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
19. ✛ Training Training Data
> Simple gradient descent
procedure
> Loss functions needs to be
convex
✛ Prediction SGD
> Logistic Regression:
∗ Sigmoid function using
parameter vector (dot)
example as exponential
Model
parameter
19
20. Current Limitations
✛ Sequential algorithms on a single node only
goes so far
✛ The “Data Deluge”
> Presents algorithmic challenges when combined with
large data sets
> need to design algorithms that are able to perform in
a distributed fashion
✛ MapReduce only fits certain types of algorithms
20
21. Distributed Learning Strategies
✛ Langford, 2007
> Vowpal Wabbit
✛ McDonald 2010
> Distributed Training Strategies for the Structured
Perceptron
✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-
Batches
21
23. “Are the gains gotten from using X worth the integration costs incurred in
building the end-to-end solution?
If no, then operationally, we can consider the Hadoop stack …
there are substantial costs in knitting together a patchwork of different
frameworks, programming models, etc.”
–– Lin, 2012
23
24. ✛ Parallel Iterative implementation of SGD on
YARN
✛ Workers work on partitions of the data
✛ Master keeps global copy of merged parameter
vector
24
25. ✛ Each given a split of the total dataset
> Similar to a map task
✛ Using a modified OLR
> process N samples in a batch (subset of split)
✛ Batched gradient accumulation updates sent to
master node
> Gradient influences future models vectors towards
better predictions
25
26. ✛ Accumulates gradient updates
> From batches of worker OLR runs
✛ Produces new global parameter vector
> By averaging workers’ vectors
✛ Sends update to all workers
> Workers replace local parameter vector with new
global parameter vector
26
27. OnlineLogisticRegression
Knitting Boar’s POLR
Split 1 Split 2 Split 3
Training Data
Worker 1 Worker 2
… Worker N
Partial Model Partial Model Partial Model
OnlineLogisticRegression
Master
Model
Global Model
27
30. ✛ Parallel SGD
> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications
> More complex the just MapReduce
> Requires lots of “plumbing”
✛ IterativeReduce
> Great native-Hadoop way to implement algorithms
> Easy to use and well integrated
30
32. ✛ Machine Learning is hard
> Don’t believe the hype
> Do the work
✛ Model development takes
time
> Lots of iterations
> Speed is key here
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
32
33. ✛ Strata / Hadoop World 2012 Slides
> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-
knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation
> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf
✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a
Nail!
> http://arxiv.org/pdf/1209.2191v1.pdf
33
Examples of key information: selecting embryos based on 60 featuresYou may be asking “why arent we talking about mahout?”What we want to do here is look at the fundamentals that will underly all of the systems, not just mahoutSome of the wording may be different, but it’s the same
Yeah? Ok let’s look at doing ETL in HadoopAnd then running the model construction phase in another tool like RNo?We need to think of a way to either Refactor the algorithm into MapReducePartition the data such that a reducer can work on each subset
Frequent itemset mining – what appears together
“What do other people w/ similar tastes like?”“strength of associations”
“say hello to my leeeeetle friend….”
Vorpal: doesn’t natively run on HadoopSpark: scala, overhead, integration issues
“Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
Bottou similar to Xu2010 in the 2010 paper
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
Some of these are in progress towards being ready on YARN, some not; wanted to focus on OLR and not framework for now
POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point
3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step
Multi-dimensional: need to constantly think about the Client, the Master, and the Worker, how they interact and the implications of failures, etc.
Basecamp: use story of how we get to basecamp to see how to climb some more