This document summarizes Josh Patterson's work on parallel machine learning algorithms. It discusses his past publications and work on routing algorithms and metaheuristics. It then outlines his work developing parallel versions of algorithms like linear regression, logistic regression, and neural networks using Hadoop and YARN. It presents performance results showing these parallel algorithms can achieve close to linear speedup. It also discusses techniques used like vector caching and unit testing frameworks. Finally, it discusses future work on algorithms like Adagrad and parallel quasi-Newton methods.
5. 5
Machine Learning and Optimization
Direct Methods
Normal Equation
Iterative Methods
Newtonâs Method
Quasi-Newton
Gradient Descent
Heuristics
AntNet
PSO
Genetic Algorithms
6. Linear Regression
In linear regression, data is
modeled using linear predictor
functions
unknown model parameters are
estimated from the data.
We use optimization techniques
like Stochastic Gradient Descent to
find the coeffcients in the model
Y = (1*x0) + (c1*x1) + ⊠+ (cN*xN)
8. 8
Stochastic Gradient Descent
Training
Training Data
Simple gradient descent procedure
Loss functions needs to be convex
(with exceptions)
Linear Regression
SGD
Loss Function: squared error of
prediction
Prediction: linear combination of
coefficients and input variables
Model
9. 9
Mahoutâs SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation
10. 10
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured
Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Deanâs Work on Parallel SGD
DownPour SGD
12. 12
YARN
Yet Another Resource Negotiator
Framework for scheduling
distributed applications
Allows for any type of parallel
application to run natively on
hadoop
MRv2 is now a distributed
application
Node
Manager
Container
App Mstr
Client
Resource
Manager
Node
Manager
Client
App Mstr
MapReduce Status
Job Submission
Node Status
Resource Request
Container
Node
Manager
Container
Container
14. 14
SGD: Serial vs Parallel
Split 1
Split 2
Split 3
Training Data
Worker 1
Partial
Model
Worker 2
âŠ
Partial Model
Master
Model
Global Model
Worker N
Partial
Model
15. Parallel Iterative Algorithms on YARN
Based directly on work we did with Knitting Boar
Parallel logistic regression
And then added
Parallel linear regression
Parallel Neural Networks
Packaged in a new suite of parallel iterative algorithms
called Metronome
100% Java, ASF 2.0 Licensed, on github
16. Linear Regression Results
Total Processing Time
Linear Regression - Parallel vs Serial
200
150
100
Parallel Runs
Serial Runs
50
0
64
128
192
256
Megabytes Processed Total
320
18. Convergence Testing
Debugging parallel iterative algorithms during
testing is hard
Processes on different hosts are difficult to observe
Using the Unit Test framework IRUnit we can
simulate the IterativeReduce framework
We know the plumbing of message passing works
Allows us to focus on parallel algorithm design/testing
while still using standard debugging tools
19.
20. What are Neural Networks?
Inspired by nervous systems in biological
systems
Models layers of neurons in the brain
Can learn non-linear functions
Recently enjoying a surge in popularity
21. Multi-Layer Perceptron
First layer has input neurons
Last layer has output neurons
Each neuron in the layer
connected to all neurons in the
next layer
Neuron has activation
function, typically sigmoid /
logistic
Input to neuron is the sum of the
weight * input of connections
22. Backpropogation Learning
Calculates the gradient of the error of the network
regarding the network's modifiable weights
Intuition
Run forward pass of example through network
Compute activations and output
Iterating output layer back to input layer (backwards)
For each neuron in the layer
Compute nodeâs responsibility for error
Update weights on connections
23. Parallelizing Neural Networks
Dean, (NIPS, 2012)
First Steps: Focus on linear convex models, calculating
distributed gradient
Model Parallelism must be combined with distributed
optimization that leverages data parallelization
simultaneously process distinct training examples in
each of the many model replicas
periodically combine their results to optimize our
objective function
Single pass frameworks such as MapReduce âill-suitedâ
24. Costs of Neural Network Training
Connections count explodes quickly as neurons and layers increase
Example: {784, 450, 10} network has 357,300 connections
Need fast iterative framework
Example: 30 sec MR setup cost: 10k Epochs: 30s x 10,000 == 300,000 seconds of setup time
5,000 minutes or 83 hours
3 ways to speed up training
Subdivide dataset between works (data parallelism)
Max transfer rate of disks and Vector caching to max data throughput
Minimize inter-epoch setup times with proper iterative framework
25. Vector In-Memory Caching
Since we make lots of passes over same dataset
In memory caching makes sense here
Once a record is vectorized it is cached in memory
on the worker node
Speedup (single pass, âno cacheâ vs âcachedâ):
~12x
26. Neural Networks Parallelization Speedup
Training Speedup Factor (Multiple)
6.00
5.00
4.00
UCI Iris
3.00
UCI Lenses
UCI Wine
2.00
UCI Dermatology
NIST Handwriting Downsample
1.00
1
2
3
4
Number of Parallel Processing Units
5
27.
28. Lessons Learned
Linear scale continues to be achieved with
parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
31. Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/jav
a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat
eLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j
ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB
oar_IRUnitSim.java
Hinweis der Redaktion
Talk about how you normally would use the Normal equation, notes from Andrew Ng
âUnlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.âBottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
âUnlikely optimization algorithms such as stochastic gradient descent show  amazing performance for large-scale problems.âBottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
The most important additions in Mahoutâs SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
Bottou similar to Xu2010 in the 2010 paper
Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)âą No single programming model or framework can excel atevery problem; there are always tradeoïŹs between simplicity, expressivity, fault tolerance, performance, etc.
POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahoutâs SGD is well known, and so we used that as a base point
3 major costs of BSP style computations:Max unit compute timeCost of global communicationCost of barrier sync at end of super step