1. Scalable Machine Learning
A survey of large scale machine learning frameworks.
Arnaud Rachez
rachez@ceremade.dauphine.fr
2. Intro - Cellule de Calcul
!
• Who?
– Engineers:Arnaud Rachez, Fabian Pedregosa (from Feb. 2015)
– Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine)
• What for?
– Mutualizing computational needs for partners of the chair.
– Centralizing computational expertise for academic projects and
industrial cooperations.
2
3. Context
3
!
Try to view Big data from the
perspective of a machine learning
researcher.
Implementing algorithms at scale
in a parallel and distributed
fashion.
Big models trained with online
optimisation (eg. deep networks)
or sampling (eg. topic models)
4. Why all the hype
4
Peter Norvig,Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009
[Max Welling, ICML 2014]
Big Data Big Models
5. Outline
5
• Out of core
• Data parallel
• Graph parallel
• Model Parallel
6. More details
6
J. Gonzalez @ ICML 2014
techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/
Since this is a short talk I’ll go very quickly over a lot of
the interesting details of the frameworks.
!
If you are interested in knowing more and have ~2hours
to spare, you should definitely check J. Gonzalez’s talk:
!
8. Out of core!
• Problem:Training data does not fit in RAM.
• Solution: Lay out data efficiently on disk and load it as needed
in memory.
8
Very fast online learning learning.
One thread to read, one to train.
Hashing trick, online error, etc.
Parallel matrix multiplication.
Bottleneck tends to be
CPU-GPU memory transfer.
Sometimes extends to GPU computing.
9. Playing withVowpal Wabbit
• Criteo’s Display Advertising Challenge dataset: ~10GB with
~50MM lines
• VW’s logistic regression run on one EC2 instance with an attached
EBS volume (3000 reserved IOPS):
– cross-entropy = 0.473 in 2’10” (one online pass)
– converged to 0.470 in 7 passes (9’4”)
!
Pure C++ code. Compiles without problem on linux but latest version
has trouble on Mac. Has recently added support for a cluster mode
using allreduce.
Does not seem to have support for implementing new algorithms easily.
9
10. Scalability - A perspective on Big data
!
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.
Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
10
12. Map-Reduce: Statistical query model
12
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004
• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
13. Statistical query model - Example
13
Gradient of the loss:
• For each (x, y) in dataset apply
in parallel (Map step)
• Sum gradients via Reduce
• Update w on the master
• Repeat until convergence
14. Map-Reduce
14
• Resilient to failure. HDFS
disk replication.
• Can run on huge clusters.
• Makes use of data
locality.
Programme (query) is moved
to the data and not the
opposite.
• Map functions must be
stateless
States are lost between map
iterations.
• Computation graph is
very constrained by
independencies.
Not ideal for computation on
arbitrary graphs.
15. Fom Hadoop to Spark
15 Shamelessly stolen from J. Gonzalez presentation
16. Implemented algorithms MLlib 1.1
16
•
linear SVM and logistic regression
•
classification and regression tree
•
k-means clustering
•
recommendation via alternating least squares
•
singular value decomposition
•
linear regression with L1- and L2-regularization
•
multinomial naive Bayes
•
basic statistics
•
feature transformations
17. Playing with Spark
• Scala library with Java and Python interfaces.The Python version was not always
responsive.
• Easy installation on both linux and mac. EC2 scripts allow for easy deployment in
standalone cluster mode (give instances some additional time to be initialised correctly).
• Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1
version fixes an OutOfMemory error but crashes at the very end of the job.
• Strong scalability for logistic regression was super linear… (probably due to a sub-
optimal configuration)
17
19. The Graph-parallel pattern
19
Model / Alg.
State
Computation depends
only on the neighbors
Shamelessly stolen from J. Gonzalez presentation ICML ‘14
22. Playing with GraphLab
• C++ library using MPI for communication.
• Compiles without problem on linux.Works on Mac but a bit more
involved (surprisingly since it seems to be developed mainly on Mac.)
• Easy deployment on a cluster. Basic ALS on small Netflix subset works.
No logistic regression implemented (it is a graph oriented framework
after all).
• Nice API for vertex programming.Would like to try collapsed Gibbs
sampling on a larger dataset (Wikipedia?).
• Data input is constrained and preprocessing can be cumbersome (Spark
could be used to take care of this part).
22
24. Big models
24
Data and models do not fit into memory anymore !
Deep Learning !
!
Neural nets with 10B parameters
PGM
!
LDA 1MM words * 1K topics
• Partition data on several machines
• Also partition the model !
[J. Gonzalez ICML 2014]
25. Parameter programming
25
IMO the most ambitious paradigm for large scale ML:
1. asynchronous (for online learning),
2. flexible consistency models (for Hogwild! algorithms)
• With Hadoop/Spark you program on parallel collections.
• With GraphLab/Pregel you program on vertices.
ParameterServer lets you program on parameters
Two implementations, both from Carnegie Mellon University
http://parameterserver.org http://petuum.github.io
But it is for VERY large scale problems.
27. Playing with parameter server
27
• Could not make ParameterServer work as of
now…
!
!
!
• Petuum compiles easily on linux.
• Neural network training works on a randomly
generated dataset on my laptop.
• Support for cluster deployment too but I haven’t
tried it yet. Configuration will probably not be
easy…
28. Summary
28
Spark, Graphlab and ParameterServer are complementary frameworks.
!
• Spark is easy to use, has a well thought API that makes implementing
new models quite easy (as long as they fit the Map-Reduce
paradigm). It seems mainly targeted at companies already familiar
with the Hadoop stack.
!
• GraphLab is designed for use by machine learning researchers. I am
not certain vertex programming is convenient for all types of ML
algorithms but it certainly is appealing for MCMC methods.
!
• ParameterServer is the framework to rule them all. It is targeted at
very large machine learning and is still at a very early development
stage.