Scalable machine learning

Scalable Machine Learning
A survey of large scale machine learning frameworks.
Arnaud Rachez

rachez@ceremade.dauphine.fr

Intro - Cellule de Calcul
!
• Who?

– Engineers:Arnaud Rachez, Fabian Pedregosa (from Feb. 2015)
– Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine)

• What for?
– Mutualizing computational needs for partners of the chair.

– Centralizing computational expertise for academic projects and
industrial cooperations.
2

Context
3
!
Try to view Big data from the
perspective of a machine learning
researcher.
Implementing algorithms at scale
in a parallel and distributed
fashion.
Big models trained with online
optimisation (eg. deep networks)
or sampling (eg. topic models)

Why all the hype
4
Peter Norvig,Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009

[Max Welling, ICML 2014]
Big Data Big Models

Outline
5
• Out of core

• Data parallel

• Graph parallel

• Model Parallel

More details
6
J. Gonzalez @ ICML 2014

techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/
Since this is a short talk I’ll go very quickly over a lot of
the interesting details of the frameworks.
!
If you are interested in knowing more and have ~2hours
to spare, you should deﬁnitely check J. Gonzalez’s talk:
!

7
Out of core
Scaling on one
machine

Out of core!
• Problem:Training data does not ﬁt in RAM.

• Solution: Lay out data efﬁciently on disk and load it as needed
in memory.
8
Very fast online learning learning.

One thread to read, one to train.

Hashing trick, online error, etc.
Parallel matrix multiplication.

Bottleneck tends to be

CPU-GPU memory transfer.
Sometimes extends to GPU computing.

Playing withVowpal Wabbit
• Criteo’s Display Advertising Challenge dataset: ~10GB with
~50MM lines

• VW’s logistic regression run on one EC2 instance with an attached
EBS volume (3000 reserved IOPS):

– cross-entropy = 0.473 in 2’10” (one online pass)

– converged to 0.470 in 7 passes (9’4”)

!
Pure C++ code. Compiles without problem on linux but latest version
has trouble on Mac. Has recently added support for a cluster mode
using allreduce.

Does not seem to have support for implementing new algorithms easily.

9

Scalability - A perspective on Big data
!
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time. 
Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time. 
Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
10

11
Data parallel
Statistical query

model

Map-Reduce: Statistical query model
12
f, the map function, is

sent to every machine
the sum corresponds

to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.

Statistical query model - Example
13
Gradient of the loss:
• For each (x, y) in dataset apply  
in parallel (Map step)
• Sum gradients via Reduce
• Update w on the master
• Repeat until convergence

Map-Reduce
14
• Resilient to failure. HDFS
disk replication.
• Can run on huge clusters.
• Makes use of data
locality. 
 
Programme (query) is moved
to the data and not the
opposite.
• Map functions must be
stateless  
 
States are lost between map
iterations.
• Computation graph is
very constrained by
independencies. 
 
Not ideal for computation on
arbitrary graphs.

Fom Hadoop to Spark
15 Shamelessly stolen from J. Gonzalez presentation

Implemented algorithms MLlib 1.1
16
•

linear SVM and logistic regression

•

classiﬁcation and regression tree

•

k-means clustering

•

recommendation via alternating least squares

•

singular value decomposition

•

linear regression with L1- and L2-regularization

•

multinomial naive Bayes

•

basic statistics

•

feature transformations

Playing with Spark
• Scala library with Java and Python interfaces.The Python version was not always
responsive.

• Easy installation on both linux and mac. EC2 scripts allow for easy deployment in
standalone cluster mode (give instances some additional time to be initialised correctly).

• Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1
version ﬁxes an OutOfMemory error but crashes at the very end of the job.

• Strong scalability for logistic regression was super linear… (probably due to a sub-
optimal conﬁguration)
17

18
Graph parallel
Vertex
programming
Pregel

The Graph-parallel pattern
19
Model / Alg.
State
Computation depends
only on the neighbors
Shamelessly stolen from J. Gonzalez presentation ICML ‘14

BSP processing
Synchronous vs Asynchronous
20
[J. Gonzalez Parallel Gibbs Sampling 2010]
Strong Positive
Correlation
t=0
ParallelExecution
t=2 t=3
Strong Positive
Correlation
t=1
Sequential
Execution
Strong Negative
Correlation
Asynchronous
ML APPROVED

Many Graph-Parallel Algorithms
• Collaborative Filtering

– Alternating Least Squares
– Stochastic Gradient Descent
– Tensor Factorization
• Structured Prediction

– Loopy Belief Propagation
– Max-Product Linear Programs

– Gibbs Sampling
• Semi-supervised ML

– Graph SSL
– CoEM

• Community Detection

– Triangle-Counting
– K-core Decomposition
– K-Truss

• Graph Analytics

– PageRank
– Personalized PageRank
– Shortest Path
– Graph Coloring

• Classification

– Neural Networks
21

Playing with GraphLab
• C++ library using MPI for communication.

• Compiles without problem on linux.Works on Mac but a bit more
involved (surprisingly since it seems to be developed mainly on Mac.)

• Easy deployment on a cluster. Basic ALS on small Netﬂix subset works.
No logistic regression implemented (it is a graph oriented framework
after all).

• Nice API for vertex programming.Would like to try collapsed Gibbs
sampling on a larger dataset (Wikipedia?).

• Data input is constrained and preprocessing can be cumbersome (Spark
could be used to take care of this part).
22

23
Model parallel
Parameter
programming

Big models
24
Data and models do not ﬁt into memory anymore !
Deep Learning !
!
Neural nets with 10B parameters
PGM 
!
LDA 1MM words * 1K topics
• Partition data on several machines
• Also partition the model !
[J. Gonzalez ICML 2014]

Parameter programming
25
IMO the most ambitious paradigm for large scale ML:

1. asynchronous (for online learning),

2. ﬂexible consistency models (for Hogwild! algorithms)
• With Hadoop/Spark you program on parallel collections.

• With GraphLab/Pregel you program on vertices.
ParameterServer lets you program on parameters
Two implementations, both from Carnegie Mellon University
http://parameterserver.org http://petuum.github.io
But it is for VERY large scale problems.

Implemented algorithms
26
Very very beta…
!
• Linear and logistic regression
• Neural nets?

Playing with parameter server
27
• Could not make ParameterServer work as of
now…
!
!
!
• Petuum compiles easily on linux.
• Neural network training works on a randomly
generated dataset on my laptop.
• Support for cluster deployment too but I haven’t
tried it yet. Conﬁguration will probably not be
easy…

Summary
28
Spark, Graphlab and ParameterServer are complementary frameworks.

!
• Spark is easy to use, has a well thought API that makes implementing
new models quite easy (as long as they ﬁt the Map-Reduce
paradigm). It seems mainly targeted at companies already familiar
with the Hadoop stack.

!
• GraphLab is designed for use by machine learning researchers. I am
not certain vertex programming is convenient for all types of ML
algorithms but it certainly is appealing for MCMC methods.

!
• ParameterServer is the framework to rule them all. It is targeted at
very large machine learning and is still at a very early development
stage.

Scalable machine learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scalable machine learning

Ähnlich wie Scalable machine learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scalable machine learning