SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Scalable Machine Learning
A survey of large scale machine learning frameworks.
Arnaud Rachez	

rachez@ceremade.dauphine.fr
Intro - Cellule de Calcul
!
• Who?	

– Engineers:Arnaud Rachez, Fabian Pedregosa (from Feb. 2015)
– Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine)	

• What for?
– Mutualizing computational needs for partners of the chair.	

– Centralizing computational expertise for academic projects and
industrial cooperations.
2
Context
3
!
Try to view Big data from the
perspective of a machine learning
researcher.
Implementing algorithms at scale
in a parallel and distributed
fashion.
Big models trained with online
optimisation (eg. deep networks)
or sampling (eg. topic models)
Why all the hype
4
Peter Norvig,Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009	

[Max Welling, ICML 2014]
Big Data Big Models
Outline
5
• Out of core	

• Data parallel	

• Graph parallel	

• Model Parallel
More details
6
J. Gonzalez @ ICML 2014	

techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/
Since this is a short talk I’ll go very quickly over a lot of
the interesting details of the frameworks.
!
If you are interested in knowing more and have ~2hours
to spare, you should definitely check J. Gonzalez’s talk:
!
7
Out of core
Scaling on one
machine
Out of core!
• Problem:Training data does not fit in RAM.	

• Solution: Lay out data efficiently on disk and load it as needed
in memory.
8
Very fast online learning learning. 	

One thread to read, one to train.	

Hashing trick, online error, etc.
Parallel matrix multiplication.	

Bottleneck tends to be 	

CPU-GPU memory transfer.
Sometimes extends to GPU computing.
Playing withVowpal Wabbit
• Criteo’s Display Advertising Challenge dataset: ~10GB with
~50MM lines	

• VW’s logistic regression run on one EC2 instance with an attached
EBS volume (3000 reserved IOPS):	

– cross-entropy = 0.473 in 2’10” (one online pass)	

– converged to 0.470 in 7 passes (9’4”)	

!
Pure C++ code. Compiles without problem on linux but latest version
has trouble on Mac. Has recently added support for a cluster mode
using allreduce. 	

Does not seem to have support for implementing new algorithms easily.	

9
Scalability - A perspective on Big data
!
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.

Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.

Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
10
11
Data parallel
Statistical query	

model
Map-Reduce: Statistical query model
12
f, the map function, is	

sent to every machine
the sum corresponds	

to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004	

• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
Statistical query model - Example
13
Gradient of the loss:
• For each (x, y) in dataset apply 

in parallel (Map step)
• Sum gradients via Reduce
• Update w on the master
• Repeat until convergence
Map-Reduce
14
• Resilient to failure. HDFS
disk replication.
• Can run on huge clusters.
• Makes use of data
locality.



Programme (query) is moved
to the data and not the
opposite.
• Map functions must be
stateless 



States are lost between map
iterations.
• Computation graph is
very constrained by
independencies.



Not ideal for computation on
arbitrary graphs.
Fom Hadoop to Spark
15 Shamelessly stolen from J. Gonzalez presentation
Implemented algorithms MLlib 1.1
16
• 	

 linear SVM and logistic regression	

• 	

 classification and regression tree	

• 	

 k-means clustering	

• 	

 recommendation via alternating least squares	

• 	

 singular value decomposition	

• 	

 linear regression with L1- and L2-regularization	

• 	

 multinomial naive Bayes	

• 	

 basic statistics	

• 	

 feature transformations
Playing with Spark
• Scala library with Java and Python interfaces.The Python version was not always
responsive.	

• Easy installation on both linux and mac. EC2 scripts allow for easy deployment in
standalone cluster mode (give instances some additional time to be initialised correctly). 	

• Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1
version fixes an OutOfMemory error but crashes at the very end of the job.	

• Strong scalability for logistic regression was super linear… (probably due to a sub-
optimal configuration)
17
18
Graph parallel
Vertex
programming
Pregel
The Graph-parallel pattern
19
Model / Alg.
State
Computation depends
only on the neighbors
Shamelessly stolen from J. Gonzalez presentation ICML ‘14
BSP processing
Synchronous vs Asynchronous
20
[J. Gonzalez Parallel Gibbs Sampling 2010]
Strong Positive
Correlation
t=0
ParallelExecution
t=2 t=3
Strong Positive
Correlation
t=1
Sequential
Execution
Strong Negative
Correlation
Asynchronous
ML APPROVED
Many Graph-Parallel Algorithms
• Collaborative Filtering	

– Alternating Least Squares
– Stochastic Gradient Descent
– Tensor Factorization
• Structured Prediction	

– Loopy Belief Propagation
– Max-Product Linear Programs	

– Gibbs Sampling
• Semi-supervised ML	

– Graph SSL
– CoEM	

• Community Detection	

– Triangle-Counting
– K-core Decomposition
– K-Truss	

• Graph Analytics	

– PageRank
– Personalized PageRank
– Shortest Path
– Graph Coloring	

• Classification	

– Neural Networks
21
Playing with GraphLab
• C++ library using MPI for communication.	

• Compiles without problem on linux.Works on Mac but a bit more
involved (surprisingly since it seems to be developed mainly on Mac.)	

• Easy deployment on a cluster. Basic ALS on small Netflix subset works.
No logistic regression implemented (it is a graph oriented framework
after all).	

• Nice API for vertex programming.Would like to try collapsed Gibbs
sampling on a larger dataset (Wikipedia?). 	

• Data input is constrained and preprocessing can be cumbersome (Spark
could be used to take care of this part).
22
23
Model parallel
Parameter
programming
Big models
24
Data and models do not fit into memory anymore !
Deep Learning !
!
Neural nets with 10B parameters
PGM

!
LDA 1MM words * 1K topics
• Partition data on several machines
• Also partition the model !
[J. Gonzalez ICML 2014]
Parameter programming
25
IMO the most ambitious paradigm for large scale ML:	

1. asynchronous (for online learning),	

2. flexible consistency models (for Hogwild! algorithms)
• With Hadoop/Spark you program on parallel collections.	

• With GraphLab/Pregel you program on vertices.
ParameterServer lets you program on parameters
Two implementations, both from Carnegie Mellon University
http://parameterserver.org http://petuum.github.io
But it is for VERY large scale problems.
Implemented algorithms
26
Very very beta…
!
• Linear and logistic regression
• Neural nets?
Playing with parameter server
27
• Could not make ParameterServer work as of
now…
!
!
!
• Petuum compiles easily on linux.
• Neural network training works on a randomly
generated dataset on my laptop.
• Support for cluster deployment too but I haven’t
tried it yet. Configuration will probably not be
easy…
Summary
28
Spark, Graphlab and ParameterServer are complementary frameworks.	

!
• Spark is easy to use, has a well thought API that makes implementing
new models quite easy (as long as they fit the Map-Reduce
paradigm). It seems mainly targeted at companies already familiar
with the Hadoop stack.	

!
• GraphLab is designed for use by machine learning researchers. I am
not certain vertex programming is convenient for all types of ML
algorithms but it certainly is appealing for MCMC methods.	

!
• ParameterServer is the framework to rule them all. It is targeted at
very large machine learning and is still at a very early development
stage.
29
Thanks

Weitere ähnliche Inhalte

Was ist angesagt?

MACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULEMACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULE
DrBindhuM
 

Was ist angesagt? (20)

Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
MACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULEMACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULE
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
ML Label engineering and N-Hot Encoders
ML Label engineering and N-Hot EncodersML Label engineering and N-Hot Encoders
ML Label engineering and N-Hot Encoders
 
ML Basics
ML BasicsML Basics
ML Basics
 
Back propagation
Back propagationBack propagation
Back propagation
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Bayesian network
Bayesian networkBayesian network
Bayesian network
 

Ähnlich wie Scalable machine learning

ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
khstandrews
 

Ähnlich wie Scalable machine learning (20)

AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
Disco workshop
Disco workshopDisco workshop
Disco workshop
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?Mobile+Cloud: a viable replacement for desktop cheminformatics?
Mobile+Cloud: a viable replacement for desktop cheminformatics?
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Scalable machine learning

  • 1. Scalable Machine Learning A survey of large scale machine learning frameworks. Arnaud Rachez rachez@ceremade.dauphine.fr
  • 2. Intro - Cellule de Calcul ! • Who? – Engineers:Arnaud Rachez, Fabian Pedregosa (from Feb. 2015) – Researchers: Stéphane Gaiffas (X), Robin Ryder (Dauphine) • What for? – Mutualizing computational needs for partners of the chair. – Centralizing computational expertise for academic projects and industrial cooperations. 2
  • 3. Context 3 ! Try to view Big data from the perspective of a machine learning researcher. Implementing algorithms at scale in a parallel and distributed fashion. Big models trained with online optimisation (eg. deep networks) or sampling (eg. topic models)
  • 4. Why all the hype 4 Peter Norvig,Alon Halevy. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 2009 [Max Welling, ICML 2014] Big Data Big Models
  • 5. Outline 5 • Out of core • Data parallel • Graph parallel • Model Parallel
  • 6. More details 6 J. Gonzalez @ ICML 2014 techtalks.tv/talks/emerging-systems-for-large-scale-machine-learning/60852/ Since this is a short talk I’ll go very quickly over a lot of the interesting details of the frameworks. ! If you are interested in knowing more and have ~2hours to spare, you should definitely check J. Gonzalez’s talk: !
  • 7. 7 Out of core Scaling on one machine
  • 8. Out of core! • Problem:Training data does not fit in RAM. • Solution: Lay out data efficiently on disk and load it as needed in memory. 8 Very fast online learning learning. One thread to read, one to train. Hashing trick, online error, etc. Parallel matrix multiplication. Bottleneck tends to be CPU-GPU memory transfer. Sometimes extends to GPU computing.
  • 9. Playing withVowpal Wabbit • Criteo’s Display Advertising Challenge dataset: ~10GB with ~50MM lines • VW’s logistic regression run on one EC2 instance with an attached EBS volume (3000 reserved IOPS): – cross-entropy = 0.473 in 2’10” (one online pass) – converged to 0.470 in 7 passes (9’4”) ! Pure C++ code. Compiles without problem on linux but latest version has trouble on Mac. Has recently added support for a cluster mode using allreduce. Does not seem to have support for implementing new algorithms easily. 9
  • 10. Scalability - A perspective on Big data ! • Strong scaling: if you throw twice as many machines at the task, you solve it in half the time.
 Usually relevant when the task is CPU bound. • Weak scaling: if the dataset is twice as big, throw twice as many machines at it to solve the task in constant time.
 Memory bound tasks… usually. Most “big data” problems are I/O bound. Hard to solve the task in an acceptable time independently of the size of the data (weak scaling). 10
  • 12. Map-Reduce: Statistical query model 12 f, the map function, is sent to every machine the sum corresponds to a reduce operation • D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees. Int. J. Hybrid Intell. Syst. 2004 • Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
  • 13. Statistical query model - Example 13 Gradient of the loss: • For each (x, y) in dataset apply 
 in parallel (Map step) • Sum gradients via Reduce • Update w on the master • Repeat until convergence
  • 14. Map-Reduce 14 • Resilient to failure. HDFS disk replication. • Can run on huge clusters. • Makes use of data locality.
 
 Programme (query) is moved to the data and not the opposite. • Map functions must be stateless 
 
 States are lost between map iterations. • Computation graph is very constrained by independencies.
 
 Not ideal for computation on arbitrary graphs.
  • 15. Fom Hadoop to Spark 15 Shamelessly stolen from J. Gonzalez presentation
  • 16. Implemented algorithms MLlib 1.1 16 • linear SVM and logistic regression • classification and regression tree • k-means clustering • recommendation via alternating least squares • singular value decomposition • linear regression with L1- and L2-regularization • multinomial naive Bayes • basic statistics • feature transformations
  • 17. Playing with Spark • Scala library with Java and Python interfaces.The Python version was not always responsive. • Easy installation on both linux and mac. EC2 scripts allow for easy deployment in standalone cluster mode (give instances some additional time to be initialised correctly). • Code base is under active development and MLlib seems a bit buggy at times. Spark 1.1 version fixes an OutOfMemory error but crashes at the very end of the job. • Strong scalability for logistic regression was super linear… (probably due to a sub- optimal configuration) 17
  • 19. The Graph-parallel pattern 19 Model / Alg. State Computation depends only on the neighbors Shamelessly stolen from J. Gonzalez presentation ICML ‘14
  • 20. BSP processing Synchronous vs Asynchronous 20 [J. Gonzalez Parallel Gibbs Sampling 2010] Strong Positive Correlation t=0 ParallelExecution t=2 t=3 Strong Positive Correlation t=1 Sequential Execution Strong Negative Correlation Asynchronous ML APPROVED
  • 21. Many Graph-Parallel Algorithms • Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization • Structured Prediction – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling • Semi-supervised ML – Graph SSL – CoEM • Community Detection – Triangle-Counting – K-core Decomposition – K-Truss • Graph Analytics – PageRank – Personalized PageRank – Shortest Path – Graph Coloring • Classification – Neural Networks 21
  • 22. Playing with GraphLab • C++ library using MPI for communication. • Compiles without problem on linux.Works on Mac but a bit more involved (surprisingly since it seems to be developed mainly on Mac.) • Easy deployment on a cluster. Basic ALS on small Netflix subset works. No logistic regression implemented (it is a graph oriented framework after all). • Nice API for vertex programming.Would like to try collapsed Gibbs sampling on a larger dataset (Wikipedia?). • Data input is constrained and preprocessing can be cumbersome (Spark could be used to take care of this part). 22
  • 24. Big models 24 Data and models do not fit into memory anymore ! Deep Learning ! ! Neural nets with 10B parameters PGM
 ! LDA 1MM words * 1K topics • Partition data on several machines • Also partition the model ! [J. Gonzalez ICML 2014]
  • 25. Parameter programming 25 IMO the most ambitious paradigm for large scale ML: 1. asynchronous (for online learning), 2. flexible consistency models (for Hogwild! algorithms) • With Hadoop/Spark you program on parallel collections. • With GraphLab/Pregel you program on vertices. ParameterServer lets you program on parameters Two implementations, both from Carnegie Mellon University http://parameterserver.org http://petuum.github.io But it is for VERY large scale problems.
  • 26. Implemented algorithms 26 Very very beta… ! • Linear and logistic regression • Neural nets?
  • 27. Playing with parameter server 27 • Could not make ParameterServer work as of now… ! ! ! • Petuum compiles easily on linux. • Neural network training works on a randomly generated dataset on my laptop. • Support for cluster deployment too but I haven’t tried it yet. Configuration will probably not be easy…
  • 28. Summary 28 Spark, Graphlab and ParameterServer are complementary frameworks. ! • Spark is easy to use, has a well thought API that makes implementing new models quite easy (as long as they fit the Map-Reduce paradigm). It seems mainly targeted at companies already familiar with the Hadoop stack. ! • GraphLab is designed for use by machine learning researchers. I am not certain vertex programming is convenient for all types of ML algorithms but it certainly is appealing for MCMC methods. ! • ParameterServer is the framework to rule them all. It is targeted at very large machine learning and is still at a very early development stage.