SlideShare a Scribd company logo
1 of 16
Download to read offline
A Scalable Implementation
of Deep Learning on Spark
Alexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett Packard Labs 2Databricks
3Huawei & Jules Energy 4Spark community
Outline
โ€“ Artificial neural network basics
โ€“ Implementation of Multilayer Perceptron (MLP) in Spark
โ€“ Optimization & parallelization
โ€“ Experiments
โ€“ Future work
โ€“ Whatโ€™s new comparing to Spark Summit talk
โ€“ Update and more details about parallelization heuristic
โ€“ Experiments with larger cluster
โ€“ Slide design (now Hewlett Packard Enterprise)
Artificial neural network
โ€“ Basics
โ€“Statistical model that approximates a function of multiple inputs
โ€“Consists of interconnected โ€œneuronsโ€ which exchange messages
โ€“โ€œNeuronโ€ produces an output by applying a transformation function on its inputs
โ€“Network with more than 3 layers of neurons is called โ€œdeepโ€, instance of deep learning
โ€“ Layer types & learning
โ€“A layer type is defined by a transformation function
โ€“Affine: ๐‘ฆ๐‘— = ๐’˜๐’Š๐’‹ โˆ™ ๐‘ฅ๐‘– + ๐‘๐‘—, Sigmoid: ๐‘ฆ๐‘– = 1 + ๐‘’โˆ’๐‘ฅ ๐‘– โˆ’1
, Convolution, Softmax, etc.
โ€“Multilayer perceptron (MLP) โ€“ a network with several pairs of Affine & Sigmoid layers
โ€“Model parameters โ€“ weights that โ€œneuronsโ€ use for transformations
โ€“Parameters are iteratively estimated with the backpropagation algorithm
โ€“ Multilayer perceptron
โ€“Speech recognition (phoneme classification), computer vision
โ€“Released in Spark 1.5.0
๐‘ฅ
๐‘ฆ
input
output
hidden layer
Example of MLP in Spark
โ€“Handwritten digits recognition
โ€“Dataset MNIST [LeCun et al. 1998]
โ€“28x28 greyscale images of handwritten digits 0-9
โ€“MLP with 784 inputs, 10 outputs and two hidden layers
of 300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300 neurons 100 neurons 10 neurons
1st hidden layer 2nd hidden layer Output layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python
Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(โ€œlibsvmโ€).load(โ€œ/data/mnistโ€)
val pca = new PCA()
.setInputCol(โ€œfeaturesโ€)
.setK(20)
.setOutPutCol(โ€œfeatures20โ€)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(โ€œfeatures20โ€)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python
MLP implementation in Spark
โ€“Requirements
โ€“Conform to Spark APIs
โ€“Provide extensible interface (deep learning API)
โ€“Efficient and scalable (single node & cluster)
โ€“Why conform to Spark APIs?
โ€“Spark can call any Java, Python or Scala library, not necessary designed for Spark
โ€“Results with expensive data movement from Spark RDD to the library
โ€“Prohibits from using for Spark ML Pipelines
โ€“Extensible interface
โ€“Our implementation processes each layer as a black box with backpropagation in general form
โ€“Allows further introduction of new layers and features
โ€“CNN, (Stacked)Autoencoder, RBM are currently under dev. by community
Efficiency
โ€“Batch processing
โ€“Layerโ€™s affine transformations can be represented in vector form: ๐’š = ๐‘Š ๐‘‡
๐’™ + ๐’ƒ
โ€“๐’š โ€“ output from the layer, vector of size ๐‘›
โ€“๐‘Š โ€“ the matrix of layer weights ๐‘š ร— ๐‘› , ๐’ƒ โ€“ bias, vector of size ๐‘›
โ€“๐’™ โ€“ input to the layer, vector of size ๐‘š
โ€“Vector-matrix multiplications are not as efficient as matrix-matrix
โ€“Stack ๐‘  input vectors (into batch) to perform matrices multiplication: ๐’€ = ๐‘Š ๐‘‡
๐‘ฟ + ๐‘ฉ
โ€“๐‘ฟ is ๐‘š ร— ๐‘  , ๐’€ is ๐‘› ร— ๐‘  ,
โ€“๐‘ฉ is ๐‘› ร— ๐‘  , each column contains a copy of ๐’ƒ
โ€“We implemented batch processing in matrix form
โ€“Enabled the use of optimized native BLAS libraries
โ€“Memory is reused to limit GC overhead
= * +
= * +
โ€“ BLAS in Spark
โ€“ BLAS โ€“ Basic Linear Algebra Subprograms
โ€“ Hardware optimized native in C & Fortran
โ€“CPU: MKL, OpenBLAS etc.
โ€“GPU: NVBLAS (F-BLAS interface to CUDA)
โ€“ Use in Spark through Netlib-java
โ€“ Experiments
โ€“ Huge benefit from native BLAS vs pure Java
f2jblas
โ€“ GPU is faster (2x) only for large matrices
โ€“When compute is larger than copy to/from GPU
โ€“ More details:
โ€“ https://github.com/avulanov/scala-blas
โ€“ โ€œlinalg: Matrix Computations in Apache Sparkโ€ Reza et
al., 2015
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1X1)*(1X1)
(10X10)*(10X1)
(10X10)*(10X10)
(100X100)*(100X1)
(100X100)*(100X10)
(100X100)*(100X100)
(1000X1000)*
(1000X100)
(1000X1000)*
(1000X1000)
(10000X10000)*
(10000X1000)
(10000X10000)*
(10000X10000)
DGEMM PERFORMANCE
netlib-NVBLAS netlib-MKL
netlib OpenBLAS netlib-f2jblas
Single node BLAS
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores
seconds
Matrices sizes
Scalability
Parallelization
โ€“ Each iteration ๐‘˜, each node ๐‘–
โ€“ 1. Gets parameters ๐‘ค ๐‘˜
from master
โ€“ 2. Computes a gradient ๐›ป๐‘–
๐‘˜
๐น(๐‘‘๐‘Ž๐‘ก๐‘Ž๐‘–)
โ€“ 3. Sends a gradient to master
โ€“ 4. Master computes ๐‘ค ๐‘˜+1
based on gradients
โ€“ Gradient type
โ€“ Batch โ€“ process all data on each iteration
โ€“ Stochastic โ€“ random point
โ€“ Mini-batch โ€“ random batch
โ€“ How many workers to use?
โ€“ Less workers โ€“ less compute
โ€“ More workers โ€“ more communication
๐‘ค ๐‘˜
๐‘ค ๐‘˜+1
โ‰” ๐‘Œ ๐›ป๐‘–
๐‘˜
๐น
Master
Executor 1
Executor N
Partition 1
Partition 2
Partition P
Executor 1
Executor N
V
V
v
๐›ป1
๐‘˜
๐น(๐‘‘๐‘Ž๐‘ก๐‘Ž1)
๐›ป ๐‘
๐‘˜
๐น(๐‘‘๐‘Ž๐‘ก๐‘Ž ๐‘)
๐›ป1
๐‘˜
๐น
Master
Executor 1
Executor N
Master V
V
v
1.
2.
3.
4.
GoTo #1
Communication and computation trade-off
Parallelization of batch gradient
โ€“ There are ๐‘‘ data points, ๐‘“ features and ๐‘˜ classes
โ€“ Assume, we want to train logistic regression, it has ๐‘“๐‘˜ parameters
โ€“ Communication: ๐‘› workers get/receive ๐‘“๐‘˜ 64 bit parameters through the network with bandwidth ๐‘ and
software overhead ๐‘. Use all-reduce:
โ€“ ๐‘ก ๐‘๐‘š = 2
64๐‘“๐‘˜
๐‘
+ ๐‘ log2 ๐‘›
โ€“ Computation: each worker has ๐‘ FLOPS and processes
๐‘‘
๐‘›
of data, that needs 2๐‘“๐‘˜ operations
โ€“ ๐‘ก ๐‘๐‘~
๐‘‘
๐‘›
2๐‘“๐‘˜
๐‘
โ€“ What is the optimal number of workers N?
โ€“ min
๐‘›
๐‘ก ๐‘๐‘š + ๐‘ก ๐‘๐‘ โ‡’ ๐‘ = ๐‘š๐‘Ž๐‘ฅ
2๐‘‘๐‘“๐‘˜ ln 2
๐‘ 128๐‘“๐‘˜ ๐‘+2๐‘
, 1
โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ
๐‘‘โˆ™๐‘™โˆ™ln 2
๐‘ 128๐‘ค ๐‘+2๐‘
, 1 , if ๐‘™ is the number of floating point operations
Analysis of the trade-off
Optimal number of workers for batch gradient
โ€“ Parallelism in a cluster
โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ
๐‘‘โˆ™๐‘™โˆ™ln 2
๐‘ 128๐‘ค ๐‘+2๐‘
, 1
โ€“ Analysis
โ€“ More FLOPS ๐‘ means lower degree of batch gradient parallelism in a cluster
โ€“ More operations, i.e. more features and classes (or a deep network) means higher degree
โ€“ Small ๐‘ overhead for get/send a message means higher degree
โ€“ Example: MNIST8M handwritten digit recognition dataset
โ€“ 8.1M documents, 784 features, 10 classes, logistic regression
โ€“ 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ
2โˆ™8.1๐‘€โˆ™784โˆ™10โˆ™0.69
32๐บ 128โˆ™784โˆ™10 1๐บ+2โˆ™0.1
, 1 = 12
Artificial neural network case
โ€“ Parallelization of batch gradient
โ€“ General case
โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ
๐‘‘โˆ™๐‘™โˆ™ln 2
๐‘ 128๐‘ค ๐‘+2๐‘
, 1
โ€“ Artificial neural network training:
โ€“ Forward pass (each layer matrix-vector multiplication, 2๐‘š๐‘›): ๐‘™ += 2๐‘ค
โ€“ Back propagation (same): ๐‘™ += 2๐‘ค
โ€“ Gradient (vector-row matrix multiplication): ๐‘™ += 2๐‘ค
โ€“ Total: ๐‘™ = 6๐‘ค
โ€“ Artificial neural network prediction:
โ€“ Forward pass, ๐‘™ = 2๐‘ค
Comparison with the best case
โ€“ What is we canโ€™t get the optimal number of workers?
โ€“ After a quick drop, time decreases slowly and starts increasing at some point
โ€“ We can use a smaller cluster that will be only ๐‘˜ times slower than the optimal
โ€“ Time: ๐‘ก = 2
64๐‘ค
๐‘
+ ๐‘ log2 ๐‘› +
๐‘‘
๐‘›
๐‘™
๐‘
= ๐›ผ log2 ๐‘› +
๐›ฝ
๐‘›
โ€“ Find the number of nodes that is ๐‘˜ time slower than the optimal
โ€“ ๐›ผ log2 ๐‘› +
๐›ฝ
๐‘›
= ๐‘˜๐‘ก ๐‘
โ€“ Approximation
โ€“ Lets approximate log2 ๐‘› with log2 ๐‘, substitute ๐‘ก ๐‘ and solve the equation for ๐‘›
โ€“ ๐‘› =
๐‘
๐‘˜โˆ’1 ln ๐‘+๐‘˜
โ€“ Also, ๐‘˜ =
ln ๐‘+
๐‘
๐‘›
ln ๐‘+1
(how much is our configuration slower than the optimal)
โ€“ Example: Number of nodes that run logistic regression example 10% slower than the optimal configuration
โ€“ Optimal number ๐‘ = 12
โ€“ ๐‘› =
12
1.1โˆ’1 ln 12+1.1
โ‰ˆ 9
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13
SPARK MLP VS CAFFE MLP
MLP (total) MLP (compute) Caffe CPU Caffe GPU
Scalability testing
โ€“ Setup
โ€“ MNIST hw digit recognition 60K samples
โ€“ 6-layer MLP-784,2500,2000,1500,1000,500,10
โ€“ 12M parameters
โ€“ CPU: Xeon E31240, 3.3GHz, 105.6GFLops
โ€“ GPU: Tesla M2050 3GB, 575MHz
โ€“ Caffe (Deep Learning from Berkeley): 1 node
โ€“ Spark: 1 master + 5 workers
โ€“ Results per iteration
โ€“ Single node (both tools double precision)
โ€“ 1.7 slower than Caffe CPU (Scala vs C++)
โ€“ Scalability
โ€“ 5 nodes give 4.7x speedup, beats Caffe, close to GPU
โ€“ 7 nodes on par with GPU by compute
Seconds
Nodes = Workers
Communication
&schedulercost
๐‘ = ๐‘š๐‘Ž๐‘ฅ
60๐พ โˆ™ 6 โˆ™ 12๐‘€ โˆ™ 0.69
105.6๐บ 128 โˆ™ 12๐‘€ 950๐‘€ + 2 โˆ™ 0.1
, 1 = 15
๐‘˜ =
ln 15 +
15
5
ln 15 + 1
โ‰ˆ 1.5
Conclusions & future work
โ€“ Conclusions
โ€“ Scalable multilayer perceptron is available in Spark 1.5.0
โ€“ Extensible internal API for Artificial Neural Networks
โ€“ Further contributions are welcome!
โ€“ Native BLAS (and GPU) speeds up Spark
โ€“ Heuristics for parallelization of batch gradient
โ€“ Work in progress [SPARK-5575]
โ€“ (Stacked)Autoencoder(s)
โ€“ Restricted Boltzmann Machines
โ€“ Drop-out
โ€“ Convolutional neural networks
โ€“ Further work
โ€“ Adaptive batch LBFGS
โ€“ SGD & parameter server
ยฉ Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thankyou

More Related Content

What's hot

Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
ย 

What's hot (19)

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
ย 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
ย 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
ย 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
ย 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
ย 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
ย 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
ย 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
ย 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
ย 
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
ย 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
ย 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
ย 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
ย 
Parikshit Ram โ€“ Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram โ€“ Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram โ€“ Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram โ€“ Senior Machine Learning Scientist, Skytree at MLconf ATL
ย 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
ย 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
ย 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
ย 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
ย 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
ย 

Viewers also liked

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
ย 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
DataWorks Summit
ย 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
MongoDB
ย 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
ย 

Viewers also liked (20)

Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
ย 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
ย 
Training MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and OperabilityTraining MongoDB - Monitoring and Operability
Training MongoDB - Monitoring and Operability
ย 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
ย 
20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn20140202 fosdem-nosql-devroom-hadoop-yarn
20140202 fosdem-nosql-devroom-hadoop-yarn
ย 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
ย 
Dynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeperDynamic Reconfiguration of Apache ZooKeeper
Dynamic Reconfiguration of Apache ZooKeeper
ย 
MongoDB Shell Tips & Tricks
MongoDB Shell Tips & TricksMongoDB Shell Tips & Tricks
MongoDB Shell Tips & Tricks
ย 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
ย 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
ย 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ย 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier
ย 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
ย 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
ย 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
ย 
Apache Hadoop YARN โ€“ Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN โ€“ Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN โ€“ Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN โ€“ Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
ย 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
ย 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
ย 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
ย 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
ย 

Similar to A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)

MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
MLconf
ย 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
Chao Chen
ย 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
ย 

Similar to A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov) (20)

A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
ย 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
ย 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
ย 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
ย 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
ย 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
ย 
Java Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame GraphsJava Performance Analysis on Linux with Flame Graphs
Java Performance Analysis on Linux with Flame Graphs
ย 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
ย 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
ย 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
ย 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
ย 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
ย 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
ย 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
ย 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
ย 
Drizzleโ€”Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzleโ€”Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzleโ€”Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzleโ€”Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
ย 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
ย 
A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
ย 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
ย 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
ย 

Recently uploaded

Call Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night StandCall Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
amitlee9823
ย 
Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...
amitlee9823
ย 
Call Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night StandCall Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
amitlee9823
ย 
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
amitlee9823
ย 
Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...
amitlee9823
ย 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
ย 
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
amitlee9823
ย 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
ย 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
ย 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
ย 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
ย 
Call Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night StandCall Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Hsr Layout โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
ย 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
ย 
Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: ๐Ÿ“ 7737669865 ๐Ÿ“ High Profile Model Escorts | Bangalore...
ย 
Call Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night StandCall Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
Call Girls In Bellandur โ˜Ž 7737669865 ๐Ÿฅต Book Your One night Stand
ย 
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
ย 
Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ba...
ย 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
ย 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
ย 
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort ServiceBDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
BDSMโšกCall Girls in Mandawali Delhi >เผ’8448380779 Escort Service
ย 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
ย 
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout โ˜Ž 7737669865โ˜Ž Book Your One night Stand (B...
ย 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
ย 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
ย 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
ย 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
ย 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
ย 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
ย 

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)

  • 1. A Scalable Implementation of Deep Learning on Spark Alexander Ulanov 1 Joint work with Xiangrui Meng2, Bert Greevenbosch3 With the help from Guoqiang Li4, Andrey Simanovsky1 1Hewlett Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark community
  • 2. Outline โ€“ Artificial neural network basics โ€“ Implementation of Multilayer Perceptron (MLP) in Spark โ€“ Optimization & parallelization โ€“ Experiments โ€“ Future work โ€“ Whatโ€™s new comparing to Spark Summit talk โ€“ Update and more details about parallelization heuristic โ€“ Experiments with larger cluster โ€“ Slide design (now Hewlett Packard Enterprise)
  • 3. Artificial neural network โ€“ Basics โ€“Statistical model that approximates a function of multiple inputs โ€“Consists of interconnected โ€œneuronsโ€ which exchange messages โ€“โ€œNeuronโ€ produces an output by applying a transformation function on its inputs โ€“Network with more than 3 layers of neurons is called โ€œdeepโ€, instance of deep learning โ€“ Layer types & learning โ€“A layer type is defined by a transformation function โ€“Affine: ๐‘ฆ๐‘— = ๐’˜๐’Š๐’‹ โˆ™ ๐‘ฅ๐‘– + ๐‘๐‘—, Sigmoid: ๐‘ฆ๐‘– = 1 + ๐‘’โˆ’๐‘ฅ ๐‘– โˆ’1 , Convolution, Softmax, etc. โ€“Multilayer perceptron (MLP) โ€“ a network with several pairs of Affine & Sigmoid layers โ€“Model parameters โ€“ weights that โ€œneuronsโ€ use for transformations โ€“Parameters are iteratively estimated with the backpropagation algorithm โ€“ Multilayer perceptron โ€“Speech recognition (phoneme classification), computer vision โ€“Released in Spark 1.5.0 ๐‘ฅ ๐‘ฆ input output hidden layer
  • 4. Example of MLP in Spark โ€“Handwritten digits recognition โ€“Dataset MNIST [LeCun et al. 1998] โ€“28x28 greyscale images of handwritten digits 0-9 โ€“MLP with 784 inputs, 10 outputs and two hidden layers of 300 and 100 neurons val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist") val mlp = new MultilayerPerceptronClassifier() .setLayers(Array(784, 300, 100, 10)) .setBlockSize(128) val model = mlp.fit(digits) 784 inputs 300 neurons 100 neurons 10 neurons 1st hidden layer 2nd hidden layer Output layer digits = sqlContext.read.format("libsvm").load("/data/mnist") mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128) model = mlp.fit(digits) Scala Python
  • 5. Pipeline with PCA+MLP in Spark val digits: DataFrame = sqlContext.read.format(โ€œlibsvmโ€).load(โ€œ/data/mnistโ€) val pca = new PCA() .setInputCol(โ€œfeaturesโ€) .setK(20) .setOutPutCol(โ€œfeatures20โ€) val mlp = new MultilayerPerceptronClassifier() .setFeaturesCol(โ€œfeatures20โ€) .setLayers(Array(20, 50, 10)) .setBlockSize(128) val pipeline = new Pipeline() .setStages(Array(pca, mlp)) val model = pipeline.fit(digits) digits = sqlContext.read.format("libsvm").load("/data/mnist8m") pca = PCA(inputCol="features", k=20, outputCol="features20") mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10], blockSize=128) pipeline = Pipeline(stages=[pca, mlp]) model = pipeline.fit(digits) Scala Python
  • 6. MLP implementation in Spark โ€“Requirements โ€“Conform to Spark APIs โ€“Provide extensible interface (deep learning API) โ€“Efficient and scalable (single node & cluster) โ€“Why conform to Spark APIs? โ€“Spark can call any Java, Python or Scala library, not necessary designed for Spark โ€“Results with expensive data movement from Spark RDD to the library โ€“Prohibits from using for Spark ML Pipelines โ€“Extensible interface โ€“Our implementation processes each layer as a black box with backpropagation in general form โ€“Allows further introduction of new layers and features โ€“CNN, (Stacked)Autoencoder, RBM are currently under dev. by community
  • 7. Efficiency โ€“Batch processing โ€“Layerโ€™s affine transformations can be represented in vector form: ๐’š = ๐‘Š ๐‘‡ ๐’™ + ๐’ƒ โ€“๐’š โ€“ output from the layer, vector of size ๐‘› โ€“๐‘Š โ€“ the matrix of layer weights ๐‘š ร— ๐‘› , ๐’ƒ โ€“ bias, vector of size ๐‘› โ€“๐’™ โ€“ input to the layer, vector of size ๐‘š โ€“Vector-matrix multiplications are not as efficient as matrix-matrix โ€“Stack ๐‘  input vectors (into batch) to perform matrices multiplication: ๐’€ = ๐‘Š ๐‘‡ ๐‘ฟ + ๐‘ฉ โ€“๐‘ฟ is ๐‘š ร— ๐‘  , ๐’€ is ๐‘› ร— ๐‘  , โ€“๐‘ฉ is ๐‘› ร— ๐‘  , each column contains a copy of ๐’ƒ โ€“We implemented batch processing in matrix form โ€“Enabled the use of optimized native BLAS libraries โ€“Memory is reused to limit GC overhead = * + = * +
  • 8. โ€“ BLAS in Spark โ€“ BLAS โ€“ Basic Linear Algebra Subprograms โ€“ Hardware optimized native in C & Fortran โ€“CPU: MKL, OpenBLAS etc. โ€“GPU: NVBLAS (F-BLAS interface to CUDA) โ€“ Use in Spark through Netlib-java โ€“ Experiments โ€“ Huge benefit from native BLAS vs pure Java f2jblas โ€“ GPU is faster (2x) only for large matrices โ€“When compute is larger than copy to/from GPU โ€“ More details: โ€“ https://github.com/avulanov/scala-blas โ€“ โ€œlinalg: Matrix Computations in Apache Sparkโ€ Reza et al., 2015 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 (1X1)*(1X1) (10X10)*(10X1) (10X10)*(10X10) (100X100)*(100X1) (100X100)*(100X10) (100X100)*(100X100) (1000X1000)* (1000X100) (1000X1000)* (1000X1000) (10000X10000)* (10000X1000) (10000X10000)* (10000X10000) DGEMM PERFORMANCE netlib-NVBLAS netlib-MKL netlib OpenBLAS netlib-f2jblas Single node BLAS CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores seconds Matrices sizes
  • 9. Scalability Parallelization โ€“ Each iteration ๐‘˜, each node ๐‘– โ€“ 1. Gets parameters ๐‘ค ๐‘˜ from master โ€“ 2. Computes a gradient ๐›ป๐‘– ๐‘˜ ๐น(๐‘‘๐‘Ž๐‘ก๐‘Ž๐‘–) โ€“ 3. Sends a gradient to master โ€“ 4. Master computes ๐‘ค ๐‘˜+1 based on gradients โ€“ Gradient type โ€“ Batch โ€“ process all data on each iteration โ€“ Stochastic โ€“ random point โ€“ Mini-batch โ€“ random batch โ€“ How many workers to use? โ€“ Less workers โ€“ less compute โ€“ More workers โ€“ more communication ๐‘ค ๐‘˜ ๐‘ค ๐‘˜+1 โ‰” ๐‘Œ ๐›ป๐‘– ๐‘˜ ๐น Master Executor 1 Executor N Partition 1 Partition 2 Partition P Executor 1 Executor N V V v ๐›ป1 ๐‘˜ ๐น(๐‘‘๐‘Ž๐‘ก๐‘Ž1) ๐›ป ๐‘ ๐‘˜ ๐น(๐‘‘๐‘Ž๐‘ก๐‘Ž ๐‘) ๐›ป1 ๐‘˜ ๐น Master Executor 1 Executor N Master V V v 1. 2. 3. 4. GoTo #1
  • 10. Communication and computation trade-off Parallelization of batch gradient โ€“ There are ๐‘‘ data points, ๐‘“ features and ๐‘˜ classes โ€“ Assume, we want to train logistic regression, it has ๐‘“๐‘˜ parameters โ€“ Communication: ๐‘› workers get/receive ๐‘“๐‘˜ 64 bit parameters through the network with bandwidth ๐‘ and software overhead ๐‘. Use all-reduce: โ€“ ๐‘ก ๐‘๐‘š = 2 64๐‘“๐‘˜ ๐‘ + ๐‘ log2 ๐‘› โ€“ Computation: each worker has ๐‘ FLOPS and processes ๐‘‘ ๐‘› of data, that needs 2๐‘“๐‘˜ operations โ€“ ๐‘ก ๐‘๐‘~ ๐‘‘ ๐‘› 2๐‘“๐‘˜ ๐‘ โ€“ What is the optimal number of workers N? โ€“ min ๐‘› ๐‘ก ๐‘๐‘š + ๐‘ก ๐‘๐‘ โ‡’ ๐‘ = ๐‘š๐‘Ž๐‘ฅ 2๐‘‘๐‘“๐‘˜ ln 2 ๐‘ 128๐‘“๐‘˜ ๐‘+2๐‘ , 1 โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ ๐‘‘โˆ™๐‘™โˆ™ln 2 ๐‘ 128๐‘ค ๐‘+2๐‘ , 1 , if ๐‘™ is the number of floating point operations
  • 11. Analysis of the trade-off Optimal number of workers for batch gradient โ€“ Parallelism in a cluster โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ ๐‘‘โˆ™๐‘™โˆ™ln 2 ๐‘ 128๐‘ค ๐‘+2๐‘ , 1 โ€“ Analysis โ€“ More FLOPS ๐‘ means lower degree of batch gradient parallelism in a cluster โ€“ More operations, i.e. more features and classes (or a deep network) means higher degree โ€“ Small ๐‘ overhead for get/send a message means higher degree โ€“ Example: MNIST8M handwritten digit recognition dataset โ€“ 8.1M documents, 784 features, 10 classes, logistic regression โ€“ 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ 2โˆ™8.1๐‘€โˆ™784โˆ™10โˆ™0.69 32๐บ 128โˆ™784โˆ™10 1๐บ+2โˆ™0.1 , 1 = 12
  • 12. Artificial neural network case โ€“ Parallelization of batch gradient โ€“ General case โ€“ ๐‘ = ๐‘š๐‘Ž๐‘ฅ ๐‘‘โˆ™๐‘™โˆ™ln 2 ๐‘ 128๐‘ค ๐‘+2๐‘ , 1 โ€“ Artificial neural network training: โ€“ Forward pass (each layer matrix-vector multiplication, 2๐‘š๐‘›): ๐‘™ += 2๐‘ค โ€“ Back propagation (same): ๐‘™ += 2๐‘ค โ€“ Gradient (vector-row matrix multiplication): ๐‘™ += 2๐‘ค โ€“ Total: ๐‘™ = 6๐‘ค โ€“ Artificial neural network prediction: โ€“ Forward pass, ๐‘™ = 2๐‘ค
  • 13. Comparison with the best case โ€“ What is we canโ€™t get the optimal number of workers? โ€“ After a quick drop, time decreases slowly and starts increasing at some point โ€“ We can use a smaller cluster that will be only ๐‘˜ times slower than the optimal โ€“ Time: ๐‘ก = 2 64๐‘ค ๐‘ + ๐‘ log2 ๐‘› + ๐‘‘ ๐‘› ๐‘™ ๐‘ = ๐›ผ log2 ๐‘› + ๐›ฝ ๐‘› โ€“ Find the number of nodes that is ๐‘˜ time slower than the optimal โ€“ ๐›ผ log2 ๐‘› + ๐›ฝ ๐‘› = ๐‘˜๐‘ก ๐‘ โ€“ Approximation โ€“ Lets approximate log2 ๐‘› with log2 ๐‘, substitute ๐‘ก ๐‘ and solve the equation for ๐‘› โ€“ ๐‘› = ๐‘ ๐‘˜โˆ’1 ln ๐‘+๐‘˜ โ€“ Also, ๐‘˜ = ln ๐‘+ ๐‘ ๐‘› ln ๐‘+1 (how much is our configuration slower than the optimal) โ€“ Example: Number of nodes that run logistic regression example 10% slower than the optimal configuration โ€“ Optimal number ๐‘ = 12 โ€“ ๐‘› = 12 1.1โˆ’1 ln 12+1.1 โ‰ˆ 9
  • 14. 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 SPARK MLP VS CAFFE MLP MLP (total) MLP (compute) Caffe CPU Caffe GPU Scalability testing โ€“ Setup โ€“ MNIST hw digit recognition 60K samples โ€“ 6-layer MLP-784,2500,2000,1500,1000,500,10 โ€“ 12M parameters โ€“ CPU: Xeon E31240, 3.3GHz, 105.6GFLops โ€“ GPU: Tesla M2050 3GB, 575MHz โ€“ Caffe (Deep Learning from Berkeley): 1 node โ€“ Spark: 1 master + 5 workers โ€“ Results per iteration โ€“ Single node (both tools double precision) โ€“ 1.7 slower than Caffe CPU (Scala vs C++) โ€“ Scalability โ€“ 5 nodes give 4.7x speedup, beats Caffe, close to GPU โ€“ 7 nodes on par with GPU by compute Seconds Nodes = Workers Communication &schedulercost ๐‘ = ๐‘š๐‘Ž๐‘ฅ 60๐พ โˆ™ 6 โˆ™ 12๐‘€ โˆ™ 0.69 105.6๐บ 128 โˆ™ 12๐‘€ 950๐‘€ + 2 โˆ™ 0.1 , 1 = 15 ๐‘˜ = ln 15 + 15 5 ln 15 + 1 โ‰ˆ 1.5
  • 15. Conclusions & future work โ€“ Conclusions โ€“ Scalable multilayer perceptron is available in Spark 1.5.0 โ€“ Extensible internal API for Artificial Neural Networks โ€“ Further contributions are welcome! โ€“ Native BLAS (and GPU) speeds up Spark โ€“ Heuristics for parallelization of batch gradient โ€“ Work in progress [SPARK-5575] โ€“ (Stacked)Autoencoder(s) โ€“ Restricted Boltzmann Machines โ€“ Drop-out โ€“ Convolutional neural networks โ€“ Further work โ€“ Adaptive batch LBFGS โ€“ SGD & parameter server
  • 16. ยฉ Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Thankyou