Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
ย
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
1. A Scalable Implementation
of Deep Learning on Spark
Alexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3
With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett Packard Labs 2Databricks
3Huawei & Jules Energy 4Spark community
2. Outline
โ Artificial neural network basics
โ Implementation of Multilayer Perceptron (MLP) in Spark
โ Optimization & parallelization
โ Experiments
โ Future work
โ Whatโs new comparing to Spark Summit talk
โ Update and more details about parallelization heuristic
โ Experiments with larger cluster
โ Slide design (now Hewlett Packard Enterprise)
3. Artificial neural network
โ Basics
โStatistical model that approximates a function of multiple inputs
โConsists of interconnected โneuronsโ which exchange messages
โโNeuronโ produces an output by applying a transformation function on its inputs
โNetwork with more than 3 layers of neurons is called โdeepโ, instance of deep learning
โ Layer types & learning
โA layer type is defined by a transformation function
โAffine: ๐ฆ๐ = ๐๐๐ โ ๐ฅ๐ + ๐๐, Sigmoid: ๐ฆ๐ = 1 + ๐โ๐ฅ ๐ โ1
, Convolution, Softmax, etc.
โMultilayer perceptron (MLP) โ a network with several pairs of Affine & Sigmoid layers
โModel parameters โ weights that โneuronsโ use for transformations
โParameters are iteratively estimated with the backpropagation algorithm
โ Multilayer perceptron
โSpeech recognition (phoneme classification), computer vision
โReleased in Spark 1.5.0
๐ฅ
๐ฆ
input
output
hidden layer
4. Example of MLP in Spark
โHandwritten digits recognition
โDataset MNIST [LeCun et al. 1998]
โ28x28 greyscale images of handwritten digits 0-9
โMLP with 784 inputs, 10 outputs and two hidden layers
of 300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")
val mlp = new MultilayerPerceptronClassifier()
.setLayers(Array(784, 300, 100, 10))
.setBlockSize(128)
val model = mlp.fit(digits)
784 inputs 300 neurons 100 neurons 10 neurons
1st hidden layer 2nd hidden layer Output layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")
mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)
model = mlp.fit(digits)
Scala
Python
5. Pipeline with PCA+MLP in Spark
val digits: DataFrame = sqlContext.read.format(โlibsvmโ).load(โ/data/mnistโ)
val pca = new PCA()
.setInputCol(โfeaturesโ)
.setK(20)
.setOutPutCol(โfeatures20โ)
val mlp = new MultilayerPerceptronClassifier()
.setFeaturesCol(โfeatures20โ)
.setLayers(Array(20, 50, 10))
.setBlockSize(128)
val pipeline = new Pipeline()
.setStages(Array(pca, mlp))
val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")
pca = PCA(inputCol="features", k=20, outputCol="features20")
mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10],
blockSize=128)
pipeline = Pipeline(stages=[pca, mlp])
model = pipeline.fit(digits)
Scala
Python
6. MLP implementation in Spark
โRequirements
โConform to Spark APIs
โProvide extensible interface (deep learning API)
โEfficient and scalable (single node & cluster)
โWhy conform to Spark APIs?
โSpark can call any Java, Python or Scala library, not necessary designed for Spark
โResults with expensive data movement from Spark RDD to the library
โProhibits from using for Spark ML Pipelines
โExtensible interface
โOur implementation processes each layer as a black box with backpropagation in general form
โAllows further introduction of new layers and features
โCNN, (Stacked)Autoencoder, RBM are currently under dev. by community
7. Efficiency
โBatch processing
โLayerโs affine transformations can be represented in vector form: ๐ = ๐ ๐
๐ + ๐
โ๐ โ output from the layer, vector of size ๐
โ๐ โ the matrix of layer weights ๐ ร ๐ , ๐ โ bias, vector of size ๐
โ๐ โ input to the layer, vector of size ๐
โVector-matrix multiplications are not as efficient as matrix-matrix
โStack ๐ input vectors (into batch) to perform matrices multiplication: ๐ = ๐ ๐
๐ฟ + ๐ฉ
โ๐ฟ is ๐ ร ๐ , ๐ is ๐ ร ๐ ,
โ๐ฉ is ๐ ร ๐ , each column contains a copy of ๐
โWe implemented batch processing in matrix form
โEnabled the use of optimized native BLAS libraries
โMemory is reused to limit GC overhead
= * +
= * +
8. โ BLAS in Spark
โ BLAS โ Basic Linear Algebra Subprograms
โ Hardware optimized native in C & Fortran
โCPU: MKL, OpenBLAS etc.
โGPU: NVBLAS (F-BLAS interface to CUDA)
โ Use in Spark through Netlib-java
โ Experiments
โ Huge benefit from native BLAS vs pure Java
f2jblas
โ GPU is faster (2x) only for large matrices
โWhen compute is larger than copy to/from GPU
โ More details:
โ https://github.com/avulanov/scala-blas
โ โlinalg: Matrix Computations in Apache Sparkโ Reza et
al., 2015
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
(1X1)*(1X1)
(10X10)*(10X1)
(10X10)*(10X10)
(100X100)*(100X1)
(100X100)*(100X10)
(100X100)*(100X100)
(1000X1000)*
(1000X100)
(1000X1000)*
(1000X1000)
(10000X10000)*
(10000X1000)
(10000X10000)*
(10000X10000)
DGEMM PERFORMANCE
netlib-NVBLAS netlib-MKL
netlib OpenBLAS netlib-f2jblas
Single node BLAS
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAM
GPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores
seconds
Matrices sizes
9. Scalability
Parallelization
โ Each iteration ๐, each node ๐
โ 1. Gets parameters ๐ค ๐
from master
โ 2. Computes a gradient ๐ป๐
๐
๐น(๐๐๐ก๐๐)
โ 3. Sends a gradient to master
โ 4. Master computes ๐ค ๐+1
based on gradients
โ Gradient type
โ Batch โ process all data on each iteration
โ Stochastic โ random point
โ Mini-batch โ random batch
โ How many workers to use?
โ Less workers โ less compute
โ More workers โ more communication
๐ค ๐
๐ค ๐+1
โ ๐ ๐ป๐
๐
๐น
Master
Executor 1
Executor N
Partition 1
Partition 2
Partition P
Executor 1
Executor N
V
V
v
๐ป1
๐
๐น(๐๐๐ก๐1)
๐ป ๐
๐
๐น(๐๐๐ก๐ ๐)
๐ป1
๐
๐น
Master
Executor 1
Executor N
Master V
V
v
1.
2.
3.
4.
GoTo #1
10. Communication and computation trade-off
Parallelization of batch gradient
โ There are ๐ data points, ๐ features and ๐ classes
โ Assume, we want to train logistic regression, it has ๐๐ parameters
โ Communication: ๐ workers get/receive ๐๐ 64 bit parameters through the network with bandwidth ๐ and
software overhead ๐. Use all-reduce:
โ ๐ก ๐๐ = 2
64๐๐
๐
+ ๐ log2 ๐
โ Computation: each worker has ๐ FLOPS and processes
๐
๐
of data, that needs 2๐๐ operations
โ ๐ก ๐๐~
๐
๐
2๐๐
๐
โ What is the optimal number of workers N?
โ min
๐
๐ก ๐๐ + ๐ก ๐๐ โ ๐ = ๐๐๐ฅ
2๐๐๐ ln 2
๐ 128๐๐ ๐+2๐
, 1
โ ๐ = ๐๐๐ฅ
๐โ๐โln 2
๐ 128๐ค ๐+2๐
, 1 , if ๐ is the number of floating point operations
11. Analysis of the trade-off
Optimal number of workers for batch gradient
โ Parallelism in a cluster
โ ๐ = ๐๐๐ฅ
๐โ๐โln 2
๐ 128๐ค ๐+2๐
, 1
โ Analysis
โ More FLOPS ๐ means lower degree of batch gradient parallelism in a cluster
โ More operations, i.e. more features and classes (or a deep network) means higher degree
โ Small ๐ overhead for get/send a message means higher degree
โ Example: MNIST8M handwritten digit recognition dataset
โ 8.1M documents, 784 features, 10 classes, logistic regression
โ 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
โ ๐ = ๐๐๐ฅ
2โ8.1๐โ784โ10โ0.69
32๐บ 128โ784โ10 1๐บ+2โ0.1
, 1 = 12
13. Comparison with the best case
โ What is we canโt get the optimal number of workers?
โ After a quick drop, time decreases slowly and starts increasing at some point
โ We can use a smaller cluster that will be only ๐ times slower than the optimal
โ Time: ๐ก = 2
64๐ค
๐
+ ๐ log2 ๐ +
๐
๐
๐
๐
= ๐ผ log2 ๐ +
๐ฝ
๐
โ Find the number of nodes that is ๐ time slower than the optimal
โ ๐ผ log2 ๐ +
๐ฝ
๐
= ๐๐ก ๐
โ Approximation
โ Lets approximate log2 ๐ with log2 ๐, substitute ๐ก ๐ and solve the equation for ๐
โ ๐ =
๐
๐โ1 ln ๐+๐
โ Also, ๐ =
ln ๐+
๐
๐
ln ๐+1
(how much is our configuration slower than the optimal)
โ Example: Number of nodes that run logistic regression example 10% slower than the optimal configuration
โ Optimal number ๐ = 12
โ ๐ =
12
1.1โ1 ln 12+1.1
โ 9
14. 0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13
SPARK MLP VS CAFFE MLP
MLP (total) MLP (compute) Caffe CPU Caffe GPU
Scalability testing
โ Setup
โ MNIST hw digit recognition 60K samples
โ 6-layer MLP-784,2500,2000,1500,1000,500,10
โ 12M parameters
โ CPU: Xeon E31240, 3.3GHz, 105.6GFLops
โ GPU: Tesla M2050 3GB, 575MHz
โ Caffe (Deep Learning from Berkeley): 1 node
โ Spark: 1 master + 5 workers
โ Results per iteration
โ Single node (both tools double precision)
โ 1.7 slower than Caffe CPU (Scala vs C++)
โ Scalability
โ 5 nodes give 4.7x speedup, beats Caffe, close to GPU
โ 7 nodes on par with GPU by compute
Seconds
Nodes = Workers
Communication
&schedulercost
๐ = ๐๐๐ฅ
60๐พ โ 6 โ 12๐ โ 0.69
105.6๐บ 128 โ 12๐ 950๐ + 2 โ 0.1
, 1 = 15
๐ =
ln 15 +
15
5
ln 15 + 1
โ 1.5
15. Conclusions & future work
โ Conclusions
โ Scalable multilayer perceptron is available in Spark 1.5.0
โ Extensible internal API for Artificial Neural Networks
โ Further contributions are welcome!
โ Native BLAS (and GPU) speeds up Spark
โ Heuristics for parallelization of batch gradient
โ Work in progress [SPARK-5575]
โ (Stacked)Autoencoder(s)
โ Restricted Boltzmann Machines
โ Drop-out
โ Convolutional neural networks
โ Further work
โ Adaptive batch LBFGS
โ SGD & parameter server
16. ยฉ Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thankyou