More Related Content Similar to Parallel/Distributed Deep Learning and CDSW (20) More from DataWorks Summit (20) Parallel/Distributed Deep Learning and CDSW1. © Cloudera, Inc. All rights reserved.
Parallel/Distributed Deep Learning
and CDSW
Rafael Arana - Senior Solutions Architect
Zuling Kang - Senior Solutions Architect
2. © Cloudera, Inc. All rights reserved. 2
TABLE OF CONTENTS
● Initiative of distributed deep learning and distributed model training
● Distributing the model training processes
● Integrating the distributed model training into CDSW
● Discussions and future
3. © Cloudera, Inc. All rights reserved. 3
BACKGROUND
CONNECT
products &
services (IoT)
PROTECTDRIVE
customer insights
4. © Cloudera, Inc. All rights reserved. 4
Are we there yet?
QUAID Where am I?
JOHNNY (cheerful) You're in a
JohnnyCab!
QUAID I mean...what am I doing
here?
JOHNNY I'm sorry. Would you
please rephrase the question.
QUAID (impatient, enunciates)
How did I get in this taxi?!
JOHNNY The door opened. You
got it.
5. © Cloudera, Inc. All rights reserved. 5
Increase in compute
Source: https://blog.openai.com/ai-and-compute/
7. © Cloudera, Inc. All rights reserved. 7
The Power-law Region
More compute + more training data -> Better Accuracy
Reference: https://arxiv.org/abs/1712.00409
8. © Cloudera, Inc. All rights reserved. 8
The Power-law Region
More compute + more training data set = Better Accuracy
Reference: https://arxiv.org/abs/1712.00409
9. © Cloudera, Inc. All rights reserved. 9
PROBLEM:
LABELED
TRAINING
DATA
• Supervised learning
• Reuse public data sets
• Data Augmentation
• Enterprise Data and data privacy
regulations
10. © Cloudera, Inc. All rights reserved. 10
TRANSFER
LEARNING
• Low budget ( computation , data set labelling,…)
• Use transfer learning to sort of transfer knowledge
from large public data sets to your own problem.
• Small data: Replace soft Layer
• Medium Data set: Replace last layers
• Large Dataset. Just for initialization
• Sample image detection based on retinanet using
Keras:
• Person, car, …
• But,…what is that prediction on Ringo’s left leg?
11. © Cloudera, Inc. All rights reserved. 11
Neural Networks Architectures
Training
Data Set
12. © Cloudera, Inc. All rights reserved. 12
Neural Network
Architecture and
Accuracy
DNN models with more parameters would produce higher
classification accuracy?
• Example: Computer Vision Popular DNN Convnets
• VGG and AlexNet each have more than 150MB of
fully-connected layer parameters, GoogLeNet has
smaller fully-connected layers, and NiN does not
have fully-connected
• GoogLeNet and NiN have a resolution of 1x1
instead of 3x3 or larger
• Models with fewer parameters are more amenable to
scalability, while still delivering high accuracy.
Reference: https://arxiv.org/pdf/1511.00175
13. © Cloudera, Inc. All rights reserved. 13
Let’s put our model in production!!!!
Photos by Unsplash
14. © Cloudera, Inc. All rights reserved. 14
Industrialization of ML – Efficient training
Photos by Unsplash
15. © Cloudera, Inc. All rights reserved. 15
Machine Learning Development Life Cycle
17. © Cloudera, Inc. All rights reserved. 17
Cloudera Data Science Workbench
Architecture
CDH CDH
Cloudera Manager
Gateway node(s) CDH nodes
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine
Container
Registry
Git Repo
18. © Cloudera, Inc. All rights reserved. 18
Cloudera Data Science Workbench
Architecture
HDP HDP
Ambari
Gateway node(s) HDP nodes
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine
Container
Registry
Git Repo
19. © Cloudera, Inc. All rights reserved. 19
Adding GPUs
Step 1. Admin > Engines > Engine Images
20. © Cloudera, Inc. All rights reserved. 20
Adding GPUs
Step 2. Project > Settings > Engine
21. © Cloudera, Inc. All rights reserved. 21
Adding GPUs
GPU Support
CDSW
CPU
CDH/HDP
CPU
CDH/HDP
CPU
single-node
training
distributed
training, scoring
On CDH coming in C6
GPU
22. © Cloudera, Inc. All rights reserved. 22
Distributed Tensorflow Package
• Main concepts
• Workers
• Parameter Servers
• tf.Server(),
• tf.ClusterSpec(),
tf.train.SyncReplicasOptimizer()
tf.train.replicas_device_setter()
23. © Cloudera, Inc. All rights reserved. 23
Local Multi-GPU Training - TF Distribution Strategy
Keras API
distribution = tf.contrib.distribute.MirroredStrategy()
with distribution.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mean_squared_error',
optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2))
model.fit(train_dataset, epochs=5, steps_per_epoch=10)
CDSW
CPU GPUGPU
GPUGPUCPU
24. © Cloudera, Inc. All rights reserved. 24
Local Multi-GPU Training - TF Distribution Strategy
Estimator API
def model_fn(features, labels, mode):
layer = tf.layers.Dense(1)
logits = layer(features)
def input_fn():
features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)
labels = tf.data.Dataset.from_tensors(1.).repeat(100)
return tf.data.Dataset.zip((features, labels))
distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)
classifier.train(input_fn=input_fn) classifier.evaluate(input_fn=input_fn)
CDSW
CPU GPUGPU
GPUGPUCPU
TF Estimator
25. © Cloudera, Inc. All rights reserved.
Distributing the Model Training Processes
26. © Cloudera, Inc. All rights reserved. 26
PROCEDURES OF TRAINING A DEEP LEARNING MODEL
Repeat the following code for num_epoch times
For each mini_batch(x, y) in dataset
Set pred_tensor = model(x) //feeding forward
Set diff_tensor = L2_loss(y, pred_tensor)
// OR Set diff_tensor = cross_entropy_loss(y, pred_tensor)
Set grad = gradient of diff_tensor
Update the model using grad
27. © Cloudera, Inc. All rights reserved. 27
FOUR MAJOR ISSUES IN DISTRIBUTED MODEL TRAINING
• Shall we use data parallelism or model parallelism?
• How to efficiently distribute model parameters, which is normal huge in amount?
• How to aggregate model parameters in different training nodes into a global one?
• Model updating algorithms
• How to efficiently scale the training load and make it efficient access the huge
amount of training data?
• The first 3 issues are covered in this section, while the 4th one will be addressed in
the next section.
28. © Cloudera, Inc. All rights reserved. 28
TENSORFLOW AND MODEL PARALLELISM
● The initial idea of DistBelief is proposed by Google
● First idea is published in its research paper in 2012.
● Used as the built-in distributed implementation for
Tensorflow
● Parameter server (PS)
● A centralized server for sharing neural network
parameters
● Model parallelism:
● A method to distributed the training parameters across
worker nodes
● Model updating algorithm
● Downpour SGD
Jeffrey Dean, et al. “Large scale distributed deep networks”, advances in neural
information processing systems (NIPS), 2012.
29. © Cloudera, Inc. All rights reserved. 29
FROM MODEL TO DATA PARALLELISM
• Strength of model parallelism
• Applicable for models with size greater than memory or GPU capacities within ONE worker
node
• Weakness
• Unable to take full advantage of our hardware resources
• For models whose parameters can be hold in GPUs within ONE worker node
• Data parallelism
30. © Cloudera, Inc. All rights reserved. 30
HARDWARE USE RATE AS TRAINING NODE INCREASES
https://eng.uber.com/horovod/
31. © Cloudera, Inc. All rights reserved. 31
WHOLE PICTURE OF DATA PARALLELISM
https://eng.uber.com/horovod/
32. © Cloudera, Inc. All rights reserved. 32
FROM PS TO MPI ALLREDUCE
● Based on Baidu ring-allreduce algorithm (see
http://andrew.gibiansky.com/ for detail)
● Using HPC/MPI framework, which originally
written in C while currently encapsulated with
Python
● Implemented in Uber Horovod, Baidu, PyTorch,
MXNet, etc.
● Found to be faster in small-scale number of
nodes (8-64)
○ https://cwiki.apache.org/confluence/display/M
XNET/Extend+MXNet+Distributed+Training+by
+MPI+AllReduce
33. © Cloudera, Inc. All rights reserved. 33
PERFORMANCE GAINS OF INFINIBAND/RDMA
https://eng.uber.com/horovod/
34. © Cloudera, Inc. All rights reserved. 34
MODEL UPDATING ALGORITHMS
Synchronized Asynchronous
From: Strategies and Principles of Distributed Machine Learning on
Big Data, https://doi.org/10.1016/J.ENG.2016.02.008
35. © Cloudera, Inc. All rights reserved. 35
UPDATING ALGORITHM: SYNCHRONIZED VS. ASYNCHRONOUS
• Synchronized algorithms will lead to a more precise and consistent model,
however, some workers will sometimes have to wait for a long time during the
synchronization barrier, which leads to longer training time.
• When the minibatch is large, the low efficiency issue can be largely reduced.
• From: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,
https://arxiv.org/abs/1706.02677
• Asynchronous algorithms is said to be stochastic
in descent directions which will make the model
imprecise.
• However there find in practice some momentum
that leads the model convergent to a very close
place of its synchronized counterpart.
36. © Cloudera, Inc. All rights reserved. 36
MODEL ERROR VS. BATCH SIZE
Priya Goyal, et al. Accurate, Large Minibatch SGD: Training ImageNet in 1
Hour. https://arxiv.org/abs/1706.02677
37. © Cloudera, Inc. All rights reserved. 37
FAMOUS SYNCHRONIZED AND ASYNCHRONOUS EXAMPLES
• Synchronized updating algorithms
• Microsoft CNTK: Model average after certain iterations.
• Uber Horovod: Using large minibatches.
• Asynchronous updating algorithms
• Google Tensorflow: Downpour SGD.
38. © Cloudera, Inc. All rights reserved. 38
ALGORITHM FRAMEWORK FOR SYNCHRONIZED SGD
•
39. © Cloudera, Inc. All rights reserved.
Integrating the Distributed Model Training into CDSW
41. © Cloudera, Inc. All rights reserved. 41
USING CDSW API TO SPAWN TRAIN-WORKERS
Using cdsw.launch_workers() to generate sub-containers, then connect back using the master’s IP address obtained from the
CDSW_MASTER_IP environment variable. After that, trainer-master is able to distribute the IP addresses, and enable all the workers
to create mutual communication.
import cdsw, socket
import threading
import time
workers = cdsw.launch_workers(n=2, cpu=0.2, memory=0.5,
script="worker.py")
# Attempt to get workers’ IP addresses by accepting connections
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("0.0.0.0", 6000))
s.listen(1)
conns=dict()
for i in range(2):
conn, addr = s.accept()
print("IP address of %d: %s"%(i,addr[0]))
conns[i]=(conn,addr[0])
import os, time, socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((os.environ["CDSW_MASTER_IP"], 6000))
data = s.recv(1024).decode()
print("Response from the master:", data)
s.close()
master.py worker.py
42. © Cloudera, Inc. All rights reserved. 42
CREATING CDSW DOCKER IMAGES
• CDSW Docker images for distributed model training can be created by extending
the following base images.
• docker.repository.cloudera.com/cdsw/engine:7
• Running the base image, and installing OpenMPI 4.0.0 from source code in the
docker instance.
• Not to install the OS provided OpenMPI package, as its version is below Horovod’s
requirement.
• Installing the core packages.
• pip install petastorm tensorflow pytorch horovd
• If you wish to use GPU in model training, make sure to the NVidia driver and use
the GPU version of Tensorflow and/or PyTorch.
• See: https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_gpu.html
43. © Cloudera, Inc. All rights reserved. 43
USING PRE-BUILT IMAGES
• You can also use our pre-built
Docker image from our public
Docker repo.
• docker pull rarana73/cdsw-7-
horovod-gpu:1
• Content:
• CDSW Base Image v7 -
• CUDA_VERSION 9.0.176
• NCCL_VERSION 2.4.2
• CUDNN_VERSION 7.4.2.24
• Tensorflow 1.12.0
• Open MPI 4.0.0
44. © Cloudera, Inc. All rights reserved. 44
INITIALIZING OPEN-MPI PEERS
• Normally, OpenMPI peers are initialized by directly spawning Python/OpenMPI
processes via the mpirun command.
• Similarly, in CDSW-Horovod processes, it can also be done by invoke the mpirun
command via Python packages.
• However, make sure when doing so, the train-worker containers are still there.
45. © Cloudera, Inc. All rights reserved. 45
Horovod in Action
• Applying Horovod to
a WideResNet model, trained on
the Fashion MNIST dataset
• 2 GPUS NVIDA Quadro P600
• CUDA Cores: 384 / 2 GB GDDR5
horovodrun -np 1 python
fashion_mnist/fashion_mnist_solution.py --log-dir log/np-1
horovodrun -np 2 python
fashion_mnist/fashion_mnist_solution.py --log-dir log/np-2
47. © Cloudera, Inc. All rights reserved. 47
Around the corner
• SPARK On K8S & GPU support
• Horovod in Spark
• TensorFlow 2.0 & Distribution
Strategy
• Apache Submarine -
https://hadoop.apache.org/submarine/
• …