Parallel/Distributed Deep Learning and CDSW

© Cloudera, Inc. All rights reserved.
Parallel/Distributed Deep Learning
and CDSW
Rafael Arana - Senior Solutions Architect
Zuling Kang - Senior Solutions Architect

© Cloudera, Inc. All rights reserved. 2
TABLE OF CONTENTS
● Initiative of distributed deep learning and distributed model training
● Distributing the model training processes
● Integrating the distributed model training into CDSW
● Discussions and future

BACKGROUND
CONNECT
products &
services (IoT)
PROTECTDRIVE
customer insights

Are we there yet?
QUAID Where am I?
JOHNNY (cheerful) You're in a
JohnnyCab!
QUAID I mean...what am I doing
here?
JOHNNY I'm sorry. Would you
please rephrase the question.
QUAID (impatient, enunciates)
How did I get in this taxi?!
JOHNNY The door opened. You
got it.

Increase in compute
Source: https://blog.openai.com/ai-and-compute/

Model lifecycle

The Power-law Region
More compute + more training data -> Better Accuracy
Reference: https://arxiv.org/abs/1712.00409

The Power-law Region
More compute + more training data set = Better Accuracy
Reference: https://arxiv.org/abs/1712.00409

PROBLEM:
LABELED
TRAINING
DATA
• Supervised learning
• Reuse public data sets
• Data Augmentation
• Enterprise Data and data privacy
regulations

TRANSFER
LEARNING
• Low budget ( computation , data set labelling,…)
• Use transfer learning to sort of transfer knowledge
from large public data sets to your own problem.
• Small data: Replace soft Layer
• Medium Data set: Replace last layers
• Large Dataset. Just for initialization
• Sample image detection based on retinanet using
Keras:
• Person, car, …
• But,…what is that prediction on Ringo’s left leg?

Neural Networks Architectures
Training
Data Set

Neural Network
Architecture and
Accuracy
DNN models with more parameters would produce higher
classification accuracy?
• Example: Computer Vision Popular DNN Convnets
• VGG and AlexNet each have more than 150MB of
fully-connected layer parameters, GoogLeNet has
smaller fully-connected layers, and NiN does not
have fully-connected
• GoogLeNet and NiN have a resolution of 1x1
instead of 3x3 or larger
• Models with fewer parameters are more amenable to
scalability, while still delivering high accuracy.
Reference: https://arxiv.org/pdf/1511.00175

Let’s put our model in production!!!!
Photos by Unsplash

Industrialization of ML – Efficient training
Photos by Unsplash

Machine Learning Development Life Cycle

Let’s scale

Cloudera Data Science Workbench
Architecture
CDH CDH
Cloudera Manager
Gateway node(s) CDH nodes
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine
Container
Registry
Git Repo

Cloudera Data Science Workbench
Architecture
HDP HDP
Ambari
Gateway node(s) HDP nodes
Hive, HDFS, ...
CDSW CDSW
...
Master
...
Engine
EngineEngine
EngineEngine
Container
Registry
Git Repo

Adding GPUs
Step 1. Admin > Engines > Engine Images

Adding GPUs
Step 2. Project > Settings > Engine

Adding GPUs
GPU Support
CDSW
CPU
CDH/HDP
CPU
CDH/HDP
CPU
single-node
training
distributed
training, scoring
On CDH coming in C6
GPU

Distributed Tensorflow Package
• Main concepts
• Workers
• Parameter Servers
• tf.Server(),
• tf.ClusterSpec(),
tf.train.SyncReplicasOptimizer()
tf.train.replicas_device_setter()

Local Multi-GPU Training - TF Distribution Strategy
Keras API
distribution = tf.contrib.distribute.MirroredStrategy()
with distribution.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mean_squared_error',
optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2))
model.fit(train_dataset, epochs=5, steps_per_epoch=10)
CDSW
CPU GPUGPU
GPUGPUCPU

Local Multi-GPU Training - TF Distribution Strategy
Estimator API
def model_fn(features, labels, mode):
layer = tf.layers.Dense(1)
logits = layer(features)
def input_fn():
features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)
labels = tf.data.Dataset.from_tensors(1.).repeat(100)
return tf.data.Dataset.zip((features, labels))
distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)
classifier.train(input_fn=input_fn) classifier.evaluate(input_fn=input_fn)
CDSW
CPU GPUGPU
GPUGPUCPU
TF Estimator

Distributing the Model Training Processes

PROCEDURES OF TRAINING A DEEP LEARNING MODEL
Repeat the following code for num_epoch times
For each mini_batch(x, y) in dataset
Set pred_tensor = model(x) //feeding forward
Set diff_tensor = L2_loss(y, pred_tensor)
// OR Set diff_tensor = cross_entropy_loss(y, pred_tensor)
Set grad = gradient of diff_tensor
Update the model using grad

FOUR MAJOR ISSUES IN DISTRIBUTED MODEL TRAINING
• Shall we use data parallelism or model parallelism?
• How to efficiently distribute model parameters, which is normal huge in amount?
• How to aggregate model parameters in different training nodes into a global one?
• Model updating algorithms
• How to efficiently scale the training load and make it efficient access the huge
amount of training data?
• The first 3 issues are covered in this section, while the 4th one will be addressed in
the next section.

TENSORFLOW AND MODEL PARALLELISM
● The initial idea of DistBelief is proposed by Google
● First idea is published in its research paper in 2012.
● Used as the built-in distributed implementation for
Tensorflow
● Parameter server (PS)
● A centralized server for sharing neural network
parameters
● Model parallelism:
● A method to distributed the training parameters across
worker nodes
● Model updating algorithm
● Downpour SGD
Jeffrey Dean, et al. “Large scale distributed deep networks”, advances in neural
information processing systems (NIPS), 2012.

FROM MODEL TO DATA PARALLELISM
• Strength of model parallelism
• Applicable for models with size greater than memory or GPU capacities within ONE worker
node
• Weakness
• Unable to take full advantage of our hardware resources
• For models whose parameters can be hold in GPUs within ONE worker node
• Data parallelism

HARDWARE USE RATE AS TRAINING NODE INCREASES
https://eng.uber.com/horovod/

WHOLE PICTURE OF DATA PARALLELISM

FROM PS TO MPI ALLREDUCE
● Based on Baidu ring-allreduce algorithm (see
http://andrew.gibiansky.com/ for detail)
● Using HPC/MPI framework, which originally
written in C while currently encapsulated with
Python
● Implemented in Uber Horovod, Baidu, PyTorch,
MXNet, etc.
● Found to be faster in small-scale number of
nodes (8-64)
○ https://cwiki.apache.org/confluence/display/M
XNET/Extend+MXNet+Distributed+Training+by
+MPI+AllReduce

PERFORMANCE GAINS OF INFINIBAND/RDMA

MODEL UPDATING ALGORITHMS
Synchronized Asynchronous
From: Strategies and Principles of Distributed Machine Learning on
Big Data, https://doi.org/10.1016/J.ENG.2016.02.008

UPDATING ALGORITHM: SYNCHRONIZED VS. ASYNCHRONOUS
• Synchronized algorithms will lead to a more precise and consistent model,
however, some workers will sometimes have to wait for a long time during the
synchronization barrier, which leads to longer training time.
• When the minibatch is large, the low efficiency issue can be largely reduced.
• From: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,
https://arxiv.org/abs/1706.02677
• Asynchronous algorithms is said to be stochastic
in descent directions which will make the model
imprecise.
• However there find in practice some momentum
that leads the model convergent to a very close
place of its synchronized counterpart.

MODEL ERROR VS. BATCH SIZE
Priya Goyal, et al. Accurate, Large Minibatch SGD: Training ImageNet in 1
Hour. https://arxiv.org/abs/1706.02677

FAMOUS SYNCHRONIZED AND ASYNCHRONOUS EXAMPLES
• Synchronized updating algorithms
• Microsoft CNTK: Model average after certain iterations.
• Uber Horovod: Using large minibatches.
• Asynchronous updating algorithms
• Google Tensorflow: Downpour SGD.

ALGORITHM FRAMEWORK FOR SYNCHRONIZED SGD
•

Integrating the Distributed Model Training into CDSW

OVERVIEW OF THE ARCHITECTURE

USING CDSW API TO SPAWN TRAIN-WORKERS
Using cdsw.launch_workers() to generate sub-containers, then connect back using the master’s IP address obtained from the
CDSW_MASTER_IP environment variable. After that, trainer-master is able to distribute the IP addresses, and enable all the workers
to create mutual communication.
import cdsw, socket
import threading
import time
workers = cdsw.launch_workers(n=2, cpu=0.2, memory=0.5,
script="worker.py")
# Attempt to get workers’ IP addresses by accepting connections
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("0.0.0.0", 6000))
s.listen(1)
conns=dict()
for i in range(2):
conn, addr = s.accept()
print("IP address of %d: %s"%(i,addr[0]))
conns[i]=(conn,addr[0])
import os, time, socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((os.environ["CDSW_MASTER_IP"], 6000))
data = s.recv(1024).decode()
print("Response from the master:", data)
s.close()
master.py worker.py

CREATING CDSW DOCKER IMAGES
• CDSW Docker images for distributed model training can be created by extending
the following base images.
• docker.repository.cloudera.com/cdsw/engine:7
• Running the base image, and installing OpenMPI 4.0.0 from source code in the
docker instance.
• Not to install the OS provided OpenMPI package, as its version is below Horovod’s
requirement.
• Installing the core packages.
• pip install petastorm tensorflow pytorch horovd
• If you wish to use GPU in model training, make sure to the NVidia driver and use
the GPU version of Tensorflow and/or PyTorch.
• See: https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_gpu.html

USING PRE-BUILT IMAGES
• You can also use our pre-built
Docker image from our public
Docker repo.
• docker pull rarana73/cdsw-7-
horovod-gpu:1
• Content:
• CDSW Base Image v7 -
• CUDA_VERSION 9.0.176
• NCCL_VERSION 2.4.2
• CUDNN_VERSION 7.4.2.24
• Tensorflow 1.12.0
• Open MPI 4.0.0

INITIALIZING OPEN-MPI PEERS
• Normally, OpenMPI peers are initialized by directly spawning Python/OpenMPI
processes via the mpirun command.
• Similarly, in CDSW-Horovod processes, it can also be done by invoke the mpirun
command via Python packages.
• However, make sure when doing so, the train-worker containers are still there.

Horovod in Action
• Applying Horovod to
a WideResNet model, trained on
the Fashion MNIST dataset
• 2 GPUS NVIDA Quadro P600
• CUDA Cores: 384 / 2 GB GDDR5
horovodrun -np 1 python
fashion_mnist/fashion_mnist_solution.py --log-dir log/np-1
horovodrun -np 2 python
fashion_mnist/fashion_mnist_solution.py --log-dir log/np-2

Discussions and Future

Around the corner
• SPARK On K8S & GPU support
• Horovod in Spark
• TensorFlow 2.0 & Distribution
Strategy
• Apache Submarine -
https://hadoop.apache.org/submarine/
• …

THANK YOU

Parallel/Distributed Deep Learning and CDSW

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel/Distributed Deep Learning and CDSW

Similar to Parallel/Distributed Deep Learning and CDSW (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Parallel/Distributed Deep Learning and CDSW