Slides for the talk at the O'Reilly AI Conference San Francisco 2017 - https://conferences.oreilly.com/artificial-intelligence/ai-ca/public/schedule/detail/59613
Boost Fertility New Invention Ups Success Rates.pdf
Using Deep Learning Toolkits with Kubernetes clusters
1. Using Deep Learning Toolkits
with Kubernetes clusters
Wee Hyong, Joy Qiao
Cloud AI, Microsoft
Credits: Jin Li, Sanjeev Mehrotra, Hongzhi Li, Lachie Evenson, William Buchwalter,
Mathew Salvaris, Ilia Karmanov, Taifeng Wang, CNTK Team
O'Reilly Artificial Intelligence Conference 2017
Sept 17 – 20 , San Francisco, CA
2. Tips & tricks
learned from using
Deep Learning
Toolkits on
Kubernetes
1. Getting the K8S cluster to run in the cloud
using acs-engine
2. Scaling your Deployments
3. Distributed Deep Learning
4. Best Practices for High-Performance
Models
5. Distributed Training Performance on
Kubernetes
3. Deep Learning Common Patterns
CNN RNN
Convolutional Neural Network Recurrent Neural Network
4. How long does it take to train DNN models?
ResNet
ImageNet
GoogleNet
ImageNet
2000h Speech
LSTM Model
Neural
Translational Model
130
hours
570
hours
1,100
hours
2,000
hours
Imagenet: 1M Images, 1K Classes
K40 x 8 K40 K40 K40
5. Getting Started with Deep Learning
Toolkits Environment
Desktop / Laptop
Virtual Machine Devices / Edge
Cloud
And more….
8. Simplified View of
Kubernetes Concepts
Master
Node
Pod
Container
Pod
Container
Node
Pod
Container
Pod
Container
Node
Pod
Container
Pod
Container
Client
kubectl get nodes
Servers /
Virtual Machines
Cloud
Where can you run K8S?
kubelet
kubelet
kubelet
10. Deploy the Kubernetes Cluster
Use an existing image or prep a new
Docker Image
Choose storage to persist the data
(logs, checkpoint files, model, etc)
3 Easy Steps
Getting Started
with Kubernetes
and Deep
Learning Toolkit
13. Checking the Nvidia drivers are used
output logs from nvidia-smi
Running
nvidia-smi to display
the GPU info
14. dockerfile
1 Specify base image from NVidia
2 Define entry file that is run on startup
3 Install relevant tools
5 Specify entry point and port to be exposed
4 Install CNTK 2.1
18. Defining a Training Job
1 Run this as a K8S Job
2 Secret for Azure Storage
3 Specify the image to use
4 Run the download and
train script
5 Mount a folder on Azure File
19. Creating a Deployment for Serving
1 Specify this as a K8S Deployment
2
3
Specify the image to use
3 Mount a folder on Azure File
26. Node-Level AutoScaling
time
ETCD
kube API Server
User creates pod
kubectl create pod
kube scheduler
(kube master)
Any stuffs
to schedule?
Pending pods = { X }
Nodes = {A, B, C}
Nodes with
free capacity = { }
kubernetes-
acs-autoscaler
Do we have
pending pods?
Pending pods
= { X }
Azure
Container
Services
Set size
= 20
Get
current state
of all agents
New agent
(Azure VM)
Create
VM
kubelet
I am
Node D Put Pod X
On Node D
kube scheduler
(kube master)
Pending pods = { X }
Nodes = {A, B, C}
Nodes with
free capacity = { D }
Any stuffs
to schedule?
27. Sample YAML for a TensorFlow worker pod with GPUs
2. Check to make sure your K8s has your GPU resources data.
kubectl describe nodes
1. GPU setup scripts
source:
https://github.com/Microsoft/DLWorkspace/blob/master/src/ClusterBootstrap/scripts/prepare_acs.sh
28. Scheduling GPUs
Each node need to be pre-installed
with Nvidia drivers
Resource name to use Nvidia GPUs
alpha.kubernetes.io/nvidia-gpu
31. Distributed Training Architecture
Data Parallelism Model Parallelism
1. Parallel training on different
machines
2. Update the parameter server
synchronously/asynchronously
3. Refresh the local model with
new parameters, go to 1 and
repeat
1. The global model is partitioned
into K sub-models without
overlap.
2. The sub-models are distributed
over K local workers and serve
as their local models.
3. In each mini-batch, the local
workers compute the gradients
of the local weights by back
propagation.
Credits: Taifeng Wang, DMTK team
32. TensorFlow Training on Multi-GPU single node
• Places an individual model replica on
each GPU.
• Splits the batch across the GPUs.
• Updates model parameters
synchronously by waiting for all GPUs
to finish processing a batch of data.
Each tower computes the gradients for a
portion of the batch and the gradients are
combined and averaged across the
multiple towers in order to provide a
single update of the Variables stored on
the CPU.
Source:
https://www.tensorflow.org/tutorials/deep_cnn#launching_and_training_the_model_on_multiple_gpu_cards
33. Distributed TensorFlow Architecture
For Variable Distribution &
Gradient Aggregation
• Parameter_server
Source: https://www.tensorflow.org/performance/performance_models
35. Best Practices
• Input Pipeline
oDo not use feed_dict, slowest way of reading data
oUse the Dataset API
oUse the native parallelism in TensorFlow
➢Parallelize I/O Reads
➢Parallelize Image Processing
➢Parallelize CPU-to-GPU Data Transfer
➢Software Pipelining Source: https://www.tensorflow.org/performance/performance_models
36. Best Practices
• Preprocessing on the CPU
• Use large files, e.g. large TFRecord files.
o E.g. TensorFlow’s official benchmark training file is 140MB each, in TFRecord format
• Place shared parameters on CPU vs GPU
• NCCL vs TensorFlow’s implicit copy mechanism
oNCCL is an NVIDIA® library that can efficiently broadcast and aggregate data across
different GPUs, with optimized utilization of the underlying hardware topology
37. Best Practices
• Build the model with both NHWC and NCHW
o NCHW is the optimal format when training with GPUs.
o A flexible model can be trained on GPUs using NCHW, with inference done on
CPU using NHWC with the weights obtained from training.
39. Training Environment on Azure
• VM SKU
oNC24r for workers
▪ 4x NVIDIA® Tesla® K80 GPU
▪ 24 CPU cores, 224 GB RAM
oD14_v2 for parameter server
▪ 16 CPU cores, 112 GB RAM
• Kubernetes: 1.6.6 (created using ACS-Engine)
• GPU: NVIDIA® Tesla® K80
• Benchmarks scripts:
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
• OS: Ubuntu 16.04 LTS
• TensorFlow: 1.2
• CUDA / cuDNN: 8.0 / 6.0
• Disk: Local SSD
• DataSet: ImageNet (real data,
not synthetic)
40. Training on Single node, Multi-GPU
• Linear scalability
• GPUs are fully saturated
• variable_update mode: parameter_server
• local_parameter_device: cpu
48
96
190
0
20
40
60
80
100
120
140
160
180
200
1 2 3
images/sec
No. of GPUs
Resnet-50 with batchsize=64
41. Training on Single node, Multi-GPU
For Tesla K80:
• If the GPUs can use NVIDIA GPUDirect Peer to Peer, place the variables equally across the GPUs used for training.
• If the GPUs cannot use GPUDirect, place the variables on the CPU.
(source: https://www.tensorflow.org/performance/performance_guide)
96
190
95
182
0
20
40
60
80
100
120
140
160
180
200
1 2
Images/sec
No. of GPUs
Resnet-50 with batchsize=64
Series1 Series2
42. Training on Single node, Multi-GPU
• Larger batch size helps with training performance
• Batch size is limited by GPU memory (e.g. 12GB RAM for NVIDIA® Tesla® K80)
variable_update mode: parameter_server
135
124
0
20
40
60
80
100
120
140
160
1 2
images/sec
Batch size
VGG16
440
413
0
50
100
150
200
250
300
350
400
450
500
1 2
Images/sec
Batch size
GoogLeNet
43. Distributed Training
Settings:
• Topology: 1 ps and 2 workers
• Async variables update
• Using cpu as the local_parameter_device
• Each ps/worker pod has its own dedicated host
• variable_update mode: parameter_server
• Network protocol: gPRC
Single-node Training with 4 GPUs
vs Distributed Training with 2 workers with 8 GPUs in total
440
107.6
190
73
135
818
172.6
296
93 84.5
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5
Images/sec
Series1
Series2
Observations on distributed training:
• Linear scalability largely depends on the model and
network bandwidth.
• GPUs not fully saturated on the worker nodes, likely
due to network bottleneck.
• VGG16 had suboptimal performance than single-
node training. GPUs “starved” most of the time.
• Running directly on Host VMs rather than K8s pods
did not make a huge difference, in this particular
test environment.
44. Distributed Training
Distributed training scalability depends on the compute/bandwidth ratio of the model
1.86
1.60 1.56
1.27
0.63
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5
Speedup
Training Speedup on 2 nodes vs single-node
Source: https://arxiv.org/abs/1704.04560
The model with a higher
ratio scales better.
GoogLeNet scales pretty well.
VGG16 is suboptimal, due to its large
size
45. Distributed Training
• Sync vs Async variable updates
• parameter_server vs distributed_replicated mode
814 800
764
0
100
200
300
400
500
600
700
800
900
1 2 3
Images/sec
GoogLeNet with 128 batch size
46. Distributed Training
Observations on different cluster topologies in this test environment
• Adding more ps servers do not seem to make much difference.
• Having ps servers running on the same pods as the workers seem to have worse performance
o Don’t forget to “export CUDA_VISIBLE_DEVICES=” before starting the ps job session if running ps server on the same pods
with GPUs
296 296
274
0
50
100
150
200
250
300
350
1 2 3
Images/sec
Resnet-50 with 64 batch size
variable_update mode: parameter_server
47. Demo
Deep Learning Workspace from Microsoft Research
Powered by Kubernetes
• Alpha release available at https://github.com/microsoft/DLWorkspace/
Documentation at https://microsoft.github.io/DLWorkspace/
• Note that DL Workspace is NOT a MS product/service.
It’s an open source solution, and we welcome contribution!
48. Summary
Tips & tricks
learned from using
Deep Learning
Toolkits on
Kubernetes
1. Getting the K8S cluster to run in the cloud
using acs-engine
2. Scaling your Deployments
3. Distributed Deep Learning
4. Best Practices for High-Performance
Models
5. Distributed Training Performance on
Kubernetes
49. Resources
• Getting Started with Kubernetes on Azure
https://github.com/Azure/acs-engine
https://docs.microsoft.com/en-us/azure/container-service/kubernetes/
• Running Distributed TensorFlow on Kubernetes using ACS-Engine
https://github.com/joyq-github/TensorFlowonK8s
• Using CNTK and Kubernetes
https://aka.ms/cntkkubernetes
• Distributed CNTK and TensorFlow resources
• https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines
• https://www.tensorflow.org/performance/
• https://arxiv.org/abs/1704.04560
• Deep Learning Workspace powered by Kubernetes
https://github.com/microsoft/DLWorkspace/
https://microsoft.github.io/DLWorkspace/