Using Deep Learning Toolkits with Kubernetes clusters

Using Deep Learning Toolkits
with Kubernetes clusters
Wee Hyong, Joy Qiao
Cloud AI, Microsoft
Credits: Jin Li, Sanjeev Mehrotra, Hongzhi Li, Lachie Evenson, William Buchwalter,
Mathew Salvaris, Ilia Karmanov, Taifeng Wang, CNTK Team
O'Reilly Artificial Intelligence Conference 2017
Sept 17 – 20 , San Francisco, CA

Tips & tricks
learned from using
Deep Learning
Toolkits on
Kubernetes
1. Getting the K8S cluster to run in the cloud
using acs-engine
2. Scaling your Deployments
3. Distributed Deep Learning
4. Best Practices for High-Performance
Models
5. Distributed Training Performance on
Kubernetes

Deep Learning Common Patterns
CNN RNN
Convolutional Neural Network Recurrent Neural Network

How long does it take to train DNN models?
ResNet
ImageNet
GoogleNet
ImageNet
2000h Speech
LSTM Model
Neural
Translational Model
130
hours
570
hours
1,100
hours
2,000
hours
Imagenet: 1M Images, 1K Classes
K40 x 8 K40 K40 K40

Getting Started with Deep Learning
Toolkits Environment
Desktop / Laptop
Virtual Machine Devices / Edge
Cloud
And more….

Infrastructure that allows you to
do lots of experimentation

Infrastructure that enables you to
scale up/down as needed

Simplified View of
Kubernetes Concepts
Master
Node
Pod
Container
Pod
Container
Node
Pod
Container
Pod
Container
Node
Pod
Container
Pod
Container
Client
kubectl get nodes
Servers /
Virtual Machines
Cloud
Where can you run K8S?
kubelet
kubelet
kubelet

CNTK + K8S
Using acs-engine to setup K8S
1

Deploy the Kubernetes Cluster
Use an existing image or prep a new
Docker Image
Choose storage to persist the data
(logs, checkpoint files, model, etc)
3 Easy Steps
Getting Started
with Kubernetes
and Deep
Learning Toolkit

Resources: https://github.com/Azure/acs-engine
acs-engine
K8s cluster
definition file
Azure Resource
Manager (ARM) templates
ssh keys
kubeconfig file
Deploy to
Azure
Using acs-engine to setup K8S on Azure

VM1
k8s-master-27473156-0
VM2
VM3
k8s-agentpool1-27473156-1
k8s-agentpool2-27473156-0
NC6
(GPU Enabled)
DS_v2
(Non-GPU)
Virtual network
K8s-vnet-27473156
Deployment to Azure

Checking the Nvidia drivers are used
output logs from nvidia-smi
Running
nvidia-smi to display
the GPU info

dockerfile
1 Specify base image from NVidia
2 Define entry file that is run on startup
3 Install relevant tools
5 Specify entry point and port to be exposed
4 Install CNTK 2.1

docker build -t <image-name> -f <path-to-dockerfile>/dockerfile <src-folder>
Example: https://hub.docker.com/r/weehyong/cntkresnetgpu/

Demo
Prep Kubernetes Cluster using
acs-engine

Defining a Training Job
1 Run this as a K8S Job
2 Secret for Azure Storage
3 Specify the image to use
4 Run the download and
train script
5 Mount a folder on Azure File

Creating a Deployment for Serving
1 Specify this as a K8S Deployment

2
3
Specify the image to use
3 Mount a folder on Azure File

Have the GPU resources when
you need them

Auto-Scaling Deployment
1. To handle more load for serving, I want to scale my deployment
2. Having more pods to run different training jobs

Auto-Scaling
Deployment
Pod-Level
Horizontal Pod Autoscaling
kubectl autoscale
Node-Level
Autoscaling
aka.ms/k8sautoscaleazure
Walkthrough by @wbuchwalter
Based on OpenAI Kubernetes-ecs-autoscaler

Horizontal Pod AutoScaling
Pod
Container
RC / Deployment
Scale
Horizontal Pod
Autoscaler
Pod
Container
CPU% = 70%
Pod
Container
Node

What if the nodes are maxed out?

Node-Level AutoScaling
time
ETCD
kube API Server
User creates pod
kubectl create pod
kube scheduler
(kube master)
Any stuffs
to schedule?
Pending pods = { X }
Nodes = {A, B, C}
Nodes with
free capacity = { }
kubernetes-
acs-autoscaler
Do we have
pending pods?
Pending pods
= { X }
Azure
Container
Services
Set size
= 20
Get
current state
of all agents
New agent
(Azure VM)
Create
VM
kubelet
I am
Node D Put Pod X
On Node D
kube scheduler
(kube master)
Pending pods = { X }
Nodes = {A, B, C}
Nodes with
free capacity = { D }
Any stuffs
to schedule?

Sample YAML for a TensorFlow worker pod with GPUs
2. Check to make sure your K8s has your GPU resources data.
kubectl describe nodes
1. GPU setup scripts
source:
https://github.com/Microsoft/DLWorkspace/blob/master/src/ClusterBootstrap/scripts/prepare_acs.sh

Scheduling GPUs
Each node need to be pre-installed
with Nvidia drivers
Resource name to use Nvidia GPUs
alpha.kubernetes.io/nvidia-gpu

Demo
Node-level Scaling for Deep
Learning Jobs on k8s

Distributed Training Architecture
Data Parallelism Model Parallelism
1. Parallel training on different
machines
2. Update the parameter server
synchronously/asynchronously
3. Refresh the local model with
new parameters, go to 1 and
repeat
1. The global model is partitioned
into K sub-models without
overlap.
2. The sub-models are distributed
over K local workers and serve
as their local models.
3. In each mini-batch, the local
workers compute the gradients
of the local weights by back
propagation.
Credits: Taifeng Wang, DMTK team

TensorFlow Training on Multi-GPU single node
• Places an individual model replica on
each GPU.
• Splits the batch across the GPUs.
• Updates model parameters
synchronously by waiting for all GPUs
to finish processing a batch of data.
Each tower computes the gradients for a
portion of the batch and the gradients are
combined and averaged across the
multiple towers in order to provide a
single update of the Variables stored on
the CPU.
Source:
https://www.tensorflow.org/tutorials/deep_cnn#launching_and_training_the_model_on_multiple_gpu_cards

Distributed TensorFlow Architecture
For Variable Distribution &
Gradient Aggregation
• Parameter_server
Source: https://www.tensorflow.org/performance/performance_models

Best Practices for
High-Performance Models4

Best Practices
• Input Pipeline
oDo not use feed_dict, slowest way of reading data
oUse the Dataset API
oUse the native parallelism in TensorFlow
➢Parallelize I/O Reads
➢Parallelize Image Processing
➢Parallelize CPU-to-GPU Data Transfer
➢Software Pipelining Source: https://www.tensorflow.org/performance/performance_models

Best Practices
• Preprocessing on the CPU
• Use large files, e.g. large TFRecord files.
o E.g. TensorFlow’s official benchmark training file is 140MB each, in TFRecord format
• Place shared parameters on CPU vs GPU
• NCCL vs TensorFlow’s implicit copy mechanism
oNCCL is an NVIDIA® library that can efficiently broadcast and aggregate data across
different GPUs, with optimized utilization of the underlying hardware topology

Best Practices
• Build the model with both NHWC and NCHW
o NCHW is the optimal format when training with GPUs.
o A flexible model can be trained on GPUs using NCHW, with inference done on
CPU using NHWC with the weights obtained from training.

Distributed Training Performance
on Kubernetes5

Training Environment on Azure
• VM SKU
oNC24r for workers
▪ 4x NVIDIA® Tesla® K80 GPU
▪ 24 CPU cores, 224 GB RAM
oD14_v2 for parameter server
▪ 16 CPU cores, 112 GB RAM
• Kubernetes: 1.6.6 (created using ACS-Engine)
• GPU: NVIDIA® Tesla® K80
• Benchmarks scripts:
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
• OS: Ubuntu 16.04 LTS
• TensorFlow: 1.2
• CUDA / cuDNN: 8.0 / 6.0
• Disk: Local SSD
• DataSet: ImageNet (real data,
not synthetic)

Training on Single node, Multi-GPU
• Linear scalability
• GPUs are fully saturated
• variable_update mode: parameter_server
• local_parameter_device: cpu
48
96
190
0
20
40
60
80
100
120
140
160
180
200
1 2 3
images/sec
No. of GPUs
Resnet-50 with batchsize=64

For Tesla K80:
• If the GPUs can use NVIDIA GPUDirect Peer to Peer, place the variables equally across the GPUs used for training.
• If the GPUs cannot use GPUDirect, place the variables on the CPU.
(source: https://www.tensorflow.org/performance/performance_guide)
96
190
95
182
0
20
40
60
80
100
120
140
160
180
200
1 2
Images/sec
No. of GPUs
Resnet-50 with batchsize=64
Series1 Series2

• Larger batch size helps with training performance
• Batch size is limited by GPU memory (e.g. 12GB RAM for NVIDIA® Tesla® K80)
variable_update mode: parameter_server
135
124
0
20
40
60
80
100
120
140
160
1 2
images/sec
Batch size
VGG16
440
413
0
50
100
150
200
250
300
350
400
450
500
1 2
Images/sec
Batch size
GoogLeNet

Distributed Training
Settings:
• Topology: 1 ps and 2 workers
• Async variables update
• Using cpu as the local_parameter_device
• Each ps/worker pod has its own dedicated host
• variable_update mode: parameter_server
• Network protocol: gPRC
Single-node Training with 4 GPUs
vs Distributed Training with 2 workers with 8 GPUs in total
440
107.6
190
73
135
818
172.6
296
93 84.5
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5
Images/sec
Series1
Series2
Observations on distributed training:
• Linear scalability largely depends on the model and
network bandwidth.
• GPUs not fully saturated on the worker nodes, likely
due to network bottleneck.
• VGG16 had suboptimal performance than single-
node training. GPUs “starved” most of the time.
• Running directly on Host VMs rather than K8s pods
did not make a huge difference, in this particular
test environment.

Distributed training scalability depends on the compute/bandwidth ratio of the model
1.86
1.60 1.56
1.27
0.63
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5
Speedup
Training Speedup on 2 nodes vs single-node
Source: https://arxiv.org/abs/1704.04560
The model with a higher
ratio scales better.
GoogLeNet scales pretty well.
VGG16 is suboptimal, due to its large
size

• Sync vs Async variable updates
• parameter_server vs distributed_replicated mode
814 800
764
0
100
200
300
400
500
600
700
800
900
1 2 3
Images/sec
GoogLeNet with 128 batch size

Observations on different cluster topologies in this test environment
• Adding more ps servers do not seem to make much difference.
• Having ps servers running on the same pods as the workers seem to have worse performance
o Don’t forget to “export CUDA_VISIBLE_DEVICES=” before starting the ps job session if running ps server on the same pods
with GPUs
296 296
274
0
50
100
150
200
250
300
350
1 2 3
Images/sec
Resnet-50 with 64 batch size
variable_update mode: parameter_server

Demo
Deep Learning Workspace from Microsoft Research
Powered by Kubernetes
• Alpha release available at https://github.com/microsoft/DLWorkspace/
Documentation at https://microsoft.github.io/DLWorkspace/
• Note that DL Workspace is NOT a MS product/service.
It’s an open source solution, and we welcome contribution!

Summary
Tips & tricks
learned from using
Deep Learning
Toolkits on
Kubernetes
1. Getting the K8S cluster to run in the cloud
using acs-engine
2. Scaling your Deployments
3. Distributed Deep Learning
4. Best Practices for High-Performance
Models
5. Distributed Training Performance on
Kubernetes

Resources
• Getting Started with Kubernetes on Azure
https://github.com/Azure/acs-engine
https://docs.microsoft.com/en-us/azure/container-service/kubernetes/
• Running Distributed TensorFlow on Kubernetes using ACS-Engine
https://github.com/joyq-github/TensorFlowonK8s
• Using CNTK and Kubernetes
https://aka.ms/cntkkubernetes
• Distributed CNTK and TensorFlow resources
• https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines
• https://www.tensorflow.org/performance/
• https://arxiv.org/abs/1704.04560
• Deep Learning Workspace powered by Kubernetes
https://github.com/microsoft/DLWorkspace/
https://microsoft.github.io/DLWorkspace/

Using Deep Learning Toolkits with Kubernetes clusters

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Using Deep Learning Toolkits with Kubernetes clusters

Ähnlich wie Using Deep Learning Toolkits with Kubernetes clusters (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Using Deep Learning Toolkits with Kubernetes clusters