Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - Erez Cohen, Mellanox - Cloud Native Day Tel Aviv 2018

1© 2018 Mellanox Technologies
Machine Learning On OpenStack and K8 Done Right!
2018
Brain In The Cloud
Erez Cohen, VP CloudX & Artificial Intelligence

Data is Growing Faster Than Ever
Autonomous vehicle generates 4000GByte per day
SONAR
~10-100KB Per/Sec
CAMERA
~20-40MB Per/sec
GPS
~50KB Per/Sec
Data will grow by a factor of 10 over the next decade to 160 Zeta Bytes in 2025 (source: IDC)
Faster Data processing requires faster Interconnect speeds
RADAR
~10-100KB Per/Sec
Light Detection & Ranging
~10-70MB Per/Sec

Machine Learning Is Everywhere!
Fraud Detection

What Is Machine Learning
Machine Learning
Machine learning is the subfield of computer science that, according
to Arthur Samuel in 1959, gives "computers the ability to learn without
being explicitly programmed.“
Source: https://en.wikipedia.org/wiki/Machine_learning

Deep Learning
 Also known as Deep Neural Network (DNN)
 Subset of Artificial Neural Network (ANN)
Deep Learning
Deep Learning is a subfield of machine learning concerned with
algorithms inspired by the structure and function of the brain called
artificial neural networks
Source: http://machinelearningmastery.com/what-is-deep-learning/

Why Deep Learning And Why Now?
 Deep Learning allow to solve difficult problems
 In some cases problems that can’t be solve in other ways
 Deep Learning is not new
 1943: “A logical calculus of the ideas immanent in nervous activity”, McCulloch & Pitts
So why now?
 Infrastructure
 Recent development in GPU and network technology allow to realize machine learning
 Data
 More data is generated then ever. Critical for the training process
 Software
 Wave of open source machine learning frameworks
Cognitive Toolkit

Deep Learning Demands Highest Performance
TRAINING DATASET
NEW DATA
TRAINING
Intensive computing (Billions of
TFLOPS)
• GPU!
Ultra-fast networking for scalability
• RDMA, GPUDirect, Collective
acceleration
Fast, distributed storage
INFERENCING
Images
Video
Text
Speech

Neural Networks Complexity Growth
2014 2015 2016 2017
DeepSpeech DeepSpeech-2
DeepSpeech-3
30X
2013 2014 2015 2016
AlexNet GoogleNet
ResNet
Inception-V2
350X
Inception-V4
Image
Recognition
Speech
Recognition
PolyNet

Training Challenges
Training with large data sets and increasing networks can take long time
 In some cases even weeks
In many cases training need to happen frequently
 Model development and tuning
 Real life use cases may require retraining regularly
Accelerate training time by scale out architecture
 Add workers (nodes) to reduce training time
Types of parallelism that are now popular
Data parallelism
Model parallelism
Network is critical element to accelerate Distributed Training!

Model and Data Parallelism
Main Model/Parameter Server/Allreaduce
Local
Model
Mini
Batch
Mini
Batch
Mini
Batch
Mini
Batch
Mini
Batch
Local
Model
Local
Model
Local
Model
Local
Model
Local
Model
Mini
BatchData Data
Model Parallelism Data Parallelism

Accelerates Distributed Training
 Data Parallelism communication pattern
 Gradient updates to parameter servers or among workers.
 Model parameters distribution among workers.
 Frequent – each training step due to the sequential nature of SGD
 High bandwidth is needed, as models become larger and larger
 Number of parameters is increasing
 Usually characterized with Bursts on the network - workers are synchronized
RDMA and GPU Direct Accelerates Distributed Training

Machine Learning on the Cloud
 GPU provisioning to VMs
 Advance Networking
 Advance Storage

What Is RDMA?
 Remote Direct Memory Access (RDMA)
 Advance transport protocol (same layer as TCP and UDP)
 Main features
 Remote memory read/write semantics in addition to send/receive
 Kernel bypass / direct user space access
 Full hardware offload
 Secure, channel based IO
 Application advantage
 Low latency
 High bandwidth
 Low CPU consumption
 RoCE: RDMA over Converged Ethernet
 Available for all Ethernet speeds 10 – 100G
 Verbs: RDMA SW interface (equivalent to sockets)

GPUDirect™ RDMA Technology

Para-Virtualized SR-IOV
Enable Advance Networking For VMs & Containers
Single Root I/O Virtualization
(SR-IOV)
 PCIe device presents multiple instances
to the OS/Hypervisor
 Enables Application Direct Access
 Bare metal performance for VM
 Reduces CPU overhead
 Enables many advanced NIC features
(e.g. DPDK, RDMA, ASAP2,)
NIC
Hypervisor
vSwitch
VM VM
SR-IOV NIC
Hypervisor VM VM
eSwitch
Physical Function
(PF)
Virtual Function
(VF)

ASAP2 Direct: Full OVS Offload
 Enable SR-IOV data path with OVS control plane
 In other words, enable support for most SDN controllers with SR-IOV
data plane
 Use Open vSwitch to be the management interface and
offload OVS data-plane to Mellanox embedded Switch
(eSwitch) using ASAP2 Direct
 Allow for RDMA, GPUDirect and other advance network
services directly from a VM or Container
VM
ConnectX-5 eSwitch
VM
Hypervisor
OVS
SR-IOV
VF
SR-IOV
VF
DataPath
PF

Comprehensive OpenStack Integration
Integrated with Major
OpenStack
Distributions
In-Box
Neturon-ML2
support for
mixed
environment
(VXLAN, PV,
SRIOV)
Ethernet
Neutron: Data
plane
acceleration
and isolation
iSER and
NVMf
Accelerating
storage
access
OpenStack Plugins Create Seamless Integration , Control, & Management

Container Networking Acceleration
Enable RoCE and DPDK networking technologies to accelerate
cloud-native apps and workloads

Containers and Kubernetes Integration
PF VF-1 VF-2 VF-3
SR-IOV
CNI
ibdev=mlx5_1
netdev=eth0
net_ns=1
ibdev=mlx5_2
netdev=eth1
net_ns=2
ibdev=mlx5_3
netdev=eth2
net_ns=3
Kubernetes/
Docker
Container1 Container2 Container3
SR-
IOV/RDMA
Device
Plugin
Mellanox ConnectX Adapter Card with SR-IOV Enabled
 Every container/POD has an IB device (mlx5_1,2,3)
 Isolation is on the driver level
RDMA Application RDMA Application RDMA Application
Verbs Verbs Verbs

All Major Machine Learning Frameworks Support
RDMA
TensorFlow: Several implementations upstream
 Native (verbs) -
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/
verbs
 MPI, Horovod – Donated by Uber among others
Caffe2 / PyTorch: Over MPI or Gloo library
Microsoft Cognitive Toolkit: Native support
NVIDIA NCCL2: Native support in NCCL
Cognitive Toolkit

TensorFlow with Mellanox RDMA Test Report
 System Configuration
 8 x x86 servers
 4 x NVIDIA P100 per server
 Mellanox 100G RDMA network
 NVMe driver per server
 TensorFlow v1.4
RDMA vs. TCP: Up to 50% Better Performance
Advanced RDMA vs. TCP: Up to 173% Better Performance
Reference Deployment Guide

NVIDIA® DGX-1™ Deep Learning Server
8 x NVIDIA® Tesla® P/V100 GPUs
5.3TFlops
16nm FinFET
NVLINK
4 x ConnectX®-4 EDR 100G InfiniBand
Adapters

Mellanox Enables Most Efficient Machine Learning
Platforms
Highest Performance, Scalability and Productivity for Deep Learning

Thank You

Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - Erez Cohen, Mellanox - Cloud Native Day Tel Aviv 2018

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - Erez Cohen, Mellanox - Cloud Native Day Tel Aviv 2018

Ähnlich wie Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - Erez Cohen, Mellanox - Cloud Native Day Tel Aviv 2018 (20)

Mehr von Cloud Native Day Tel Aviv

Mehr von Cloud Native Day Tel Aviv (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Brain in the Cloud: Machine Learning on OpenStack & Kubernetes Done Right - Erez Cohen, Mellanox - Cloud Native Day Tel Aviv 2018