The ever-growing continuous influx of data causes every component in a system to burst at its seams. GPUs and ASICs are helping on the compute side, whereas in-memory and flash storage devices are utilized to keep up with those local IOPS. All of those can perform extremely well in smaller setups and under contained workloads. However, today's workloads require more and more power that directly translates into higher scale. Training major AI models can no longer fit into humble setups. Streaming ingestion systems are barely keeping up with the load. These are just a few examples of why enterprises require a massive versatile infrastructure, that continuously grows and scales. The problems start when workloads are then scaled out to reveal the hardships of traditional network infrastructures in coping with those bandwidth hungry and latency sensitive applications. In this talk, we are going to dive into how intelligent hardware offloads can mitigate network bottlenecks in Big Data and AI platforms, and compare the offering and performance of what's available in major public clouds, as well as a la carte on-premise solutions.
Invezz.com - Grow your wealth with trading signals
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
1. Yuval Degani, LinkedIn
Dr. Jithin Jose, Microsoft Azure
Tackling Network
Bottlenecks with
Hardware Accelerations:
Cloud vs. On-Premise
#UnifiedAnalytics #SparkAISummit
2. Intro
• Infinite loop of removing performance road blocks
• With faster storage devices (DRAM, NVMe, SSD) and
stronger than ever processing power (CPU, GPU, ASIC),
a traditional network just can’t keep up with I/O flow
• Upgrading to higher wire speeds will rarely do the trick
• This is where co-designed hardware acceleration can be
used to truly utilize the power of a compute cluster
2#UnifiedAnalytics #SparkAISummit
3. Previous talks
3#UnifiedAnalytics #SparkAISummit
Spark Summit Europe 2017
First open-source stand-alone RDMA accelerated
shuffle plugin for Spark (SparkRDMA)
Spark+AI Summit North America 2018
First preview of SparkRDMA on Azure HPC
nodes, demonstrating x2.6 job speed-up on cloud
VMs
5. Network Bottlenecks in the Wild
• Not always caused by lack of bandwidth
• Network I/O imposes overhead in many system components:
– Memory management
– Memory copy
– Garbage Collection
– Serialization/Compression/Encryption
• Overhead=CPU cycles, cycles that are not available for the
actual job at hand
• Hardware acceleration can reduce overhead and allow better
utilization of compute and network resources
5#UnifiedAnalytics #SparkAISummit
6. Network Bottlenecks: Shuffle
• Most expensive non-storage
network I/O in compute clusters
• Blocking, massive movement of
transient data
• Acceleration opportunities:
– Efficient serving with reduced server-
side logic
– Serialization/Compression/Encryption
– Reduce I/O overhead and latency by
employing modern transport protocols
6#UnifiedAnalytics #SparkAISummit
Partitioning
4%
Input
11%
Shuffle
Read
57%
Output
28%
HiBench TeraSort on Spark
7. Network Bottlenecks: Distributed
Training
• Model updates create massive
network traffic
• Model update frequency rises
as GPUs get faster
• Acceleration opportunities:
– Inter-GPU RDMA communication
– Lower latency network transport
– Collectives offloads
7#UnifiedAnalytics #SparkAISummit
K80
M60
V100
ResNet 269*
Total Time GPU Active Time
* “Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training” by Luo et al.
8. Network Bottlenecks: Storage
• Massive data movement
• Premium devices (DRAM, Flash) provide storage
access speeds that were never seen before
• Acceleration opportunities:
– Higher bandwidth
– Reduced transport overhead
– OS/CPU bypass – direct storage access from network
devices
8#UnifiedAnalytics #SparkAISummit
10. Speeds
• 1, 10, 25, 40, 100, 200Gbps
• Faster network doesn’t
necessarily mean a faster
runtime
• Many workloads consist of
relatively short bursts rather
than sustainable throughput:
higher bandwidth may not have
any effect
10#UnifiedAnalytics #SparkAISummit
0
100
200
300
400
500
600
700
800
Flink
TeraSort
Flink
PageRank
PowerGraph
PageRank
Timely
PageRank
Effect of network speed
on workload runtime*
1GbE 10GbE 40GbE
* “On The [Ir]relevance of Network Performance for Data Processing” by Trivedi et al.
11. InfiniBand
• De-facto standard in the HPC world
• FDR: 56Gbps, EDR: 100Gbps, HDR:
200Gbps
• Sub-microsecond latency
• Native support for RDMA
• HW accelerated transport layer
• True SDN: standard fabric components are
developed as open-source and are cross-
platform
• Native support for Switch collectives offload
11#UnifiedAnalytics #SparkAISummit
Ethernet
23%
InfiniBand
38%
Custom
28%
Omnipath
10%
Proprietary
1%
TOP500 Supercomputers
Interconnect Performance
Share*
* www.top500.org
12. RDMA
• Remote Direct Memory Access
– Read/write from/to remote memory locations
• Zero-copy
• Direct hardware interface – bypasses the
kernel and TCP/IP in IO path
• Flow control and reliability is offloaded in
hardware
• Supported on almost all mid-range/high-
end network adapters: both InfiniBand
and Ethernet
12
Java app
buffer
OS
Sockets
TCP/IP
Driver
Network Adapter
RDMA
Socket
Context switch
#UnifiedAnalytics #SparkAISummit
13. NVIDIA GPUDirect
• Direct DMA over PCIe
• RDMA devices can write/read
directly to/from GPU memory
over the network
• No CPU overhead
• Zero-copy
13#UnifiedAnalytics #SparkAISummit
GPUDirect
Non-GPUDirect
NIC GPU
CPU
16. NVMeOF
• Network protocol for NVM
express disks (PCIe)
• Uses RDMA to provide direct
NIC<->Disk access
• Completely bypasses the host
• Minimal latency differences
between local and remote access
16#UnifiedAnalytics #SparkAISummit
NVMeOF
Traditional
NIC
CPU
18. Offer ‘Bare Metal’ Experience
– Azure HPC Solution
#UnifiedAnalytics #SparkAISummit 18
Eliminate Jitter
Host holdback is a start, but must
completely isolate guest from host
Minroot & CPU Groups; separated
host and guest VM sandboxes
Full Network Experience
Enable customers to use Mellanox or
OFED drivers
Supports all MPI types and versions
Leverage hardware offload to
Mellanox InfiniBand ASIC
Transparent Exposure of
Hardware
Core N in guest VM should =
Core N in silicon
1:1 between physical pNUMA
topology and vNUMA topology
19. Latest Azure HPC Offerings – HB/HC
HB Series (AMD EPYC) HC Series (Intel Xeon Platinum)
Workloads Targets Bandwidth Intensive Compute Intensive
Core Count 60 44
System Memory 240 GB 352 GB
Network 100 Gbps EDR InfiniBand, 40 Gbps Ethernet
Storage Support Standard / Premium Azure Storage, and 700GB Local SSD
OS Support for RDMA CentOS/RHEL, Ubuntu, SLES 12, Windows
MPI Support
OpenMPI, HPC-X, MVAPICH2, MPICH,
Intel MPI, PlatformMPI, Microsoft MPI
Hardware Collectives Enabled
Access Model
Azure CLI, ARM template, Azure CycleCloud,
Azure Batch, Partner Platform
19#UnifiedAnalytics #SparkAISummit
20. Other Azure HPC Highlights
• SR-IOV going broad
– All HPC SKUs will support SR-IOV
– Driver/SKU Performance Optimizations
• GPUs
– Latest NDv2 Series
• 8 Nvidia Tesla v100 NVLINK interconnected GPUs
• Intel Skylake, 672 GB Memory
• Excellent platform for HPC and AI workloads
• Azure FPGA
– Based on Project Brainwave
– Deploy model to Azure FPGA, Reconfigure for different models
– Supports ResNet 50, ResNet 152, DenseNet-121, and VGG-16
20#UnifiedAnalytics #SparkAISummit