High-Performance Input Pipelines for Scalable Deep Learning

HIGH-PERFORMANCE
INPUT PIPELINES FOR
SCALABLE DEEP LEARNING
Joshua Robinson
Pure Storage

© 2019 PURE STORAGE INC.2
QUESTIONONEVERYONE’SMIND:
WHYISASTORAGECOMPANYHERE?

“We don’t have better algorithms,
we just have more data”
PETER NORVIG
Engineering Director, Google

The AI “Hierarchy of Needs”
credit: Monica Rogati
ML algorithms: linear & logistic
regression, k-means clustering,
decision trees, etc.
Validation: A/B testing, detecting
model drift over time✓
Data preparation: cleaning, feature
identification, exploration, etc.
Data acquisition: ingest,
transformation, and representation of
data for analysis

THIS IS NOT THE FIRST AI HYPE WAVE
1950 1960 1970 1980 1990 2000 2010 2020
Birth of AI Re-birth I Re-birth II
AI winter I AI winter II
Common themes: compute and data couldn’t
match needs of problems being hyped
Common themes: focus on specific problems
where available compute & data are sufficient

6 © 2019 PURE STORAGE INC.
DEEP LEARNING = MASSIVE DATA & COMPUTE
Deep Learning
Accuracy
Data & Compute
Previous methods
STATE-OF-THE-ART RESULTS ACROSS VISION, SPEECH, LANGUAGE, AND MORE
Sources: https://arxiv.org/abs/1506.01497; https://arxiv.org/abs/1703.06870; https://shubhangdesai.github.io/blog/Neural-Style.html; https://cs.stanford.edu/people/karpathy/cnnembed/

THE INTUITION BEHIND DEEP LEARNING
deep
neural net
Pr{dog}= 0.903
Pr{cat} = 0.072
…
“dog”
Primitives Rough shapes Macro features

TRAINING A DEEP NEURAL NETWORK
evaluate
compute
gradients
apply
gradients
Pr{dog}= 0.903
Pr{cat} = 0.072
…
Primitives Rough shapes Macro features

DISTRIBUTED TRAINING
evaluate
compute
gradients
merge
gradients
apply
gradients
evaluate
compute
gradients
apply
gradients
evaluate
compute
gradients
apply
gradients
# GPUs

MORE, FASTER GPUs + MORE DATA

CAN WE KEEP GPUs FED WITH DATA?
INPUT PIPELINE = POTENTIAL BOTTLENECK

INPUT PIPELINES
CAN IT BE THAT SIMPLE?
Source: K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR 2015

REAL INPUT PIPELINES
CAN YOU SPOT THE BOTTLENECK?
Source: K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, CVPR 2015

FROM IMAGES TO TENSORS
PLANE
DOG
BOAT
CAT
1. Enumerate

PLANE PLANE PLANE PLANE
DOG DOG DOG DOG
BOAT BOAT BOAT BOAT
CAT CAT CAT CAT
1. Enumerate
2. Associate labels

BOATCAT
PLANE
PLANE
PLANE
PLANE
DOG
DOG
DOGDOG
BOATBOAT
BOAT
CAT CAT
CAT
1. Enumerate
2. Associate labels
3. Shuffle

PLANE
PLANE
PLANE
DOG
DOG
DOGDOG
BOAT BOATBOAT
BOAT
CAT CAT
CAT
CATPLANE
1. Enumerate
2. Associate labels
3. Shuffle
4. Read, crop, distort

1. Enumerate
2. Associate labels
3. Shuffle
5. Copy to GPU
PLANE
PLANE
PLANE
DOG
DOG
DOGDOG
BOAT BOATBOAT
BOAT
CAT CAT
CAT
CATPLANE

1. Enumerate
2. Associate labels
3. Shuffle
5. Copy to GPU
ANY OF THESE STEPS CAN BE
A POTENTIAL BOTTLENECK
Other domains (NLP, speech, etc.)
will follow a similar(ish) flow

EVALUATION METHODOLOGY
1.3M images, 1000 categories

40Gb Ethernet
4x NVIDIA DGX-1, each with
8x Tesla V100 GPUs (SXM2)
2x Intel E5-2698 v4 @ 2.20GHz
4x Mellanox MT27700 100Gb/s VPI adapters
512GB DDR4-2400
Pure Storage FlashBlade: 15x17TB
179T usable before data reduction
Arista DCS-7060CX2-32S
32x 100Gb/s QSFP100 ports
AIRI
100Gb Ethernet
w/ RDMA (RoCE)
HARDWARE STACK

SOFTWARE STACK
nvcr.io/nvidia/tensorflow:17.12
Using
TensorFlow
“Datasets” API
for input
pipelines
DGX-OS (Ubuntu 16.04)
CUDA 9.0 NCCL 2.1.2
CUDNN v7
OpenMPI 3.0
TensorFlow 1.4.0+
Horovod
alsrgv/tf_cnn_benchmarks

TRAINING WITH 1 GPU
216 i/s
Defaults
Images per second when training Inception3 (batch size = 64)
forward
input
pipeline
backward
“Default” training pipeline
forward backward
Replace the input pipeline with synthetic data
How do we know what good looks like?
Synthetic
228 i/s

TRAINING WITH 1 GPU
225 i/s
Defaults + Prefetch
forward
input
pipeline
backward
Images per second when training Inception3 (batch size = 64)
Adding a prefetch queue improves scheduler behavior
216 i/s
Synthetic
228 images/s
forward
input
pipeline
backward
“Default” training pipeline
SHOULD WE CARE ABOUT 5%?

SCALING TO 32 GPUs (4x DGX-1s)
Defaults
4143 i/s
Linear
Synthetic
6580 images/s
7200 images/s
+ Prefetch
5335 i/s
- Distortions
6440 i/s
Images per second when training Inception3 (batch size = 64/GPU)
+ Thread
Pool Limit
5527 i/s
Thread pool limits: Avoid
over-subscribing CPU with
too many threads.
(inter_op_parallelism_threads)
No Distortions: Skip
preprocessing step from
input pipeline. This is an
unrealistic configuration,
but it shows the bottleneck.
EXCELLENT SCALABILITY, BUT
STILL MORE WORK TO BE DONE
42% gap!

2.5X Performance Improvement

SCALE OF REAL-WORLD DATA
143 GB 20 PB
ImageNet Zenuity

SINGLE-GPU TRAINING
evaluate
compute
gradients
apply
gradients
Pr{dog}= 0.903
Pr{cat} = 0.072
…

DISTRIBUTED TRAINING
evaluate
compute
gradients
merge
gradients
apply
gradients
evaluate
compute
gradients
apply
gradients
evaluate
compute
gradients
apply
gradients
# GPUs

LINEAR SCALING FOR CONVNETS
RESNET-50
2540 i/s
4870 i/s
10244 i/s
1 DGX-1 2 DGX-1 4 DGX-1
INCEPTION3
1600 i/s
3160 i/s
6440 i/s
1 DGX-1 2 DGX-1 4 DGX-1
VGG16
1640 i/s
3110 i/s
6300 i/s
1 DGX-1 2 DGX-1 4 DGX-1

RDMA OVER ETHERNET
RDMA is essential for peak performance

Input queue is full - need
more/faster GPUs?
KEEPING GPUs FED WITH DATA

PLANE PLANE PLANE
DOG DOG DOG DOG
BOAT BOAT BOAT BOAT
CAT CAT CAT CAT
PLANE 1. Enumerate
2. Associate labels
3. Crop and distort

High-Performance Input Pipelines for Scalable Deep Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie High-Performance Input Pipelines for Scalable Deep Learning

Ähnlich wie High-Performance Input Pipelines for Scalable Deep Learning (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

High-Performance Input Pipelines for Scalable Deep Learning