Hacking GPUs for Deep Learning: GPUs have revolutionized machine learning in recent years, and have made both massive and deep multi-layer neural networks feasible. However, misunderstandings on why they seem to be winning persist. Many of deep learning’s workloads are in fact “too small” for GPUs, and require significantly different approaches to take full advantage of their power. There are many differences between traditional high-performance computing workloads, long the domain of GPUs, and those used in deep learning. This talk will cover these issues by looking into various quirks of GPUs, how they are exploited (or not) in current model architectures, and how Facebook AI Research is approaching deep learning programming through our recent work.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
1. Hacking GPUs for Deep Learning
MLConf New York
Jeff Johnson
Facebook AI Research
jhj@fb.com
2. Deep (convolutional) Neural Networks
Revolution in machine learning
Convolution: since 1980s. Deep: flops since 2000s
Avoid feature engineering
▪ With enough data, let network discover feature
representations
▪ Can work even for NLP. No word segmentation, use
raw character data.
3. 2D Convolutional Nets (images)
LeCun, Bottou,
Bengio and Haffner,
1998
Krizhevsky,
Sutskever and
Hinton, 2012
4. 2D Convolutional Nets
Progress towards smaller kernels and deeper nets
Network architecture ImageNet 1000 class top-5
error
AlexNet ~15%
OverFeat ~13%
ZeilerNet ~11%
Oxford-VGG ~7%
GoogLeNet ~6%, ~4.5%
PReLU (MSR) ~4.9%
Human performance 3-5%
5. 3D Convolutional Nets (videos)
C3D (Tran et al., 2014)
DeepVideo (Karpathy et al., 2014)
10. Deep nets are flop eaters
Convolutions are expensive
Pointwise calcuations (log/exp, ReLU, */+, ...)
Neighborhood reductions (pooling, convolution)
Scaling network parameters
increased learning capacity; overfitting
more training data (real or synthetic),
regularization required
11. Deep nets are bandwidth eaters
More parameters = more memory, data to exchange
Barrier to cross-machine parallelism
▪ periodic exchanges, compression, quantization
Increase reuse of memory while local?
▪ interspersed reductions are resistant to fusion of
computations
▪ generalized programming language problem
12. Deep nets are latency sensitive
Serial dependency of training
fprop => bprop => fprop => ...
Serial dependency of multi-layer networks
layer 1 => layer 2 => layer 3 => ...
Multiple path dependent networks (RNNs, multi-layer
LSTMs)
13. Deep nets are also small?
Deeper = smaller feature planes, more of them
input Rm => expand to Rn => non-lin => reduce to
Rk
Problems are tiny in HPC terms
4096×4096 FFT, FE/PDE on massive grids, ...
NLP tasks can be sparse
Setup/kernel launch latency on GPU can dominate
compute
15. Vector processors
SIMD: Single Instruction,
Multiple Data
Serial processor with ability to
operate on more than one
piece of data concurrently
Cray-1 (1976)
16. Vector processors
Hard to use: instructions only operate on 4, 8, 16, ...
pieces of data at a time. Boundary/alignment
effects. Great if your vectors are large, but...
float* a = ...; // is this aligned (a % 16 == 0)?
float* b = ...; // is this aligned (b % 16 == 0)?
for (i = 0; i < 18; ++i) { // how to handle [16, 17]?
b[i] += a[i]; // SIMD this?!? masking/loop epilogue
}
17. “Vector cores”?
SIMD variant: NVIDIA calls
“SIMT”
Lots of simple cores (CM)
Hide latency through many
threads + switching (Tera)
“Pixel/vertex shaders” in 2000s
GPUs => GPGPU
CM-1 (1983)
Tera MTA (1995)
18. GPU versus CPU
GPUs represent a different form of vector
programming (“vector cores”)
▪ 32-wide vector of threads (“warp”)
Sufficiently optimized CPU code can be on par with
GPU perf (Tflop range with AVX2/512, exploit multi-
level caches, deep pipelines, prefetch, ...)
Vector programming: easier with GPUs than CPUs
Sweetspot is different from GPU codes
19. Parallelization + vectorization
Serial nature of commonly used CPU programming
languages sometimes hides opportunities
Auto-vectorizing/parallelizing compilers + DSLs can’t
yet compete with expert hand-rolled
▪ DSLs like Halide (Ragan-Kelley et al. 2013) show
promise but need a few more generations
Sprinkle in (OpenMP) doesn’t cut it
20. Who wins
CPU GPU
flops ✔
(vectorize: AVX2/512 gives
Tflop range)
✔
Tesla K40: 2880 fp32 ALU
pipelines
main memory b/w ✖
(Xeon Phi improves)
✔
latency ✔
(high clock, reordering;
caches are large and work if
you obey them)
✖
(threads slow, non-smem
caches irrelevant, CPU ->
GPU control overhead)
boundary effects,
small/irregular sizes
✔✖
(branches easy, vectorization
hard)
✖
(warp divergence, load
imbalance)
parallel programming model ✖
(vectorization hard, perf
black box)
✔✖
(CUDA is very different,
domain knowledge)
22. Dive into 2D Convolutional Nets
Somewhat computationally expensive
O(b × f × f’ × n2 × k2)
1st layer AlexNet:
▪ 13.493 Gflop (1 flop here = fp32 multiply-add)
▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32)
▪ Perfect caching + reuse, 175 flop/byte in
▪ No caching + reuse, 0.125 flop/byte in
23. The problem
Programmable caches (shared memory, registers, ...)
not large enough for perfect reuse
Space of all possible square 2D convolution
problems is 5/6-dimensional
Parameter Size
minibatch size (b) 128
input feature maps (f) 3
output feature maps (f’) 96
input feature size (n x n) 224
convolution kernel size (k x k) 11
convolution kernel stride (SxS) (optional) 4
24. Converting
Space of all possible matrix multiplications = 3
dimensional (ANxMBMxP = CNxP)
NVIDIA, Intel, others have put lots of effort into
optimizing many parts of this space
▪ Rephrase convolution as a matrix multiplication!
▪ NVIDIA’s cuDNN
25. But:
Sgemm originally optimized for large problems
13x13 * 3x3 is a small convolution. Unrolling it 192
times it might be enough to feed GPU
Large convolutions are intractable?
Small feature maps/
convolutions = boundary
effects bad for GPUs
26. Facebook AI Research work
2D convolution via FFT
Fast convolutional nets with fbfft: A GPU Performance
Evaluation (Vasilache, Johnson et al., 2015 ICLR
conference track oral)
Convolution => pointwise × in Fourier basis
Choice of basis is wide open! 2i is great perf
O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log n)
▪ >= 5x5 kernels, faster than cuDNN
28. Data layout
Different problem sizes => different data layout
▪ cudaconv: DHWB (optimal for large b)
▪ deeper layers: HWBD/BHWD (many feature maps)
▪ b=1 faster convergence?
▪ b=128 better compute utilization
Smaller problems, exploit different layout/batching
▪ fbcunn 1D convolution
29. Latency hiding: what holds you back?
▪ Compute bound? (math)
▪ Memory b/w bound? (streaming)
▪ Memory latency bound? (sparse)
Almost all “deep learning” algorithms are b/w bound
on GPU. Low math intensity!
cuBLAS: Sgemm b/w bound. Dgemm compute
bound
30. Kernel fusion: CPU vs GPU
Reduces memory b/w pressure
Exploits cache locality and register reuse
CPU: fusion not necessary
Kernel tiling + interleaving works due to caches
GPU: fusion necessary
Tiling + interleaving doesn’t work: smem not
persistent, caches too small/irrelevant
31. Kernel fusion
CUDA kernel = hard optimization boundary on GPU
Loop interchange, lifting, better fusion on CPU
CUDA: parallelization layer not visible to optimizer.
Auto-tuning desired. HW specific non-linear tradeoffs
Scripting languages are further barrier to fusion on
both CPU and GPU (Torch)
32. Kernel fusion
Torch: transposition is common operation
▪ size (80, 40) stride (40, 1) => size (40, 80) stride
(1, 40)
▪ Old approach: transpose in memory, perform work,
copy back
▪ New approach: rewrite kernel to handle
transpositions. Optimize if non-transposed
Runtime fusion (CUDA 7.0, Theano)