35. 39
benchmark(DGX-1)
• Two fully connected quads,
connected at corners
• 160GB/s per GPU bidirectional to Peers
• Load/store access to Peer Memory
• Full atomics to Peer GPUs
• High speed copy engines for bulk data
copy
• PCIe to/from CPU
DGX-1
Dual 20-core Intel® Xeon® E5-2698 v4 2.2 GHz
8x Tesla GP100
36. 40
TensorFlow
Deep Learning Training
An open-source software library for numerical
computation using data flow graphs.
VERSION
1.0
ACCELERATED FEATURES
Full framework accelerated
SCALABILITY
Multi-GPU and multi-node
More Information
https://www.tensorflow.org/
TensorFlow Deep Learning Framework
Training on 8x P100 GPU Server vs 8 x K80 GPU Server
-
1.0
2.0
3.0
4.0
5.0
Speedupvs.Serverwith8xK80
AlexNet GoogleNet ResNet-50 ResNet-152 VGG16
2.5x
Avg. Speedup
3x
Avg. Speedup
GPU Servers: Single Xeon E5-2690 v4@2.6GHz with GPUs configs as shown
Ubuntu 14.04.5, CUDA 8.0.42, cuDNN 6.0.5; NCCL 1.6.1, data set: ImageNet;
batch sizes: AlexNet (128), GoogleNet (256), ResNet-50 (64), ResNet-152 (32), VGG-16 (32)
Server with 8x P100
16GB NVLink
Server with 8x P100
PCIe 16GB
39. 44
データ並列(同期型)
w w w
Layer 1 Layer 2Inputs Layer N
LossFunc
LossFunc
GPU 1
GPU 2
“cat”
Labels
“monkey”
w w w
Copy Model, Assigne different data
40. 45
データ並列(同期型)
w
x y
w
x y
w
x y
Layer 1
“dog”
Layer 2Inputs OutputsLayer N
LossFunc
“human”
LossFunc
GPU 1
GPU 2
“cat”
Labels
“monkey”
error⊿y⊿x⊿y⊿x⊿y
⊿w⊿w⊿w
w
x y
w
x y
w
x y
⊿y⊿x⊿y⊿x⊿y
⊿w⊿w⊿w
error
Forward & Backward Independently
41. 46
データ並列(同期型)
w
x y
w
x y
w
x y
Layer 1
“dog”
Layer 2Inputs OutputsLayer N
LossFunc
“human”
LossFunc
GPU 1
GPU 2
“cat”
Labels
“monkey”
error⊿y⊿x⊿y⊿x⊿y
⊿w⊿w⊿w
w
x y
w
x y
w
x y
⊿y⊿x⊿y⊿x⊿y
⊿w⊿w⊿w
error
Combine ⊿w over multi-GPU
⊿w⊿w⊿w
⊿w⊿w⊿w
All-reduce
All-reduce
All-reduce
42. 47
データ並列(同期型)
w
x y
w
x y
w
x y
Layer 1
“dog”
Layer 2Inputs OutputsLayer N
LossFunc
“human”
LossFunc
GPU 1
GPU 2
“cat”
Labels
“monkey”
error⊿y⊿x⊿y⊿x⊿y
⊿w⊿w⊿w
w
x y
w
x y
w
x y
⊿y⊿x⊿y⊿x⊿y
⊿w⊿w⊿w
error
Update Weights Independently
w w w
w w w
43. 48
マルチGPU学習のパフォーマンス
NVIDIA DGX-1, Chainer 1.17.0 with multi-process patch
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
Speed-upto1GPU
Number of GPUs
AlexNet VGG-D ResNet
[Batch size per GPU] AlexNet:768, VGG-D:32, ResNet:12
0
0.5
1
1.5
2
2.5
1 2 4 8
Relativetimeto1GPU
Number of GPUs
Time per one batch (VGG-D)
Update
Allreduce
Backward
Forward
DGX-1’s NVLink is not well utilized.
Chainer’s all-reduce implementation
is naïve “gather and broadcat”.
48. 53
NCCLの実装
• 1 CPU and 4 GPUs (PCIe)
Ring Algorithm
Most collectives amenable to bandwidth-optimal
implementation on rings, and many topologyies can be
interpreted as one or more rings [P. Patarasuk and X. Yuan]
49. 54
NCCLの実装
• 2 CPUs and 8 GPUs (QPI and PCIe)
Ring Algorithm
Most collectives amenable to bandwidth-optimal
implementation on rings, and many topologyies can be
interpreted as one or more rings [P. Patarasuk and X. Yuan]
50. 55
NCCL パフォーマンス
Bandwidth at different problem sizes (4 Maxwell GPUs)
All-Gather
All-Reduce
Reduce-Scatter
Broadcast