Ikuro Sato's slide presented at International Conference of Neural Information Processing (ICONIP) 2017. This research work proposes a new update rule for asynchronous, distributed SGD training.
1. Asynchronous, Data-Parallel
Deep Convolutional Neural Network Training
with Linear Prediction Model
for Parameter Transition
Ikuro Sato1), Ryo Fujisaki1),
Yosuke Oyama2), Akihiro Nomura2), and Satoshi Matsuoka2)
Deep Learning 3 (Nov. 16, 2017)
ICONIP 2017ICONIP 2017
1) Denso IT Laboratory,
2) Tokyo Institute of Technology, Japan
Ikuro Sato, Denso IT Laboratory, Inc. 1/25
3. Common practices in state-of-the-art CNNs
Recent trend
#multiplications
per parameter
Computationally intensive models tend to perform well.
AlexNet
VGG-19
GoogLeNet
ResNet
137
11
221
179
top-5 error rate
@LSVRC
16.4%
7.32%
6.67%
3.57%
[Krizhevsky+, NIPS2012]
[Simonyan+, ICLR2015]
[Szegedy+, CVPR2015]
[He+, CVPR2016]
Ikuro Sato, Denso IT Laboratory, Inc. 3/25
4. Data-parallel, mini-batch SGD to boost training
What is it?
How fast is it to train computationally intensive CNNs?
GoogLeNet training on ImageNet boosted by 16x with 32 GPUs
Model optimization with many processors (GPUs) used in parallel
ResNet training on ImageNet within 1h with 256 GPUs
[Iandola+, CVPR2016]
ResNet training on ImageNet within 15 min with 1024 GPUs
[Akiba+, 2017]
[Goyal+, 2017]
Ikuro Sato, Denso IT Laboratory, Inc. 4/25
5. Two approaches: SSGD and ASGD
SSGD: Synchronous Stochastic Gradient Descent
ASGD: Asynchronous Stochastic Gradient Descent
Allows parameter update after completing all gradient comp.
Allows parameter update without completing all gradient comp.
Basic update rule:
Basic update rule:
𝑊 𝑡+1
= 𝑊 𝑡
− 𝜆
𝑎𝑙𝑙 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆
𝑠𝑜𝑚𝑒 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝜏
Gradients evaluated
at old parameters.
𝒔𝒕𝒂𝒍𝒆𝒏𝒆𝒔𝒔 = 𝑡 − 𝜏 > 0
Ikuro Sato, Denso IT Laboratory, Inc. 5/25
6. Which is faster, SSGD or ASGD?
high update-frequencylow update-frequency
SSGD
ASGD
“Sync is faster” group:
“Async is faster” group:
low cost-drop per update
high cost-drop per update
No conclusion yet.
steepest
descent
[Zheng+, arxiv1609.08326] [Gupta+, ICDM2016] [Zhang+, IJCAI2016]
[Chen+, ICLR 2016 workshop] [Jin+, NIPS2016 workshop]
Ikuro Sato, Denso IT Laboratory, Inc. 6/25
7. Our contributions
Outperforms ASGD & conditionally outperforms SSGD in speed.
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness.
high update-frequencylow update-frequency
steepest
descent
SSGD
ASGD
PP-ASGD
low cost-drop per update
high cost-drop per update
better
gradient
“quality”
much higher
update frequency
Ikuro Sato, Denso IT Laboratory, Inc. 7/25
9. SSGD (with collective communication)
Load
Comp. grad.
Send
grad. &
update
Grad
Update rule (SSGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡
synchronous
Ikuro Sato, Denso IT Laboratory, Inc. 9/25
10. ASGD (with collective communication)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Update rule (ASGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at stale parameters
asynchronous synchronous
[Oyama+, IEEE BigData 2016]
Ikuro Sato, Denso IT Laboratory, Inc. 10/25
11. PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at predicted parameters
(𝑠 = measured staleness)
asynchronous synchronous
Ikuro Sato, Denso IT Laboratory, Inc. 11/25
12. PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at predicted parameters
(𝑠 = measured staleness)
asynchronous synchronous
If staleness is zero (𝑠 = 0),
PP-ASGD becomes
Nesterov’s Accelerated Gradient method
(NAG).
Ikuro Sato, Denso IT Laboratory, Inc. 12/25
13. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
Ex) staleness of 2
Ikuro Sato, Denso IT Laboratory, Inc. 13/25
14. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
Ikuro Sato, Denso IT Laboratory, Inc. 14/25
15. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
𝜇 + 𝜇2 + 𝜇3 = 2.94 𝜇 = 0.99
Ikuro Sato, Denso IT Laboratory, Inc. 15/25
16. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
𝑊 𝑡+1
stale grad
Ikuro Sato, Denso IT Laboratory, Inc. 16/25
17. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)
Ikuro Sato, Denso IT Laboratory, Inc. 17/25
18. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)𝑊 𝑡+2
Ikuro Sato, Denso IT Laboratory, Inc. 18/25
19. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (DONE!)𝑊 𝑡+2
𝑊 𝑡+3
Ikuro Sato, Denso IT Laboratory, Inc. 19/25
20. Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
𝑊 𝑡+2
𝑊 𝑡+3
Hypothesis:
They’re close!
Ikuro Sato, Denso IT Laboratory, Inc. 20/25
22. Training speed: PP-ASGD vs ASGD
Proposed PP-ASGD outperforms ASGD by ~2x
on (randomly chosen) 32-class ImageNet.
Validation error rate curves
32-GPU
(4-node x 8-GPU)
staleness
resource
8.5
Ikuro Sato, Denso IT Laboratory, Inc. 22/25
23. Training speed: PP-ASGD vs SSGD
1.9x
faster
Relative speed to reach
0.6 error rate.
Validation error rate curves
on 1000-class ImageNet
Proposed PP-ASGD consistently outperforms SSGD
by factor of 1.8-1.9 on 1000-class ImageNet.
staleness 1.9-2.6
GPU
update
frequency (Hz)
PP-
ASGD
(ours)
SSGD
32 13.4 4.8
64 12.1 4.7
128 9.9 4.5
256 8.2 3.9
Ikuro Sato, Denso IT Laboratory, Inc. 23/25
24. Parameter prediction accuracy
The proposed parameter transition model
Distance between
the (𝑠0-step) future param 𝑊𝑓𝑢𝑡𝑢𝑟𝑒, and
the predicted param 𝑊𝑝𝑟𝑒𝑑 𝑠 ,
as a function of 𝑠.
𝑊𝑝𝑟𝑒𝑑 𝑠 ≡ 𝑊 𝜏
+ 𝑀 𝜏−1
𝓈=1
𝑠+1
𝜇 𝓈
𝑊𝑝𝑟𝑒𝑑𝑠−𝑊𝑓𝑢𝑡𝑢𝑟𝑒2
No prediction (ASGD)
𝑊𝑝𝑟𝑒𝑑 0 − 𝑊𝑓𝑢𝑡𝑢𝑟𝑒 2
is most accurate when 𝑠 = measured staleness.
outperforms ASGD in prediction accuracy (𝑠 > 0).
Case of SSGD
Ikuro Sato, Denso IT Laboratory, Inc. 24/25
25. Conclusion
high update-frequencylow update-frequency
steepest
descent
SSGD
ASGD
PP-ASGD
low loss-drop per update
high loss-drop per update
better
gradient
“quality”
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness by parameter prediction.
much higher
update frequency
Outperforms ASGD & conditionally outperforms SSGD in speed.
Ikuro Sato, Denso IT Laboratory, Inc. 25/25