Ikuro Sato's slide presented at ICONIP2017

Asynchronous, Data-Parallel
Deep Convolutional Neural Network Training
with Linear Prediction Model
for Parameter Transition
Ikuro Sato1), Ryo Fujisaki1),
Yosuke Oyama2), Akihiro Nomura2), and Satoshi Matsuoka2)
Deep Learning 3 (Nov. 16, 2017)
ICONIP 2017ICONIP 2017
1) Denso IT Laboratory,
2) Tokyo Institute of Technology, Japan
Ikuro Sato, Denso IT Laboratory, Inc. 1/25

Introduction
Method
Experiment
1.
2.
3.

Common practices in state-of-the-art CNNs
Recent trend
#multiplications
per parameter
Computationally intensive models tend to perform well.
AlexNet
VGG-19
GoogLeNet
ResNet
137
11
221
179
top-5 error rate
@LSVRC
16.4%
7.32%
6.67%
3.57%
[Krizhevsky+, NIPS2012]
[Simonyan+, ICLR2015]
[Szegedy+, CVPR2015]
[He+, CVPR2016]

Data-parallel, mini-batch SGD to boost training
What is it?
How fast is it to train computationally intensive CNNs?
GoogLeNet training on ImageNet boosted by 16x with 32 GPUs
Model optimization with many processors (GPUs) used in parallel
ResNet training on ImageNet within 1h with 256 GPUs
[Iandola+, CVPR2016]
ResNet training on ImageNet within 15 min with 1024 GPUs
[Akiba+, 2017]
[Goyal+, 2017]

Two approaches: SSGD and ASGD
SSGD: Synchronous Stochastic Gradient Descent
ASGD: Asynchronous Stochastic Gradient Descent
Allows parameter update after completing all gradient comp.
Allows parameter update without completing all gradient comp.
Basic update rule:
Basic update rule:
𝑊 𝑡+1
= 𝑊 𝑡
− 𝜆
𝑎𝑙𝑙 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆
𝑠𝑜𝑚𝑒 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝜏
Gradients evaluated
at old parameters.
𝒔𝒕𝒂𝒍𝒆𝒏𝒆𝒔𝒔 = 𝑡 − 𝜏 > 0

Which is faster, SSGD or ASGD?
high update-frequencylow update-frequency
SSGD
ASGD
“Sync is faster” group:
“Async is faster” group:
low cost-drop per update
high cost-drop per update
No conclusion yet.
steepest
descent
[Zheng+, arxiv1609.08326] [Gupta+, ICDM2016] [Zhang+, IJCAI2016]
[Chen+, ICLR 2016 workshop] [Jin+, NIPS2016 workshop]

Our contributions
Outperforms ASGD & conditionally outperforms SSGD in speed.
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness.
steepest
descent
SSGD
ASGD
PP-ASGD
low cost-drop per update
high cost-drop per update
better
gradient
“quality”
much higher
update frequency

Introduction
Method
Experiment
1.
2.
3.
SSGD
ASGD
PP-ASGD (proposed)

SSGD (with collective communication)
Load
Comp. grad.
Send
grad. &
update
Grad
Update rule (SSGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡
synchronous

ASGD (with collective communication)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Update rule (ASGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at stale parameters
asynchronous synchronous
[Oyama+, IEEE BigData 2016]

PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
at predicted parameters
(𝑠 = measured staleness)

PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
at predicted parameters
(𝑠 = measured staleness)
If staleness is zero (𝑠 = 0),
PP-ASGD becomes
Nesterov’s Accelerated Gradient method
(NAG).

Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
Ex) staleness of 2

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
grad (computing)
𝜇 + 𝜇2 + 𝜇3 = 2.94 𝜇 = 0.99

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
grad (computing)
𝑊 𝑡+1
stale grad

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)𝑊 𝑡+2

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (DONE!)𝑊 𝑡+2
𝑊 𝑡+3

𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
𝑊 𝑡+2
𝑊 𝑡+3
Hypothesis:
They’re close!

Introduction
Method
Experiment
1.
2.
3.

Training speed: PP-ASGD vs ASGD
Proposed PP-ASGD outperforms ASGD by ~2x
on (randomly chosen) 32-class ImageNet.
Validation error rate curves
32-GPU
(4-node x 8-GPU)
staleness
resource
8.5

Training speed: PP-ASGD vs SSGD
1.9x
faster
Relative speed to reach
0.6 error rate.
Validation error rate curves
on 1000-class ImageNet
Proposed PP-ASGD consistently outperforms SSGD
by factor of 1.8-1.9 on 1000-class ImageNet.
staleness 1.9-2.6
GPU
update
frequency (Hz)
PP-
ASGD
(ours)
SSGD
32 13.4 4.8
64 12.1 4.7
128 9.9 4.5
256 8.2 3.9

Parameter prediction accuracy
The proposed parameter transition model
Distance between
the (𝑠0-step) future param 𝑊𝑓𝑢𝑡𝑢𝑟𝑒, and
the predicted param 𝑊𝑝𝑟𝑒𝑑 𝑠 ,
as a function of 𝑠.
𝑊𝑝𝑟𝑒𝑑 𝑠 ≡ 𝑊 𝜏
+ 𝑀 𝜏−1
𝓈=1
𝑠+1
𝜇 𝓈
𝑊𝑝𝑟𝑒𝑑𝑠−𝑊𝑓𝑢𝑡𝑢𝑟𝑒2
No prediction (ASGD)
𝑊𝑝𝑟𝑒𝑑 0 − 𝑊𝑓𝑢𝑡𝑢𝑟𝑒 2
is most accurate when 𝑠 = measured staleness.
outperforms ASGD in prediction accuracy (𝑠 > 0).
Case of SSGD

Conclusion
steepest
descent
SSGD
ASGD
PP-ASGD
low loss-drop per update
high loss-drop per update
better
gradient
“quality”
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness by parameter prediction.
much higher
update frequency
Outperforms ASGD & conditionally outperforms SSGD in speed.

Ikuro Sato's slide presented at ICONIP2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ikuro Sato's slide presented at ICONIP2017

Similar to Ikuro Sato's slide presented at ICONIP2017 (20)

Recently uploaded

Recently uploaded (20)

Ikuro Sato's slide presented at ICONIP2017