SlideShare a Scribd company logo
1 of 25
Asynchronous, Data-Parallel
Deep Convolutional Neural Network Training
with Linear Prediction Model
for Parameter Transition
Ikuro Sato1), Ryo Fujisaki1),
Yosuke Oyama2), Akihiro Nomura2), and Satoshi Matsuoka2)
Deep Learning 3 (Nov. 16, 2017)
ICONIP 2017ICONIP 2017
1) Denso IT Laboratory,
2) Tokyo Institute of Technology, Japan
Ikuro Sato, Denso IT Laboratory, Inc. 1/25
Introduction
Method
Experiment
1.
2.
3.
Ikuro Sato, Denso IT Laboratory, Inc. 2/25
Common practices in state-of-the-art CNNs
Recent trend
#multiplications
per parameter
Computationally intensive models tend to perform well.
AlexNet
VGG-19
GoogLeNet
ResNet
137
11
221
179
top-5 error rate
@LSVRC
16.4%
7.32%
6.67%
3.57%
[Krizhevsky+, NIPS2012]
[Simonyan+, ICLR2015]
[Szegedy+, CVPR2015]
[He+, CVPR2016]
Ikuro Sato, Denso IT Laboratory, Inc. 3/25
Data-parallel, mini-batch SGD to boost training
What is it?
How fast is it to train computationally intensive CNNs?
GoogLeNet training on ImageNet boosted by 16x with 32 GPUs
Model optimization with many processors (GPUs) used in parallel
ResNet training on ImageNet within 1h with 256 GPUs
[Iandola+, CVPR2016]
ResNet training on ImageNet within 15 min with 1024 GPUs
[Akiba+, 2017]
[Goyal+, 2017]
Ikuro Sato, Denso IT Laboratory, Inc. 4/25
Two approaches: SSGD and ASGD
SSGD: Synchronous Stochastic Gradient Descent
ASGD: Asynchronous Stochastic Gradient Descent
Allows parameter update after completing all gradient comp.
Allows parameter update without completing all gradient comp.
Basic update rule:
Basic update rule:
𝑊 𝑡+1
= 𝑊 𝑡
− 𝜆
𝑎𝑙𝑙 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆
𝑠𝑜𝑚𝑒 𝐺𝑃𝑈𝑠
𝜕𝐽
𝜕𝑊
𝑊 𝜏
Gradients evaluated
at old parameters.
𝒔𝒕𝒂𝒍𝒆𝒏𝒆𝒔𝒔 = 𝑡 − 𝜏 > 0
Ikuro Sato, Denso IT Laboratory, Inc. 5/25
Which is faster, SSGD or ASGD?
high update-frequencylow update-frequency
SSGD
ASGD
“Sync is faster” group:
“Async is faster” group:
low cost-drop per update
high cost-drop per update
No conclusion yet.
steepest
descent
[Zheng+, arxiv1609.08326] [Gupta+, ICDM2016] [Zhang+, IJCAI2016]
[Chen+, ICLR 2016 workshop] [Jin+, NIPS2016 workshop]
Ikuro Sato, Denso IT Laboratory, Inc. 6/25
Our contributions
Outperforms ASGD & conditionally outperforms SSGD in speed.
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness.
high update-frequencylow update-frequency
steepest
descent
SSGD
ASGD
PP-ASGD
low cost-drop per update
high cost-drop per update
better
gradient
“quality”
much higher
update frequency
Ikuro Sato, Denso IT Laboratory, Inc. 7/25
Introduction
Method
Experiment
1.
2.
3.
SSGD
ASGD
PP-ASGD (proposed)
Ikuro Sato, Denso IT Laboratory, Inc. 8/25
SSGD (with collective communication)
Load
Comp. grad.
Send
grad. &
update
Grad
Update rule (SSGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝑡
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡
synchronous
Ikuro Sato, Denso IT Laboratory, Inc. 9/25
ASGD (with collective communication)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Update rule (ASGD with momentum)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at stale parameters
asynchronous synchronous
[Oyama+, IEEE BigData 2016]
Ikuro Sato, Denso IT Laboratory, Inc. 10/25
PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at predicted parameters
(𝑠 = measured staleness)
asynchronous synchronous
Ikuro Sato, Denso IT Laboratory, Inc. 11/25
PP-ASGD (proposed)
Load
Comp. grad.
Flag Unflag
Send
grad. &
update
Grad
Send
zero &
update
Flagged?
Zero
yes no
Predict param.
Update rule (PP-ASGD)
𝑀 𝑡
= 𝜇𝑀 𝑡−1
− 𝜆
𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠
𝜕𝐽
𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1
𝓈=1
𝑠+1 𝜇 𝓈
𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated
at predicted parameters
(𝑠 = measured staleness)
asynchronous synchronous
If staleness is zero (𝑠 = 0),
PP-ASGD becomes
Nesterov’s Accelerated Gradient method
(NAG).
Ikuro Sato, Denso IT Laboratory, Inc. 12/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
Ex) staleness of 2
Ikuro Sato, Denso IT Laboratory, Inc. 13/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
Ikuro Sato, Denso IT Laboratory, Inc. 14/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
𝜇 + 𝜇2 + 𝜇3 = 2.94 𝜇 = 0.99
Ikuro Sato, Denso IT Laboratory, Inc. 15/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡
+ 𝜇𝑀 𝑡−1
transition by momentum
predicted transition
transition by (stale) gradients
grad (computing)
𝑊 𝑡+1
stale grad
Ikuro Sato, Denso IT Laboratory, Inc. 16/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)
Ikuro Sato, Denso IT Laboratory, Inc. 17/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (computing)𝑊 𝑡+2
Ikuro Sato, Denso IT Laboratory, Inc. 18/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
grad (DONE!)𝑊 𝑡+2
𝑊 𝑡+3
Ikuro Sato, Denso IT Laboratory, Inc. 19/25
Proposed prediction model for param. transition
Parameter transition modeled as
stale momentum multiplied by staleness-aware coefficient
𝑊 𝑡
parameter space
𝑊 𝑡 + 𝑀 𝑡−1
𝓈=1
2+1
𝜇 𝓈
Ex) staleness of 2
𝑊 𝑡+1 + 𝑀 𝑡
𝓈=1
2+1
𝜇 𝓈
𝑊 𝑡+1
𝑊 𝑡+2
𝑊 𝑡+3
Hypothesis:
They’re close!
Ikuro Sato, Denso IT Laboratory, Inc. 20/25
Introduction
Method
Experiment
1.
2.
3.
Ikuro Sato, Denso IT Laboratory, Inc. 21/25
Training speed: PP-ASGD vs ASGD
Proposed PP-ASGD outperforms ASGD by ~2x
on (randomly chosen) 32-class ImageNet.
Validation error rate curves
32-GPU
(4-node x 8-GPU)
staleness
resource
8.5
Ikuro Sato, Denso IT Laboratory, Inc. 22/25
Training speed: PP-ASGD vs SSGD
1.9x
faster
Relative speed to reach
0.6 error rate.
Validation error rate curves
on 1000-class ImageNet
Proposed PP-ASGD consistently outperforms SSGD
by factor of 1.8-1.9 on 1000-class ImageNet.
staleness 1.9-2.6
GPU
update
frequency (Hz)
PP-
ASGD
(ours)
SSGD
32 13.4 4.8
64 12.1 4.7
128 9.9 4.5
256 8.2 3.9
Ikuro Sato, Denso IT Laboratory, Inc. 23/25
Parameter prediction accuracy
The proposed parameter transition model
Distance between
the (𝑠0-step) future param 𝑊𝑓𝑢𝑡𝑢𝑟𝑒, and
the predicted param 𝑊𝑝𝑟𝑒𝑑 𝑠 ,
as a function of 𝑠.
𝑊𝑝𝑟𝑒𝑑 𝑠 ≡ 𝑊 𝜏
+ 𝑀 𝜏−1
𝓈=1
𝑠+1
𝜇 𝓈
𝑊𝑝𝑟𝑒𝑑𝑠−𝑊𝑓𝑢𝑡𝑢𝑟𝑒2
No prediction (ASGD)
𝑊𝑝𝑟𝑒𝑑 0 − 𝑊𝑓𝑢𝑡𝑢𝑟𝑒 2
is most accurate when 𝑠 = measured staleness.
outperforms ASGD in prediction accuracy (𝑠 > 0).
Case of SSGD
Ikuro Sato, Denso IT Laboratory, Inc. 24/25
Conclusion
high update-frequencylow update-frequency
steepest
descent
SSGD
ASGD
PP-ASGD
low loss-drop per update
high loss-drop per update
better
gradient
“quality”
Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD).
Mitigates badness of staleness by parameter prediction.
much higher
update frequency
Outperforms ASGD & conditionally outperforms SSGD in speed.
Ikuro Sato, Denso IT Laboratory, Inc. 25/25

More Related Content

What's hot

IEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time TrackerIEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time Trackerc.choi
 
Enhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUEnhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUMahesh Khadatare
 
Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...Lionel Briand
 
Implementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopterImplementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopterTack-geun You
 
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexGpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexMahesh Khadatare
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Tom Hubregtsen
 
Rethinking attention with performers
Rethinking attention with performersRethinking attention with performers
Rethinking attention with performersKyuYeolJung
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningDataWorks Summit
 
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...IRJET Journal
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecurityKim Hammar
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15Karen Pao
 
Frechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and ApplicationsFrechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and ApplicationsSam Relton
 
Progress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-ToanProgress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-ToanToan Ngo Sy
 
Magnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap OneMagnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap OneJames D.B. Wang, PhD
 
"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will ZengImpact.Tech
 

What's hot (20)

20191019 sinkhorn
20191019 sinkhorn20191019 sinkhorn
20191019 sinkhorn
 
IEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time TrackerIEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time Tracker
 
Enhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPUEnhanced Human Computer Interaction using hand gesture analysis on GPU
Enhanced Human Computer Interaction using hand gesture analysis on GPU
 
Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...Generating Automated and Online Test Oracles for Simulink Models with Continu...
Generating Automated and Online Test Oracles for Simulink Models with Continu...
 
Implementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopterImplementation of PD controller in attitude of quadcopter
Implementation of PD controller in attitude of quadcopter
 
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-indexGpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
Gpu based-image-quality-assessment-using-structural-similarity-(ssim)-index
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
 
Rethinking attention with performers
Rethinking attention with performersRethinking attention with performers
Rethinking attention with performers
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep Learning
 
Thermography slide
Thermography slideThermography slide
Thermography slide
 
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
IRJET- Design the Surveillance Algorithm and Motion Detection of Objects for ...
 
Quantum computing
Quantum computingQuantum computing
Quantum computing
 
Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Barker_SIAMCSE15
Barker_SIAMCSE15Barker_SIAMCSE15
Barker_SIAMCSE15
 
Frechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and ApplicationsFrechet Derivatives of Matrix Functions and Applications
Frechet Derivatives of Matrix Functions and Applications
 
Progress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-ToanProgress_report_KUSP2016_Ngo-Sy-Toan
Progress_report_KUSP2016_Ngo-Sy-Toan
 
Magnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap OneMagnetic tracking --- talking from Magic Leap One
Magnetic tracking --- talking from Magic Leap One
 
"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng"The Arrival of Quantum Computing" by Will Zeng
"The Arrival of Quantum Computing" by Will Zeng
 
Pycon9 dibernado
Pycon9 dibernadoPycon9 dibernado
Pycon9 dibernado
 

Similar to Ikuro Sato's slide presented at ICONIP2017

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...asahiushio1
 
SpectrumEstimation.ppt
SpectrumEstimation.pptSpectrumEstimation.ppt
SpectrumEstimation.pptMaryanne678733
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to AlgorithmsVenkatesh Iyer
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
IRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery UnitIRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery UnitIRJET Journal
 
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...asahiushio1
 
Chaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problemChaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problemIJAAS Team
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationWork-Bench
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather PredictionIRJET Journal
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORIJNSA Journal
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...IJERA Editor
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...AIRCC Publishing Corporation
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...ijcsit
 
Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...IRJET Journal
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 

Similar to Ikuro Sato's slide presented at ICONIP2017 (20)

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
 
SpectrumEstimation.ppt
SpectrumEstimation.pptSpectrumEstimation.ppt
SpectrumEstimation.ppt
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 
IRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery UnitIRJET - A Speculative Approximate Adder for Error Recovery Unit
IRJET - A Speculative Approximate Adder for Error Recovery Unit
 
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
2017-12, Keio University, Projection-based Regularized Dual Averaging for Sto...
 
Chaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problemChaotic based Pteropus algorithm for solving optimal reactive power problem
Chaotic based Pteropus algorithm for solving optimal reactive power problem
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
IRJET- Different Data Mining Techniques for Weather Prediction
IRJET-  	  Different Data Mining Techniques for Weather PredictionIRJET-  	  Different Data Mining Techniques for Weather Prediction
IRJET- Different Data Mining Techniques for Weather Prediction
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...
 
Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...Real time active noise cancellation using adaptive filters following RLS and ...
Real time active noise cancellation using adaptive filters following RLS and ...
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 

Recently uploaded

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 

Recently uploaded (20)

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 

Ikuro Sato's slide presented at ICONIP2017

  • 1. Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition Ikuro Sato1), Ryo Fujisaki1), Yosuke Oyama2), Akihiro Nomura2), and Satoshi Matsuoka2) Deep Learning 3 (Nov. 16, 2017) ICONIP 2017ICONIP 2017 1) Denso IT Laboratory, 2) Tokyo Institute of Technology, Japan Ikuro Sato, Denso IT Laboratory, Inc. 1/25
  • 3. Common practices in state-of-the-art CNNs Recent trend #multiplications per parameter Computationally intensive models tend to perform well. AlexNet VGG-19 GoogLeNet ResNet 137 11 221 179 top-5 error rate @LSVRC 16.4% 7.32% 6.67% 3.57% [Krizhevsky+, NIPS2012] [Simonyan+, ICLR2015] [Szegedy+, CVPR2015] [He+, CVPR2016] Ikuro Sato, Denso IT Laboratory, Inc. 3/25
  • 4. Data-parallel, mini-batch SGD to boost training What is it? How fast is it to train computationally intensive CNNs? GoogLeNet training on ImageNet boosted by 16x with 32 GPUs Model optimization with many processors (GPUs) used in parallel ResNet training on ImageNet within 1h with 256 GPUs [Iandola+, CVPR2016] ResNet training on ImageNet within 15 min with 1024 GPUs [Akiba+, 2017] [Goyal+, 2017] Ikuro Sato, Denso IT Laboratory, Inc. 4/25
  • 5. Two approaches: SSGD and ASGD SSGD: Synchronous Stochastic Gradient Descent ASGD: Asynchronous Stochastic Gradient Descent Allows parameter update after completing all gradient comp. Allows parameter update without completing all gradient comp. Basic update rule: Basic update rule: 𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆 𝑎𝑙𝑙 𝐺𝑃𝑈𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝑡 𝑊 𝑡+1 = 𝑊 𝑡 − 𝜆 𝑠𝑜𝑚𝑒 𝐺𝑃𝑈𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 Gradients evaluated at old parameters. 𝒔𝒕𝒂𝒍𝒆𝒏𝒆𝒔𝒔 = 𝑡 − 𝜏 > 0 Ikuro Sato, Denso IT Laboratory, Inc. 5/25
  • 6. Which is faster, SSGD or ASGD? high update-frequencylow update-frequency SSGD ASGD “Sync is faster” group: “Async is faster” group: low cost-drop per update high cost-drop per update No conclusion yet. steepest descent [Zheng+, arxiv1609.08326] [Gupta+, ICDM2016] [Zhang+, IJCAI2016] [Chen+, ICLR 2016 workshop] [Jin+, NIPS2016 workshop] Ikuro Sato, Denso IT Laboratory, Inc. 6/25
  • 7. Our contributions Outperforms ASGD & conditionally outperforms SSGD in speed. Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD). Mitigates badness of staleness. high update-frequencylow update-frequency steepest descent SSGD ASGD PP-ASGD low cost-drop per update high cost-drop per update better gradient “quality” much higher update frequency Ikuro Sato, Denso IT Laboratory, Inc. 7/25
  • 9. SSGD (with collective communication) Load Comp. grad. Send grad. & update Grad Update rule (SSGD with momentum) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑎𝑙𝑙 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝑡 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 synchronous Ikuro Sato, Denso IT Laboratory, Inc. 9/25
  • 10. ASGD (with collective communication) Load Comp. grad. Flag Unflag Send grad. & update Grad Send zero & update Flagged? Zero yes no Update rule (ASGD with momentum) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated at stale parameters asynchronous synchronous [Oyama+, IEEE BigData 2016] Ikuro Sato, Denso IT Laboratory, Inc. 10/25
  • 11. PP-ASGD (proposed) Load Comp. grad. Flag Unflag Send grad. & update Grad Send zero & update Flagged? Zero yes no Predict param. Update rule (PP-ASGD) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1 𝓈=1 𝑠+1 𝜇 𝓈 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated at predicted parameters (𝑠 = measured staleness) asynchronous synchronous Ikuro Sato, Denso IT Laboratory, Inc. 11/25
  • 12. PP-ASGD (proposed) Load Comp. grad. Flag Unflag Send grad. & update Grad Send zero & update Flagged? Zero yes no Predict param. Update rule (PP-ASGD) 𝑀 𝑡 = 𝜇𝑀 𝑡−1 − 𝜆 𝑠𝑜𝑚𝑒 𝑛𝑜𝑑𝑒𝑠 𝜕𝐽 𝜕𝑊 𝑊 𝜏 +𝑀 𝜏−1 𝓈=1 𝑠+1 𝜇 𝓈 𝑊 𝑡+1 = 𝑊 𝑡 + 𝑀 𝑡 Gradients evaluated at predicted parameters (𝑠 = measured staleness) asynchronous synchronous If staleness is zero (𝑠 = 0), PP-ASGD becomes Nesterov’s Accelerated Gradient method (NAG). Ikuro Sato, Denso IT Laboratory, Inc. 12/25
  • 13. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space Ex) staleness of 2 Ikuro Sato, Denso IT Laboratory, Inc. 13/25
  • 14. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡 + 𝜇𝑀 𝑡−1 transition by momentum predicted transition transition by (stale) gradients grad (computing) Ikuro Sato, Denso IT Laboratory, Inc. 14/25
  • 15. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡 + 𝜇𝑀 𝑡−1 transition by momentum predicted transition transition by (stale) gradients grad (computing) 𝜇 + 𝜇2 + 𝜇3 = 2.94 𝜇 = 0.99 Ikuro Sato, Denso IT Laboratory, Inc. 15/25
  • 16. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡 + 𝜇𝑀 𝑡−1 transition by momentum predicted transition transition by (stale) gradients grad (computing) 𝑊 𝑡+1 stale grad Ikuro Sato, Denso IT Laboratory, Inc. 16/25
  • 17. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 grad (computing) Ikuro Sato, Denso IT Laboratory, Inc. 17/25
  • 18. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 grad (computing)𝑊 𝑡+2 Ikuro Sato, Denso IT Laboratory, Inc. 18/25
  • 19. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 grad (DONE!)𝑊 𝑡+2 𝑊 𝑡+3 Ikuro Sato, Denso IT Laboratory, Inc. 19/25
  • 20. Proposed prediction model for param. transition Parameter transition modeled as stale momentum multiplied by staleness-aware coefficient 𝑊 𝑡 parameter space 𝑊 𝑡 + 𝑀 𝑡−1 𝓈=1 2+1 𝜇 𝓈 Ex) staleness of 2 𝑊 𝑡+1 + 𝑀 𝑡 𝓈=1 2+1 𝜇 𝓈 𝑊 𝑡+1 𝑊 𝑡+2 𝑊 𝑡+3 Hypothesis: They’re close! Ikuro Sato, Denso IT Laboratory, Inc. 20/25
  • 22. Training speed: PP-ASGD vs ASGD Proposed PP-ASGD outperforms ASGD by ~2x on (randomly chosen) 32-class ImageNet. Validation error rate curves 32-GPU (4-node x 8-GPU) staleness resource 8.5 Ikuro Sato, Denso IT Laboratory, Inc. 22/25
  • 23. Training speed: PP-ASGD vs SSGD 1.9x faster Relative speed to reach 0.6 error rate. Validation error rate curves on 1000-class ImageNet Proposed PP-ASGD consistently outperforms SSGD by factor of 1.8-1.9 on 1000-class ImageNet. staleness 1.9-2.6 GPU update frequency (Hz) PP- ASGD (ours) SSGD 32 13.4 4.8 64 12.1 4.7 128 9.9 4.5 256 8.2 3.9 Ikuro Sato, Denso IT Laboratory, Inc. 23/25
  • 24. Parameter prediction accuracy The proposed parameter transition model Distance between the (𝑠0-step) future param 𝑊𝑓𝑢𝑡𝑢𝑟𝑒, and the predicted param 𝑊𝑝𝑟𝑒𝑑 𝑠 , as a function of 𝑠. 𝑊𝑝𝑟𝑒𝑑 𝑠 ≡ 𝑊 𝜏 + 𝑀 𝜏−1 𝓈=1 𝑠+1 𝜇 𝓈 𝑊𝑝𝑟𝑒𝑑𝑠−𝑊𝑓𝑢𝑡𝑢𝑟𝑒2 No prediction (ASGD) 𝑊𝑝𝑟𝑒𝑑 0 − 𝑊𝑓𝑢𝑡𝑢𝑟𝑒 2 is most accurate when 𝑠 = measured staleness. outperforms ASGD in prediction accuracy (𝑠 > 0). Case of SSGD Ikuro Sato, Denso IT Laboratory, Inc. 24/25
  • 25. Conclusion high update-frequencylow update-frequency steepest descent SSGD ASGD PP-ASGD low loss-drop per update high loss-drop per update better gradient “quality” Proposes a new update rule, PP-ASGD (Parameter-Predicted ASGD). Mitigates badness of staleness by parameter prediction. much higher update frequency Outperforms ASGD & conditionally outperforms SSGD in speed. Ikuro Sato, Denso IT Laboratory, Inc. 25/25