SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Beyond data and model parallelism for deep
neural networks
Outline
• Background
• Existing parallelization strategy
• Automatic generated strategy
• Overview
• Deep Learning Engine “FlexFlow”
• How to find best strategy
• Evaluation
• Comparison existing parallelization strategy
• Challenge
1
Training Large-scale DNN models is computationally expensive .
Large-scale and Complex Deep Neural Network ( DNN ) Models
Background 2
Reduce training time by parallelization across devices .
Inception v3
Model
“models/research/inception at master · tensorflow/models”. Github .
https://github.com/tensorflow/models/tree/master/research/inception , ( 2019-06-03 )
Existing Parallelization Approach
Data Parallelism
Splitting data per worker
3
Model Parallelism
Splitting model per worker
Dean et al. ( 2012 ). Large Scale Distributed Deep Networks.
In Neural Information Processing SystemsConference.
Data Parallelism
• Each device is placed a replica of
the entire DNN.
• Each device processes a subset of
the training data.
• Each device synchronizes
network parameters at
the end of iteration.
( Synchronous )
4
Dean et al. ( 2012 ). Large Scale Distributed Deep Networks.
In Neural Information Processing SystemsConference.
Model Parallelism
• Each device is assigned
disjoint subsets of DNN.
• Eliminates parameter synchronization
but requires data transfers
between operators.
5
Dean et al. ( 2012 ). Large Scale Distributed Deep Networks.
In Neural Information Processing SystemsConference.
ImageNet competition 6
(2016)
Yamazaki et al.
Yamazaki et al. (2019).YetAnother Accelerated SGD: ResNet-50Training on
ImageNet in 74.7 seconds.
(2017)
(2017)
(2017)
(2018)
(2018)
(2018)
(2019)
Present
Variation of optimal parallelization strategy due to various factors
• Hardware architecture
• DNN models architecture
• Training data
Necessity of designing special parallelization strategy manually
7
Automatic Generated Strategy
• ColocRL ( Mirho-seini et al., 2017 ) uses reinforcement learning
to learn efficient operator assignments for model parallelism.
• Executing each strategy in the hardware environment to get reward signals and takes
12-27 hours to find the best placement.
• OptCNN ( Jia et al., 2018 ) uses dynamic programming
to parallelize linear DNNs.
• Cannot apply to Recurrent Neural Network ( RNN ).
8
Overview
Z. Jia, M. Zaharia , A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural
Networks. In sysML Conference.
• Deep Learning Engine “FlexFlow” whichAutomatically finds
parallelization strategies for arbitrary DNNs & Hardware.
• FlexFlow increases training throughput by up to 3.3× over
state-of-the-art approaches.
9
Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• the SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
10
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• the SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
11
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Operator Graph & DeviceTopology
• Node = operator in DNN
• Convolution
• Matrix multiplication etc.
• Edge = tensor
• Output of operator
• Input of operator
12
• Node = device
• GPU
• CPU etc.
• Edge = hardware connection
• NVLink
• Network-link etc.
• PCI-e
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• The SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
13
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
The SOAP search space
• Introduce a comprehensive SOAP search space
• Sample
• Operator
• Attribute
• Parameter
14
Sample dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator
• Attribute
• Parameter
15
Sample
Parameter
GPU1 GPU2 GPU3 GPU4
Parallelizing 1D convolution
Operator dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator … partitioning operators in DNN
• Attribute
• Parameter
16
Sample
Parameter
GPU1 GPU2 GPU3
Convolution#1 Convolution#2 Convolution#3
Attribute dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator … partitioning operators in DNN
• Attribute … partitioning attributes in a sample
• Parameter
17
GPU1
GPU2
GPU3
GPU4
Parallelizing 1D convolution
Sample
Parameter
Parameter dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator … partitioning operators in DNN
• Attribute … partitioning attributes in a sample
• Parameter … partitioning parameters in an operator
18
GPU1
GPU2
GPU3
GPU4
Parallelizing 1D convolution
Sample
Parameter
Parallelizable dimensions in SOAP space 19
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• the SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
20
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
How to search optimal strategy 21
Generate
strategy
Simulate
execution
Improve
strategy
Markov Chain Monte Carlo
( MCMC ) search algorithm
Full simulation
&
Delta simulation
Decision of parallelization
for each operator
Generate Strategy
Define parallelizable dimensions for each operator .
one strategy = combination of all types of parallelization for each operator
22
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ) Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Simulate execution
• Challenge
• Measuring distributed executions on real hardware is slow.
• Observation
• The performance of DNN operators is highly predictable because
most of DNN operator using dense linear algebra.
• DNN models only use a small number of distinct operators.
• Execution Simulator
• Measure each distinct operator once.
• Use the measurement to estimate different parallelization strategies.
23
Simulate Execution : GenerateTask Graph 24
Parallelization Strategy
O1 , O2 Degree(sample) = 2
GPU1
O3 , O4 Degree(sample) = 2
GPU2
O5 , O6 Degree(sample) = 1
GPU 3
Operator Graph
O5
O6
O1 O3
O3O1
O2 O4
O4O2
Task Graph
O1 O3 O5
O2 O4 O6
Embedding Recurrent Linear
Data transfer time
= tensorsize / channel bandwidth
( Assumption )
Improve Strategy : Full & Delta Simulation
• Full simulation ( initial simulation )
• Predict execution time when use an initial strategy.
• Delta simulation
• Do not have to build & simulate new task graph from scratch.
• The MarkovChain Monte Carlo search proposes a new strategy by
updating the previous strategy.
• Proposes new candidates until one of the following two criteria is satisfied.
1. The search time budget is exhausted for each initial strategy.
2. The search procedure cannot improve the best strategy for half of the search time.
25
Delta Simulation
• An operator in the current parallelization strategy is selected at random ,
and its parallelization configuration is replaced by a random configuration .
26
O5
O6
O1 O3
O3O1
O2 O4
O4O2
O5
O6
O3O1
O1
O2 O4
O4O2
Previous Simulation New Simulation
Evaluation
Evaluates the performance of FlexFlow on six real-world DNN benchmarks with two device
topologies .
Software dependencies of FlexFlow
27
Software libraries Version Contributors
cuDNN 7.3 NVIDIA
cuBLAS 9.0 NVIDIA
Legion 18.02.0 LANL , NVIDIA , SLAC , Stanford *
( optional ) GASNet 1.28.0 Lawrence Berkeley National Laboratory
* LosAlamos National Laboratory ( LANL )
Stanford National Accelerator Laboratory ( SLAC )
Devices topologies in experiments
The P100 Cluster The K80 Cluster
Main Memory 56GB 256 GB
CPU Intel 10-core E5-2600CPUs × 2 Intel 10-core E5-2680CPUs × 2
GPU NVIDIATesla P100GPUs × 4 NVIDIATesla K80 GPUs × 4
CPU - GPU shared PCI-e switch shared PCI-e switch
GPU - GPU NVLink separate PCI-e switch
Node - Node over 100GB/s EDR Infiniband over 56 GB/s EDR Infiniband
28
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
DNN in experiments
• Introduce picked two DNN benchmarks from six DNN benchmarks.
29
DNN Description Dataset
Convolutional Neural Networks ( CNN )
Inception-v3 A 102-layer CNN with inception modules ImageNet
Recurrent Neural Networks ( RNN )
Neural Machine
Translation ( NMT )
4 recurrent layers followed by
an attention and a softmax layer
WMT English-German
Per-iteration training performance 30
Num. Devices Num. Devices
Num.Samples/second/GPU
Num.Samples/second/GPUZ. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Expert-designed strategy of CNN = Krizhevsky ( 2014 )
Expert-designed strategy of RNN = Wu et al. ( 2016 )
Higherisbetter
Comparison of parallelization performance
Parallelization performance for NMT on 64 K80 GPUs (16 nodes)
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.Expert-designed strategy = Wu et al. ( 2016 )
Lowerisbetter
Comparison Different Automated Frameworks 32
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Higherisbetter
Full simulation & Delta simulation 33
Search performance with the full and delta simulation algorithms for
the NMT model on 16 P100 GPUs ( 4 nodes )
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Lower is better
Simulation time & Real execution time 34
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Challenge
• FlexFlow does not consider memory constraints.
• MCMC may be not best algorithm.
• Assumption might be relaxed or even eliminated.
• data transfer time is tensorsize / channel bandwidth.
• execution time is independent to data.
35
Conclusion
• Deep Learning Engine “FlexFlow”
• Automatically finds parallelization strategies for arbitrary DNNs & Hardware.
• increases training throughput by up to 3.3× over state-of-the-art approaches.
• Challenges of FlexFlow
• Memory constraints
• Search algorithm
• Assumption
36

Weitere ähnliche Inhalte

Was ist angesagt?

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningBrodmann17
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Pelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper ReviewPelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper ReviewLEE HOSEONG
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkDalei Li
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetAmazon Web Services
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 

Was ist angesagt? (20)

A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Pelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper ReviewPelee: a real time object detection system on mobile devices Paper Review
Pelee: a real time object detection system on mobile devices Paper Review
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on Spark
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 

Ähnlich wie Beyond data and model parallelism for deep neural networks

Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deploymenttaeseon ryu
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLSeldon
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingJinwon Lee
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 

Ähnlich wie Beyond data and model parallelism for deep neural networks (20)

Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 

Kürzlich hochgeladen

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 

Kürzlich hochgeladen (20)

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 

Beyond data and model parallelism for deep neural networks

  • 1. Beyond data and model parallelism for deep neural networks
  • 2. Outline • Background • Existing parallelization strategy • Automatic generated strategy • Overview • Deep Learning Engine “FlexFlow” • How to find best strategy • Evaluation • Comparison existing parallelization strategy • Challenge 1
  • 3. Training Large-scale DNN models is computationally expensive . Large-scale and Complex Deep Neural Network ( DNN ) Models Background 2 Reduce training time by parallelization across devices . Inception v3 Model “models/research/inception at master · tensorflow/models”. Github . https://github.com/tensorflow/models/tree/master/research/inception , ( 2019-06-03 )
  • 4. Existing Parallelization Approach Data Parallelism Splitting data per worker 3 Model Parallelism Splitting model per worker Dean et al. ( 2012 ). Large Scale Distributed Deep Networks. In Neural Information Processing SystemsConference.
  • 5. Data Parallelism • Each device is placed a replica of the entire DNN. • Each device processes a subset of the training data. • Each device synchronizes network parameters at the end of iteration. ( Synchronous ) 4 Dean et al. ( 2012 ). Large Scale Distributed Deep Networks. In Neural Information Processing SystemsConference.
  • 6. Model Parallelism • Each device is assigned disjoint subsets of DNN. • Eliminates parameter synchronization but requires data transfers between operators. 5 Dean et al. ( 2012 ). Large Scale Distributed Deep Networks. In Neural Information Processing SystemsConference.
  • 7. ImageNet competition 6 (2016) Yamazaki et al. Yamazaki et al. (2019).YetAnother Accelerated SGD: ResNet-50Training on ImageNet in 74.7 seconds. (2017) (2017) (2017) (2018) (2018) (2018) (2019)
  • 8. Present Variation of optimal parallelization strategy due to various factors • Hardware architecture • DNN models architecture • Training data Necessity of designing special parallelization strategy manually 7
  • 9. Automatic Generated Strategy • ColocRL ( Mirho-seini et al., 2017 ) uses reinforcement learning to learn efficient operator assignments for model parallelism. • Executing each strategy in the hardware environment to get reward signals and takes 12-27 hours to find the best placement. • OptCNN ( Jia et al., 2018 ) uses dynamic programming to parallelize linear DNNs. • Cannot apply to Recurrent Neural Network ( RNN ). 8
  • 10. Overview Z. Jia, M. Zaharia , A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysML Conference. • Deep Learning Engine “FlexFlow” whichAutomatically finds parallelization strategies for arbitrary DNNs & Hardware. • FlexFlow increases training throughput by up to 3.3× over state-of-the-art approaches. 9
  • 11. Overview “FlexFlow” 1. Input information • Operator Graph • DeviceTopology 2. Search optimal parallelization strategy • the SOAP search space • Generating Strategy & Simulation 3. Execute best found strategy 10 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 12. Overview “FlexFlow” 1. Input information • Operator Graph • DeviceTopology 2. Search optimal parallelization strategy • the SOAP search space • Generating Strategy & Simulation 3. Execute best found strategy 11 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 13. Operator Graph & DeviceTopology • Node = operator in DNN • Convolution • Matrix multiplication etc. • Edge = tensor • Output of operator • Input of operator 12 • Node = device • GPU • CPU etc. • Edge = hardware connection • NVLink • Network-link etc. • PCI-e Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 14. Overview “FlexFlow” 1. Input information • Operator Graph • DeviceTopology 2. Search optimal parallelization strategy • The SOAP search space • Generating Strategy & Simulation 3. Execute best found strategy 13 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 15. The SOAP search space • Introduce a comprehensive SOAP search space • Sample • Operator • Attribute • Parameter 14
  • 16. Sample dimension in SOAP • Sample … partitioning training samples ( Data parallelism ) • Operator • Attribute • Parameter 15 Sample Parameter GPU1 GPU2 GPU3 GPU4 Parallelizing 1D convolution
  • 17. Operator dimension in SOAP • Sample … partitioning training samples ( Data parallelism ) • Operator … partitioning operators in DNN • Attribute • Parameter 16 Sample Parameter GPU1 GPU2 GPU3 Convolution#1 Convolution#2 Convolution#3
  • 18. Attribute dimension in SOAP • Sample … partitioning training samples ( Data parallelism ) • Operator … partitioning operators in DNN • Attribute … partitioning attributes in a sample • Parameter 17 GPU1 GPU2 GPU3 GPU4 Parallelizing 1D convolution Sample Parameter
  • 19. Parameter dimension in SOAP • Sample … partitioning training samples ( Data parallelism ) • Operator … partitioning operators in DNN • Attribute … partitioning attributes in a sample • Parameter … partitioning parameters in an operator 18 GPU1 GPU2 GPU3 GPU4 Parallelizing 1D convolution Sample Parameter
  • 20. Parallelizable dimensions in SOAP space 19 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 21. Overview “FlexFlow” 1. Input information • Operator Graph • DeviceTopology 2. Search optimal parallelization strategy • the SOAP search space • Generating Strategy & Simulation 3. Execute best found strategy 20 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 22. How to search optimal strategy 21 Generate strategy Simulate execution Improve strategy Markov Chain Monte Carlo ( MCMC ) search algorithm Full simulation & Delta simulation Decision of parallelization for each operator
  • 23. Generate Strategy Define parallelizable dimensions for each operator . one strategy = combination of all types of parallelization for each operator 22 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ) Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 24. Simulate execution • Challenge • Measuring distributed executions on real hardware is slow. • Observation • The performance of DNN operators is highly predictable because most of DNN operator using dense linear algebra. • DNN models only use a small number of distinct operators. • Execution Simulator • Measure each distinct operator once. • Use the measurement to estimate different parallelization strategies. 23
  • 25. Simulate Execution : GenerateTask Graph 24 Parallelization Strategy O1 , O2 Degree(sample) = 2 GPU1 O3 , O4 Degree(sample) = 2 GPU2 O5 , O6 Degree(sample) = 1 GPU 3 Operator Graph O5 O6 O1 O3 O3O1 O2 O4 O4O2 Task Graph O1 O3 O5 O2 O4 O6 Embedding Recurrent Linear Data transfer time = tensorsize / channel bandwidth ( Assumption )
  • 26. Improve Strategy : Full & Delta Simulation • Full simulation ( initial simulation ) • Predict execution time when use an initial strategy. • Delta simulation • Do not have to build & simulate new task graph from scratch. • The MarkovChain Monte Carlo search proposes a new strategy by updating the previous strategy. • Proposes new candidates until one of the following two criteria is satisfied. 1. The search time budget is exhausted for each initial strategy. 2. The search procedure cannot improve the best strategy for half of the search time. 25
  • 27. Delta Simulation • An operator in the current parallelization strategy is selected at random , and its parallelization configuration is replaced by a random configuration . 26 O5 O6 O1 O3 O3O1 O2 O4 O4O2 O5 O6 O3O1 O1 O2 O4 O4O2 Previous Simulation New Simulation
  • 28. Evaluation Evaluates the performance of FlexFlow on six real-world DNN benchmarks with two device topologies . Software dependencies of FlexFlow 27 Software libraries Version Contributors cuDNN 7.3 NVIDIA cuBLAS 9.0 NVIDIA Legion 18.02.0 LANL , NVIDIA , SLAC , Stanford * ( optional ) GASNet 1.28.0 Lawrence Berkeley National Laboratory * LosAlamos National Laboratory ( LANL ) Stanford National Accelerator Laboratory ( SLAC )
  • 29. Devices topologies in experiments The P100 Cluster The K80 Cluster Main Memory 56GB 256 GB CPU Intel 10-core E5-2600CPUs × 2 Intel 10-core E5-2680CPUs × 2 GPU NVIDIATesla P100GPUs × 4 NVIDIATesla K80 GPUs × 4 CPU - GPU shared PCI-e switch shared PCI-e switch GPU - GPU NVLink separate PCI-e switch Node - Node over 100GB/s EDR Infiniband over 56 GB/s EDR Infiniband 28 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 30. DNN in experiments • Introduce picked two DNN benchmarks from six DNN benchmarks. 29 DNN Description Dataset Convolutional Neural Networks ( CNN ) Inception-v3 A 102-layer CNN with inception modules ImageNet Recurrent Neural Networks ( RNN ) Neural Machine Translation ( NMT ) 4 recurrent layers followed by an attention and a softmax layer WMT English-German
  • 31. Per-iteration training performance 30 Num. Devices Num. Devices Num.Samples/second/GPU Num.Samples/second/GPUZ. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference. Expert-designed strategy of CNN = Krizhevsky ( 2014 ) Expert-designed strategy of RNN = Wu et al. ( 2016 ) Higherisbetter
  • 32. Comparison of parallelization performance Parallelization performance for NMT on 64 K80 GPUs (16 nodes) Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.Expert-designed strategy = Wu et al. ( 2016 ) Lowerisbetter
  • 33. Comparison Different Automated Frameworks 32 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference. Higherisbetter
  • 34. Full simulation & Delta simulation 33 Search performance with the full and delta simulation algorithms for the NMT model on 16 P100 GPUs ( 4 nodes ) Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference. Lower is better
  • 35. Simulation time & Real execution time 34 Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural Networks. In sysMLConference.
  • 36. Challenge • FlexFlow does not consider memory constraints. • MCMC may be not best algorithm. • Assumption might be relaxed or even eliminated. • data transfer time is tensorsize / channel bandwidth. • execution time is independent to data. 35
  • 37. Conclusion • Deep Learning Engine “FlexFlow” • Automatically finds parallelization strategies for arbitrary DNNs & Hardware. • increases training throughput by up to 3.3× over state-of-the-art approaches. • Challenges of FlexFlow • Memory constraints • Search algorithm • Assumption 36

Hinweis der Redaktion

  1. はい、ではこれから塙研究室修士一年の工藤純が最適な並列化戦略による大規模深層学習という題目で発表させていただきます。よろしくお願いします。
  2. 本日のアウトラインです、現状の背景の説明ののち、FlexFlowというフレームワークについて説明し、それを用いた際の実験結果の評価と、そのFlexFlowが抱える課題についてお話します。
  3. まず背景です。 活用事例が画像認識や自然言語処理まで多分野に浸透してきている深層学習ですが、より高い精度を目指すために学習に用いるデータのサイズやモデルの層の数が膨大になっていたり、またニューラルネットワークモデルの構造自体が複雑なものも多く登場しています。 そのことから深層学習におけるニューラルネットワークモデルの学習は非常に計算コストが高くなってきており、学習の時間が数ヶ月かかる場合もあります。 この問題を解決するために複数のデバイスで処理を分散、つまり並列化によって学習時間を減らそうとする流れがあります。
  4. はい、では深層学習における既存の並列化手法について説明します。 深層学習の並列化手法は主にデータ並列とモデル並列の2つに分けることができます。 この二つの並列化手法について少し詳しく説明していきます。
  5. まずはデータ並列について説明します。 データ並列は結論から言いますと、学習する際にバッチを複数のデバイスに分割して学習する並列化手法です。 データ並列では並列化を行う際にそれぞれのデバイスがニューラルネットワーク全体のレプリカを持ちます。そしてそれぞれのデバイスが異なるデータを用いて学習を行います。 なお、データ並列の中ではさらに同期型と非同期型のものに分けることができます。現状はデータ並列の非同期型はほぼ使われることはなく、データ並列の同期型が採用されることがほとんどです。
  6. 次はモデル並列についての説明です。 モデル並列はデータ並列と違い、それぞれのデバイスがニューラルネットワーク全体の複製を持っているわけではありません。モデル並列ではDNNが行う計算処理を分割するという考え方に基づいています。 データ並列と違いそれぞれのデバイスがDNN複製を持たないので、それぞれのデバイスは自身の持つパラメータを他のデバイスと同期する必要はありません。 ただし、計算を行う時に他のデバイスの計算のアウトプットを受け取る必要が出てくるので、その通信がオーバーヘッドになります。 以上のデータ並列とモデル並列が既存の一般的な並列化手法です。
  7. データ並列とモデル並列の説明を行いましたが、現状世間で用いられている並列化手法は同期型データ並列がほとんどです。 実例ではResNet-50というモデルとImageNetというデータセットに対してどれだけ精度を落とさずに学習を高速化できるかがここ数年で競われてきました。 単純な並列化手法に比べて、精度を落とさずさらに学習の高速化をするために専門家による戦略のチューニング等、様々な工夫がなされ、最適な並列化戦略が提案されてきました。
  8. ここで問題になるのは、最適な並列化戦略の設計方法です。 ハードウェアアーキテクチャ、DNNモデルのアーキテクチャ、トレーニングに用いるデータはどれも大規模で複雑化しているのが現状です。 特定のハードウェア、DNNモデル、トレーニングデータにおいて最適な並列化戦略があったとしても、その中のどれかが変わってしまう、例えばハードウェアアーキテクチャが変わってしまうだけでも最適なものではなくなる可能性が高いです。なのでいちいち、この3つの要因やこの三つ以外の要因も含めて、並列化戦略を設計しなければいけません。 また、そもそも考える要因そのものが複雑化している現状では、複雑な並列化戦略を考える必要があると思いますが、それは専門家が時間をかけなければ難しいであろうことが予想されます。ということで、複雑な並列化戦略は専門家といった人がそのつど要因を考慮して、設計するのが現状となっています。
  9. 専門家が手作業で並列化戦略を生成する、そのような状態を解決するために、自動で最適な並列化戦略を生成する研究が存在します。 一つ目はcolocRLというフレームワークです。これは強化学習を用いて最適なモデル並列を考えるものです。ただ非常に戦略を発見するのに時間がかかってしまいます。 二つ目はOptCNNというフレームワークです。これは動的計画法を用いる手法ですが、RNNなど再帰構造を持つネットワークに適用できないという特徴があります。 詳細について解説はしませんが、以上のように自動で戦略を発見するための研究が行われている、と理解していただければ十分です。
  10. では本題です。 今回ご紹介する論文は〜〜〜です。 2019年のsysMLカンファレンスでも発表されたもので、自動で並列化戦略を発見してくれるフレームワークを提案した論文です。 専門家が設計した並列化戦略やデータ並列は、すべてのDNNやハードウェアに適用できないこともしばしばあります。今回ご紹介するFlexFlowというフレームワークでは、すべてのDNNとハードウェアに適用が可能です。 彼らが発表したFlexFlowは既存の並列化手法に比べ高いスループットを記録しました。
  11. ではまず、FlexFlowの流れについて説明いたします。 まず入力としてオペレータグラフとデバイストポロジーを与えます。 すると、FlexFlowはそれを元に最適な並列化戦略の探索を行います。 そして探索した中で最も良い結果が得られると予想された戦略を実際に実行するといった流れです。
  12. まずはFlexFlowに与える二つの情報について少し解説します。
  13. FlexFlowはtensorflowやpytorchと同様にオペレータグラフ、計算グラフを入力として扱います。計算グラフのノード部分はDNNにおける計算、例えば畳み込みなどを表します。辺の部分はオペレータへ入力するテンソルもしくはオペレータが出力するテンソルを表します。 そしてデバイストポロジーは、学習を実行する際に用いるハードウェアアーキテクチャの情報です。ノード部分は計算処理を行うCPUやGPU、辺はデバイス間のコネクションを表します。 FlexFlowはこの二つの情報をもとに最適な並列化戦略を探索します。
  14. ということで次は戦略の探索について説明します。
  15. FlexFlowは並列化戦略を考えるさいにSOAP、ソープという探索空間の概念を導入します。今後の説明のためにSOAP探索空間の説明をまず行なっていきます。
  16. ではまずソープ探索空間におけるサンプル次元についての説明です。 サンプル次元ではトレーニングに用いるデータ、言ってしまえばバッチをどのように分割するのかということを考えます。つまりこの次元による並列化はデータ並列に当たる、ということになります。
  17. 次はオペレータ次元です。 オペレータ次元ではDNNにおける計算を分割することができます。 DNN内でのオペレータというのはほぼほぼ層における計算に対応していると考えて差し支えないと思います。
  18. 次はアトリビュート次元です。 アトリビュート次元では1つのサンプル内の要素を分割を考えます。例えば、サンプルが1枚の画像とすると画像のピクセルの部分集合がこれに当たります。
  19. 最後はパラメータ次元です。パラメータ次元では1つのオペレータにおけるパラメータの分割を考えます。要するにサンプルの分割ではなくモデルパラメータの分割ということになります。 以上がソープ探索空間における4つの次元となります。
  20. それで、ソープ探索空間をご紹介しましたが、スライドの図は実際に様々な並列化戦略がソープ探索空間におけるどの次元を考慮しているのか、という図になっています。 Hybrid parallelismというのは複数次元における並列化を同時に適用できるのかという意味です。 supported DNNsは適用可能なDNNモデルを表しています。 データ並列は、1つのデバイスにDNNモデル全体とパラメーターが乗らないと実行ができません。 また、クライゼンスキーさんという専門家の方が提案した戦略はCNNのみに、ウーさんという専門家の方が提案した戦略はRNNのみに適用が可能です。 少し前に説明した自動で並列化戦略を探索してくれる2つの手法、colocRLとoptCNNもすべてのDNNをサポートしているわけではありません。また、ハイブリッドな並列化を行なっている戦略は非常に少ないと言えます。
  21. ではソープ探索空間がどのようなものかを理解したところで、FlexFlowがどのように最適な戦略を探索するのかを説明いたします。
  22. FlexFLowが最適な並列化戦略を探索する流れの図です。 まず適当な戦略を生成し、その戦略をシミュレーション、そしてその戦略を改善し、またシミュレーション、これを繰り返していく流れになります。 戦略改善にはマルコフチェインモンテカルロ法という乱択アルゴリズムが用いられます。 ではこの戦略探索の流れについて詳しく説明していきます。
  23. まず、最初の戦略の生成を考える際に、学習しようとしているDNNのそれぞれのオペレータで並列化可能な次元を定義しておきます。 表はそのオペレータで並列化可能な次元の一覧となっています。 サンプル次元についてはバッチを分割するだけですので、もちろんどのオペレータについても必ず適用が可能です。 それぞれのオペレータをどのハードに割り当て、どのように並列化するのかという設定の集合が1つの戦略を形成します。また、FlexFlowは戦略を定義する際に負荷分散のために分割サイズが全て同じになるように考慮されます。 (チャネルは異なるニューロンを表しています。)
  24. では戦略を生成できたところで、実行シミュレーションについてお話します。 実際にハードウェアで学習を行って時間を測るのは非常に低速なので実際に実行するのではなくシミュレートすることで、高速にその戦略を適用した際の実行時間を予測しようという考えがあります。強化学習によって並列化戦略を探索するアプローチがありましたが、その方法では実際にハードウェアで学習を実行して時間を測っていたので、並列化戦略の発見に非常に時間がかかりました。 DNNのオペレータのほとんどが密行列を用いているため、実行時間が高い精度予測可能です。またDNNのモデルに使われているオペレータの種類は非常に少なく例えばニューラルマシントランスレーションというモデルは100層以上の層がありますが、種類としてはの4種類のオペレータしかありません。 このことを踏まえ、実行シミュレータではそれぞれの種類のオペレータについてそのオペレータの設定で実行にかかる時間を実際に計測し、その結果を戦略のシミュレート時に使いまわすことで役立てていきます。
  25. シミュレーションをするためにはタスクグラフというものを生成する必要があります。並列化戦略からタスクグラフを生成します。図は簡単なリカレントニューラルネットワークの例です。 Degree(sample) = 2 というのはつまりo1,o2,o3,o4はサンプル次元において並列化度合い2で並列化される。要するにバッチを2分割すると、とらえて問題ないと思います。 この左のオペレータグラフと並列化戦略を元に右のようなタスクグラフが生成されます。 タスクグラフのエッジ、辺の部分はデバイス間のデータのやり取りを表していています。なおFlexFlowはシミュレートをする際にテンソルのサイズをチャネル帯域幅で割った値をデータ転送時間とする仮定があリます。 他にもflexflowには3つの仮定のもとでシミュレーションが行われます。他の仮定は今回の発表中にご紹介しませんが、ハンドアウトの方をご覧いただければと思います。
  26. タスクグラフができたところで実際にシミュレーションする話をしていきます。 初回のシミュレーションのみフルシミュレーションで、2回目以降のシミュレーションはデルタシミュレーションというシミュレーション法を用います。 FlexFlowで重要なのはデルタシミュレーションの部分です、デルタシミュレーションは新たな戦略を考える際にタスクグラフを0から再構成してシミュレートをする必要はないという考えから考案されました。 FlexFlowは現在の戦略の一部を更新して、新しい戦略を提案し、それを繰り返してどんどん戦略を改善していくのですが、その際にはマルコフチェインモンテカルロ法という探索アルゴリズムを用います。 またデルタシミュレーションと戦略改善は以下の2点のどちらかに引っかかるまでつづけられます。 1つ目はあらかじめ設定された検索時間予算が尽きる場合です。この予算は人間が設定します。 2つ目探索時間の半分をかけても今以上に良い並列化戦略が発見されなかった場合です。 (フルシミュレーションではダイクストラ法で実行時間が計算され、デルタシミュレーションではベルマンフォード法という方法で実行時間が計算されます。) Predict execution time when use existing strategies as initial strategy.Data parallelism , expert-designed strategies etc.
  27. デルタシミュレーションでは図のように前の戦略から1つのオペレータをランダムに決定し、そのオペレータにおける並列化設定をランダムに設定し直して新しい戦略を生成します。 1つのオペレータにおける設定が変わっただけですので、他のオペレータについては複雑な実行時間の再計算が必要ありません。 新しく提案された戦略を受け入れるか,アクセプトするかどうかはシミュレーション結果からMCMC法によって決定します。 どんどん戦略を変えていき、最終的な戦略を最適な並列化戦略として実際に実行します。 以上がFlexFlowが最適な並列化戦略を探索する流れです。
  28. ではFlexFlowの実験結果とその評価について説明していきます。 参考としている論文の方では6つのDNNのベンチマークを2つのデバイストポロジーで実験し、評価してあります。 FlexFlowのソフトウェアディペンデンシーは表のようになっております。 FlexFlowはgithub上でも公開されています。 (残り5分!)
  29. 実験に用いられた2つのデバイストポロジーです、メインメモリー、CPU,GPUの項目は1つのノードにおいての値です。
  30. 6つのベンチマークが実験では計られていますが、今回はその中から2つピックアップしてご紹介していきます。 1つはInception-v3というモデルでinceptionモジュールを持った102層のCNNです。そして学習に用いたデータセットはImageNetのものです。 もう1つはNeural Machine TranslationというRNNで、これはWMT English-Germanというデータセットで実験を行なっています。
  31. 実験結果のグラフです。 縦軸はサンプル数/実行にかかった秒数/用いたGPU数 横軸はデバイス数を表していて、括弧内の値はノード数です。 それぞれの折れ線が用いた並列化手法とデバイストポロジーに対応しています。 点線部分は理想的なスループットを表しています。 グラフからわかるようにInception-v3とNeural Machine Translationにおいて、FlexFlowがどちらのデバイストポロジーにおいても他の2つの戦略より良いスループットを出しています。
  32. これはK80 GPUを64個使った際のNMTにおける実行時間の比較です。 まず一番左のグラフはイテレーションごとの実行時間ですが、FlexFlowが最も実行時間が短いことがわかります。 また真ん中のグラフを見るとわかるように、FlexFlowは転送したデータ量が非常に抑えられていることがわかります。 一番右のグラフを見るとすべてのタスクの実行時間は専門家の設計した戦略にほぼ引けを取らない結果になっています。
  33. これは最初の方に説明した自動で並列化戦略を探索するフレームワークとの比較です。スループットはどちらもFlexFlowの方が高い値を示しています。 (同じパラメータを共有するreccurent層全てを統合して一つのoperatorとしてみて評価した)
  34. そしてこれはフルシミュレーションのみを使った戦略探索とデルタシミュレーションを導入した探索の比較です。 どちらも最終的に同様の並列化戦略に収束していることが伺えます。 また、デルタシミュレーションを用いた場合はフルシミュレーションを用いた結果より2倍ほど早く戦略が収束し、探索が終了していることがわかります。
  35. このグラフはシミュレート時間と実際の実行時間の比較です。 縦軸横軸共に秒を表していてログスケールになっていることに注意していただきたいんですけれども、予想からはるかにかけ離れた値は共になさそうであることがわかります。
  36. 最後にフレックスフローの課題点についてお話しします。 FlexFlowはシミュレートの際にメモリ制約について考えいません。なので最適な並列化戦略を発見しても、メモリ制約を超えてしまっていると、その戦略は実行できない可能性があります。 また、戦略の更新にはマルコフチェインモンテカルロ法を用いていると説明しましたが、マルコフチェインモンテカルロ法が最適なアルゴリズムである保証は著者自身もないと言っております。 また、FlexFlowは実行をシミュレートする際に四つ仮定がありましたが、その仮定は容易にに成り立たなくなります。データ転送時間がテンソルサイズをチャネル帯域幅で割ったものとなるのはあくまで理論値ですし、実行時間はデータに対して独立であるとは言い難いと思います。ですので、FlexFlowにおける仮定を見直す必要があるかもしれません。 Simulation gives a very good insight on what is worth spending time on executing
  37. はい、以上で発表を終わります。 あとはフルシミューレション、初回シミュレーションで用いるのは既存のアプローチを用いていて、その部分も探索できるようになるといいよねー
  38. みてわかるようにかなり複雑です vertical = sample , horizontal = parameter Compared to data parallelism, this strategy reduces the parameter synchronization costs by 75% and the per-iteration execution time by 12%. For parallelizing the same Inception-v3 model on four K80 GPUs we observe that the best discovered strategy tends to parallelize operators on adjacent GPUs with a direct connection to reduce the communication costs.
  39. parallelizing NMT on four P100 GPUs. First, for a layer with a large number of network(e.g., embed layers), it performs the computation on a single GPU to eliminate parameter synchronization. Second, for a layer with a large number of parameters and heavy com- putation (e.g., softmax layers), FlexFlow uses parallelism in the parameter dimension and assigns the computation for a subset of parameters to each task. This reduces parame- ter synchronization costs while maintaining load balance. Third, for multiple recurrent layers (e.g., LSTM and atten- tion layers), FlexFlow uses concurrency among different layers as well as parallelism within each operator to reduce parameter synchronization costs while balancing load parameters and little computation
  40. 要するに確率付きのランダムウォークのようなもの