SlideShare ist ein Scribd-Unternehmen logo
1 von 16
OpenCL caffe: Accelerating and enabling a
cross platform machine learning framework
Junli Gu gujunli@gmail.com
Yibing Liu (Tsinghua University)
Yuan Gao
Maohua Zhu (UCSB)
Presented by Hugh Perkins (ASAPP)
Deep learning brings challenges to system design
– Deep Learning: DNN model + Big Data
• Complex model: millions to billions of parameters
• Big Data input: OCR: 100M, Speech: 10B, CTR: 100B
– System is the final enabler
• Model training: takes weeks on CPU + GPU clusters
• Model deployment: trained model deployed for various application scenarios
DNN
model
Big Data
input
Recognition
results
DNNs Everywhere!
Supercomputers! Datacenters! Tablets,smartphones! Wearable devices!
IoTs!
1000s GPUs! 100k-1m servers! 700m (in China)! Billions?!
Functionality Offline DNN training Deploy DNN on cloud
Deploy mobile DNN
apps
Deploy on Wearable and loTs
H/W systems CPU + GPU cluster
CPU clusters or CPU+GPU
clusters
ARM/GPU/SOC ARM/Soc/FPGA
Scale Small scale (hundreds) 100k-1M 700M billions
Opportunities for OpenCL: cross platform DNN
deployment
 Current trend: DNN will be everywhere.
 Cross platform compatibility is becoming a challenge for internet giants.
 However most DNN frameworks are based on CUDA: closed format, limiting the
deployment breadth of DNN systems.
The goal of OpenCL caffe
• Hierarchical framework that serves as machine learning OS
• Software level
‒ machine learning SDK and
APIs
‒ CNN, MLP, RNN, LSTM etc.
• Hardware level
‒ hardware resources
allocation and utilizations
‒ optimized DNN and math
libraries
• Workload partition
‒ CPU: data processing and
main loop iteration
‒ GPU: major DNN kernel
computation
Original CUDA caffe from UC. Berkeley: https://github.com/BVLC/caffe
Two phase strategies
• Phase one: OpenCL backend porting and analysis
– It is not a straightforward engineering porting, algorithm convergence
might be destroyed
– Re-architecture due to key difference between CUDA and OpenCL
• Phase two: OpenCL caffe performance optimizations
– Given the algorithm correctness, improve the performance
– Current BLAS libraries are not optimized for DNN computing, why and
how to improve without modifying BLAS?
OpenCL Caffe Framework
• Hybrid CPU and GPU implementation
– Each layer
• CAFFE is most popular in industry these
days
– Complexity:
– ~70k lines of code
– Originally designed for C++ & CUDA
• Seamless switch between
CPU/GPU
layers
Accuracy_layer Euclidean_loss_layer
Hinge_loss_layer Infogain_loss_layer
Loss_layer Memory_data_layer
Multinomial_logist
ic_loss_layer
Neuron_layer
Data_layer
Bnil_layer
Concat_layer
Conv_layer
dropout_layer
Eltwise_product_layer
Flatten_layer
Hdf5_data_layer
Hdf5_output_layer
Im2col_layer
Image_data_layer
Inner_product_layer
Lrn_layer
Pooling_layer
Power_layer
Relu_layer
Sigmoid_cross_entropy_los
s_layer
Sigmoid_layer
Softmax_layer Softmax_loss_layer
Split_layer
Tanh_layer
Window_data_layer
Prototype Training
Deployment
Forward_gpu Backward_gpu
MaxPoolForwardfloat
AvePoolFowardfloat AvePoolBackwardfloat
MaxPoolBackwardfloat
Forward_gpu Backward_gpu
Im2col_gpu
Caffe_gpu_gemm Caffe_gpu_gemv
col2im_gpu
Forward_gpu Backward_gpu
ReLUForwardfloat ReLUBackwardfloat
Forward_gpu Backward_gpu
Caffe_gpu_gemm Caffe_gpu_gemv
Forward_gpu
Kernel_get_maxfloat Caffe_gpu_gemmKernel_softmax_divfloat Caffe_gpu_gemv
Im2col_gpu Col2im_gpu
Caffe_gpu_gemm Caffe_gpu_gemv Caffe_gpu_axpy
Caffe_gpu_axpby
Caffe_gpu_scal
Caffe_gpu_dot Caffe_gpu_asum Caffe_gpu_scale
Caffe_gpu_axpy
OpenCL porting challenges and re-architecturing
• Memory layout & data coherence
– mutable data structures
– Optimal buffer allocation for each layer
• Hide data transfer to the underlying hardware layers
• Added extra OpenCL wrapper layer compared to CUDA
• Hide messy clSetArg etc stuff
Layer 3: GPU kernels
Layer 2:OpenCL wrappers
Layer 1: C++ machine learning interfaces
Caffe_lrn Caffe_pool
Caffe_max Caffe_relu
FORWARD
BACKWARD
Datalayer
Convolutionlayer[5x5]
Relulayer
Convolutionlayer[5x5]
Poolinglayer[3x3,stride2]
Relulayer
Convolutionlayer[5x5]
Poolinglayer[3x3,stride2]
Relulayer
InnerProduct
InnerProduct
SoftMax+LogLoss
Poolinglayer[3x3,stride2]
Layer wise porting to guarantee correctness
• DNN is a deep layered structure, algorithm convergence is fragile.
Gradient check is well known challenge.
– Local correctness: unit test
– Global correctness: comparing the convergence curves with CPU/CUDA baseline
When port conv layers, only conv layers are
in OpenCL, other layers are in CPU
OpenCL backend bottleneck analysis
• OpenCL's online compilation frequently calls clBuildProgram
– Too many DNN kernels to create!
• DNN falls into BLAS’ poor performance area
– Irregular tall and skinny matrix sizes from different layers
– Bottleneck exists for all BLAS implementations, cuBLAS, clBLAS etc.
– clBLAS is 3-5x slower than cuBLAS, the biggest performance gap to
catch up
63 clbuildProgram calls!
Takes up 68% of time
OpenCL backend bottleneck analysis
• OpenCL‘s online compilation frequently calls clBuildProgram
– Too many DNN kernels to create!
• DNN falls into BLAS' poor performance area
– Irregular tall and skinny matrix sizes from different layers
– Bottleneck exists for all BLAS implementations, cuBLAS, clBLAS etc.
– clBLAS is 3-5x slower than cuBLAS, the biggest performance gap to
catch up
AMD R9 Fury vs. GTX980
1. peak performance 7.2 vs. 4.6 TFLOPS
2.OpenCL caffe is 6x slower than cuda caffe
OpenCL caffe performance optimizations
• Avoid OpenCL online compilation overheads
– Precompile and save the kernels
– Works if hardware does not change
• Boost data parallelism
– Batched manner data layout transformation
– To bring DNN data size to better performance areas
• Boost task parallelism
– Multiple command queues
– Increase concurrent tasks
Batched data layout transformation optimization
• Batched data layout scheme
– Design pipeline to pack small
matrix into bigger ones
– Increase data parallelism
– Release GPU’s computing power
• Notes
– Optimization applies to general
machine learning framework
– When integrated within
sgemm, called batched sgemm
Batched size
Org size
• Batched transformation significantly unrolls the matrix size
– Bigger matrix, more regular
– M, N,K can be aligned with 4/8/16/32 (BLAS preferred sizes)
– Forward propogation, M scaled up; backward propogation, N,K scaled up
(algorithm limitations)
• Optimal batched number
– depending on H/W properties and input data size
– 16 or 32 on AMD GPUs for ImageNet data set
Batched data layout transformation optimization
Layers Original M, N, K Unrolled M’, N’,
K’
speedup
conv1 3025, 96, 363 48400, 96, 363 11
conv2 729, 128, 1200 11664, 128, 1200 12
conv3 169, 384, 2034 2704, 384, 2034 10
conv4 169, 192, 1728 2704, 192, 1728 9
conv5 169, 128, 1728 2704, 128, 1728 16This is matrix size for forward propagation
Boost task parallelism
• The nature of workload imbalance among DNN layers
• Luckily, we can make use of model parallelism
• Performance improvement depends on layer structure, data size and
hardware resources.
Command
queue 1
Command
queue 2
Queue1 and queue2 run concurrently to
improve GPU utilization
Performance evaluation
– OpenCL batched vs clBLAS
• 4.5x speedup without modifying clBLAS
– OpenCL vs CUDA caffe (apple to apple )
• Similar performance
– OpenCL vs cuDNN v2
• 2x gap
• Potential to catch with low
-level hardware optimization
Conclusions
• OpenCL caffe
– To enable a cross platform DNN framework
• Optimize towards competitive performance
– Data parallelism: batched manner data layout transformation
– Task parallelism: make use of model parallelsim
– 4.5x speedup on top of clBLAS library
• Existing challenges of OpenCL in cross-platform
– Differences of various hardware manufacture extensions
– Queueing efficiency, command queue synchronization overheads, runtime
efficiency
– Low level hardware optimizaiton tool chain for highly optimized machine
learning libraries
OpenCL Caffe is at: https://github.com/gujunli/OpenCL-caffe

Weitere ähnliche Inhalte

Was ist angesagt?

Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 
High performance computing
High performance computingHigh performance computing
High performance computing
Guy Tel-Zur
 

Was ist angesagt? (20)

Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
2017 04-13-google-tpu-04
2017 04-13-google-tpu-042017 04-13-google-tpu-04
2017 04-13-google-tpu-04
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
High performance computing with accelarators
High performance computing with accelaratorsHigh performance computing with accelarators
High performance computing with accelarators
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Spine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localizationSpine net learning scale permuted backbone for recognition and localization
Spine net learning scale permuted backbone for recognition and localization
 
GPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application ModelsGPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application Models
 

Andere mochten auch

Autonomous driving revolution- trends, challenges and machine learning 
Autonomous driving revolution- trends, challenges and machine learning  Autonomous driving revolution- trends, challenges and machine learning 
Autonomous driving revolution- trends, challenges and machine learning 
Junli Gu
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Edge AI and Vision Alliance
 

Andere mochten auch (18)

Autonomous driving revolution- trends, challenges and machine learning 
Autonomous driving revolution- trends, challenges and machine learning  Autonomous driving revolution- trends, challenges and machine learning 
Autonomous driving revolution- trends, challenges and machine learning 
 
Machine Learning Using Cloud Services
Machine Learning Using Cloud ServicesMachine Learning Using Cloud Services
Machine Learning Using Cloud Services
 
Autonomous vehicles[1]
Autonomous vehicles[1]Autonomous vehicles[1]
Autonomous vehicles[1]
 
Case studies in Games, Machine Learning in the Cloud,
Case studies in Games, Machine Learning in the Cloud,Case studies in Games, Machine Learning in the Cloud,
Case studies in Games, Machine Learning in the Cloud,
 
Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at Scale
 
Machine Learning and the Cloud
Machine Learning and the CloudMachine Learning and the Cloud
Machine Learning and the Cloud
 
Machine Learning Use Cases with Azure
Machine Learning Use Cases with AzureMachine Learning Use Cases with Azure
Machine Learning Use Cases with Azure
 
Machine Learning and its Use Cases (dsth Meetup#3)
Machine Learning and its Use Cases (dsth Meetup#3)Machine Learning and its Use Cases (dsth Meetup#3)
Machine Learning and its Use Cases (dsth Meetup#3)
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
ISSCC "Carrizo"
ISSCC "Carrizo"ISSCC "Carrizo"
ISSCC "Carrizo"
 
Cloud and Machine Learning in real world business
Cloud and Machine Learning in real world businessCloud and Machine Learning in real world business
Cloud and Machine Learning in real world business
 
Machine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
Machine Learning in the Cloud: Building a Better Forecast with H20 & SalesforceMachine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
Machine Learning in the Cloud: Building a Better Forecast with H20 & Salesforce
 
Cloud Machine Learning with Google Cloud Platform
Cloud Machine Learning with Google Cloud PlatformCloud Machine Learning with Google Cloud Platform
Cloud Machine Learning with Google Cloud Platform
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Autonomous Car
Autonomous CarAutonomous Car
Autonomous Car
 
Autonomous Vehicles
Autonomous VehiclesAutonomous Vehicles
Autonomous Vehicles
 
Driving Disrupted: Driverless Cars Change Everything
Driving Disrupted: Driverless Cars Change EverythingDriving Disrupted: Driverless Cars Change Everything
Driving Disrupted: Driverless Cars Change Everything
 
Robust Large-Scale Machine Learning in the Cloud
Robust Large-Scale Machine Learning in the CloudRobust Large-Scale Machine Learning in the Cloud
Robust Large-Scale Machine Learning in the Cloud
 

Ähnlich wie OpenCL caffe IWOCL 2016 presentation final

Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 

Ähnlich wie OpenCL caffe IWOCL 2016 presentation final (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUsS6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 

OpenCL caffe IWOCL 2016 presentation final

  • 1. OpenCL caffe: Accelerating and enabling a cross platform machine learning framework Junli Gu gujunli@gmail.com Yibing Liu (Tsinghua University) Yuan Gao Maohua Zhu (UCSB) Presented by Hugh Perkins (ASAPP)
  • 2. Deep learning brings challenges to system design – Deep Learning: DNN model + Big Data • Complex model: millions to billions of parameters • Big Data input: OCR: 100M, Speech: 10B, CTR: 100B – System is the final enabler • Model training: takes weeks on CPU + GPU clusters • Model deployment: trained model deployed for various application scenarios DNN model Big Data input Recognition results
  • 3. DNNs Everywhere! Supercomputers! Datacenters! Tablets,smartphones! Wearable devices! IoTs! 1000s GPUs! 100k-1m servers! 700m (in China)! Billions?! Functionality Offline DNN training Deploy DNN on cloud Deploy mobile DNN apps Deploy on Wearable and loTs H/W systems CPU + GPU cluster CPU clusters or CPU+GPU clusters ARM/GPU/SOC ARM/Soc/FPGA Scale Small scale (hundreds) 100k-1M 700M billions Opportunities for OpenCL: cross platform DNN deployment  Current trend: DNN will be everywhere.  Cross platform compatibility is becoming a challenge for internet giants.  However most DNN frameworks are based on CUDA: closed format, limiting the deployment breadth of DNN systems.
  • 4. The goal of OpenCL caffe • Hierarchical framework that serves as machine learning OS • Software level ‒ machine learning SDK and APIs ‒ CNN, MLP, RNN, LSTM etc. • Hardware level ‒ hardware resources allocation and utilizations ‒ optimized DNN and math libraries • Workload partition ‒ CPU: data processing and main loop iteration ‒ GPU: major DNN kernel computation Original CUDA caffe from UC. Berkeley: https://github.com/BVLC/caffe
  • 5. Two phase strategies • Phase one: OpenCL backend porting and analysis – It is not a straightforward engineering porting, algorithm convergence might be destroyed – Re-architecture due to key difference between CUDA and OpenCL • Phase two: OpenCL caffe performance optimizations – Given the algorithm correctness, improve the performance – Current BLAS libraries are not optimized for DNN computing, why and how to improve without modifying BLAS?
  • 6. OpenCL Caffe Framework • Hybrid CPU and GPU implementation – Each layer • CAFFE is most popular in industry these days – Complexity: – ~70k lines of code – Originally designed for C++ & CUDA • Seamless switch between CPU/GPU layers Accuracy_layer Euclidean_loss_layer Hinge_loss_layer Infogain_loss_layer Loss_layer Memory_data_layer Multinomial_logist ic_loss_layer Neuron_layer Data_layer Bnil_layer Concat_layer Conv_layer dropout_layer Eltwise_product_layer Flatten_layer Hdf5_data_layer Hdf5_output_layer Im2col_layer Image_data_layer Inner_product_layer Lrn_layer Pooling_layer Power_layer Relu_layer Sigmoid_cross_entropy_los s_layer Sigmoid_layer Softmax_layer Softmax_loss_layer Split_layer Tanh_layer Window_data_layer Prototype Training Deployment
  • 7. Forward_gpu Backward_gpu MaxPoolForwardfloat AvePoolFowardfloat AvePoolBackwardfloat MaxPoolBackwardfloat Forward_gpu Backward_gpu Im2col_gpu Caffe_gpu_gemm Caffe_gpu_gemv col2im_gpu Forward_gpu Backward_gpu ReLUForwardfloat ReLUBackwardfloat Forward_gpu Backward_gpu Caffe_gpu_gemm Caffe_gpu_gemv Forward_gpu Kernel_get_maxfloat Caffe_gpu_gemmKernel_softmax_divfloat Caffe_gpu_gemv Im2col_gpu Col2im_gpu Caffe_gpu_gemm Caffe_gpu_gemv Caffe_gpu_axpy Caffe_gpu_axpby Caffe_gpu_scal Caffe_gpu_dot Caffe_gpu_asum Caffe_gpu_scale Caffe_gpu_axpy OpenCL porting challenges and re-architecturing • Memory layout & data coherence – mutable data structures – Optimal buffer allocation for each layer • Hide data transfer to the underlying hardware layers • Added extra OpenCL wrapper layer compared to CUDA • Hide messy clSetArg etc stuff Layer 3: GPU kernels Layer 2:OpenCL wrappers Layer 1: C++ machine learning interfaces Caffe_lrn Caffe_pool Caffe_max Caffe_relu
  • 8. FORWARD BACKWARD Datalayer Convolutionlayer[5x5] Relulayer Convolutionlayer[5x5] Poolinglayer[3x3,stride2] Relulayer Convolutionlayer[5x5] Poolinglayer[3x3,stride2] Relulayer InnerProduct InnerProduct SoftMax+LogLoss Poolinglayer[3x3,stride2] Layer wise porting to guarantee correctness • DNN is a deep layered structure, algorithm convergence is fragile. Gradient check is well known challenge. – Local correctness: unit test – Global correctness: comparing the convergence curves with CPU/CUDA baseline When port conv layers, only conv layers are in OpenCL, other layers are in CPU
  • 9. OpenCL backend bottleneck analysis • OpenCL's online compilation frequently calls clBuildProgram – Too many DNN kernels to create! • DNN falls into BLAS’ poor performance area – Irregular tall and skinny matrix sizes from different layers – Bottleneck exists for all BLAS implementations, cuBLAS, clBLAS etc. – clBLAS is 3-5x slower than cuBLAS, the biggest performance gap to catch up 63 clbuildProgram calls! Takes up 68% of time
  • 10. OpenCL backend bottleneck analysis • OpenCL‘s online compilation frequently calls clBuildProgram – Too many DNN kernels to create! • DNN falls into BLAS' poor performance area – Irregular tall and skinny matrix sizes from different layers – Bottleneck exists for all BLAS implementations, cuBLAS, clBLAS etc. – clBLAS is 3-5x slower than cuBLAS, the biggest performance gap to catch up AMD R9 Fury vs. GTX980 1. peak performance 7.2 vs. 4.6 TFLOPS 2.OpenCL caffe is 6x slower than cuda caffe
  • 11. OpenCL caffe performance optimizations • Avoid OpenCL online compilation overheads – Precompile and save the kernels – Works if hardware does not change • Boost data parallelism – Batched manner data layout transformation – To bring DNN data size to better performance areas • Boost task parallelism – Multiple command queues – Increase concurrent tasks
  • 12. Batched data layout transformation optimization • Batched data layout scheme – Design pipeline to pack small matrix into bigger ones – Increase data parallelism – Release GPU’s computing power • Notes – Optimization applies to general machine learning framework – When integrated within sgemm, called batched sgemm Batched size Org size
  • 13. • Batched transformation significantly unrolls the matrix size – Bigger matrix, more regular – M, N,K can be aligned with 4/8/16/32 (BLAS preferred sizes) – Forward propogation, M scaled up; backward propogation, N,K scaled up (algorithm limitations) • Optimal batched number – depending on H/W properties and input data size – 16 or 32 on AMD GPUs for ImageNet data set Batched data layout transformation optimization Layers Original M, N, K Unrolled M’, N’, K’ speedup conv1 3025, 96, 363 48400, 96, 363 11 conv2 729, 128, 1200 11664, 128, 1200 12 conv3 169, 384, 2034 2704, 384, 2034 10 conv4 169, 192, 1728 2704, 192, 1728 9 conv5 169, 128, 1728 2704, 128, 1728 16This is matrix size for forward propagation
  • 14. Boost task parallelism • The nature of workload imbalance among DNN layers • Luckily, we can make use of model parallelism • Performance improvement depends on layer structure, data size and hardware resources. Command queue 1 Command queue 2 Queue1 and queue2 run concurrently to improve GPU utilization
  • 15. Performance evaluation – OpenCL batched vs clBLAS • 4.5x speedup without modifying clBLAS – OpenCL vs CUDA caffe (apple to apple ) • Similar performance – OpenCL vs cuDNN v2 • 2x gap • Potential to catch with low -level hardware optimization
  • 16. Conclusions • OpenCL caffe – To enable a cross platform DNN framework • Optimize towards competitive performance – Data parallelism: batched manner data layout transformation – Task parallelism: make use of model parallelsim – 4.5x speedup on top of clBLAS library • Existing challenges of OpenCL in cross-platform – Differences of various hardware manufacture extensions – Queueing efficiency, command queue synchronization overheads, runtime efficiency – Low level hardware optimizaiton tool chain for highly optimized machine learning libraries OpenCL Caffe is at: https://github.com/gujunli/OpenCL-caffe

Hinweis der Redaktion

  1. This is Hugh Perkins. Take the opportunity to Say something about your OpenCL DNN work! I am delivering the talk on behalf of the author Junli. She could not come to make the presentation due to visa issues. When the authors submitted the paper, Junli was at AMD. But now she is working at Tesla. The project if an open sourced project. So the talk is not on behalf of any company at this point.
  2. Deep learning applications has posed great challenges to system design Deep learning applications are both compute intensive and memory intensive, Due to the high complexicity of DNN models, and Big Data input. The past 3 years, DNN models keep increase the number of the parameters, increase from millions to billons. The Big data samples to train a model can be from hundreds of millions to hundreds of billions depending on different applications. Yet the data is said to increase 10x per year. Thus Deep learning applications has posed great challenges to system design.
  3. The transition of DNN systems DNN has becoming the hottest internet trends in the past two years. The current trend indicates that DNN applications will be everywhere. Cross platform compatibility is becoming a challenge. I see this as opportunity for OpenCL.. Let me first introduce the DNN process a little bit, a DNN model will be trained on supercomputers with Big Data input, later deployed on could for online recognition, these days, DNN apps are also deployed on mobiles and wearable. I have listed the scale of each category and opportunity. Offline training supercomputers will be small scale, in hundreds of CPU + GPU machines, Performance will critical for internet players, but our performance is currently not competitive. online cloud will be much bigger scale, from 100k to 1M, and is more tolerate with performance, there is potential opportunity for AMD to win with better performance per dollar. Timing is right. the online cloud is in transition form CPU cluster into CPU + GPU cluster. Mobile we have no products.
  4. Due to DNN models' high complexity, we use a two-phase strategy. First we introduce the OpenCL porting strategies that guarantee algorithm convergence; Then we analyze OpenCL's performance bottlenecks in DNN domain and propose a few optimization techniques improve hardware resources utilization and boost OpenCL runtime eciency.
  5. Layer1: C++ interfaces (for domain experts) Layer2: OpenCL wrapper hides hardware details (for systems) Layer3: Underlying GPU kernels (for device level optimizations
  6. DNN is a deep layered structure, if we port all layers together, the algorithm convergence can be easily destroyed and it is very painful to debug, due to millions of parameters and stochastic gradients. The challenges of gradient is a well know challenge for any machine learning framework development. For porting from CUDA to OpencL, because the hardware abstraction details, data transfer and other aspects are all different, we
  7. Optimal batch size 16 or 32 on AMD GPUs
  8. Even if you have fast math librarie, DNN has this nature of workload imbalance for each layers. As shown in the following figure, Alexnet model, each conv layer has different number of channels, as the layer goes deeper, feature maps size reduces, these factors decide that the amount of computing for each layer are all different When we map this layer to the GPU, this will result some layer keeps the GPU busy, while other layers may use only part of the GPU. So, how to balance GPU utilizations? Luckily, we can make use of model parallelism. Can you explain what model parallelism is? Model parallelism refers to divide the whole model into smaller chunks or branches and run it on different computing nodes. This is very useful when the whole model can not be fit into the same GPU or accelerate speed by mapping to different computing nodes. This same algorithm parallelism can be used to balance the workload imbalance between layers. Usually the DNN frameworks will compute each black in sequential manner. We utilize this to apply multiple command queues. These days, as DNN models start to show more braches in the structure, for example, the google inception network, most layers has 3 or 4 branches. this technique can be used to run each branch concurrently to improve utlization Performance effect:for small Cifar, speed by 10%. for large ImageNet, 12% for minibatch size of 10; for large minibatch sizes, we sometimes observe a slow down due to interfering between two queues and extra synchronization overheads. Final Performance improvement depends on layer structure, data size and hardware resources.
  9. Need continuous community efforts to improve It is hard to achieve high utlization of hardware without low level hardware languanges and optimization Cross platform compibility can be got for free though OpenCL But you could not achieve the best performance for free. Most efficient implemenation has to always come from customized low level optimizations.