1. OpenCL caffe: Accelerating and enabling a
cross platform machine learning framework
Junli Gu gujunli@gmail.com
Yibing Liu (Tsinghua University)
Yuan Gao
Maohua Zhu (UCSB)
Presented by Hugh Perkins (ASAPP)
2. Deep learning brings challenges to system design
– Deep Learning: DNN model + Big Data
• Complex model: millions to billions of parameters
• Big Data input: OCR: 100M, Speech: 10B, CTR: 100B
– System is the final enabler
• Model training: takes weeks on CPU + GPU clusters
• Model deployment: trained model deployed for various application scenarios
DNN
model
Big Data
input
Recognition
results
3. DNNs Everywhere!
Supercomputers! Datacenters! Tablets,smartphones! Wearable devices!
IoTs!
1000s GPUs! 100k-1m servers! 700m (in China)! Billions?!
Functionality Offline DNN training Deploy DNN on cloud
Deploy mobile DNN
apps
Deploy on Wearable and loTs
H/W systems CPU + GPU cluster
CPU clusters or CPU+GPU
clusters
ARM/GPU/SOC ARM/Soc/FPGA
Scale Small scale (hundreds) 100k-1M 700M billions
Opportunities for OpenCL: cross platform DNN
deployment
Current trend: DNN will be everywhere.
Cross platform compatibility is becoming a challenge for internet giants.
However most DNN frameworks are based on CUDA: closed format, limiting the
deployment breadth of DNN systems.
4. The goal of OpenCL caffe
• Hierarchical framework that serves as machine learning OS
• Software level
‒ machine learning SDK and
APIs
‒ CNN, MLP, RNN, LSTM etc.
• Hardware level
‒ hardware resources
allocation and utilizations
‒ optimized DNN and math
libraries
• Workload partition
‒ CPU: data processing and
main loop iteration
‒ GPU: major DNN kernel
computation
Original CUDA caffe from UC. Berkeley: https://github.com/BVLC/caffe
5. Two phase strategies
• Phase one: OpenCL backend porting and analysis
– It is not a straightforward engineering porting, algorithm convergence
might be destroyed
– Re-architecture due to key difference between CUDA and OpenCL
• Phase two: OpenCL caffe performance optimizations
– Given the algorithm correctness, improve the performance
– Current BLAS libraries are not optimized for DNN computing, why and
how to improve without modifying BLAS?
6. OpenCL Caffe Framework
• Hybrid CPU and GPU implementation
– Each layer
• CAFFE is most popular in industry these
days
– Complexity:
– ~70k lines of code
– Originally designed for C++ & CUDA
• Seamless switch between
CPU/GPU
layers
Accuracy_layer Euclidean_loss_layer
Hinge_loss_layer Infogain_loss_layer
Loss_layer Memory_data_layer
Multinomial_logist
ic_loss_layer
Neuron_layer
Data_layer
Bnil_layer
Concat_layer
Conv_layer
dropout_layer
Eltwise_product_layer
Flatten_layer
Hdf5_data_layer
Hdf5_output_layer
Im2col_layer
Image_data_layer
Inner_product_layer
Lrn_layer
Pooling_layer
Power_layer
Relu_layer
Sigmoid_cross_entropy_los
s_layer
Sigmoid_layer
Softmax_layer Softmax_loss_layer
Split_layer
Tanh_layer
Window_data_layer
Prototype Training
Deployment
7. Forward_gpu Backward_gpu
MaxPoolForwardfloat
AvePoolFowardfloat AvePoolBackwardfloat
MaxPoolBackwardfloat
Forward_gpu Backward_gpu
Im2col_gpu
Caffe_gpu_gemm Caffe_gpu_gemv
col2im_gpu
Forward_gpu Backward_gpu
ReLUForwardfloat ReLUBackwardfloat
Forward_gpu Backward_gpu
Caffe_gpu_gemm Caffe_gpu_gemv
Forward_gpu
Kernel_get_maxfloat Caffe_gpu_gemmKernel_softmax_divfloat Caffe_gpu_gemv
Im2col_gpu Col2im_gpu
Caffe_gpu_gemm Caffe_gpu_gemv Caffe_gpu_axpy
Caffe_gpu_axpby
Caffe_gpu_scal
Caffe_gpu_dot Caffe_gpu_asum Caffe_gpu_scale
Caffe_gpu_axpy
OpenCL porting challenges and re-architecturing
• Memory layout & data coherence
– mutable data structures
– Optimal buffer allocation for each layer
• Hide data transfer to the underlying hardware layers
• Added extra OpenCL wrapper layer compared to CUDA
• Hide messy clSetArg etc stuff
Layer 3: GPU kernels
Layer 2:OpenCL wrappers
Layer 1: C++ machine learning interfaces
Caffe_lrn Caffe_pool
Caffe_max Caffe_relu
9. OpenCL backend bottleneck analysis
• OpenCL's online compilation frequently calls clBuildProgram
– Too many DNN kernels to create!
• DNN falls into BLAS’ poor performance area
– Irregular tall and skinny matrix sizes from different layers
– Bottleneck exists for all BLAS implementations, cuBLAS, clBLAS etc.
– clBLAS is 3-5x slower than cuBLAS, the biggest performance gap to
catch up
63 clbuildProgram calls!
Takes up 68% of time
10. OpenCL backend bottleneck analysis
• OpenCL‘s online compilation frequently calls clBuildProgram
– Too many DNN kernels to create!
• DNN falls into BLAS' poor performance area
– Irregular tall and skinny matrix sizes from different layers
– Bottleneck exists for all BLAS implementations, cuBLAS, clBLAS etc.
– clBLAS is 3-5x slower than cuBLAS, the biggest performance gap to
catch up
AMD R9 Fury vs. GTX980
1. peak performance 7.2 vs. 4.6 TFLOPS
2.OpenCL caffe is 6x slower than cuda caffe
11. OpenCL caffe performance optimizations
• Avoid OpenCL online compilation overheads
– Precompile and save the kernels
– Works if hardware does not change
• Boost data parallelism
– Batched manner data layout transformation
– To bring DNN data size to better performance areas
• Boost task parallelism
– Multiple command queues
– Increase concurrent tasks
12. Batched data layout transformation optimization
• Batched data layout scheme
– Design pipeline to pack small
matrix into bigger ones
– Increase data parallelism
– Release GPU’s computing power
• Notes
– Optimization applies to general
machine learning framework
– When integrated within
sgemm, called batched sgemm
Batched size
Org size
13. • Batched transformation significantly unrolls the matrix size
– Bigger matrix, more regular
– M, N,K can be aligned with 4/8/16/32 (BLAS preferred sizes)
– Forward propogation, M scaled up; backward propogation, N,K scaled up
(algorithm limitations)
• Optimal batched number
– depending on H/W properties and input data size
– 16 or 32 on AMD GPUs for ImageNet data set
Batched data layout transformation optimization
Layers Original M, N, K Unrolled M’, N’,
K’
speedup
conv1 3025, 96, 363 48400, 96, 363 11
conv2 729, 128, 1200 11664, 128, 1200 12
conv3 169, 384, 2034 2704, 384, 2034 10
conv4 169, 192, 1728 2704, 192, 1728 9
conv5 169, 128, 1728 2704, 128, 1728 16This is matrix size for forward propagation
14. Boost task parallelism
• The nature of workload imbalance among DNN layers
• Luckily, we can make use of model parallelism
• Performance improvement depends on layer structure, data size and
hardware resources.
Command
queue 1
Command
queue 2
Queue1 and queue2 run concurrently to
improve GPU utilization
15. Performance evaluation
– OpenCL batched vs clBLAS
• 4.5x speedup without modifying clBLAS
– OpenCL vs CUDA caffe (apple to apple )
• Similar performance
– OpenCL vs cuDNN v2
• 2x gap
• Potential to catch with low
-level hardware optimization
16. Conclusions
• OpenCL caffe
– To enable a cross platform DNN framework
• Optimize towards competitive performance
– Data parallelism: batched manner data layout transformation
– Task parallelism: make use of model parallelsim
– 4.5x speedup on top of clBLAS library
• Existing challenges of OpenCL in cross-platform
– Differences of various hardware manufacture extensions
– Queueing efficiency, command queue synchronization overheads, runtime
efficiency
– Low level hardware optimizaiton tool chain for highly optimized machine
learning libraries
OpenCL Caffe is at: https://github.com/gujunli/OpenCL-caffe
Hinweis der Redaktion
This is Hugh Perkins. Take the opportunity to Say something about your OpenCL DNN work!
I am delivering the talk on behalf of the author Junli. She could not come to make the presentation due to visa issues.
When the authors submitted the paper, Junli was at AMD. But now she is working at Tesla. The project if an open sourced project. So the talk is not on behalf of any company at this point.
Deep learning applications has posed great challenges to system design
Deep learning applications are both compute intensive and memory intensive, Due to the high complexicity of DNN models, and Big Data input.
The past 3 years, DNN models keep increase the number of the parameters, increase from millions to billons.
The Big data samples to train a model can be from hundreds of millions to hundreds of billions depending on different applications. Yet the data is said to increase 10x per year.
Thus Deep learning applications has posed great challenges to system design.
The transition of DNN systems
DNN has becoming the hottest internet trends in the past two years.
The current trend indicates that DNN applications will be everywhere. Cross platform compatibility is becoming a challenge. I see this as opportunity for OpenCL..
Let me first introduce the DNN process a little bit, a DNN model will be trained on supercomputers with Big Data input, later deployed on could for online recognition, these days, DNN apps are also deployed on mobiles and wearable.
I have listed the scale of each category and opportunity. Offline training supercomputers will be small scale, in hundreds of CPU + GPU machines, Performance will critical for internet players, but our performance is currently not competitive. online cloud will be much bigger scale, from 100k to 1M, and is more tolerate with performance, there is potential opportunity for AMD to win with better performance per dollar. Timing is right. the online cloud is in transition form CPU cluster into CPU + GPU cluster. Mobile we have no products.
Due to DNN models' high complexity, we use a two-phase strategy.
First we introduce the OpenCL porting strategies that guarantee algorithm convergence;
Then we analyze OpenCL's performance bottlenecks in DNN domain and propose a few optimization techniques improve hardware resources utilization and boost OpenCL
runtime eciency.
DNN is a deep layered structure, if we port all layers together, the algorithm convergence can be easily destroyed and it is very painful to debug, due to millions of parameters and stochastic gradients.
The challenges of gradient is a well know challenge for any machine learning framework development.
For porting from CUDA to OpencL, because the hardware abstraction details, data transfer and other aspects are all different, we
Optimal batch size 16 or 32 on AMD GPUs
Even if you have fast math librarie, DNN has this nature of workload imbalance for each layers. As shown in the following figure, Alexnet model, each conv layer has different number of channels, as the layer goes deeper, feature maps size reduces, these factors decide that the amount of computing for each layer are all different
When we map this layer to the GPU, this will result some layer keeps the GPU busy, while other layers may use only part of the GPU.
So, how to balance GPU utilizations?
Luckily, we can make use of model parallelism.
Can you explain what model parallelism is? Model parallelism refers to divide the whole model into smaller chunks or branches and run it on different computing nodes. This is very useful when the whole model can not be fit into the same GPU or accelerate speed by mapping to different computing nodes.
This same algorithm parallelism can be used to balance the workload imbalance between layers. Usually the DNN frameworks will compute each black in sequential manner. We utilize this to apply multiple command queues.
These days, as DNN models start to show more braches in the structure, for example, the google inception network, most layers has 3 or 4 branches. this technique can be used to run each branch concurrently to improve utlization
Performance effect:for small Cifar, speed by 10%. for large ImageNet, 12% for minibatch size of 10; for large minibatch sizes, we sometimes observe a slow down due to interfering between two queues and extra synchronization overheads. Final Performance improvement depends on layer structure, data size and hardware resources.
Need continuous community efforts to improve
It is hard to achieve high utlization of hardware without low level hardware languanges and optimization
Cross platform compibility can be got for free though OpenCL
But you could not achieve the best performance for free. Most efficient implemenation has to always come from customized low level optimizations.