SlideShare a Scribd company logo
1 of 24
Download to read offline
Copyright © 2016 ARM Ltd 1
Gian Marco Iodice, SW Engineer – ARM
May 3, 2016
Using SGEMM and FFTs to Accelerate
Deep Learning
Copyright © 2016 ARM Ltd 2
• About ARM
• Convolutional Neural Networks (CNN)
• Architecture and building blocks
• Convolutional Layer
• SGEMM-based convolution
• FFT-based convolution
• SGEMM vs FFT
• Limited Numerical Precision for CNN
• Lesson Learned
Contents
Copyright © 2016 ARM Ltd 3
ARM Ltd
• ARM Holdings plc is a British multinational semiconductor and software
design company (www.arm.com)
• Headquarters in Cambridge, England
Copyright © 2016 ARM Ltd 4
Architecture and Building Blocks of CNN
• Convolutional layer (core block of CNN)
• Number of convolution kernels (filters bank)
• Filter shape (width, height and depth)
• Pooling layer (typical values 2x2)
• Non-linear gating (ReLu)
• Classifier: Fully Connected Neural Network
Learned
Non-Linear Trainable Feature
Copyright © 2016 ARM Ltd 5
Why Are We Going to Study Convolutional Layer?
*Learning Semantic Image Representations at a Large Scale, Yangqing Jia
conv1
16.9%
relu
0.7%
pool
1.0%
conv2
21.9%
pool2
0.7%
norm2
0.5%
conv3
17.8%
relu3
0.2%
conv4
17.8%
conv5
17.7%
fc6
1.8%
fc7
0.8%
Compute Load for AlexNet Inference*
Copyright © 2016 ARM Ltd 6
From 2D Convolution to 3D Batched Convolution
• Most of the time for the convolution layers we have:
• Multiple input images
• Multiple convolution kernels (various dimensions and shapes)
• Multiple channels per image/kernel (not necessarily 3!)
Output images
Input image
Kernels
Why don’t we use sliding window approach?
Copyright © 2016 ARM Ltd 7
SGEMM-based Convolution
C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
Copyright © 2016 ARM Ltd 8
Im2col
• im2col stores in each row the
necessary pixels for each
kernel application
• Costs in terms of memory
requirements!
• pixels duplication
• col2im restores the output
image structure
Input image Output image
A B C
Output images
stride X
B C
Copyright © 2016 ARM Ltd 9
SGEMM: Naïve implementation
• Each thread computes a single element of the output matrix
Not cache
friendly!
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c0 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1 * N);
c0 += ai * bi;
…
store(c0, addr_c);
Matrix A Matrix BMatrix C
Copyright © 2016 ARM Ltd 10
Transpose Matrix B
Matrix B Transposition
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c00 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1);
c00 += ai * bi;
...
store(c0, addr_c);
Matrix A Matrix BMatrix C
1.1x…
Speed-up
achievable?
Copyright © 2016 ARM Ltd 11
Transpose Matrix B in Chunk of 1x4 (I)
• Each thread computes 1x4 elements of the output matrix
Not cache
friendly!
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 1 * N);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix A Matrix BMatrix C
Copyright © 2016 ARM Ltd 12
Transpose Matrix B in Chunk of 1x4 (II)
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 4);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix B
Matrix BT1x4
2
2.5
3
3.5
4
512 1024 2048 4096
SGEMM Speed-Up
Speed-up achievable?
3.5x
N: A=NxN, B=NxN, C=NxN
Copyright © 2016 ARM Ltd 13
Reshaping Matrix A (I)
• We can do more…we can compute a block of 4x4 elements per
thread in order to re-use the values loaded from Matrix A
Matrix BT1x4
Matrix AMatrix C
Copyright © 2016 ARM Ltd 14
Reshaping Matrix A (II)
Chunk 0
Chunk 1
Chunk = Block of 4 rows
Matrix A – 8x8 Matrix AI – 2x32
6.5
7
7.5
8
8.5
512 1024 2048 4096
N: A=NxN, B=NxN, C=NxN
SGEMM Speed-Up
Speed-up achievable?
> 8.0x
Copyright © 2016 ARM Ltd 15
FFT-based Convolution
• Convolution in the spatial domain is equivalent to a scalar multiplication in
frequency domain
Copyright © 2016 ARM Ltd 16
From Radix-2 to Mixed-Radix
• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)
• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)
• Mixed-Radix is the generalization of the basic radix-2 FFT
Over 1.5x better performance than Radix-2
Copyright © 2016 ARM Ltd 17
FFT Implementation
• Recursive FFT in-place computation*
• Each thread computes a single radix-N (floating point computation)
• Block-wise 2x2 in-place transposition
• ~2x times better performance than 2x2 out-of-place transposition
• Out-of-place batched convolution
• High memory requirements as we have to keep the frequency representation for:
1. Input image
2. Convolution kernels
3. Result of convolutions
* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-
means-of-opencl-part-2
Copyright © 2016 ARM Ltd 18
SGEMM vs FFT (I)
• High memory requirements due to im2col:
• stride < kernel dimension
• large convolution kernel
• large input image
SGEMM-based convolution
• No efficient way to handle stride != 1
• High memory requirements for batched
convolutions
• It could require considerable effort to
optimize well
SGEMM-based convolution FFT-based convolution
Copyright © 2016 ARM Ltd 19
SGEMM vs FFT (II)
Case 1: 1 input image, 64/128/256 convolution kernels
• Study limited on inference problem
• Stride x = 1 and stride y = 1
• N. of channels = 1
• Pre-computed FFT for convolution kernels
Case 2: 64 input images, 32 convolution kernels
ImagesizeImagesize
Kernel size / Number of convolutions
Kernel size
and using stride x = 2?
SGEMM FFT
Copyright © 2016 ARM Ltd 20
Limited Numerical Precision for CNN (I)
• Some papers ([1], [2]) have demonstrated the feasibility in using limited
numerical precision for CNN
• This opens an interesting computational scenario if, for instance, HW has
accelerators for 16 bit half-precision:
• Performance boosting
• Reduced memory traffic to/from external memory
• Possible to dispatch fewer threads
• Energy saving
• Essentially due to the reduced memory traffic to/from the external memory
[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David
[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
Copyright © 2016 ARM Ltd 21
Limited Numerical Precision for CNN (II)
1
1.5
2
2.5
512 1024 2048 4096
1
1.5
2
2.5
512 1024 2048 4096
It is possible to dispatch
fewer threads
i.e. 8x4 elements per thread
We can not dispatch fewer
threads
Each thread computes a
single radix-N
SGEMM Speed-Up
FFT Speed-Up
N: A=NxN, B=NxN, C=NxN
N
> 2.0x
> 1.5x
Copyright © 2016 ARM Ltd 22
Lessons Learned
1. Cache-efficient data layout has huge impact on performance of our algorithm
also for GPU computing
2. Simple changes in data layout can bring to:
• dispatch fewer threads
• exploit better vector instructions
3. Limited Numerical Precision plays a crucial role IF HW accelerated
4. Convolutional calculation is an embarrassingly parallel task which can be
easily and efficiently accelerated on mobile GPU by means of OpenCL
Copyright © 2016 ARM Ltd 23
Question Time
Question Time
Copyright © 2016 ARM Ltd 24
Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or
elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
Copyright © 2016 ARM Limited
Thank you!

More Related Content

What's hot

整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ
Kentaro Kanamori
 
20120729 ODbL勉強会
20120729 ODbL勉強会20120729 ODbL勉強会
20120729 ODbL勉強会
Shu Higashi
 

What's hot (20)

勉強会資料 Uml概要
勉強会資料 Uml概要勉強会資料 Uml概要
勉強会資料 Uml概要
 
整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ
 
アンサンブル学習
アンサンブル学習アンサンブル学習
アンサンブル学習
 
第1回NIPS読み会・関西発表資料
第1回NIPS読み会・関西発表資料第1回NIPS読み会・関西発表資料
第1回NIPS読み会・関西発表資料
 
PRML10-draft1002
PRML10-draft1002PRML10-draft1002
PRML10-draft1002
 
敵対的サンプル・摂動サーベイ
敵対的サンプル・摂動サーベイ敵対的サンプル・摂動サーベイ
敵対的サンプル・摂動サーベイ
 
20170408cvsaisentan6 2 4.3-4.5
20170408cvsaisentan6 2 4.3-4.520170408cvsaisentan6 2 4.3-4.5
20170408cvsaisentan6 2 4.3-4.5
 
近接分離最適化によるブラインド⾳源分離(Blind source separation via proximal splitting algorithm)
近接分離最適化によるブラインド⾳源分離(Blind source separation via proximal splitting algorithm)近接分離最適化によるブラインド⾳源分離(Blind source separation via proximal splitting algorithm)
近接分離最適化によるブラインド⾳源分離(Blind source separation via proximal splitting algorithm)
 
Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験
 
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
独立低ランク行列分析に基づくブラインド音源分離(Blind source separation based on independent low-rank...
 
[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking
[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking
[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking
 
20120729 ODbL勉強会
20120729 ODbL勉強会20120729 ODbL勉強会
20120729 ODbL勉強会
 
Design of Two Way Ribbed slab 18 x 25 m clear span
Design of Two Way Ribbed slab 18 x 25 m clear spanDesign of Two Way Ribbed slab 18 x 25 m clear span
Design of Two Way Ribbed slab 18 x 25 m clear span
 
20180723 PFNの研究基盤 / PFN research system infrastructure
20180723 PFNの研究基盤 / PFN research system infrastructure20180723 PFNの研究基盤 / PFN research system infrastructure
20180723 PFNの研究基盤 / PFN research system infrastructure
 
実用Brainf*ckプログラミング
実用Brainf*ckプログラミング実用Brainf*ckプログラミング
実用Brainf*ckプログラミング
 
Systèmes coupe-feu fermacell
Systèmes coupe-feu fermacellSystèmes coupe-feu fermacell
Systèmes coupe-feu fermacell
 
深層ニューラルネットワークの積分表現(Deepを定式化する数学)
深層ニューラルネットワークの積分表現(Deepを定式化する数学)深層ニューラルネットワークの積分表現(Deepを定式化する数学)
深層ニューラルネットワークの積分表現(Deepを定式化する数学)
 
MLデザインパターン入門_Embeddings
MLデザインパターン入門_EmbeddingsMLデザインパターン入門_Embeddings
MLデザインパターン入門_Embeddings
 
闘病ブログからの医薬品奏功情報認識
闘病ブログからの医薬品奏功情報認識闘病ブログからの医薬品奏功情報認識
闘病ブログからの医薬品奏功情報認識
 
画像の基盤モデルの変遷と研究動向
画像の基盤モデルの変遷と研究動向画像の基盤モデルの変遷と研究動向
画像の基盤モデルの変遷と研究動向
 

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
Edge AI and Vision Alliance
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Heiko Joerg Schick
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
elliando dias
 

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM (20)

26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISA
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware Accelerators
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpeg
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

  • 1. Copyright © 2016 ARM Ltd 1 Gian Marco Iodice, SW Engineer – ARM May 3, 2016 Using SGEMM and FFTs to Accelerate Deep Learning
  • 2. Copyright © 2016 ARM Ltd 2 • About ARM • Convolutional Neural Networks (CNN) • Architecture and building blocks • Convolutional Layer • SGEMM-based convolution • FFT-based convolution • SGEMM vs FFT • Limited Numerical Precision for CNN • Lesson Learned Contents
  • 3. Copyright © 2016 ARM Ltd 3 ARM Ltd • ARM Holdings plc is a British multinational semiconductor and software design company (www.arm.com) • Headquarters in Cambridge, England
  • 4. Copyright © 2016 ARM Ltd 4 Architecture and Building Blocks of CNN • Convolutional layer (core block of CNN) • Number of convolution kernels (filters bank) • Filter shape (width, height and depth) • Pooling layer (typical values 2x2) • Non-linear gating (ReLu) • Classifier: Fully Connected Neural Network Learned Non-Linear Trainable Feature
  • 5. Copyright © 2016 ARM Ltd 5 Why Are We Going to Study Convolutional Layer? *Learning Semantic Image Representations at a Large Scale, Yangqing Jia conv1 16.9% relu 0.7% pool 1.0% conv2 21.9% pool2 0.7% norm2 0.5% conv3 17.8% relu3 0.2% conv4 17.8% conv5 17.7% fc6 1.8% fc7 0.8% Compute Load for AlexNet Inference*
  • 6. Copyright © 2016 ARM Ltd 6 From 2D Convolution to 3D Batched Convolution • Most of the time for the convolution layers we have: • Multiple input images • Multiple convolution kernels (various dimensions and shapes) • Multiple channels per image/kernel (not necessarily 3!) Output images Input image Kernels Why don’t we use sliding window approach?
  • 7. Copyright © 2016 ARM Ltd 7 SGEMM-based Convolution C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
  • 8. Copyright © 2016 ARM Ltd 8 Im2col • im2col stores in each row the necessary pixels for each kernel application • Costs in terms of memory requirements! • pixels duplication • col2im restores the output image structure Input image Output image A B C Output images stride X B C
  • 9. Copyright © 2016 ARM Ltd 9 SGEMM: Naïve implementation • Each thread computes a single element of the output matrix Not cache friendly! /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c0 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1 * N); c0 += ai * bi; … store(c0, addr_c); Matrix A Matrix BMatrix C
  • 10. Copyright © 2016 ARM Ltd 10 Transpose Matrix B Matrix B Transposition /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c00 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1); c00 += ai * bi; ... store(c0, addr_c); Matrix A Matrix BMatrix C 1.1x… Speed-up achievable?
  • 11. Copyright © 2016 ARM Ltd 11 Transpose Matrix B in Chunk of 1x4 (I) • Each thread computes 1x4 elements of the output matrix Not cache friendly! float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 1 * N); out += (float4)ai * bi; ... store4(out, addr_c); Matrix A Matrix BMatrix C
  • 12. Copyright © 2016 ARM Ltd 12 Transpose Matrix B in Chunk of 1x4 (II) float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 4); out += (float4)ai * bi; ... store4(out, addr_c); Matrix B Matrix BT1x4 2 2.5 3 3.5 4 512 1024 2048 4096 SGEMM Speed-Up Speed-up achievable? 3.5x N: A=NxN, B=NxN, C=NxN
  • 13. Copyright © 2016 ARM Ltd 13 Reshaping Matrix A (I) • We can do more…we can compute a block of 4x4 elements per thread in order to re-use the values loaded from Matrix A Matrix BT1x4 Matrix AMatrix C
  • 14. Copyright © 2016 ARM Ltd 14 Reshaping Matrix A (II) Chunk 0 Chunk 1 Chunk = Block of 4 rows Matrix A – 8x8 Matrix AI – 2x32 6.5 7 7.5 8 8.5 512 1024 2048 4096 N: A=NxN, B=NxN, C=NxN SGEMM Speed-Up Speed-up achievable? > 8.0x
  • 15. Copyright © 2016 ARM Ltd 15 FFT-based Convolution • Convolution in the spatial domain is equivalent to a scalar multiplication in frequency domain
  • 16. Copyright © 2016 ARM Ltd 16 From Radix-2 to Mixed-Radix • The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…) • Any factorization would generally be possible for N (N = N1 x N2 x N3 x…) • Mixed-Radix is the generalization of the basic radix-2 FFT Over 1.5x better performance than Radix-2
  • 17. Copyright © 2016 ARM Ltd 17 FFT Implementation • Recursive FFT in-place computation* • Each thread computes a single radix-N (floating point computation) • Block-wise 2x2 in-place transposition • ~2x times better performance than 2x2 out-of-place transposition • Out-of-place batched convolution • High memory requirements as we have to keep the frequency representation for: 1. Input image 2. Convolution kernels 3. Result of convolutions * https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by- means-of-opencl-part-2
  • 18. Copyright © 2016 ARM Ltd 18 SGEMM vs FFT (I) • High memory requirements due to im2col: • stride < kernel dimension • large convolution kernel • large input image SGEMM-based convolution • No efficient way to handle stride != 1 • High memory requirements for batched convolutions • It could require considerable effort to optimize well SGEMM-based convolution FFT-based convolution
  • 19. Copyright © 2016 ARM Ltd 19 SGEMM vs FFT (II) Case 1: 1 input image, 64/128/256 convolution kernels • Study limited on inference problem • Stride x = 1 and stride y = 1 • N. of channels = 1 • Pre-computed FFT for convolution kernels Case 2: 64 input images, 32 convolution kernels ImagesizeImagesize Kernel size / Number of convolutions Kernel size and using stride x = 2? SGEMM FFT
  • 20. Copyright © 2016 ARM Ltd 20 Limited Numerical Precision for CNN (I) • Some papers ([1], [2]) have demonstrated the feasibility in using limited numerical precision for CNN • This opens an interesting computational scenario if, for instance, HW has accelerators for 16 bit half-precision: • Performance boosting • Reduced memory traffic to/from external memory • Possible to dispatch fewer threads • Energy saving • Essentially due to the reduced memory traffic to/from the external memory [1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David [2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
  • 21. Copyright © 2016 ARM Ltd 21 Limited Numerical Precision for CNN (II) 1 1.5 2 2.5 512 1024 2048 4096 1 1.5 2 2.5 512 1024 2048 4096 It is possible to dispatch fewer threads i.e. 8x4 elements per thread We can not dispatch fewer threads Each thread computes a single radix-N SGEMM Speed-Up FFT Speed-Up N: A=NxN, B=NxN, C=NxN N > 2.0x > 1.5x
  • 22. Copyright © 2016 ARM Ltd 22 Lessons Learned 1. Cache-efficient data layout has huge impact on performance of our algorithm also for GPU computing 2. Simple changes in data layout can bring to: • dispatch fewer threads • exploit better vector instructions 3. Limited Numerical Precision plays a crucial role IF HW accelerated 4. Convolutional calculation is an embarrassingly parallel task which can be easily and efficiently accelerated on mobile GPU by means of OpenCL
  • 23. Copyright © 2016 ARM Ltd 23 Question Time Question Time
  • 24. Copyright © 2016 ARM Ltd 24 Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited Thank you!