SlideShare a Scribd company logo
1 of 24
Download to read offline
Copyright © 2016 ARM Ltd 1
Gian Marco Iodice, SW Engineer – ARM
May 3, 2016
Using SGEMM and FFTs to Accelerate
Deep Learning
Copyright © 2016 ARM Ltd 2
• About ARM
• Convolutional Neural Networks (CNN)
• Architecture and building blocks
• Convolutional Layer
• SGEMM-based convolution
• FFT-based convolution
• SGEMM vs FFT
• Limited Numerical Precision for CNN
• Lesson Learned
Contents
Copyright © 2016 ARM Ltd 3
ARM Ltd
• ARM Holdings plc is a British multinational semiconductor and software
design company (www.arm.com)
• Headquarters in Cambridge, England
Copyright © 2016 ARM Ltd 4
Architecture and Building Blocks of CNN
• Convolutional layer (core block of CNN)
• Number of convolution kernels (filters bank)
• Filter shape (width, height and depth)
• Pooling layer (typical values 2x2)
• Non-linear gating (ReLu)
• Classifier: Fully Connected Neural Network
Learned
Non-Linear Trainable Feature
Copyright © 2016 ARM Ltd 5
Why Are We Going to Study Convolutional Layer?
*Learning Semantic Image Representations at a Large Scale, Yangqing Jia
conv1
16.9%
relu
0.7%
pool
1.0%
conv2
21.9%
pool2
0.7%
norm2
0.5%
conv3
17.8%
relu3
0.2%
conv4
17.8%
conv5
17.7%
fc6
1.8%
fc7
0.8%
Compute Load for AlexNet Inference*
Copyright © 2016 ARM Ltd 6
From 2D Convolution to 3D Batched Convolution
• Most of the time for the convolution layers we have:
• Multiple input images
• Multiple convolution kernels (various dimensions and shapes)
• Multiple channels per image/kernel (not necessarily 3!)
Output images
Input image
Kernels
Why don’t we use sliding window approach?
Copyright © 2016 ARM Ltd 7
SGEMM-based Convolution
C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
Copyright © 2016 ARM Ltd 8
Im2col
• im2col stores in each row the
necessary pixels for each
kernel application
• Costs in terms of memory
requirements!
• pixels duplication
• col2im restores the output
image structure
Input image Output image
A B C
Output images
stride X
B C
Copyright © 2016 ARM Ltd 9
SGEMM: Naïve implementation
• Each thread computes a single element of the output matrix
Not cache
friendly!
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c0 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1 * N);
c0 += ai * bi;
…
store(c0, addr_c);
Matrix A Matrix BMatrix C
Copyright © 2016 ARM Ltd 10
Transpose Matrix B
Matrix B Transposition
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c00 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1);
c00 += ai * bi;
...
store(c0, addr_c);
Matrix A Matrix BMatrix C
1.1x…
Speed-up
achievable?
Copyright © 2016 ARM Ltd 11
Transpose Matrix B in Chunk of 1x4 (I)
• Each thread computes 1x4 elements of the output matrix
Not cache
friendly!
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 1 * N);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix A Matrix BMatrix C
Copyright © 2016 ARM Ltd 12
Transpose Matrix B in Chunk of 1x4 (II)
float4 out = 0.0f;
/* First accumulation */
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = vload4(addr_b + 4);
out += (float4)ai * bi;
...
store4(out, addr_c);
Matrix B
Matrix BT1x4
2
2.5
3
3.5
4
512 1024 2048 4096
SGEMM Speed-Up
Speed-up achievable?
3.5x
N: A=NxN, B=NxN, C=NxN
Copyright © 2016 ARM Ltd 13
Reshaping Matrix A (I)
• We can do more…we can compute a block of 4x4 elements per
thread in order to re-use the values loaded from Matrix A
Matrix BT1x4
Matrix AMatrix C
Copyright © 2016 ARM Ltd 14
Reshaping Matrix A (II)
Chunk 0
Chunk 1
Chunk = Block of 4 rows
Matrix A – 8x8 Matrix AI – 2x32
6.5
7
7.5
8
8.5
512 1024 2048 4096
N: A=NxN, B=NxN, C=NxN
SGEMM Speed-Up
Speed-up achievable?
> 8.0x
Copyright © 2016 ARM Ltd 15
FFT-based Convolution
• Convolution in the spatial domain is equivalent to a scalar multiplication in
frequency domain
Copyright © 2016 ARM Ltd 16
From Radix-2 to Mixed-Radix
• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)
• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)
• Mixed-Radix is the generalization of the basic radix-2 FFT
Over 1.5x better performance than Radix-2
Copyright © 2016 ARM Ltd 17
FFT Implementation
• Recursive FFT in-place computation*
• Each thread computes a single radix-N (floating point computation)
• Block-wise 2x2 in-place transposition
• ~2x times better performance than 2x2 out-of-place transposition
• Out-of-place batched convolution
• High memory requirements as we have to keep the frequency representation for:
1. Input image
2. Convolution kernels
3. Result of convolutions
* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-
means-of-opencl-part-2
Copyright © 2016 ARM Ltd 18
SGEMM vs FFT (I)
• High memory requirements due to im2col:
• stride < kernel dimension
• large convolution kernel
• large input image
SGEMM-based convolution
• No efficient way to handle stride != 1
• High memory requirements for batched
convolutions
• It could require considerable effort to
optimize well
SGEMM-based convolution FFT-based convolution
Copyright © 2016 ARM Ltd 19
SGEMM vs FFT (II)
Case 1: 1 input image, 64/128/256 convolution kernels
• Study limited on inference problem
• Stride x = 1 and stride y = 1
• N. of channels = 1
• Pre-computed FFT for convolution kernels
Case 2: 64 input images, 32 convolution kernels
ImagesizeImagesize
Kernel size / Number of convolutions
Kernel size
and using stride x = 2?
SGEMM FFT
Copyright © 2016 ARM Ltd 20
Limited Numerical Precision for CNN (I)
• Some papers ([1], [2]) have demonstrated the feasibility in using limited
numerical precision for CNN
• This opens an interesting computational scenario if, for instance, HW has
accelerators for 16 bit half-precision:
• Performance boosting
• Reduced memory traffic to/from external memory
• Possible to dispatch fewer threads
• Energy saving
• Essentially due to the reduced memory traffic to/from the external memory
[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David
[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
Copyright © 2016 ARM Ltd 21
Limited Numerical Precision for CNN (II)
1
1.5
2
2.5
512 1024 2048 4096
1
1.5
2
2.5
512 1024 2048 4096
It is possible to dispatch
fewer threads
i.e. 8x4 elements per thread
We can not dispatch fewer
threads
Each thread computes a
single radix-N
SGEMM Speed-Up
FFT Speed-Up
N: A=NxN, B=NxN, C=NxN
N
> 2.0x
> 1.5x
Copyright © 2016 ARM Ltd 22
Lessons Learned
1. Cache-efficient data layout has huge impact on performance of our algorithm
also for GPU computing
2. Simple changes in data layout can bring to:
• dispatch fewer threads
• exploit better vector instructions
3. Limited Numerical Precision plays a crucial role IF HW accelerated
4. Convolutional calculation is an embarrassingly parallel task which can be
easily and efficiently accelerated on mobile GPU by means of OpenCL
Copyright © 2016 ARM Ltd 23
Question Time
Question Time
Copyright © 2016 ARM Ltd 24
Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or
elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
Copyright © 2016 ARM Limited
Thank you!

More Related Content

What's hot

IXP Route Servers with RPKI and IXP Manager
IXP Route Servers with RPKI and IXP ManagerIXP Route Servers with RPKI and IXP Manager
IXP Route Servers with RPKI and IXP ManagerAPNIC
 
Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1
Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1
Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1Roland Goldberg
 
Correlation analysis ppt
Correlation analysis pptCorrelation analysis ppt
Correlation analysis pptAnil Mishra
 
One R (1R) Algorithm
One R (1R) AlgorithmOne R (1R) Algorithm
One R (1R) AlgorithmMLCollab
 
Linear regression
Linear regressionLinear regression
Linear regressionTech_MX
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
 
Analysis of the “KDD Cup-1999” Datasets
Analysis of the  “KDD Cup-1999”  DatasetsAnalysis of the  “KDD Cup-1999”  Datasets
Analysis of the “KDD Cup-1999” DatasetsRafsanjani, Muhammod
 
Glaucoma Detection in Retinal Images Using Image Processing Techniques: A Survey
Glaucoma Detection in Retinal Images Using Image Processing Techniques: A SurveyGlaucoma Detection in Retinal Images Using Image Processing Techniques: A Survey
Glaucoma Detection in Retinal Images Using Image Processing Techniques: A SurveyEswar Publications
 
Customer Life Time Value
Customer Life Time ValueCustomer Life Time Value
Customer Life Time ValueYunusTalipEROL
 
Introduction to ROC Curve Analysis with Application in Functional Genomics
Introduction to ROC Curve Analysis with Application in Functional GenomicsIntroduction to ROC Curve Analysis with Application in Functional Genomics
Introduction to ROC Curve Analysis with Application in Functional GenomicsShana White
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade offVARUN KUMAR
 
Introduction to Database Log Analysis
Introduction to Database Log AnalysisIntroduction to Database Log Analysis
Introduction to Database Log AnalysisAnton Chuvakin
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 

What's hot (20)

Decision tree
Decision treeDecision tree
Decision tree
 
IXP Route Servers with RPKI and IXP Manager
IXP Route Servers with RPKI and IXP ManagerIXP Route Servers with RPKI and IXP Manager
IXP Route Servers with RPKI and IXP Manager
 
Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1
Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1
Wêreldbeskouings & Ideologieë - WVES 221 LE 1.1
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
Correlation analysis ppt
Correlation analysis pptCorrelation analysis ppt
Correlation analysis ppt
 
One R (1R) Algorithm
One R (1R) AlgorithmOne R (1R) Algorithm
One R (1R) Algorithm
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Decision tree
Decision treeDecision tree
Decision tree
 
How To Triple The Range of LoRa
How To Triple The Range of LoRaHow To Triple The Range of LoRa
How To Triple The Range of LoRa
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
COVARIANCE IN PROBABILITY
COVARIANCE IN PROBABILITYCOVARIANCE IN PROBABILITY
COVARIANCE IN PROBABILITY
 
Analysis of the “KDD Cup-1999” Datasets
Analysis of the  “KDD Cup-1999”  DatasetsAnalysis of the  “KDD Cup-1999”  Datasets
Analysis of the “KDD Cup-1999” Datasets
 
Glaucoma Detection in Retinal Images Using Image Processing Techniques: A Survey
Glaucoma Detection in Retinal Images Using Image Processing Techniques: A SurveyGlaucoma Detection in Retinal Images Using Image Processing Techniques: A Survey
Glaucoma Detection in Retinal Images Using Image Processing Techniques: A Survey
 
Customer Life Time Value
Customer Life Time ValueCustomer Life Time Value
Customer Life Time Value
 
Introduction to ROC Curve Analysis with Application in Functional Genomics
Introduction to ROC Curve Analysis with Application in Functional GenomicsIntroduction to ROC Curve Analysis with Application in Functional Genomics
Introduction to ROC Curve Analysis with Application in Functional Genomics
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Introduction to Database Log Analysis
Introduction to Database Log AnalysisIntroduction to Database Log Analysis
Introduction to Database Log Analysis
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAGanesan Narayanasamy
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overviewNabil Chouba
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsSepidehShirkhanzadeh
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegKieran Kunhya
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNNJunho Cho
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfssuser30e7d2
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCdterei
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...Edge AI and Vision Alliance
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM (20)

26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
 
Semiconductor overview
Semiconductor overviewSemiconductor overview
Semiconductor overview
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISA
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware Accelerators
 
AVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpegAVX512 assembly language in FFmpeg
AVX512 assembly language in FFmpeg
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
 
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD
 
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati..."Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 

More from Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

  • 1. Copyright © 2016 ARM Ltd 1 Gian Marco Iodice, SW Engineer – ARM May 3, 2016 Using SGEMM and FFTs to Accelerate Deep Learning
  • 2. Copyright © 2016 ARM Ltd 2 • About ARM • Convolutional Neural Networks (CNN) • Architecture and building blocks • Convolutional Layer • SGEMM-based convolution • FFT-based convolution • SGEMM vs FFT • Limited Numerical Precision for CNN • Lesson Learned Contents
  • 3. Copyright © 2016 ARM Ltd 3 ARM Ltd • ARM Holdings plc is a British multinational semiconductor and software design company (www.arm.com) • Headquarters in Cambridge, England
  • 4. Copyright © 2016 ARM Ltd 4 Architecture and Building Blocks of CNN • Convolutional layer (core block of CNN) • Number of convolution kernels (filters bank) • Filter shape (width, height and depth) • Pooling layer (typical values 2x2) • Non-linear gating (ReLu) • Classifier: Fully Connected Neural Network Learned Non-Linear Trainable Feature
  • 5. Copyright © 2016 ARM Ltd 5 Why Are We Going to Study Convolutional Layer? *Learning Semantic Image Representations at a Large Scale, Yangqing Jia conv1 16.9% relu 0.7% pool 1.0% conv2 21.9% pool2 0.7% norm2 0.5% conv3 17.8% relu3 0.2% conv4 17.8% conv5 17.7% fc6 1.8% fc7 0.8% Compute Load for AlexNet Inference*
  • 6. Copyright © 2016 ARM Ltd 6 From 2D Convolution to 3D Batched Convolution • Most of the time for the convolution layers we have: • Multiple input images • Multiple convolution kernels (various dimensions and shapes) • Multiple channels per image/kernel (not necessarily 3!) Output images Input image Kernels Why don’t we use sliding window approach?
  • 7. Copyright © 2016 ARM Ltd 7 SGEMM-based Convolution C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply
  • 8. Copyright © 2016 ARM Ltd 8 Im2col • im2col stores in each row the necessary pixels for each kernel application • Costs in terms of memory requirements! • pixels duplication • col2im restores the output image structure Input image Output image A B C Output images stride X B C
  • 9. Copyright © 2016 ARM Ltd 9 SGEMM: Naïve implementation • Each thread computes a single element of the output matrix Not cache friendly! /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c0 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1 * N); c0 += ai * bi; … store(c0, addr_c); Matrix A Matrix BMatrix C
  • 10. Copyright © 2016 ARM Ltd 10 Transpose Matrix B Matrix B Transposition /* First accumulation */ ai = load(addr_a); bi = load(addr_b); c00 += ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = load(addr_b + 1); c00 += ai * bi; ... store(c0, addr_c); Matrix A Matrix BMatrix C 1.1x… Speed-up achievable?
  • 11. Copyright © 2016 ARM Ltd 11 Transpose Matrix B in Chunk of 1x4 (I) • Each thread computes 1x4 elements of the output matrix Not cache friendly! float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 1 * N); out += (float4)ai * bi; ... store4(out, addr_c); Matrix A Matrix BMatrix C
  • 12. Copyright © 2016 ARM Ltd 12 Transpose Matrix B in Chunk of 1x4 (II) float4 out = 0.0f; /* First accumulation */ ai = load(addr_a); bi = vload4(addr_b); out += (float4)ai * bi; /* Second accumulation */ ai = load(addr_a + 1); bi = vload4(addr_b + 4); out += (float4)ai * bi; ... store4(out, addr_c); Matrix B Matrix BT1x4 2 2.5 3 3.5 4 512 1024 2048 4096 SGEMM Speed-Up Speed-up achievable? 3.5x N: A=NxN, B=NxN, C=NxN
  • 13. Copyright © 2016 ARM Ltd 13 Reshaping Matrix A (I) • We can do more…we can compute a block of 4x4 elements per thread in order to re-use the values loaded from Matrix A Matrix BT1x4 Matrix AMatrix C
  • 14. Copyright © 2016 ARM Ltd 14 Reshaping Matrix A (II) Chunk 0 Chunk 1 Chunk = Block of 4 rows Matrix A – 8x8 Matrix AI – 2x32 6.5 7 7.5 8 8.5 512 1024 2048 4096 N: A=NxN, B=NxN, C=NxN SGEMM Speed-Up Speed-up achievable? > 8.0x
  • 15. Copyright © 2016 ARM Ltd 15 FFT-based Convolution • Convolution in the spatial domain is equivalent to a scalar multiplication in frequency domain
  • 16. Copyright © 2016 ARM Ltd 16 From Radix-2 to Mixed-Radix • The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…) • Any factorization would generally be possible for N (N = N1 x N2 x N3 x…) • Mixed-Radix is the generalization of the basic radix-2 FFT Over 1.5x better performance than Radix-2
  • 17. Copyright © 2016 ARM Ltd 17 FFT Implementation • Recursive FFT in-place computation* • Each thread computes a single radix-N (floating point computation) • Block-wise 2x2 in-place transposition • ~2x times better performance than 2x2 out-of-place transposition • Out-of-place batched convolution • High memory requirements as we have to keep the frequency representation for: 1. Input image 2. Convolution kernels 3. Result of convolutions * https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by- means-of-opencl-part-2
  • 18. Copyright © 2016 ARM Ltd 18 SGEMM vs FFT (I) • High memory requirements due to im2col: • stride < kernel dimension • large convolution kernel • large input image SGEMM-based convolution • No efficient way to handle stride != 1 • High memory requirements for batched convolutions • It could require considerable effort to optimize well SGEMM-based convolution FFT-based convolution
  • 19. Copyright © 2016 ARM Ltd 19 SGEMM vs FFT (II) Case 1: 1 input image, 64/128/256 convolution kernels • Study limited on inference problem • Stride x = 1 and stride y = 1 • N. of channels = 1 • Pre-computed FFT for convolution kernels Case 2: 64 input images, 32 convolution kernels ImagesizeImagesize Kernel size / Number of convolutions Kernel size and using stride x = 2? SGEMM FFT
  • 20. Copyright © 2016 ARM Ltd 20 Limited Numerical Precision for CNN (I) • Some papers ([1], [2]) have demonstrated the feasibility in using limited numerical precision for CNN • This opens an interesting computational scenario if, for instance, HW has accelerators for 16 bit half-precision: • Performance boosting • Reduced memory traffic to/from external memory • Possible to dispatch fewer threads • Energy saving • Essentially due to the reduced memory traffic to/from the external memory [1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David [2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan
  • 21. Copyright © 2016 ARM Ltd 21 Limited Numerical Precision for CNN (II) 1 1.5 2 2.5 512 1024 2048 4096 1 1.5 2 2.5 512 1024 2048 4096 It is possible to dispatch fewer threads i.e. 8x4 elements per thread We can not dispatch fewer threads Each thread computes a single radix-N SGEMM Speed-Up FFT Speed-Up N: A=NxN, B=NxN, C=NxN N > 2.0x > 1.5x
  • 22. Copyright © 2016 ARM Ltd 22 Lessons Learned 1. Cache-efficient data layout has huge impact on performance of our algorithm also for GPU computing 2. Simple changes in data layout can bring to: • dispatch fewer threads • exploit better vector instructions 3. Limited Numerical Precision plays a crucial role IF HW accelerated 4. Convolutional calculation is an embarrassingly parallel task which can be easily and efficiently accelerated on mobile GPU by means of OpenCL
  • 23. Copyright © 2016 ARM Ltd 23 Question Time Question Time
  • 24. Copyright © 2016 ARM Ltd 24 Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2016 ARM Limited Thank you!