"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Copyright © 2016 ARM Ltd 1
Gian Marco Iodice, SW Engineer – ARM
May 3, 2016
Using SGEMM and FFTs to Accelerate
Deep Learning

• About ARM
• Convolutional Neural Networks (CNN)
• Architecture and building blocks
• Convolutional Layer
• SGEMM-based convolution
• FFT-based convolution
• SGEMM vs FFT
• Limited Numerical Precision for CNN
• Lesson Learned
Contents

ARM Ltd
• ARM Holdings plc is a British multinational semiconductor and software
design company (www.arm.com)
• Headquarters in Cambridge, England

Architecture and Building Blocks of CNN
• Convolutional layer (core block of CNN)
• Number of convolution kernels (filters bank)
• Filter shape (width, height and depth)
• Pooling layer (typical values 2x2)
• Non-linear gating (ReLu)
• Classifier: Fully Connected Neural Network
Learned
Non-Linear Trainable Feature

Why Are We Going to Study Convolutional Layer?
*Learning Semantic Image Representations at a Large Scale, Yangqing Jia
conv1
16.9%
relu
0.7%
pool
1.0%
conv2
21.9%
pool2
0.7%
norm2
0.5%
conv3
17.8%
relu3
0.2%
conv4
17.8%
conv5
17.7%
fc6
1.8%
fc7
0.8%
Compute Load for AlexNet Inference*

From 2D Convolution to 3D Batched Convolution
• Most of the time for the convolution layers we have:
• Multiple input images
• Multiple convolution kernels (various dimensions and shapes)
• Multiple channels per image/kernel (not necessarily 3!)
Output images
Input image
Kernels
Why don’t we use sliding window approach?

SGEMM-based Convolution
C = α∙AB + β∙CSGEMM: Single Precision GEeneral Matrix Multiply

Im2col
• im2col stores in each row the
necessary pixels for each
kernel application
• Costs in terms of memory
requirements!
• pixels duplication
• col2im restores the output
image structure
Input image Output image
A B C
Output images
stride X
B C

SGEMM: Naïve implementation
• Each thread computes a single element of the output matrix
Not cache
friendly!
/* First accumulation */
ai = load(addr_a);
bi = load(addr_b);
c0 += ai * bi;
/* Second accumulation */
ai = load(addr_a + 1);
bi = load(addr_b + 1 * N);
c0 += ai * bi;
…
store(c0, addr_c);
Matrix A Matrix BMatrix C

Transpose Matrix B
Matrix B Transposition
ai = load(addr_a);
bi = load(addr_b);
c00 += ai * bi;
bi = load(addr_b + 1);
c00 += ai * bi;
...
store(c0, addr_c);
1.1x…
Speed-up
achievable?

Transpose Matrix B in Chunk of 1x4 (I)
• Each thread computes 1x4 elements of the output matrix
Not cache
friendly!
float4 out = 0.0f;
ai = load(addr_a);
bi = vload4(addr_b);
out += (float4)ai * bi;
bi = vload4(addr_b + 1 * N);
...
store4(out, addr_c);

Transpose Matrix B in Chunk of 1x4 (II)
float4 out = 0.0f;
ai = load(addr_a);
bi = vload4(addr_b);
bi = vload4(addr_b + 4);
...
store4(out, addr_c);
Matrix B
Matrix BT1x4
2
2.5
3
3.5
4
512 1024 2048 4096
SGEMM Speed-Up
Speed-up achievable?
3.5x
N: A=NxN, B=NxN, C=NxN

Reshaping Matrix A (I)
• We can do more…we can compute a block of 4x4 elements per
thread in order to re-use the values loaded from Matrix A
Matrix BT1x4
Matrix AMatrix C

Reshaping Matrix A (II)
Chunk 0
Chunk 1
Chunk = Block of 4 rows
Matrix A – 8x8 Matrix AI – 2x32
6.5
7
7.5
8
8.5
512 1024 2048 4096
SGEMM Speed-Up
Speed-up achievable?
> 8.0x

FFT-based Convolution
• Convolution in the spatial domain is equivalent to a scalar multiplication in
frequency domain

From Radix-2 to Mixed-Radix
• The most famous FFT is Radix-2 Cooley–Tukey (just with N power of 2: N = 2 x 2 x 2…)
• Any factorization would generally be possible for N (N = N1 x N2 x N3 x…)
• Mixed-Radix is the generalization of the basic radix-2 FFT
Over 1.5x better performance than Radix-2

FFT Implementation
• Recursive FFT in-place computation*
• Each thread computes a single radix-N (floating point computation)
• Block-wise 2x2 in-place transposition
• ~2x times better performance than 2x2 out-of-place transposition
• Out-of-place batched convolution
• High memory requirements as we have to keep the frequency representation for:
1. Input image
2. Convolution kernels
3. Result of convolutions
* https://community.arm.com/groups/arm-mali-graphics/blog/2016/02/01/speeding-up-fast-fourier-transform-mixed-radix-on-mobile-arm-mali-gpu-by-
means-of-opencl-part-2

SGEMM vs FFT (I)
• High memory requirements due to im2col:
• stride < kernel dimension
• large convolution kernel
• large input image
SGEMM-based convolution
• No efficient way to handle stride != 1
• High memory requirements for batched
convolutions
• It could require considerable effort to
optimize well
SGEMM-based convolution FFT-based convolution

SGEMM vs FFT (II)
Case 1: 1 input image, 64/128/256 convolution kernels
• Study limited on inference problem
• Stride x = 1 and stride y = 1
• N. of channels = 1
• Pre-computed FFT for convolution kernels
Case 2: 64 input images, 32 convolution kernels
ImagesizeImagesize
Kernel size / Number of convolutions
Kernel size
and using stride x = 2?
SGEMM FFT

Limited Numerical Precision for CNN (I)
• Some papers ([1], [2]) have demonstrated the feasibility in using limited
numerical precision for CNN
• This opens an interesting computational scenario if, for instance, HW has
accelerators for 16 bit half-precision:
• Performance boosting
• Reduced memory traffic to/from external memory
• Possible to dispatch fewer threads
• Energy saving
• Essentially due to the reduced memory traffic to/from the external memory
[1] Training Deep Neural Networks with Low Precision Multiplications, Matthieu Courbariaux, Jean-Pierre David
[2] Deep Learning with Limited Numerical Precision, Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan

Limited Numerical Precision for CNN (II)
1
1.5
2
2.5
512 1024 2048 4096
1
1.5
2
2.5
512 1024 2048 4096
It is possible to dispatch
fewer threads
i.e. 8x4 elements per thread
We can not dispatch fewer
threads
Each thread computes a
single radix-N
SGEMM Speed-Up
FFT Speed-Up
N
> 2.0x
> 1.5x

Lessons Learned
1. Cache-efficient data layout has huge impact on performance of our algorithm
also for GPU computing
2. Simple changes in data layout can bring to:
• dispatch fewer threads
• exploit better vector instructions
3. Limited Numerical Precision plays a crucial role IF HW accelerated
4. Convolutional calculation is an embarrassingly parallel task which can be
easily and efficiently accelerated on mobile GPU by means of OpenCL

Question Time
Question Time

Thank you!
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or
elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
Copyright © 2016 ARM Limited
Thank you!

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM

Similar to "Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM (20)

More from Edge AI and Vision Alliance

More from Edge AI and Vision Alliance (20)

Recently uploaded

Recently uploaded (20)

"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM