CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks

CNNECST: an FPGA-based approach for the
hardware acceleration of
Convolutional Neural Networks
A. Solazzo, E. Del Sozzo, G. C. Durelli, M. D. Santambrogio
emanuele.delsozzo@polimi.it
Politecnico di Milano
Samsung Research America, Mountain View, CA
06/09/17

Hardware Acceleration
• Different types of hardware accelerators have been proposed:
– GPUs
– FPGAs
– ASICs
• Find the proper tradeoff is crucial
4

Hardware Acceleration
• Different types of hardware accelerators have been proposed:
– GPUs
– FPGAs
– ASICs
• Furthermore, reduced data precision allows FPGAs to reach
performance in range of Tera-Operations per second[2]
• However, FPGAs have a steep learning curve
5
[1] GPU FPGA
Execution Time [s] 18.7 21.4
Power Consumption [W] 235 27.3
[1] Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, Guangwen Yang, undefined, undefined, undefined, undefined, "F-CNN: An FPGA-based framework for training
Convolutional Neural Networks", 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
[2] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., & Vissers, K. (2016). FINN: A Framework for Fast, Scalable Binarized Neural Network Inference.

Proposed Solution
• CNNECST: an automated framework for the design and the
implementation of CNNs targeting FPGAs
• Bridge the gap between high level frameworks for Machine
Learning (ML) and the design process on FPGA
• The framework features:
– High level APIs to design the network
– Integration with ML frameworks for CNNs Training
– Custom C++ libraries to design a dataflow architecture for the
hardware acceleration of CNNs
6

Convolutional Neural Networks 7
0 0.2 0.4 0.6 0.8 1
Airplane
Car
Dog
Horse
Ship
Classification Probabilities

Convolution Layer
• Most computationally intensive
• Intrinsic parallelism in the
computation pattern
• Highly suitable for hardware
acceleration
8
𝑜 𝑘,𝑖,𝑗 =
𝑐=0
𝐶ℎ
ℎ=0
𝐾𝐻
𝑟=0
𝐾𝑊
(𝑤 𝑘,𝑐,ℎ,𝑟. 𝑥𝑖+ℎ,𝑗+𝑟,𝑐) + 𝑏 𝑘
r r
r
r

Proposed Methodology
• Integration with ML
frameworks for model
specification and training
• Custom C++ libraries for the
hardware accelerator
implementation
• Integration with Xilinx Vivado
toolchain for High Level
Synthesis and bitstream
generation
9
CNNECSTFrameworkCore
Python APIs
Intermediate Representation
Training
Hardware Generation
High Level CNN Description
Target Platform

External Interface 10
Python APIs
High Level CNN Description
• High level description of the
CNN
• Compatibility with Caffe[3]
machine learning framework
name: "LeNet“
layer {
name: "data“ type: "Input"
top: "data"
input_param {
channels : 1
height : 28 width : 28
}
}
layer {
name: "conv1" type: "Convolution"
bottom: "data" top: "conv1"
convolution_param {
num_output: 20
kernel_size: 5
}
}
[3] Caffe: http://caffe.berkeleyvision.org/

• Lightweight, scalable
of the network with Google
Protocol Buffers[4]
• Integrated with Caffe network
model representation
11
[4] Google Protocol Buffers: https://developers.google.com/protocol-buffers
Project
ProtoBuf
Target
Platform
Network
Model
Dataset
information

Training
• APIs for automated CNN
Training with TensorFlow[5]
• Parametric training strategy
with Stochastic Gradient
Descent and Cross-entropy
cost function
• Export trained weights as csv
file for the hardware
implementation
12
Training
[5] TensorFlow: https://www.tensorflow.org

Hardware Generation
• C++ Libraries for hardware-
oriented layer modules
• Dataflow architecture of the
CNN accelerator
• Automatic code generation
• Integration with Xilinx Vivado
Design Suite[6]
• Script generation for automated
High Level Synthesis and
bitstream generation
13
Hardware Generation
Target Platform
[6] Vivado Design Suite: https://www.xilinx.com/products/design-tools/vivado.html

Accelerator Architecture
• Text
14

Convolutional Layer Architecture
• Dataflow computation by
means of ‘Window’ shift-
registers
• Pipelined layer modules
• On-chip weights
• Buffering of partial results
15

Experimental Results
Two boards powered by different
FPGA fabrics:
• Zedboard (Zynq 7 SoC)
• VC707 (Virtex 7)
16
Zynq 7
(z7020)
Virtex 7
(VX485T)
LUT 53 200 303 600
FF 106 400 607 200
DSP 220 2 800
BRAM 280 2 060

Case Studies
Two different datasets:
• U.S. Postal Service dataset
(16x16 single channel images)
• MNIST dataset
(28x28 single channel images)
17
Network # Conv Layers # Pool Layers # FC Layers FLOPS
Small USPS-Net 1 1 1 49K
Large USPS-Net 2 1 1 65K
MNIST-Net 2 2 1 0.48M
LeNet-5 3 2 3 0.54M

USPS Case Study 18
Test Execution
GFLOPS
Energy
Accuracy
Network Device ARM FPGA ARM FPGA
Small Zedboard 1.67 s 0.028 s 1.76 3.67 J 0.12 J 96.3 %
Small VC707 1.67 s 0.017 s 2.89 3.67 J 0.34 J 96.3 %
Small
VC707
(4 Cores)
1.67 s 0.004 s 11.12 3.67 J 0.09 J 96.3 %
Large VC707 2.16 s 0.017 s 3.83 4.75 J 0.36 J 91.7 %
Test Resources
Network Device LUTs Flip-Flops DSP Slices BRAMs
Small Zedboard 62 % 29 % 62 % 8 %
Small VC707 17 % 9 % 5 % 2 %
Small
VC707
(4 Cores)
48 % 27 % 20 % 6 %
Large VC707 43 % 37 % 32 % 2 %

MNIST Case Study 19
Test Execution
GFLOPS
Energy
Accuracy
Network Device i7 FPGA i7 FPGA
MNIST-Net VC707 0.29 s 0.081 s 59 12.38 J 1.62 J 97.82 %
LeNet-5 VC707 0.3 s 0.087 s 62 13.14 J 1.66 J 98.17 %
Test Resources
Network Device LUTs Flip-Flops DSP Slices BRAMs
MNIST-Net VC707 43 % 18 % 18 % 2 %
LeNet-5 VC707 88 % 43 % 82 % 6.5 %

Experimental Results Summary 20
0.01
0.01
0.39
0.41
0.14
0.18
2.96
3.03
SMALL USPS-NET LARGE USPS-NET MNIST-NET LENET-5
GFLOPS/W
CPU FPGA

Conclusion & Challenges
• An automated framework to bridge the gap between
productivity-level languages and FPGA-based design
• The framework features:
– A modular dataflow architecture for CNN hardware acceleration
– Integration with modern Machine Learning frameworks for CNNs
• Challenges:
– Support for other types of layers (e.g. activation layer)
– Support for reduced precision data types
– Improve streaming data transfer
21

Thank You For The Attention 22
Andrea Solazzo
andrea.solazzo@mail.polimi.it
Emanuele Del Sozzo
emanuele.delsozzo@polimi.it
Gianluca C. Durelli
gianlucacarlo.durelli@polimi.it
Marco D. Santambrogio
marco.santambrogio@polimi.it
CNNECST-Convolutional Neural Network (https://www.facebook.com/cnn2ecst/)
@cnn2ecst (https://twitter.com/cnn2ecst)

CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks

Ähnlich wie CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks (20)

Mehr von NECST Lab @ Politecnico di Milano

Mehr von NECST Lab @ Politecnico di Milano (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks

Hinweis der Redaktion