SlideShare a Scribd company logo
1 of 22
CNNECST: an FPGA-based approach for the
hardware acceleration of
Convolutional Neural Networks
A. Solazzo, E. Del Sozzo, G. C. Durelli, M. D. Santambrogio
emanuele.delsozzo@polimi.it
Politecnico di Milano
QUID, San Francisco, CA
06/06/17
Context Definition 2
Problems 3
Hardware Acceleration
• Different types of hardware accelerators have been proposed:
– GPUs
– FPGAs
– ASICs
• Find the proper tradeoff is crucial
4
Hardware Acceleration
• Different types of hardware accelerators have been proposed:
– GPUs
– FPGAs
– ASICs
• Furthermore, reduced data precision allows FPGAs to reach
performance in range of Tera-Operations per second[2]
• However, FPGAs have a steep learning curve
5
[1] GPU FPGA
Execution Time [s] 18.7 21.4
Power Consumption [W] 235 27.3
[1] Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, Guangwen Yang, undefined, undefined, undefined, undefined, "F-CNN: An FPGA-based framework for training
Convolutional Neural Networks", 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
[2] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., & Vissers, K. (2016). FINN: A Framework for Fast, Scalable Binarized Neural Network Inference.
Proposed Solution
• CNNECST: an automated framework for the design and the
implementation of CNNs targeting FPGAs
• Bridge the gap between high level frameworks for Machine
Learning (ML) and the design process on FPGA
• The framework features:
– High level APIs to design the network
– Integration with ML frameworks for CNNs Training
– Custom C++ libraries to design a dataflow architecture for the
hardware acceleration of CNNs
6
Convolutional Neural Networks 7
0 0.2 0.4 0.6 0.8 1
Airplane
Car
Dog
Horse
Ship
Classification Probabilities
Convolution Layer
• Most computationally intensive
• Intrinsic parallelism in the
computation pattern
• Highly suitable for hardware
acceleration
8
𝑜 𝑘,𝑖,𝑗 =
𝑐=0
𝐶ℎ
ℎ=0
𝐾𝐻
𝑟=0
𝐾𝑊
(𝑤 𝑘,𝑐,ℎ,𝑟. 𝑥𝑖+ℎ,𝑗+𝑟,𝑐) + 𝑏 𝑘
r r
r
r
Proposed Methodology
• Integration with ML
frameworks for model
specification and training
• Custom C++ libraries for the
hardware accelerator
implementation
• Integration with Xilinx Vivado
toolchain for High Level
Synthesis and bitstream
generation
9
CNNECSTFrameworkCore
Python APIs
Intermediate Representation
Training
Hardware Generation
High Level CNN Description
Target Platform
External Interface 10
Python APIs
High Level CNN Description
• High level description of the
CNN
• Compatibility with Caffe[3]
machine learning framework
name: "LeNet“
layer {
name: "data“ type: "Input"
top: "data"
input_param {
channels : 1
height : 28 width : 28
}
}
layer {
name: "conv1" type: "Convolution"
bottom: "data" top: "conv1"
convolution_param {
num_output: 20
kernel_size: 5
}
}
[3] Caffe: http://caffe.berkeleyvision.org/
Intermediate Representation
• Lightweight, scalable
Intermediate Representation
of the network with Google
Protocol Buffers[4]
• Integrated with Caffe network
model representation
11
Intermediate Representation
[4] Google Protocol Buffers: https://developers.google.com/protocol-buffers
Project
ProtoBuf
Target
Platform
Network
Model
Dataset
information
Training
• APIs for automated CNN
Training with TensorFlow[5]
• Parametric training strategy
with Stochastic Gradient
Descent and Cross-entropy
cost function
• Export trained weights as csv
file for the hardware
implementation
12
Training
[5] TensorFlow: https://www.tensorflow.org
Hardware Generation
• C++ Libraries for hardware-
oriented layer modules
• Dataflow architecture of the
CNN accelerator
• Automatic code generation
• Integration with Xilinx Vivado
Design Suite[6]
• Script generation for automated
High Level Synthesis and
bitstream generation
13
Hardware Generation
Target Platform
[6] Vivado Design Suite: https://www.xilinx.com/products/design-tools/vivado.html
Accelerator Architecture
• Text
14
Convolutional Layer Architecture
• Dataflow computation by
means of ‘Window’ shift-
registers
• Pipelined layer modules
• On-chip weights
• Buffering of partial results
15
Experimental Results
Two boards powered by different
FPGA fabrics:
• Zedboard (Zynq 7 SoC)
• VC707 (Virtex 7)
16
Zynq 7
(z7020)
Virtex 7
(VX485T)
LUT 53 200 303 600
FF 106 400 607 200
DSP 220 2 800
BRAM 280 2 060
Case Studies
Two different datasets:
• U.S. Postal Service dataset
(16x16 single channel images)
• MNIST dataset
(28x28 single channel images)
17
Network # Conv Layers # Pool Layers # FC Layers FLOPS
Small USPS-Net 1 1 1 49K
Large USPS-Net 2 1 1 65K
MNIST-Net 2 2 1 0.48M
LeNet-5 3 2 3 0.54M
USPS Case Study 18
Test Execution
GFLOPS
Energy
Accuracy
Network Device ARM FPGA ARM FPGA
Small Zedboard 1.67 s 0.028 s 1.76 3.67 J 0.12 J 96.3 %
Small VC707 1.67 s 0.017 s 2.89 3.67 J 0.34 J 96.3 %
Small
VC707
(4 Cores)
1.67 s 0.004 s 11.12 3.67 J 0.09 J 96.3 %
Large VC707 2.16 s 0.017 s 3.83 4.75 J 0.36 J 91.7 %
Test Resources
Network Device LUTs Flip-Flops DSP Slices BRAMs
Small Zedboard 62 % 29 % 62 % 8 %
Small VC707 17 % 9 % 5 % 2 %
Small
VC707
(4 Cores)
48 % 27 % 20 % 6 %
Large VC707 43 % 37 % 32 % 2 %
MNIST Case Study 19
Test Execution
GFLOPS
Energy
Accuracy
Network Device i7 FPGA i7 FPGA
MNIST-Net VC707 0.29 s 0.081 s 59 12.38 J 1.62 J 97.82 %
LeNet-5 VC707 0.3 s 0.087 s 62 13.14 J 1.66 J 98.17 %
Test Resources
Network Device LUTs Flip-Flops DSP Slices BRAMs
MNIST-Net VC707 43 % 18 % 18 % 2 %
LeNet-5 VC707 88 % 43 % 82 % 6.5 %
Experimental Results Summary 20
0.01
0.01
0.39
0.41
0.14
0.18
2.96
3.03
SMALL USPS-NET LARGE USPS-NET MNIST-NET LENET-5
GFLOPS/W
CPU FPGA
Conclusion & Challenges
• An automated framework to bridge the gap between
productivity-level languages and FPGA-based design
• The framework features:
– A modular dataflow architecture for CNN hardware acceleration
– Integration with modern Machine Learning frameworks for CNNs
• Challenges:
– Support for other types of layers (e.g. activation layer)
– Support for reduced precision data types
– Improve streaming data transfer
21
Thank You For The Attention 22
Andrea Solazzo
andrea.solazzo@mail.polimi.it
Emanuele Del Sozzo
emanuele.delsozzo@polimi.it
Gianluca C. Durelli
gianlucacarlo.durelli@polimi.it
Marco D. Santambrogio
marco.santambrogio@polimi.it
CNNECST-Convolutional Neural Network (https://www.facebook.com/cnn2ecst/)
@cnn2ecst (https://twitter.com/cnn2ecst)

More Related Content

What's hot

Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashedinside-BigData.com
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 Benoit Hudzia
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...Edge AI and Vision Alliance
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
A fun cup of joe with open liberty
A fun cup of joe with open libertyA fun cup of joe with open liberty
A fun cup of joe with open libertyAndy Mauer
 
Redesigning the LTE Packet Core
Redesigning the LTE Packet CoreRedesigning the LTE Packet Core
Redesigning the LTE Packet CoreMichelle Holley
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Haidee McMahon
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutionsinside-BigData.com
 
Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Intel® Software
 

What's hot (20)

2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA? 2. Cnnecst-Why the use of FPGA?
2. Cnnecst-Why the use of FPGA?
 
Sierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science UnleashedSierra Supercomputer: Science Unleashed
Sierra Supercomputer: Science Unleashed
 
Multicore Intel Processors Performance Evaluation
Multicore Intel Processors Performance EvaluationMulticore Intel Processors Performance Evaluation
Multicore Intel Processors Performance Evaluation
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
A fun cup of joe with open liberty
A fun cup of joe with open libertyA fun cup of joe with open liberty
A fun cup of joe with open liberty
 
Redesigning the LTE Packet Core
Redesigning the LTE Packet CoreRedesigning the LTE Packet Core
Redesigning the LTE Packet Core
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 
Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture Accelerate Machine Learning Software on Intel Architecture
Accelerate Machine Learning Software on Intel Architecture
 

Similar to CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks

Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...PT Datacomm Diangraha
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA accelerationMarco77328
 
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...Christian Esteve Rothenberg
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeAnand Haridass
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Michelle Holley
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit SupercomputerVigneshwarRamaswamy
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackOPNFV
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
Sobanski odl summit_2015
Sobanski odl summit_2015Sobanski odl summit_2015
Sobanski odl summit_2015John Sobanski
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjManhHoangVan
 
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiGaurav Raina
 

Similar to CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks (20)

Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
uCluster
uClusteruCluster
uCluster
 
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
HPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand ChallengeHPC Infrastructure To Solve The CFD Grand Challenge
HPC Infrastructure To Solve The CFD Grand Challenge
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
Sobanski odl summit_2015
Sobanski odl summit_2015Sobanski odl summit_2015
Sobanski odl summit_2015
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
 
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiDeep Convolutional Neural Network acceleration on the Intel Xeon Phi
Deep Convolutional Neural Network acceleration on the Intel Xeon Phi
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 

Recently uploaded (20)

Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 

CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks

  • 1. CNNECST: an FPGA-based approach for the hardware acceleration of Convolutional Neural Networks A. Solazzo, E. Del Sozzo, G. C. Durelli, M. D. Santambrogio emanuele.delsozzo@polimi.it Politecnico di Milano QUID, San Francisco, CA 06/06/17
  • 4. Hardware Acceleration • Different types of hardware accelerators have been proposed: – GPUs – FPGAs – ASICs • Find the proper tradeoff is crucial 4
  • 5. Hardware Acceleration • Different types of hardware accelerators have been proposed: – GPUs – FPGAs – ASICs • Furthermore, reduced data precision allows FPGAs to reach performance in range of Tera-Operations per second[2] • However, FPGAs have a steep learning curve 5 [1] GPU FPGA Execution Time [s] 18.7 21.4 Power Consumption [W] 235 27.3 [1] Wenlai Zhao, Haohuan Fu, Wayne Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, Guangwen Yang, undefined, undefined, undefined, undefined, "F-CNN: An FPGA-based framework for training Convolutional Neural Networks", 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP) [2] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., & Vissers, K. (2016). FINN: A Framework for Fast, Scalable Binarized Neural Network Inference.
  • 6. Proposed Solution • CNNECST: an automated framework for the design and the implementation of CNNs targeting FPGAs • Bridge the gap between high level frameworks for Machine Learning (ML) and the design process on FPGA • The framework features: – High level APIs to design the network – Integration with ML frameworks for CNNs Training – Custom C++ libraries to design a dataflow architecture for the hardware acceleration of CNNs 6
  • 7. Convolutional Neural Networks 7 0 0.2 0.4 0.6 0.8 1 Airplane Car Dog Horse Ship Classification Probabilities
  • 8. Convolution Layer • Most computationally intensive • Intrinsic parallelism in the computation pattern • Highly suitable for hardware acceleration 8 𝑜 𝑘,𝑖,𝑗 = 𝑐=0 𝐶ℎ ℎ=0 𝐾𝐻 𝑟=0 𝐾𝑊 (𝑤 𝑘,𝑐,ℎ,𝑟. 𝑥𝑖+ℎ,𝑗+𝑟,𝑐) + 𝑏 𝑘 r r r r
  • 9. Proposed Methodology • Integration with ML frameworks for model specification and training • Custom C++ libraries for the hardware accelerator implementation • Integration with Xilinx Vivado toolchain for High Level Synthesis and bitstream generation 9 CNNECSTFrameworkCore Python APIs Intermediate Representation Training Hardware Generation High Level CNN Description Target Platform
  • 10. External Interface 10 Python APIs High Level CNN Description • High level description of the CNN • Compatibility with Caffe[3] machine learning framework name: "LeNet“ layer { name: "data“ type: "Input" top: "data" input_param { channels : 1 height : 28 width : 28 } } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" convolution_param { num_output: 20 kernel_size: 5 } } [3] Caffe: http://caffe.berkeleyvision.org/
  • 11. Intermediate Representation • Lightweight, scalable Intermediate Representation of the network with Google Protocol Buffers[4] • Integrated with Caffe network model representation 11 Intermediate Representation [4] Google Protocol Buffers: https://developers.google.com/protocol-buffers Project ProtoBuf Target Platform Network Model Dataset information
  • 12. Training • APIs for automated CNN Training with TensorFlow[5] • Parametric training strategy with Stochastic Gradient Descent and Cross-entropy cost function • Export trained weights as csv file for the hardware implementation 12 Training [5] TensorFlow: https://www.tensorflow.org
  • 13. Hardware Generation • C++ Libraries for hardware- oriented layer modules • Dataflow architecture of the CNN accelerator • Automatic code generation • Integration with Xilinx Vivado Design Suite[6] • Script generation for automated High Level Synthesis and bitstream generation 13 Hardware Generation Target Platform [6] Vivado Design Suite: https://www.xilinx.com/products/design-tools/vivado.html
  • 15. Convolutional Layer Architecture • Dataflow computation by means of ‘Window’ shift- registers • Pipelined layer modules • On-chip weights • Buffering of partial results 15
  • 16. Experimental Results Two boards powered by different FPGA fabrics: • Zedboard (Zynq 7 SoC) • VC707 (Virtex 7) 16 Zynq 7 (z7020) Virtex 7 (VX485T) LUT 53 200 303 600 FF 106 400 607 200 DSP 220 2 800 BRAM 280 2 060
  • 17. Case Studies Two different datasets: • U.S. Postal Service dataset (16x16 single channel images) • MNIST dataset (28x28 single channel images) 17 Network # Conv Layers # Pool Layers # FC Layers FLOPS Small USPS-Net 1 1 1 49K Large USPS-Net 2 1 1 65K MNIST-Net 2 2 1 0.48M LeNet-5 3 2 3 0.54M
  • 18. USPS Case Study 18 Test Execution GFLOPS Energy Accuracy Network Device ARM FPGA ARM FPGA Small Zedboard 1.67 s 0.028 s 1.76 3.67 J 0.12 J 96.3 % Small VC707 1.67 s 0.017 s 2.89 3.67 J 0.34 J 96.3 % Small VC707 (4 Cores) 1.67 s 0.004 s 11.12 3.67 J 0.09 J 96.3 % Large VC707 2.16 s 0.017 s 3.83 4.75 J 0.36 J 91.7 % Test Resources Network Device LUTs Flip-Flops DSP Slices BRAMs Small Zedboard 62 % 29 % 62 % 8 % Small VC707 17 % 9 % 5 % 2 % Small VC707 (4 Cores) 48 % 27 % 20 % 6 % Large VC707 43 % 37 % 32 % 2 %
  • 19. MNIST Case Study 19 Test Execution GFLOPS Energy Accuracy Network Device i7 FPGA i7 FPGA MNIST-Net VC707 0.29 s 0.081 s 59 12.38 J 1.62 J 97.82 % LeNet-5 VC707 0.3 s 0.087 s 62 13.14 J 1.66 J 98.17 % Test Resources Network Device LUTs Flip-Flops DSP Slices BRAMs MNIST-Net VC707 43 % 18 % 18 % 2 % LeNet-5 VC707 88 % 43 % 82 % 6.5 %
  • 20. Experimental Results Summary 20 0.01 0.01 0.39 0.41 0.14 0.18 2.96 3.03 SMALL USPS-NET LARGE USPS-NET MNIST-NET LENET-5 GFLOPS/W CPU FPGA
  • 21. Conclusion & Challenges • An automated framework to bridge the gap between productivity-level languages and FPGA-based design • The framework features: – A modular dataflow architecture for CNN hardware acceleration – Integration with modern Machine Learning frameworks for CNNs • Challenges: – Support for other types of layers (e.g. activation layer) – Support for reduced precision data types – Improve streaming data transfer 21
  • 22. Thank You For The Attention 22 Andrea Solazzo andrea.solazzo@mail.polimi.it Emanuele Del Sozzo emanuele.delsozzo@polimi.it Gianluca C. Durelli gianlucacarlo.durelli@polimi.it Marco D. Santambrogio marco.santambrogio@polimi.it CNNECST-Convolutional Neural Network (https://www.facebook.com/cnn2ecst/) @cnn2ecst (https://twitter.com/cnn2ecst)

Editor's Notes

  1. Al giorno d’oggi, uno dei temi di maggior interesse nella ricerca sia accademica che industriale sono gli algoritmi chiamati Reti Neurali Convoluzionali, o brevemente CNN. Le CNN sono uno degli algoritmi del appartenenti al campo del Deep Learning. Negli ultimi anni, data la loro efficacia nel riconoscimento di immagine, le CNN sono usate in svariate applicazioni come l’analisi di Big Data, la visione artificiale, la video sorveglianza e così via.. In questi particolari ambiti, l’enorme mole di dati da analizzare, i requisiti di consumo energetico o di computazione ‘real time’ rendono necessario trovare dei modi per accelerare questo tipo di algoritmi
  2. Al giorno d’oggi, uno dei temi di maggior interesse nella ricerca sia accademica che industriale sono gli algoritmi chiamati Reti Neurali Convoluzionali, o brevemente CNN. Le CNN sono uno degli algoritmi del appartenenti al campo del Deep Learning. Negli ultimi anni, data la loro efficacia nel riconoscimento di immagine, le CNN sono usate in svariate applicazioni come l’analisi di Big Data, la visione artificiale, la video sorveglianza e così via.. In questi particolari ambiti, l’enorme mole di dati da analizzare, i requisiti di consumo energetico o di computazione ‘real time’ rendono necessario trovare dei modi per accelerare questo tipo di algoritmi
  3. Per questo motivo, in letteratura sono stati proposti diversi tipi di acceleratori basati su GPU, FPGA ed ASIC. Tra questi, occorre trovare il giusto compromesso considerando diversi aspetti, tra cui: consumo energetico, costo, flessibilità e prestazioni della soluzione ottenuta. Le FPGA rappresentano un ottimo tradeoff tra questi dispositivi, in quanto molto più flessibili e meno costose rispetto a una soluzione ASIC, ma al contempo fornendo alte prestazioni con consumi energetici più contenuti rispetto alle GPU. Tuttavia, il processo di design e implementazione di un acceleratore per questo tipo di dispositivo può avere lunghi tempi di sviluppo, specialmente per sviluppatori non esperti in hardware design
  4. Per questo motivo, in letteratura sono stati proposti diversi tipi di acceleratori basati su GPU, FPGA ed ASIC. Tra questi, occorre trovare il giusto compromesso considerando diversi aspetti, tra cui: consumo energetico, costo, flessibilità e prestazioni della soluzione ottenuta. Le FPGA rappresentano un ottimo tradeoff tra questi dispositivi, in quanto molto più flessibili e meno costose rispetto a una soluzione ASIC, ma al contempo fornendo alte prestazioni con consumi energetici più contenuti rispetto alle GPU. Tuttavia, il processo di design e implementazione di un acceleratore per questo tipo di dispositivo può avere lunghi tempi di sviluppo, specialmente per sviluppatori non esperti in hardware design
  5. Per questi motivi, lo scopo di questa tesi è di colmare il divario tra il livello di produttività offerto da linguaggi come python negli odierni framework di machine learning, e il processo di design di acceleratori basati su FPGA. Il contributo portato da questo lavoro è infatti un framework per automatizzare il flusso di implementazione di CNN su FPGA, fornendo delle API di alto livello per il design della rete, una libreria per l’implementazione di acceleratori con un’architettura di tipo dataflow, e l’integrazione con un framework di machine learning per il training della rete.
  6. Andiamo ora a vedere brevemente il funzionamento delle CNN e del particolare pattern computazionale che permette di accelerarle efficientemente. Come si può vedere dall’immagine, una rete e composta da diversi tipi di livelli messi in serie. Nella prima parte della rete troviamo usualmente i livelli convoluzionali, che contraddistinguono le CNN da altri tipi di reti neurali, e dei livelli di pooling o sub-sampling. Ad alto livello, questa parte della rete applica dei «filtri» a un immagine in input, estraendone caratteristiche via via più complesse. L’ultima parte della rete è invece costituita dai cosiddetti fully-connected o linear layers, dello stesso tipo di una rete neurale standard. Questa parte è responsabile dell’aggregazione dell’informazione estratta dalla parte convoluzionale, e fornisce una specie di punteggio per l’appartenenza di un’immagine a una certa classe di oggetti. Tramite trasformazioni matematiche, come per esempio l’operatore SoftMax, è possibile ottenere in uscita dalla rete un vettore di probabilità di appartenenza alle varie classi, come in questo caso in cui la classe più probabile è rappresentata proprio dalla macchina.
  7. Vediamo ora più nel dettaglio il funzionamento di un livello convoluzionale, che rappresenta la parte computazionalmente più onerosa della rete in termini di numero di operazioni. Un livello convoluzionale è composto da un numero arbitrario di filtri, o kernel, che applicano un operatore di convoluzione sui pixel dell’immagine in ingresso. Ogni filtro è composto da ‘pesi’ il cui valore viene ‘imparato’ attraverso il processo di training della rete. Come si può vedere dallo pseudocodice, il calcolo effettuato da un livello convoluzionale presenta un alto livello di parallelismo, e si presta quindi molto bene ad un’accelerazione in hardware. Il numero di filtri di un livello convoluzionale, così come le loro dimensioni e il numero stesso di livelli nella rete rappresentano alcuni degli iperparametri configurabili di una rete.
  8. Come detto in precedenza, il framework proposto è in grado di automatizzare il flusso di generazione dell’acceleratore di una CNN. I diversi moduli che compongono la toolchain svolgono le varie funzioni per l’addestramento della CNN, la generazione del codice dell’acceleratore e delle direttive per la sintesi su FPGA.
  9. Come input, il framework richiede una descrizione ad alto livello della rete, compatibile con quella del framework Caffe sviluppato da Berkley, come quella in esempio nella slide. Questo tipo di descrizione permette di definire gli iperparametri della rete e dei singoli livelli in maniera immediata. Inoltre la compatibilità con la descrizione di Caffe permette di generare alcune reti precedentemente implementate. Il framework espone poi delle API Python per interagire con i diversi moduli del framework e importare la descrizione stessa.
  10. Una volta definita la struttura della rete, questa viene tradotta in una rappresentazione intermedia implementata tramite una specifica Protocol Buffer, una rappresentazione di dati simile al JSON sviluppata da Google. Questa rappresentazione incorpora la descrizione della rete vista precedentemente, e aggiunge delle informazioni sui dataset che verrano utilizzati per il training e sul dispositivo finale.
  11. Lo step successivo è quello del training della rete, ovvero della generazione dei ‘pesi’ che permettono alla rete di filtrare e classificare l’immagine correttamente. Questo passaggio è realizzato tramite un modulo che si integra con il framework TensorFlow, generando il modello della rete a partire dalla rappresentazione intermedia. Il training è effettuato tramite l’algoritmo di apprendimento più comunemente utilizzato, lo SGD, utilizzando la cross entropy come funzione di costo, e parametri come il learning rate specificato dall’utente. Una volta completato l’addestramento, i pesi finali vengono esportati in modo da essere utilizzati durante la generazione del codice dell’acceleratore.
  12. Infine, il modulo per la generazione dell’acceleratore utilizza le librerie C++ incluse nel framework per l’implementazione di un’architettura di tipo dataflow di una CNN. In particolare, la libreria fornisce delle funzioni che implementano i diversi tipi di layer della rete, e che contengono delle direttive per definire le caratteristiche che l’implementazione hardware dovrà avere, come per esempio il numero di processing elements che devono operare in parallelol. Il passaggio dal codice generato all’effettivo RTL dell’acceleratore è realizzato tramite la Vivado Dsign Suite di Xilinx, che includono gli strumenti per l’High Level Synthesis e la generazione del bitstream per configurare l’FPGA.
  13. L’architettura dell’acceleratore generato è composta da diversi moduli in cascata che implementano i diversi livelli della rete. La connessione tra i singoli di moduli avviene tramite ‘Stream’ realizzati come FIFO, per ottenere una computazione di tipo dataflow. I dati in input e in output all’acceleratore vengono letti e scritti in memoria tramite un modulo DMA per diminuire la latenza dovuta al trasferimento dei dati, e viene inviato all’acceleratore della CNN tramite interfacce che utilizzano il protocollo AXI-Stream.
  14. Oltre alle connessioni stream tra i singoli moduli, la realizzazione di un’architettura dataflow è resa possibile dal buffering dei risultati parziali nei moduli che implementano i livelli della parte convoluzionale. Mentre infatti per quanto riguarda il fully connected layer è possibile ottenere un flusso di dati completamente streaming, per quanto riguarda la convoluzione, è necessario salvare i pixel in input in quanto verranno riutilizzati da diverse posizioni dei filtri che compongono il livello. Per questo motivo i livelli convoluzionale e di pooling sono stati implementati con una finestra di shift-registers che possono essere letti e scritti in parallelo. In particolare, come si può vedere dall’immagine, nella window vengono salvati i pixel necessari per calcolare la convoluzione di una prima riga dell’immagine in output. Quando il kernel, in blu, è stato fatto scorrere su tutta la finestra, la window viene fatta virtualmente scorrere in basso, eliminando la prima riga dell’immagine in input, che non sarà più utilizzata, e caricando una nuova riga per proseguire con la computazione. I pesi dei livelli vengono inoltre salvati nelle memorie locali dell’FPGA e non devono essere caricati dalla memoria. In questo modo è quindi possibile ottenere una catena di moduli che lavorano come una pipeline e permettono di avere una computazione di tipo streaming della rete implementata.
  15. Per validare l’approccio proposto, abbiamo utilizzato due board che montano due diverse FPGA. La prima, la Zedboard qui sulla sinistra, monta un SoC di tipo Zynq, composto da un processore ARM e da una FPGA derivata dalla serie Artix di Xilinx. La seconda invece, è una board che monta una FPGA Virtex-7. Come si può vedere, i due dispositivi presentano un numero molto diverso di risorse disponibili, in particolare per quanto riguarda il numero di LUT e di Digital Signal Processing, che come vedremo saranno la risorsa critica per i design testati.
  16. Per quanto riguarda invece le reti utilizzate, abbiamo addestrato due coppie di modelli su due diversi datasets di immagini a singolo canale, quindi in bianco e nero, di cifre scritte a mano. Il primo è lo US Postal Service dataset, le cui immagini sono ricavate da codici postali scannerizzati dai pacchi dei servizi postali statunitensi. Il secondo invece è il MNIST dataset, uno dei primi dataset utilizzati quando le CNN sono state proposte da LeCun, e tuttora molto utilizzato in applicazioni di questo tipo. Le reti differiscono per quanto riguarda il numero di livelli che le compongono. In particolare si può vedere come il numero di floating point operations aumenti significaticamente all’aumentare del numero di layer e della dimensione dei dati da classificare. I modelli utilizzati sono stati sviluppati internamente, tranne che per l’ultimo, che consiste in una versione leggermente modificata di LeNet-5, una CNN che è stata proposta dallo stesso LeCun nel primo paper sulle CNN.
  17. Per quanto riguarda lo USPS dataset, sono state realizzate 4 diverse implementazioni. Una della rete più piccola su Zedboard, quindi implementando l’acceleratore sull’FPGA del SoC Zynq, e altre 3 per le due diverse reti su Virtex-7, di cui una in particolare è stata realizzata con 4 acceleratori della rete in modo da sfruttare il maggior numero di risorse disponibili sulla FPGA e di ottenere una parallelizzazione a grana più grossa per classificare il dataset di test. Inoltre, la maggior velocità di esecuzione ha permesso di ridurre notevolmente il consumo energetico del dispositivo Per quanto riguarda invece il consumo di risorse, in questa tabella sono riportati gli utilizzi per le 4 diverse configurazione. Come si può vedere in particolare dalle due implementazioni dellae reti small e large sulla stessa FPGA, all’aumentare della dimensione della rete, e quindi al numero di operazioni che vengono effettuate in parallelo dall’acceleratore, il numero di LUT e DSP che vengono utilizzati per le unità di moltiplicazione e somma floating point aumentino significaticamente.
  18. Passando ora al MNIST dataset, date le maggiori dimensioni delle due reti implementate, è stato possibile le due reti solo su Virtex-7