2. HW
Binary Neural Network
Original Binary Binary w/ Dither
+
Quant.
A
W
Wgt.
Act.
High
Precision
(32-bit)
Low
Precision
(1-bit for each)
Low
Precision
(1-bit)
Information Loss!
Out Ch
Binary Neural Network (BNN)
+1/-1 (1 ) Act Weight
J
L
:
1
→
DitherNN [Ando+, FPT’18] DeltaNet [Oba+, ASAP’19]
+1/-1
BNN-HW
Σ f
Σ f
Σ f
Σ f
Σ f
Σ f
Σ f
Σ f
f
f
Σ
Σ
Σ
Σ
f
f
0 f
f
Σ
Σ
Σ
Σ
f
f
0
Binary NN DeltaNet
3. SoC FPGA DNN
3
SoC FPGA
Programmable LogicOn-chipInterconnect(AMBAAXI4)
CPU
NW
DRAM
Ctrl
GPU
Soft CPU
BRAM DSP
DNN Processor
BRAM
DSP
BRAM
DSP
DSP DSP
BRAM BRAM
Camera
Camera
Image
Processor
BRAM
DSP
Mortar
Mortar
Controller
GPIO
Pin
GPIO
Controller
DRAM
All-in-One 1
4. SoC FPGA DNN
4
SoC FPGA
Programmable LogicOn-chipInterconnect(AMBAAXI4)
CPU
NW
DRAM
Ctrl
GPU
Soft CPU
BRAM DSP
DNN Processor
BRAM
DSP
BRAM
DSP
DSP DSP
BRAM BRAM
Camera
Camera
Image
Processor
BRAM
DSP
Mortar
Mortar
Controller
GPIO
Pin
GPIO
Controller
DRAM
(BRAM) (DSP)
L
I/O
J
5. NNgen
n DNN
IP
n :
Tensorflow
n : IP (or RTL)
l Veriloggen Object
l Verilog HDL
l IP-XACT
5
Model Definition
layer0 = ng.conv2d(a0, w0, ...)
NNgen
Scheduler
Graph Optimization
Task Scheduling
Allocator
RAM Assignment
Stream-Op Assignment
Pipeline Synthesis
Building Stream-Op via
Veriloggen.Stream API
Control Synthesis
Building FSM via
Veriloggen.Thread API
Code Synthesis
RTL and IP-XACT generation via Veriloggen/IPgen
Pyverilog
Verilog HDL AST Abstraction
Veriloggen
Veriloggen.Thread
Procedural HLS:
Python Source Code
-> AST -> FSM
Veriloggen.Stream
Dataflow HLS:
Dataflow Definition
-> Scheduled Pipeline
Veriloggen.Core
Verilog HDL Abstraction and Meta-Programing API
6. n DNN
l Tensorflow
n
l HLS Veriloggen
https://github.com/PyHDI/veriloggen
l HLS C++/C HDL
n
l
l Veriloggen.Thread, Veriloggen.Stream
6
15. NNgen-DNN
15
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
OP
16. NNgen-DNN
16
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
OP
17. NNgen-DNN
17
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
NoC
18. NNgen-DNN
18
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
RAM
19. NNgen-DNN
19
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
RAM NoC
20. NNgen-DNN
20
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
AXI4-Master + DMA
21. NNgen-DNN
21
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
FSM
FSMFSM
22. NNgen-DNN
22
CPU
Substream Pool
Computing Unit Pool
RAM Pool
Mul Mul Mul Mul
Mul Mul Mul Mul
Mul Mul Mul Mul
Acc Acc Acc Acc
AddTree AddTree
AddTree AddTree
conv2d 3x3
Parallel: 3x3x4x4x2x2
max_pool 2x2
Parallel: 4
matmul
Parallel: 4x4
ThreadArg
Stream
ThreadArg
Stream
ThreadArg
Stream
Main Thread
SubstreamInterconnect
BRAM
Width:
8x4-bit
BRAM
Width:
8x4-bit
BRAM
MemoryInterconnect
DMAInterconnect
DMAController
AXI4MasterI/FAXI4SlaveI/F
Config Register
AXI4Interconnect
NNgen Accelerator IP-core (IP-XACT)
DRAM
23. n
FSM NNgen
n RTL + IP
Veriloggen
n FSM
Python
l
23
Model Definition
layer0 = ng.conv2d(a0, w0, ...)
NNgen
Scheduler
Graph Optimization
Task Scheduling
Allocator
RAM Assignment
Stream-Op Assignment
Pipeline Synthesis
Building Stream-Op via
Veriloggen.Stream API
Control Synthesis
Building FSM via
Veriloggen.Thread API
Code Synthesis
RTL and IP-XACT generation via Veriloggen/IPgen
Pyverilog
Verilog HDL AST Abstraction
Veriloggen
Veriloggen.Thread
Procedural HLS:
Python Source Code
-> AST -> FSM
Veriloggen.Stream
Dataflow HLS:
Dataflow Definition
-> Scheduled Pipeline
Veriloggen.Core
Verilog HDL Abstraction and Meta-Programing API
25. NNgen
nONNX (nngen.onnx)
l ONNX
NNgen
→ (Pytorch ) Verilog HDL/IP-XACT
ONNX NNgen
→
n (nngen.quantizer)
l 1/2/4/8/16/32/64
l
ü Experimental Implementation
25