A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

A High Performance Heterogeneous
FPGA-based Accelerator
with PyCoRAM
Team: PyCoRAMist
Shinya Takamaeda-Yamazaki
Tokyo Institute of Technology
JSPS Research Fellow (DC1)
February 21, 2014
Digilent Design Contest @TED Yokohama

The 1st IPSJ SIG-ARC High-Performance
Processor Design Contest (Jan 2014 @Tokyo)
n  A competition of developing a fast
computing system for the specified
applications on the specified platform
n  FPGA board: Digilent Atlys
l  FPGA: Xilinx Spartan-6 LX45
DRAM: DDR2-800 (1.6GB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 2

4 Specified Contest Applications
Hybrid System of CPU core + HW Accelerator
Suitable for HW AcceleratorsMatrix Mult & Stencil
Sort & Shortest Path Difficult for HW Accelerators
Application Description
Requirements for
Memory System
310_sort Integer Sort Low Latency
320_mm Matrix-Matrix Multiplication High Bandwidth
330_stencil 9-Point Stencil (Integer) High Bandwidth
340_spath Shortest Path Search Low Latency

How to Implement an Accelerator?
n  HDL? NO WAY! It’s so annoying L
l  Implementing the entire system using HDL is hard, because ...
•  Scheduling logic of computations and memory accesses
–  Double buffering requires complicated logics
–  State machine implementation is so annoying and error-prone
l  But, we want define the pipeline design in cycle-level
•  Essential for high performance of FPGA-based accelerators
–  HDL is still good weapon to write just a computation logic
–  The modern high-level synthesis tools are still not effective
n  Memory abstractions make up happy?
CoRAM Memory Architecture

CoRAM (Connected RAM) [Chung+,FPGA’11]
n  Abstract Memory System for FPGAs
l  High-level abstraction for memory management
•  Decoupling computing logics and memory access behaviors
•  Memory access patterns in software model (C language)
HW Kernels
(Computing Logics)
CoRAM
Memory
Read
Write
Manage
Control Threads
(Memory Access
Pattern in C)
CoRAM
Channel
Read/Write Read/Write
Communication
FIFOs (Registers)
Abstracted
On-chip Memories
Off-chip
Memory

PyCoRAM [Takamaeda+,CARL’13]
n  Python-based implementation of CoRAM memory
architecture for modern FPGA EDKs
l  CoRAM memory abstraction for EDK development flow
n  Key features
l  Control Thread in Python
•  We developed Python-to-Verilog HLS Compiler from scratch
l  AMBA AXI4 Interconnect for on-chip interconnect
•  For IP-core based development on Xilinx Platform Studio (XPS)

PyCoRAM Microarchitecture
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO

User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
Modeled in RTL
(Verilog HDL)
Memory Access
Pattern
in Python
def calc_sum(times):�
ram = CoramMemory(idx=0, datawidth=32, size=1024)�
channel = CoramChannel(idx=0, datawidth=32)�
addr = 0�
sum = 0�
for i in range(times):�
ram.write(0, addr, 128)�
channel.write(addr)�
sum += channel.read()�
addr += 128 * (32/8)�
print(‘sum=’, sum)�
calc_sum(8)�
# Transfer (off-chip DRAM to BRAM)
# Notification to User-logic
# Wait for Notification from User-logic
# $display Verilog system task
�
0�
1�
2�
3�
4�
5�
6�
7�
8�
9�
10�
11�

PyCoRAM IP
AXI4 Interconnect
DRAM ControllerFPGA
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
AXI I/F
CoRAM
Memory
DMAC
AXI I/F
CoRAM
Stream FSM
GPIO

FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator

FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator
9.8%
4.5%
0.4%
2.5% 28.1% 22.5%
6.3%

Matrix-Matrix Multiplication Accelerator
n  Each row of matrix A/B/C is stored on CoRAM memories
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  Fully-occupied pipeline for every cycle
l  Double buffering of computations and transmission of mat B
•  Mat B is transposed in advance by the other CoRAM hardware
•  1/4 of the total memory bandwidth is utilized (about 400MB/s)
Computing Logic (Verilog HDL)
Control
Thread
(Python)
sum
CoRAM
Memory 0
B
× +
CoRAM
Memory 1
CoRAM
Memory 2
Control Logic
CoRAM
Channel 0
8-stage
Multiply
PipelineA
C
check
sum+

Stencil Computation Accelerator
n  3 arrays for source and 1 array for result by CoRAM
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  The pipeline consumes data of 3 points for every cycle
•  (Sum of input data within latest 3 cycles) / 9
l  Write back of the result, then read the next array
•  1/12 of the total memory bandwidth is utilized (about 130MB/s)
Computing Logic (Verilog HDL) Control
Thread
(Python)
CoRAM
Memory 0
d1
CoRAM
Memory 2
CoRAM
Memory 3
Control Logic
CoRAM
Channel 0
41-stage
Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check
sum
CoRAM
Memory 1

L1 Data Cache for MIPS-core
n  CoRAM Memory as Data Memory
l  Data replacements are managed by the control thread
•  When a cache miss occurs, a handling request is issued to the CT
Cache
Logic
(Verilog HDL)
Control
Thread
(Python)
CoRAM
Memory
0,1
Control
Logic
CoRAM
Channel 0
D0
D1
MUX
Tag0
=
Select
Tag1
=
Write
Data
Addr Stall
Read
Data
Write
Enable
Read
Enable
reg
reg
reg

Evaluation
n  Evaluation targets
l  Reference design provided by the contest committee (Ref)
l  6-stage MIPS-core+L1 Cache (6-stage)
l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)
n  Application dataset
l  Dataset provided for first round match
n  FPGA EDA tools
l  Xilinx Platform Studio 14.6, PlanAhead 14.6
•  Optimization goal: Speed, Optimization Effort: High
•  AXI4 Interconnect: 32-bit Shared bus (Area optimized)
n  Compiler for MIPS-core
l  gcc 4.3.3 (-O3)

Performance
n  =Execution time (not including data transfer time)
n  Drastic speed up compared to the reference design
l  The 6-stage+MIPS-core achieves 3.5 times faster speed
l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster
speed at average, 47.1 times faster at maximum
3.9
1.4
5.9 4.7 3.53.9
35.2
47.1
4.7
13.2
0
5
10
15
20
25
30
35
40
45
50
310_sort 320_mm 330_stencil 340_spath Gmean
RelativePerformance
6-stage
6-stage+ACC
14.2 14.2
16.0
20.8
3.6
9.8
2.7
4.4
3.6
0.4 0.3
4.4
0
5
10
15
20
25
310_sort 320_mm 330_stencil 340_spath
Time[sec]
Ref
6-stage
6-stage+ACC

Conclusion
n  From IPSJ SIG-ARC High-Performance Processor
Design Contest
n  Development of a heterogeneous FPGA-based
accelerator with PyCoRAM
l  Heterogeneous system of MIPS-core and two accelerators
l  47.1 times faster than the reference design
n  The tool-chain and framework are available on GitHub
l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l  Pyverilog: http://shtaxxx.github.io/Pyverilog/

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

Ähnlich wie A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region) (20)

Mehr von Shinya Takamaeda-Y

Mehr von Shinya Takamaeda-Y (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)