08448380779 Call Girls In Friends Colony Women Seeking Men
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)
1. A High Performance Heterogeneous
FPGA-based Accelerator
with PyCoRAM
Team: PyCoRAMist
Shinya Takamaeda-Yamazaki
Tokyo Institute of Technology
JSPS Research Fellow (DC1)
February 21, 2014
Digilent Design Contest @TED Yokohama
2. The 1st IPSJ SIG-ARC High-Performance
Processor Design Contest (Jan 2014 @Tokyo)
n A competition of developing a fast
computing system for the specified
applications on the specified platform
n FPGA board: Digilent Atlys
l FPGA: Xilinx Spartan-6 LX45
DRAM: DDR2-800 (1.6GB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 2
3. 4 Specified Contest Applications
2014-02-21 Shinya T-Y. Tokyo Tech 3
Hybrid System of CPU core + HW Accelerator
Suitable for HW AcceleratorsMatrix Mult & Stencil
Sort & Shortest Path Difficult for HW Accelerators
Application Description
Requirements for
Memory System
310_sort Integer Sort Low Latency
320_mm Matrix-Matrix Multiplication High Bandwidth
330_stencil 9-Point Stencil (Integer) High Bandwidth
340_spath Shortest Path Search Low Latency
4. How to Implement an Accelerator?
n HDL? NO WAY! It’s so annoying L
l Implementing the entire system using HDL is hard, because ...
• Scheduling logic of computations and memory accesses
– Double buffering requires complicated logics
– State machine implementation is so annoying and error-prone
l But, we want define the pipeline design in cycle-level
• Essential for high performance of FPGA-based accelerators
– HDL is still good weapon to write just a computation logic
– The modern high-level synthesis tools are still not effective
n Memory abstractions make up happy?
2014-02-21 Shinya T-Y. Tokyo Tech 4
CoRAM Memory Architecture
5. CoRAM (Connected RAM) [Chung+,FPGA’11]
n Abstract Memory System for FPGAs
l High-level abstraction for memory management
• Decoupling computing logics and memory access behaviors
• Memory access patterns in software model (C language)
2014-02-21 Shinya T-Y. Tokyo Tech 5
HW Kernels
(Computing Logics)
CoRAM
Memory
Read
Write
Manage
Control Threads
(Memory Access
Pattern in C)
CoRAM
Channel
Read/Write Read/Write
Communication
FIFOs (Registers)
Abstracted
On-chip Memories
Off-chip
Memory
6. PyCoRAM [Takamaeda+,CARL’13]
n Python-based implementation of CoRAM memory
architecture for modern FPGA EDKs
l CoRAM memory abstraction for EDK development flow
n Key features
l Control Thread in Python
• We developed Python-to-Verilog HLS Compiler from scratch
l AMBA AXI4 Interconnect for on-chip interconnect
• For IP-core based development on Xilinx Platform Studio (XPS)
2014-02-21 Shinya T-Y. Tokyo Tech 6
12. Matrix-Matrix Multiplication Accelerator
n Each row of matrix A/B/C is stored on CoRAM memories
l Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l Fully-occupied pipeline for every cycle
l Double buffering of computations and transmission of mat B
• Mat B is transposed in advance by the other CoRAM hardware
• 1/4 of the total memory bandwidth is utilized (about 400MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 12
Computing Logic (Verilog HDL)
Control
Thread
(Python)
sum
CoRAM
Memory 0
B
× +
CoRAM
Memory 1
CoRAM
Memory 2
Control Logic
CoRAM
Channel 0
8-stage
Multiply
PipelineA
C
check
sum+
13. Stencil Computation Accelerator
n 3 arrays for source and 1 array for result by CoRAM
l Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l The pipeline consumes data of 3 points for every cycle
• (Sum of input data within latest 3 cycles) / 9
l Write back of the result, then read the next array
• 1/12 of the total memory bandwidth is utilized (about 130MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 13
Computing Logic (Verilog HDL) Control
Thread
(Python)
CoRAM
Memory 0
d1
CoRAM
Memory 2
CoRAM
Memory 3
Control Logic
CoRAM
Channel 0
41-stage
Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check
sum
CoRAM
Memory 1
14. L1 Data Cache for MIPS-core
n CoRAM Memory as Data Memory
l Data replacements are managed by the control thread
• When a cache miss occurs, a handling request is issued to the CT
2014-02-21 Shinya T-Y. Tokyo Tech 14
Cache
Logic
(Verilog HDL)
Control
Thread
(Python)
CoRAM
Memory
0,1
Control
Logic
CoRAM
Channel 0
D0
D1
MUX
Tag0
=
Select
Tag1
=
Write
Data
Addr Stall
Read
Data
Write
Enable
Read
Enable
reg
reg
reg
15. Evaluation
n Evaluation targets
l Reference design provided by the contest committee (Ref)
l 6-stage MIPS-core+L1 Cache (6-stage)
l 6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)
n Application dataset
l Dataset provided for first round match
n FPGA EDA tools
l Xilinx Platform Studio 14.6, PlanAhead 14.6
• Optimization goal: Speed, Optimization Effort: High
• AXI4 Interconnect: 32-bit Shared bus (Area optimized)
n Compiler for MIPS-core
l gcc 4.3.3 (-O3)
2014-02-21 Shinya T-Y. Tokyo Tech 15
16. Performance
n =Execution time (not including data transfer time)
n Drastic speed up compared to the reference design
l The 6-stage+MIPS-core achieves 3.5 times faster speed
l The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster
speed at average, 47.1 times faster at maximum
2014-02-21 Shinya T-Y. Tokyo Tech 16
3.9
1.4
5.9 4.7 3.53.9
35.2
47.1
4.7
13.2
0
5
10
15
20
25
30
35
40
45
50
310_sort 320_mm 330_stencil 340_spath Gmean
RelativePerformance
6-stage
6-stage+ACC
14.2 14.2
16.0
20.8
3.6
9.8
2.7
4.4
3.6
0.4 0.3
4.4
0
5
10
15
20
25
310_sort 320_mm 330_stencil 340_spath
Time[sec]
Ref
6-stage
6-stage+ACC
17. Conclusion
n From IPSJ SIG-ARC High-Performance Processor
Design Contest
n Development of a heterogeneous FPGA-based
accelerator with PyCoRAM
l Heterogeneous system of MIPS-core and two accelerators
l 47.1 times faster than the reference design
n The tool-chain and framework are available on GitHub
l PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l Pyverilog: http://shtaxxx.github.io/Pyverilog/
2014-02-21 Shinya T-Y. Tokyo Tech 17