SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
A High Performance Heterogeneous
FPGA-based Accelerator
with PyCoRAM
Team: PyCoRAMist
Shinya Takamaeda-Yamazaki
Tokyo Institute of Technology
JSPS Research Fellow (DC1)
February 21, 2014
Digilent Design Contest @TED Yokohama
The 1st IPSJ SIG-ARC High-Performance
Processor Design Contest (Jan 2014 @Tokyo)
n  A competition of developing a fast
computing system for the specified
applications on the specified platform
n  FPGA board: Digilent Atlys
l  FPGA: Xilinx Spartan-6 LX45
DRAM: DDR2-800 (1.6GB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 2
4 Specified Contest Applications
2014-02-21 Shinya T-Y. Tokyo Tech 3
Hybrid System of CPU core + HW Accelerator
Suitable for HW AcceleratorsMatrix Mult & Stencil
Sort & Shortest Path Difficult for HW Accelerators
Application Description
Requirements for
Memory System
310_sort Integer Sort Low Latency
320_mm Matrix-Matrix Multiplication High Bandwidth
330_stencil 9-Point Stencil (Integer) High Bandwidth
340_spath Shortest Path Search Low Latency
How to Implement an Accelerator?
n  HDL? NO WAY! It’s so annoying L
l  Implementing the entire system using HDL is hard, because ...
•  Scheduling logic of computations and memory accesses
–  Double buffering requires complicated logics
–  State machine implementation is so annoying and error-prone
l  But, we want define the pipeline design in cycle-level
•  Essential for high performance of FPGA-based accelerators
–  HDL is still good weapon to write just a computation logic
–  The modern high-level synthesis tools are still not effective
n  Memory abstractions make up happy?
2014-02-21 Shinya T-Y. Tokyo Tech 4
CoRAM Memory Architecture
CoRAM (Connected RAM) [Chung+,FPGA’11]
n  Abstract Memory System for FPGAs
l  High-level abstraction for memory management
•  Decoupling computing logics and memory access behaviors
•  Memory access patterns in software model (C language)
2014-02-21 Shinya T-Y. Tokyo Tech 5
HW Kernels
(Computing Logics)
CoRAM
Memory
Read
Write
Manage
Control Threads
(Memory Access
Pattern in C)
CoRAM
Channel
Read/Write Read/Write
Communication
FIFOs (Registers)
Abstracted
On-chip Memories
Off-chip
Memory
PyCoRAM [Takamaeda+,CARL’13]
n  Python-based implementation of CoRAM memory
architecture for modern FPGA EDKs
l  CoRAM memory abstraction for EDK development flow
n  Key features
l  Control Thread in Python
•  We developed Python-to-Verilog HLS Compiler from scratch
l  AMBA AXI4 Interconnect for on-chip interconnect
•  For IP-core based development on Xilinx Platform Studio (XPS)
2014-02-21 Shinya T-Y. Tokyo Tech 6
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 7
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 8
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
CoRAM
Memory
DMAC
CoRAM
Stream FSM
GPIO
Modeled in RTL
(Verilog HDL)
Memory Access
Pattern
in Python
def calc_sum(times):�
ram = CoramMemory(idx=0, datawidth=32, size=1024)�
channel = CoramChannel(idx=0, datawidth=32)�
addr = 0�
sum = 0�
for i in range(times):�
ram.write(0, addr, 128)�
channel.write(addr)�
sum += channel.read()�
addr += 128 * (32/8)�
print(‘sum=’, sum)�
calc_sum(8)�
# Transfer (off-chip DRAM to BRAM)
# Notification to User-logic
# Wait for Notification from User-logic
# $display Verilog system task
�
0�
1�
2�
3�
4�
5�
6�
7�
8�
9�
10�
11�
PyCoRAM Microarchitecture
2014-02-21 Shinya T-Y. Tokyo Tech 9
PyCoRAM IP
AXI4 Interconnect
DRAM ControllerFPGA
User
I/O
User Logic
CoRAM
Channel
CoRAM
Register
Control
Thread
DMAC
AXI I/F
CoRAM
Memory
DMAC
AXI I/F
CoRAM
Stream FSM
GPIO
FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
2014-02-21 Shinya T-Y. Tokyo Tech 10
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator
FPGA Accelerator for PROCON
n  6-stage MIPS-core + UART loader + Two accelerators
l  XPS automatically synthesizes AXI4 interconnections
l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz
2014-02-21 Shinya T-Y. Tokyo Tech 11
AXI4 Interconnect (32-bit, Shared-bus)
DRAM Controller
PyCoRAM
Abstraction
L1-D Cache
(2-way, 32KB,
64bytes/line)
6-stage
MIPS-core
PyCoRAM
Abstraction
Memory
Loader
UART
PyCoRAM
Abstraction
Matrix
Multiplication
Accelerator
PyCoRAM
Abstraction
9-point
Stencil
Accelerator
9.8%
4.5%
0.4%
2.5% 28.1% 22.5%
6.3%
Matrix-Matrix Multiplication Accelerator
n  Each row of matrix A/B/C is stored on CoRAM memories
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  Fully-occupied pipeline for every cycle
l  Double buffering of computations and transmission of mat B
•  Mat B is transposed in advance by the other CoRAM hardware
•  1/4 of the total memory bandwidth is utilized (about 400MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 12
Computing Logic (Verilog HDL)
Control
Thread
(Python)
sum
CoRAM
Memory 0
B
× +
CoRAM
Memory 1
CoRAM
Memory 2
Control Logic
CoRAM
Channel 0
8-stage
Multiply
PipelineA
C
check
sum+
Stencil Computation Accelerator
n  3 arrays for source and 1 array for result by CoRAM
l  Data movements between on-chip memory and DRAM are
managed by control threads of PyCoRAM
l  The pipeline consumes data of 3 points for every cycle
•  (Sum of input data within latest 3 cycles) / 9
l  Write back of the result, then read the next array
•  1/12 of the total memory bandwidth is utilized (about 130MB/s)
2014-02-21 Shinya T-Y. Tokyo Tech 13
Computing Logic (Verilog HDL) Control
Thread
(Python)
CoRAM
Memory 0
d1
CoRAM
Memory 2
CoRAM
Memory 3
Control Logic
CoRAM
Channel 0
41-stage
Add-Divide
Pipeline
d0
rslt
d2
+ /
+ check
sum
CoRAM
Memory 1
L1 Data Cache for MIPS-core
n  CoRAM Memory as Data Memory
l  Data replacements are managed by the control thread
•  When a cache miss occurs, a handling request is issued to the CT
2014-02-21 Shinya T-Y. Tokyo Tech 14
Cache
Logic
(Verilog HDL)
Control
Thread
(Python)
CoRAM
Memory
0,1
Control
Logic
CoRAM
Channel 0
D0
D1
MUX
Tag0
=
Select
Tag1
=
Write
Data
Addr Stall
Read
Data
Write
Enable
Read
Enable
reg
reg
reg
Evaluation
n  Evaluation targets
l  Reference design provided by the contest committee (Ref)
l  6-stage MIPS-core+L1 Cache (6-stage)
l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC)
n  Application dataset
l  Dataset provided for first round match
n  FPGA EDA tools
l  Xilinx Platform Studio 14.6, PlanAhead 14.6
•  Optimization goal: Speed, Optimization Effort: High
•  AXI4 Interconnect: 32-bit Shared bus (Area optimized)
n  Compiler for MIPS-core
l  gcc 4.3.3 (-O3)
2014-02-21 Shinya T-Y. Tokyo Tech 15
Performance
n  =Execution time (not including data transfer time)
n  Drastic speed up compared to the reference design
l  The 6-stage+MIPS-core achieves 3.5 times faster speed
l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster
speed at average, 47.1 times faster at maximum
2014-02-21 Shinya T-Y. Tokyo Tech 16
3.9
1.4
5.9 4.7 3.53.9
35.2
47.1
4.7
13.2
0
5
10
15
20
25
30
35
40
45
50
310_sort 320_mm 330_stencil 340_spath Gmean
RelativePerformance
6-stage
6-stage+ACC
14.2 14.2
16.0
20.8
3.6
9.8
2.7
4.4
3.6
0.4 0.3
4.4
0
5
10
15
20
25
310_sort 320_mm 330_stencil 340_spath
Time[sec]
Ref
6-stage
6-stage+ACC
Conclusion
n  From IPSJ SIG-ARC High-Performance Processor
Design Contest
n  Development of a heterogeneous FPGA-based
accelerator with PyCoRAM
l  Heterogeneous system of MIPS-core and two accelerators
l  47.1 times faster than the reference design
n  The tool-chain and framework are available on GitHub
l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/
l  Pyverilog: http://shtaxxx.github.io/Pyverilog/
2014-02-21 Shinya T-Y. Tokyo Tech 17

Weitere ähnliche Inhalte

Was ist angesagt?

A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systemsA compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
Takefumi MIYOSHI
 
Fpga(field programmable gate array)
Fpga(field programmable gate array) Fpga(field programmable gate array)
Fpga(field programmable gate array)
Iffat Anjum
 

Was ist angesagt? (20)

Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V cores
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Online test program generator for RISC-V processors
Online test program generator for RISC-V processorsOnline test program generator for RISC-V processors
Online test program generator for RISC-V processors
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket Chip
 
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systemsA compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Fpga computing
Fpga computingFpga computing
Fpga computing
 
Fpga(field programmable gate array)
Fpga(field programmable gate array) Fpga(field programmable gate array)
Fpga(field programmable gate array)
 
Dr.s.shiyamala fpga ppt
Dr.s.shiyamala  fpga pptDr.s.shiyamala  fpga ppt
Dr.s.shiyamala fpga ppt
 

Andere mochten auch

Andere mochten auch (20)

PythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングPythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミング
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
 
Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討
 
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータPyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
 
マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討
 
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
 
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
 
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようPythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向
 
Zynq+PyCoRAM(+Debian)入門
Zynq+PyCoRAM(+Debian)入門Zynq+PyCoRAM(+Debian)入門
Zynq+PyCoRAM(+Debian)入門
 
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
 
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
OpenCV acceleration battle:OpenCL on Firefly-RK3288(MALI-T764) vs. FPGA on Ze...
 
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
 
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
 
Ubuntuをインストールしたzyboボードにカメラを付けてopen cvで顔認識
Ubuntuをインストールしたzyboボードにカメラを付けてopen cvで顔認識Ubuntuをインストールしたzyboボードにカメラを付けてopen cvで顔認識
Ubuntuをインストールしたzyboボードにカメラを付けてopen cvで顔認識
 
高位合成ツールVivado hlsのopen cv対応
高位合成ツールVivado hlsのopen cv対応高位合成ツールVivado hlsのopen cv対応
高位合成ツールVivado hlsのopen cv対応
 
Gpu vs fpga
Gpu vs fpgaGpu vs fpga
Gpu vs fpga
 
Zynq + Vivado HLS入門
Zynq + Vivado HLS入門Zynq + Vivado HLS入門
Zynq + Vivado HLS入門
 
Pynqでカメラ画像をリアルタイムfastx コーナー検出
Pynqでカメラ画像をリアルタイムfastx コーナー検出Pynqでカメラ画像をリアルタイムfastx コーナー検出
Pynqでカメラ画像をリアルタイムfastx コーナー検出
 

Ähnlich wie A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
byteLAKE
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.ppt
safia kalwar
 

Ähnlich wie A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region) (20)

00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Digital Systems Design
Digital Systems DesignDigital Systems Design
Digital Systems Design
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentationHiPEAC Computing Systems Week 2022_Mario Porrmann presentation
HiPEAC Computing Systems Week 2022_Mario Porrmann presentation
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.ppt
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
FPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projectsFPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projects
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
HiPEAC-Keynote.pptx
HiPEAC-Keynote.pptxHiPEAC-Keynote.pptx
HiPEAC-Keynote.pptx
 
Performance challenges in software networking
Performance challenges in software networkingPerformance challenges in software networking
Performance challenges in software networking
 
NIOS II Processor.ppt
NIOS II Processor.pptNIOS II Processor.ppt
NIOS II Processor.ppt
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 

Mehr von Shinya Takamaeda-Y

Mehr von Shinya Takamaeda-Y (9)

オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムオープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
 
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
 
むかし名言集bot作りました!
むかし名言集bot作りました!むかし名言集bot作りました!
むかし名言集bot作りました!
 
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
 
Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...
 
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner Up Award at Digilent Design Contest 2014 Japan Region)

  • 1. A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM Team: PyCoRAMist Shinya Takamaeda-Yamazaki Tokyo Institute of Technology JSPS Research Fellow (DC1) February 21, 2014 Digilent Design Contest @TED Yokohama
  • 2. The 1st IPSJ SIG-ARC High-Performance Processor Design Contest (Jan 2014 @Tokyo) n  A competition of developing a fast computing system for the specified applications on the specified platform n  FPGA board: Digilent Atlys l  FPGA: Xilinx Spartan-6 LX45 DRAM: DDR2-800 (1.6GB/s) 2014-02-21 Shinya T-Y. Tokyo Tech 2
  • 3. 4 Specified Contest Applications 2014-02-21 Shinya T-Y. Tokyo Tech 3 Hybrid System of CPU core + HW Accelerator Suitable for HW AcceleratorsMatrix Mult & Stencil Sort & Shortest Path Difficult for HW Accelerators Application Description Requirements for Memory System 310_sort Integer Sort Low Latency 320_mm Matrix-Matrix Multiplication High Bandwidth 330_stencil 9-Point Stencil (Integer) High Bandwidth 340_spath Shortest Path Search Low Latency
  • 4. How to Implement an Accelerator? n  HDL? NO WAY! It’s so annoying L l  Implementing the entire system using HDL is hard, because ... •  Scheduling logic of computations and memory accesses –  Double buffering requires complicated logics –  State machine implementation is so annoying and error-prone l  But, we want define the pipeline design in cycle-level •  Essential for high performance of FPGA-based accelerators –  HDL is still good weapon to write just a computation logic –  The modern high-level synthesis tools are still not effective n  Memory abstractions make up happy? 2014-02-21 Shinya T-Y. Tokyo Tech 4 CoRAM Memory Architecture
  • 5. CoRAM (Connected RAM) [Chung+,FPGA’11] n  Abstract Memory System for FPGAs l  High-level abstraction for memory management •  Decoupling computing logics and memory access behaviors •  Memory access patterns in software model (C language) 2014-02-21 Shinya T-Y. Tokyo Tech 5 HW Kernels (Computing Logics) CoRAM Memory Read Write Manage Control Threads (Memory Access Pattern in C) CoRAM Channel Read/Write Read/Write Communication FIFOs (Registers) Abstracted On-chip Memories Off-chip Memory
  • 6. PyCoRAM [Takamaeda+,CARL’13] n  Python-based implementation of CoRAM memory architecture for modern FPGA EDKs l  CoRAM memory abstraction for EDK development flow n  Key features l  Control Thread in Python •  We developed Python-to-Verilog HLS Compiler from scratch l  AMBA AXI4 Interconnect for on-chip interconnect •  For IP-core based development on Xilinx Platform Studio (XPS) 2014-02-21 Shinya T-Y. Tokyo Tech 6
  • 7. PyCoRAM Microarchitecture 2014-02-21 Shinya T-Y. Tokyo Tech 7 User I/O User Logic CoRAM Channel CoRAM Register Control Thread DMAC CoRAM Memory DMAC CoRAM Stream FSM GPIO
  • 8. PyCoRAM Microarchitecture 2014-02-21 Shinya T-Y. Tokyo Tech 8 User I/O User Logic CoRAM Channel CoRAM Register Control Thread DMAC CoRAM Memory DMAC CoRAM Stream FSM GPIO Modeled in RTL (Verilog HDL) Memory Access Pattern in Python def calc_sum(times):� ram = CoramMemory(idx=0, datawidth=32, size=1024)� channel = CoramChannel(idx=0, datawidth=32)� addr = 0� sum = 0� for i in range(times):� ram.write(0, addr, 128)� channel.write(addr)� sum += channel.read()� addr += 128 * (32/8)� print(‘sum=’, sum)� calc_sum(8)� # Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task � 0� 1� 2� 3� 4� 5� 6� 7� 8� 9� 10� 11�
  • 9. PyCoRAM Microarchitecture 2014-02-21 Shinya T-Y. Tokyo Tech 9 PyCoRAM IP AXI4 Interconnect DRAM ControllerFPGA User I/O User Logic CoRAM Channel CoRAM Register Control Thread DMAC AXI I/F CoRAM Memory DMAC AXI I/F CoRAM Stream FSM GPIO
  • 10. FPGA Accelerator for PROCON n  6-stage MIPS-core + UART loader + Two accelerators l  XPS automatically synthesizes AXI4 interconnections l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz 2014-02-21 Shinya T-Y. Tokyo Tech 10 AXI4 Interconnect (32-bit, Shared-bus) DRAM Controller PyCoRAM Abstraction L1-D Cache (2-way, 32KB, 64bytes/line) 6-stage MIPS-core PyCoRAM Abstraction Memory Loader UART PyCoRAM Abstraction Matrix Multiplication Accelerator PyCoRAM Abstraction 9-point Stencil Accelerator
  • 11. FPGA Accelerator for PROCON n  6-stage MIPS-core + UART loader + Two accelerators l  XPS automatically synthesizes AXI4 interconnections l  Clock frequency: logic and AXI: 100MHz, DRAM: 400MHz 2014-02-21 Shinya T-Y. Tokyo Tech 11 AXI4 Interconnect (32-bit, Shared-bus) DRAM Controller PyCoRAM Abstraction L1-D Cache (2-way, 32KB, 64bytes/line) 6-stage MIPS-core PyCoRAM Abstraction Memory Loader UART PyCoRAM Abstraction Matrix Multiplication Accelerator PyCoRAM Abstraction 9-point Stencil Accelerator 9.8% 4.5% 0.4% 2.5% 28.1% 22.5% 6.3%
  • 12. Matrix-Matrix Multiplication Accelerator n  Each row of matrix A/B/C is stored on CoRAM memories l  Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM l  Fully-occupied pipeline for every cycle l  Double buffering of computations and transmission of mat B •  Mat B is transposed in advance by the other CoRAM hardware •  1/4 of the total memory bandwidth is utilized (about 400MB/s) 2014-02-21 Shinya T-Y. Tokyo Tech 12 Computing Logic (Verilog HDL) Control Thread (Python) sum CoRAM Memory 0 B × + CoRAM Memory 1 CoRAM Memory 2 Control Logic CoRAM Channel 0 8-stage Multiply PipelineA C check sum+
  • 13. Stencil Computation Accelerator n  3 arrays for source and 1 array for result by CoRAM l  Data movements between on-chip memory and DRAM are managed by control threads of PyCoRAM l  The pipeline consumes data of 3 points for every cycle •  (Sum of input data within latest 3 cycles) / 9 l  Write back of the result, then read the next array •  1/12 of the total memory bandwidth is utilized (about 130MB/s) 2014-02-21 Shinya T-Y. Tokyo Tech 13 Computing Logic (Verilog HDL) Control Thread (Python) CoRAM Memory 0 d1 CoRAM Memory 2 CoRAM Memory 3 Control Logic CoRAM Channel 0 41-stage Add-Divide Pipeline d0 rslt d2 + / + check sum CoRAM Memory 1
  • 14. L1 Data Cache for MIPS-core n  CoRAM Memory as Data Memory l  Data replacements are managed by the control thread •  When a cache miss occurs, a handling request is issued to the CT 2014-02-21 Shinya T-Y. Tokyo Tech 14 Cache Logic (Verilog HDL) Control Thread (Python) CoRAM Memory 0,1 Control Logic CoRAM Channel 0 D0 D1 MUX Tag0 = Select Tag1 = Write Data Addr Stall Read Data Write Enable Read Enable reg reg reg
  • 15. Evaluation n  Evaluation targets l  Reference design provided by the contest committee (Ref) l  6-stage MIPS-core+L1 Cache (6-stage) l  6-stage MIPS-core+L1 Cache + Accelerators (6-stage+ACC) n  Application dataset l  Dataset provided for first round match n  FPGA EDA tools l  Xilinx Platform Studio 14.6, PlanAhead 14.6 •  Optimization goal: Speed, Optimization Effort: High •  AXI4 Interconnect: 32-bit Shared bus (Area optimized) n  Compiler for MIPS-core l  gcc 4.3.3 (-O3) 2014-02-21 Shinya T-Y. Tokyo Tech 15
  • 16. Performance n  =Execution time (not including data transfer time) n  Drastic speed up compared to the reference design l  The 6-stage+MIPS-core achieves 3.5 times faster speed l  The 6-stage+MIPS-core+Accelerators achieves 13.2 times faster speed at average, 47.1 times faster at maximum 2014-02-21 Shinya T-Y. Tokyo Tech 16 3.9 1.4 5.9 4.7 3.53.9 35.2 47.1 4.7 13.2 0 5 10 15 20 25 30 35 40 45 50 310_sort 320_mm 330_stencil 340_spath Gmean RelativePerformance 6-stage 6-stage+ACC 14.2 14.2 16.0 20.8 3.6 9.8 2.7 4.4 3.6 0.4 0.3 4.4 0 5 10 15 20 25 310_sort 320_mm 330_stencil 340_spath Time[sec] Ref 6-stage 6-stage+ACC
  • 17. Conclusion n  From IPSJ SIG-ARC High-Performance Processor Design Contest n  Development of a heterogeneous FPGA-based accelerator with PyCoRAM l  Heterogeneous system of MIPS-core and two accelerators l  47.1 times faster than the reference design n  The tool-chain and framework are available on GitHub l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/ l  Pyverilog: http://shtaxxx.github.io/Pyverilog/ 2014-02-21 Shinya T-Y. Tokyo Tech 17