SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Automatic generation of platform architectures
using OpenCL
FPGA roadmap
Department of Electrical and
Computer Engineering
University of Thessaly
Volos, Greece
Nikolaos Bellas
What is an FPGA?
• Field Programmable Gate Array (FPGA) is the best
known example of Reconfigurable Logic
• Hardware can be modified post chip fabrication
• Tailor the Hardware to the application
– Fixed logic processors (CPUs/GPUs) only modify
their software (via programming)
• FPGAs can offer superior performance,
performance/power, or performance/cost compared to
CPUs and GPUs.
2
FPGA architecture
• A generic island-style
FPGA fabric
• Configurable Logic Blocks
(CLB) and Programmable
Switch Matrices (PSM)
• Bitstream configures
functionality of each CLB
and interconnection
between logic blocks
3
The Xilinx Slice
• Xilinx slice features
– LUTs
– MUXF5, MUXF6,
MUXF7, MUXF8
(only the F5 and
F6 MUX are shown
in this diagram)
– Carry Logic
– MULT_ANDs
– Sequential Elements
•Detailed Structure
LUTLUT
Example 2-input LUT
• Lookup table: a b out
0 0
0 1
1 0
1 1
a
b
out
0
0
0
1
0 0 0 1
1
0
0
1
1 0 0 1
5
configuration
input
Modern FPGA architecture
Xilinx Virtex family
6
•Columns of on-chips SRAMs, hard IP cores (PPC 405), and
•DSP slices (Multiply-Accumulate) units
FPGA discussion
• Advantages
– Potential for (near) optimal performance for a given
application
– Various forms of parallelisms can be exploited
• Disadvantages
– Programmable mainly at the hardware level using
Hardware Description Languages (BUT, this can
change)
– Lower clock frequency (200-300 MHz) compared to
CPUs (~ 3GHz) and GPUs (~1.5 GHz)
7
MATENVMED
Silicon OpenCL: Automatic
generation of platform
architectures using OpenCL
8
18-19/7/2013 MATENVMED Plenary Meeting
Introduction
• Automatic generation of hardware at the
research forefront in the last 10 years.
• Variety of High Level Programming Models:
C/C++, C-like Languages, MATLAB
• Obstacles:
– Parallelism Extraction for larger applications
– Extensive Compiler Transformations & Optimizations
• Parallel Programming Models to the Rescue:
– CUDA, OpenCL.
9
18-19/7/2013 MATENVMED Plenary Meeting
Motivation
• Parallel programming models are for
reconfigurable platforms.
• A major shift of Computing industry toward
many-core computing systems.
• Reconfigurable fabrics bear a strong
resemblance to many core systems.
10
18-19/7/2013 MATENVMED Plenary Meeting
Vision
• Provide the tools and methodology to enable the large
pool of software developers and domain experts, who do
not necessarily have expertise on hardware design, to
architect whole accelerator-based systems
– Borrowed from advances in massively parallel programming
models
11
FPGA
PCIexpress
GPU
CPU
PCIexpress
18-19/7/2013 MATENVMED Plenary Meeting
Silicon OpenCL
• Silicon-OpenCL
“SOpenCL”.
• A tool flow to convert
an unmodified
OpenCL application
into a SoC design with
HW/SW components.
18-19/7/2013
Contribution
• Architectural Synthesis methodology:
– Code Transformations.
– Architectural Template.
13MATENVMED Plenary Meeting
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL for Heterogeneous Systems
• OpenCL (Open Computing Language) : A unified programming
model aims at letting a programmer write a portable program once
and deploy it on any heterogeneous system with CPUs and GPUs.
• Became an important industry standard after release due to
substantial industry support.
14
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL Platform Model
One host and one or more Compute Devices (CD)
Each CD consists of one or more Compute Units (CU)
Each CU is further divided into one or more Processing Elements (PE)
15
Main
Program
Computations
Kernels
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL Kernel Execution Geometry
• OpenCL defines a geometric partitioning of grid of computations
• Grid consists of N dimensional space of work-groups
• Each work-group consists of N dimensional space of work-items.
work-group
grid work-item
16
18-19/7/2013 MATENVMED Plenary Meeting
OpenCL Simple Example
__kernel void vadd(
__global int* a,
__global int* b,
__global int* c) {
int idx= get_global_id(0);
c[idx] = a[idx] + b[idx];
}
• OpenCL kernel describes the computation of a work-
item
• Finest parallelism granularity
• e.g. add two integer vectors (N=1)
void add(int* a,
int* b,
int* c) {
for (int idx=0; idx<sizeof(a); idx++)
c[idx] = a[idx] + b[idx];
}
C code OpenCL kernel code
Run-time call
Used to differentiate execution
for each work-item
17
18-19/7/2013 MATENVMED Plenary Meeting
Why OpenCL as an HDL?
• OpenCL exposes parallelism at the finest
granularity
– Allows easy hardware generation at different levels of
granularity
– One accelerator per work-item, one accelerator per work-
group, one accelerator per multiple work-groups, etc.
• OpenCL exposes data communication
– Critical to transfer and stage data across platforms
• We target unmodified OpenCL to enable
hardware design to software engineers
– No need for hardware/architectural expertise
18
18-19/7/2013 MATENVMED Plenary Meeting
SOpenCL Tool Flow
Granularity Management
work-group
FPGA
Optimal thread granularity depends on hardware platform
GPU
CPU
We select a hardware accelerator to process one work-group per
invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting
18-19/7/2013 MATENVMED Plenary Meeting
Granularity Coarsening
Work-item
thread
Work-group
thread
18-19/7/2013 MATENVMED Plenary Meeting
Serialization of Work Items
__kernel void vadd(…) {
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
__kernel void Vadd(…) {
int idx;
for( i = 0; i < get_local_size(2); i++)
for( j = 0; j < get_local_size(1); j++)
for( k = 0; k < get_local_size(0); k++)
{
idx = get_item_gid(0);
c[idx] = a[idx] + b[idx];
}
}
OpenCL
code
C code
22
idx = (global_id2(0) + i) * Grid_Width * Grid_Height +
(global_id1(0) + j) * Grid_Width +
(global_id0(0) + k);
18-19/7/2013 MATENVMED Plenary Meeting
Architectural Synthesis
• Exploit available parallelism and application specific
features.
• Apply a series of transformations to generate customized
hardware accelerators.
23
• Uses LLVM Compiler Infrastructure.
• Generate synthesizable Verilog & Test bench.
Feed Data
in Order
Write Data
in Order
•FU types,
•Bitwidths,
•I/O Bandwidth
2407/27/13
Verilog Generation: PE Architecture
•Predication
•Code
•slicing
•SMS mod
•scheduling
•Verilog
•generation
MATENVMED Kickc Off Meeting
Roadmap for FPGA
implementation
FPGA Implementation
• Our plan is to use the same code base
(e.g. OpenCL) to explore different
architectures
– OpenCL used for multicore CPU, GPU, FPGA
(SOpenCL)
• Fast exploration based on area,
performance and power requirements
18-19/7/2013 MATENVMED Plenary Meeting
FPGA Implementation
• Monte-Carlo simulations can exploit multi-level
parallelism of FPGAs
– Multiple MC simulations per point
– Multiple points simultaneously
– Double precision Trigonometric, Log, Additions,
Multiplications functions for each walk
– FP operations with double precision are not FPGAs
strong point, but still SOpenCL can handle it.
18-19/7/2013 MATENVMED Plenary Meeting

Weitere ähnliche Inhalte

Was ist angesagt?

Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyDesign of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyTELKOMNIKA JOURNAL
 
FPGA Intro
FPGA IntroFPGA Intro
FPGA Intronaito88
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
 
Synthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrumSynthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrumHossam Hassan
 
Introduction to fpga synthesis tools
Introduction to fpga synthesis toolsIntroduction to fpga synthesis tools
Introduction to fpga synthesis toolsHossam Hassan
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
Implementation of secure rfid on fpga
Implementation of secure rfid on fpgaImplementation of secure rfid on fpga
Implementation of secure rfid on fpgaansh1692
 
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation SystemSynopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation SystemMostafa Khamis
 
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing WorldCloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing WorldOmer Kilic
 
4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design FlowMaurizio Donna
 
Programmable logic device (PLD)
Programmable logic device (PLD)Programmable logic device (PLD)
Programmable logic device (PLD)Sɐɐp ɐɥɯǝp
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 

Was ist angesagt? (20)

Microblaze
MicroblazeMicroblaze
Microblaze
 
Asic vs fpga
Asic vs fpgaAsic vs fpga
Asic vs fpga
 
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking TechnologyDesign of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
Design of LDPC Decoder Based On FPGA in Digital Image Watermarking Technology
 
FPGA Intro
FPGA IntroFPGA Intro
FPGA Intro
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
Synthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrumSynthesizing HDL using LeonardoSpectrum
Synthesizing HDL using LeonardoSpectrum
 
Introduction to fpga synthesis tools
Introduction to fpga synthesis toolsIntroduction to fpga synthesis tools
Introduction to fpga synthesis tools
 
FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
FPGA Introduction
FPGA IntroductionFPGA Introduction
FPGA Introduction
 
SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
FPGA @ UPB-BGA
FPGA @ UPB-BGAFPGA @ UPB-BGA
FPGA @ UPB-BGA
 
Implementation of secure rfid on fpga
Implementation of secure rfid on fpgaImplementation of secure rfid on fpga
Implementation of secure rfid on fpga
 
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation SystemSynopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
Synopsys Fusion Compiler-Comprehensive RTL-to-GDSII Implementation System
 
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing WorldCloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
Cloud, Distributed, Embedded: Erlang in the Heterogeneous Computing World
 
4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow4.FPGA for dummies: Design Flow
4.FPGA for dummies: Design Flow
 
Programmable logic device (PLD)
Programmable logic device (PLD)Programmable logic device (PLD)
Programmable logic device (PLD)
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Fpga design flow
Fpga design flowFpga design flow
Fpga design flow
 
Digital Design Flow
Digital Design FlowDigital Design Flow
Digital Design Flow
 

Andere mochten auch

Parallel iterative solution of the hermite collocation equations on gpus
Parallel iterative solution of the hermite collocation equations on gpusParallel iterative solution of the hermite collocation equations on gpus
Parallel iterative solution of the hermite collocation equations on gpusManolis Vavalis
 
Optimization techniques for a model problem of saltwater intrusion in coastal...
Optimization techniques for a model problem of saltwater intrusion in coastal...Optimization techniques for a model problem of saltwater intrusion in coastal...
Optimization techniques for a model problem of saltwater intrusion in coastal...Manolis Vavalis
 
2η διάλεξη Γραμμικής άλγεβρας
2η διάλεξη Γραμμικής άλγεβρας2η διάλεξη Γραμμικής άλγεβρας
2η διάλεξη Γραμμικής άλγεβραςManolis Vavalis
 
4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού
4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού
4η διάλεξη Τεχνολογίες Παγκόσμιου ΙστούManolis Vavalis
 
Ch5 beeing an application
Ch5   beeing an applicationCh5   beeing an application
Ch5 beeing an applicationManolis Vavalis
 
Ch. 9 jsp standard tag library
Ch. 9 jsp standard tag libraryCh. 9 jsp standard tag library
Ch. 9 jsp standard tag libraryManolis Vavalis
 
Ch6 conversational state
Ch6   conversational stateCh6   conversational state
Ch6 conversational stateManolis Vavalis
 
Απαλοιφή του Γκάους
Απαλοιφή του ΓκάουςΑπαλοιφή του Γκάους
Απαλοιφή του ΓκάουςManolis Vavalis
 
9η διάλεξη - Πράξεις με πίνακες
9η διάλεξη - Πράξεις με πίνακες9η διάλεξη - Πράξεις με πίνακες
9η διάλεξη - Πράξεις με πίνακεςManolis Vavalis
 
3η διάλεξη Γραμμικής Άλγεβρας
3η διάλεξη Γραμμικής Άλγεβρας3η διάλεξη Γραμμικής Άλγεβρας
3η διάλεξη Γραμμικής ΆλγεβραςManolis Vavalis
 
13η Διάλεξη - Ανάλυση LU (συνέχεια)
13η Διάλεξη - Ανάλυση LU (συνέχεια)13η Διάλεξη - Ανάλυση LU (συνέχεια)
13η Διάλεξη - Ανάλυση LU (συνέχεια)Manolis Vavalis
 
10η διάλεξη - Πράξεις με πίνακες
10η διάλεξη - Πράξεις με πίνακες10η διάλεξη - Πράξεις με πίνακες
10η διάλεξη - Πράξεις με πίνακεςManolis Vavalis
 
ο δήμος βόλου στον παγκόσμιο ιστό
ο δήμος βόλου στον παγκόσμιο ιστόο δήμος βόλου στον παγκόσμιο ιστό
ο δήμος βόλου στον παγκόσμιο ιστόManolis Vavalis
 
21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων
21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων
21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρωνManolis Vavalis
 
2η διάλεξη - Εισαγωγή
2η διάλεξη - Εισαγωγή2η διάλεξη - Εισαγωγή
2η διάλεξη - ΕισαγωγήManolis Vavalis
 

Andere mochten auch (19)

Parallel iterative solution of the hermite collocation equations on gpus
Parallel iterative solution of the hermite collocation equations on gpusParallel iterative solution of the hermite collocation equations on gpus
Parallel iterative solution of the hermite collocation equations on gpus
 
Position Statements
Position StatementsPosition Statements
Position Statements
 
Optimization techniques for a model problem of saltwater intrusion in coastal...
Optimization techniques for a model problem of saltwater intrusion in coastal...Optimization techniques for a model problem of saltwater intrusion in coastal...
Optimization techniques for a model problem of saltwater intrusion in coastal...
 
2η διάλεξη Γραμμικής άλγεβρας
2η διάλεξη Γραμμικής άλγεβρας2η διάλεξη Γραμμικής άλγεβρας
2η διάλεξη Γραμμικής άλγεβρας
 
4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού
4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού
4η διάλεξη Τεχνολογίες Παγκόσμιου Ιστού
 
Ch. 11 deploying
Ch. 11 deployingCh. 11 deploying
Ch. 11 deploying
 
Ch5 beeing an application
Ch5   beeing an applicationCh5   beeing an application
Ch5 beeing an application
 
Ch. 9 jsp standard tag library
Ch. 9 jsp standard tag libraryCh. 9 jsp standard tag library
Ch. 9 jsp standard tag library
 
Ch6 conversational state
Ch6   conversational stateCh6   conversational state
Ch6 conversational state
 
Απαλοιφή του Γκάους
Απαλοιφή του ΓκάουςΑπαλοιφή του Γκάους
Απαλοιφή του Γκάους
 
9η διάλεξη - Πράξεις με πίνακες
9η διάλεξη - Πράξεις με πίνακες9η διάλεξη - Πράξεις με πίνακες
9η διάλεξη - Πράξεις με πίνακες
 
aaarrr
aaarrraaarrr
aaarrr
 
3η διάλεξη Γραμμικής Άλγεβρας
3η διάλεξη Γραμμικής Άλγεβρας3η διάλεξη Γραμμικής Άλγεβρας
3η διάλεξη Γραμμικής Άλγεβρας
 
13η Διάλεξη - Ανάλυση LU (συνέχεια)
13η Διάλεξη - Ανάλυση LU (συνέχεια)13η Διάλεξη - Ανάλυση LU (συνέχεια)
13η Διάλεξη - Ανάλυση LU (συνέχεια)
 
Larissa@28 02 2013
Larissa@28 02 2013Larissa@28 02 2013
Larissa@28 02 2013
 
10η διάλεξη - Πράξεις με πίνακες
10η διάλεξη - Πράξεις με πίνακες10η διάλεξη - Πράξεις με πίνακες
10η διάλεξη - Πράξεις με πίνακες
 
ο δήμος βόλου στον παγκόσμιο ιστό
ο δήμος βόλου στον παγκόσμιο ιστόο δήμος βόλου στον παγκόσμιο ιστό
ο δήμος βόλου στον παγκόσμιο ιστό
 
21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων
21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων
21η Διάλεξη - Βάση και διάσταση θεμελιωδών χώρων
 
2η διάλεξη - Εισαγωγή
2η διάλεξη - Εισαγωγή2η διάλεξη - Εισαγωγή
2η διάλεξη - Εισαγωγή
 

Ähnlich wie Automatic generation of FPGA architectures using OpenCL

L12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).ppt
L12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).pptL12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).ppt
L12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).pptMikeTango5
 
L12 programmable+logic+devices+(pld)
L12 programmable+logic+devices+(pld)L12 programmable+logic+devices+(pld)
L12 programmable+logic+devices+(pld)NAGASAI547
 
Cpld and fpga mod vi
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod viAgi George
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
programmable_devices_en_02_2014
programmable_devices_en_02_2014programmable_devices_en_02_2014
programmable_devices_en_02_2014Svetozar Jovanovic
 
Fundamentals of FPGA
Fundamentals of FPGAFundamentals of FPGA
Fundamentals of FPGAvelamakuri
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Cesar Maciel
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Pradeep Singh
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar reportrahul kumar verma
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERAchronix
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxVivek Kumar
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
 
Basic Design Flow for Field Programmable Gate Arrays
Basic Design Flow for Field Programmable Gate ArraysBasic Design Flow for Field Programmable Gate Arrays
Basic Design Flow for Field Programmable Gate ArraysUsha Mehta
 
Reconfigurable ICs
Reconfigurable ICsReconfigurable ICs
Reconfigurable ICsAnish Goel
 
A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...IOSR Journals
 
FPGAs : An Overview
FPGAs : An OverviewFPGAs : An Overview
FPGAs : An OverviewSanjiv Malik
 

Ähnlich wie Automatic generation of FPGA architectures using OpenCL (20)

L12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).ppt
L12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).pptL12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).ppt
L12_PROGRAMMABLE+LOGIC+DEVICES+(PLD).ppt
 
L12 programmable+logic+devices+(pld)
L12 programmable+logic+devices+(pld)L12 programmable+logic+devices+(pld)
L12 programmable+logic+devices+(pld)
 
Cpld and fpga mod vi
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod vi
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
programmable_devices_en_02_2014
programmable_devices_en_02_2014programmable_devices_en_02_2014
programmable_devices_en_02_2014
 
Fundamentals of FPGA
Fundamentals of FPGAFundamentals of FPGA
Fundamentals of FPGA
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
 
14 284-291
14 284-29114 284-291
14 284-291
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Basic Design Flow for Field Programmable Gate Arrays
Basic Design Flow for Field Programmable Gate ArraysBasic Design Flow for Field Programmable Gate Arrays
Basic Design Flow for Field Programmable Gate Arrays
 
Reconfigurable ICs
Reconfigurable ICsReconfigurable ICs
Reconfigurable ICs
 
A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...A Review of FPGA-based design methodologies for efficient hardware Area estim...
A Review of FPGA-based design methodologies for efficient hardware Area estim...
 
FPGAs memory synchronization and performance evaluation using the open compu...
FPGAs memory synchronization and performance evaluation  using the open compu...FPGAs memory synchronization and performance evaluation  using the open compu...
FPGAs memory synchronization and performance evaluation using the open compu...
 
FPGAs : An Overview
FPGAs : An OverviewFPGAs : An Overview
FPGAs : An Overview
 

Kürzlich hochgeladen

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Kürzlich hochgeladen (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Automatic generation of FPGA architectures using OpenCL

  • 1. Automatic generation of platform architectures using OpenCL FPGA roadmap Department of Electrical and Computer Engineering University of Thessaly Volos, Greece Nikolaos Bellas
  • 2. What is an FPGA? • Field Programmable Gate Array (FPGA) is the best known example of Reconfigurable Logic • Hardware can be modified post chip fabrication • Tailor the Hardware to the application – Fixed logic processors (CPUs/GPUs) only modify their software (via programming) • FPGAs can offer superior performance, performance/power, or performance/cost compared to CPUs and GPUs. 2
  • 3. FPGA architecture • A generic island-style FPGA fabric • Configurable Logic Blocks (CLB) and Programmable Switch Matrices (PSM) • Bitstream configures functionality of each CLB and interconnection between logic blocks 3
  • 4. The Xilinx Slice • Xilinx slice features – LUTs – MUXF5, MUXF6, MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram) – Carry Logic – MULT_ANDs – Sequential Elements •Detailed Structure
  • 5. LUTLUT Example 2-input LUT • Lookup table: a b out 0 0 0 1 1 0 1 1 a b out 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 5 configuration input
  • 6. Modern FPGA architecture Xilinx Virtex family 6 •Columns of on-chips SRAMs, hard IP cores (PPC 405), and •DSP slices (Multiply-Accumulate) units
  • 7. FPGA discussion • Advantages – Potential for (near) optimal performance for a given application – Various forms of parallelisms can be exploited • Disadvantages – Programmable mainly at the hardware level using Hardware Description Languages (BUT, this can change) – Lower clock frequency (200-300 MHz) compared to CPUs (~ 3GHz) and GPUs (~1.5 GHz) 7
  • 8. MATENVMED Silicon OpenCL: Automatic generation of platform architectures using OpenCL 8
  • 9. 18-19/7/2013 MATENVMED Plenary Meeting Introduction • Automatic generation of hardware at the research forefront in the last 10 years. • Variety of High Level Programming Models: C/C++, C-like Languages, MATLAB • Obstacles: – Parallelism Extraction for larger applications – Extensive Compiler Transformations & Optimizations • Parallel Programming Models to the Rescue: – CUDA, OpenCL. 9
  • 10. 18-19/7/2013 MATENVMED Plenary Meeting Motivation • Parallel programming models are for reconfigurable platforms. • A major shift of Computing industry toward many-core computing systems. • Reconfigurable fabrics bear a strong resemblance to many core systems. 10
  • 11. 18-19/7/2013 MATENVMED Plenary Meeting Vision • Provide the tools and methodology to enable the large pool of software developers and domain experts, who do not necessarily have expertise on hardware design, to architect whole accelerator-based systems – Borrowed from advances in massively parallel programming models 11 FPGA PCIexpress GPU CPU PCIexpress
  • 12. 18-19/7/2013 MATENVMED Plenary Meeting Silicon OpenCL • Silicon-OpenCL “SOpenCL”. • A tool flow to convert an unmodified OpenCL application into a SoC design with HW/SW components.
  • 13. 18-19/7/2013 Contribution • Architectural Synthesis methodology: – Code Transformations. – Architectural Template. 13MATENVMED Plenary Meeting
  • 14. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL for Heterogeneous Systems • OpenCL (Open Computing Language) : A unified programming model aims at letting a programmer write a portable program once and deploy it on any heterogeneous system with CPUs and GPUs. • Became an important industry standard after release due to substantial industry support. 14
  • 15. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL Platform Model One host and one or more Compute Devices (CD) Each CD consists of one or more Compute Units (CU) Each CU is further divided into one or more Processing Elements (PE) 15 Main Program Computations Kernels
  • 16. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL Kernel Execution Geometry • OpenCL defines a geometric partitioning of grid of computations • Grid consists of N dimensional space of work-groups • Each work-group consists of N dimensional space of work-items. work-group grid work-item 16
  • 17. 18-19/7/2013 MATENVMED Plenary Meeting OpenCL Simple Example __kernel void vadd( __global int* a, __global int* b, __global int* c) { int idx= get_global_id(0); c[idx] = a[idx] + b[idx]; } • OpenCL kernel describes the computation of a work- item • Finest parallelism granularity • e.g. add two integer vectors (N=1) void add(int* a, int* b, int* c) { for (int idx=0; idx<sizeof(a); idx++) c[idx] = a[idx] + b[idx]; } C code OpenCL kernel code Run-time call Used to differentiate execution for each work-item 17
  • 18. 18-19/7/2013 MATENVMED Plenary Meeting Why OpenCL as an HDL? • OpenCL exposes parallelism at the finest granularity – Allows easy hardware generation at different levels of granularity – One accelerator per work-item, one accelerator per work- group, one accelerator per multiple work-groups, etc. • OpenCL exposes data communication – Critical to transfer and stage data across platforms • We target unmodified OpenCL to enable hardware design to software engineers – No need for hardware/architectural expertise 18
  • 19. 18-19/7/2013 MATENVMED Plenary Meeting SOpenCL Tool Flow
  • 20. Granularity Management work-group FPGA Optimal thread granularity depends on hardware platform GPU CPU We select a hardware accelerator to process one work-group per invocation. Smaller invocation overhead18-19/7/2013 MATENVMED Plenary Meeting
  • 21. 18-19/7/2013 MATENVMED Plenary Meeting Granularity Coarsening Work-item thread Work-group thread
  • 22. 18-19/7/2013 MATENVMED Plenary Meeting Serialization of Work Items __kernel void vadd(…) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; } __kernel void Vadd(…) { int idx; for( i = 0; i < get_local_size(2); i++) for( j = 0; j < get_local_size(1); j++) for( k = 0; k < get_local_size(0); k++) { idx = get_item_gid(0); c[idx] = a[idx] + b[idx]; } } OpenCL code C code 22 idx = (global_id2(0) + i) * Grid_Width * Grid_Height + (global_id1(0) + j) * Grid_Width + (global_id0(0) + k);
  • 23. 18-19/7/2013 MATENVMED Plenary Meeting Architectural Synthesis • Exploit available parallelism and application specific features. • Apply a series of transformations to generate customized hardware accelerators. 23 • Uses LLVM Compiler Infrastructure. • Generate synthesizable Verilog & Test bench.
  • 24. Feed Data in Order Write Data in Order •FU types, •Bitwidths, •I/O Bandwidth 2407/27/13 Verilog Generation: PE Architecture •Predication •Code •slicing •SMS mod •scheduling •Verilog •generation MATENVMED Kickc Off Meeting
  • 26. FPGA Implementation • Our plan is to use the same code base (e.g. OpenCL) to explore different architectures – OpenCL used for multicore CPU, GPU, FPGA (SOpenCL) • Fast exploration based on area, performance and power requirements 18-19/7/2013 MATENVMED Plenary Meeting
  • 27. FPGA Implementation • Monte-Carlo simulations can exploit multi-level parallelism of FPGAs – Multiple MC simulations per point – Multiple points simultaneously – Double precision Trigonometric, Log, Additions, Multiplications functions for each walk – FP operations with double precision are not FPGAs strong point, but still SOpenCL can handle it. 18-19/7/2013 MATENVMED Plenary Meeting

Hinweis der Redaktion

  1. My thesis title is using a parallel programming model for architectural synthesis. I will talk about how we used a parallel programming model, particularly OpenCL programming model for architectural synthesis.
  2. The problem of architectural synthesis commonly know as high level synthesis concerns the automatic generation of hardware accelerators systems from high level programming languages. There have been intensive research efforts in the last few years to extend (mainly) sequential programming languages like C/C++, etc. to serve as hardware description languages and replace the likes of Verilog/VHDL. Using sequential languages presents some formidable challenges like (without going into much details): Parallelism extraction which is extremely difficult in sequential languages like C. Previous researches proposed a variety of language extensions and compiler directives to help expose parallelism. Others used restricted set of C. On the other hand parallel programming models like CUDA and OpenCL provides semantics that expose parallelism and data movement, features necessary for successful architectural synthesis
  3. So, there is an obvious lack of parallel programming models that could be used to describe an application and map it on reconfigurable platforms More important, The computing industry is moving toward many core computing systems with unified programming models. Parallel programming model are very suitable for reconfigurable platforms programming because of the strong resemblance between distributed logic cells of reconfigurable fabric and many core architectures.
  4. Our vision is to exploit parallel programming models and develop tools and methodologies that enable software engineers and parallel programmers to build a complete system based on hardware accelerators without the need for hardware design expertise, with a language from their own domain without any changes or modifications.
  5. The work of this research is developed as part of the Silicon OpenCL tool. Silicon OpenCL converts an unmodified OpenCL application into a system on chip with software and hardware Components. The tool flow converts an OpenCL kernel which represents the computational part in OpenCL applications into a hardware accelerator by first transforming the OpenCL kernel into an intermediate C function then the Hardware generation back end developed in this research converts the C function into RTL Verilog describing a hardware accelerator. The backend also generates Testbench for simulation and verification purposes.
  6. In The hardware generation backend we propose a variety of Transformations and Optimizations. Most importantly Code Slicing and Code customization. We propose an architectural template that supports complex and imperfect loop nests. Moreover, the architectural template decouples and overlaps data movement and data computations as we will see shortly. Another important feature of the architecture template, is its support of asynchronous execution model to parallelize the execution multiple components in the accelerator.
  7. OpenCL ( Open Computing Language ) presents a framework for writing programs that execute across heterogeneous platforms consisting of CPUs , GPUs , and other processors. OpenCL 1.0 out late 2008 Vision: write one portable application and execute in any processor or collection of processors. Strong industry support and drivers out for NVIDIA, Intel, AMD/ATI, IBM (Cell) chipsets etc.
  8. OpenCL adapts a generic multicore computing model. The model of OpenCL consists of a host connected to one or more compute devices . A compute device is divided into one or more compute units (CUs) which are further divided into one or more processing elements (PEs). Computations on a device occur within the processing elements. In this platform model, an OpenCL application consists of a host program and computational kernels. The host program runs on the Host initialize compute devices and submits kernels for executions. A computational kernel executes on the processing elements.
  9. When a kernel is submitted for execution by the host, a geometric grid encompassing all computations is defined. Grids consist of work-groups which further decomposed into work items, where a work item is the finest computation granularity and represents the work load of a single kernel. Work-groups are assigned a unique work-group ID, and each work item is assigned a unique ID within the work group. Combining the work group ID and the work item ID produces a unique global ID for each work item. work-item executes the same code but the specific execution pathway through the code and the data operated upon can vary per work-item. Work groups are independent and can run in parallel. The work-items in a given work-group execute concurrently on the compute elements of a single compute unit. It is important to know that it’s the responsibility of the programmer to define the geometry of the grid and work groups. Which means explicitly exposing the parallelism in the application.
  10. Consider a simple example of OpenCL kernel code : add two vectors. Important point: OpenCL code describes the computation of a work-item The C code to the left is equivalent to the OpenCL code to the right. Only the code for one loop iteration is needed (in this case 1 loop iteration = 1 work-item) Notice that the specific data processed depends on the work-item global-id (found by a run-time library call) This is how work-items execute the same code base, but differentiate (at run-time) their dynamic execution path and data structures they process
  11. The questions is: why would a developer use OpenCL as a hardware design language? 1) OpenCL expresses explicitly parallelism at the level of a work-item  Multiple work-items can execute in parallel by default Easier to coarsen parallelism (e.g. at the work-group level) by selecting multiple work-items Therefore, the designer/compiler has freedom to select an accelerator at various levels of granularity: from a single work-item to a work-group to a grid. 2) Exposure of data communication facilitates data movement, staging between accelerators and memory hierarchies. Sequential languages lack semantics for all that. 3) The third reason is that using unmodified OpenCL we open up platform design to the larger pool of software engineers/design experts. No need for hardware/architectural expertise. Ride the wave of parallel programming.
  12. So lets see how Silicon OpenCL do its Job
  13. The first concern we have to deal with is the computations granularity we should assign to a hardware accelerator. On modern GPUs with hundred of light weight cores, a work item executes on a single core as a thread. A Chip Multiprocessor (e.g. from Intel) has fewer, more complex cores  a thread executing in such a core may be a collection of work-items (may be a work-group) In the FPGA case, the jury is still out. In SOpenCL, we coarsen parallelism granularity by invoking an accelerator to execute a work-group. This reduces overhead per invocation. However, this is by no means the only answer to this problem. A lot of work in performance and area estimation is currently done to determine the best granularity for hardware accelerators for each application. The front end of Silicon OpenCL handles the task of coarsening the granularity of an OpenCL kernel and generating a C function represents the workload of a single work-group
  14. The first step is logical thread serialization . Work-items inside a work-group can be executed in any sequence, provided that no synchronization operation is present inside a kernel function. Based on this observation, we serialize the execution of work-items by enclosing the instructions in the body of a kernel function into a triple nested loop , given that the maximum number of dimensions in the abstract index space within a workgroup is three. The nested loop we get has the following property: all iterations can execute in parallel and out of order.
  15. The hardware generator uses LLVM (Low Level Virtual Machine), an open-source compiler infrastructure, to perform a series of optimizations and generate synthesizable Verilog (including the testbench). Note that the decoupling of front- and back-ends allows for Silicon OpenCL to tackle C code, and not only OpenCL. The following slides present the details of the back-end of the compiler
  16. The PE architecture consists of a datapath executes the computation kernel, and a streaming unit that handle address generation for input and output streaming kernels. Also the streaming unit feeds data in order to the data path and write back data in order by allocating separate aligning units. The streaming unit allocates an optional cache that the tool allocates if it detects potentials for temporal and spatial locality. I want to emphasize that this is just a footprint according to which all accelerators are generated. The number, type, bitwidth of functional units are application dependent and are configurable.