Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology
When implementing computation-intensive algorithms on finegrained
parallel architectures, adjustment of resource to
performance tradeoff is a big challenge. This paper proposes a
methodology for dealing with some of these performance tradeoffs
by adjusting parallelism at different levels. In a case study,
interpolation kernels are implemented on a fine-grained
architecture (FPGA) using a high level language (Mitrion-C).
For both cubic and bi-cubic interpolation, one single-kernel, one
cross-kernel and two multi-kernel parallel implementations are
designed and evaluated. Our results demonstrate that no single
level of parallelism can be used for trade-off adjustment. Instead,
the appropriate degree of parallelism on each level, according to
available resources and the performance requirements of the
application, needs to be found. Basing the design on high-level
programming simplifies the trade-off process. This research is a
step towards automation of the choice of parallelization based on
a combination of parallelism levels.
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology
1. Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology
Fahad Islam Cheema, Zain-Ul-Abdin,
Professor Bertil Svensson
Halmstad University, Halmstad, Sweden
2. Engr. Fahad Islam Cheema
4-Year Bachelor in Computer Engineering (BCE) from COMSATS Lahore in 2006
2-Year Industrial Experience as Embedded Software/System Engineer in Lahore
and Islamabad
Five Rivers Technologies Lahore
Streaming Networks Islamabad
Delta Indus Systems Lahore
Masters in Computer System Engineering from Halmstad University of Sweden in
2009
Masters standalone thesis accepted for publication in FPGAWorld 2010 international
conference
www.fpgaworld2010.com
Copenhagen, Denmark in September,10
1-Year Academic Experience
Halmstad University Sweden
LUMS Lahore
Bahria University Islamabad
2
3. Engr. Fahad Islam Cheema
3-Year Experience (Embedded Systems)
2-year Industrial (Streaming Networks)
1-Year Academic
Universities (Halmstad, LUMS, Bahria)
Courses
Linux Programming and shell Scripting, Administration of OS, Databases
Embedded Systems
System Programming
17-Year Education (Computer Engineering)
Masters From Sweden
Computer Engineering from COMSATS
Specialization in Embedded Systems
PEC # Comp/6774
1 Publication
Masters thesis accepted for publication in FPGAWorld2010 3
4. Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology
Fahad Islam Cheema, Zain-Ul-Abdin,
Professor Bertil Svensson
Halmstad University, Halmstad, Sweden
5. Agenda
Overview and Problem Definition
Main Idea
Experimental Setup
Mitrion Parallel Architecture
Interpolation Kernels
Parallelization Levels
Conclusions
Future Work
5
6. Overview
Motivation
Computation intensive algorithms
Fine grained architectures
Problem Definition
Parallelism
Resource to Performance Tradeoffs
Hardware/logic gates to performance tradeoffs
Memory to performance tradeoffs
6
10. Main Idea
Parallelism Levels
BitLevel Parallelism (BLP)
Kernel Level Parallelism (KLP)
Problem Level Parallelism (PLP)
Maximum parallelism at one level is not ultimate
solution
Customized parallelism at different levels
Can better adjust Resource-performance tradoffs
Gates-performance tradeoff
10
11. Main idea (Conti.)
Maximum parallelism at one level is not ultimate
solution
Combine parallelism at different parallelism levels to
produce parallelization levels
Parallelization Levels
Single Kernel (SKZ)
Cross Kernel (CKZ)
Multi-SKZ
Multi-CKZ
Figure-3: Parallelism and Parallelization
Levels
11
12. Experimental Setup
Computation intensive algorithms
Interpolation Kernels
Fine Grained Architecture
FPGA
Fine Grained Parallelism
Mitrion virtual processor
Extract fine grained parallelism
Mitrion-C high level language (HLL)
Hardware Platform
Cray XD1 Supercomputer with Vertex-4 FPGA
12
13. Interpolation Kernels
What is interpolation
Process of calculating new values within the range of
available values [1]
Cubic interpolation
Bi-cubic interpolation
Applying cubic in 2D
5 cubic kernels
Figure-4: 2D Interpolation
13
14. Mitrion Parallel Architecture
Mitrion Virtual Processor (MVP)
Fine-Grained, Soft-Core Processor
Almost 60 IP blocks defined in HDL [2]
Non von-neumann architecture
Mitrion-C
HLL for FPGA
Data dependence instead of order-of-execution
Parallelism Language Constructs [3]
Pipelining
14
15. Parallelization Levels
Single Kernel
Parallelization (SKZ)
Only kernel level
parallelism (KLP)
All data independent
blocks are internally
parallel but externally
pipelined
Figure-5: SKZ
15
16. Parallelization Levels (Conti.)
Cross Kernel
Parallelization (CKZ)
Extend kernel by Mixing
more than one kernels
Replicate computation
intensive data
independent blocks
Resource computation
balance
Figure-6: CKZ
16
18. Parallelization Levels (Conti.)
Multi-CKZ
Replicate kernels which
already have CKZ
d0 d0 P01 d0 P01
P01
d1 d1 d1
D values P12 D values P12 D values P12
d2 d2 d2
P23 d3 P23 d3 P23
d3
d0 d0 P01 d0 P01
a P01 a a
d1 d1 d1
Read from Read from Read from
D values P12 D values P12 D values P12
Memory Memory Memory
d2 d2 d2
b b P23 b P23
d3 P23 d3 d3
Go for Go for Go for
next next next
iteration iteration iteration
p02 p02 p02
Write to Write to Write to
P03 P03 P03
Memory Memory Memory
p13 p13 p13
p02 p02 p02
P03 P03 P03
p13 p13 p13
d0 d0 P01 d0
P01 P01
d1 d1 d1
D values P12 D values P12 D values P12
d2 d2 d2
P23 d3 P23 P23
d3 d3
d0 d0 P01 d0
a P01 a a P01
d1 d1 d1
Read from Read from Read from
D values P12 D values P12 D values P12
Memory Memory Memory
d2 d2 d2
b b P23 b
d3 P23 d3 d3 P23
Go for Go for Go for
next next next
iteration iteration iteration
p02 p02 p02
Write to Write to Write to
P03 P03 P03
Memory Memory Memory
p13 p13 p13
p02 p02 p02
P03 P03 P03
p13 p13 p13
Figure-8:Multi-CKZ
18
20. Conclusions
Specific conclusions
For very limited resources, SKZ is better
CKZ is better for applications with high unbalanced computation
distribution
SKZ and CKZ are better for large size applications
Multi-CKZ can provide high level of parallelism at cost of design
complexity
Multi-SKZ and Multi-CKZ are attractive for small size Real-Time
applications
Using parallelization levels
Can adjust trade-offs
Can achieve highly custom parallelism
Mix of parallelization levels can produce
Application-specific parallelism
Resource-specific parallelism
20
21. Future Work
Automation of parallelization levels
Parallelization levels to deal with other tradeoffs
Generalized parallelization levels for all
application
Generalized parallelization levels for graphical
processors to adjust tradeoffs
Floating point and accuracy
21