SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology




   Fahad Islam Cheema, Zain-Ul-Abdin,
   Professor Bertil Svensson

   Halmstad University, Halmstad, Sweden
Engr. Fahad Islam Cheema
   4-Year Bachelor in Computer Engineering (BCE) from COMSATS Lahore in 2006
   2-Year Industrial Experience as Embedded Software/System Engineer in Lahore
    and Islamabad
      Five Rivers Technologies        Lahore
      Streaming Networks              Islamabad
      Delta Indus Systems             Lahore
   Masters in Computer System Engineering from Halmstad University of Sweden in
    2009
   Masters standalone thesis accepted for publication in FPGAWorld 2010 international
    conference
      www.fpgaworld2010.com
      Copenhagen, Denmark in September,10
   1-Year Academic Experience
      Halmstad University Sweden
      LUMS Lahore
      Bahria University Islamabad


                                                                                    2
Engr. Fahad Islam Cheema
   3-Year Experience                           (Embedded Systems)
     2-year Industrial (Streaming Networks)
     1-Year Academic
            Universities          (Halmstad, LUMS, Bahria)
            Courses
                  Linux Programming and shell Scripting, Administration of OS, Databases
                  Embedded Systems
                  System Programming

   17-Year Education                            (Computer Engineering)
     Masters From Sweden
     Computer Engineering from COMSATS
     Specialization in Embedded Systems
     PEC # Comp/6774

   1 Publication
       Masters thesis accepted for publication in FPGAWorld2010                            3
Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology




   Fahad Islam Cheema, Zain-Ul-Abdin,
   Professor Bertil Svensson

   Halmstad University, Halmstad, Sweden
Agenda
 Overview and Problem Definition
 Main Idea
 Experimental Setup
     Mitrion Parallel Architecture
     Interpolation Kernels

 Parallelization Levels
 Conclusions
 Future Work
                                      5
Overview
   Motivation
     Computation  intensive algorithms
     Fine grained architectures

   Problem Definition
     Parallelism
     Resource   to Performance Tradeoffs
        Hardware/logic gates to performance tradeoffs
        Memory to performance tradeoffs


                                                         6
Problem Defination                                            Pipeline Step1
                                                      (First data independent Block)

        d0 = x_int - x0;
        d1 = x_int - x1;
        d2 = x_int - x2;
        d3 = x_int - x3;

        p01 = (y0*d1 - y1*d0) / (x0 - x1);
Step2   p12 = (y1*d2 - y2*d1) / (x1 - x2);
        p23 = (y2*d3 - y3*d2) / (x2 - x3);

        p02 = (p01*d2 - p12*d0) / (x0 - x2);
Step3
        p13 = (p12*d3 - p23*d1) / (x1 - x3);

Step4   p03 = (p02*d3 - p13*d0) / (x0 - x3);

                             Figure-1: Problem at Kernel Level
Problem Defination (Conti.)
   Points to consider for Parallelism
     Performance                           Improved
     Required Hardware resourses           Increased
           Hardware Gates
           Memory interface
           Memory Size                     TradeOffs   
           Memory access speed


Fine-Grained Reconfigurable Architectures




                                                            8
Problem Defination (Conti.)
 for(i in <0 .. 90000>)                                                Problem Level
 {                                                                     Parallelism (PLP)
        d0 = x_int - x0;
        d1 = x_int - x1;              Pipeline Step1
        d2 = x_int - x2;
        d3 = x_int - x3;

        p01 = (y0*d1 - y1*d0) / (x0 - x1);
Step2                                                                  Kernel Level
        p12 = (y1*d2 - y2*d1) / (x1 - x2);
                                                                       Parallelism (KLP)
        p23 = (y2*d3 - y3*d2) / (x2 - x3);

        p02 = (p01*d2 - p12*d0) / (x0 - x2);
Step3
        p13 = (p12*d3 - p23*d1) / (x1 - x3);

Step4   p03 = (p02*d3 - p13*d0) / (x0 - x3);

 }                         Figure-2: Parallelism at different levels
Main Idea
   Parallelism Levels
     BitLevel Parallelism (BLP)
     Kernel Level Parallelism (KLP)
     Problem Level Parallelism (PLP)
   Maximum parallelism at one level is not ultimate
    solution
     Customized  parallelism at different levels
     Can better adjust Resource-performance tradoffs
           Gates-performance tradeoff




                                                        10
Main idea (Conti.)
   Maximum parallelism at one level is not ultimate
    solution
     Combine   parallelism at different parallelism levels to
      produce parallelization levels
     Parallelization Levels
          Single Kernel (SKZ)
          Cross Kernel (CKZ)
          Multi-SKZ
          Multi-CKZ



                                      Figure-3: Parallelism and Parallelization
                                                       Levels
                                                                                  11
Experimental Setup
   Computation intensive algorithms
     Interpolation   Kernels
   Fine Grained Architecture
     FPGA
   Fine Grained Parallelism
     Mitrion virtual processor
        Extract fine grained parallelism

     Mitrion-C high level language         (HLL)
   Hardware Platform
     Cray   XD1 Supercomputer with Vertex-4 FPGA

                                                    12
Interpolation Kernels
   What is interpolation
     Process  of calculating new values within the range of
      available values [1]
   Cubic interpolation
   Bi-cubic interpolation
     Applying cubic in 2D
     5 cubic kernels



                                        Figure-4: 2D Interpolation

                                                                     13
Mitrion Parallel Architecture
   Mitrion Virtual Processor (MVP)
     Fine-Grained, Soft-Core Processor
     Almost 60 IP blocks defined in HDL [2]
     Non von-neumann architecture
   Mitrion-C
     HLL for FPGA
     Data dependence instead of order-of-execution
     Parallelism Language Constructs [3]
     Pipelining



                                                      14
Parallelization Levels
   Single Kernel
    Parallelization (SKZ)
     Only  kernel level
      parallelism (KLP)
     All data independent
      blocks are internally
      parallel but externally
      pipelined
                                Figure-5: SKZ

                                                15
Parallelization Levels (Conti.)
   Cross Kernel
    Parallelization (CKZ)
     Extend  kernel by Mixing
      more than one kernels
     Replicate computation
      intensive data
      independent blocks
     Resource computation
      balance
                                 Figure-6: CKZ
                                                 16
Parallelization Levels (Conti.)
   Multi-SKZ
     Replicate kernels which
      already have SKZ




                                Figure-7: Multi-SKZ
                                                      17
Parallelization Levels (Conti.)
   Multi-CKZ
     Replicate kernels which
      already have CKZ
                                                                          d0                                                             d0          P01                                             d0          P01
                                                                                      P01
                                                                          d1                                                             d1                                                          d1

                                                                     D values         P12                                           D values         P12                                        D values         P12
                                                                        d2                                                             d2                                                          d2

                                                                                      P23                                                d3          P23                                             d3          P23
                                                                          d3


                                                                          d0                                                             d0          P01                                             d0          P01
                                                            a                         P01                                  a                                                           a
                                                                          d1                                                             d1                                                          d1
                                                     Read from                                                      Read from                                                   Read from
                                                                     D values         P12                                           D values         P12                                        D values         P12
                                                      Memory                                                         Memory                                                      Memory
                                                                        d2                                                             d2                                                          d2

                                                            b                                                              b                         P23                               b                         P23
                                                                          d3          P23                                                d3                                                          d3
                                       Go for                                                         Go for                                                      Go for
                                        next                                                           next                                                        next
                                     iteration                                                      iteration                                                   iteration


                                                                                      p02                                                            p02                                                         p02

                                                     Write to                                                       Write to                                                    Write to
                                                                          P03                                                            P03                                                         P03
                                                     Memory                                                         Memory                                                      Memory
                                                                                      p13                                                            p13                                                         p13




                                                                                      p02                                                            p02                                                         p02

                                                                          P03                                                            P03                                                         P03

                                                                                      p13                                                            p13                                                         p13




                                                                    d0                                                             d0          P01                                             d0
                                                                                P01                                                                                                                        P01
                                                                    d1                                                             d1                                                          d1
                                                                 D values       P12                                             D values       P12                                          D values       P12
                                                                    d2                                                             d2                                                          d2

                                                                                P23                                                d3          P23                                                         P23
                                                                    d3                                                                                                                         d3


                                                                    d0                                                             d0          P01                                             d0
                                                     a                          P01                                 a                                                           a                          P01
                                                                    d1                                                             d1                                                          d1
                                                 Read from                                                      Read from                                                   Read from
                                                                 D values       P12                                             D values       P12                                          D values       P12
                                                  Memory                                                         Memory                                                      Memory
                                                                    d2                                                             d2                                                          d2

                                                     b                                                              b                          P23                              b
                                                                    d3          P23                                                d3                                                          d3          P23
                                  Go for                                                         Go for                                                      Go for
                                   next                                                           next                                                        next
                                iteration                                                      iteration                                                   iteration


                                                                                p02                                                            p02                                                         p02

                                                 Write to                                                       Write to                                                    Write to
                                                                   P03                                                            P03                                                         P03
                                                 Memory                                                         Memory                                                      Memory
                                                                                p13                                                            p13                                                         p13




                                                                                p02                                                            p02                                                         p02

                                                                    P03                                                            P03                                                         P03
                                                                                p13                                                            p13                                                         p13




                                                                                            Figure-8:Multi-CKZ
                                                                                                                                                                                                           18
Results




          Table-1 : Results


                              19
Conclusions
   Specific conclusions
        For very limited resources, SKZ is better
        CKZ is better for applications with high unbalanced computation
         distribution
        SKZ and CKZ are better for large size applications
        Multi-CKZ can provide high level of parallelism at cost of design
         complexity
        Multi-SKZ and Multi-CKZ are attractive for small size Real-Time
         applications
   Using parallelization levels
        Can adjust trade-offs
        Can achieve highly custom parallelism
   Mix of parallelization levels can produce
        Application-specific parallelism
        Resource-specific parallelism



                                                                             20
Future Work
   Automation of parallelization levels
   Parallelization levels to deal with other tradeoffs
   Generalized parallelization levels for all
    application
   Generalized parallelization levels for graphical
    processors to adjust tradeoffs
     Floating   point and accuracy



                                                      21
References
[1] William H. Press Brian P. Flannery, Saul A. Teukolsky William
    T.Vetterling “Numerical Recipes, Art of Scientific Computing”,
    Cambridge University Press

[2] Stefan Möhl, “The Mitrion Virtual Processor, Using FPGAs in HPC”
    Sixteenth ACM/SIGDA International Symposium on FPGAs
    <http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20workshop%2
    0-%2005%20Mitrionics%20-%20Mohl.pdf> Date 14-05-2009

[3] “Mitrion User Guide”, Copyright © 2005 - 2008 by Mitrionics AB.
    <http://www.mitrionics.com/?page=developers_resources> Date 03-
    03-2009




                                                                     22
Tack
(Hope you enjoyed)

Weitere ähnliche Inhalte

Was ist angesagt?

cmp104 lec 8
cmp104 lec 8cmp104 lec 8
cmp104 lec 8
kapil078
 
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMTUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
MediaEval2012
 
Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31
Magnus Christerson
 
Chapter 7 slides
Chapter 7 slidesChapter 7 slides
Chapter 7 slides
lara_ays
 
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
ijcseit
 

Was ist angesagt? (20)

Syntutic
SyntuticSyntutic
Syntutic
 
4 lexical
4 lexical4 lexical
4 lexical
 
Core concepts of C++
Core concepts of C++  Core concepts of C++
Core concepts of C++
 
cmp104 lec 8
cmp104 lec 8cmp104 lec 8
cmp104 lec 8
 
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVMTUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
 
Multinode Cooperative Communications with Generalized Combining Schemes
Multinode Cooperative Communications with Generalized Combining SchemesMultinode Cooperative Communications with Generalized Combining Schemes
Multinode Cooperative Communications with Generalized Combining Schemes
 
Verilog HDL Training Course
Verilog HDL Training CourseVerilog HDL Training Course
Verilog HDL Training Course
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31Univ of va intentional introduction 2013 01-31
Univ of va intentional introduction 2013 01-31
 
Chapter 7 slides
Chapter 7 slidesChapter 7 slides
Chapter 7 slides
 
Introduction to-vhdl
Introduction to-vhdlIntroduction to-vhdl
Introduction to-vhdl
 
High-Level Synthesis with GAUT
High-Level Synthesis with GAUTHigh-Level Synthesis with GAUT
High-Level Synthesis with GAUT
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012
 
proposal
proposalproposal
proposal
 
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
 
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
Programming with NV_path_rendering:  An Annex to the SIGGRAPH Asia 2012 paper...Programming with NV_path_rendering:  An Annex to the SIGGRAPH Asia 2012 paper...
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
 
Slide collab com
Slide collab comSlide collab com
Slide collab com
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
 

Andere mochten auch (7)

Jain
JainJain
Jain
 
Spanish Storytimes
Spanish StorytimesSpanish Storytimes
Spanish Storytimes
 
O que pensam os executivos brasileiros sobre liderança?
O que pensam os executivos brasileiros sobre liderança?O que pensam os executivos brasileiros sobre liderança?
O que pensam os executivos brasileiros sobre liderança?
 
Guión
GuiónGuión
Guión
 
Qué es tecnología sancho
Qué es tecnología sanchoQué es tecnología sancho
Qué es tecnología sancho
 
Using the SSS Website
Using the SSS WebsiteUsing the SSS Website
Using the SSS Website
 
Marcas blancas
Marcas blancasMarcas blancas
Marcas blancas
 

Ähnlich wie Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology

CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
trupeace
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Sourour Kanzari
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Sourour Kanzari
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
butest
 
Low power ldpc decoder implementation using layer decoding
Low power ldpc decoder implementation using layer decodingLow power ldpc decoder implementation using layer decoding
Low power ldpc decoder implementation using layer decoding
ajithc0003
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
butest
 

Ähnlich wie Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology (20)

Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 
02 direct3 d_pipeline
02 direct3 d_pipeline02 direct3 d_pipeline
02 direct3 d_pipeline
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Securerank ping-opendns
Securerank ping-opendnsSecurerank ping-opendns
Securerank ping-opendns
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Rsltollvm
RsltollvmRsltollvm
Rsltollvm
 
Low power ldpc decoder implementation using layer decoding
Low power ldpc decoder implementation using layer decodingLow power ldpc decoder implementation using layer decoding
Low power ldpc decoder implementation using layer decoding
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
05 defense
05 defense05 defense
05 defense
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
verification resume
verification resumeverification resume
verification resume
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
High Performance FPGA Based Decimal-to-Binary Conversion SchemesHigh Performance FPGA Based Decimal-to-Binary Conversion Schemes
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
 
Cluster Computing with Dryad
Cluster Computing with DryadCluster Computing with Dryad
Cluster Computing with Dryad
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology

  • 1. Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology Fahad Islam Cheema, Zain-Ul-Abdin, Professor Bertil Svensson Halmstad University, Halmstad, Sweden
  • 2. Engr. Fahad Islam Cheema  4-Year Bachelor in Computer Engineering (BCE) from COMSATS Lahore in 2006  2-Year Industrial Experience as Embedded Software/System Engineer in Lahore and Islamabad  Five Rivers Technologies Lahore  Streaming Networks Islamabad  Delta Indus Systems Lahore  Masters in Computer System Engineering from Halmstad University of Sweden in 2009  Masters standalone thesis accepted for publication in FPGAWorld 2010 international conference  www.fpgaworld2010.com  Copenhagen, Denmark in September,10  1-Year Academic Experience  Halmstad University Sweden  LUMS Lahore  Bahria University Islamabad 2
  • 3. Engr. Fahad Islam Cheema  3-Year Experience (Embedded Systems)  2-year Industrial (Streaming Networks)  1-Year Academic  Universities (Halmstad, LUMS, Bahria)  Courses  Linux Programming and shell Scripting, Administration of OS, Databases  Embedded Systems  System Programming  17-Year Education (Computer Engineering)  Masters From Sweden  Computer Engineering from COMSATS  Specialization in Embedded Systems  PEC # Comp/6774  1 Publication  Masters thesis accepted for publication in FPGAWorld2010 3
  • 4. Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology Fahad Islam Cheema, Zain-Ul-Abdin, Professor Bertil Svensson Halmstad University, Halmstad, Sweden
  • 5. Agenda  Overview and Problem Definition  Main Idea  Experimental Setup  Mitrion Parallel Architecture  Interpolation Kernels  Parallelization Levels  Conclusions  Future Work 5
  • 6. Overview  Motivation  Computation intensive algorithms  Fine grained architectures  Problem Definition  Parallelism  Resource to Performance Tradeoffs  Hardware/logic gates to performance tradeoffs  Memory to performance tradeoffs 6
  • 7. Problem Defination Pipeline Step1 (First data independent Block) d0 = x_int - x0; d1 = x_int - x1; d2 = x_int - x2; d3 = x_int - x3; p01 = (y0*d1 - y1*d0) / (x0 - x1); Step2 p12 = (y1*d2 - y2*d1) / (x1 - x2); p23 = (y2*d3 - y3*d2) / (x2 - x3); p02 = (p01*d2 - p12*d0) / (x0 - x2); Step3 p13 = (p12*d3 - p23*d1) / (x1 - x3); Step4 p03 = (p02*d3 - p13*d0) / (x0 - x3); Figure-1: Problem at Kernel Level
  • 8. Problem Defination (Conti.)  Points to consider for Parallelism  Performance Improved  Required Hardware resourses Increased  Hardware Gates  Memory interface  Memory Size TradeOffs   Memory access speed Fine-Grained Reconfigurable Architectures 8
  • 9. Problem Defination (Conti.) for(i in <0 .. 90000>) Problem Level { Parallelism (PLP) d0 = x_int - x0; d1 = x_int - x1; Pipeline Step1 d2 = x_int - x2; d3 = x_int - x3; p01 = (y0*d1 - y1*d0) / (x0 - x1); Step2 Kernel Level p12 = (y1*d2 - y2*d1) / (x1 - x2); Parallelism (KLP) p23 = (y2*d3 - y3*d2) / (x2 - x3); p02 = (p01*d2 - p12*d0) / (x0 - x2); Step3 p13 = (p12*d3 - p23*d1) / (x1 - x3); Step4 p03 = (p02*d3 - p13*d0) / (x0 - x3); } Figure-2: Parallelism at different levels
  • 10. Main Idea  Parallelism Levels  BitLevel Parallelism (BLP)  Kernel Level Parallelism (KLP)  Problem Level Parallelism (PLP)  Maximum parallelism at one level is not ultimate solution  Customized parallelism at different levels  Can better adjust Resource-performance tradoffs  Gates-performance tradeoff 10
  • 11. Main idea (Conti.)  Maximum parallelism at one level is not ultimate solution  Combine parallelism at different parallelism levels to produce parallelization levels  Parallelization Levels  Single Kernel (SKZ)  Cross Kernel (CKZ)  Multi-SKZ  Multi-CKZ Figure-3: Parallelism and Parallelization Levels 11
  • 12. Experimental Setup  Computation intensive algorithms  Interpolation Kernels  Fine Grained Architecture  FPGA  Fine Grained Parallelism  Mitrion virtual processor  Extract fine grained parallelism  Mitrion-C high level language (HLL)  Hardware Platform  Cray XD1 Supercomputer with Vertex-4 FPGA 12
  • 13. Interpolation Kernels  What is interpolation  Process of calculating new values within the range of available values [1]  Cubic interpolation  Bi-cubic interpolation  Applying cubic in 2D  5 cubic kernels Figure-4: 2D Interpolation 13
  • 14. Mitrion Parallel Architecture  Mitrion Virtual Processor (MVP)  Fine-Grained, Soft-Core Processor  Almost 60 IP blocks defined in HDL [2]  Non von-neumann architecture  Mitrion-C  HLL for FPGA  Data dependence instead of order-of-execution  Parallelism Language Constructs [3]  Pipelining 14
  • 15. Parallelization Levels  Single Kernel Parallelization (SKZ)  Only kernel level parallelism (KLP)  All data independent blocks are internally parallel but externally pipelined Figure-5: SKZ 15
  • 16. Parallelization Levels (Conti.)  Cross Kernel Parallelization (CKZ)  Extend kernel by Mixing more than one kernels  Replicate computation intensive data independent blocks  Resource computation balance Figure-6: CKZ 16
  • 17. Parallelization Levels (Conti.)  Multi-SKZ  Replicate kernels which already have SKZ Figure-7: Multi-SKZ 17
  • 18. Parallelization Levels (Conti.)  Multi-CKZ  Replicate kernels which already have CKZ d0 d0 P01 d0 P01 P01 d1 d1 d1 D values P12 D values P12 D values P12 d2 d2 d2 P23 d3 P23 d3 P23 d3 d0 d0 P01 d0 P01 a P01 a a d1 d1 d1 Read from Read from Read from D values P12 D values P12 D values P12 Memory Memory Memory d2 d2 d2 b b P23 b P23 d3 P23 d3 d3 Go for Go for Go for next next next iteration iteration iteration p02 p02 p02 Write to Write to Write to P03 P03 P03 Memory Memory Memory p13 p13 p13 p02 p02 p02 P03 P03 P03 p13 p13 p13 d0 d0 P01 d0 P01 P01 d1 d1 d1 D values P12 D values P12 D values P12 d2 d2 d2 P23 d3 P23 P23 d3 d3 d0 d0 P01 d0 a P01 a a P01 d1 d1 d1 Read from Read from Read from D values P12 D values P12 D values P12 Memory Memory Memory d2 d2 d2 b b P23 b d3 P23 d3 d3 P23 Go for Go for Go for next next next iteration iteration iteration p02 p02 p02 Write to Write to Write to P03 P03 P03 Memory Memory Memory p13 p13 p13 p02 p02 p02 P03 P03 P03 p13 p13 p13 Figure-8:Multi-CKZ 18
  • 19. Results Table-1 : Results 19
  • 20. Conclusions  Specific conclusions  For very limited resources, SKZ is better  CKZ is better for applications with high unbalanced computation distribution  SKZ and CKZ are better for large size applications  Multi-CKZ can provide high level of parallelism at cost of design complexity  Multi-SKZ and Multi-CKZ are attractive for small size Real-Time applications  Using parallelization levels  Can adjust trade-offs  Can achieve highly custom parallelism  Mix of parallelization levels can produce  Application-specific parallelism  Resource-specific parallelism 20
  • 21. Future Work  Automation of parallelization levels  Parallelization levels to deal with other tradeoffs  Generalized parallelization levels for all application  Generalized parallelization levels for graphical processors to adjust tradeoffs  Floating point and accuracy 21
  • 22. References [1] William H. Press Brian P. Flannery, Saul A. Teukolsky William T.Vetterling “Numerical Recipes, Art of Scientific Computing”, Cambridge University Press [2] Stefan Möhl, “The Mitrion Virtual Processor, Using FPGAs in HPC” Sixteenth ACM/SIGDA International Symposium on FPGAs <http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20workshop%2 0-%2005%20Mitrionics%20-%20Mohl.pdf> Date 14-05-2009 [3] “Mitrion User Guide”, Copyright © 2005 - 2008 by Mitrionics AB. <http://www.mitrionics.com/?page=developers_resources> Date 03- 03-2009 22