SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
2012/12/07 The Third International Conference on Networking and Computing
International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30)
25-minute presentation and 5-minute question and discussion time




          Towards a Low-Power Accelerator of
          Many FPGAs for Stencil Computations



        ☆Ryohei Kobayashi†1 Shinya Takamaeda-Yamazaki†1 †2 Kenji Kise†1

                              †1 Tokyo Institute of Technology, Japan
                                  †2 JSPS Research Fellow, Japan
Motivation(1/2)
 GPU or FPGA ??




                   or




                        1
FPGA Based Accelerator
 Growing demand to perform scientific computation in low-
  power and high performance
 Designed various accelerators to solve scientific computing
  kernels by using FPGA
   ► CUBE Mencer, O SPL.2009
          ◇Systolic array of 512 FPGAs
          ◇For encryption, pattern matching


   ► Stencil computation accelerator composed of 9 FPGAs
          ◇Scalable streaming-Array with constant memory-bandwidth



   Sano, K., IEEE 19th Annual International Symposium
   on Field-Programmable
   Custom Computing Machines, (2011).



                                                                     2
2D Stencil Computation
 Iterative computation updating data set by using nearest
  neighbor values called stencil
 One of methods to obtain approximate solution of partial
  differential equation (e.g. Thermodynamics, Hydrodynamics,
  Electromagnetism …)
                                v1[i][j] =
                                (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) +
                                (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);

                          v1[i][j] is updated by the summation of four values.
                          Cx : weighting factor
                          Time-step k




                                   Update data set
                                                                                 3
Motivation(2/2)
 Small or Big ??




                    or




                         4
ScalableCore System             *Takamaeda-Yamazaki,   S., (ARC 2012) (2012).




 Tile architecture simulator by Multiple low end FPGAs
   ► High speed simulation environment for many-core processors
     research
   ► We use hardware components of the system as an infrastructure for
     HPC hardware accelerators.

                                                                  One FPGA node



                                                                             FPGA


                                                                      PROM          SRAM




                                                                                           5
Our Plan




One node           4 nodes(2×2)   100 nodes(10×10)
                                      Final goal
           Now implementing

                                                     6
Parallel Stencil Computation by
Using Multi-FPGA

                                  7
Block Division and Assigned to Each FPGA
           :grid-point




                  :data subset communicated            Group of grid-points
:communication
                     with neighbor FPGAs               assigned one FPGA

 ・Data set is divided into several blocks according to the number of FPGAs
 ・Each FPGA performs stencil computation in parallel
                                                                              8
The Computing Order of Grid-points on FPGA




                                 Proposed method

Our proposed method increases the acceptable communication latency!
Now, let’s compare (a)’s model with proposed method
                                                                  9
Comparison between (a) and (b) (1/2)

・”Iteration” : a sequent process to compute all the grid-points at a time-
step
・Now we suppose a computation updating a value of one grid-point takes
just a cycle.
・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15)
during every Iteration.
                  A0    A1    A2    A3                   C12   C13   C14   C15
        FPGA(A)




                                               FPGA(C)
                  A4    A5    A6    A7                   C8    C9    C10   C11

                  A8    A9    A10   A11                  C4    C5    C6    C7

                  A12

                  B0
                        A13

                        B1
                              A14

                              B2
                                    A15

                                    B3
                                          vs             C0

                                                         D0
                                                               C1

                                                               D1
                                                                     C2

                                                                     D2
                                                                           C3

                                                                           D3
        FPGA(B)




                                               FPGA(D)

                  B4    B5    B6    B7                   D4    D5    D6    D7

                  B8    B9    B10   B11                  D8    D9    D10   D11

                  B12   B13   B14   B15                  D12   D13   D14   D15
         (a)                                    (b) Proposed method              10
Comparison between (a) and (b) (2/2)
                       A0    A1    A2     A3
                                                                               First Iteration end
                                               0                                                       16
            FPGA(A)

                       A4    A5    A6     A7

                       A8    A9    A10   A11       A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1

                       A12   A13   A14   A15                                                                     …
                                                   B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1
                       B0    B1    B2     B3
           FPGA(B)




                       B4    B5    B6     B7

                       B8    B9    B10   B11


      (a)              B12   B13   B14   B15



Proposed               C12   C13   C14   C15   0                              First Iteration end      16
 method
             FPGA(C)




                       C8    C9    C10   C11       C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1

                       C4    C5    C6    C7                                                                      …
                                                   D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1
                       C0    C1    C2    C3

                       D0    D1    D2    D3
           FPGA(D)




                       D4    D5    D6    D7

                       D8    D9    D10   D11


       (b)             D12   D13   D14   D15                                                                11
Comparison between (a) and (b) (2/2)
                       A0    A1    A2     A3
                                                                                First Iteration end
                                               0                                                       16
            FPGA(A)

                       A4    A5    A6     A7

                       A8    A9    A10   A11       A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1

                       A12   A13   A14   A15                                                                     …
                                                   B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1
                       B0    B1    B2     B3

                                                       In order not to stall the computation
           FPGA(B)




                       B4    B5    B6     B7
                                                       of B1, the value of A13 must be
                       B8    B9    B10   B11
                                                       communicated within three cycles
      (a)              B12   B13   B14   B15           (14, 15, 16) after the computation…

Proposed               C12   C13   C14   C15   0                               First Iteration end     16
 method
             FPGA(C)




                       C8    C9    C10   C11       C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1

                       C4    C5    C6    C7                                                                      …
                                                   D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1
                       C0    C1    C2    C3

                       D0    D1    D2    D3
           FPGA(D)




                       D4    D5    D6    D7

                       D8    D9    D10   D11


       (b)             D12   D13   D14   D15                                                                12
Comparison between (a) and (b) (2/2)
                       A0    A1    A2     A3
                                                                                First Iteration end
                                               0                                                       16
            FPGA(A)

                       A4    A5    A6     A7

                       A8    A9    A10   A11       A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1

                       A12   A13   A14   A15                                                                     …
                                                   B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1
                       B0    B1    B2     B3

                                                       In order not to stall the computation
           FPGA(B)




                       B4    B5    B6     B7
                                                       of B1, the value of A13 must be
                       B8    B9    B10   B11
                                                       communicated within three cycles
      (a)              B12   B13   B14   B15           (14, 15, 16) after the computation…

Proposed               C12   C13   C14   C15   0                               First Iteration end     16
 method
             FPGA(C)




                       C8    C9    C10   C11       C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1

                       C4    C5    C6    C7                                                                      …
                                                   D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1
                       C0    C1    C2    C3

                       D0    D1    D2    D3
           FPGA(D)




                       D4    D5    D6    D7
                                                            In order not to stall the
                       D8    D9    D10   D11               computation of D1 of Iteration 2
                                                           (17th cycle), the margin to send
       (b)             D12   D13   D14   D15                                                                13
                                                           value of C1 (1st cycle) is 15 cycles
Comparison between (a) and (b) (N×M grid-points)
                    N            If the N×M grid-points are assigned to a
                                 single FPGA, every shared value must be
                                 communicated within N–1cycles
             FPGA




                        M                                    Iteration end


                            …                                                          …
            FPGA




           (a)                                                            N-1 cycles

                                If the N×M grid-points are assigned to a
Proposed            N
                                single FPGA, every shared value must be
 method
                                communicated within N×M–1cycles
            FPGA




                        M                                     Iteration end

                            …                                                          …
           FPGA




                                                    N×M-1 cycles                       14
           (b)
Comparison between (a) and (b) (N×M grid-points)
                    N            If the N×M grid-points are assigned to a
                                 single FPGA, every shared value must be
                                 communicated within N–1cycles
             FPGA




                         M                                   Iteration end

                        Proposed method gives
                          …                                                            …

                        increase acceptable
            FPGA




           (a)
                        latency N×M grid-points are assigned to a
                             If the
                                     of                                   N-1 cycles

Proposed            N
 method                 communication N×M–1cycles be
                                                  !!
                             single FPGA, every shared value must
                             communicated within
            FPGA




                         M                                    Iteration end

                             …                                                         …
           FPGA




                                                   N×M-1 cycles                        15
           (b)
Computing Order Applied Proposed Method




                                                        :computation order




 This method ensures margin of about one Iteration.
 As the number of grid-points increases, acceptable latency is scaled.
                                                                          16
Architecture and Implementation


                                  17
System Architecture
                    from North                 from South
                                                                            from East

     from West
                                 mux2
                                                                                         Memory unit (BlockRAMs)

                                                                                        Computation unit
                                                                                              Configuration
                                                                                                 ROM                JTAG port
     mux     mux     mux      mux       mux       mux       mux    mux
                                                                                                XCF04S

    MADD     MADD   MADD      MADD      MADD      MADD    MADD     MADD


                                                                                              FPGA
                                                                                            Spartan-6

   GATE[0]
                                 mux8                             GATE[3]                     Clock
        to West                                                          to East

                           GATE[1]      GATE[2]
                                                                                              Reset
                               to North        to South                                                   to/from
                                                                                                         Adjacent
                                                                                                           Units
                                                                                                      Ser/Des

                                                                                                      Ser/Des

                                                                                                      Ser/Des

                                                                                                      Ser/Des
                                                                                                                                18
Relationship between The Data Subset and
  BlockRAM(Memory unit)
                     BlockRAM: low-latency SRAM which each FPGA has.




FPGA array 4×4                                     BlockRAMs
(Data is assigned)
         The data set which assigned to each FPGA is split in the
         vertical direction, and is stored in each BlockRAM (0~7)

                If the data set of 64×128 is assigned to one FPGA, the split data set
                (8×128) is stored in each BlockRAM (0~7).


                                                                                        19
Relationship between MADD and
BlockRAM(Memory unit)
                       ・The data set stored in each
                       BlockRAM is computed by each MADD.
                       ・Each MADD performs the
                       computation in parallel
                       ・The computed data is stored in
                       BlockRAM.




                                                       20
MADD Architecture(Computation unit)
 MADD
  ► Multiply: seven pipeline stages
  ► Adder: seven pipeline stages
  ► Both multiply and adder are single precision floating-point unit which
    conforms to IEEE 754.




                                                                             21
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                              8-stages




                                              8-stages




                                                             22
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                  C0
                                              8-stages




                                              8-stages




                                                             23
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                  C1
                                                 8-stages


                                 Take 8 cycles

                                                 8-stages




                                                             24
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                  C1
                                              8-stages




                                              8-stages




                                                             25
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                  C2
                                                 8-stages


                                 Take 8 cycles

                                                 8-stages

                            Take 8 cycles


                                                             26
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                  C2
                                              8-stages




                                              8-stages




                                                             27
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                  C3
                                              8-stages




                                              8-stages




                                                             28
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                              8-stages




                                              8-stages




                                                             29
Stencil Computation at MADD
 v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 *
  v0[i][j+1]) + (C3 * v0[i+1][j]);




                                              8-stages




                                              8-stages




                               V1[i][j]                      30
MADD Pipeline Operation(Computation unit)
 The computation of grid-points 11~18


                                                         8-stages
                        Input2(adder)

                                                         8-stages


                                         Input1(adder)




                                                                31
MADD Pipeline Operation (in cycles 0〜7)
   The computation of grid-points 11~18              8
                                                      7
                                                      6
                                                      5
The grid-points 1~8 are loaded from                   4
                                                      3
BlockRAM and they are input to the                    2
                                                      1
multiplier in cycles 0~7.                                                 8-stages
                                      Input2(adder)

                                                                          8-stages


                                                          Input1(adder)




                                                                                 32
MADD Pipeline Operation (in cycles 8〜15)
 The computation of grid-points 11~18                17
                                                      16
                                                      15
                                                      14
                                                      13
 The computation result is output from                12
                                                      11
 multiplier, at the same times, grid-points           10
 10~17 are input to the multiplier in                       8              8-stages
                                                            7
 cycles 8~15.                                               6
                                                            5
                                                            4
                                                            3
                                      Input2(adder)         2              8-stages
                                                            1



                                                           Input1(adder)




                                                                                  33
MADD Pipeline Operation (in cycles 16〜23)
 The computation of grid-points 11~18                    19
                                                          18
                                                          17
                                                          16
The grid-points 12~19 are input to the                    15
                                                          14
multiplier, at the same time, value of grid-              13
                                                          12
points 1〜8 and 10~17 multiplied by a                  8         17
                                                      7         16             8-stages
weighting factor are summed in cycles 16~             6         15
                                                      5         14
23.                                                   4         13
                                                      3         12
                                                      2         11             8-stages
                                      Input2(adder)   1         10



                                                               Input1(adder)




                                                                                      34
MADD Pipeline Operation (in cycles 24〜31)
  The computation of grid-points 11~18                    28
                                                           27
                                                           26
                                                           25
Input2(adder): 1~8 and 10~17 grid-points          8   17   24
                                                           23
Input1(adder): 12~19 grid-points                  7   16   22
                                                           21
                                                  6   15
Input(multiplier): 21~28 grid-points                            19
                                                                                8-stages
                                                  5   14        18
                                                      13        17
                                                  4
                                  Input2(adder)   3
                                                                16
                                                      12        15
                                                  2   11        14
                                                  1   10
                                                                13              8-stages
                                                                12



                                                                Input1(adder)




                                                                                       35
MADD Pipeline Operation (in cycles 32〜39)
    The computation of grid-points 11~18                         18
                                                                  17
                                                                  16
                                                                  15
Input2(adder): 1~8, 10~17 and 12~19 grid-points         8   17 19 14
                                                                  13
Input1(adder): 21~28 grid-points                        7   16 18 12
                                                                  11
Input(multiplier): 11~18 grid-points                    6   15 17       28
                                                        5   14 16       27             8-stages
                                                        4   13 15       26
                                        Input2(adder)                   25
                                                        3   12 14       24
                                                        2   11 13       23
                                                        1   10 12
                                                                        22
                                                                        21
                                                                                       8-stages


                                                                       Input1(adder)




                                                                                              36
MADD Pipeline Operation (in cycles 40〜48)
    The computation of grid-points 11~18                        27
                                                                 26
                                                                 25
                                                                 24
The computation results that data of up, down,                   23
                                                                 22
left and right gird-points are multiplied by a                   21     18
                                                                 20     17
weighting factor and summed are output in                               16
cycles 40~48.                                                           15             8-stages
                                                                        14
                                                                        13
                                          Input2(adder)                 12
                                                                        11
                                                          8    17 19   28              8-stages
                                                          7    16 18   27
                                                          6   15 17    26
                                                          5   14 16    25
                                                          4   13 15    Input1(adder)
                                                                       24
                                                          3   12 14    23
                                                          2   11 13    22
                                                          1   10 12    21




                                                                                              37
MADD Pipeline Operation(Computation unit)
The filing rate of the pipeline: (N-8/N)×100%                          (N is
  cycles which taken this computation.)
   ► Achievement of high computation performance and the small circuit area
   ► This scheduling is valid only when width of computed grid is equal to the
     pipeline stages of multiplier and adder.




                                                                                 38
Initialization Mechanism(1/2)


Master
          (1,0)     (2,0)    (3,0)
 (0,0)


 (0,1)    (1,1)     (2,1)    (3,1)



 (0,2)    (1,2)     (2,2)    (3,2)
                                     ・To determine the computation order
                                     of each FPGA, every FPGA uses own
 (0,3)    (1,3)     (2,3)    (3,3)   position coordinate in the system.


         :x-coordinate + 1

         :y-coordinate + 1
                                                                           39
Initialization Mechanism(2/2)

  FPGA    FPGA     FPGA     FPGA


                                       ・It is necessary for this array system
                                       to be synchronized precisely the timing
  FPGA    FPGA     FPGA     FPGA       of start of computation in the first
                                       Iteration.

                                       ・Because this array system is not able
  FPGA    FPGA     FPGA     FPGA
                                       to get the data of communication
                                       region to be used for the next Iteration
                                       if there is a skew.
  FPGA    FPGA     FPGA     FPGA




 Sending start signal of computation


                                                                                  40
Evaluation


             41
Environment
 FPGA:Xilinx Spartan-6 XC6SLX16
    ► BlockRAM: 72KB
 Design tool: Xilinx ISE webpack 13.3
 Hardware description language: Verilog HDL
 Implementation of MADD:IP core generated by Xilinx core-generator
    ► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6
      FPGA has.
        ◇ Therefore, the number of MADD to be able to be implemented in single FPGA is
           eight




                                                                         SRAM is not used.
       Hardware configuration of FPGA array                  ScalableCore board          42
Performance of Single FPGA Node(1/2)
 Grid-size:64×128
 Iteration:500,000
 Performance and Power Consumption(160MHz)
    ► Performance:2.24GFlop/s
    ► Power Consumption:2.37W

                                            Peak performance[GFlop/s]

                               Peak = 2×F×NFPGA×NMADD×7/8
                                 Peak:Peak performance[GFlop/s]
                                     F:Operation frequency[GHz]
                                 NFPGA:the number of FPGA
                                 NMADD:the number of MADD
                                    7/8: Average utilization of MADD unit
                                 → The four multiplications and the three additions
                                      v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) +
                                                (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);

                                                                                           43
Performance of Single FPGA Node(2/2)
 Performance and Performance par watt (160MHz)
   ► Performance:2.24GFlop/s
      26% of Intel Core i7-2600 (single
        thread, 3.4GHz, -O3 option)
   ► Performance par watt:0.95GFlop/sW


                          Performance/W value is about six-times
                          better than Nvidia GTX280 GPU card.

  Nvidia GTX 280 card

 Hardware Resource Consumption
   ► LUT: 50%
   ► Slice: 67%
   ► BlockRAM: 75%
   ► DSP48A1: 100%                                                 44
Estimation of Effective Performance in 256 FPGA Nodes

 Upper Limit of Effective Performance
   ► 573GFlop/s =(8 multipliers + 8 adders)× 256FPGA × 160MHz × 7/8
 Performance par Watt
   ► 0.944GFlop/sW
                                           1000


                                                      Freqency:0.16[GHz]
           Effec ve performance[GFlop/s]




                                            100




                                            10




                                              1
                                                  2    4     8       16      32         64   128   256
                                                                 Number of FPGA nodes

       Estimation of effective performance improvement rate.                                             45
Conclusion
 Proposition of high performance stencil computing method
  and architecture
 Implementation result (One-FPGA node)
   ► Frequency 160MHz (no communication)
   ► Effective performance 2.24GFlop/s. Power consumption 2.37W.
   ► Hardware resource consumption : Slices 67%
 Estimation of performance in 256 FPGA nodes
   ► Upper limit of effective performance:573GFlop/s
   ► Effective performance par watt:0.944GFlop/sW
   Low end FPGAs array system is promising ! (Better than Nvidia
   GTX280 GPU card)
 Future works
   ► Implementation and evaluation of more scaled FPGA array
   ► Implementation towards lower-power
                                                                   46

Weitere ähnliche Inhalte

Was ist angesagt?

Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086aviban
 
8086 microprocessor instruction set by Er. Swapnil Kaware
8086 microprocessor instruction set by Er. Swapnil Kaware8086 microprocessor instruction set by Er. Swapnil Kaware
8086 microprocessor instruction set by Er. Swapnil KawareProf. Swapnil V. Kaware
 
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanualiravi9
 
Instruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil KawareInstruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil KawareProf. Swapnil V. Kaware
 
Instruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSEInstruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSEsalmancreation
 
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086Akhila Rahul
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
SmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan DesignSmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan DesignChristiaan van Gorkum
 
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Wwwguest3f9c6b
 
Instruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorInstruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorAshita Agrawal
 

Was ist angesagt? (16)

Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086
 
8086 microprocessor instruction set by Er. Swapnil Kaware
8086 microprocessor instruction set by Er. Swapnil Kaware8086 microprocessor instruction set by Er. Swapnil Kaware
8086 microprocessor instruction set by Er. Swapnil Kaware
 
50
5050
50
 
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanual
 
Instruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil KawareInstruction set of 8085 Microprocessor By Er. Swapnil Kaware
Instruction set of 8085 Microprocessor By Er. Swapnil Kaware
 
HEVC intra coding
HEVC intra codingHEVC intra coding
HEVC intra coding
 
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086
 
Instruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSEInstruction Set Of 8086 DIU CSE
Instruction Set Of 8086 DIU CSE
 
Instruction set of 8086
Instruction set of 8086Instruction set of 8086
Instruction set of 8086
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
SmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan DesignSmB café 13 sep '12 - Compaan Design
SmB café 13 sep '12 - Compaan Design
 
8086 Instruction set
8086 Instruction set8086 Instruction set
8086 Instruction set
 
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  C O M M U N I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L C O M M U N I C A T I O N S J N T U M O D E L P A P E R{Www
 
Instruction Set of 8086 Microprocessor
Instruction Set of 8086 MicroprocessorInstruction Set of 8086 Microprocessor
Instruction Set of 8086 Microprocessor
 
Digital Logic Design
Digital Logic Design  Digital Logic Design
Digital Logic Design
 
Instruction set of 8086 Microprocessor
Instruction set of 8086 Microprocessor Instruction set of 8086 Microprocessor
Instruction set of 8086 Microprocessor
 

Andere mochten auch

A survey of how to efficiently implement application-specific hardware on an ...
A survey of how to efficiently implement application-specific hardware on an ...A survey of how to efficiently implement application-specific hardware on an ...
A survey of how to efficiently implement application-specific hardware on an ...Ryohei Kobayashi
 
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
3bOS: A flexible and lightweight embedded OS operated using only 3 buttonsRyohei Kobayashi
 
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...Ryohei Kobayashi
 
IEICE technical report (RECONF), January 2013.
IEICE technical report (RECONF), January 2013.IEICE technical report (RECONF), January 2013.
IEICE technical report (RECONF), January 2013.Ryohei Kobayashi
 
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...Ryohei Kobayashi
 
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試みFPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試みRyohei Kobayashi
 
FPGAベースのソーティングアクセラレータの設計と実装
FPGAベースのソーティングアクセラレータの設計と実装FPGAベースのソーティングアクセラレータの設計と実装
FPGAベースのソーティングアクセラレータの設計と実装Ryohei Kobayashi
 
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発Ryohei Kobayashi
 
私が上智に通って唯一誇れること
私が上智に通って唯一誇れること私が上智に通って唯一誇れること
私が上智に通って唯一誇れることRyohei Kobayashi
 
A High-speed Verilog HDL Simulation Method using a Lightweight Translator
A High-speed Verilog HDL Simulation Method using a Lightweight TranslatorA High-speed Verilog HDL Simulation Method using a Lightweight Translator
A High-speed Verilog HDL Simulation Method using a Lightweight TranslatorRyohei Kobayashi
 

Andere mochten auch (11)

A survey of how to efficiently implement application-specific hardware on an ...
A survey of how to efficiently implement application-specific hardware on an ...A survey of how to efficiently implement application-specific hardware on an ...
A survey of how to efficiently implement application-specific hardware on an ...
 
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
3bOS: A flexible and lightweight embedded OS operated using only 3 buttons
 
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
Fully-Functional FPGA Prototype with Fine-Grain Programmable Body Biasing (FP...
 
hpc2013_20131223
hpc2013_20131223hpc2013_20131223
hpc2013_20131223
 
IEICE technical report (RECONF), January 2013.
IEICE technical report (RECONF), January 2013.IEICE technical report (RECONF), January 2013.
IEICE technical report (RECONF), January 2013.
 
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
 
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試みFPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
FPGAを用いた世界最速のソーティングハードウェアの実現に向けた試み
 
FPGAベースのソーティングアクセラレータの設計と実装
FPGAベースのソーティングアクセラレータの設計と実装FPGAベースのソーティングアクセラレータの設計と実装
FPGAベースのソーティングアクセラレータの設計と実装
 
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
多数の小容量FPGAを用いた スケーラブルなステンシル計算機の開発
 
私が上智に通って唯一誇れること
私が上智に通って唯一誇れること私が上智に通って唯一誇れること
私が上智に通って唯一誇れること
 
A High-speed Verilog HDL Simulation Method using a Lightweight Translator
A High-speed Verilog HDL Simulation Method using a Lightweight TranslatorA High-speed Verilog HDL Simulation Method using a Lightweight Translator
A High-speed Verilog HDL Simulation Method using a Lightweight Translator
 

Ähnlich wie CMPP 2012 held in conjunction with ICNC’12

86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuits86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuitsLekashri Subramanian
 
Seminar on field programmable gate array
Seminar on field programmable gate arraySeminar on field programmable gate array
Seminar on field programmable gate arraySaransh Choudhary
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
GPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by AltenGPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by AltenArjan Somers
 
Reconfigurable ICs
Reconfigurable ICsReconfigurable ICs
Reconfigurable ICsAnish Goel
 
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptxLecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptxwafawafa52
 
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
High Performance FPGA Based Decimal-to-Binary Conversion SchemesHigh Performance FPGA Based Decimal-to-Binary Conversion Schemes
High Performance FPGA Based Decimal-to-Binary Conversion SchemesSilicon Mentor
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
VLSI experiments II
VLSI experiments IIVLSI experiments II
VLSI experiments IIGouthaman V
 
Combinational logic circuits
Combinational logic circuitsCombinational logic circuits
Combinational logic circuitsAswiniT3
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
 
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...Fahad Cheema
 
FPGA 개발하면서 겪은 삽질에 대한 총 정리
FPGA 개발하면서 겪은 삽질에 대한 총 정리FPGA 개발하면서 겪은 삽질에 대한 총 정리
FPGA 개발하면서 겪은 삽질에 대한 총 정리Ubuntu Korea Community
 
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...PROFIBUS and PROFINET InternationaI - PI UK
 

Ähnlich wie CMPP 2012 held in conjunction with ICNC’12 (20)

86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuits86254 162058-ee2255-digital-logic-circuits
86254 162058-ee2255-digital-logic-circuits
 
PAL
PALPAL
PAL
 
Seminar on field programmable gate array
Seminar on field programmable gate arraySeminar on field programmable gate array
Seminar on field programmable gate array
 
Dld ppt
Dld  pptDld  ppt
Dld ppt
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
VLSI & E-CAD Lab Manual
VLSI & E-CAD Lab ManualVLSI & E-CAD Lab Manual
VLSI & E-CAD Lab Manual
 
GPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by AltenGPGPU Programming @DroidconNL 2012 by Alten
GPGPU Programming @DroidconNL 2012 by Alten
 
Reconfigurable ICs
Reconfigurable ICsReconfigurable ICs
Reconfigurable ICs
 
e CAD lab manual
e CAD lab manuale CAD lab manual
e CAD lab manual
 
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptxLecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
 
Lecture26
Lecture26Lecture26
Lecture26
 
unit 5.ppt
unit 5.pptunit 5.ppt
unit 5.ppt
 
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
High Performance FPGA Based Decimal-to-Binary Conversion SchemesHigh Performance FPGA Based Decimal-to-Binary Conversion Schemes
High Performance FPGA Based Decimal-to-Binary Conversion Schemes
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
VLSI experiments II
VLSI experiments IIVLSI experiments II
VLSI experiments II
 
Combinational logic circuits
Combinational logic circuitsCombinational logic circuits
Combinational logic circuits
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
 
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
 
FPGA 개발하면서 겪은 삽질에 대한 총 정리
FPGA 개발하면서 겪은 삽질에 대한 총 정리FPGA 개발하면서 겪은 삽질에 대한 총 정리
FPGA 개발하면서 겪은 삽질에 대한 총 정리
 
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
FPGA-based error generator for PROFIBUS DP - Jean-Marc Capron (Yncréa Hauts-d...
 

CMPP 2012 held in conjunction with ICNC’12

  • 1. 2012/12/07 The Third International Conference on Networking and Computing International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30) 25-minute presentation and 5-minute question and discussion time Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations ☆Ryohei Kobayashi†1 Shinya Takamaeda-Yamazaki†1 †2 Kenji Kise†1 †1 Tokyo Institute of Technology, Japan †2 JSPS Research Fellow, Japan
  • 3. FPGA Based Accelerator  Growing demand to perform scientific computation in low- power and high performance  Designed various accelerators to solve scientific computing kernels by using FPGA ► CUBE Mencer, O SPL.2009 ◇Systolic array of 512 FPGAs ◇For encryption, pattern matching ► Stencil computation accelerator composed of 9 FPGAs ◇Scalable streaming-Array with constant memory-bandwidth Sano, K., IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, (2011). 2
  • 4. 2D Stencil Computation  Iterative computation updating data set by using nearest neighbor values called stencil  One of methods to obtain approximate solution of partial differential equation (e.g. Thermodynamics, Hydrodynamics, Electromagnetism …) v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) + (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]); v1[i][j] is updated by the summation of four values. Cx : weighting factor Time-step k Update data set 3
  • 6. ScalableCore System *Takamaeda-Yamazaki, S., (ARC 2012) (2012).  Tile architecture simulator by Multiple low end FPGAs ► High speed simulation environment for many-core processors research ► We use hardware components of the system as an infrastructure for HPC hardware accelerators. One FPGA node FPGA PROM SRAM 5
  • 7. Our Plan One node 4 nodes(2×2) 100 nodes(10×10) Final goal Now implementing 6
  • 8. Parallel Stencil Computation by Using Multi-FPGA 7
  • 9. Block Division and Assigned to Each FPGA :grid-point :data subset communicated Group of grid-points :communication with neighbor FPGAs assigned one FPGA ・Data set is divided into several blocks according to the number of FPGAs ・Each FPGA performs stencil computation in parallel 8
  • 10. The Computing Order of Grid-points on FPGA Proposed method Our proposed method increases the acceptable communication latency! Now, let’s compare (a)’s model with proposed method 9
  • 11. Comparison between (a) and (b) (1/2) ・”Iteration” : a sequent process to compute all the grid-points at a time- step ・Now we suppose a computation updating a value of one grid-point takes just a cycle. ・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15) during every Iteration. A0 A1 A2 A3 C12 C13 C14 C15 FPGA(A) FPGA(C) A4 A5 A6 A7 C8 C9 C10 C11 A8 A9 A10 A11 C4 C5 C6 C7 A12 B0 A13 B1 A14 B2 A15 B3 vs C0 D0 C1 D1 C2 D2 C3 D3 FPGA(B) FPGA(D) B4 B5 B6 B7 D4 D5 D6 D7 B8 B9 B10 B11 D8 D9 D10 D11 B12 B13 B14 B15 D12 D13 D14 D15 (a) (b) Proposed method 10
  • 12. Comparison between (a) and (b) (2/2) A0 A1 A2 A3 First Iteration end 0 16 FPGA(A) A4 A5 A6 A7 A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1 A12 A13 A14 A15 … B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1 B0 B1 B2 B3 FPGA(B) B4 B5 B6 B7 B8 B9 B10 B11 (a) B12 B13 B14 B15 Proposed C12 C13 C14 C15 0 First Iteration end 16 method FPGA(C) C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1 C4 C5 C6 C7 … D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1 C0 C1 C2 C3 D0 D1 D2 D3 FPGA(D) D4 D5 D6 D7 D8 D9 D10 D11 (b) D12 D13 D14 D15 11
  • 13. Comparison between (a) and (b) (2/2) A0 A1 A2 A3 First Iteration end 0 16 FPGA(A) A4 A5 A6 A7 A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1 A12 A13 A14 A15 … B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1 B0 B1 B2 B3 In order not to stall the computation FPGA(B) B4 B5 B6 B7 of B1, the value of A13 must be B8 B9 B10 B11 communicated within three cycles (a) B12 B13 B14 B15 (14, 15, 16) after the computation… Proposed C12 C13 C14 C15 0 First Iteration end 16 method FPGA(C) C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1 C4 C5 C6 C7 … D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1 C0 C1 C2 C3 D0 D1 D2 D3 FPGA(D) D4 D5 D6 D7 D8 D9 D10 D11 (b) D12 D13 D14 D15 12
  • 14. Comparison between (a) and (b) (2/2) A0 A1 A2 A3 First Iteration end 0 16 FPGA(A) A4 A5 A6 A7 A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1 A12 A13 A14 A15 … B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1 B0 B1 B2 B3 In order not to stall the computation FPGA(B) B4 B5 B6 B7 of B1, the value of A13 must be B8 B9 B10 B11 communicated within three cycles (a) B12 B13 B14 B15 (14, 15, 16) after the computation… Proposed C12 C13 C14 C15 0 First Iteration end 16 method FPGA(C) C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1 C4 C5 C6 C7 … D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1 C0 C1 C2 C3 D0 D1 D2 D3 FPGA(D) D4 D5 D6 D7 In order not to stall the D8 D9 D10 D11 computation of D1 of Iteration 2 (17th cycle), the margin to send (b) D12 D13 D14 D15 13 value of C1 (1st cycle) is 15 cycles
  • 15. Comparison between (a) and (b) (N×M grid-points) N If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles FPGA M Iteration end … … FPGA (a) N-1 cycles If the N×M grid-points are assigned to a Proposed N single FPGA, every shared value must be method communicated within N×M–1cycles FPGA M Iteration end … … FPGA N×M-1 cycles 14 (b)
  • 16. Comparison between (a) and (b) (N×M grid-points) N If the N×M grid-points are assigned to a single FPGA, every shared value must be communicated within N–1cycles FPGA M Iteration end Proposed method gives … … increase acceptable FPGA (a) latency N×M grid-points are assigned to a If the of N-1 cycles Proposed N method communication N×M–1cycles be !! single FPGA, every shared value must communicated within FPGA M Iteration end … … FPGA N×M-1 cycles 15 (b)
  • 17. Computing Order Applied Proposed Method :computation order  This method ensures margin of about one Iteration.  As the number of grid-points increases, acceptable latency is scaled. 16
  • 19. System Architecture from North from South from East from West mux2 Memory unit (BlockRAMs) Computation unit Configuration ROM JTAG port mux mux mux mux mux mux mux mux XCF04S MADD MADD MADD MADD MADD MADD MADD MADD FPGA Spartan-6 GATE[0] mux8 GATE[3] Clock to West to East GATE[1] GATE[2] Reset to North to South to/from Adjacent Units Ser/Des Ser/Des Ser/Des Ser/Des 18
  • 20. Relationship between The Data Subset and BlockRAM(Memory unit) BlockRAM: low-latency SRAM which each FPGA has. FPGA array 4×4 BlockRAMs (Data is assigned) The data set which assigned to each FPGA is split in the vertical direction, and is stored in each BlockRAM (0~7) If the data set of 64×128 is assigned to one FPGA, the split data set (8×128) is stored in each BlockRAM (0~7). 19
  • 21. Relationship between MADD and BlockRAM(Memory unit) ・The data set stored in each BlockRAM is computed by each MADD. ・Each MADD performs the computation in parallel ・The computed data is stored in BlockRAM. 20
  • 22. MADD Architecture(Computation unit)  MADD ► Multiply: seven pipeline stages ► Adder: seven pipeline stages ► Both multiply and adder are single precision floating-point unit which conforms to IEEE 754. 21
  • 23. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); 8-stages 8-stages 22
  • 24. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C0 8-stages 8-stages 23
  • 25. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C1 8-stages Take 8 cycles 8-stages 24
  • 26. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C1 8-stages 8-stages 25
  • 27. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C2 8-stages Take 8 cycles 8-stages Take 8 cycles 26
  • 28. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C2 8-stages 8-stages 27
  • 29. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); C3 8-stages 8-stages 28
  • 30. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); 8-stages 8-stages 29
  • 31. Stencil Computation at MADD  v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j-1]) + (C2 * v0[i][j+1]) + (C3 * v0[i+1][j]); 8-stages 8-stages V1[i][j] 30
  • 32. MADD Pipeline Operation(Computation unit)  The computation of grid-points 11~18 8-stages Input2(adder) 8-stages Input1(adder) 31
  • 33. MADD Pipeline Operation (in cycles 0〜7)  The computation of grid-points 11~18 8 7 6 5 The grid-points 1~8 are loaded from 4 3 BlockRAM and they are input to the 2 1 multiplier in cycles 0~7. 8-stages Input2(adder) 8-stages Input1(adder) 32
  • 34. MADD Pipeline Operation (in cycles 8〜15)  The computation of grid-points 11~18 17 16 15 14 13 The computation result is output from 12 11 multiplier, at the same times, grid-points 10 10~17 are input to the multiplier in 8 8-stages 7 cycles 8~15. 6 5 4 3 Input2(adder) 2 8-stages 1 Input1(adder) 33
  • 35. MADD Pipeline Operation (in cycles 16〜23)  The computation of grid-points 11~18 19 18 17 16 The grid-points 12~19 are input to the 15 14 multiplier, at the same time, value of grid- 13 12 points 1〜8 and 10~17 multiplied by a 8 17 7 16 8-stages weighting factor are summed in cycles 16~ 6 15 5 14 23. 4 13 3 12 2 11 8-stages Input2(adder) 1 10 Input1(adder) 34
  • 36. MADD Pipeline Operation (in cycles 24〜31)  The computation of grid-points 11~18 28 27 26 25 Input2(adder): 1~8 and 10~17 grid-points 8 17 24 23 Input1(adder): 12~19 grid-points 7 16 22 21 6 15 Input(multiplier): 21~28 grid-points 19 8-stages 5 14 18 13 17 4 Input2(adder) 3 16 12 15 2 11 14 1 10 13 8-stages 12 Input1(adder) 35
  • 37. MADD Pipeline Operation (in cycles 32〜39)  The computation of grid-points 11~18 18 17 16 15 Input2(adder): 1~8, 10~17 and 12~19 grid-points 8 17 19 14 13 Input1(adder): 21~28 grid-points 7 16 18 12 11 Input(multiplier): 11~18 grid-points 6 15 17 28 5 14 16 27 8-stages 4 13 15 26 Input2(adder) 25 3 12 14 24 2 11 13 23 1 10 12 22 21 8-stages Input1(adder) 36
  • 38. MADD Pipeline Operation (in cycles 40〜48)  The computation of grid-points 11~18 27 26 25 24 The computation results that data of up, down, 23 22 left and right gird-points are multiplied by a 21 18 20 17 weighting factor and summed are output in 16 cycles 40~48. 15 8-stages 14 13 Input2(adder) 12 11 8 17 19 28 8-stages 7 16 18 27 6 15 17 26 5 14 16 25 4 13 15 Input1(adder) 24 3 12 14 23 2 11 13 22 1 10 12 21 37
  • 39. MADD Pipeline Operation(Computation unit) The filing rate of the pipeline: (N-8/N)×100% (N is cycles which taken this computation.) ► Achievement of high computation performance and the small circuit area ► This scheduling is valid only when width of computed grid is equal to the pipeline stages of multiplier and adder. 38
  • 40. Initialization Mechanism(1/2) Master (1,0) (2,0) (3,0) (0,0) (0,1) (1,1) (2,1) (3,1) (0,2) (1,2) (2,2) (3,2) ・To determine the computation order of each FPGA, every FPGA uses own (0,3) (1,3) (2,3) (3,3) position coordinate in the system. :x-coordinate + 1 :y-coordinate + 1 39
  • 41. Initialization Mechanism(2/2) FPGA FPGA FPGA FPGA ・It is necessary for this array system to be synchronized precisely the timing FPGA FPGA FPGA FPGA of start of computation in the first Iteration. ・Because this array system is not able FPGA FPGA FPGA FPGA to get the data of communication region to be used for the next Iteration if there is a skew. FPGA FPGA FPGA FPGA Sending start signal of computation 40
  • 43. Environment  FPGA:Xilinx Spartan-6 XC6SLX16 ► BlockRAM: 72KB  Design tool: Xilinx ISE webpack 13.3  Hardware description language: Verilog HDL  Implementation of MADD:IP core generated by Xilinx core-generator ► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6 FPGA has. ◇ Therefore, the number of MADD to be able to be implemented in single FPGA is eight SRAM is not used. Hardware configuration of FPGA array ScalableCore board 42
  • 44. Performance of Single FPGA Node(1/2)  Grid-size:64×128  Iteration:500,000  Performance and Power Consumption(160MHz) ► Performance:2.24GFlop/s ► Power Consumption:2.37W Peak performance[GFlop/s] Peak = 2×F×NFPGA×NMADD×7/8 Peak:Peak performance[GFlop/s] F:Operation frequency[GHz] NFPGA:the number of FPGA NMADD:the number of MADD 7/8: Average utilization of MADD unit → The four multiplications and the three additions v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) + (C2 * v0[i][j-1]) + (C3 * v0[i+1][j]); 43
  • 45. Performance of Single FPGA Node(2/2)  Performance and Performance par watt (160MHz) ► Performance:2.24GFlop/s 26% of Intel Core i7-2600 (single thread, 3.4GHz, -O3 option) ► Performance par watt:0.95GFlop/sW Performance/W value is about six-times better than Nvidia GTX280 GPU card. Nvidia GTX 280 card  Hardware Resource Consumption ► LUT: 50% ► Slice: 67% ► BlockRAM: 75% ► DSP48A1: 100% 44
  • 46. Estimation of Effective Performance in 256 FPGA Nodes  Upper Limit of Effective Performance ► 573GFlop/s =(8 multipliers + 8 adders)× 256FPGA × 160MHz × 7/8  Performance par Watt ► 0.944GFlop/sW 1000 Freqency:0.16[GHz] Effec ve performance[GFlop/s] 100 10 1 2 4 8 16 32 64 128 256 Number of FPGA nodes Estimation of effective performance improvement rate. 45
  • 47. Conclusion  Proposition of high performance stencil computing method and architecture  Implementation result (One-FPGA node) ► Frequency 160MHz (no communication) ► Effective performance 2.24GFlop/s. Power consumption 2.37W. ► Hardware resource consumption : Slices 67%  Estimation of performance in 256 FPGA nodes ► Upper limit of effective performance:573GFlop/s ► Effective performance par watt:0.944GFlop/sW Low end FPGAs array system is promising ! (Better than Nvidia GTX280 GPU card)  Future works ► Implementation and evaluation of more scaled FPGA array ► Implementation towards lower-power 46