1. 2012/12/07 The Third International Conference on Networking and Computing
International Workshop on Challenges on Massively Parallel Processors (CMPP) (11:00-11:30)
25-minute presentation and 5-minute question and discussion time
Towards a Low-Power Accelerator of
Many FPGAs for Stencil Computations
☆Ryohei Kobayashi†1 Shinya Takamaeda-Yamazaki†1 †2 Kenji Kise†1
†1 Tokyo Institute of Technology, Japan
†2 JSPS Research Fellow, Japan
3. FPGA Based Accelerator
Growing demand to perform scientific computation in low-
power and high performance
Designed various accelerators to solve scientific computing
kernels by using FPGA
► CUBE Mencer, O SPL.2009
◇Systolic array of 512 FPGAs
◇For encryption, pattern matching
► Stencil computation accelerator composed of 9 FPGAs
◇Scalable streaming-Array with constant memory-bandwidth
Sano, K., IEEE 19th Annual International Symposium
on Field-Programmable
Custom Computing Machines, (2011).
2
4. 2D Stencil Computation
Iterative computation updating data set by using nearest
neighbor values called stencil
One of methods to obtain approximate solution of partial
differential equation (e.g. Thermodynamics, Hydrodynamics,
Electromagnetism …)
v1[i][j] =
(C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) +
(C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);
v1[i][j] is updated by the summation of four values.
Cx : weighting factor
Time-step k
Update data set
3
6. ScalableCore System *Takamaeda-Yamazaki, S., (ARC 2012) (2012).
Tile architecture simulator by Multiple low end FPGAs
► High speed simulation environment for many-core processors
research
► We use hardware components of the system as an infrastructure for
HPC hardware accelerators.
One FPGA node
FPGA
PROM SRAM
5
7. Our Plan
One node 4 nodes(2×2) 100 nodes(10×10)
Final goal
Now implementing
6
9. Block Division and Assigned to Each FPGA
:grid-point
:data subset communicated Group of grid-points
:communication
with neighbor FPGAs assigned one FPGA
・Data set is divided into several blocks according to the number of FPGAs
・Each FPGA performs stencil computation in parallel
8
10. The Computing Order of Grid-points on FPGA
Proposed method
Our proposed method increases the acceptable communication latency!
Now, let’s compare (a)’s model with proposed method
9
11. Comparison between (a) and (b) (1/2)
・”Iteration” : a sequent process to compute all the grid-points at a time-
step
・Now we suppose a computation updating a value of one grid-point takes
just a cycle.
・Each FPGA updates the assigned data of sixteen grid-points (from 0 to 15)
during every Iteration.
A0 A1 A2 A3 C12 C13 C14 C15
FPGA(A)
FPGA(C)
A4 A5 A6 A7 C8 C9 C10 C11
A8 A9 A10 A11 C4 C5 C6 C7
A12
B0
A13
B1
A14
B2
A15
B3
vs C0
D0
C1
D1
C2
D2
C3
D3
FPGA(B)
FPGA(D)
B4 B5 B6 B7 D4 D5 D6 D7
B8 B9 B10 B11 D8 D9 D10 D11
B12 B13 B14 B15 D12 D13 D14 D15
(a) (b) Proposed method 10
13. Comparison between (a) and (b) (2/2)
A0 A1 A2 A3
First Iteration end
0 16
FPGA(A)
A4 A5 A6 A7
A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1
A12 A13 A14 A15 …
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1
B0 B1 B2 B3
In order not to stall the computation
FPGA(B)
B4 B5 B6 B7
of B1, the value of A13 must be
B8 B9 B10 B11
communicated within three cycles
(a) B12 B13 B14 B15 (14, 15, 16) after the computation…
Proposed C12 C13 C14 C15 0 First Iteration end 16
method
FPGA(C)
C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1
C4 C5 C6 C7 …
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1
C0 C1 C2 C3
D0 D1 D2 D3
FPGA(D)
D4 D5 D6 D7
D8 D9 D10 D11
(b) D12 D13 D14 D15 12
14. Comparison between (a) and (b) (2/2)
A0 A1 A2 A3
First Iteration end
0 16
FPGA(A)
A4 A5 A6 A7
A8 A9 A10 A11 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A0 A1
A12 A13 A14 A15 …
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B0 B1
B0 B1 B2 B3
In order not to stall the computation
FPGA(B)
B4 B5 B6 B7
of B1, the value of A13 must be
B8 B9 B10 B11
communicated within three cycles
(a) B12 B13 B14 B15 (14, 15, 16) after the computation…
Proposed C12 C13 C14 C15 0 First Iteration end 16
method
FPGA(C)
C8 C9 C10 C11 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C0 C1
C4 C5 C6 C7 …
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D0 D1
C0 C1 C2 C3
D0 D1 D2 D3
FPGA(D)
D4 D5 D6 D7
In order not to stall the
D8 D9 D10 D11 computation of D1 of Iteration 2
(17th cycle), the margin to send
(b) D12 D13 D14 D15 13
value of C1 (1st cycle) is 15 cycles
15. Comparison between (a) and (b) (N×M grid-points)
N If the N×M grid-points are assigned to a
single FPGA, every shared value must be
communicated within N–1cycles
FPGA
M Iteration end
… …
FPGA
(a) N-1 cycles
If the N×M grid-points are assigned to a
Proposed N
single FPGA, every shared value must be
method
communicated within N×M–1cycles
FPGA
M Iteration end
… …
FPGA
N×M-1 cycles 14
(b)
16. Comparison between (a) and (b) (N×M grid-points)
N If the N×M grid-points are assigned to a
single FPGA, every shared value must be
communicated within N–1cycles
FPGA
M Iteration end
Proposed method gives
… …
increase acceptable
FPGA
(a)
latency N×M grid-points are assigned to a
If the
of N-1 cycles
Proposed N
method communication N×M–1cycles be
!!
single FPGA, every shared value must
communicated within
FPGA
M Iteration end
… …
FPGA
N×M-1 cycles 15
(b)
17. Computing Order Applied Proposed Method
:computation order
This method ensures margin of about one Iteration.
As the number of grid-points increases, acceptable latency is scaled.
16
19. System Architecture
from North from South
from East
from West
mux2
Memory unit (BlockRAMs)
Computation unit
Configuration
ROM JTAG port
mux mux mux mux mux mux mux mux
XCF04S
MADD MADD MADD MADD MADD MADD MADD MADD
FPGA
Spartan-6
GATE[0]
mux8 GATE[3] Clock
to West to East
GATE[1] GATE[2]
Reset
to North to South to/from
Adjacent
Units
Ser/Des
Ser/Des
Ser/Des
Ser/Des
18
20. Relationship between The Data Subset and
BlockRAM(Memory unit)
BlockRAM: low-latency SRAM which each FPGA has.
FPGA array 4×4 BlockRAMs
(Data is assigned)
The data set which assigned to each FPGA is split in the
vertical direction, and is stored in each BlockRAM (0~7)
If the data set of 64×128 is assigned to one FPGA, the split data set
(8×128) is stored in each BlockRAM (0~7).
19
21. Relationship between MADD and
BlockRAM(Memory unit)
・The data set stored in each
BlockRAM is computed by each MADD.
・Each MADD performs the
computation in parallel
・The computed data is stored in
BlockRAM.
20
22. MADD Architecture(Computation unit)
MADD
► Multiply: seven pipeline stages
► Adder: seven pipeline stages
► Both multiply and adder are single precision floating-point unit which
conforms to IEEE 754.
21
33. MADD Pipeline Operation (in cycles 0〜7)
The computation of grid-points 11~18 8
7
6
5
The grid-points 1~8 are loaded from 4
3
BlockRAM and they are input to the 2
1
multiplier in cycles 0~7. 8-stages
Input2(adder)
8-stages
Input1(adder)
32
34. MADD Pipeline Operation (in cycles 8〜15)
The computation of grid-points 11~18 17
16
15
14
13
The computation result is output from 12
11
multiplier, at the same times, grid-points 10
10~17 are input to the multiplier in 8 8-stages
7
cycles 8~15. 6
5
4
3
Input2(adder) 2 8-stages
1
Input1(adder)
33
35. MADD Pipeline Operation (in cycles 16〜23)
The computation of grid-points 11~18 19
18
17
16
The grid-points 12~19 are input to the 15
14
multiplier, at the same time, value of grid- 13
12
points 1〜8 and 10~17 multiplied by a 8 17
7 16 8-stages
weighting factor are summed in cycles 16~ 6 15
5 14
23. 4 13
3 12
2 11 8-stages
Input2(adder) 1 10
Input1(adder)
34
38. MADD Pipeline Operation (in cycles 40〜48)
The computation of grid-points 11~18 27
26
25
24
The computation results that data of up, down, 23
22
left and right gird-points are multiplied by a 21 18
20 17
weighting factor and summed are output in 16
cycles 40~48. 15 8-stages
14
13
Input2(adder) 12
11
8 17 19 28 8-stages
7 16 18 27
6 15 17 26
5 14 16 25
4 13 15 Input1(adder)
24
3 12 14 23
2 11 13 22
1 10 12 21
37
39. MADD Pipeline Operation(Computation unit)
The filing rate of the pipeline: (N-8/N)×100% (N is
cycles which taken this computation.)
► Achievement of high computation performance and the small circuit area
► This scheduling is valid only when width of computed grid is equal to the
pipeline stages of multiplier and adder.
38
40. Initialization Mechanism(1/2)
Master
(1,0) (2,0) (3,0)
(0,0)
(0,1) (1,1) (2,1) (3,1)
(0,2) (1,2) (2,2) (3,2)
・To determine the computation order
of each FPGA, every FPGA uses own
(0,3) (1,3) (2,3) (3,3) position coordinate in the system.
:x-coordinate + 1
:y-coordinate + 1
39
41. Initialization Mechanism(2/2)
FPGA FPGA FPGA FPGA
・It is necessary for this array system
to be synchronized precisely the timing
FPGA FPGA FPGA FPGA of start of computation in the first
Iteration.
・Because this array system is not able
FPGA FPGA FPGA FPGA
to get the data of communication
region to be used for the next Iteration
if there is a skew.
FPGA FPGA FPGA FPGA
Sending start signal of computation
40
43. Environment
FPGA:Xilinx Spartan-6 XC6SLX16
► BlockRAM: 72KB
Design tool: Xilinx ISE webpack 13.3
Hardware description language: Verilog HDL
Implementation of MADD:IP core generated by Xilinx core-generator
► Implementing single MADD expends four pieces of 32 DSP-blocks which a Spartan-6
FPGA has.
◇ Therefore, the number of MADD to be able to be implemented in single FPGA is
eight
SRAM is not used.
Hardware configuration of FPGA array ScalableCore board 42
44. Performance of Single FPGA Node(1/2)
Grid-size:64×128
Iteration:500,000
Performance and Power Consumption(160MHz)
► Performance:2.24GFlop/s
► Power Consumption:2.37W
Peak performance[GFlop/s]
Peak = 2×F×NFPGA×NMADD×7/8
Peak:Peak performance[GFlop/s]
F:Operation frequency[GHz]
NFPGA:the number of FPGA
NMADD:the number of MADD
7/8: Average utilization of MADD unit
→ The four multiplications and the three additions
v1[i][j] = (C0 * v0[i-1][j]) + (C1 * v0[i][j+1]) +
(C2 * v0[i][j-1]) + (C3 * v0[i+1][j]);
43
45. Performance of Single FPGA Node(2/2)
Performance and Performance par watt (160MHz)
► Performance:2.24GFlop/s
26% of Intel Core i7-2600 (single
thread, 3.4GHz, -O3 option)
► Performance par watt:0.95GFlop/sW
Performance/W value is about six-times
better than Nvidia GTX280 GPU card.
Nvidia GTX 280 card
Hardware Resource Consumption
► LUT: 50%
► Slice: 67%
► BlockRAM: 75%
► DSP48A1: 100% 44
46. Estimation of Effective Performance in 256 FPGA Nodes
Upper Limit of Effective Performance
► 573GFlop/s =(8 multipliers + 8 adders)× 256FPGA × 160MHz × 7/8
Performance par Watt
► 0.944GFlop/sW
1000
Freqency:0.16[GHz]
Effec ve performance[GFlop/s]
100
10
1
2 4 8 16 32 64 128 256
Number of FPGA nodes
Estimation of effective performance improvement rate. 45
47. Conclusion
Proposition of high performance stencil computing method
and architecture
Implementation result (One-FPGA node)
► Frequency 160MHz (no communication)
► Effective performance 2.24GFlop/s. Power consumption 2.37W.
► Hardware resource consumption : Slices 67%
Estimation of performance in 256 FPGA nodes
► Upper limit of effective performance:573GFlop/s
► Effective performance par watt:0.944GFlop/sW
Low end FPGAs array system is promising ! (Better than Nvidia
GTX280 GPU card)
Future works
► Implementation and evaluation of more scaled FPGA array
► Implementation towards lower-power
46