Designing High Performance Computing Architectures for Reliable Space Applications

Designing High Performance Computing
Architectures f Reliable S
A hit t
for R li bl Space
Applications
pp

Fisnik Kraja
PhD Defense
December 6, 2012

Advisors:
1st : Prof. Dr Arndt Bode
Prof Dr.
2nd : Prof. Dr. Xavier Martorell

Out e
Outline
1. Motivation
2.
2 The Proposed Computing Architecture
3.
3 The 2DSSAR Benchmarking Application
4. Optimizations and Benchmarking Results
p
g
–
–
–

Shared memory multiprocessor systems
Distributed memory multiprocessor systems
Heterogeneous CPU/GPU systems

5. Conclusions
2

Motivation
ot at o
•

Future space applications will demand for:
– Increased on-board computing capabilities
– Preserved system reliability

•

Future missions:
– Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s
– Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1
operations/s, 603.1
Gbit/s

•

Challenges
–
–
–
–
–
–

Costs
Modularity
Portability
y
Scalability
Programmability
Efficiency
y

(ASICs are very expensive)
(component change and reuse)
(
(across various spacecraft p
p
platforms)
)
(hardware and software)
(compatible to various environments)
(p
(power consumption and size)
p
)
3

The Proposed Architecture
e oposed c tectu e

Legend:
RHMU
Radiation-Hardened
Management Unit
PPN
Parallel processing Node

Co t o us
Control Bus

Data Bus

4

The 2DSSAR Application
2- Dimensional Spotlight Synthetic Aperture Radar
Illuminated swath in Side‐looking Spotlight SAR
Synthetic Data
Generation (SDG):
Synthetic SAR returns
from a uniform grid of
point reflectors

Spacecraft
S
ft

Azimuth
Flight Path

SAR Sensor Processing (SSP)

Altitude
Altit d

Swath
Range

Range
Swath
Cross-Range

Read Generated Data
Image Reconstruction (IR)
Write Reconstructed Image

Reconstructed SAR
image i obtained b
i
is b i d by
applying a 2D
Fourier Matched
Filtering and
Interpolation
p
Algorithm
5

Profiling SAR Image Reconstruction
g
g
Coverage
g
(in km)

Memory
y
(in GB)

FLOP
(in Giga)

Time
(in Seconds)

Scale=10

3.8 x 2.5

0.25

29.54

23

Scale 30
Scale=30

11.4 7.5
11 4 x 7 5

2

115.03
115 03

230

Scale=60

22.8 x 15

8

1302

Goal:
Speedup

926

Transposition and
FFT‐shifting
2%

30x

Compression and
Decompression
Loops
7%

Interpolation
Interpolation
Loop
69%

IR Profiling

FFTs
22%

6

IR Optimizations for
Shared Memory Multiprocessing
• O
OpenMP
MP
– General optimizations:
• Thread Pinning and First Touch Policy
• Static/Dynamic Scheduling

– FFT
• Manual Multithreading of Loops of 1D-FFT(not the FFT itself)

– Interpolation Loop (Polar to Rectangular Coordinates)
• Atomic Operations
• Replication and Reduction (R&R)

• Other Programming Models
– OmpSs, MPI, MPI+OpenMP
MPI OpenMP
7

IR on a Shared Memory Node
y
12

The ccNUMA Node:

10
Sp
peedup (Scale=60)

2 x Nehalem CPUs
6 Cores, 12 threads
2.93-3.33 GHz
QPI: 25.6 GB/s
IMC:
IMC 32 GB/
GB/s
Lithography: 32 nm
TDP: 95 W
2 x 3 x 6 GB Memory
Total: 36 GB
DDR3 SDRAM
1066 MHz

8
6
4
2

0
Cores (Threads)
OpenMP Atomic
OpenMP R&R
OpenMP R&R
OmpSs Atomic
OmpSs R&R
MPI R&R
MPI+OpenMP

1

2

4

6

8

10

12

12 (24)

1
1

1,55
1,78

3,05
3,51

4,45
5,02

5,81
6,36

6,94
7,74

7,98
9,03

10,54
11,06

1
1

1,61
1,93

3,12
3,73

4,62
5,52

5,92
7,13

7,02
8,65

8,13
10,37

10,72
12,37

1
1

1,92
1,89
1 89

3,65
3,54
3 54

5,30
4,88
4 88

6,57
6,40
6 40

7,94
8,02
8 02

9,81
9,94
9 94

11,20
11,69
11 69
8

Distributed Memory Multiprocessing
•

Programming Paradigms
g
g
g
– MPI
• Data Replication
• Process Creation Overhead

– MPI+OpenMP

PID 0
PID 0
PID 1
PID 2
PID 3

• 1 process/Node
• 1 Thread/Core
PID 0

•

Communication Optimizations
– Transposition (new: All-to-All)
– FFT-shift (new: Peer-to-Peer)
– Int.Loop Replication and Reduction

•

PID 1
PID 2
PID 3

D00
D10
D20
D30

D01
D11
D21
D31

A1
A2
D1
D2

D02
D12
D22
D32

D03
D13
D23
D33

B1
B2
C1
C2

Pipelined IR
– Each node reconstructs a separate
SAR Image

9

IR on the Distributed Memory System
y y
The Nehalem Cluster:
60

50
Speedu
up (Scale=60)

Each Node
2 x 4 Cores,
16 threads
2.8 - 3.2 GHz
12/24/48 GB RAM
QPI: 25.6 GB/s
IMC: 32 GB/s,
Lithography: 45 nm
TDP: 95 W/CPU
Infiniband Network
Fat-tree Topology
6 Backbone Switches
24 Leaf Switches

40

30

20

10

0
No. of Nodes (Cores)
No of Nodes (Cores)
MPI (4Proc/Node)
Hybrid (1Proc:16Thr/Node)
MPI_new (8Proc/Node‐24GB)
Hyb new (1Proc:16Thr/Node)
Hyb_new (1Proc:16Thr/Node)
Pipelined (1Proc:16Thr/Node)

( )
1 (8)

( )
2 (16)

( )
4 (32)

( )
8 (64)

( )
12 (96)

(
)
16 (128)

3,54

5,46

7,92

8,52

7,69

7,37

6,68
6,45

10,19
10,87

14,41
15,93

17,11
23,69

17,19
28,90

17,73
31,06

5,66
5 66
6,35

9,68
9 68
11,50

17,13
17 13
21,30

26,92
26 92
38,05

30,72
30 72
50,48

32,08
32 08
59,80
10

Heterogeneous CPU/GPU Computing
blockIndex.x (bx)

• ccNUMA Multi Processor
Multi-Processor

0

1

– Sequential Optimizations
– Minor load-balancing improvements

0 1

0

1

tsize

threadInd
dex.y (ty)

blockInde
ex.y (by)

– CUDA – Tiling Technique
– cuFFT Library
– Transcendental functions

3

threadIndex.x (tx)

• Computing on CPU+GPU

• Accelerator (GP-GPU)

2

1

tsize

Block (2 1)
(2,1)

0

2

• Such as sine and cosine

– CUDA 3.2 lacks
• Some complex operations (
p
p
(multiplication and CEXP)
p
)
• Atomic operations for complex/float data

– Memory Limitation
• Atomic operations are used in SFI loop (R&R is not an option)
• Large Scale IR dataset does not fit into GPU memory
11

IR on a Heterogeneous Node
g
35

The Machine:

30
25
20
Speedup

ccNUMA Module
2 x 4 Cores,
16 threads
2.8 – 3.2 GHz
12 GB RAM
TDP: 95 W/CPU
PCIe 2.0 (8 GB/s)
Accelerator Module
2 GPU Cards
NVIDIA Tesla(Fermi)
(
)
1.15 GHz
6 GB GDDR5
144GB/s
TDP 238 W

15
10
5
0
CPU

CPU
CPU Best
CPU B t
CPU
CPU
16 Threads         GPU
Sequential 8 Threads
(SMT)

CPU + GPU

2 GPUs

2 GPUs
2 GPU
Pipelined

Scale=10
Scale=30

1
1

1,82
1,89

14,46
11,41

16,06
13,26

20,11
19,44

18,88
22,10

4,27
16,71

15,86
25,40

Scale=60

1

1,97

10,27

12,55

20,17

24,68

22,26

34,46
12

Conclusions
Co c us o s
•

Shared memory Nodes
y
– Performance is limited by hardware resources
– 1 Node (12 Cores/24 Threads): speedup = 12.4

•

Distributed memory systems
– Low efficiency in terms of performance per power consumption and size.
– 8 Nodes (64 cores): speedup: 38.05

•

Heterogeneous CPU/GPU systems
– Perfect compromise:
• Better performance than current shared memory nodes
• Better efficiency than distributed memory systems
• 1 CPU + 2 GPUs: speedup: 34.46

•

Final Design Recommendations
– Powerful shared memory PPN
– PPN with ccNUMA CPUs and GPU accelerators
– Distributed memory only if multiple PPNs are needed
13

Thank Y
Th k You

kraja@in.tum.de

Designing High Performance Computing Architectures for Reliable Space Applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Designing High Performance Computing Architectures for Reliable Space Applications

Ähnlich wie Designing High Performance Computing Architectures for Reliable Space Applications (20)

Mehr von Fisnik Kraja

Mehr von Fisnik Kraja (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Designing High Performance Computing Architectures for Reliable Space Applications