SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Designing High Performance Computing
Architectures f Reliable S
A hit t
for R li bl Space
Applications
pp

Fisnik Kraja
PhD Defense
December 6, 2012

Advisors:
1st : Prof. Dr Arndt Bode
Prof Dr.
2nd : Prof. Dr. Xavier Martorell
Out e
Outline
1. Motivation
2.
2 The Proposed Computing Architecture
3.
3 The 2DSSAR Benchmarking Application
4. Optimizations and Benchmarking Results
p
g
–
–
–

Shared memory multiprocessor systems
Distributed memory multiprocessor systems
Heterogeneous CPU/GPU systems

5. Conclusions
2
Motivation
ot at o
•

Future space applications will demand for:
– Increased on-board computing capabilities
– Preserved system reliability

•

Future missions:
– Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s
– Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1
operations/s, 603.1
Gbit/s

•

Challenges
–
–
–
–
–
–

Costs
Modularity
Portability
y
Scalability
Programmability
Efficiency
y

(ASICs are very expensive)
(component change and reuse)
(
(across various spacecraft p
p
platforms)
)
(hardware and software)
(compatible to various environments)
(p
(power consumption and size)
p
)
3
The Proposed Architecture
e oposed c tectu e

Legend:
RHMU
Radiation-Hardened
Management Unit
PPN
Parallel processing Node

Co t o us
Control Bus

Data Bus

4
The 2DSSAR Application
2- Dimensional Spotlight Synthetic Aperture Radar
Illuminated swath in Side‐looking Spotlight SAR
Synthetic Data
Generation (SDG):
Synthetic SAR returns
from a uniform grid of
point reflectors

Spacecraft
S
ft

Azimuth
Flight Path

SAR Sensor Processing (SSP)

Altitude
Altit d

Swath
Range

Range
Swath
Cross-Range

Read Generated Data
Image Reconstruction (IR)
Write Reconstructed Image

Reconstructed SAR
image i obtained b
i
is b i d by
applying a 2D
Fourier Matched
Filtering and
Interpolation
p
Algorithm
5
Profiling SAR Image Reconstruction
g
g
Coverage
g
(in km)

Memory
y
(in GB)

FLOP
(in Giga)

Time
(in Seconds)

Scale=10

3.8 x 2.5

0.25

29.54

23

Scale 30
Scale=30

11.4 7.5
11 4 x 7 5

2

115.03
115 03

230

Scale=60

22.8 x 15

8

1302

Goal:
Speedup

926

Transposition and 
FFT‐shifting
2%

30x

Compression and 
Decompression 
Loops
7%

Interpolation 
Interpolation
Loop
69%

IR Profiling

FFTs
22%

6
IR Optimizations for
Shared Memory Multiprocessing
• O
OpenMP
MP
– General optimizations:
• Thread Pinning and First Touch Policy
• Static/Dynamic Scheduling

– FFT
• Manual Multithreading of Loops of 1D-FFT(not the FFT itself)

– Interpolation Loop (Polar to Rectangular Coordinates)
• Atomic Operations
• Replication and Reduction (R&R)

• Other Programming Models
– OmpSs, MPI, MPI+OpenMP
MPI OpenMP
7
IR on a Shared Memory Node
y
12

The ccNUMA Node:

10
Sp
peedup  (Scale=60)

2 x Nehalem CPUs
6 Cores, 12 threads
2.93-3.33 GHz
QPI: 25.6 GB/s
IMC:
IMC 32 GB/
GB/s
Lithography: 32 nm
TDP: 95 W
2 x 3 x 6 GB Memory
Total: 36 GB
DDR3 SDRAM
1066 MHz

8
6
4
2

0
Cores (Threads)
OpenMP Atomic
OpenMP R&R
OpenMP R&R
OmpSs Atomic
OmpSs R&R
MPI R&R
MPI+OpenMP

1

2

4

6

8

10

12

12 (24)

1
1

1,55
1,78

3,05
3,51

4,45
5,02

5,81
6,36

6,94
7,74

7,98
9,03

10,54
11,06

1
1

1,61
1,93

3,12
3,73

4,62
5,52

5,92
7,13

7,02
8,65

8,13
10,37

10,72
12,37

1
1

1,92
1,89
1 89

3,65
3,54
3 54

5,30
4,88
4 88

6,57
6,40
6 40

7,94
8,02
8 02

9,81
9,94
9 94

11,20
11,69
11 69
8
IR Optimizations for
Distributed Memory Multiprocessing
•

Programming Paradigms
g
g
g
– MPI
• Data Replication
• Process Creation Overhead

– MPI+OpenMP

PID 0
PID 0
PID 1
PID 2
PID 3

• 1 process/Node
• 1 Thread/Core
PID 0

•

Communication Optimizations
– Transposition (new: All-to-All)
– FFT-shift (new: Peer-to-Peer)
– Int.Loop Replication and Reduction

•

PID 1
PID 2
PID 3

D00
D10
D20
D30

D01
D11
D21
D31

A1
A2
D1
D2

D02
D12
D22
D32

D03
D13
D23
D33

B1
B2
C1
C2

Pipelined IR
– Each node reconstructs a separate
SAR Image

9
IR on the Distributed Memory System
y y
The Nehalem Cluster:
60

50
Speedu
up (Scale=60)

Each Node
2 x 4 Cores,
16 threads
2.8 - 3.2 GHz
12/24/48 GB RAM
QPI: 25.6 GB/s
IMC: 32 GB/s,
Lithography: 45 nm
TDP: 95 W/CPU
Infiniband Network
Fat-tree Topology
6 Backbone Switches
24 Leaf Switches

40

30

20

10

0
No. of Nodes (Cores)
No of Nodes (Cores)
MPI (4Proc/Node)
Hybrid (1Proc:16Thr/Node)
MPI_new (8Proc/Node‐24GB)
Hyb new (1Proc:16Thr/Node)
Hyb_new (1Proc:16Thr/Node)
Pipelined (1Proc:16Thr/Node)

( )
1 (8)

( )
2 (16)

( )
4 (32)

( )
8 (64)

( )
12 (96)

(
)
16 (128)

3,54

5,46

7,92

8,52

7,69

7,37

6,68
6,45

10,19
10,87

14,41
15,93

17,11
23,69

17,19
28,90

17,73
31,06

5,66
5 66
6,35

9,68
9 68
11,50

17,13
17 13
21,30

26,92
26 92
38,05

30,72
30 72
50,48

32,08
32 08
59,80
10
IR Optimizations for
Heterogeneous CPU/GPU Computing
blockIndex.x (bx)

• ccNUMA Multi Processor
Multi-Processor

0

1

– Sequential Optimizations
– Minor load-balancing improvements

0 1

0

1

tsize

threadInd
dex.y (ty)

blockInde
ex.y (by)

– CUDA – Tiling Technique
– cuFFT Library
– Transcendental functions

3

threadIndex.x (tx)

• Computing on CPU+GPU

• Accelerator (GP-GPU)

2

1

tsize

Block (2 1)
(2,1)

0

2

• Such as sine and cosine

– CUDA 3.2 lacks
• Some complex operations (
p
p
(multiplication and CEXP)
p
)
• Atomic operations for complex/float data

– Memory Limitation
• Atomic operations are used in SFI loop (R&R is not an option)
• Large Scale IR dataset does not fit into GPU memory
11
IR on a Heterogeneous Node
g
35

The Machine:

30
25
20
Speedup

ccNUMA Module
2 x 4 Cores,
16 threads
2.8 – 3.2 GHz
12 GB RAM
TDP: 95 W/CPU
PCIe 2.0 (8 GB/s)
Accelerator Module
2 GPU Cards
NVIDIA Tesla(Fermi)
(
)
1.15 GHz
6 GB GDDR5
144GB/s
TDP 238 W

15
10
5
0
CPU 

CPU               
CPU Best 
CPU B t
CPU           
CPU
16 Threads         GPU
Sequential  8 Threads
(SMT)

CPU + GPU

2 GPUs

2 GPUs 
2 GPU
Pipelined

Scale=10
Scale=30

1
1

1,82
1,89

14,46
11,41

16,06
13,26

20,11
19,44

18,88
22,10

4,27
16,71

15,86
25,40

Scale=60

1

1,97

10,27

12,55

20,17

24,68

22,26

34,46
12
Conclusions
Co c us o s
•

Shared memory Nodes
y
– Performance is limited by hardware resources
– 1 Node (12 Cores/24 Threads): speedup = 12.4

•

Distributed memory systems
– Low efficiency in terms of performance per power consumption and size.
– 8 Nodes (64 cores): speedup: 38.05

•

Heterogeneous CPU/GPU systems
– Perfect compromise:
• Better performance than current shared memory nodes
• Better efficiency than distributed memory systems
• 1 CPU + 2 GPUs: speedup: 34.46

•

Final Design Recommendations
– Powerful shared memory PPN
– PPN with ccNUMA CPUs and GPU accelerators
– Distributed memory only if multiple PPNs are needed
13
Thank Y
Th k You

kraja@in.tum.de

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Cuda
CudaCuda
Cuda
 
Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
 
Gpu Cuda
Gpu CudaGpu Cuda
Gpu Cuda
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and DeploymentABCI: An Open Innovation Platform for Advancing AI Research and Deployment
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
 
How to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memoHow to Burn Multi-GPUs using CUDA stress test memo
How to Burn Multi-GPUs using CUDA stress test memo
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
CUDA
CUDACUDA
CUDA
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 

Ähnlich wie Designing High Performance Computing Architectures for Reliable Space Applications

ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Jinho Lee
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
DDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM MemoryDDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM Memory
Subhajit Sahu
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 

Ähnlich wie Designing High Performance Computing Architectures for Reliable Space Applications (20)

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Dsp ajal
Dsp  ajalDsp  ajal
Dsp ajal
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Graph Data Processing With uRIKA Appliance
Graph Data Processing With uRIKA ApplianceGraph Data Processing With uRIKA Appliance
Graph Data Processing With uRIKA Appliance
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
DDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM MemoryDDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM Memory
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 

Mehr von Fisnik Kraja (6)

Performance Optimization of HPC Applications: From Hardware to Source Code
Performance Optimization of HPC Applications: From Hardware to Source CodePerformance Optimization of HPC Applications: From Hardware to Source Code
Performance Optimization of HPC Applications: From Hardware to Source Code
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM Simulation
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsPerformance Evaluation of SAR Image Reconstruction on CPUs and GPUs
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
 
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Designing High Performance Computing Architectures for Reliable Space Applications

  • 1. Designing High Performance Computing Architectures f Reliable S A hit t for R li bl Space Applications pp Fisnik Kraja PhD Defense December 6, 2012 Advisors: 1st : Prof. Dr Arndt Bode Prof Dr. 2nd : Prof. Dr. Xavier Martorell
  • 2. Out e Outline 1. Motivation 2. 2 The Proposed Computing Architecture 3. 3 The 2DSSAR Benchmarking Application 4. Optimizations and Benchmarking Results p g – – – Shared memory multiprocessor systems Distributed memory multiprocessor systems Heterogeneous CPU/GPU systems 5. Conclusions 2
  • 3. Motivation ot at o • Future space applications will demand for: – Increased on-board computing capabilities – Preserved system reliability • Future missions: – Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s – Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1 operations/s, 603.1 Gbit/s • Challenges – – – – – – Costs Modularity Portability y Scalability Programmability Efficiency y (ASICs are very expensive) (component change and reuse) ( (across various spacecraft p p platforms) ) (hardware and software) (compatible to various environments) (p (power consumption and size) p ) 3
  • 4. The Proposed Architecture e oposed c tectu e Legend: RHMU Radiation-Hardened Management Unit PPN Parallel processing Node Co t o us Control Bus Data Bus 4
  • 5. The 2DSSAR Application 2- Dimensional Spotlight Synthetic Aperture Radar Illuminated swath in Side‐looking Spotlight SAR Synthetic Data Generation (SDG): Synthetic SAR returns from a uniform grid of point reflectors Spacecraft S ft Azimuth Flight Path SAR Sensor Processing (SSP) Altitude Altit d Swath Range Range Swath Cross-Range Read Generated Data Image Reconstruction (IR) Write Reconstructed Image Reconstructed SAR image i obtained b i is b i d by applying a 2D Fourier Matched Filtering and Interpolation p Algorithm 5
  • 6. Profiling SAR Image Reconstruction g g Coverage g (in km) Memory y (in GB) FLOP (in Giga) Time (in Seconds) Scale=10 3.8 x 2.5 0.25 29.54 23 Scale 30 Scale=30 11.4 7.5 11 4 x 7 5 2 115.03 115 03 230 Scale=60 22.8 x 15 8 1302 Goal: Speedup 926 Transposition and  FFT‐shifting 2% 30x Compression and  Decompression  Loops 7% Interpolation  Interpolation Loop 69% IR Profiling FFTs 22% 6
  • 7. IR Optimizations for Shared Memory Multiprocessing • O OpenMP MP – General optimizations: • Thread Pinning and First Touch Policy • Static/Dynamic Scheduling – FFT • Manual Multithreading of Loops of 1D-FFT(not the FFT itself) – Interpolation Loop (Polar to Rectangular Coordinates) • Atomic Operations • Replication and Reduction (R&R) • Other Programming Models – OmpSs, MPI, MPI+OpenMP MPI OpenMP 7
  • 8. IR on a Shared Memory Node y 12 The ccNUMA Node: 10 Sp peedup  (Scale=60) 2 x Nehalem CPUs 6 Cores, 12 threads 2.93-3.33 GHz QPI: 25.6 GB/s IMC: IMC 32 GB/ GB/s Lithography: 32 nm TDP: 95 W 2 x 3 x 6 GB Memory Total: 36 GB DDR3 SDRAM 1066 MHz 8 6 4 2 0 Cores (Threads) OpenMP Atomic OpenMP R&R OpenMP R&R OmpSs Atomic OmpSs R&R MPI R&R MPI+OpenMP 1 2 4 6 8 10 12 12 (24) 1 1 1,55 1,78 3,05 3,51 4,45 5,02 5,81 6,36 6,94 7,74 7,98 9,03 10,54 11,06 1 1 1,61 1,93 3,12 3,73 4,62 5,52 5,92 7,13 7,02 8,65 8,13 10,37 10,72 12,37 1 1 1,92 1,89 1 89 3,65 3,54 3 54 5,30 4,88 4 88 6,57 6,40 6 40 7,94 8,02 8 02 9,81 9,94 9 94 11,20 11,69 11 69 8
  • 9. IR Optimizations for Distributed Memory Multiprocessing • Programming Paradigms g g g – MPI • Data Replication • Process Creation Overhead – MPI+OpenMP PID 0 PID 0 PID 1 PID 2 PID 3 • 1 process/Node • 1 Thread/Core PID 0 • Communication Optimizations – Transposition (new: All-to-All) – FFT-shift (new: Peer-to-Peer) – Int.Loop Replication and Reduction • PID 1 PID 2 PID 3 D00 D10 D20 D30 D01 D11 D21 D31 A1 A2 D1 D2 D02 D12 D22 D32 D03 D13 D23 D33 B1 B2 C1 C2 Pipelined IR – Each node reconstructs a separate SAR Image 9
  • 10. IR on the Distributed Memory System y y The Nehalem Cluster: 60 50 Speedu up (Scale=60) Each Node 2 x 4 Cores, 16 threads 2.8 - 3.2 GHz 12/24/48 GB RAM QPI: 25.6 GB/s IMC: 32 GB/s, Lithography: 45 nm TDP: 95 W/CPU Infiniband Network Fat-tree Topology 6 Backbone Switches 24 Leaf Switches 40 30 20 10 0 No. of Nodes (Cores) No of Nodes (Cores) MPI (4Proc/Node) Hybrid (1Proc:16Thr/Node) MPI_new (8Proc/Node‐24GB) Hyb new (1Proc:16Thr/Node) Hyb_new (1Proc:16Thr/Node) Pipelined (1Proc:16Thr/Node) ( ) 1 (8) ( ) 2 (16) ( ) 4 (32) ( ) 8 (64) ( ) 12 (96) ( ) 16 (128) 3,54 5,46 7,92 8,52 7,69 7,37 6,68 6,45 10,19 10,87 14,41 15,93 17,11 23,69 17,19 28,90 17,73 31,06 5,66 5 66 6,35 9,68 9 68 11,50 17,13 17 13 21,30 26,92 26 92 38,05 30,72 30 72 50,48 32,08 32 08 59,80 10
  • 11. IR Optimizations for Heterogeneous CPU/GPU Computing blockIndex.x (bx) • ccNUMA Multi Processor Multi-Processor 0 1 – Sequential Optimizations – Minor load-balancing improvements 0 1 0 1 tsize threadInd dex.y (ty) blockInde ex.y (by) – CUDA – Tiling Technique – cuFFT Library – Transcendental functions 3 threadIndex.x (tx) • Computing on CPU+GPU • Accelerator (GP-GPU) 2 1 tsize Block (2 1) (2,1) 0 2 • Such as sine and cosine – CUDA 3.2 lacks • Some complex operations ( p p (multiplication and CEXP) p ) • Atomic operations for complex/float data – Memory Limitation • Atomic operations are used in SFI loop (R&R is not an option) • Large Scale IR dataset does not fit into GPU memory 11
  • 12. IR on a Heterogeneous Node g 35 The Machine: 30 25 20 Speedup ccNUMA Module 2 x 4 Cores, 16 threads 2.8 – 3.2 GHz 12 GB RAM TDP: 95 W/CPU PCIe 2.0 (8 GB/s) Accelerator Module 2 GPU Cards NVIDIA Tesla(Fermi) ( ) 1.15 GHz 6 GB GDDR5 144GB/s TDP 238 W 15 10 5 0 CPU  CPU                CPU Best  CPU B t CPU            CPU 16 Threads         GPU Sequential  8 Threads (SMT) CPU + GPU 2 GPUs 2 GPUs  2 GPU Pipelined Scale=10 Scale=30 1 1 1,82 1,89 14,46 11,41 16,06 13,26 20,11 19,44 18,88 22,10 4,27 16,71 15,86 25,40 Scale=60 1 1,97 10,27 12,55 20,17 24,68 22,26 34,46 12
  • 13. Conclusions Co c us o s • Shared memory Nodes y – Performance is limited by hardware resources – 1 Node (12 Cores/24 Threads): speedup = 12.4 • Distributed memory systems – Low efficiency in terms of performance per power consumption and size. – 8 Nodes (64 cores): speedup: 38.05 • Heterogeneous CPU/GPU systems – Perfect compromise: • Better performance than current shared memory nodes • Better efficiency than distributed memory systems • 1 CPU + 2 GPUs: speedup: 34.46 • Final Design Recommendations – Powerful shared memory PPN – PPN with ccNUMA CPUs and GPU accelerators – Distributed memory only if multiple PPNs are needed 13
  • 14. Thank Y Th k You kraja@in.tum.de