Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Â
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm
1. Parallelization Techniques for the 2D
Fourier M t h d Filt i
F i Matched Filtering and d
Interpolation SAR Algorithm
Fisnik Kraja, Georg Acher, Arndt Bode
j , g ,
Chair of Computer Architecture, Technische Universität Mßnchen
kraja@in.tum.de, acher@in.tum.de, bode@in.tum.de
2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
2. The main points will be:
p
⢠The motivation statement
⢠Description of the SAR 2DFMFI application
⢠Description of the benchmarked architectures
⢠Parallelization techniques and results on
q
â shared-memory and
â distributed-memory architectures
⢠Specific optimizations for distributed memory
environments
⢠Summary and conclusions
February 24, 2012 2
3. Motivation
⢠C
Current and f t
t d future space applications with onboard hi h
li ti ith b d high-
performance requirement
â Observation satellites with increased
⢠Image resolutions
⢠Data sets
⢠Computational requirements
p q
⢠Novel and interesting research based on many-cores for space
(Dependable Multiprocessor and Maestro)
⢠The tendence to fly COTS products to space
y p p
⢠Performance/power ratio depends directly on the scalability of
applications.
li i
February 24, 2012 3
4. SAR 2DFMFI Application
pp
Synthetic Data SAR Sensor
Generation(SDG): Processing (SSP)
Synthetic SAR Reconstructed SAR
returns from a image is obtained by
uniform grid of applying the 2D
point reflectors
fl Fourier Matched
Filtering and
Interpolation
Raw Data Reconstructed Image
SCALE mc n m nx
10 1600 3290 3808 2474
20 3200 6460 7616 4926
30 4800 9630 11422 7380
60 9600 19140 22844 14738
February 24, 2012 4
5. SAR Sensor Processing Profiling
g g
SSP Processing Step Computation Execution Size &
Type Time in % Layout
1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]
2. Transposition is needed
p 0.3 [
[n x mc]]
3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]
4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]
5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]
6.
6 Transform back
Transform-back the zero padded spatial spectrum 1d_Bw_FFT
1d Bw FFT 5.2
52 [n x m]
7. Slow-time decompression CEXp, MAC 2.3 [n x m]
8. Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]
9. Generate the Doppler domain representation the CEXP, MAC 3.4 [n x m]
reference signal's complex conjugate
f i l' l j t
10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]
11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m]
input[n x m] -> output[nx x m]
12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx]
image. 1d_Bw_FFT
IFFT[nx x m]-> Transpose -> FFT[m x nx]
13 Transform into a viewable imageg CABS 1.1 [
[m x nx]
]
February 24, 2012 5
6. The benchmarked ccNUMA
(distributed shared memory)
The ccNUMA machine consists of:
⢠2 Nehalem CPUs: Intel(R) Memory (6GB)
M Memory (6GB)
M
Xeon(R) CPU X5670
â 2.93 GHz Memory (6GB) Memory (6GB)
â 12 MB L3 Smart Cache
â 6 Cores/CPU
/ Memory (6GB) Memory (6GB)
â TDP=95 Watt
â 6.4 Giga Transfers/s QPI (25.6 CPU CPU
GB/s) (6Cores) (6Cores)
â DDR3 1066 memory interfacing
⢠36 Gigabytes of RAM
â (18 GB/memory controller)
I/O Controller
February 24, 2012 6
8. Results on the ccNUMA machine
12
10
8 Scale=60
dup
Speed
6 Scale=10
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12
Number of Cores
February 24, 2012 8
9. The benchmarked distributed
memory architecture
hit t
Nehalem cluster @HLRS.de
Peak 62 TFlops
Performance
Number of 700 Dual Socket Quad
Nodes
d Core
C
Processor Intel Xeon (X5560)
Nehalem @ 2.8
GHz, 8MB Cache
Memory/node 12 GB
Disk 80 TB shared scratch
(lustre)
Node-node Infiniband, Gigabit
interconnect Ethernet
February 24, 2012 9
10. MPI Master-Worker Model
Master-
⢠In MPI: row-by-row send-and-receive
⢠In MPI2: send and receive chunks of rows
⢠No more than 4 processes/node(8 cores) because of memory overhead
10
9
8
7
Speedup
6 MPI
MPI2
5
MPI(2Proc/Node)
S
4
MPI2(2Proc/Node)
3
MPI(4Proc/Node)
2 MPI2(4Proc/Node)
1
0
1 2 4 8 12 16
Number of Nodes ( 8 Cores/Node )
February 24, 2012 10
11. MPI Memory Overhead
y
⢠This overhead comes from the data replication and reduction needed
in the Interpolation Loop
⢠To improve the scalability without increasing memory consumption a
hybrid (MPI+OpenMP) version is implemented.
y p p
Worker_mem Master_mem Total_mem
ytes
27.6
Memory consumptio in Giga By
25.1
22.9
20.4
18
on
15.9
14
13
8.2
5.8 6.5
65 5.7 5.8
4.7 4.1 4.9 4.7 4.5
3.8 3.6 3.4 3.3
0
1 2 3 4 5 6 7 8
Number of Processes
February 24, 2012 11
12. Hybrid(MPI+OpenMP)
Hybrid(MPI+OpenMP) Versions
y ( p
Hyb1: Hyb1 Hyb2 Hyb3
â 1Process(8-OpenMP Hyb4 Hyb4(2Pr/8Thr) Hyb4(4Pr/4Thr)
threads)/Node. 20
Hyb2: 18
â OpenMP FFTW +
eedup
16
HyperThreading.
HyperThreading
Spe
14
Hyb3: 12
â Non-Computationally
p y 10
intensive work is done 8
only by the Master 6
p
process. 4
Hyb4: 2
0
â Send and Receive 1 2 4 8 12 16
chunks of rows.
h k f
Number of Nodes (8 Cores/Node)
February 24, 2012 12
13. Master-
Master-Worker Bottlenecks
⢠In some steps of SSP, the data is collected by the
Master process and then distrib ted again to the
distributed
Workers after the respective step.
⢠Such steps are:
â The 2-D FFT_SHIFT
â Transposition Operations
â The Reduction Operation after the Interpolation Loop
February 24, 2012 13
14. Inter-
Inter-process Communication in
the FFT SHIFT
th FFT_SHIFT
Notional depiction of the fftshift operation PID 0 A1 B1
PID 1 A2 B2
A B D C
PID 2 C1 D1
C D B A
PID 3 C2 D2
⢠New Communication PID 0 C1 D1
Pattern
P tt PID 1 C2 D2
â Nodes communicate in PID 2 A1 B1
couples PID 3 A2 B2
â N d that h
Nodes h have the dh data off
the first and second quadrant
send and receive data only to
and from nodes with the third
and fourth quadrant PID 0 D1 C1
respectively. PID 1 D2 C2
PID 2 B1 A1
PID 3 B2 A2
February 24, 2012 14
16. Reduction in the Interpolation Loop
p p
⢠To avoid a collective reduction a local reduction is applied
pp
between neighbor processes.
⢠This reduces only the overlapped regions.
⢠R d ti i scheduled i an ordered way:
Reduction is h d l d in d d
â the first process will send the data to the second process, which
accumulates the new values with the old ones and send the
results back to the first process.
February 24, 2012 16
17. Pipelining the SSP Steps
p g p
⢠Each node processes a single
p g
image:
â less inter-process
communications
â
⢠It takes longer to reconstruct
the fi i
h first image,
â but less time for the other
g
images
February 24, 2012 17
18. Speedup and Execution Time
p p
90
80 Hyb4
70 Hyb5
60 Pipelined
Speedup
50
40
30
20
100
10 90
0 80
1 8 16 32 64 96 128 70
psed Time in Seconds
Number f C
N b of Cores(8 Cores per Node)
(8 C N d ) 60
50
40
30
20
Ellap
10
0
Number of Cores 8 16 32 64 96 128
Hyb4 92.49 62.6 44.5 34.44 34.14 34.12
Hyb5 92.49 50.56 28.84 18.41 15.13 13.97
Pipelined 92.49 46.43 24.8 13.88 10.325 8.42
February 24, 2012 18
19. Summary and Conclusions
y
⢠In shared memory systems, the application can be efficiently parallelized, but
the performance will always be limited by hardware resources.
⢠In distributed memory systems, hardware resources on non-local nodes
become available with the cost of communication overhead.
⢠Performance improves with the number of resources,
â Efficiency is not on the same scale.
⢠The duty of each designer is to find the perfect compromise between
performance and other factors like
â power consumption
â size
â heat dissipation
February 24, 2012 19
20. Thank Y !
Th k You!
Questions?
Fisnik Kraja
Chair of Computer Architecture
Technische Universität Mßnchen
kraja@in.tum.de