Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

Parallelization Techniques for the 2D
Fourier M t h d Filt i
F i Matched Filtering and d
Interpolation SAR Algorithm

Fisnik Kraja, Georg Acher, Arndt Bode
j , g ,
Chair of Computer Architecture, Technische Universität München
kraja@in.tum.de, acher@in.tum.de, bode@in.tum.de

2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana

The main points will be:
p

• The motivation statement
• Description of the SAR 2DFMFI application
• Description of the benchmarked architectures
• Parallelization techniques and results on
q
– shared-memory and
– distributed-memory architectures
• Specific optimizations for distributed memory
environments
• Summary and conclusions

February 24, 2012 2

Motivation
• C
Current and f t
t d future space applications with onboard hi h
li ti ith b d high-
performance requirement
– Observation satellites with increased
• Image resolutions
• Data sets
• Computational requirements
p q

• Novel and interesting research based on many-cores for space
(Dependable Multiprocessor and Maestro)

• The tendence to fly COTS products to space
y p p

• Performance/power ratio depends directly on the scalability of
applications.
li i
February 24, 2012 3

SAR 2DFMFI Application
pp

Synthetic Data SAR Sensor
Generation(SDG): Processing (SSP)
Synthetic SAR Reconstructed SAR
returns from a image is obtained by
uniform grid of applying the 2D
point reflectors
fl Fourier Matched
Filtering and
Interpolation
Raw Data Reconstructed Image
SCALE mc n m nx
10 1600 3290 3808 2474
20 3200 6460 7616 4926
30 4800 9630 11422 7380
60 9600 19140 22844 14738

February 24, 2012 4

SAR Sensor Processing Profiling
g g
SSP Processing Step Computation Execution Size &
Type Time in % Layout
1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]
2. Transposition is needed
p 0.3 [
[n x mc]]
3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]
4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]
5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]
6.
6 Transform back
Transform-back the zero padded spatial spectrum 1d_Bw_FFT
1d Bw FFT 5.2
52 [n x m]
7. Slow-time decompression CEXp, MAC 2.3 [n x m]
8. Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]
9. Generate the Doppler domain representation the CEXP, MAC 3.4 [n x m]
reference signal's complex conjugate
f i l' l j t
10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]
11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m]
input[n x m] -> output[nx x m]
12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx]
image. 1d_Bw_FFT
IFFT[nx x m]-> Transpose -> FFT[m x nx]
13 Transform into a viewable imageg CABS 1.1 [
[m x nx]
]

February 24, 2012 5

The benchmarked ccNUMA
(distributed shared memory)

The ccNUMA machine consists of:
• 2 Nehalem CPUs: Intel(R) Memory (6GB)
M Memory (6GB)
M
Xeon(R) CPU X5670
– 2.93 GHz Memory (6GB) Memory (6GB)

– 12 MB L3 Smart Cache
– 6 Cores/CPU
/ Memory (6GB) Memory (6GB)

– TDP=95 Watt
– 6.4 Giga Transfers/s QPI (25.6 CPU CPU
GB/s) (6Cores) (6Cores)
– DDR3 1066 memory interfacing
• 36 Gigabytes of RAM
– (18 GB/memory controller)
I/O Controller

February 24, 2012 6

Parallelization techniques on the
ccNUMA machine
NUMA hi

February 24, 2012 7

Results on the ccNUMA machine

12

10

8 Scale=60
dup
Speed

6 Scale=10
4

2

0
1 2 3 4 5 6 7 8 9 10 11 12

Number of Cores

February 24, 2012 8

The benchmarked distributed
memory architecture
hit t
Nehalem cluster @HLRS.de

Peak 62 TFlops
Performance
Number of 700 Dual Socket Quad
Nodes
d Core
C
Processor Intel Xeon (X5560)
Nehalem @ 2.8
GHz, 8MB Cache
Memory/node 12 GB
Disk 80 TB shared scratch
(lustre)
Node-node Infiniband, Gigabit
interconnect Ethernet

February 24, 2012 9

MPI Master-Worker Model
Master-
• In MPI: row-by-row send-and-receive
• In MPI2: send and receive chunks of rows
• No more than 4 processes/node(8 cores) because of memory overhead
10
9
8
7
Speedup

6 MPI
MPI2
5
MPI(2Proc/Node)
S

4
MPI2(2Proc/Node)
3
MPI(4Proc/Node)
2 MPI2(4Proc/Node)
1
0
1 2 4 8 12 16

Number of Nodes ( 8 Cores/Node )

February 24, 2012 10

MPI Memory Overhead
y
• This overhead comes from the data replication and reduction needed
in the Interpolation Loop
• To improve the scalability without increasing memory consumption a
hybrid (MPI+OpenMP) version is implemented.
y p p
Worker_mem Master_mem Total_mem
ytes

27.6
Memory consumptio in Giga By

25.1
22.9
20.4
18
on

15.9
14
13

8.2
5.8 6.5
65 5.7 5.8
4.7 4.1 4.9 4.7 4.5
3.8 3.6 3.4 3.3
0
1 2 3 4 5 6 7 8

Number of Processes

Hybrid(MPI+OpenMP)
Hybrid(MPI+OpenMP) Versions
y ( p
Hyb1: Hyb1 Hyb2 Hyb3
– 1Process(8-OpenMP Hyb4 Hyb4(2Pr/8Thr) Hyb4(4Pr/4Thr)
threads)/Node. 20

Hyb2: 18

– OpenMP FFTW +

eedup
16

HyperThreading.
HyperThreading
Spe
14

Hyb3: 12

– Non-Computationally
p y 10

intensive work is done 8

only by the Master 6

p
process. 4

Hyb4: 2

0
– Send and Receive 1 2 4 8 12 16

chunks of rows.
h k f
Number of Nodes (8 Cores/Node)

Master-
Master-Worker Bottlenecks

• In some steps of SSP, the data is collected by the
Master process and then distrib ted again to the
distributed
Workers after the respective step.

• Such steps are:
– The 2-D FFT_SHIFT
– Transposition Operations
– The Reduction Operation after the Interpolation Loop


Inter-
Inter-process Communication in
the FFT SHIFT
th FFT_SHIFT
Notional depiction of the fftshift operation PID 0 A1 B1
PID 1 A2 B2
A B D C
PID 2 C1 D1
C D B A
PID 3 C2 D2

• New Communication PID 0 C1 D1
Pattern
P tt PID 1 C2 D2
– Nodes communicate in PID 2 A1 B1
couples PID 3 A2 B2
– N d that h
Nodes h have the dh data off
the first and second quadrant
send and receive data only to
and from nodes with the third
and fourth quadrant PID 0 D1 C1
respectively. PID 1 D2 C2
PID 2 B1 A1
PID 3 B2 A2

Inter-
Inter-Process Transposition
p
Data Partitioning (Tiling) and Buffering

PID 0 D0 PID 0 D00 D01 D02 D03
PID 1 D1 PID 1 D10 D11 D12 D13
PID 2 D2 PID 2 D20 D21 D22 D23
PID 3 D3 PID 3 D30 D31 D32 D33
Transposition
T iti

D00
D10
D20
D30
The Resulting PID 0

Communication Pattern

D01
D11
D21
D31
PID 1

D0
D1
D2
D3
02
12
22
32
PID 2

D03
D13
D23
D33
PID 3


Reduction in the Interpolation Loop
p p
• To avoid a collective reduction a local reduction is applied
pp
between neighbor processes.
• This reduces only the overlapped regions.
• R d ti i scheduled i an ordered way:
Reduction is h d l d in d d
– the first process will send the data to the second process, which
accumulates the new values with the old ones and send the
results back to the first process.


Pipelining the SSP Steps
p g p

• Each node processes a single
p g
image:
– less inter-process
communications
–
• It takes longer to reconstruct
the fi i
h first image,
– but less time for the other
g
images


Speedup and Execution Time
p p
90

80 Hyb4

70 Hyb5

60 Pipelined
Speedup

50

40

30

20
100
10 90
0 80
1 8 16 32 64 96 128 70

psed Time in Seconds
Number f C
N b of Cores(8 Cores per Node)
(8 C N d ) 60
50
40
30
20
Ellap

10
0
Number of Cores 8 16 32 64 96 128
Hyb4 92.49 62.6 44.5 34.44 34.14 34.12
Hyb5 92.49 50.56 28.84 18.41 15.13 13.97
Pipelined 92.49 46.43 24.8 13.88 10.325 8.42

Summary and Conclusions
y

• In shared memory systems, the application can be efficiently parallelized, but
the performance will always be limited by hardware resources.

• In distributed memory systems, hardware resources on non-local nodes
become available with the cost of communication overhead.

• Performance improves with the number of resources,
– Efficiency is not on the same scale.

• The duty of each designer is to find the perfect compromise between
performance and other factors like
– power consumption
– size
– heat dissipation


Thank Y !
Th k You!

Questions?
Fisnik Kraja
Chair of Computer Architecture
Technische Universität München
kraja@in.tum.de

Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

Ähnlich wie Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm