SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Parallelization Techniques for the 2D
   Fourier M t h d Filt i
   F i Matched Filtering and     d
    Interpolation SAR Algorithm

Fisnik Kraja, Georg Acher, Arndt Bode
          j ,     g      ,
Chair of Computer Architecture, Technische Universität Mßnchen
kraja@in.tum.de, acher@in.tum.de, bode@in.tum.de




   2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
The main points will be:
         p

   •    The motivation statement
   •    Description of the SAR 2DFMFI application
   •    Description of the benchmarked architectures
   •    Parallelization techniques and results on
                              q
          – shared-memory and
          – distributed-memory architectures
   • Specific optimizations for distributed memory
     environments
   • Summary and conclusions


February 24, 2012                                      2
Motivation
• C
  Current and f t
        t d future space applications with onboard hi h
                            li ti      ith b d high-
  performance requirement
       – Observation satellites with increased
              • Image resolutions
              • Data sets
              • Computational requirements
                    p            q

• Novel and interesting research based on many-cores for space
  (Dependable Multiprocessor and Maestro)

• The tendence to fly COTS products to space
                    y      p            p

• Performance/power ratio depends directly on the scalability of
  applications.
      li i
February 24, 2012                                                3
SAR 2DFMFI Application
             pp

Synthetic Data                                                        SAR Sensor
Generation(SDG):                                                      Processing (SSP)
Synthetic SAR                                                         Reconstructed SAR
returns from a                                                        image is obtained by
uniform grid of                                                       applying the 2D
point reflectors
        fl                                                            Fourier Matched
                                                                      Filtering and
                                                                      Interpolation
                           Raw Data             Reconstructed Image
      SCALE         mc                 n       m              nx
         10         1600              3290    3808           2474
         20         3200              6460    7616           4926
         30         4800              9630    11422          7380
         60         9600              19140   22844          14738




February 24, 2012                                                                        4
SAR Sensor Processing Profiling
                     g         g
      SSP Processing Step                                             Computation Execution   Size &
                                                                      Type        Time in %   Layout
1.    Filter the echoed signal                                        1d_Fw_FFT      1.1      [mc x n]
2.    Transposition is needed
            p                                                                        0.3      [
                                                                                              [n x mc]]
3.    Signal Compression along slow-time                              CEXP, MAC      1.1      [n x mc]
4.    Narrow-bandwidth polar format reconstruction along slow-time    1d_Fw_FFT      0.5      [n x mc]
5.    Zero pad the spatial frequency domain's compressed signal                      0.4      [n x mc]
6.
6     Transform back
      Transform-back the zero padded spatial spectrum                 1d_Bw_FFT
                                                                      1d Bw FFT      5.2
                                                                                     52       [n x m]
7.    Slow-time decompression                                         CEXp, MAC      2.3      [n x m]
8.    Digitally-spotlighted SAR signal spectrum                       1d_Fw_FFT      5.2      [n x m]
9.    Generate the Doppler domain representation the                  CEXP, MAC      3.4      [n x m]
      reference signal's complex conjugate
        f         i l'         l     j t
10.   Circumvent edge processing effects                              2D-FFT_shift   0.4      [n x m]
11.   2D Interpolation from a wedge to a rectangular area:            MAC,Sin,Cos    69       [nx x m]
      input[n x m] -> output[nx x m]
12.   Transform from the doppler domain image into a spatial domain   1d_Bw_FFT       10      [m x nx]
      image.                                                          1d_Bw_FFT
      IFFT[nx x m]-> Transpose -> FFT[m x nx]
13    Transform into a viewable imageg                                CABS           1.1      [
                                                                                              [m x nx]
                                                                                                     ]


 February 24, 2012                                                                                   5
The benchmarked ccNUMA
                                (distributed shared memory)



The ccNUMA machine consists of:
• 2 Nehalem CPUs: Intel(R)              Memory (6GB)
                                        M                                                     Memory (6GB)
                                                                                              M
   Xeon(R) CPU X5670
     – 2.93 GHz                         Memory (6GB)                                          Memory (6GB)

     – 12 MB L3 Smart Cache
     – 6 Cores/CPU
              /                         Memory (6GB)                                          Memory (6GB)

     – TDP=95 Watt
     – 6.4 Giga Transfers/s QPI (25.6                    CPU                         CPU
       GB/s)                                           (6Cores)                    (6Cores)
     – DDR3 1066 memory interfacing
• 36 Gigabytes of RAM
     – (18 GB/memory controller)
                                                                  I/O Controller




February 24, 2012                                                                                     6
Parallelization techniques on the
ccNUMA machine
  NUMA        hi




February 24, 2012                   7
Results on the ccNUMA machine

          12


          10


           8                                                         Scale=60
    dup
Speed




           6                                                         Scale=10
           4


           2


           0
               1    2   3   4   5    6    7   8   9   10   11   12


                                    Number of Cores


February 24, 2012                                                               8
The benchmarked distributed
memory architecture
          hit t
                         Nehalem cluster @HLRS.de

 Peak               62 TFlops
 Performance
 Number of    700 Dual Socket Quad
 Nodes
    d         Core
              C
 Processor    Intel Xeon (X5560)
              Nehalem @ 2.8
              GHz, 8MB Cache
 Memory/node 12 GB
 Disk         80 TB shared scratch
              (lustre)
 Node-node    Infiniband, Gigabit
 interconnect Ethernet

February 24, 2012                                   9
MPI Master-Worker Model
          Master-
      • In MPI: row-by-row send-and-receive
      • In MPI2: send and receive chunks of rows
      • No more than 4 processes/node(8 cores) because of memory overhead
          10
           9
           8
           7
Speedup




           6                                                              MPI
                                                                          MPI2
           5
                                                                          MPI(2Proc/Node)
S




           4
                                                                          MPI2(2Proc/Node)
           3
                                                                          MPI(4Proc/Node)
           2                                                              MPI2(4Proc/Node)
           1
           0
                   1     2    4              8            12         16

                                  Number of Nodes ( 8 Cores/Node )

     February 24, 2012                                                                10
MPI Memory Overhead
         y
• This overhead comes from the data replication and reduction needed
  in the Interpolation Loop
• To improve the scalability without increasing memory consumption a
  hybrid (MPI+OpenMP) version is implemented.
    y            p                    p
                                                               Worker_mem              Master_mem        Total_mem
                                           ytes




                                                                                                                                   27.6
                Memory consumptio in Giga By




                                                                                                                        25.1
                                                                                                             22.9
                                                                                                  20.4
                                                                                       18
                                on




                                                                         15.9
                                                               14
                                                      13

                                                               8.2
                                                               5.8       6.5
                                                                         65            5.7        5.8
                                                                         4.7           4.1                   4.9        4.7        4.5
                                                                                                  3.8        3.6        3.4        3.3
                                                      0
                                                  1        2         3             4          5          6          7          8

                                                                               Number of Processes
February 24, 2012                                                                                                                         11
Hybrid(MPI+OpenMP)
 Hybrid(MPI+OpenMP) Versions
  y    (     p
Hyb1:                                         Hyb1              Hyb2               Hyb3
    – 1Process(8-OpenMP                       Hyb4              Hyb4(2Pr/8Thr)     Hyb4(4Pr/4Thr)
      threads)/Node.                     20

Hyb2:                                    18

    – OpenMP FFTW +

                                 eedup
                                         16

      HyperThreading.
      HyperThreading
                               Spe
                                         14


Hyb3:                                    12


    – Non-Computationally
                p         y              10


      intensive work is done             8


      only by the Master                 6


      p
      process.                           4


Hyb4:                                    2

                                         0
    – Send and Receive                        1        2         4        8         12        16

      chunks of rows.
       h k f
                                                  Number of Nodes (8 Cores/Node)
 February 24, 2012                                                                             12
Master-
Master-Worker Bottlenecks

• In some steps of SSP, the data is collected by the
  Master process and then distrib ted again to the
                            distributed
  Workers after the respective step.

• Such steps are:
       – The 2-D FFT_SHIFT
       – Transposition Operations
       – The Reduction Operation after the Interpolation Loop


February 24, 2012                                               13
Inter-
Inter-process Communication in
the FFT SHIFT
th FFT_SHIFT
Notional depiction of the fftshift operation   PID 0   A1   B1
                                               PID 1   A2   B2
      A       B            D     C
                                               PID 2   C1   D1
      C       D            B     A
                                               PID 3   C2   D2

• New Communication                            PID 0   C1   D1
  Pattern
  P tt                                         PID 1   C2   D2
       – Nodes communicate in                  PID 2   A1   B1
         couples                               PID 3   A2   B2
       – N d that h
         Nodes h have the dh data off
         the first and second quadrant
         send and receive data only to
         and from nodes with the third
         and fourth quadrant                   PID 0   D1   C1
         respectively.                         PID 1   D2   C2
                                               PID 2   B1   A1
                                               PID 3   B2   A2
February 24, 2012                                                14
Inter-
Inter-Process Transposition
                   p
                                            Data Partitioning (Tiling) and Buffering



                    PID 0           D0                             PID 0        D00          D01     D02           D03
                    PID 1           D1                             PID 1        D10          D11     D12           D13
                    PID 2           D2                             PID 2        D20          D21     D22           D23
                    PID 3           D3                             PID 3        D30          D31     D32           D33
                                                                                             Transposition
                                                                                             T       iti




                                                                                       D00
                                                                                              D10
                                                                                                    D20
                                                                                                             D30
                           The Resulting                                    PID 0

                        Communication Pattern




                                                                                       D01
                                                                                              D11
                                                                                                    D21
                                                                                                             D31
                                                                            PID 1




                                                                                       D0
                                                                                              D1
                                                                                                    D2
                                                                                                             D3
                                                                                        02
                                                                                               12
                                                                                                     22
                                                                                                              32
                                                                            PID 2




                                                                                       D03
                                                                                              D13
                                                                                                    D23
                                                                                                             D33
                                                                            PID 3



February 24, 2012                                                                                                        15
Reduction in the Interpolation Loop
                      p           p
• To avoid a collective reduction a local reduction is applied
                                                        pp
  between neighbor processes.
• This reduces only the overlapped regions.
• R d ti i scheduled i an ordered way:
  Reduction is h d l d in         d d
       – the first process will send the data to the second process, which
         accumulates the new values with the old ones and send the
         results back to the first process.




February 24, 2012                                                            16
Pipelining the SSP Steps
  p      g            p


• Each node processes a single
            p              g
  image:
       – less inter-process
         communications
       –
• It takes longer to reconstruct
  the fi i
   h first image,
       – but less time for the other
             g
         images




February 24, 2012                      17
Speedup and Execution Time
           p    p
          90

          80                                                                                    Hyb4

          70                                                                                    Hyb5

          60                                                                                    Pipelined
Speedup




          50

          40

          30

          20
                                                                                                100
          10                                                                                     90
          0                                                                                      80
                 1       8     16      32     64     96   128                                    70


                                                                         psed Time in Seconds
                         Number f C
                         N b of Cores(8 Cores per Node)
                                     (8 C         N d )                                          60
                                                                                                 50
                                                                                                 40
                                                                                                 30
                                                                                                 20
                                                                     Ellap




                                                                                                 10
                                                                             0
                                                                Number of Cores                             8    16      32      64      96      128
                                                                                       Hyb4             92.49   62.6    44.5    34.44   34.14    34.12
                                                                                       Hyb5             92.49   50.56   28.84   18.41   15.13    13.97
                                                                                       Pipelined        92.49   46.43   24.8    13.88   10.325   8.42
          February 24, 2012                                                                                                                      18
Summary and Conclusions
      y

• In shared memory systems, the application can be efficiently parallelized, but
  the performance will always be limited by hardware resources.

• In distributed memory systems, hardware resources on non-local nodes
  become available with the cost of communication overhead.

• Performance improves with the number of resources,
       – Efficiency is not on the same scale.

• The duty of each designer is to find the perfect compromise between
  performance and other factors like
       – power consumption
       – size
       – heat dissipation



February 24, 2012                                                                  19
Thank Y !
       Th k You!



Questions?
                         Fisnik Kraja
                    Chair of Computer Architecture
                   Technische Universität Mßnchen
                                  kraja@in.tum.de

Weitere ähnliche Inhalte

Was ist angesagt?

Gh2411361141
Gh2411361141Gh2411361141
Gh2411361141
IJERA Editor
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
A Comparative Study of Image Compression Algorithms
A Comparative Study of Image Compression AlgorithmsA Comparative Study of Image Compression Algorithms
A Comparative Study of Image Compression Algorithms
IJORCS
 

Was ist angesagt? (18)

High Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image CompressionHigh Speed and Area Efficient 2D DWT Processor Based Image Compression
High Speed and Area Efficient 2D DWT Processor Based Image Compression
 
Gu2512391243
Gu2512391243Gu2512391243
Gu2512391243
 
Design and implementation of DADCT
Design and implementation of DADCTDesign and implementation of DADCT
Design and implementation of DADCT
 
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
 
Ppt
PptPpt
Ppt
 
B04 07 0614
B04 07 0614B04 07 0614
B04 07 0614
 
Multi-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different HardwareMulti-GPU FFT Performance on Different Hardware
Multi-GPU FFT Performance on Different Hardware
 
H0545156
H0545156H0545156
H0545156
 
Gh2411361141
Gh2411361141Gh2411361141
Gh2411361141
 
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
 
Lar
LarLar
Lar
 
D0941824
D0941824D0941824
D0941824
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Lifting Scheme Cores for Wavelet Transform
Lifting Scheme Cores for Wavelet TransformLifting Scheme Cores for Wavelet Transform
Lifting Scheme Cores for Wavelet Transform
 
Highly Parallel Pipelined VLSI Implementation of Lifting Based 2D Discrete Wa...
Highly Parallel Pipelined VLSI Implementation of Lifting Based 2D Discrete Wa...Highly Parallel Pipelined VLSI Implementation of Lifting Based 2D Discrete Wa...
Highly Parallel Pipelined VLSI Implementation of Lifting Based 2D Discrete Wa...
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
A Comparative Study of Image Compression Algorithms
A Comparative Study of Image Compression AlgorithmsA Comparative Study of Image Compression Algorithms
A Comparative Study of Image Compression Algorithms
 
BDL_project_report
BDL_project_reportBDL_project_report
BDL_project_report
 

Ähnlich wie Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)
kike2005
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
inside-BigData.com
 
Real time-image-processing-applied-to-traffic-queue-detection-algorithm
Real time-image-processing-applied-to-traffic-queue-detection-algorithmReal time-image-processing-applied-to-traffic-queue-detection-algorithm
Real time-image-processing-applied-to-traffic-queue-detection-algorithm
ajayrampelli
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 

Ähnlich wie Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm (20)

Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
FFaraji
FFarajiFFaraji
FFaraji
 
Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
IMQA Poster
IMQA PosterIMQA Poster
IMQA Poster
 
Research paper
Research paperResearch paper
Research paper
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
 
Real time-image-processing-applied-to-traffic-queue-detection-algorithm
Real time-image-processing-applied-to-traffic-queue-detection-algorithmReal time-image-processing-applied-to-traffic-queue-detection-algorithm
Real time-image-processing-applied-to-traffic-queue-detection-algorithm
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Octnews featured article
Octnews featured articleOctnews featured article
Octnews featured article
 
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
A 79GHz 2X2 MIMO PMCW Radar SoC in 28 nm CMOS
A 79GHz 2X2 MIMO PMCW Radar SoC in 28 nm CMOSA 79GHz 2X2 MIMO PMCW Radar SoC in 28 nm CMOS
A 79GHz 2X2 MIMO PMCW Radar SoC in 28 nm CMOS
 

KĂźrzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

KĂźrzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

  • 1. Parallelization Techniques for the 2D Fourier M t h d Filt i F i Matched Filtering and d Interpolation SAR Algorithm Fisnik Kraja, Georg Acher, Arndt Bode j , g , Chair of Computer Architecture, Technische Universität MĂźnchen kraja@in.tum.de, acher@in.tum.de, bode@in.tum.de 2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
  • 2. The main points will be: p • The motivation statement • Description of the SAR 2DFMFI application • Description of the benchmarked architectures • Parallelization techniques and results on q – shared-memory and – distributed-memory architectures • Specific optimizations for distributed memory environments • Summary and conclusions February 24, 2012 2
  • 3. Motivation • C Current and f t t d future space applications with onboard hi h li ti ith b d high- performance requirement – Observation satellites with increased • Image resolutions • Data sets • Computational requirements p q • Novel and interesting research based on many-cores for space (Dependable Multiprocessor and Maestro) • The tendence to fly COTS products to space y p p • Performance/power ratio depends directly on the scalability of applications. li i February 24, 2012 3
  • 4. SAR 2DFMFI Application pp Synthetic Data SAR Sensor Generation(SDG): Processing (SSP) Synthetic SAR Reconstructed SAR returns from a image is obtained by uniform grid of applying the 2D point reflectors fl Fourier Matched Filtering and Interpolation Raw Data Reconstructed Image SCALE mc n m nx 10 1600 3290 3808 2474 20 3200 6460 7616 4926 30 4800 9630 11422 7380 60 9600 19140 22844 14738 February 24, 2012 4
  • 5. SAR Sensor Processing Profiling g g SSP Processing Step Computation Execution Size & Type Time in % Layout 1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n] 2. Transposition is needed p 0.3 [ [n x mc]] 3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc] 4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc] 5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc] 6. 6 Transform back Transform-back the zero padded spatial spectrum 1d_Bw_FFT 1d Bw FFT 5.2 52 [n x m] 7. Slow-time decompression CEXp, MAC 2.3 [n x m] 8. Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m] 9. Generate the Doppler domain representation the CEXP, MAC 3.4 [n x m] reference signal's complex conjugate f i l' l j t 10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m] 11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m] input[n x m] -> output[nx x m] 12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx] image. 1d_Bw_FFT IFFT[nx x m]-> Transpose -> FFT[m x nx] 13 Transform into a viewable imageg CABS 1.1 [ [m x nx] ] February 24, 2012 5
  • 6. The benchmarked ccNUMA (distributed shared memory) The ccNUMA machine consists of: • 2 Nehalem CPUs: Intel(R) Memory (6GB) M Memory (6GB) M Xeon(R) CPU X5670 – 2.93 GHz Memory (6GB) Memory (6GB) – 12 MB L3 Smart Cache – 6 Cores/CPU / Memory (6GB) Memory (6GB) – TDP=95 Watt – 6.4 Giga Transfers/s QPI (25.6 CPU CPU GB/s) (6Cores) (6Cores) – DDR3 1066 memory interfacing • 36 Gigabytes of RAM – (18 GB/memory controller) I/O Controller February 24, 2012 6
  • 7. Parallelization techniques on the ccNUMA machine NUMA hi February 24, 2012 7
  • 8. Results on the ccNUMA machine 12 10 8 Scale=60 dup Speed 6 Scale=10 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of Cores February 24, 2012 8
  • 9. The benchmarked distributed memory architecture hit t Nehalem cluster @HLRS.de Peak 62 TFlops Performance Number of 700 Dual Socket Quad Nodes d Core C Processor Intel Xeon (X5560) Nehalem @ 2.8 GHz, 8MB Cache Memory/node 12 GB Disk 80 TB shared scratch (lustre) Node-node Infiniband, Gigabit interconnect Ethernet February 24, 2012 9
  • 10. MPI Master-Worker Model Master- • In MPI: row-by-row send-and-receive • In MPI2: send and receive chunks of rows • No more than 4 processes/node(8 cores) because of memory overhead 10 9 8 7 Speedup 6 MPI MPI2 5 MPI(2Proc/Node) S 4 MPI2(2Proc/Node) 3 MPI(4Proc/Node) 2 MPI2(4Proc/Node) 1 0 1 2 4 8 12 16 Number of Nodes ( 8 Cores/Node ) February 24, 2012 10
  • 11. MPI Memory Overhead y • This overhead comes from the data replication and reduction needed in the Interpolation Loop • To improve the scalability without increasing memory consumption a hybrid (MPI+OpenMP) version is implemented. y p p Worker_mem Master_mem Total_mem ytes 27.6 Memory consumptio in Giga By 25.1 22.9 20.4 18 on 15.9 14 13 8.2 5.8 6.5 65 5.7 5.8 4.7 4.1 4.9 4.7 4.5 3.8 3.6 3.4 3.3 0 1 2 3 4 5 6 7 8 Number of Processes February 24, 2012 11
  • 12. Hybrid(MPI+OpenMP) Hybrid(MPI+OpenMP) Versions y ( p Hyb1: Hyb1 Hyb2 Hyb3 – 1Process(8-OpenMP Hyb4 Hyb4(2Pr/8Thr) Hyb4(4Pr/4Thr) threads)/Node. 20 Hyb2: 18 – OpenMP FFTW + eedup 16 HyperThreading. HyperThreading Spe 14 Hyb3: 12 – Non-Computationally p y 10 intensive work is done 8 only by the Master 6 p process. 4 Hyb4: 2 0 – Send and Receive 1 2 4 8 12 16 chunks of rows. h k f Number of Nodes (8 Cores/Node) February 24, 2012 12
  • 13. Master- Master-Worker Bottlenecks • In some steps of SSP, the data is collected by the Master process and then distrib ted again to the distributed Workers after the respective step. • Such steps are: – The 2-D FFT_SHIFT – Transposition Operations – The Reduction Operation after the Interpolation Loop February 24, 2012 13
  • 14. Inter- Inter-process Communication in the FFT SHIFT th FFT_SHIFT Notional depiction of the fftshift operation PID 0 A1 B1 PID 1 A2 B2 A B D C PID 2 C1 D1 C D B A PID 3 C2 D2 • New Communication PID 0 C1 D1 Pattern P tt PID 1 C2 D2 – Nodes communicate in PID 2 A1 B1 couples PID 3 A2 B2 – N d that h Nodes h have the dh data off the first and second quadrant send and receive data only to and from nodes with the third and fourth quadrant PID 0 D1 C1 respectively. PID 1 D2 C2 PID 2 B1 A1 PID 3 B2 A2 February 24, 2012 14
  • 15. Inter- Inter-Process Transposition p Data Partitioning (Tiling) and Buffering PID 0 D0 PID 0 D00 D01 D02 D03 PID 1 D1 PID 1 D10 D11 D12 D13 PID 2 D2 PID 2 D20 D21 D22 D23 PID 3 D3 PID 3 D30 D31 D32 D33 Transposition T iti D00 D10 D20 D30 The Resulting PID 0 Communication Pattern D01 D11 D21 D31 PID 1 D0 D1 D2 D3 02 12 22 32 PID 2 D03 D13 D23 D33 PID 3 February 24, 2012 15
  • 16. Reduction in the Interpolation Loop p p • To avoid a collective reduction a local reduction is applied pp between neighbor processes. • This reduces only the overlapped regions. • R d ti i scheduled i an ordered way: Reduction is h d l d in d d – the first process will send the data to the second process, which accumulates the new values with the old ones and send the results back to the first process. February 24, 2012 16
  • 17. Pipelining the SSP Steps p g p • Each node processes a single p g image: – less inter-process communications – • It takes longer to reconstruct the fi i h first image, – but less time for the other g images February 24, 2012 17
  • 18. Speedup and Execution Time p p 90 80 Hyb4 70 Hyb5 60 Pipelined Speedup 50 40 30 20 100 10 90 0 80 1 8 16 32 64 96 128 70 psed Time in Seconds Number f C N b of Cores(8 Cores per Node) (8 C N d ) 60 50 40 30 20 Ellap 10 0 Number of Cores 8 16 32 64 96 128 Hyb4 92.49 62.6 44.5 34.44 34.14 34.12 Hyb5 92.49 50.56 28.84 18.41 15.13 13.97 Pipelined 92.49 46.43 24.8 13.88 10.325 8.42 February 24, 2012 18
  • 19. Summary and Conclusions y • In shared memory systems, the application can be efficiently parallelized, but the performance will always be limited by hardware resources. • In distributed memory systems, hardware resources on non-local nodes become available with the cost of communication overhead. • Performance improves with the number of resources, – Efficiency is not on the same scale. • The duty of each designer is to find the perfect compromise between performance and other factors like – power consumption – size – heat dissipation February 24, 2012 19
  • 20. Thank Y ! Th k You! Questions? Fisnik Kraja Chair of Computer Architecture Technische Universität MĂźnchen kraja@in.tum.de