SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
ISUM 2012, Guanajuato, Mexico

            Hands on work on
     AMD technologies for HPC solutions
                   Joshua.Mora@amd.com

ABSTRACT:

The goal of this talk is to present in a practical way (through a hands
on session) how latest AMD technology works and meets current
high performance computing requirements. Concepts such as the
performance metrics of GFLOPs and GB/s, performance efficiencies of
FPU and memory controllers/channels, scalability of the multi socket
platforms, tuning tips such as process/thread affinity, multi
Infiniband/GPU and their I/O affinity, impact of appropriate math
libraries and compilers, power consumption characteristics on a
system when heavily stressed with different HPC workloads,….will be
reviewed. By the end of the talk/session you should walk away with
some good foundation on what building block technologies matter for
you and how to design and exploit your own HPC solutions.
ISUM 2012, Guanajuato, Mexico


       Performance metrics
– GFLOP/s (SP,DP) (SSE, FMA)
– GB/s (SP,DP) (streaming stores)
– Memory Latency (local/remote)
– Memory Bandwidth (local/remote)
– Network Latency
– Network Bandwidth
– Message rate (Network)
– IOPs, sustained reads/writes (storage)
– Roofline model (performance modeling)
ISUM 2012, Guanajuato, Mexico
Roofline model:
ISUM 2012, Guanajuato, Mexico


                  Scalability
• Hardware based:
  – Multicore
  – Numanodes in socket package
  – Multisocket
  – Probe filter (HT assist)
  – Multichipset
• Software based:
  – Compiler, Math libraries, MPI, OpenMP, affinity.
  – Algorithm, computation/communication overlap,
    non blocking collectives.
ISUM 2012, Guanajuato, Mexico


                              Probe filter
Necessary for scaling of memory bound applications, since
it keeps track (cache directory in L3) of where data is on
what memory bank when cores request data again.
memory bandwidth aggregated                             Processors
          (GB/s)               SHANGHAI          ISTANBUL      MAGNYCOURS         INTERLAGOS
        Probe filter              No                Yes            Yes                Yes
                     1             8                 10             13                18.5
                     2            16                 20             26                 37
# numanodes
                     4            21                 40             52                 74
                     8            22                 80            104                148

      FLOPs aggregated        Processors, assuming at 2.3GHz core frequency, 80% efficiency HPL
           (GF/s)              SHANGHAI          ISTANBUL        MAGNYCOURS        INTERLAGOS
         Probe filter              No                Yes               Yes              Yes
                      1           29.44             44.16             44.16            58.88
                      2           58.88             88.32             88.32            117.76
# numanodes
                      4          117.76            176.64            176.64            235.52
                      8          235.52            353.28            353.28            471.04
ISUM 2012, Guanajuato, Mexico


         Bulldozer architecture
• Bulldozer compute unit
  – Core pair
• Core shared resources
  – L2 cache
  – Floating Point Unit
  – Instruction scheduler
  – Power management
• Core independent resources
  – L1 Data cache
  – Integer Unit
ISUM 2012, Guanajuato, Mexico


         Bulldozer block diagram
• HPC workloads are using all
  the cores for the same
  nature of computation, also
  synchronized.
• High workload flexibility
  such as in Cloud under
  power budget

Example: Cloud workloads
can use 1 core for integer
work and the other the whole
FPU for number crunching
ISUM 2012, Guanajuato, Mexico


           Socket block diagram




16 cores grouped in 8 compute units by core-pairs
grouped in 2 numanodes. Each numanode has 2 memory
channels. The numanodes are interconnected through
cHT. Delivers, 18.5 GB/s x 2, 60 DP GF/s x2 under 130W
ISUM 2012, Guanajuato, Mexico


    Bulldozer architecture (cont)
• Flexible Floating Point Unit
  – Work that 1 core can do. 8 DP FLOPs/clk
  – Work that 2 cores can do. 4 DP FLOPs/clk
     • Example of DGEMM from ACML.
• FMA4 and FMA3 instructions
  – FMA4 on Interlagos d = a (+/-) b*c
  – FMA3 on Abudhabi c = a (+/-) b*c
• AVX instructions
  – Increase IPC by compacting instructions
Where are FMA instructions used ?
ISUM 2012, Guanajuato, Mexico


    Bulldozer architecture (cont)
• Power management:
  – Maxpower (eg. 135W), TDP (115W), ACP (85W)
  – Power capping (to limit power consumption)
• Boost states
  – Pstates (HW and SW views)
• HPC mode (mostly for HPL benchmark)
• Throttling
  – Power (too much power consumption, HPL)
  – Thermal (too hot, not enough cooling, protection)
ISUM 2012, Guanajuato, Mexico


             Power management
  P0
              Boost P-states
  P1
  P2         P0        Base P-state
                                        Measured Dynamic Power
  P3         P1
                        120%
  P4         P2                                 TDP
                        100%
  P5         P3
                               POWER HEADROOM AVAILABLE FOR BOOST
                        80%
  P6         P4
  P7         P5         60%


                        40%
HW View   SW View
                        20%


                         0%




                                         Tolf
                                      Applu
                                        HLT

                                   Wupwise
                               MaxPower128




                                      Galgel




                                      Lucas




                                      Crafty




                                      Vortex
                                    Sixtrack




                                        Eon
                                   Perlbmk
                                        Gzip
                                    Equake




                                      Bzip2
                                       NOP




                                          Art




                                         Mcf
                                        Sim




                                         Vpr




                                     Parser
                                        Gcc
                                       Mesa




                                   Facerec




                                        Gap
                                      Mgrid




                                     Ammp

                                     Fma3d

                                        Apsi
ISUM 2012, Guanajuato, Mexico


 Coherent and non coherent fabric
• Coherent Hypertransport fabric
  – Connects the numanodes with cache coherence
     • MOESI protocol
  – X8 cHT links, x16 cHT links
  – Scenic routing, reroutes traffic to make even x8 /
    x16 resources
• Non Coherent Hypertransport
  – RD890 chipset (PCIegen2)
  – Connects the numanodes with PCI devices
  – multichipset
ISUM 2012, Guanajuato, Mexico


Coherent and non coherent fabric
ISUM 2012, Guanajuato, Mexico


              Software Ecosystem
• Operating Systems
• Compilers
  – Open64, GCC, PGI
• Math library
  – ACML, AMDlibM
• Profilers
  – CodeAnalyst
     • Instruction Based Profiling
ISUM 2012, Guanajuato, Mexico


  Operating systems for Interlagos
• Basic list of OS providing proper performance
  – Windows Server 2008 R2
  – RHEL6.2
  – CentOS 6.2
  – SLES11sp2
  – Scientific Linux 6.2

Older versions need specific patches in order to
perform.
ISUM 2012, Guanajuato, Mexico


                  Compiler flags
•   Open64 version >= 4.2.5
•   GCC version >= 4.6
•   PGI version >= 11.9
•   Open64 and GCC
    – Compile/link flags: -Ofast -march=bdver1
• PGI
    – Compile/link flags: -fast -tp Interlagos-64
ISUM 2012, Guanajuato, Mexico
   AMD Core Math Library,
download @ developer.amd.com
ISUM 2012, Guanajuato, Mexico
  AMD Code Analyst Profiler,
download @ developer.amd.com
ISUM 2012, Guanajuato, Mexico


NUMA definition
ISUM 2012, Guanajuato, Mexico


    Feeding locally versus remotely
• Locally                            0   1       Channel 0
                 NUMA node 0
                                     2   3       Channel 1
Eg. 12GB/s


• Remotely                           0   1       Channel 0
                   NUMA node 0
                                     2   3       Channel 1
  Constrain
                       cHT x8, x16            Higher latency (1 hop)
  bandwidth
Eg. 7GB/s at x16, 5GB/s at x8        0   1       Channel 0

                   NUMA node 1       2   3       Channel 1

                                                                       21
ISUM 2012, Guanajuato, Mexico


                      Affinity
• numa [ctl/stat] tool (Linux)
• Start tool (Windows)
• HWLOC toolset (Windows, Linux)
  – www.open-mpi.org/projects/hwloc
• LIKWID toolset (Windows, Linux)
  – http://code.google.com/p/likwid/
• openMP environment variables
  – Eg. Open64: O64_OMP_AFFINITY_MAP
• MPI runtime flags
  – Eg. OpenMPI: --bind-to-core
ISUM 2012, Guanajuato, Mexico


 numactl –hardware and numastat
Detecting wrong BIOS settings configuration of system ,
If NODE INTERLEAVED was ENABLED then it would only be 1
                                                                  Physical
numa node with core ids 0,1,2….30,31 and with 64 GB of memory.
                                                                  memory on
                                                                  numa node
                                                                  and how
                                                                  much is
                                                                  available
                                                                  (free)
                                                                 Core ids for
                                                                 numa node 3




                                                                 Good, no misses



                                                                                23
ISUM 2012, Guanajuato, Mexico
                   EXAMPLE using likwid
                   Hybrid MPI+OPenMP
• Build application file and launch mpi job with hybrid openMP with 1 thread
  per compute unit on 2 . Using 4 compute nodes.
• export OMP_NUM_THREADS=4
• mpirun –app ./appfile,
• Where appfile is
                                     Repeated core id for the binding of MPI process +
                                                                     4 worker threads
-h node 1 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application
-h node 1 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application
-h node 1 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application
-h node 1 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application
…………………………………………….
-h node 4 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application
-h node 4 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application
-h node 4 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application
-h node 4 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application
                                                                                     24
ISUM 2012, Guanajuato, Mexico


         Putting it all together
Pre-exascale (high computing density) system
  – Multicore
  – Multisocket
  – Multichipset
  – Multirail
  – MultiGPU
  – dynamically reconfigurable multi root PCI devices
    through workload analysis
ISUM 2012, Guanajuato, Mexico
ISUM 2012, Guanajuato, Mexico


More @ http://developer.amd.com
•   X86 Open64 Compilers Suite (http://developer.amd.com/tools/open64/)
•   AMD Developer Tools (http://developer.amd.com/tools/)
•   AMD Libraries (ACML, LibM, etc.) http://developer.amd.com/libraries/
•   AMD Opteron™ 4200/6200 Series processors Compiler Options Quick Guide
    (http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf)
•   AMD OpenCL™ Zone (http://developer.amd.com/zones/OpenCLZone/)
•   AMD HPC (www.amd.com/hpc)
•   AMD APP SDK Documentation
    (http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx)
•   Using the x86 Open64 Compiler Suite
    (http://developer.amd.com/tools/open64/Documents/open64.html)
•   x86 Open64 4.2.5.2 Release Notes
    (http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt)
•   ACML 5.0 Information
    (http://developer.amd.com/libraries/acml/features/pages/default.aspx)
•   Software Optimization Guide for “Bulldozer” processors
    (http://support.amd.com/us/Processor_TechDocs/47414.pdf)
•   AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4
    Instructions
    (http://support.amd.com/us/Embedded_TechDocs/43479.pdf)
•   Here are links to the 2- and 4-socket results for the AMD Opteron™ 6276 Series processors (16 core,
    2.3Ghz). The SPEC runs used the X86 Open64 Compiler Suite.
    http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18742.pdf
    http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18748.pdf

Weitere ähnliche Inhalte

Was ist angesagt?

Tcpip (Dharmender Kumar) 09990478253
Tcpip (Dharmender Kumar)   09990478253Tcpip (Dharmender Kumar)   09990478253
Tcpip (Dharmender Kumar) 09990478253
guestda14e85
 

Was ist angesagt? (10)

1
11
1
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Numascale Product IBM
Numascale Product IBMNumascale Product IBM
Numascale Product IBM
 
One day-workshop on tms320 f2812
One day-workshop on tms320 f2812One day-workshop on tms320 f2812
One day-workshop on tms320 f2812
 
小型安価なFPGAボードの紹介と任意波形発生器
小型安価なFPGAボードの紹介と任意波形発生器小型安価なFPGAボードの紹介と任意波形発生器
小型安価なFPGAボードの紹介と任意波形発生器
 
Tcpip (Dharmender Kumar) 09990478253
Tcpip (Dharmender Kumar)   09990478253Tcpip (Dharmender Kumar)   09990478253
Tcpip (Dharmender Kumar) 09990478253
 
Tms320 f2812
Tms320 f2812Tms320 f2812
Tms320 f2812
 
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And ResearchGPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
 
DFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core MicroprocessorsDFX Architecture for High-performance Multi-core Microprocessors
DFX Architecture for High-performance Multi-core Microprocessors
 
PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)PFQ@ 10th Italian Networking Workshop (Bormio)
PFQ@ 10th Italian Networking Workshop (Bormio)
 

Ähnlich wie AMD technologies for HPC

ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
DVClub
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Heiko Joerg Schick
 

Ähnlich wie AMD technologies for HPC (20)

Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
 
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
0507036
05070360507036
0507036
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPU
 
GPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry ApplicationsGPU Accelerated Computational Chemistry Applications
GPU Accelerated Computational Chemistry Applications
 
FPGA
FPGAFPGA
FPGA
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
Mateo valero p1
Mateo valero p1Mateo valero p1
Mateo valero p1
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

AMD technologies for HPC

  • 1. ISUM 2012, Guanajuato, Mexico Hands on work on AMD technologies for HPC solutions Joshua.Mora@amd.com ABSTRACT: The goal of this talk is to present in a practical way (through a hands on session) how latest AMD technology works and meets current high performance computing requirements. Concepts such as the performance metrics of GFLOPs and GB/s, performance efficiencies of FPU and memory controllers/channels, scalability of the multi socket platforms, tuning tips such as process/thread affinity, multi Infiniband/GPU and their I/O affinity, impact of appropriate math libraries and compilers, power consumption characteristics on a system when heavily stressed with different HPC workloads,….will be reviewed. By the end of the talk/session you should walk away with some good foundation on what building block technologies matter for you and how to design and exploit your own HPC solutions.
  • 2. ISUM 2012, Guanajuato, Mexico Performance metrics – GFLOP/s (SP,DP) (SSE, FMA) – GB/s (SP,DP) (streaming stores) – Memory Latency (local/remote) – Memory Bandwidth (local/remote) – Network Latency – Network Bandwidth – Message rate (Network) – IOPs, sustained reads/writes (storage) – Roofline model (performance modeling)
  • 3. ISUM 2012, Guanajuato, Mexico Roofline model:
  • 4. ISUM 2012, Guanajuato, Mexico Scalability • Hardware based: – Multicore – Numanodes in socket package – Multisocket – Probe filter (HT assist) – Multichipset • Software based: – Compiler, Math libraries, MPI, OpenMP, affinity. – Algorithm, computation/communication overlap, non blocking collectives.
  • 5. ISUM 2012, Guanajuato, Mexico Probe filter Necessary for scaling of memory bound applications, since it keeps track (cache directory in L3) of where data is on what memory bank when cores request data again. memory bandwidth aggregated Processors (GB/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS Probe filter No Yes Yes Yes 1 8 10 13 18.5 2 16 20 26 37 # numanodes 4 21 40 52 74 8 22 80 104 148 FLOPs aggregated Processors, assuming at 2.3GHz core frequency, 80% efficiency HPL (GF/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS Probe filter No Yes Yes Yes 1 29.44 44.16 44.16 58.88 2 58.88 88.32 88.32 117.76 # numanodes 4 117.76 176.64 176.64 235.52 8 235.52 353.28 353.28 471.04
  • 6. ISUM 2012, Guanajuato, Mexico Bulldozer architecture • Bulldozer compute unit – Core pair • Core shared resources – L2 cache – Floating Point Unit – Instruction scheduler – Power management • Core independent resources – L1 Data cache – Integer Unit
  • 7. ISUM 2012, Guanajuato, Mexico Bulldozer block diagram • HPC workloads are using all the cores for the same nature of computation, also synchronized. • High workload flexibility such as in Cloud under power budget Example: Cloud workloads can use 1 core for integer work and the other the whole FPU for number crunching
  • 8. ISUM 2012, Guanajuato, Mexico Socket block diagram 16 cores grouped in 8 compute units by core-pairs grouped in 2 numanodes. Each numanode has 2 memory channels. The numanodes are interconnected through cHT. Delivers, 18.5 GB/s x 2, 60 DP GF/s x2 under 130W
  • 9. ISUM 2012, Guanajuato, Mexico Bulldozer architecture (cont) • Flexible Floating Point Unit – Work that 1 core can do. 8 DP FLOPs/clk – Work that 2 cores can do. 4 DP FLOPs/clk • Example of DGEMM from ACML. • FMA4 and FMA3 instructions – FMA4 on Interlagos d = a (+/-) b*c – FMA3 on Abudhabi c = a (+/-) b*c • AVX instructions – Increase IPC by compacting instructions
  • 10. Where are FMA instructions used ?
  • 11. ISUM 2012, Guanajuato, Mexico Bulldozer architecture (cont) • Power management: – Maxpower (eg. 135W), TDP (115W), ACP (85W) – Power capping (to limit power consumption) • Boost states – Pstates (HW and SW views) • HPC mode (mostly for HPL benchmark) • Throttling – Power (too much power consumption, HPL) – Thermal (too hot, not enough cooling, protection)
  • 12. ISUM 2012, Guanajuato, Mexico Power management P0 Boost P-states P1 P2 P0 Base P-state Measured Dynamic Power P3 P1 120% P4 P2 TDP 100% P5 P3 POWER HEADROOM AVAILABLE FOR BOOST 80% P6 P4 P7 P5 60% 40% HW View SW View 20% 0% Tolf Applu HLT Wupwise MaxPower128 Galgel Lucas Crafty Vortex Sixtrack Eon Perlbmk Gzip Equake Bzip2 NOP Art Mcf Sim Vpr Parser Gcc Mesa Facerec Gap Mgrid Ammp Fma3d Apsi
  • 13. ISUM 2012, Guanajuato, Mexico Coherent and non coherent fabric • Coherent Hypertransport fabric – Connects the numanodes with cache coherence • MOESI protocol – X8 cHT links, x16 cHT links – Scenic routing, reroutes traffic to make even x8 / x16 resources • Non Coherent Hypertransport – RD890 chipset (PCIegen2) – Connects the numanodes with PCI devices – multichipset
  • 14. ISUM 2012, Guanajuato, Mexico Coherent and non coherent fabric
  • 15. ISUM 2012, Guanajuato, Mexico Software Ecosystem • Operating Systems • Compilers – Open64, GCC, PGI • Math library – ACML, AMDlibM • Profilers – CodeAnalyst • Instruction Based Profiling
  • 16. ISUM 2012, Guanajuato, Mexico Operating systems for Interlagos • Basic list of OS providing proper performance – Windows Server 2008 R2 – RHEL6.2 – CentOS 6.2 – SLES11sp2 – Scientific Linux 6.2 Older versions need specific patches in order to perform.
  • 17. ISUM 2012, Guanajuato, Mexico Compiler flags • Open64 version >= 4.2.5 • GCC version >= 4.6 • PGI version >= 11.9 • Open64 and GCC – Compile/link flags: -Ofast -march=bdver1 • PGI – Compile/link flags: -fast -tp Interlagos-64
  • 18. ISUM 2012, Guanajuato, Mexico AMD Core Math Library, download @ developer.amd.com
  • 19. ISUM 2012, Guanajuato, Mexico AMD Code Analyst Profiler, download @ developer.amd.com
  • 20. ISUM 2012, Guanajuato, Mexico NUMA definition
  • 21. ISUM 2012, Guanajuato, Mexico Feeding locally versus remotely • Locally 0 1 Channel 0 NUMA node 0 2 3 Channel 1 Eg. 12GB/s • Remotely 0 1 Channel 0 NUMA node 0 2 3 Channel 1 Constrain cHT x8, x16 Higher latency (1 hop) bandwidth Eg. 7GB/s at x16, 5GB/s at x8 0 1 Channel 0 NUMA node 1 2 3 Channel 1 21
  • 22. ISUM 2012, Guanajuato, Mexico Affinity • numa [ctl/stat] tool (Linux) • Start tool (Windows) • HWLOC toolset (Windows, Linux) – www.open-mpi.org/projects/hwloc • LIKWID toolset (Windows, Linux) – http://code.google.com/p/likwid/ • openMP environment variables – Eg. Open64: O64_OMP_AFFINITY_MAP • MPI runtime flags – Eg. OpenMPI: --bind-to-core
  • 23. ISUM 2012, Guanajuato, Mexico numactl –hardware and numastat Detecting wrong BIOS settings configuration of system , If NODE INTERLEAVED was ENABLED then it would only be 1 Physical numa node with core ids 0,1,2….30,31 and with 64 GB of memory. memory on numa node and how much is available (free) Core ids for numa node 3 Good, no misses 23
  • 24. ISUM 2012, Guanajuato, Mexico EXAMPLE using likwid Hybrid MPI+OPenMP • Build application file and launch mpi job with hybrid openMP with 1 thread per compute unit on 2 . Using 4 compute nodes. • export OMP_NUM_THREADS=4 • mpirun –app ./appfile, • Where appfile is Repeated core id for the binding of MPI process + 4 worker threads -h node 1 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application -h node 1 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application -h node 1 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application -h node 1 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application ……………………………………………. -h node 4 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application -h node 4 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application -h node 4 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application -h node 4 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application 24
  • 25. ISUM 2012, Guanajuato, Mexico Putting it all together Pre-exascale (high computing density) system – Multicore – Multisocket – Multichipset – Multirail – MultiGPU – dynamically reconfigurable multi root PCI devices through workload analysis
  • 27. ISUM 2012, Guanajuato, Mexico More @ http://developer.amd.com • X86 Open64 Compilers Suite (http://developer.amd.com/tools/open64/) • AMD Developer Tools (http://developer.amd.com/tools/) • AMD Libraries (ACML, LibM, etc.) http://developer.amd.com/libraries/ • AMD Opteron™ 4200/6200 Series processors Compiler Options Quick Guide (http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf) • AMD OpenCL™ Zone (http://developer.amd.com/zones/OpenCLZone/) • AMD HPC (www.amd.com/hpc) • AMD APP SDK Documentation (http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx) • Using the x86 Open64 Compiler Suite (http://developer.amd.com/tools/open64/Documents/open64.html) • x86 Open64 4.2.5.2 Release Notes (http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt) • ACML 5.0 Information (http://developer.amd.com/libraries/acml/features/pages/default.aspx) • Software Optimization Guide for “Bulldozer” processors (http://support.amd.com/us/Processor_TechDocs/47414.pdf) • AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions (http://support.amd.com/us/Embedded_TechDocs/43479.pdf) • Here are links to the 2- and 4-socket results for the AMD Opteron™ 6276 Series processors (16 core, 2.3Ghz). The SPEC runs used the X86 Open64 Compiler Suite. http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18742.pdf http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18748.pdf