The 7 Things I Know About Cyber Security After 25 Years | April 2024
AMD technologies for HPC
1. ISUM 2012, Guanajuato, Mexico
Hands on work on
AMD technologies for HPC solutions
Joshua.Mora@amd.com
ABSTRACT:
The goal of this talk is to present in a practical way (through a hands
on session) how latest AMD technology works and meets current
high performance computing requirements. Concepts such as the
performance metrics of GFLOPs and GB/s, performance efficiencies of
FPU and memory controllers/channels, scalability of the multi socket
platforms, tuning tips such as process/thread affinity, multi
Infiniband/GPU and their I/O affinity, impact of appropriate math
libraries and compilers, power consumption characteristics on a
system when heavily stressed with different HPC workloads,….will be
reviewed. By the end of the talk/session you should walk away with
some good foundation on what building block technologies matter for
you and how to design and exploit your own HPC solutions.
5. ISUM 2012, Guanajuato, Mexico
Probe filter
Necessary for scaling of memory bound applications, since
it keeps track (cache directory in L3) of where data is on
what memory bank when cores request data again.
memory bandwidth aggregated Processors
(GB/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS
Probe filter No Yes Yes Yes
1 8 10 13 18.5
2 16 20 26 37
# numanodes
4 21 40 52 74
8 22 80 104 148
FLOPs aggregated Processors, assuming at 2.3GHz core frequency, 80% efficiency HPL
(GF/s) SHANGHAI ISTANBUL MAGNYCOURS INTERLAGOS
Probe filter No Yes Yes Yes
1 29.44 44.16 44.16 58.88
2 58.88 88.32 88.32 117.76
# numanodes
4 117.76 176.64 176.64 235.52
8 235.52 353.28 353.28 471.04
6. ISUM 2012, Guanajuato, Mexico
Bulldozer architecture
• Bulldozer compute unit
– Core pair
• Core shared resources
– L2 cache
– Floating Point Unit
– Instruction scheduler
– Power management
• Core independent resources
– L1 Data cache
– Integer Unit
7. ISUM 2012, Guanajuato, Mexico
Bulldozer block diagram
• HPC workloads are using all
the cores for the same
nature of computation, also
synchronized.
• High workload flexibility
such as in Cloud under
power budget
Example: Cloud workloads
can use 1 core for integer
work and the other the whole
FPU for number crunching
8. ISUM 2012, Guanajuato, Mexico
Socket block diagram
16 cores grouped in 8 compute units by core-pairs
grouped in 2 numanodes. Each numanode has 2 memory
channels. The numanodes are interconnected through
cHT. Delivers, 18.5 GB/s x 2, 60 DP GF/s x2 under 130W
9. ISUM 2012, Guanajuato, Mexico
Bulldozer architecture (cont)
• Flexible Floating Point Unit
– Work that 1 core can do. 8 DP FLOPs/clk
– Work that 2 cores can do. 4 DP FLOPs/clk
• Example of DGEMM from ACML.
• FMA4 and FMA3 instructions
– FMA4 on Interlagos d = a (+/-) b*c
– FMA3 on Abudhabi c = a (+/-) b*c
• AVX instructions
– Increase IPC by compacting instructions
15. ISUM 2012, Guanajuato, Mexico
Software Ecosystem
• Operating Systems
• Compilers
– Open64, GCC, PGI
• Math library
– ACML, AMDlibM
• Profilers
– CodeAnalyst
• Instruction Based Profiling
16. ISUM 2012, Guanajuato, Mexico
Operating systems for Interlagos
• Basic list of OS providing proper performance
– Windows Server 2008 R2
– RHEL6.2
– CentOS 6.2
– SLES11sp2
– Scientific Linux 6.2
Older versions need specific patches in order to
perform.
17. ISUM 2012, Guanajuato, Mexico
Compiler flags
• Open64 version >= 4.2.5
• GCC version >= 4.6
• PGI version >= 11.9
• Open64 and GCC
– Compile/link flags: -Ofast -march=bdver1
• PGI
– Compile/link flags: -fast -tp Interlagos-64
23. ISUM 2012, Guanajuato, Mexico
numactl –hardware and numastat
Detecting wrong BIOS settings configuration of system ,
If NODE INTERLEAVED was ENABLED then it would only be 1
Physical
numa node with core ids 0,1,2….30,31 and with 64 GB of memory.
memory on
numa node
and how
much is
available
(free)
Core ids for
numa node 3
Good, no misses
23
24. ISUM 2012, Guanajuato, Mexico
EXAMPLE using likwid
Hybrid MPI+OPenMP
• Build application file and launch mpi job with hybrid openMP with 1 thread
per compute unit on 2 . Using 4 compute nodes.
• export OMP_NUM_THREADS=4
• mpirun –app ./appfile,
• Where appfile is
Repeated core id for the binding of MPI process +
4 worker threads
-h node 1 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application
-h node 1 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application
-h node 1 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application
-h node 1 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application
…………………………………………….
-h node 4 –np 1 likwid-pin –q –c 0,0,2,4,6 ./application
-h node 4 –np 1 likwid-pin –q –c 8,8,10,12,14 ./application
-h node 4 –np 1 likwid-pin –q –c 16,16,18,20,22 ./application
-h node 4 –np 1 likwid-pin –q –c 24,24,26,28,30 ./application
24
25. ISUM 2012, Guanajuato, Mexico
Putting it all together
Pre-exascale (high computing density) system
– Multicore
– Multisocket
– Multichipset
– Multirail
– MultiGPU
– dynamically reconfigurable multi root PCI devices
through workload analysis
27. ISUM 2012, Guanajuato, Mexico
More @ http://developer.amd.com
• X86 Open64 Compilers Suite (http://developer.amd.com/tools/open64/)
• AMD Developer Tools (http://developer.amd.com/tools/)
• AMD Libraries (ACML, LibM, etc.) http://developer.amd.com/libraries/
• AMD Opteron™ 4200/6200 Series processors Compiler Options Quick Guide
(http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf)
• AMD OpenCL™ Zone (http://developer.amd.com/zones/OpenCLZone/)
• AMD HPC (www.amd.com/hpc)
• AMD APP SDK Documentation
(http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx)
• Using the x86 Open64 Compiler Suite
(http://developer.amd.com/tools/open64/Documents/open64.html)
• x86 Open64 4.2.5.2 Release Notes
(http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt)
• ACML 5.0 Information
(http://developer.amd.com/libraries/acml/features/pages/default.aspx)
• Software Optimization Guide for “Bulldozer” processors
(http://support.amd.com/us/Processor_TechDocs/47414.pdf)
• AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4
Instructions
(http://support.amd.com/us/Embedded_TechDocs/43479.pdf)
• Here are links to the 2- and 4-socket results for the AMD Opteron™ 6276 Series processors (16 core,
2.3Ghz). The SPEC runs used the X86 Open64 Compiler Suite.
http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18742.pdf
http://www.spec.org/cpu2006/results/res2011q4/cpu2006-20111025-18748.pdf