IBM Power System AC922: The Brain Behind Blazing Fast Supercomputers
1. IBM Power System AC922 : The
Brain Behind the Supercomputer
—
Pidad D’Souza(pidsouza@in.ibm.com)
Aditya Nitsure(anitsure@in.ibm.com)
Power System Performance, ISDL, IBM, Bengaluru
5. +
Rest of Sequential
CPU Code
Compute-Intensive Code
Application Code
GPU Acceleration
5
CPU
– Large and broad instruction set to perform complex operations
GPU
– High throughput – Massive parallelization through large number of cores
– Specialized for SIMD/SIMT
Heterogenous Computing
Maximize
performance
and energy
efficiency
6. – NVLink 2.0 : High-Bandwidth Interconnect
o 150 bi-directional bandwidth (or 100 GB/s for 6 GPU
config) between CPU-GPU and GPU-GPU
– Coherent access to CPU memory
Summit and Sierra Supercomputer configurations
6
Nvidia V100
NVLink
150GB/s
DDR4
170GB/s
POWER9
PCIe4.0
CAPI 2.0
NVLink
150GB/s
NVLink
100GB/s
DDR4
170GB/s
POWER9
NVLink
100GB/s
Sierra
(4 GPU Half Node)
Summit
(6 GPU Half Node)
IB
PCIe4.0
CAPI 2.0
Coherent access to
system memory
Nvidia V100
• CPU and GPU co-operate in execution of
work
• GPU coherently access to CPU memory
Coherent access to
system memory
IB
7. 7
– Delivers unprecedented performance for modern
HPC, analytics, and artificial intelligence (AI)
– Designed to fully exploit the capabilities of CPU
and GPU accelerators
– Eliminates I/O bottlenecks and allows sharing
memory across GPUs and CPUs
– Extraordinary POWER9 CPUs
– 2-6 NVIDIA® Tesla® V100 GPUs with NVLink
– Co-optimized hardware and software for deep
learning and AI
– Supports up to 5.6x more I/O bandwidth than
competitive servers
– Combines the cutting edge AI innovation Data
Scientists desire with the dependability IT
– Next Gen PCIe - PCIe Gen4 2x faster
IBM POWER9 AC922 Server
7
8. 8
– Designed for AI Computing and HPC
– Second-Generation NVLink™
– HBM2 Memory: Faster, Higher Efficiency
– Enhanced Unified Memory and Address Translation
Services
– Maximum Performance and Maximum Efficiency Modes
– Number of SM/cores : 80/5120
– Double Precision Performance : 7.5 TFLOPS
– Single Precision Performance : 15 TFLOPS
– 125 Tensor TFLOPS
– GPU Memory : 16 or 32 GB
– Memory bandwidth : 900 GB/s
https://devblogs.nvidia.com/inside-volta
Nvidia Tesla V100 GPU
10. Boost application performance with sustained peak memory bandwidth of
~280GB/s
CPU STREAM Bandwidth
STREAM benchmark ( https://www.cs.virginia.edu/stream/) *not submitted
10
11. NVIDIA Volta 100 Compute – Single and Double Precision
–Applications to have more compute
power
–Shorten time to completion
–Accomplish more
simulation/experiment
–1.5x higher compute than NVIDIA
P100 GPUs
0
2
4
6
8
10
12
14
16
S822LC + P100 AC922 + V100
4.8
7.45
9.8
15.3
Compute-TFLOPS
NVIDIA V100 SGEMM and DGEMM
DGEMM SGEMM
1.5x higher
11
13. 0
10
20
30
40
50
60
70
Xeon E5-
2640 V4 +
P100
S822LC +
P100
AC922 + 6
V100
AC922 + 4
V100
12
34.16
45.9
68
Bandwidth–GB/s
CPU to GPU NVLink Vs PCIe3 bandwidth
5.6x better
3.8x better
2x
–NVLink 2.0 is 5.6x better than PCIe3
–Remove CPU-GPU Data transfer
bottlenecks
2.8x better
1.34x
Note: NVIDIA bandwidth test used for measurement 13
14. NVLINK Bandwidth with varied data sizes
–Minimize communication latencies
–Unlock PCIe bottlenecks
–Transfer larger data at high speed
–Ideal for data size larger than GPU
memory
0
10000
20000
30000
40000
50000
60000
70000
80000
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
Bandwidth-MB/s
Data Size - KBytes
NVLink2.0 vs PCIe3 Host to Device
Bandwidth
2NVLinkPerGPU 3NVLinkPerGPU PCIe3
14
15. Workload Optimized
Frequency (WOF)
– Boost performance of less active workload
through higher frequency
– Lower the frequency to save power or boost other
cores
– Maximize performance through dynamically
adjusting processor frequency
– Governing factors
• Processor utilization, Number of active cores &
Environment condition
– Power Saver Modes
• Dynamic Performance Mode(DPM)
• Maximum Performance Mode(MPM)
15
17. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18.
Bi-section Bandwidth & All Reduce scaling on Summit
– Good scaling at large scale due to ~74% of
bisection bandwidth with adaptive routing
enabled
– SMPI supports HCOLL(FCA) & SHARP,
enables applications to run with best
collective performance
18. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18.
– Imporved application performance through
burst buffer
– Applications not bottlenecked on I/O
operations on Parallel file systems
Burst buffer performance on Summit
21. CUDA Programming
21
h_data = cudaMallocHost(size) // Allocate
memory on the host
d_data = cudaMalloc(size) // Allocate memory on
the GPU
init_dataCPU(h_data)
cudaMemcpy(h_data, d_data, size,
HostToDevice) // Move data to GPU
gpu_kernel<<<…>>> // GPU compute
cudaMemcpy(d_data, h_data, size,
DeviceToHost) // Move results back to CPU
cpu_processing(h_data)
21
22. Unified Memory Programming
22
– Single memory address space accessible to
both CPU & GPU
– Enables oversubscribing memory
• Computation of data size larger than GPU
memory
– System wide atomic memory operations
– Transparent Memory migration between CPU
and GPU depending on who accesses it
• Explicit migration through
cudaMemPrefetchAsyn()
– Allocating Unified memory
• Replace “malloc” & “new” with
“cudaMallocManaged”
GPU CPU
Unified Memory
22
23. Unified Memory Advises
23
– ReadMostly
• Data is mostly read, occasionally written
• Duplicate pages, writes possible but expensive
– PreferredLocation
• Specify preferred location for data
• “resist” migrations from the preferred location
– AccessedBy
• Establish mappings to avoid migrations and
access directly
char *data;
cudaMallocManaged(&data, size);
init_dataCPU(data, size);
cudaMemPrefetchAsync(data, size, gpuID);
cudaMemAdvise(data, size, …ReadMostly,
gpuID);
gpuKernel<<<… >>>(data, size);
// Transparent data migration to GPU
cudaDeviceSynchronize();
use_dataCPU(data, size);
//Data migrate back to CPU
23
24. *data = malloc(size);
gpu_kernel<<<…>>>(data);
data[1024];
gpu_kernel<<<…>>>(data);
extern float *data;
gpu_kernel<<<…>>>(data);
Hardware Coherency & ATS
24
– Hardware coherency
• CPU can directly access and cache GPU
memory
• Native atomics support
– Address Translation Services(ATS)
• Allows the GPU to access the CPU’s page
tables directly
• System allocator support – malloc, stack,
global, file system
Simplifiedprogramming with
ATS
24
26. GPU Direct RDMA
26
– Data exchange between GPU and other Peer
devices using PCIe standards
– Network devices directly access GPU memory
bypassing host
26
30. nvprof
• The nvprof is command-line profiling tool which enables you to collect and view
profiling data
• Using nvprof one can collect –
• kernel execution time
• memory transfers
• memory set and CUDA API calls
• events or metrics for CUDA kernels
NVVP (NVIDIA Visual Profiler)
• The Visual Profiler displays a timeline of your application's activity on both the CPU
and GPU so that one can identify opportunities for performance improvement.
• Visualize profile data collected from nvprof
• More documentation can be found @ https://docs.nvidia.com/cuda/profiler-users-guide/index.html
Monitoring and Profiling tools
32. Conclusion
32
➢ AC922 Designed for Super Computers
➢ Better performance for HPC applications
➢ High speed interconnect NVLink between CPU & GPU
➢ Simplified programming using Unified memory, ATS, and
OpenMP
33. References
–IBM Power System AC922 Introduction and Technical Overview
–NVIDIA Volta GPU
–IBM Power Systems Proof Points
–Unified Memory on P9+V100
–Summit SuperComputer
–Sierra SuperComputer
33
35. Notices and disclaimers
continued
Information concerningnon-IBM products was obtained from the
suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products about this
publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed
to the suppliers of those products. IBM does not warrant the quality of
any third-party products, or the ability of any such third-party products
to interoperate with IBM’s products. IBM expressly disclaims all
warranties, expressed or implied, including but not limited to, the
implied warranties of merchantability and fitness for a purpose.
The provision of the information contained herein is not intended to, and
does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com and [names of other referenced IBM
products and services used in the presentation] are trademarks of
International Business Machines Corporation, registered in many
jurisdictions worldwide. Other product and service names might
be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at “Copyright and trademark
information”at: www.ibm.com/legal/copytrade.shtml.
35