IBM Power System AC922: The Brain Behind Blazing Fast Supercomputers

IBM Power System AC922 : The
Brain Behind the Supercomputer
—
Pidad D’Souza(pidsouza@in.ibm.com)
Aditya Nitsure(anitsure@in.ibm.com)
Power System Performance, ISDL, IBM, Bengaluru

Agenda
● AC922 System Components
● AC922 Characteristics
● System Features

IBM Systems at Supercomputing 2019 / © 2019 IBM Corporation
The most powerful supercomputer on the planet
4 out of 11 Top500 are IBM Power9 Systems
3
▪ 4,608 IBM AC922 nodes
▪ 200 peta FLOPS
▪ 27,648 NVIDIA Tesla GPUs
▪ 25 gigabytes per second between nodes
▪ 13 MW Energy
Exascale Energy Budget: 20-40 megawatts
(MW)
Innovations in Hardware and Software
▪ Processor/Accelerators
▪ Memory
▪ Interconnect
▪ Spectrum MPI, Math Libraries
4 out of Top 10 Green500 systems are IBM
Power9 systems
AiMOS – Green500 No. 3 with 15.72 GFlops/Watt

+
Rest of Sequential
CPU Code
Compute-Intensive Code
Application Code
GPU Acceleration
5
CPU
– Large and broad instruction set to perform complex operations
GPU
– High throughput – Massive parallelization through large number of cores
– Specialized for SIMD/SIMT
Heterogenous Computing
Maximize
performance
and energy
efficiency

– NVLink 2.0 : High-Bandwidth Interconnect
o 150 bi-directional bandwidth (or 100 GB/s for 6 GPU
config) between CPU-GPU and GPU-GPU
– Coherent access to CPU memory
Summit and Sierra Supercomputer configurations
6
Nvidia V100
NVLink
150GB/s
DDR4
170GB/s
POWER9
PCIe4.0
CAPI 2.0
NVLink
150GB/s
NVLink
100GB/s
DDR4
170GB/s
POWER9
NVLink
100GB/s
Sierra
(4 GPU Half Node)
Summit
(6 GPU Half Node)
IB
PCIe4.0
CAPI 2.0
Coherent access to
system memory
Nvidia V100
• CPU and GPU co-operate in execution of
work
• GPU coherently access to CPU memory
Coherent access to
system memory
IB

7
– Delivers unprecedented performance for modern
HPC, analytics, and artificial intelligence (AI)
– Designed to fully exploit the capabilities of CPU
and GPU accelerators
– Eliminates I/O bottlenecks and allows sharing
memory across GPUs and CPUs
– Extraordinary POWER9 CPUs
– 2-6 NVIDIA® Tesla® V100 GPUs with NVLink
– Co-optimized hardware and software for deep
learning and AI
– Supports up to 5.6x more I/O bandwidth than
competitive servers
– Combines the cutting edge AI innovation Data
Scientists desire with the dependability IT
– Next Gen PCIe - PCIe Gen4 2x faster
IBM POWER9 AC922 Server
7

8
– Designed for AI Computing and HPC
– Second-Generation NVLink™
– HBM2 Memory: Faster, Higher Efficiency
– Enhanced Unified Memory and Address Translation
Services
– Maximum Performance and Maximum Efficiency Modes
– Number of SM/cores : 80/5120
– Double Precision Performance : 7.5 TFLOPS
– Single Precision Performance : 15 TFLOPS
– 125 Tensor TFLOPS
– GPU Memory : 16 or 32 GB
– Memory bandwidth : 900 GB/s
https://devblogs.nvidia.com/inside-volta
Nvidia Tesla V100 GPU

AC922 SYSTEM
CHARACTERISTICS
9

Boost application performance with sustained peak memory bandwidth of
~280GB/s
CPU STREAM Bandwidth
STREAM benchmark ( https://www.cs.virginia.edu/stream/) *not submitted
10

NVIDIA Volta 100 Compute – Single and Double Precision
–Applications to have more compute
power
–Shorten time to completion
–Accomplish more
simulation/experiment
–1.5x higher compute than NVIDIA
P100 GPUs
0
2
4
6
8
10
12
14
16
S822LC + P100 AC922 + V100
4.8
7.45
9.8
15.3
Compute-TFLOPS
NVIDIA V100 SGEMM and DGEMM
DGEMM SGEMM
1.5x higher
11

NVIDIA V100 GPU memory bandwidth (GPU STREAM)
0
100
200
300
400
500
600
700
800
900
S822LC + P100 AC922 + V100
512
840
Bandwidth-GB/s
840 GB/s
1.6x Higher
–1.6x Higher Bandwidth than NVIDIA
P100
–Speed up of memory intensive
applications
Theoretical
12

0
10
20
30
40
50
60
70
Xeon E5-
2640 V4 +
P100
S822LC +
P100
AC922 + 6
V100
AC922 + 4
V100
12
34.16
45.9
68
Bandwidth–GB/s
CPU to GPU NVLink Vs PCIe3 bandwidth
5.6x better
3.8x better
2x
–NVLink 2.0 is 5.6x better than PCIe3
–Remove CPU-GPU Data transfer
bottlenecks
2.8x better
1.34x
Note: NVIDIA bandwidth test used for measurement 13

NVLINK Bandwidth with varied data sizes
–Minimize communication latencies
–Unlock PCIe bottlenecks
–Transfer larger data at high speed
–Ideal for data size larger than GPU
memory
0
10000
20000
30000
40000
50000
60000
70000
80000
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
Bandwidth-MB/s
Data Size - KBytes
NVLink2.0 vs PCIe3 Host to Device
Bandwidth
2NVLinkPerGPU 3NVLinkPerGPU PCIe3
14

Workload Optimized
Frequency (WOF)
– Boost performance of less active workload
through higher frequency
– Lower the frequency to save power or boost other
cores
– Maximize performance through dynamically
adjusting processor frequency
– Governing factors
• Processor utilization, Number of active cores &
Environment condition
– Power Saver Modes
• Dynamic Performance Mode(DPM)
• Maximum Performance Mode(MPM)
15

IBM Systems at Supercomputing 2019 / © 2019 IBM Corporation
HPC Interconnect
–Multi-Host Adapter (Mellanox
ConnectX-5 EDR)
• Latency : sub-600 nanoseconds
• Bandwidth : 2 ports of 100Gb/s
• Message Rate : 200M messages/second
–Adapter Features
• Switch based collectives - SHARP
• Hardware Tag Matching
• User mode memory registration(UMR)
• GPU Direct RDMA
• Tunneled Atomics
P9
X-Bus
x8x8
IB - EDR
P9
16

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18.
Bi-section Bandwidth & All Reduce scaling on Summit
– Good scaling at large scale due to ~74% of
bisection bandwidth with adaptive routing
enabled
– SMPI supports HCOLL(FCA) & SHARP,
enables applications to run with best
collective performance

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems, SC18.
– Imporved application performance through
burst buffer
– Applications not bottlenecked on I/O
operations on Parallel file systems
Burst buffer performance on Summit

HPC APPLICATION
PERFORMANCE Memory
Capacity
IO
Compute
Interconnect
Memory
Bandwidth
Parameters impacting HPC
Application Performance
19

HPC APPLICATION
ACCELERATION
METHODOLOGIES
o Unified Memory
o Coherency
o ATS
o OpenMP
20

CUDA Programming
21
h_data = cudaMallocHost(size) // Allocate
memory on the host
d_data = cudaMalloc(size) // Allocate memory on
the GPU
init_dataCPU(h_data)
cudaMemcpy(h_data, d_data, size,
HostToDevice) // Move data to GPU
gpu_kernel<<<…>>> // GPU compute
cudaMemcpy(d_data, h_data, size,
DeviceToHost) // Move results back to CPU
cpu_processing(h_data)
21

Unified Memory Programming
22
– Single memory address space accessible to
both CPU & GPU
– Enables oversubscribing memory
• Computation of data size larger than GPU
memory
– System wide atomic memory operations
– Transparent Memory migration between CPU
and GPU depending on who accesses it
• Explicit migration through
cudaMemPrefetchAsyn()
– Allocating Unified memory
• Replace “malloc” & “new” with
“cudaMallocManaged”
GPU CPU
Unified Memory
22

Unified Memory Advises
23
– ReadMostly
• Data is mostly read, occasionally written
• Duplicate pages, writes possible but expensive
– PreferredLocation
• Specify preferred location for data
• “resist” migrations from the preferred location
– AccessedBy
• Establish mappings to avoid migrations and
access directly
char *data;
cudaMallocManaged(&data, size);
init_dataCPU(data, size);
cudaMemPrefetchAsync(data, size, gpuID);
cudaMemAdvise(data, size, …ReadMostly,
gpuID);
gpuKernel<<<… >>>(data, size);
// Transparent data migration to GPU
cudaDeviceSynchronize();
use_dataCPU(data, size);
//Data migrate back to CPU
23

*data = malloc(size);
gpu_kernel<<<…>>>(data);
data[1024];
extern float *data;
Hardware Coherency & ATS
24
– Hardware coherency
• CPU can directly access and cache GPU
memory
• Native atomics support
– Address Translation Services(ATS)
• Allows the GPU to access the CPU’s page
tables directly
• System allocator support – malloc, stack,
global, file system
Simplifiedprogramming with
ATS
24

CUDA Aware MPI
25
–Avoid staging of GPU buffers in
host memory
–Run applications efficiently
–IBM SpectrumMPI is CUDA-
Aware
25
Code without CUDA-Aware MPI (using GPU buffers)
//MPI Rank 0
CudaMemcpy(…, DeviceToHost)
MPI_Send()
//MPI Rank 1
MPI_Recv()
CudaMemcpy(…, HostToDevice)
Code with CUDA-Aware MPI (using GPU buffers)
//MPI Rank 0
CudaMemcpy(…, DeviceToHost)
MPI_Send()
//MPI Rank 1
MPI_Recv()
CudaMemcpy(…, HostToDevice)
https://devblogs.nvidia.com/introduction-cuda-aware-mpi/

GPU Direct RDMA
26
– Data exchange between GPU and other Peer
devices using PCIe standards
– Network devices directly access GPU memory
bypassing host
26

Monitoring and Profiling
tools
27

Monitoring and Profiling tools
Monitoring
➢ mpstat, vmstat – CPU and memory utilization
➢ numastat – numa memory statistics
➢ top/htop – real-time view of system usage
Profiling
➢ Perf record/report – CPU profiling
➢ nvprof – GPU profiling
CPU memory GPU
memory
numastat

nvidia-smi
Also check “nvidia-smi –query-gpu” more monitoring options

nvprof
• The nvprof is command-line profiling tool which enables you to collect and view
profiling data
• Using nvprof one can collect –
• kernel execution time
• memory transfers
• memory set and CUDA API calls
• events or metrics for CUDA kernels
NVVP (NVIDIA Visual Profiler)
• The Visual Profiler displays a timeline of your application's activity on both the CPU
and GPU so that one can identify opportunities for performance improvement.
• Visualize profile data collected from nvprof
• More documentation can be found @ https://docs.nvidia.com/cuda/profiler-users-guide/index.html

Nvidia Visual Profiler
Data
Transfer
Compute

Conclusion
32
➢ AC922 Designed for Super Computers
➢ Better performance for HPC applications
➢ High speed interconnect NVLink between CPU & GPU
➢ Simplified programming using Unified memory, ATS, and
OpenMP

References
–IBM Power System AC922 Introduction and Technical Overview
–NVIDIA Volta GPU
–IBM Power Systems Proof Points
–Unified Memory on P9+V100
–Summit SuperComputer
–Sierra SuperComputer
33

Notices and disclaimers
© 2018 International Business Machines Corporation. No part of this
document may be reproduced or transmitted in any form without
written permission from IBM.
U.S. Government Users Restricted Rights — use, duplication or
disclosurerestricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including informationrelating to
products that have not yet been announced by IBM) has been reviewed
for accuracy as of the date of initial publication and could include
unintentional technical or typographical errors. IBM shall have no
responsibility to update this information. This document is distributed
“as is” without any warranty, either express or implied. In no event,
shall IBM be liable for any damage arising from the use of this
information, including but not limited to, loss of data, business
interruption, loss of profit or loss of opportunity. IBM products and
services are warranted per the terms and conditions of the agreements
under which they are provided.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously
installed. Regardless, our warranty terms apply.”
Any statements regarding IBM's future direction, intent or product
plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a
controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the
results they may have achieved. Actual performance, cost, savings or
other results in other operating environments may vary.
References in this document to IBM products, programs, or services
does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does
business.
Workshops, sessions and associated materials may have been prepared
by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for
informational purposes only, and are neither intended to, nor shall
constitutelegal or other guidance or advice to any individual participant
or their specific situation.
It is the customer’s responsibility to insure its own compliance
with legal requirements and to obtain advice of competent legal counsel
as to the identificationand interpretationof any relevant laws and
regulatory requirements that may affect the customer’s business and
any actions the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its
services or products will ensure that the customer follows any law.
34

Notices and disclaimers
continued
Information concerningnon-IBM products was obtained from the
suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products about this
publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed
to the suppliers of those products. IBM does not warrant the quality of
any third-party products, or the ability of any such third-party products
to interoperate with IBM’s products. IBM expressly disclaims all
warranties, expressed or implied, including but not limited to, the
implied warranties of merchantability and fitness for a purpose.
The provision of the information contained herein is not intended to, and
does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com and [names of other referenced IBM
products and services used in the presentation] are trademarks of
International Business Machines Corporation, registered in many
jurisdictions worldwide. Other product and service names might
be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at “Copyright and trademark
information”at: www.ibm.com/legal/copytrade.shtml.
35

Thank you
Pidad D’Souza
Power System Performance Architect
—
pidsouza@in.ibm.com
+91-80-4177 6526
ibm.com
36

IBM Power System AC922: The Brain Behind Blazing Fast Supercomputers

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie IBM Power System AC922: The Brain Behind Blazing Fast Supercomputers

Ähnlich wie IBM Power System AC922: The Brain Behind Blazing Fast Supercomputers (20)

Mehr von Ganesan Narayanasamy

Mehr von Ganesan Narayanasamy (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

IBM Power System AC922: The Brain Behind Blazing Fast Supercomputers