第12回配信講義　計算科学技術特論B（2022）

Akira Naruse, 7th July 2022
CUDAやOpenACCによる
GPUコンピューティング

2
AGENDA
 GPU Computing
 OpenACC
 CUDA
 Latest GPUs (NVIDIA H100)

4
Low latency + High throughput
CPU GPU

5
アプリケーション実行
アプリケーション・コード
GPU
CPU
並列部分は
GPUで実行
計算の
重い部分
逐次部分は
CPU上で実行
Do i=1,N
End do

6
GPUの構造 (NVIDIA A100)
6912 CUDA core/chip
この全てを活用できるかが鍵
108 Stream Multiprocessor (SM) /chip
64 CUDA core /SM

7
GPU APPLICATIONS
数百のアプリケーションがGPUに対応
www.nvidia.com/object/gpu-applications.html

9
アプリをGPU対応する方法
Application
CUDA
OpenACC
Library
GPU対応ライブラリにチェンジ
簡単に開始
主要処理をCUDAで記述
高い自由度
既存コードにディレクティブを挿入
簡単に加速

10
NVIDIA LIBRARIES
cuRAND Thrust
cuDNN
AmgX
cuBLAS
cuFFT
cuSPARSE cuSOLVER
Performance
Primitives NCCL
nvGRAPH
TensorRT

11
PARTNER LIBRARIES
Computer Vision
Sparse direct solvers
Sparse Iterative Methods
Linear Algebra
Linear Algebra
Graph
Audio and Video Matrix, Signal and Image
Math, Signal and Image Linear Algebra
Computational
Geometry
Real-time visual
simulation

12
Application
Library
簡単に開始
CUDA
OpenACC
高い自由度
簡単に加速

13
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
SAXPY (Y=A*X+Y)
OpenMP OpenACC
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...

14
Application
Library OpenACC CUDA
簡単に開始
高い自由度
簡単に加速

15
__global__ void saxpy(int n, float a,
float *x, float *y)
{
int i = threadIdx.x + blodkDim.x * blockIdx;
if (i < n)
y[i] += a*x[i];
}
...
size_t size = sizeof(float) * N;
cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
cudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);
...
SAXPY (Y=A*X+Y)
CPU CUDA
void saxpy(int n, float a,
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...

17
Application
Library
簡単に開始
CUDA
OpenACC
高い自由度
簡単に加速

18
OPENACC
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
... serial code …
End Program myscience
CPU
GPU
既存のC/Fortranコード
簡単: 既存のコードに
コンパイラへのヒントを追加
強力: 相応の労力で、コンパイラが
コードを自動で並列化
オープン: 複数コンパイラベンダが、
様々なプロセッサをサポート
NVIDIA GPU, AMD GPU,
x86 CPU, ARM CPU
ヒントの
追加

19
アプリケーション・コード
GPU
CPU
並列部分は
GPUで実行
計算の
重い部分
逐次部分は
CPU上で実行
Do i=1,N
End do
OpenACC

20
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
SAXPY (y=a*x+y)
 omp  acc  データの移動
OpenMP OpenACC
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...

21
subroutine saxpy(n, a, X, Y)
real :: a, Y(:), Y(:)
integer :: n, i
!$acc parallel copy(Y(:)) copyin(X(:))
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$acc end parallel
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...
SAXPY (y=a*x+y, FORTRAN)
 FORTRANも同様
OpenMP OpenACC
subroutine saxpy(n, a, X, Y)
real :: a, X(:), Y(:)
integer :: n, i
!$omp parallel do
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$omp end parallel do
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...

22
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: PARALLEL DIRECTIVE
並列化領域を
指示する

23
float *x,
float *restrict y)
{
#pragma acc kernels copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: KERNELS DIRECTIVE
並列化領域を
指示する
Parallel: ユーザ主導
Kernels: コンパイラ主導

24
float *x,
float *restrict y)
{
#pragma acc loop independent
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: LOOP CLAUSE
ループの並列化方法を
指示する
independent: 並列化可能
seq: 逐次実行
collapse: ループ融合
gang/vector: 並列化粒度
...

25
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: DATA CLAUSE
データの移動方法を
指示する
copyin: CPU  GPU
copyout: CPU  GPU
copy: (両方)
create: (メモリ確保)
...

26
float *x,
float *restrict y)
{
#pragma acc kernels present(x, y)
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
#pragma acc data copy(y[:N]) create(x[:N])
{
produce_x();
consume_y();
}
OPENACC DATA DIRECTIVE
データ領域を
指示する

27
OPENMP と OPENACC の共存
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
(*) この場合、
OpenMP と
OpenACC の両方
を有効にするのは危
険です。

28
簡単にコンパイル
OpenACC
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
$ nvc –acc=gpu –Minfo=acc –gpu=... saxpy.c
saxpy:
16, Generating present_or_copy(y[:n])
Generating present_or_copyin(x[:n])
Generating Tesla code
19, Loop is parallelizable
Accelerator kernel generated
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

29
簡単に実行
OpenACC
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
$ nvc –acc=gpu –Minfo=acc –gpu=... saxpy.c
saxpy:
16, Generating present_or_copy(y[:n])
Generating present_or_copyin(x[:n])
Generating Tesla code
19, Loop is parallelizable
Accelerator kernel generated
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
$ nsys nvprof ./a.out
...
Time(%) Time Calls Avg Min Max Name
62.95% 3.0358ms 2 1.5179ms 1.5172ms 1.5186ms [CUDA memcpy HtoD]
31.48% 1.5181ms 1 1.5181ms 1.5181ms 1.5181ms [CUDA memcpy DtoH]
5.56% 268.31us 1 268.31us 268.31us 268.31us saxpy_19_gpu

30
OPENACCが加速する科学技術計算

31
Janus Juul Eriksen, PhD Fellow
qLEAP Center for Theoretical Chemistry, Aarhus University
“
OpenACC makes GPU computing approachable for
domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modifications of our existing CPU implementation.
“
LSDALTON
Large-scale application for calculating high-
accuracy molecular energies
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines 1 Week 1 Source
Big Performance
7.9x
8.9x
11.7x
ALANINE-1
13 ATOMS
ALANINE-2
23 ATOMS
ALANINE-3
33 ATOMS
Speedup
vs
CPU
Minimal Effort
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
https://developer.nvidia.com/openacc/success-stories

32
Brad Sutton, Associate Professor of Bioengineering and
Technical Director of the Biomedical Imaging Center
University of Illinois at Urbana-Champaign
“
Now that we’ve seen how easy it is to program the
GPU using OpenACC and the PGI compiler, we’re
looking forward to translating more of our projects
“
CHALLENGE
• Produce detailed and accurate brain images by applying
computationally intensive algorithms to MRI data
• Reduce reconstruction time to make diagnostic use possible
SOLUTION
• Accelerated MRI reconstruction application with OpenACC
using NVIDIA GPUs
RESULTS
• Reduced reconstruction time for a single high-resolution MRI
scan from 40 days to a couple of hours
• Scaled on Blue Waters at NCSA to reconstruct 3000 images in
under 24 hours
POWERGRID
Advanced MRI Reconstruction Model

33
“
OpenACC enabled us to
target routines for GPU
acceleration without
rewriting code, allowing us
to maintain portability on a
code that is 20-years old
“
David Gutzwiller, Head of HPC
NUMECA International
CHALLENGE
• Accelerate 20 year old highly optimized code
on GPUs
SOLUTION
• Accelerated computationally intensive routines
with OpenACC
RESULTS
• Achieved 10x or higher speed-up on key
routines
• Full app speedup of up to 2x on the Oak Ridge
Titan supercomputer
• Total time spent on optimizing various routines
with OpenACC was just five person-months
NUMECA FINE/TURBO
Commercial CFD Application

34
The most significant result from
our performance studies is faster
computation with less energy
consumption compared with our
CPU-only runs.
Dr. Misun Min, Computation Scientist
Argonne National Laboratory
CHALLENGE
• Enable NekCEM to deliver strong scaling on
next-generation architectures while
maintaining portability
SOLUTION
• Use OpenACC to port the entire program
to GPUs
RESULTS
• 2.5x speedup over a highly tuned CPU-only
version
• GPU used only 39 percent of the energy
needed by 16 CPUs to do the same
computation in the same amount of time
NEKCEM
Computational Electromagnetics Application
“
“

35
Adam Jacobs, PhD candidate in the Department of Physics and
Astronomy at Stony Brook University
On the reactions side, accelerated calculations allow us to model larger
networks of nuclear reactions for similar computational costs as the
simple networks we model now
MAESTRO & CASTRO
Astrophysics 3D Simulation
“
“
CHALLENGE
• Pursuing strong scaling and portability to run
code on GPU-powered supercomputers
SOLUTION
• OpenACC compiler on OLCF’s Titan
supercomputer
RESULTS
• 4.4x faster reactions than on a multi-core
computer with 16 cores
• Accelerated calculations allow modeling larger
networks of nuclear reactions for similar
computational costs as simple networks
2 weeks to
learn OpenACC
2 weeks to
modify code

36
Wayne Gaudin and Oliver Perks
Atomic Weapons Establishment, UK
We were extremely impressed that we can run OpenACC on a CPUwith
no code change and get equivalent performance to our OpenMP/ MPI
implementation.
CLOVERLEAF
Performance Portability for a Hydrodynamics Application
Speedup
vs
1
CPU
Core
Benchmarked Intel(R) Xeon(R) CPUE5-2690 v2 @3.00GHz, Accelerator: Tesla K80
“
“
CHALLENGE
• Application code that runs across architectures
without compromising on performance
SOLUTION
• Use OpenACC to port the CloverLeaf mini app
to GPUs and then recompile and run the same
source code on multi-core CPUs
RESULTS
• Same performance as optimized OpenMP
version on x86 CPU
• 4x faster performance using the same code on
a GPU

37
Lixiang Luo, Researcher
Aerospace Engineering Computational Fluid Dynamics Laboratory
North Carolina State University (NCSU)
“
OpenACC is a highly effective tool for programming fully
implicit CFD solvers on GPU to achieve true 4X speedup
“
* CPU Speedup on 6 cores of Xeon E5645, with additional cores performance reduces due to partitioning and MPI overheads
CHALLENGE
• Accelerate a complex implicit CFD solver on GPU
SOLUTION
• Used OpenACC to run solver on GPU with
minimal code changes.
RESULTS
• Achieved up to 4X speedups on Tesla GPU over
parallel MPI implementation on x86 multicore
processor
• Structured approach to parallelism provided by
OpenACC allowed better algorithm design
without worrying about GPU architecture
INCOMP3D
3D Fully Implicit CFD Solver

38
Earthquake Simulation (理研、東大地震研)
 WACCPD 2016(SC16併催WS): Best Paper Award
http://waccpd.org/wp-content/uploads/2016/04/SC16_WACCPD_fujita.pdf より引用

39
NICAM
 気象・気候モデル by 理研AICS/東大
 膨大なコード (数十万行)
 ホットスポットがない (パレートの法則)
 特性の異なる2種類の処理
 力学系 … メモリバンド幅ネック
 物理系 … 演算ネック

40
NICAM: 力学系(NICAM-DC)
 OpenACCによるGPU化
 主要サブルーチンは、全てGPU
上で動作(50以上)
 MPI対応済み
 2週間
 良好なスケーラビリティ
 Tsubame 2.5, 最大2560
GPUs
 Scaling factor: 0.8
Weak scaling
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04
Performance
(GFLOPS)
Number of CPUs or GPUs
Tsubame 2.5 (GPU:K20X)
K computer
Tsubame 2.5 (CPU:WSM)
(*) weak scaling
Courtesy of Dr. Yashiro from RIKEN AICS

41
NICAM: 物理系(SCALE-LES)
 Atmospheric radiation transfer
 物理系の中で、最も重い計算
 OpenACCによるGPU対応済
1.00 1.99 3.88 8.51
37.8
76.0
151
0
20
40
60
80
100
120
140
160
1 core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs
Xeon E5-2690v2(3.0GHz,10-core) Tesla K40
Speedup
vs.
CPU
1-core
(*) PCIデータ転送時間込み, グリッドサイズ:1256x32x32
Better

42
SEISM3D
 地震シミュレーション by 古村教授(東大地震研)
 主要サブルーチンのGPU対応が完了
 メモリバンド幅ネック、 3次元モデル(2次元分割)、隣接プロセス間通信
605
459
134
0
100
200
300
400
500
600
K: 8x SPARC64
VIIIfx
CPU: 8x Xeon
E5-2690v2
GPU: 8x Tesla
K40
Time
(sec)
SEISM3D (480x480x1024, 1K steps)
3.4x speedup
(アプリ全体)
0
20
40
60
80
100
120
140
GPU: 8x Tesla K40
Others (CPU, MPI and so on)
[CUDA memcpy DtoH]
[CUDA memcpy HtoD]
(other subroutines)
update_vel_pml
update_vel
update_stress_pml
update_stress
diff3d_*
GPUの実行時間内訳
Better

43
CM-RCM IC-CG
 IC-CG法のベンチマークコード by 中島教授(東大)
 CM-RCM法(Cyclic Multi-coloring Reverse Cuthill-Mckee)を使用
 メインループ内のサブルーチンを全てOpenACCでGPU化
3.36
1.58 1.53 1.39
0.655
0
1
2
3
4
OpenMP OpenMP OpenMP OpenMP OpenACC
Opteron 6386SE
(2.8GHz,16core)
SPARC64 Ixfx
(1.85GHz,16core)
Xeon E5-2680v2
(2.8GHz,10core)
Xeon-Phi 5110P Tesla K40
Time
(second)
CM-RCM ICCG (100x100x100)
Better
Courtesy of Dr. Ohshima from U-Tokyo

44
CCS-QCD
 QCDコード by 石川准教授(広島大)
 BiCGStab計算を全てOpenACCでGPU化
 データレイアウトを変更: AoS  SoA
32.3
24.3 22.1 20.9 19.7
42.5
53.3
57.9 60.8 63.4
0
10
20
30
40
50
60
70
80
90
8x8x8x32 8x8x8x64 8x8x16x64 8x16x16x64 16x16x16x64
GFLOPS
Problem Size
CCS-QCD: BiCGStab Total FLOPS
Xeon E5-2690v2(3.0GHz,10core,OpenMP) Tesla K40(OpenACC)
Better

46
CUDAプログラミング
 プログラミングモデル
 アーキテクチャ
 性能Tips

47
 高スループット指向のプロセッサ
 分離されたメモリ空間
GPU Memory
CPU Memory
PCI
CPU GPU

48
GPUプログラム
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
cudaDeviceSynchronize();
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU

49
GPU実行の基本的な流れ
 GPUは、CPUからの制御
で動作
 入力データ: CPUからGPUに
転送 (H2D)
 GPUカーネル: CPUから投入
 出力データ: GPUからCPUに
転送 (D2H)
CPU GPU
入力データ
転送
GPUカーネル
投入
同期
出力データ
転送
GPU上で
演算

50
GPUプログラム
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
入力データ転送
カーネル起動
同期
出力データ転送

51
GPUプログラム (Unified Memory)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, x, y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
カーネル起動
同期

52
GPUプログラム (Unified Memory)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
int dev_id = 0;
cudaMemPrefetchAsync(x, size, dev_id);
cudaMemPrefetchAsync(y, size, dev_id);
saxpy<<< N/128, 128 >>>(N, 3.0, x, y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
プリフェッチ

53
GPUカーネル
 GPUカーネル: 1つのGPUスレッドの処理内容を記述
 基本: 1つのGPUスレッドが、1つの配列要素を担当
float *x, float *y)
{
int i = threadIdx.x + blodkDim.x * blockIdx.x;
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
Global スレッドID

54
Execution Configuration
(ブロック数とブロックサイズ)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< (N+127)/128, 128 >>>(N, 3.0, d_x, d_y);
...
ブロック数ブロックサイズ
ブロック数 x ブロックサイズ ≧ 配列要素数
スレッドID
ブロックID
ブロックサイズ

55
スレッド階層
(スレッド、ブロック、グリッド)
グリッド
x[] 0 127 128 255
y[] 0 127 128 255
y[i] = a*x[i] + y[i]
256 383 384
256 383 384
ブロック2
ブロック1
ブロック0
スレッド
(global)
0 127 128 255 256 383 384
 ブロックサイズ(スレッド数/ブロック)は、カーネル毎に設定可能
 推奨: 128 or 256 スレッド
スレッド
(local)
0 127 0 127 0 127 0

56
Execution Configuration
(ブロック数とブロックサイズ)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< (N+127)/128, 128 >>>(N, 3.0, d_x, d_y);
...
ブロック数ブロックサイズ
ブロック数 x ブロックサイズ ≧ 配列要素数
(N+ 63)/ 64, 64
(N+255)/256, 256
スレッドID
ブロックID
ブロックサイズ

57
2D配列のGPUカーネル例
 ブロックサイズ(ブロック形状)は、1D～3Dで表現可能
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int j = threadIdx.y + blodkDim.y * blockIdy.y;
if ( i < N && j < N )
C[i][i] = A[i][j] + B[i][j];
}
...
dim3 sizeBlock( 16, 16 );
dim3 numBlocks( N/sizeBlock.x, N/sizeBlock.y );
MatAdd<<< numBlocks, sizeBlock >>>(A, B, C);
...
32, 8
64, 4
Globalスレッド ID (x)
GlobalスレッドID (y)
ブロックサイズ (x,y)
ブロック数 (x,y)

58
ブロック・マッピング、スレッド・マッピング
dim3 sizeBlock(16,16)
(0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)
(0,2) (1,2) (2,2)
ブロックID(blockIdx)
dim3 sizeBlock(32,8)
(0,0) (1,0)
(0,1) (1,1)
(0,2) (1,2)
(0,3) (1,3)
(0,4) (1,4)
(0,0)
(15,15)
(0,0)
(31,7)
スレッドID(threadIdx)

59
 性能Tips

60
GPUアーキテクチャ概要 PCI I/F
 ホスト接続インタフェース
Giga Thread Engine
 SMに処理を割り振るスケジュー
ラ
DRAM I/F (HBM2)
 全SM、PCI I/Fからアクセス可
能なメモリ (デバイスメモリ, フ
レームバッファ)
L2 cache (40 MB)
 全SMからアクセス可能なR/W
キャッシュ
SM (Streaming Multiprocessor)
 「並列」プロセッサ、A100:108

61
Stream Multi-Processor
演算ユニット
 INT32: 64個
 FP32: 64個
 FP64: 32個
 TensorCore: 4個
その他ユニット
 LD/ST, SFU, etc.
レジスタ(32bit): 64K個 (256KB)
共有メモリ/L1キャッシュ: 192KB
NVIDIA A100

62
GPUカーネル実行の流れ
 CPUが、GPUに、グリッドを投入
 具体的な投入先は、Giga Thread Engine
グリッド
ブロック
ブロック
スレッド
スレッド

63
 Giga Thread Engine(GTE)が、SMに、ブロックを投入
 GTEは、ブロックスケジューラ
 グリッドをブロックに分解して、ブロックを、空いているSMに割当てる
グリッド
ブロック
ブロック
スレッド
スレッド

64
ブロックをSMに割り当て
 各ブロックは、互いに独立に実行
 ブロック間では同期しない、実行順序の保証なし
 1つのブロックは複数SMにまたがらない
 1つのSMに、複数ブロックが割当てられることはある
グリッド
ブロック
ブロック
ブロック
ブロック

65
 SM内のスケジューラが、スレッドをCUDAコアに投入
グリッド
ブロック
ブロック
スレッド
スレッド

66
 SM内のスケジューラが、ワープをCUDAコアに投入
 ワープ: 32スレッドの塊
 ブロックをワープに分割、実行可能なワープを、空CUDAコアに割当てる
グリッド
ブロック
ブロック
ワープ
ワープ

67
ワープのCUDAコアへの割り当て
 ワープ内の32スレッドは、同じ命令を同期して実行
 各ワープは、互いに独立して実行
 同じブロック内のワープは、明示的に同期可能(__syncthreads())
グリッド
ブロック
ブロック
ワープ
ワープ
ワープ
SIMT
(Single Instruction Multiple Threads)

68
GPUアーキの変化を問題としないプログラミングモデル
Kepler, CC3.5
192 cores /SM
Pascal, CC6.0
64 cores /SM
Maxwell, CC5.0
128 cores /SM

69
 性能Tips

70
デバイスメモリへのアクセスは、まとめて
 コアレス・アクセス
 32スレッド(ワープ)のロード/ストアをまとめて、メモリトランザクションを発行
 トランザクションサイズ: 32B, 64B or 128B
 トランザクション数は、少ないほど良い
配列 128B境界 128B境界
0 1 2 29 30 31
28
3
0 1 14 17 30 31
16
15
0 1 2
Best
Good
Bad
128B x1
64B x2
32B x32

71
リソース使用率 (Occupancy)
SMの利用効率を上げる
≒ SMに割当て可能なスレッド数を、
上限に近づける
 レジスタ使用量(/スレッド)
 できる限り減らす
 DP(64bit)は、2レジスタ消費
 レジスタ割り当て単位は8個
 レジスタ使用量と、割当て可能なスレッド数の関係
 32レジスタ: 2048(100%), 64レジスタ: 1024(50%)
 128レジスタ:512(25%), 256レジスタ: 256(12.5%)
CUDAコア数: 64
最大スレッド数: 2048
最大ブロック数: 32
共有メモリ: 160KB
レジスタ数(32-bit): 64K個
リソース量/SM (A100)

72
SMの利用効率を上げる
≒ SMに割当て可能なスレッド数を、
上限に近づける
 スレッド数(/ブロック)
 64以上にする
 64未満だと最大ブロック数がネックになる
 共有メモリ使用量(/ブロック)
 できる限り減らす
 共有メモリ使用量と、割当て可能なブロック数の関係
 32KB:2ブロック, 16KB:4ブロック, 8KB:8ブロック
CUDAコア数: 64
最大スレッド数: 2048
最大ブロック数: 32
共有メモリ: 160KB
レジスタ数(32-bit): 64K個
リソース量/SM (A100)

73
空き時間を埋める
 CUDAストリーム (≒キュー)
 同じCUDAストリームに投入した操作: 投入順に実行
 別のCUDAストリームに投入した操作: 非同期に実行 (オーバラップ実行)
(*) 操作: GPUカーネル, データ転送
[CUDAストリームの効果例]
GPUカーネルとデータ転送が
オーバラップして
同時に実行されている

74
分岐を減らす
 ワープ内のスレッドが別パスを選択すると遅くなる
 ワープ内のスレッドは、命令を共有 (SIMT)
 ワープ内のスレッドが選んだ全パスの命令を実行
 あるパスの命令を実行中、そのパスにいないスレッドはinactive状態
 Path divergenceを減らす
 できる限り、同ワープ内のスレッドは
同じパスを選択させる
A[id] > 0
処理 X
処理 Y
true false
0 1
2
3

76
HOPPER アーキテクチャ: 高性能、スケーラブル、セキュアな GPU
NVIDIA H100 (SXM5)

77
HOPPER アーキテクチャ
NVIDIA H100 GPU
132 SMs
16,896 FP32 units
528 Tensor Cores
Larger L2, 50 MB
HBM3 DRAM, 80 GB, 3 TB/s
4th-gen NVLink
900 GB/s (50 GB/s x 18 links)
SHARP, NVLink network
PCIe gen5
2nd-gen MIG
Confidential Computing
Thread Block
Clusters

78
HOPPER アーキテクチャ
GH100 Stream Multiprocessor
第 4 世代 Tensor Core
2 倍の FMA 性能 (FP32、FP64)
新 DPX 命令セット
L1/共有メモリのサイズ増, 256 KB
Thread Block Clusters
複数 SM 間の協調処理
Tensor Memory Accelerator (TMA)
テンソルデータの非同期コピー

79
Hopper Tensor Core
第 4 世代
約 3 倍の性能 UP
クロックあたり性能 2 倍
(/SM)
SM 数増: 108  132
クロック速度向上
FP8 対応
A100 H100
dense sparse dense sparse
FP64 TC 19.5 NA 60 NA
FP64 9.7 NA 30 NA
FP32 19.5 NA 60 NA
TF32 TC 156 312 500 1,000
FP/BF16 TC 312 624 1,000 2,000
FP16 78 NA 120 NA
FP8 TC NA NA 2,000 4,000
Peak TFLOPS
(*) H100 の性能値は現時点の概算値です
New

80
THREAD BLOCK CLUSTER
クラスタ
CUDA: 4階層
HW: 4階層
Grid Cluster Block Thread
GPU GPC SM CUDA core
CUDA に新しい階層を追加 (CUDA 12)
グリッド: クラスタの集合 (カーネル)
クラスタ: ブロックの集合
 ある一つの GPC 内の、複数 SM に割当
ブロック: スレッドの集合
クラスタの機能
 クラスタ内ブロックは、同時にスケジューリング
 分散共有メモリ: 互いの共有メモリにアクセス可
 高速なクラスタ内同期 (HWサポート) H100: 132 SM

81
まとめ
 GPU Computing
 OpenACC
 CUDA
 Latest GPUs (NVIDIA H100)

第12回配信講義　計算科学技術特論B（2022）

第12回配信講義　計算科学技術特論B（2022）

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 第12回配信講義　計算科学技術特論B（2022）

Ähnlich wie 第12回配信講義　計算科学技術特論B（2022） (20)

Mehr von RCCSRENKEI

Mehr von RCCSRENKEI (15)