4. 地震シミュレーション概要
• 断層~地殻~地盤~構造物
• 領域サイズ: 100 km
• 空間分解能: 0.1 m
• 複雑構造をもつ都市などを高精度で求解するため、有限要素法が使われることが
多い
• 都市規模の問題は数千億自由度規模の超大規模問題になるうえに、有限要素法
ではランダムアクセスが主体となり計算効率が下がる傾向にあるため、高性能計
算技術が必須
非構造格子(有限要素法など)
構造格子(差分法など)
Example code:
do i=1,nx
do j=1,ny
a(i,j)=b(i,j)+b(i+1,j)+….
enddo
enddo
Example code:
do i=1,n
do j=ptr(i)+1,ptr(i+1)
a(i)=a(i)+b(index(j))
enddo
enddo
4
5. 解析例
• ターゲット:断層~地殻~地盤~構造物~社会活動
5
a) Earthquake wave propagation
-7 km
0 km
c) Resident evacuation
b) City response simulation
Shinjuku
Two million agents evacuating to nearest safe
site
Tokyo station
Ikebukuro
Shibuya
Shinbashi
Ueno
Earthquake Post earthquake
京コンピュータ全系を使った
地震シミュレーション例
T. Ichimura et al., Implicit Nonlinear Wave Simulation with 1.08T DOF and 0.270T Unstructured Finite Elements to Enhance Comprehensive
Earthquake Simulation, SC15 Gordon Bell Prize Finalist
8. ソルバー例1:地震時の地盤増幅解析@京コンピュータ
• Solve preconditioning matrix roughly to reduce number of CG loops
• Use geometric multi-grid method to reduce cost of preconditioner
• Use single precision in preconditioner to reduce computation & communication
Equation to be solved
(double precision)
CG loop
Computations of
outer loop
Outer loop
Solving
preconditioning
matrix
Second ordered
tetrahedron
Solve system roughly using CG solver
(single precision)
Solve system roughly using CG solver
(single precision)
Use as initial solution
Use for preconditioner of outer loop
Solving preconditioning matrix
Inner coarse loop
Inner fine loop
Linear tetrahedron
Second ordered
tetrahedron
T. Ichimura et al., Implicit Nonlinear Wave Simulation with 1.08T DOF and 0.270T Unstructured Finite Elements to Enhance Comprehensive
Earthquake Simulation, SC15 Gordon Bell Prize Finalist 8
14. ピーク性能
• クロック x スーパースカラによる演算器数 x FMA x SIMDレーン数 x
コア数 x ノード数
• FMA: 命令一個で積和演算ができる(a=b*c+d)
• 例:Oakbridge-CXの計算ノード
• 56コアのSMP型の共有メモリ計算機(28コア Intel Xeon Platinum 8280
CPU@2.7GHz x 2個)
• Cascade Lake Xeon シリーズのCPU: AVX-512 命令をサポート
• 512ビットSIMD: 倍精度の場合は8個(単精度の場合は16個)の要素を一度に操作で
きる
• ピーク性能:2.7GHz x 2 (浮動小数点演算器数) x 2 (FMA) x 8 (倍精度
SIMD) x 56コア x 1ノード = 4838 GFLOPS
14
34. ターゲット問題
• 陰解法による動的非線形低次非構造格子有限要素法
• 都市の地震応答など、不均質な分布を持つ非線形物性・複雑形状を持つ領域
の求解に適している
• 多数回大規模線形連立方程式を求解する
• 多数のランダムデータアクセスを含む
34
Ku = f
Unknown vector with up to 1 trillion degrees of freedom
Outer force vector
Sparse symmetric positive definite matrix
(changes every time step)
Unknown vector (up to trillion degrees of freedom)
Known vector
Solve for each of few
thousand time steps:
35. SC14 unstructured finite-element solver
• Designed for CPU based K computer
• Use algorithm that can obtain equal granularity on millions of cores
• Use matrix-free matrix-vector product (Element-by-Element method): Good load balance when
elements per core is equal
• Also high-peak performance as it is on-cache computation
• Combine Element-by-Element method with multi-grid, mixed precision arithmetic, and
adaptive conjugate gradient method
• Scalability & peak-performance good (key kernels are Element-by-Element), convergency good,
thus, time-to-solution good
35
f = Σe Pe Ke Pe
T u
[Ke is generated on-the-fly]
Element-by-Element method
+=
…
+=
Element #0
Element #1
Ke
u
f
Element #N-1
…
39. 時間並列アルゴリズムの概要
• 複数の時間ステップを同時に求解することで、将来ステップの解を予測
• 一反復当たりの計算コストを削減
• 通常のソルバーを使った場合と全演算数はほぼ変わらないが、より高効率なカー
ネルを使用することができる
39
…
Element #0
Element #1
ui
ke
fi
ke
fi+1 fi+2
Future time steps
Current time step
Kernel in time-parallel solver: less random access
fi+3
ui+1 ui+2 ui+3
…
…
…
Contiguous
in memory
Contiguous in memory
(i.e., SIMD efficient)
…
Element #0
Element #1
ui
ke
fi
ke
…
Kernel in previous solver
Tsuyoshi Ichimura, Kohei Fujita, Masashi Horikoshi, Larry Meadows, Kengo Nakajima, Takuma Yamaguchi, Kentaro Koyama, Hikaru Inoue, Akira Naruse, Keisuke
Katsushima, Muneo Hori, Maddegedara Lalith, A Fast Scalable Implicit Solver with Concentrated Computation for Nonlinear Time-evolution Problems on Low-order
Unstructured Finite Elements, 32nd IEEE International Parallel and Distributed Processing Symposium, 2018.
Kohei Fujita, Keisuke Katsushima, Tsuyoshi Ichimura, Masashi Horikoshi, Kengo Nakajima, Muneo Hori, Lalith Maddegedara, Wave Propagation Simulation of Complex
Multi-Material Problems with Fast Low-Order Unstructured Finite-Element Meshing and Analysis, Proceedings of HPC Asia 2018 (Best Paper Award), 2018.
40. 1: set 𝑥−1 ← 0
2: for( 𝑖 = 0; 𝑖 < 𝑛; 𝑖 = 𝑖 + 1 ){
3: guess ҧ
𝑥𝑖 using standard predictor
4: set 𝑏𝑖 using 𝑥𝑖−1
5: solve 𝑥𝑖 ← 𝐴−1𝑏𝑖using initial solution ҧ
𝑥𝑖 (Computed using iterative solver with SpMV kernel)
6: }
1: set 𝑥−1 ← 0 and ҧ
𝑥𝑖 ← 0 ( 𝑖 = 0, … , 𝑚 − 2)
2: for( 𝑖 = 0; 𝑖 < 𝑛; 𝑖 = 𝑖 + 1 ){
3: set 𝑏𝑖 using 𝑥𝑖−1
4: guess ҧ
𝑥𝑖+𝑚−1 using standard predictor
5: 𝑏𝑗 ← ത
𝑏𝑗
6: while (
A ҧ
𝑥𝑖−𝑏𝑖
𝑏𝑖
> 𝜖) do {
7: guess ത
𝑏𝑗 using ҧ
𝑥𝑗−1 ( 𝑗 = 𝑖 + 1, … , 𝑖 + 𝑚 − 1 )
8: refine solution { ҧ
𝑥𝑗 ← 𝐴−1 ത
𝑏𝑗} with initial solution ҧ
𝑥𝑗 (𝑗 = 𝑖, … , 𝑖 + 𝑚 − 1) (Computed using iterative
solver with concentrated computation kernel)
10: }
11: 𝑥𝑖 ← ҧ
𝑥𝑖
11: }
Standard solver algorithm
Developed algorithm
40
41. 並列時間積分アルゴリズムのnaïveな実装
(m-ステップ)
• 各コアにてSIMDを用いてm本のベクトルを計算
• ただし、実際の問題ではm ≦4が一般的:SIMD幅をすべて使い切ることができない
• マルチコア間のデータ競合を回避するためにテンポラリのベクトルを確保
• メニーコア計算機を使う場合は高コスト
41
2. Update
components
by EBE (black)
1. Initialize
necessary
components (gray)
Core-wise
temporary
vectors (ft)
3. Add necessary
components
+ =
=
=
+ =
=
+ =
Global left hand side vector (f)
…
Element #0
Element #1
ui
ke
fi
ke
fi+1 fi+2
Future time steps
Current time step
fi+3
ui+1 ui+2 ui+3
Compute using
4-wide SIMD
Core-wise computation (in case of m=4)
Many-core computation (in case of three cores)
42. 1 !$OMP PARALLEL DO
2 do iu=1,numberofthreads ! for each thread
3 do i=1,nnum(iu)
4 i1=nlist(i,iu)
5 do im=1,m
6 ft(im,1,i1,iu)=0.0 ! clear temporary vector
7 ft(im,2,i1,iu)=0.0
8 ft(im,3,i1,iu)=0.0
9 enddo
10 enddo
11 do ie=npl(iu)+1,npl(iu+1)
12 cny1=cny(1,ie)
13 cny2=cny(2,ie)
14 cny3=cny(3,ie)
15 cny4=cny(4,ie)
16 xe11=x(1,cny1)
17 xe21=x(2,cny1)
...
18 xe34=x(3,cny4)
19 do im=1,m
20 ! compute BDBu using ue11~ue34 and xe11~xe34
21 ue11=u(im,1,cny1)
22 ue21=u(im,2,cny1)
...
23 ue34=u(im,3,cny4)
24 ! compute BDBu using ue11~ue34 and xe11~xe34
25 BDBu11=...
26 BDBu21=...
...
27 BDBu34=...
28 ! add to temporary vector
29 ft(im,1,cny1,iu)=BDBu11+ft(im,1,cny1,iu)
30 ft(im,2,cny1,iu)=BDBu21+ft(im,2,cny1,iu)
...
31 ft(im,3,cny4,iu)=BDBu34+ft(im,3,cny4,iu)
32 enddo ! im
33 enddo ! ie
34 enddo ! iu
35 !$OMP END PARALLEL DO
SIMD computation
with width m
2. Update
components by EBE
(black)
1. Initialize necessary
components (gray)
Core-wise
temporary
vectors (ft)
36 !$OMP PARALLEL DO
37 ! clear global vector
38 do i=1,n
39 do im=1,m
40 f(im,1,i)=0.0
41 f(im,2,i)=0.0
42 f(im,3,i)=0.0
43 enddo
44 enddo
45 !$OMP END PARALLEL DO
46 do iu=1,numberofthreads
47 !$OMP PARALLEL DO
48 ! add to global vector
49 do i=1,nnum(iu)
50 i1=nlist(i,iu)
51 do im=1,m
52 f(im,1,i1)=f(im,1,i1)+ft(im,1,i1,iu)
53 f(im,2,i1)=f(im,2,i1)+ft(im,2,i1,iu)
54 f(im,3,i1)=f(im,3,i1)+ft(im,3,i1,iu)
55 enddo
56 enddo
57 !$OMP END PARALLEL DO
58 enddo
3. Add necessary
components
+ =
=
=
+ =
=
+ =
Global left hand side vector (f)
Naïve
implementation of
EBE kernel with m
vectors
42
43. Wide-SIMD CPUにおける効率的な計算方
法
• ベクトルをパック・アンパックすることでSIMD幅をすべて使う
43
…
Element #0
Element #1
ui
ke
ke
Future time steps
Current time step ui+1 ui+2 ui+3
fi+2
fi fi+1 fi+3
Element #1
ui
Pack
ui+1 ui+2 ui+3
Element #0
Pack
With packing: can use full SIMD width
Naïve implementation: use only 4 out of the
8-width SIMD
Compute
using 8-
width SIMD
fi+2
fi fi+1 fi+3
Example for time parallel kernel (m = 4) with 8 width FP32 SIMD architecture
Unpack
and add
44. 1 !$OMP PARALLEL DO
2 do iu=1,numberofthreads ! for each thread
3 do i=1,nnum(iu)
4 i1=nlist(i,iu)
5 do im=1,m
6 ft(im,1,i1,iu)=0.0 ! clear temporary vector
7 ft(im,2,i1,iu)=0.0
8 ft(im,3,i1,iu)=0.0
9 enddo
10 enddo
11 ! block loop with blocksize NL/m
12 do ieo=npl(iu)+1,npl(iu+1),NL/m
13 ! load ue, xe
14 do ie=1,min(NL/m,npl(,iu+1)-ieo+1)
15 cny1=cny(1,ieo+ie-1)
16 cny2=cny(2,ieo+ie-1)
17 cny3=cny(3,ieo+ie-1)
18 cny4=cny(4,ieo+ie-1)
19 do im=1,m
20 ue11(im+(ie-1)*m)=u(im,1,cny1)
21 ue21(im+(ie-1)*m)=u(im,2,cny1)
...
22 ue34(im+(ie-1)*m)=u(im,3,cny4)
23 xe11(im+(ie-1)*m)=x(1,cny1)
24 xe21(im+(ie-1)*m)=x(2,cny1)
...
25 xe34(im+(ie-1)*m)=x(3,cny4)
26 enddo
27 enddo
SIMD computation
SIMD
(width=m)
computation
28 ! compute BDBu
29 do i=1,NL
30 BDBu11(ie)=...
31 BDBu21(ie)=...
...
32 BDBu34(ie)=...
33 enddo
34 ! add to global vector
35 do ie=1,min(NL/m, npl(icolor,iu+1)-ieo+1)
36 cny1=cny(1,ieo+ie-1)
37 cny2=cny(2,ieo+ie-1)
38 cny3=cny(3,ieo+ie-1)
39 cny4=cny(4,ieo+ie-1)
40 do im=1,m
41 ft(im,1,cny1,iu)=BDBu11(im+(ie-1)*m)+f(im,1,cny1,iu)
42 ft(im,2,cny1,iu)=BDBu21(im+(ie-1)*m)+f(im,2,cny1,iu)
...
43 ft(im,3,cny4,iu)=BDBu34(im+(ie-1)*m)+f(im,3,cny4,iu)
44 enddo
45 enddo
46 enddo ! ieo
47 enddo ! iu
48 !$OMP END PARALLEL DO
49 Add ft in to f (same as lines 36-58 of Fig. 2)
SIMD
(width=m)
computation
EBE kernel
with m
vectors for
wide-SIMD
CPUs
44
45. メニーコア計算機向けのスレッド分割
• テンポラリ配列を使う必要がない
• グラフ分割法を使ってキャッシュの再利用を促進
45
Overall mesh Color #1 Color #2 Color #3
All threads compute each color
Overall mesh
Thread 1, Thread 2, Thread 3
Set #1 Set #2 Set #3
(Threads
2,3 idle)
Decompose mesh using graph partitioning
method
b) Developed thread partitioning method
a) Standard coloring method
…
46. 1 !$OMP PARALLEL DO
2 ! clear global vector
3 do i=1,n
4 do im=1,m
5 f(im,1,i)=0.0
6 f(im,2,i)=0.0
7 f(im,3,i)=0.0
8 enddo
9 enddo
10 !$OMP END PARALLEL DO
11 do icolor=1,ncolor ! for each color or element set
12 !$OMP PARALLEL DO
13 do iu=1, numberofthreads
14 ! block loop with blocksize NL/m
15 do ieo=npl(icolor,iu)+1,npl(icolor,iu+1),NL/m
16 ! load ue, xe
17 do ie=1,min(NL/m,npl(icolor,iu+1)-ieo+1)
18 cny1=cny(1,ieo+ie-1)
19 cny2=cny(2,ieo+ie-1)
20 cny3=cny(3,ieo+ie-1)
21 cny4=cny(4,ieo+ie-1)
22 do im=1,m
23 ue11(im+(ie-1)*m)=u(im,1,cny1)
24 ue21(im+(ie-1)*m)=u(im,2,cny1)
25 ...
26 ue34(im+(ie-1)*m)=u(im,3,cny4)
27 xe11(im+(ie-1)*m)=x(1,cny1)
28 xe21(im+(ie-1)*m)=x(2,cny1)
...
29 xe34(im+(ie-1)*m)=x(3,cny4)
30 enddo
31 enddo
SIMD computation
SIMD
(width=m)
computation
32 ! compute BDBu
33 do i=1,NL
34 BDBu11(ie)=...
35 BDBu21(ie)=...
...
36 BDBu34(ie)=...
37 enddo
38 ! add to global vector
39 do ie=1,min(NL/m, npl(icolor,iu+1)-ieo+1)
40 cny1=cny(1,ieo+ie-1)
41 cny2=cny(2,ieo+ie-1)
42 cny3=cny(3,ieo+ie-1)
43 cny4=cny(4,ieo+ie-1)
44 do im=1,m
45 f(im,1,cny1)=BDBu11(im+(ie-1)*m)+f(im,1,cny1)
46 f(im,2,cny1)=BDBu21(im+(ie-1)*m)+f(im,2,cny1)
...
47 f(im,3,cny4)=BDBu34(im+(ie-1)*m)+f(im,3,cny4)
48 enddo
49 enddo
50 enddo ! ieo
51 enddo ! iu
52 !$OMP END PARALLEL DO
53 enddo ! icolor
SIMD
(width=m)
computation
Coloring/thread partitioning of EBE kernel with
m vectors for wide-SIMD CPUs
46
47. Mixed use of 4-wide and 16-wide SIMD
• In case of problems of m = 4 computed on 16-wide SIMD architecture, 16-
width SIMD can be used for packing/unpacking of 4-width vectors
• Enables further reduction of instructions
47
22 do im=1,m
23 ue11(im+(ie-1)*m)=u(im,1,cny1)
! Load u(1:4,1,cny1) to xmm1
! Store xmm1 to ue11(1+(ie-1)*m: 4+(ie-1)*m)
24 ue21(im+(ie-1)*m)=u(im,2,cny1)
! Load u(1:4,2,cny1) to xmm1
! Store xmm1 to ue21(1+(ie-1)*m: 4+(ie-1)*m)
25 ue31(im+(ie-1)*m)=u(j,3,cny1)
! Load u(1:4,3,cny1) to xmm1
! Store xmm1 to ue31(1+(ie-1)*m: 4+(ie-1)*m)
...
30 enddo
SIMD width=4
computation
23-25 ! Load u(1:16,cny1) to zmm1
! Store zmm1(1:4) to ue11(1+(i-1)*4:4+(i-1)*4)
! Store zmm1(5:8) to ue21(1+(i-1)*4:4+(i-1)*4)
! Store zmm1(9:12) to ue31(1+(i-1)*4:4+(i-1)*4)
...
SIMD width=16
computation
SIMD width=4
computation
Use of only 4-width SIMD instructions
Mixed use of 4-wide and 16-wide SIMD
instructions
xmm indicate 128 bit FP32 registers and zmm indicate 512 bit FP32 registers
48. 問題設定
• 下記のような2層地盤における非線形波動伝播問題を求解
• 3種の性質の異なる計算機において性能を計測
48
Layer 1 2
Vp (m/s) 700 2,100
Vs (m/s) 100 700
Density (kg/m3) 1,500 2,100
Damping 0.25 (hmax ) 0.05
Strain Criteria 0.007 -
60m
20m
64m
8m
Layer 1
Layer 2
x
y
z
K computer Oakforest-PACS Intel Skylake
Xeon Gold
based server
Nodes 8 1 1
Sockets/node 1 1 2
Cores/socket 8 68 20
FP32 SIMD width 2 16 16
Clock frequency 2.0 GHz 1.4 GHz 2.4 GHz
Total peak FP32
FLOPS
1024 GFLOPS 6092 GFLOPS 6144 GFLOPS
Total DDR
bandwidth
512 GB/s 80.1 GB/s 255.9 GB/s
Total MCDRAM
bandwidth
- 490 GB/s -
49. 9.45
7.52
7.52
11.52
6.62
20.10
7.50
3.75
8.85
3.25
2.60
14.07
3.98
3.05
8.20
2.14
2.04
0
5
10
15
20
25
Kernel
elapsed
time
per
vector
(s)
K computer (8 nodes) Oakforest-PACS (1 node) Skylake Xeon Gold 6148 x 2 socket
267
GFLOPS
(26.1%)
380
GFLOPS
(37.1%)
121
GFLOPS
(1.98%)
1000
GFLOPS
(16.3%)
175
GFLOPS
(2.85%)
1287
GFLOPS
(20.9%)
Percentage
to FP32 peak
Baseline
(1 vector)
Baseline
(4 vector)
Case #1 Case #2 Case #3 Case #4
# of vectors 1 4 4 4 4 4
SIMD packing No No Yes Yes Yes Yes
Many-core algorithm Core wise
vectors
Core wise
vectors
Core wise
vectors
Standard
coloring
Developed
partitioning
Developed
partitioning
Mixed use of 16-width and
4-width SIMD
No No No No No Yes 49
50. Performance on actual urban earthquake
simulation problem
• Compute seismic shaking of 3 layered ground in central Tokyo
50
236 [cm/s]
113
b) Elevation of interfaces of three soil layers
10 40m
c) Response at ground surface (merged
horizontal component of SI value)
a) Model of 1.25 km x 1.25 km area of Tokyo with
4066 structures
51. 247.2
125.6
61.9
0
50
100
150
200
250
300
Elapsed
time
(s)
Performance on actual urban earthquake
simulation problem
51
2.03 x faster 3.99 x faster
1.97 x faster
Solver algorithm
SC14 solver (without
time parallelism)
With time parallelism With time parallelism
EBE kernel algorithm Baseline (m=1) Baseline (m=4) Developed (m=4)
52. Summary
• Element-by-Element (EBE) kernel in matrix-vector products is
key kernel of unstructured implicit finite-element applications
• However, the EBE kernel is not straightforward to attain high
performance due to random data access
• We developed methods to circumvent these data races for high
performance on many-core CPU architectures with wide SIMD
units
• Developed EBE kernel attains
• 16.3% of FP32 peak on Intel Xeon Phi Knights Landing based Oakforest-PACS
• 20.9% of FP32 peak on Intel Skylake Xeon Gold processor based system
• Leads to 2.88-fold speedup over the baseline kernel and 2.03-fold
speedup of the whole finite-element application on Oakforest-PACS
52
53. GPU向けの有限要素法の高速化
Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Thomas
C. Schulthess, Tjerk P. Straatsma, Christopher J. Zimmer, Maxime Martinasso, Kengo
Nakajima, Muneo Hori, Lalith Maddegedara
SC18 Gordon Bell Prize Finalist
より(資料:山口拓真氏提供)
53
54. Porting to Piz Daint/Summit
• Communication & memory bandwidth relatively lower than K
computer
• Reducing data transfer required for performance
• We have been using FP32-FP64 variables
• Transprecision computing is available due to adaptive preconditioning
K computer Piz Daint Summit
CPU/node 1×SPARC64 VIIIfx 1×Intel Xeon E5-2690 v3 2×IBM POWER 9
GPU/node - 1×NVIDIA P100 GPU 6×NVIDIA V100 GPU
Peak FP32
performance/node
0.128 TFLOPS 9.4 TFLOPS 93.6 TFLOPS
Memory bandwidth/node 64 GB/s 720 GB/s 5400 GB/s
Inter-node throughput 5 GB/s
in each direction
10.2 GB/s 25 GB/s
54
55. Introduction of FP16 variables
• Half precision can be used for reduction of data transfer size
• Using FP16 for whole matrix or vector causes overflow/underflow
or fails to converge
• Smaller exponent bits → small dynamic range
• Smaller fraction bits → no more than 4-digit accuracy
S e x p o n e n t f r a c t i o n
Single precision
(FP32, 32 bits)
1bit sign + 8bits exponent + 23bits fraction
S e x p f r a c t i o n
Half precision
(FP16, 16 bits)
1bit sign + 5bits exponent + 10bits fraction
55
56. FP16 for point-to-point communication
• FP16 MPI buffer only for boundary part
• To avoid overflow or underflow, Original vector 𝐱 is divided into
one localized scaling factor 𝐶𝑜𝑛𝑠𝑡 and FP16 vector ത
𝐱16
• Data transfer size can be reduced
• 𝐶𝑜𝑛𝑠𝑡 × ത
𝐱16 does not match 𝐱 exactly, but convergence
characteristic is not changed for most problems
𝐱
PE#0
PE#1
𝐶𝑜𝑛𝑠𝑡 × ത
𝐱16
… …
×
boundary part
56
57. Overlap of computation and
communication
1 : 𝐫 = 𝐀𝐮
2 : synchronize 𝐪 by point-to-point comm.
3 : 𝐫 = 𝐛 − 𝐫; 𝐳 = 𝐌−1
𝐫
4 : 𝜌𝑎 = 1; 𝛼 = 1; 𝜌𝑏 = 𝐳 ∙ 𝐫; 𝛾 = 𝐳 ∙ 𝐪
5 : synchronize 𝜌𝑏, 𝛾 by collective comm.
6 : while (|𝐫𝑖|/|𝐛𝑖| > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
7 : 𝛽 = −𝛾𝜌𝑎/𝛼
8 : 𝐮 = 𝐮 + 𝛼𝐩; 𝐩 = 𝐳 + 𝛽𝐩
9 : 𝐪 = 𝐀𝐩
10: synchronize 𝐪 by point-to-point comm.
11: 𝜌𝑎 = 𝐩 ∙ 𝐪
12: synchronize 𝜌𝑎 by collective comm.
13: 𝛼 = 𝜌𝑏/𝜌𝑎 ; 𝜌𝑎 = 𝜌𝑏
14: 𝐫 = 𝐫 − 𝛼𝐪; 𝐳 = 𝐌−1
𝐫; 𝜌𝑏 = 𝐳 ∙ 𝐫; 𝛾 = 𝐳 ∙ 𝐪
15: synchronize 𝜌𝑏, 𝛾 by collective comm.
16: enddo
i, i+1, i+2, i+3-th time step • Conjugate Gradient method
• Introduce time-parallel algorithm
• Solve four time steps in the analysis
in parallel
• Compute 1 current time step and 3
future time steps
• Reduce iterations in the solver
• Computation becomes dense and
suitable for low B/F architectures
57
58. Overlap of computation and
communication
1’: while (e𝑟𝑟𝑜𝑟𝑖 > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
2’: Vector operation 1
3’: Matrix vector multiplication
4’: Point-to-point comm.
5’: Vector operation 2
6’: Collective comm.
7’: Vector operation 3
8’: Collective comm.
9’: enddo
i, i+1, i+2, i+3-th time step
• Simplified loop
• Computation part
• 3 groups of vector operations
• 1 sparse matrix vector multiplication
• Communication part
• 1 point-to-point communication
• 2 collective communication
• Point-to-point communication is overlapped with
matrix vector multiplication
• However, this communication is still bottleneck of
the solver
1. boundary part computation
2. inner part computation &
boundary part communication
PE#0
boundary part:
send/receive between other MPI processes
inner part
58
59. Overlap of computation and
communication
1’ : while (e𝑟𝑟𝑜𝑟𝑖 > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
2’ :
3’ : Collective comm.
4’ : Vector operation 1
5’ : Matrix vector multiplication
6’ : Point-to-point comm.
7’ : Vector operation 2
8’ : Collective comm.
9’ :
10’:
11’: Vector operation 3
12’: enddo
i, i+1-th time step
1’ : while (e𝑟𝑟𝑜𝑟𝑖 > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
2’ : Vector operation 2
3’ : Collective comm.
4’ :
5’ :
6’ : Vector operation 3
7’ :
8’ : Collective comm.
9’ : Vector operation 1
10’: Matrix vector multiplication
11’: Point-to-point comm.
12’: enddo
i+2, i+3-th time step
• 4 vectors are divided into 2 vectors × 2 sets
• Point-to-point communication is overlapped with other vector operations
• The number of collective communication is unchanged
59
60. FP16 computation in Element-by-Element method
• Matrix-free matrix-vector multiplication
• Compute element-wise multiplication
• Add into the global vector
• Normalization of variables per element can be performed
• Enables use of doubled width FP16 variables in element wise computation
• Achieved 71.9% peak FP64 performance on V100 GPU
• Similar normalization used in communication between MPI partitions
for FP16 communication
f = Σe Pe Ae Pe
T u
[Ae is generated on-the-fly]
Element-by-Element
(EBE) method
+= …
+=
Element #0
Element #1
Ae
u
f
Element #N-1
…
FP32 FP16 FP16
60
61. Introduction of custom data type: FP21
• Most computation in CG loop is memory bound
• However, exponent of FP16 is too small for use in global vectors
• Use FP21 variables for memory bound computation
• Only used for storing data (FP21×3 are stored into 64bit array)
• Bit operations used to convert FP21 to FP32 variables for computation
S e x p o n e n t f r a c t i o n
S e x p o n e n t f r a c t i o n
Single precision
(FP32, 32 bits)
(FP21, 21 bits)
1bit sign + 8bits exponent + 23bits fraction
1bit sign + 8bits exponent + 12bits fraction
S e x p f r a c t i o n
Half precision
(FP16, 16 bits)
1bit sign + 5bits exponent + 10bits fraction 61
62. Performance on Piz Daint/Summit
• Developed solver demonstrates higher scalability compared to previous solvers
• Leads to 19.8% (nearly full Piz Daint) & 14.7% (nearly full Summit) peak FP64 performance
62
2,867.1
2,999.8
3,034.6
3,065.1
2,759.3
393.3
401.0
399.5
378.5
373.2
123.7
120.8
121.1
117.8
110.7
0 1000 2000 3000 4000
4608
2304
1152
576
288
Elapsed time (s)
#
of
MPI
processes
(#
GPUs)
2,082.9
1,922.1
2,033.8
1,912.2
1,927.5
1,939.5
1,923.7
454.2
415.1
380.2
374.6
349.8
327.3
311.7
302.5
100.4
90.0
83.7
84.3
82.9
80.4
77.6
75.8
0 500 1000 1500 2000 2500
24576
12288
6144
4608
2304
1152
576
288
Elapsed time (s)
#
of
MPI
processes
(#
GPUs)
■ Developed ■ SC14 ■ PCGE (Standard)
Piz Daint Summit
63. 富岳向け計算手法の開発
Kohei Fujita, Kentaro Koyama, Kazuo Minami, Hikaru Inoue, Seiya Nishizawa,
Miwako Tsuji, Tatsuo Nishiki, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara,
High-fidelity nonlinear low-order unstructured implicit finite-element seismic
simulation of important structures by accelerated element-by-element method,
Journal of Computational Science, 2020
63
64. 富岳上でのSC14ソルバーの性能
• 京コンピュータ(8コアCPU x 82944台)においてピーク性能の11.1%
• 富岳(48コアCPU x 158976台)においてピーク性能の1.5%
• 有限要素法においては行列ベクトル積におけるランダムデータアク
セスがボトルネックになっている
• いかにしてこれらのボトルネックを回避して有限要素法を高速化するか?
• 計算用コア数の増加、システム用のアシスタントコアの追加
• これらも活用したい
64
65. 計算機機構の活用
• 計算の連続アクセス化
• SIMD演算器を有効活用するための計算の
並び替え
• 多数コアの効率的活用
• キャッシュ特性を考慮したマルチカラリング
• 計算と通信のオーバーラップ
• アシスタントコアを活用して計算と通信を同
時実行
• これらの工夫で主要計算部となる行列ベ
クトル積において13倍の高速化を実現
65
Boundary domain
Nodes on MPI
boundary
MPI process #0
MPI process #1
MPI process #2 Inner domain
Overall mesh
Thread 1
Thread 2
Thread 3
Color #1 Color #2 Color #3
(Threads 2,3 idle)