第11回配信講義　計算科学技術特論B（2022）

計算科学技術特論B
大規模地震シミュレーション2
2022/6/30
東京大学地震研究所
藤田航平
1

内容・スケジュール
• 6/23 1300-
• 地震シミュレーションの概要
• 有限要素法の基礎
• 大規模連立方程式の求解方法(共役勾配法)
• 共役勾配法における並列化
• 共役勾配法における前処理
• 大規模地震シミュレーションの例
• 6/30 1300-
• 前回の復習
• 並列計算機アーキテキチュアの概説
• SIMDによる有限要素法の行列ベクトル積の高速化
• SIMDに適した有限要素法アルゴリズム・実装
• GPU向けの有限要素法の高速化
• 富岳向けの有限要素法の高速化
2

地震シミュレーション概要
• 断層～地殻～地盤～構造物
• 領域サイズ: 100 km
• 空間分解能: 0.1 m
• 複雑構造をもつ都市などを高精度で求解するため、有限要素法が使われることが
多い
• 都市規模の問題は数千億自由度規模の超大規模問題になるうえに、有限要素法
ではランダムアクセスが主体となり計算効率が下がる傾向にあるため、高性能計
算技術が必須
非構造格子(有限要素法など)
構造格子(差分法など)
Example code:
do i=1,nx
do j=1,ny
a(i,j)=b(i,j)+b(i+1,j)+….
enddo
enddo
Example code:
do i=1,n
do j=ptr(i)+1,ptr(i+1)
a(i)=a(i)+b(index(j))
enddo
enddo
4

解析例
• ターゲット：断層～地殻～地盤～構造物～社会活動
5
a) Earthquake wave propagation
-7 km
0 km
c) Resident evacuation
b) City response simulation
Shinjuku
Two million agents evacuating to nearest safe
site
Tokyo station
Ikebukuro
Shibuya
Shinbashi
Ueno
Earthquake Post earthquake
京コンピュータ全系を使った
地震シミュレーション例
T. Ichimura et al., Implicit Nonlinear Wave Simulation with 1.08T DOF and 0.270T Unstructured Finite Elements to Enhance Comprehensive
Earthquake Simulation, SC15 Gordon Bell Prize Finalist

有限要素法（1次元）
• ターゲット問題(支配方程式): 𝑓 𝑢(𝑥) = 0
• e.g.: 𝑓 𝑢 𝑥 = 1 −
𝑑2𝑢(𝑥)
𝑑𝑥2 for 0 < 𝑥 < 1 with boundary conditions 𝑢 0 = 0, 𝑢 1 = 1/2
• 有限要素法では未知関数𝑢 𝑥 をオーバーラップの無い区間（要素）で分割する
• 𝑢 𝑥 = σ𝑖 𝑢𝑖𝜑𝑖(𝑥)
• ここで𝑢𝑖は定数(未知) 、𝜑𝑖(𝑥)は形状関数(既知)
• 𝑢𝑖が決まれば𝑢 𝑥 が求まる→どうやって𝑢𝑖を求めるか？
• 隣接する節点での関係が導かれる→行列にその値を足しこんでいく
𝑦
𝑥
𝜑𝑖(𝑥)
𝑥
要素
節点
節点𝑖
𝑢 𝑥
𝑦
1
0
6
• A =
3 −3
−3
27
5
0 0
−
12
5
0
0 −
12
5
0 0
32
5
−4
−4 4
, 𝑓 =
−1/6
−3/8
−1/3
−1/8
• あとはこれを境界条件 𝑢1 = 0, 𝑢4 =
1
2
を
満たすように解けばよい

有限要素法のポイント
• 有限要素法で数理問題を離散化することで、行列𝐀はスパース(疎)
となる
• 対称行列の行数をNとしたとき、
• →行列の非零成分の数はO(N)になる
• →行列ベクトル積のコスト・メモリ使用量はO(N)になる
• 行列ベクトル積コストが小さいため、行列ベクトル積を繰り返し使う
反復法を使って解くことが多い
• 反復法の前処理と組み合わせてアルゴリズムを構築
7

ソルバー例１：地震時の地盤増幅解析＠京コンピュータ
• Solve preconditioning matrix roughly to reduce number of CG loops
• Use geometric multi-grid method to reduce cost of preconditioner
• Use single precision in preconditioner to reduce computation & communication
Equation to be solved
(double precision)
CG loop
Computations of
outer loop
Outer loop
Solving
preconditioning
matrix
Second ordered
tetrahedron
Solve system roughly using CG solver
(single precision)
Solve system roughly using CG solver
(single precision)
Use as initial solution
Use for preconditioner of outer loop
Solving preconditioning matrix
Inner coarse loop
Inner fine loop
Linear tetrahedron
Second ordered
tetrahedron
T. Ichimura et al., Implicit Nonlinear Wave Simulation with 1.08T DOF and 0.270T Unstructured Finite Elements to Enhance Comprehensive
Earthquake Simulation, SC15 Gordon Bell Prize Finalist 8

前回までのまとめ
• 有限要素法で数理問題を離散化することで、行列𝐀はスパース(疎)
となる
• 行列ベクトル積コストが小さいため、行列ベクトル積を繰り返し使う
反復法を使って解くことが多い
• いかに速く行列ベクトル積を実行できるか？
• 計算機アーキテクチャに沿った開発が重要に
9

並列計算機アーキテキチュアの概要
10

計算機の性能向上
• 性能向上の要因
• クロック数の向上，同じ価格・電力・面積あたりで使える素子数が増加
• 計算機アーキテキチュア：使えるようになった多数の素子を使って，
いかに処理速度を向上させるか？という観点から開発が続けられて
きた
• コア内並列(pipelining, superscalar, out-of-order, SIMDなど)
• マルチコア並列
• 分散メモリ型並列
11

コア内並列
• SIMD (single instruction multiple data)
• 一つの命令で複数のデータを操作
• 命令数を減らすことができる
• データが不連続であったり，データごとに異なる操作をする処理には適さない
• 例：Intel AVX (256 bit SIMD, 単精度浮動小数点数8つをまとめて操作)
• a256=_mm 256 _loadu_ps(&a[0]); // 連続した8要素をメモリからレジスタにロード
• b256=_mm256_loadu_ps(&b[0]); // 連続した8要素をメモリからレジスタにロード
• c256=a256+b256; // 8要素を加算
• _mm_storeu_ps(&c[0],c256); // 8要素をメモリの連続領域にストア
• 参考：https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-
extensions
0 1 2 3 4 … 7
a:
b:
c:
a256
b256
c256
メモリ上レジスタ
+
=
load
load
store
0 1 2 3 4 … 7
0 1 2 3 4 … 7
0 1 2 3 4 … 7
0 1 2 3 4 … 7
0 1 2 3 4 … 7 12

共有メモリ型並列・分散メモリ型並列
• 共有メモリ計算機
• 複数のコアが，コアごとに異なる命令列を実行
• 全コアが同一のメモリ空間を共有
• 単一のオペレーティングシステムでシステムが管理される
• OpenMPやOpenACCなどを利用してコア毎の命令列や，コア間の同期
などを制御
• 分散メモリ計算機
• ネットワークで複数の計算機(ノード)を接続
• 各ノードごとにオペレーティングシステムが実行され，メモリ空間はノード
内のコア間でのみ共有される
• 各ノードの間ではネットワーク上での通信を介して情報を伝達(MPI等，
通信を明示したプログラムが必要)
Core
0
Memory
Core
1
Core
2
Core
3
共有メモリ型並列計算機
Node 0 Node 1 Node N-1
Network
…
分散メモリ型並列計算機
13

ピーク性能
• クロック x スーパースカラによる演算器数 x FMA x SIMDレーン数 x
コア数 x ノード数
• FMA: 命令一個で積和演算ができる(a=b*c+d)
• 例：Oakbridge-CXの計算ノード
• 56コアのSMP型の共有メモリ計算機(28コア Intel Xeon Platinum 8280
CPU@2.7GHz x 2個)
• Cascade Lake Xeon シリーズのCPU： AVX-512 命令をサポート
• 512ビットSIMD: 倍精度の場合は8個（単精度の場合は16個）の要素を一度に操作で
きる
• ピーク性能：2.7GHz x 2 (浮動小数点演算器数) x 2 (FMA) x 8 (倍精度
SIMD) x 56コア x 1ノード = 4838 GFLOPS
14

計算機の性能向上
• 従来からの計算機は，ユーザが意識せずともに性能向上を享受できた
• クロック数向上，pipelining, superscalar, out-of-order
• だが，クロックは向上せず，これらの技術による上がり幅が少なくなってきた
• 現在の計算機は，ユーザが計算機システム(ハードウェア＋ソフトウェア)
を理解しないと性能向上できない形でも素子を使っている
• SIMD, マルチコア並列，分散メモリ型並列
• 高性能なプログラムを開発するにあたってはこれらの知識が必須
• 続いて、1次元・3次元有限要素法を例にSIMDに着目した高速化例を説
明する
• データ転送能力(e.g. メモリバンド幅，キャッシュ特性)についても考慮する必要があ
るが，ここでは，Element-by-Element法を使うことでアルゴリズムレベルで回避し
ている
15

行列ベクトル積へのSIMDの適
用
16

一次元有限要素法
• 支配方程式
•
𝑑 (𝐸 𝑑𝑢 )
𝑑𝑥2 = 0
• 今回は0 ≤ 𝑥 ≤ 1, 全領域でヤング率𝐸 = 1
• 境界条件
• 𝑢 = 0 at 𝑥 = 0 (Dirichlet条件)
• 𝐸
𝑑𝑢
𝑑𝑥
= 1 at 𝑥 = 1 (Neumann条件)
• 線形要素を使った場合の要素剛性マトリクス
• 𝐾𝑒 =
𝐸
𝑑𝑠
1 −1
−1 1
17

有限要素法の解析フロー
• メッシュ生成
• 今回は0 ≤ 𝑥 ≤ 1を𝑛 − 1個に等分割したメッシュを使う
• 境界条件設定
• Neumann条件をfに反映
• Ku=fの求解
• 共役勾配法を使い、マトリクスベクトル積においてDirichlet条件を反映
• 可視化
• 今回は省略
18

行列ベクトル積
(cal_amat.c)
• r ⇐ K uを，r ⇐ Σe Pe
T(Ke (Pe u))とし
て計算
• Peは全体節点番号と要素節点番号の
マッピング行列
• ここで，
• ue ⇐Pe u
• Ke =
𝐸
𝑑𝑠
1 −1
−1 1
• BDBue ⇐Ke ue
ue ⇐Pe u
BDBue ⇐Ke ue
Ke
r+= Pe
T BDBue
アセンブリにコメントを埋め
込む
19

コンパイル・実行
• 東京大学情報基盤センターのOakbridge-CXを例に説明
• Intel Xeon Platinum 8280@2.7 GHz
• コンパイル
• $ gcc main.c cg.c cal_amat.c -O3 -mavx -fopenmp -fopt-info-vec-optimized -Wall -
lm
• -mavx: AVX instructions
• -fopt-info-vec-optimized : output optimization information
• -Wall: output warning messages
• -lm: link math library
• Output assemble using option “-S”
• 実行
• ./a.out
20

コンパイル・実行結果
• 関数にSIMDが適用され
るとnote: loop vectorized
と表示されるが、今回は
cal_amat.cで表示がない
• SIMDが適用されていない
• 実行時間約0.18秒、約
3.99GFLOPS
• この計算機(1コア)のピー
ク性能は172GFLOPSな
ので、実行効率は約2.3%
[user@obcx03 1D]$ gcc main.c cg.c cal_amat.c -O3 -
mavx -fopenmp -fopt-info-vec-optimized -Wall -lm
Analyzing loop at main.c:73
Analyzing loop at main.c:63
…
Analyzing loop at cal_amat.c:14
cal_amat.c:1: note: vectorized 0 loops in function.
[user@obcx03 1D]$ pjsub run.sh
[INFO] PJM 0000 pjsub Job 1017619 submitted.
[user@obcx03 1D]$ cat run.sh.o1017619 |tail
32766 1.160480e-06 1.000000e-06
32767 1.308662e-06 1.000000e-06
32768 1.348558e-06 1.000000e-06
32769 1.483470e-06 1.000000e-06
32770 1.365041e-06 1.000000e-06
32771 1.050433e-06 1.000000e-06
32772 1.327965e-06 1.000000e-06
32773 6.988487e-07 1.000000e-06
norm 1.000000e+00 2.731286e+03
0 took 0.184694 (3.991899 GFLOPS) 0.000000 21

SIMDをかけるには
• データ依存性
• Branching
do i=1,n
a(i)=2.0*b(i)
enddo
Gather load
do i=1,n
a(i)=2.0*b(cny(i))
enddo
Scatter store
do i=1,n
a(cny(i))=2.0*b(i)
enddo
do i=1,n
if(b(i)>0.0)then
a(i)=2.0*b(i)
else
a(i)=0.0
endif
enddo
SIMD演算可能
SIMD演算不可
SIMD演算できる場合もあるが、
多くの場合効率が落ちる
22

SIMDがかかる例
func1.F90
subroutine func1(n,a,b)
implicit none
integer i,n
real*4 a(n),b(n)
do i=1,n
a(i)=2.0*b(i)
enddo
end
[user@obcx03 test]$ gfortran func1.F90 -O3 -mavx -
fopenmp -fopt-info-vec-optimized -S
Analyzing loop at func1.F90:6
Vectorizing loop at func1.F90:6
func1.F90:6: note: === vect_do_peeling_for_alignment
===
func1.F90:6: note: niters for prolog loop: MIN_EXPR
<(unsigned int) -(((unsigned long) vect_pb.12_18 & 31) >>
2) & 7, niters.9_22>Setting upper bound of nb iterations
for prologue loop to 7
func1.F90:6: note: === vect_update_inits_of_dr ===
func1.F90:6: note: === vect_do_peeling_for_loop_bound
===Setting upper bound of nb iterations for epilogue loop
to 6
func1.F90:6: note: LOOP VECTORIZED.
func1.F90:1: note: vectorized 1 loops in function.
VectorizedはSIMDがか
かったという意味
0 1 2 3 4 … 7
2 2 2 2 2 … 2
0 1 2 3 4 … 7
a
b
2
*
=
23

SIMDがかからない例 (Gather load)
func2.F90
subroutine func2(n,cny,a,b)
implicit none
integer i,n,cny(n)
real*4 a(n),b(n)
do i=1,n
a(i)=2.0*b(cny(i))
enddo
end
0 1 2 3 4 … 7
b
0 1 2 3 4 … 7
a
2 2 2 2 2 … 2
2
*
=
データアクセスパターンがcnyによって変わる
24

SIMDがかからない例 (Scatter store)
func3.F90
implicit none
integer i,n,cny(n)
real*4 a(n),b(n)
do i=1,n
a(cny(i))=2.0*b(i)
enddo
end
0 1 2 3 4 … 7
b
0 1 2 3 4 … 7
a
2 2 2 2 2 … 2
2
*
同一箇所への足しこみの可能性があるため並列計算できない
データアクセスパターンがcnyによって変わる
25

or or or or or or
SIMDがかからない例 (Branching)
func4.F90
implicit none
integer i,n,cny(n)
real*4 a(n),b(n)
do i=1,n
if(b(i)>0.0)then
a(i)=2.0*b(i)
else
a(i)=0.0
endif
enddo
end
0 1 2 3 4 … 7
b
0 1 2 3 4 … 7
a
2 2 2 2 2 … 2
2
*
0 0 0 0 0 … 0
0
b(i)の値によって答えが変わる
26

cal_amat.cのieループにSIMDがかからな
い理由
• ieループにSIMDがかからな
い理由
• u, coor, youngのgather load
• rへのscatter store
• これらを主要計算部から分
離すればよい
27

ieループのSIMD化 (cal_amat_simd.c)
• ieループをgather load部・計算部・scatter store部に分離
• 一時配列の大きさを抑えるため，ループを入れ子状に変
更（ループ・ブロッキング）
主要計算部
Gather load
部
Scatter store
部
28

アセンブリ(cal_amat_simd.s)
# 23 "cal_amat_simd.c" 1
# load right hand side vector
# 0 "" 2
#NO_APP
xorl %eax, %eax
movq %rdi, %rdx
.p2align 4,,10
.p2align 3
.L8:
movslq -4(%rdx), %rcx
movslq (%rdx), %r8
addq $8, %rdx
vmovss 0(%r13,%rcx,4), %xmm0
vmovss %xmm0, 16(%rsp,%rax)
vmovss 0(%r13,%r8,4), %xmm0
vmovss %xmm0, 80(%rsp,%rax)
vmovss (%r12,%r8,4), %xmm0
vsubss (%r12,%rcx,4), %xmm0, %xmm0
…
# compute determinant and BDBu
# 0 "" 2
#NO_APP
vdivps 336(%rsp), %ymm4, %ymm0
vmovaps 80(%rsp), %ymm3
vsubps %ymm3, %ymm2, %ymm5
vsubps %ymm2, %ymm3, %ymm2
…
# add BDBu into left side vector
# 0 "" 2
#NO_APP
movq %rdi, %rdx
xorb %al, %al
.p2align 4,,10
.p2align 3
.L10:
movslq -4(%rdx), %rcx
movslq (%rdx), %r8
addq $8, %rdx
leaq (%rbx,%rcx,4), %rcx
vmovss (%rcx), %xmm0
vaddss 208(%rsp,%rax), %xmm0, %xmm0
vmovss %xmm0, (%rcx)
…
• オプション“-S”を付けてコンパイルすることでアセンブリファイルを生成
• s: スカラー命令，xmmは128 bit レジスタ(単精度で4要素)だがスカラー命令で
はそのうちの一要素しか使わない
• p: SIMD命令，ymmは256 bit レジスタ(単精度で8要素)
主要計算部
Gather
load部
Scatter
store部
29

実行結果
• 解析結果(norm)は同一
• この場合，解析時間は遅くなった
• SIMD計算部分が少ないため，配列操作などのオーバーヘッドが隠せていない
コード改変後(cal_amat_simd.c)
[user@obcx03 1D]$ tail run.simd.sh.o1017636
32766 1.160480e-06 1.000000e-06
32767 1.308662e-06 1.000000e-06
32768 1.348558e-06 1.000000e-06
32769 1.483470e-06 1.000000e-06
32770 1.365041e-06 1.000000e-06
32771 1.050433e-06 1.000000e-06
32772 1.327965e-06 1.000000e-06
32773 6.988487e-07 1.000000e-06
norm 1.000000e+00 2.731286e+03
0 took 0.233609 (3.156046 GFLOPS) 0.000000
コード改変前(cal_amat.c)
[user@obcx03 1D]$ tail run.sh.o1017619
32766 1.160480e-06 1.000000e-06
32767 1.308662e-06 1.000000e-06
32768 1.348558e-06 1.000000e-06
32769 1.483470e-06 1.000000e-06
32770 1.365041e-06 1.000000e-06
32771 1.050433e-06 1.000000e-06
32772 1.327965e-06 1.000000e-06
32773 6.988487e-07 1.000000e-06
norm 1.000000e+00 2.731286e+03
0 took 0.184694 (3.991899 GFLOPS) 0.000000
30

三次元有限要素法の場合
• 一次元有限要素法と同じ構成
• 要素当たりの節点数が4つに，各節点
あたりの変数がx,y,zの3成分になる
x y z x y z …
節点0 節点1
u:
cny[4*ie]
cny[4*ie+3]
cny[4*ie+2]
cny[4*ie+1]
要素ie:
x y z x y z …
coor:
x y z x y z …
r:
行列ベクトル積コード (三次元有限要素法)
31

実行結果
• 解析結果(norm)は同一
• 三次元の場合は行列ベクトル積内部の計算コストが大きいため、SIMD計算により
高速化
• ただしデータのロード・多仕込み部がSIMD化できないため、性能向上幅は限られている
コード改変後(cal_amat_simd.c)
nblock,n,ne 20 9261 48000
0 took 0.432718 (13.643987 GFLOPS) 0.208333
norm 3.565104e+03 3.797010e+06
コード改変前(cal_amat.c)
nblock,n,ne 20 9261 48000
0 took 0.800056 (7.379483 GFLOPS) 0.208333
norm 3.565104e+03 3.797010e+06
32

SIMDに適した有限要素法アル
ゴリズム開発・実装の例
Kohei Fujita, Masashi Horikoshi, Tsuyoshi Ichimura, Larry Meadows, Kengo
Nakajima, Muneo Hori, Lalith Maddegedara
Journal of Computational Science, 2020より
33

ターゲット問題
• 陰解法による動的非線形低次非構造格子有限要素法
• 都市の地震応答など、不均質な分布を持つ非線形物性・複雑形状を持つ領域
の求解に適している
• 多数回大規模線形連立方程式を求解する
• 多数のランダムデータアクセスを含む
34
Ku = f
Unknown vector with up to 1 trillion degrees of freedom
Outer force vector
Sparse symmetric positive definite matrix
(changes every time step)
Unknown vector (up to trillion degrees of freedom)
Known vector
Solve for each of few
thousand time steps:

SC14 unstructured finite-element solver
• Designed for CPU based K computer
• Use algorithm that can obtain equal granularity on millions of cores
• Use matrix-free matrix-vector product (Element-by-Element method): Good load balance when
elements per core is equal
• Also high-peak performance as it is on-cache computation
• Combine Element-by-Element method with multi-grid, mixed precision arithmetic, and
adaptive conjugate gradient method
• Scalability & peak-performance good (key kernels are Element-by-Element), convergency good,
thus, time-to-solution good
35
f = Σe Pe Ke Pe
T u
[Ke is generated on-the-fly]
Element-by-Element method
+=
…
+=
Element #0
Element #1
Ke
u
f
Element #N-1
…

近年の計算機におけるSC14ソルバーの性能
• 京コンピュータ(理化学研究所)
• 8コアCPU(SPARC64) x 82,944計算ノード
• ピーク性能：10.6 PFLOPS、メモリバンド幅：5.3 PB/s
• SC14ソルバーの性能：ピーク性能の11.1%
• Oakforest-PACS (東京大学・筑波大学)
• 68コアCPU(Xeon Phi Knights Landing) x 8,208計算ノード
• ピーク性能：25 PFLOPS、メモリバンド幅：4 PB/s
• SC14ソルバーの性能：ピーク性能の2.26%
36

性能劣化の原因
• 計算機によりSIMD幅が異なる
• 京コンピュータ：倍精度演算で2 (単精度演算で2)
• Oakforest-PACS:倍精度演算で16 (単精度演算で8)
• Element-by-Element法におけるランダムアクセスがwide SIMD計算機におい
てボトルネックとなる
• 右辺ベクトルのランダムロード(u)
• 左辺ベクトルへのランダム足しこみ(f)
• これらはSIMDによる連続アクセスよりも低効率な命令で実装される
37
+=
…
+=
Element #0
Element #1
Ke
u
f
Element #N-1
…

非構造有限要素法における、問題の「均質
性」の利用
• 計算の均質性・連続性→高いデータアクセス性能に直結する
• 差分法などの構造格子や、メッシュが構造化されている手法において高い
計算性能が得られる理由の一つ
• 非構造有限要素においても、メッシュは時間方向に不変
• この特性を使って、時間方向に並列に求解計算を実施することで、動的有
限要素法の効率を改善
38

時間並列アルゴリズムの概要
• 複数の時間ステップを同時に求解することで、将来ステップの解を予測
• 一反復当たりの計算コストを削減
• 通常のソルバーを使った場合と全演算数はほぼ変わらないが、より高効率なカー
ネルを使用することができる
39
…
Element #0
Element #1
ui
ke
fi
ke
fi+1 fi+2
Future time steps
Current time step
Kernel in time-parallel solver: less random access
fi+3
ui+1 ui+2 ui+3
…
…
…
Contiguous
in memory
Contiguous in memory
(i.e., SIMD efficient)
…
Element #0
Element #1
ui
ke
fi
ke
…
Kernel in previous solver
Tsuyoshi Ichimura, Kohei Fujita, Masashi Horikoshi, Larry Meadows, Kengo Nakajima, Takuma Yamaguchi, Kentaro Koyama, Hikaru Inoue, Akira Naruse, Keisuke
Katsushima, Muneo Hori, Maddegedara Lalith, A Fast Scalable Implicit Solver with Concentrated Computation for Nonlinear Time-evolution Problems on Low-order
Unstructured Finite Elements, 32nd IEEE International Parallel and Distributed Processing Symposium, 2018.
Kohei Fujita, Keisuke Katsushima, Tsuyoshi Ichimura, Masashi Horikoshi, Kengo Nakajima, Muneo Hori, Lalith Maddegedara, Wave Propagation Simulation of Complex
Multi-Material Problems with Fast Low-Order Unstructured Finite-Element Meshing and Analysis, Proceedings of HPC Asia 2018 (Best Paper Award), 2018.

1: set 𝑥−1 ← 0
2: for( 𝑖 = 0; 𝑖 < 𝑛; 𝑖 = 𝑖 + 1 ){
3: guess ҧ
𝑥𝑖 using standard predictor
4: set 𝑏𝑖 using 𝑥𝑖−1
5: solve 𝑥𝑖 ← 𝐴−1𝑏𝑖using initial solution ҧ
𝑥𝑖 (Computed using iterative solver with SpMV kernel)
6: }
1: set 𝑥−1 ← 0 and ҧ
𝑥𝑖 ← 0 ( 𝑖 = 0, … , 𝑚 − 2)
2: for( 𝑖 = 0; 𝑖 < 𝑛; 𝑖 = 𝑖 + 1 ){
3: set 𝑏𝑖 using 𝑥𝑖−1
4: guess ҧ
𝑥𝑖+𝑚−1 using standard predictor
5: 𝑏𝑗 ← ത
𝑏𝑗
6: while (
A ҧ
𝑥𝑖−𝑏𝑖
𝑏𝑖
> 𝜖) do {
7: guess ത
𝑏𝑗 using ҧ
𝑥𝑗−1 ( 𝑗 = 𝑖 + 1, … , 𝑖 + 𝑚 − 1 )
8: refine solution { ҧ
𝑥𝑗 ← 𝐴−1 ത
𝑏𝑗} with initial solution ҧ
𝑥𝑗 (𝑗 = 𝑖, … , 𝑖 + 𝑚 − 1) (Computed using iterative
solver with concentrated computation kernel)
10: }
11: 𝑥𝑖 ← ҧ
𝑥𝑖
11: }
Standard solver algorithm
Developed algorithm
40

並列時間積分アルゴリズムのnaïveな実装
(m-ステップ)
• 各コアにてSIMDを用いてm本のベクトルを計算
• ただし、実際の問題ではm ≦4が一般的：SIMD幅をすべて使い切ることができない
• マルチコア間のデータ競合を回避するためにテンポラリのベクトルを確保
• メニーコア計算機を使う場合は高コスト
41
2. Update
components
by EBE (black)
1. Initialize
necessary
components (gray)
Core-wise
temporary
vectors (ft)
3. Add necessary
components
+ =
=
=
+ =
=
+ =
Global left hand side vector (f)
…
Element #0
Element #1
ui
ke
fi
ke
fi+1 fi+2
Future time steps
Current time step
fi+3
ui+1 ui+2 ui+3
Compute using
4-wide SIMD
Core-wise computation (in case of m=4)
Many-core computation (in case of three cores)

1 !$OMP PARALLEL DO
2 do iu=1,numberofthreads ! for each thread
3 do i=1,nnum(iu)
4 i1=nlist(i,iu)
5 do im=1,m
6 ft(im,1,i1,iu)=0.0 ! clear temporary vector
7 ft(im,2,i1,iu)=0.0
8 ft(im,3,i1,iu)=0.0
9 enddo
10 enddo
11 do ie=npl(iu)+1,npl(iu+1)
12 cny1=cny(1,ie)
13 cny2=cny(2,ie)
14 cny3=cny(3,ie)
15 cny4=cny(4,ie)
16 xe11=x(1,cny1)
17 xe21=x(2,cny1)
...
18 xe34=x(3,cny4)
19 do im=1,m
20 ! compute BDBu using ue11~ue34 and xe11~xe34
21 ue11=u(im,1,cny1)
...
24 ! compute BDBu using ue11~ue34 and xe11~xe34
25 BDBu11=...
26 BDBu21=...
...
27 BDBu34=...
28 ! add to temporary vector
29 ft(im,1,cny1,iu)=BDBu11+ft(im,1,cny1,iu)
...
32 enddo ! im
33 enddo ! ie
34 enddo ! iu
35 !$OMP END PARALLEL DO
SIMD computation
with width m
2. Update
components by EBE
(black)
1. Initialize necessary
components (gray)
Core-wise
temporary
vectors (ft)
36 !$OMP PARALLEL DO
37 ! clear global vector
38 do i=1,n
39 do im=1,m
40 f(im,1,i)=0.0
41 f(im,2,i)=0.0
42 f(im,3,i)=0.0
43 enddo
44 enddo
46 do iu=1,numberofthreads
48 ! add to global vector
49 do i=1,nnum(iu)
50 i1=nlist(i,iu)
51 do im=1,m
52 f(im,1,i1)=f(im,1,i1)+ft(im,1,i1,iu)
55 enddo
56 enddo
58 enddo
3. Add necessary
components
+ =
=
=
+ =
=
+ =
Global left hand side vector (f)
Naïve
implementation of
EBE kernel with m
vectors
42

Wide-SIMD CPUにおける効率的な計算方
法
• ベクトルをパック・アンパックすることでSIMD幅をすべて使う
43
…
Element #0
Element #1
ui
ke
ke
Future time steps
Current time step ui+1 ui+2 ui+3
fi+2
fi fi+1 fi+3
Element #1
ui
Pack
ui+1 ui+2 ui+3
Element #0
Pack
With packing: can use full SIMD width
Naïve implementation: use only 4 out of the
8-width SIMD
Compute
using 8-
width SIMD
fi+2
fi fi+1 fi+3
Example for time parallel kernel (m = 4) with 8 width FP32 SIMD architecture
Unpack
and add

1 !$OMP PARALLEL DO
2 do iu=1,numberofthreads ! for each thread
3 do i=1,nnum(iu)
4 i1=nlist(i,iu)
5 do im=1,m
6 ft(im,1,i1,iu)=0.0 ! clear temporary vector
7 ft(im,2,i1,iu)=0.0
8 ft(im,3,i1,iu)=0.0
9 enddo
10 enddo
11 ! block loop with blocksize NL/m
12 do ieo=npl(iu)+1,npl(iu+1),NL/m
13 ! load ue, xe
14 do ie=1,min(NL/m,npl(,iu+1)-ieo+1)
15 cny1=cny(1,ieo+ie-1)
19 do im=1,m
20 ue11(im+(ie-1)*m)=u(im,1,cny1)
21 ue21(im+(ie-1)*m)=u(im,2,cny1)
...
22 ue34(im+(ie-1)*m)=u(im,3,cny4)
23 xe11(im+(ie-1)*m)=x(1,cny1)
24 xe21(im+(ie-1)*m)=x(2,cny1)
...
25 xe34(im+(ie-1)*m)=x(3,cny4)
26 enddo
27 enddo
SIMD computation
SIMD
(width=m)
computation
28 ! compute BDBu
29 do i=1,NL
30 BDBu11(ie)=...
31 BDBu21(ie)=...
...
32 BDBu34(ie)=...
33 enddo
35 do ie=1,min(NL/m, npl(icolor,iu+1)-ieo+1)
40 do im=1,m
41 ft(im,1,cny1,iu)=BDBu11(im+(ie-1)*m)+f(im,1,cny1,iu)
...
44 enddo
45 enddo
46 enddo ! ieo
47 enddo ! iu
49 Add ft in to f (same as lines 36-58 of Fig. 2)
SIMD
(width=m)
computation
EBE kernel
with m
vectors for
wide-SIMD
CPUs
44

メニーコア計算機向けのスレッド分割
• テンポラリ配列を使う必要がない
• グラフ分割法を使ってキャッシュの再利用を促進
45
Overall mesh Color #1 Color #2 Color #3
All threads compute each color
Overall mesh
Thread 1, Thread 2, Thread 3
Set #1 Set #2 Set #3
(Threads
2,3 idle)
Decompose mesh using graph partitioning
method
b) Developed thread partitioning method
a) Standard coloring method
…

1 !$OMP PARALLEL DO
2 ! clear global vector
3 do i=1,n
4 do im=1,m
5 f(im,1,i)=0.0
6 f(im,2,i)=0.0
7 f(im,3,i)=0.0
8 enddo
9 enddo
11 do icolor=1,ncolor ! for each color or element set
13 do iu=1, numberofthreads
14 ! block loop with blocksize NL/m
15 do ieo=npl(icolor,iu)+1,npl(icolor,iu+1),NL/m
16 ! load ue, xe
17 do ie=1,min(NL/m,npl(icolor,iu+1)-ieo+1)
22 do im=1,m
23 ue11(im+(ie-1)*m)=u(im,1,cny1)
24 ue21(im+(ie-1)*m)=u(im,2,cny1)
25 ...
26 ue34(im+(ie-1)*m)=u(im,3,cny4)
27 xe11(im+(ie-1)*m)=x(1,cny1)
28 xe21(im+(ie-1)*m)=x(2,cny1)
...
29 xe34(im+(ie-1)*m)=x(3,cny4)
30 enddo
31 enddo
SIMD computation
SIMD
(width=m)
computation
32 ! compute BDBu
33 do i=1,NL
34 BDBu11(ie)=...
35 BDBu21(ie)=...
...
36 BDBu34(ie)=...
37 enddo
39 do ie=1,min(NL/m, npl(icolor,iu+1)-ieo+1)
44 do im=1,m
45 f(im,1,cny1)=BDBu11(im+(ie-1)*m)+f(im,1,cny1)
...
48 enddo
49 enddo
50 enddo ! ieo
51 enddo ! iu
53 enddo ! icolor
SIMD
(width=m)
computation
Coloring/thread partitioning of EBE kernel with
m vectors for wide-SIMD CPUs
46

Mixed use of 4-wide and 16-wide SIMD
• In case of problems of m = 4 computed on 16-wide SIMD architecture, 16-
width SIMD can be used for packing/unpacking of 4-width vectors
• Enables further reduction of instructions
47
22 do im=1,m
23 ue11(im+(ie-1)*m)=u(im,1,cny1)
! Load u(1:4,1,cny1) to xmm1
! Store xmm1 to ue11(1+(ie-1)*m: 4+(ie-1)*m)
24 ue21(im+(ie-1)*m)=u(im,2,cny1)
25 ue31(im+(ie-1)*m)=u(j,3,cny1)
...
30 enddo
SIMD width=4
computation
23-25 ! Load u(1:16,cny1) to zmm1
! Store zmm1(1:4) to ue11(1+(i-1)*4:4+(i-1)*4)
! Store zmm1(5:8) to ue21(1+(i-1)*4:4+(i-1)*4)
! Store zmm1(9:12) to ue31(1+(i-1)*4:4+(i-1)*4)
...
SIMD width=16
computation
SIMD width=4
computation
Use of only 4-width SIMD instructions
Mixed use of 4-wide and 16-wide SIMD
instructions
xmm indicate 128 bit FP32 registers and zmm indicate 512 bit FP32 registers

問題設定
• 下記のような２層地盤における非線形波動伝播問題を求解
• ３種の性質の異なる計算機において性能を計測
48
Layer 1 2
Vp (m/s) 700 2,100
Vs (m/s) 100 700
Density (kg/m3) 1,500 2,100
Damping 0.25 (hmax ) 0.05
Strain Criteria 0.007 -
60m
20m
64m
8m
Layer 1
Layer 2
x
y
z
K computer Oakforest-PACS Intel Skylake
Xeon Gold
based server
Nodes 8 1 1
Sockets/node 1 1 2
Cores/socket 8 68 20
FP32 SIMD width 2 16 16
Clock frequency 2.0 GHz 1.4 GHz 2.4 GHz
Total peak FP32
FLOPS
1024 GFLOPS 6092 GFLOPS 6144 GFLOPS
Total DDR
bandwidth
512 GB/s 80.1 GB/s 255.9 GB/s
Total MCDRAM
bandwidth
- 490 GB/s -

9.45
7.52
7.52
11.52
6.62
20.10
7.50
3.75
8.85
3.25
2.60
14.07
3.98
3.05
8.20
2.14
2.04
0
5
10
15
20
25
Kernel
elapsed
time
per
vector
(s)
K computer (8 nodes) Oakforest-PACS (1 node) Skylake Xeon Gold 6148 x 2 socket
267
GFLOPS
(26.1%)
380
GFLOPS
(37.1%)
121
GFLOPS
(1.98%)
1000
GFLOPS
(16.3%)
175
GFLOPS
(2.85%)
1287
GFLOPS
(20.9%)
Percentage
to FP32 peak
Baseline
(1 vector)
Baseline
(4 vector)
Case #1 Case #2 Case #3 Case #4
# of vectors 1 4 4 4 4 4
SIMD packing No No Yes Yes Yes Yes
Many-core algorithm Core wise
vectors
Core wise
vectors
Core wise
vectors
Standard
coloring
Developed
partitioning
Developed
partitioning
Mixed use of 16-width and
4-width SIMD
No No No No No Yes 49

Performance on actual urban earthquake
simulation problem
• Compute seismic shaking of 3 layered ground in central Tokyo
50
236 [cm/s]
113
b) Elevation of interfaces of three soil layers
10 40m
c) Response at ground surface (merged
horizontal component of SI value)
a) Model of 1.25 km x 1.25 km area of Tokyo with
4066 structures

247.2
125.6
61.9
0
50
100
150
200
250
300
Elapsed
time
(s)
Performance on actual urban earthquake
simulation problem
51
2.03 x faster 3.99 x faster
1.97 x faster
Solver algorithm
SC14 solver (without
time parallelism)
With time parallelism With time parallelism
EBE kernel algorithm Baseline (m=1) Baseline (m=4) Developed (m=4)

Summary
• Element-by-Element (EBE) kernel in matrix-vector products is
key kernel of unstructured implicit finite-element applications
• However, the EBE kernel is not straightforward to attain high
performance due to random data access
• We developed methods to circumvent these data races for high
performance on many-core CPU architectures with wide SIMD
units
• Developed EBE kernel attains
• 16.3% of FP32 peak on Intel Xeon Phi Knights Landing based Oakforest-PACS
• 20.9% of FP32 peak on Intel Skylake Xeon Gold processor based system
• Leads to 2.88-fold speedup over the baseline kernel and 2.03-fold
speedup of the whole finite-element application on Oakforest-PACS
52

GPU向けの有限要素法の高速化
Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Thomas
C. Schulthess, Tjerk P. Straatsma, Christopher J. Zimmer, Maxime Martinasso, Kengo
Nakajima, Muneo Hori, Lalith Maddegedara
SC18 Gordon Bell Prize Finalist
より（資料：山口拓真氏提供）
53

Porting to Piz Daint/Summit
• Communication & memory bandwidth relatively lower than K
computer
• Reducing data transfer required for performance
• We have been using FP32-FP64 variables
• Transprecision computing is available due to adaptive preconditioning
K computer Piz Daint Summit
CPU/node 1×SPARC64 VIIIfx 1×Intel Xeon E5-2690 v3 2×IBM POWER 9
GPU/node - 1×NVIDIA P100 GPU 6×NVIDIA V100 GPU
Peak FP32
performance/node
0.128 TFLOPS 9.4 TFLOPS 93.6 TFLOPS
Memory bandwidth/node 64 GB/s 720 GB/s 5400 GB/s
Inter-node throughput 5 GB/s
in each direction
10.2 GB/s 25 GB/s
54

Introduction of FP16 variables
• Half precision can be used for reduction of data transfer size
• Using FP16 for whole matrix or vector causes overflow/underflow
or fails to converge
• Smaller exponent bits → small dynamic range
• Smaller fraction bits → no more than 4-digit accuracy
S e x p o n e n t f r a c t i o n
Single precision
(FP32, 32 bits)
1bit sign + 8bits exponent + 23bits fraction
S e x p f r a c t i o n
Half precision
(FP16, 16 bits)
55

FP16 for point-to-point communication
• FP16 MPI buffer only for boundary part
• To avoid overflow or underflow, Original vector 𝐱 is divided into
one localized scaling factor 𝐶𝑜𝑛𝑠𝑡 and FP16 vector ത
𝐱16
• Data transfer size can be reduced
• 𝐶𝑜𝑛𝑠𝑡 × ത
𝐱16 does not match 𝐱 exactly, but convergence
characteristic is not changed for most problems
𝐱
PE#0
PE#1
𝐶𝑜𝑛𝑠𝑡 × ത
𝐱16
… …
×
boundary part
56

Overlap of computation and
communication
1 : 𝐫 = 𝐀𝐮
2 : synchronize 𝐪 by point-to-point comm.
3 : 𝐫 = 𝐛 − 𝐫; 𝐳 = 𝐌−1
𝐫
4 : 𝜌𝑎 = 1; 𝛼 = 1; 𝜌𝑏 = 𝐳 ∙ 𝐫; 𝛾 = 𝐳 ∙ 𝐪
5 : synchronize 𝜌𝑏, 𝛾 by collective comm.
6 : while (|𝐫𝑖|/|𝐛𝑖| > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
7 : 𝛽 = −𝛾𝜌𝑎/𝛼
8 : 𝐮 = 𝐮 + 𝛼𝐩; 𝐩 = 𝐳 + 𝛽𝐩
9 : 𝐪 = 𝐀𝐩
10: synchronize 𝐪 by point-to-point comm.
11: 𝜌𝑎 = 𝐩 ∙ 𝐪
12: synchronize 𝜌𝑎 by collective comm.
13: 𝛼 = 𝜌𝑏/𝜌𝑎 ; 𝜌𝑎 = 𝜌𝑏
14: 𝐫 = 𝐫 − 𝛼𝐪; 𝐳 = 𝐌−1
𝐫; 𝜌𝑏 = 𝐳 ∙ 𝐫; 𝛾 = 𝐳 ∙ 𝐪
15: synchronize 𝜌𝑏, 𝛾 by collective comm.
16: enddo
i, i+1, i+2, i+3-th time step • Conjugate Gradient method
• Introduce time-parallel algorithm
• Solve four time steps in the analysis
in parallel
• Compute 1 current time step and 3
future time steps
• Reduce iterations in the solver
• Computation becomes dense and
suitable for low B/F architectures
57

communication
1’: while (e𝑟𝑟𝑜𝑟𝑖 > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
2’: Vector operation 1
3’: Matrix vector multiplication
4’: Point-to-point comm.
6’: Collective comm.
8’: Collective comm.
9’: enddo
i, i+1, i+2, i+3-th time step
• Simplified loop
• Computation part
• 3 groups of vector operations
• 1 sparse matrix vector multiplication
• Communication part
• 1 point-to-point communication
• 2 collective communication
• Point-to-point communication is overlapped with
matrix vector multiplication
• However, this communication is still bottleneck of
the solver
1. boundary part computation
2. inner part computation &
boundary part communication
PE#0
boundary part:
send/receive between other MPI processes
inner part
58

communication
1’ : while (e𝑟𝑟𝑜𝑟𝑖 > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
2’ :
3’ : Collective comm.
4’ : Vector operation 1
5’ : Matrix vector multiplication
6’ : Point-to-point comm.
9’ :
10’:
12’: enddo
i, i+1-th time step
1’ : while (e𝑟𝑟𝑜𝑟𝑖 > 𝑡𝑜𝑙𝑒𝑟𝑎𝑛𝑐𝑒 ) do
4’ :
5’ :
7’ :
10’: Matrix vector multiplication
11’: Point-to-point comm.
12’: enddo
i+2, i+3-th time step
• 4 vectors are divided into 2 vectors × 2 sets
• Point-to-point communication is overlapped with other vector operations
• The number of collective communication is unchanged
59

FP16 computation in Element-by-Element method
• Matrix-free matrix-vector multiplication
• Compute element-wise multiplication
• Add into the global vector
• Normalization of variables per element can be performed
• Enables use of doubled width FP16 variables in element wise computation
• Achieved 71.9% peak FP64 performance on V100 GPU
• Similar normalization used in communication between MPI partitions
for FP16 communication
f = Σe Pe Ae Pe
T u
[Ae is generated on-the-fly]
Element-by-Element
(EBE) method
+= …
+=
Element #0
Element #1
Ae
u
f
Element #N-1
…
FP32 FP16 FP16
60

Introduction of custom data type: FP21
• Most computation in CG loop is memory bound
• However, exponent of FP16 is too small for use in global vectors
• Use FP21 variables for memory bound computation
• Only used for storing data (FP21×3 are stored into 64bit array)
• Bit operations used to convert FP21 to FP32 variables for computation
Single precision
(FP32, 32 bits)
(FP21, 21 bits)
S e x p f r a c t i o n
Half precision
(FP16, 16 bits)
1bit sign + 5bits exponent + 10bits fraction 61

Performance on Piz Daint/Summit
• Developed solver demonstrates higher scalability compared to previous solvers
• Leads to 19.8% (nearly full Piz Daint) & 14.7% (nearly full Summit) peak FP64 performance
62
2,867.1
2,999.8
3,034.6
3,065.1
2,759.3
393.3
401.0
399.5
378.5
373.2
123.7
120.8
121.1
117.8
110.7
0 1000 2000 3000 4000
4608
2304
1152
576
288
Elapsed time (s)
#
of
MPI
processes
(#
GPUs)
2,082.9
1,922.1
2,033.8
1,912.2
1,927.5
1,939.5
1,923.7
454.2
415.1
380.2
374.6
349.8
327.3
311.7
302.5
100.4
90.0
83.7
84.3
82.9
80.4
77.6
75.8
0 500 1000 1500 2000 2500
24576
12288
6144
4608
2304
1152
576
288
Elapsed time (s)
#
of
MPI
processes
(#
GPUs)
■ Developed ■ SC14 ■ PCGE (Standard)
Piz Daint Summit

富岳向け計算手法の開発
Kohei Fujita, Kentaro Koyama, Kazuo Minami, Hikaru Inoue, Seiya Nishizawa,
Miwako Tsuji, Tatsuo Nishiki, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara,
High-fidelity nonlinear low-order unstructured implicit finite-element seismic
simulation of important structures by accelerated element-by-element method,
Journal of Computational Science, 2020
63

富岳上でのSC14ソルバーの性能
• 京コンピュータ(8コアCPU x 82944台)においてピーク性能の11.1%
• 富岳(48コアCPU x 158976台)においてピーク性能の1.5%
• 有限要素法においては行列ベクトル積におけるランダムデータアク
セスがボトルネックになっている
• いかにしてこれらのボトルネックを回避して有限要素法を高速化するか？
• 計算用コア数の増加、システム用のアシスタントコアの追加
• これらも活用したい
64

計算機機構の活用
• 計算の連続アクセス化
• SIMD演算器を有効活用するための計算の
並び替え
• 多数コアの効率的活用
• キャッシュ特性を考慮したマルチカラリング
• 計算と通信のオーバーラップ
• アシスタントコアを活用して計算と通信を同
時実行
• これらの工夫で主要計算部となる行列ベ
クトル積において13倍の高速化を実現
65
Boundary domain
Nodes on MPI
boundary
MPI process #0
MPI process #1
MPI process #2 Inner domain
Overall mesh
Thread 1
Thread 2
Thread 3
Color #1 Color #2 Color #3
(Threads 2,3 idle)

性能計測問題
• 2層の非線形地盤中の波動伝播解析
66
60m
20m
64m
8m
Layer 1 2
Vp (m/s) 700 2,100
Vs (m/s) 100 700
Density (kg/m3) 1,500 2,100
Damping 0.25 (hmax ) 0.05
Strain Criteria 0.007 -
Layer 1
Layer 2

性能計測結果
• ウィークスケーリング(一台あたりの計算規模を一定として、計算機の台数を増やす)を
計測：実行時間が増加しないのが理想的
• 富岳ほぼ全系に相当する147,456ノードにおいて、SC14ソルバーをそのまま実行した
場合に比べて計算機機構を踏まえた手法により7倍速に
• 京全系を使った場合の59倍速に相当
67
35.9 36.8 36.4 36.0
39.3 37.1 37.1 36.8 37.0 38.8
4.13 4.20 4.19 4.17 4.52 4.65 4.57 4.73 5.02 5.54
0.0
10.0
20.0
30.0
40.0
50.0
256 2048 16384 131072
ソルバーの実行時間
(s)
計算ノード数(台)
SC14solver_asis SC14solver_proposed
開発手法により
7倍の高速化
ピーク性能の1.5%
ピーク性能の9.9%

まとめ
• 有限要素法は複雑形状問題の求解にたけているが、ランダムアクセ
スが多く含まれるため、性能を出すには並列計算機の特徴に合わ
せたアルゴリズム開発・実装開発が重要
• SIMDの活用のためのカーネルレベルの実装
• SIMDをさらに効率よく活用するための求解アルゴリズムの再設計
• 通信バンド幅・メモリバンド幅が限られている計算機においては、低精度
データ型の使用などの工夫により高効率な計算が可能に
• これらの手法開発はランダムアクセス系の他の手法にも応用可能と期待さ
れる
69

第11回配信講義　計算科学技術特論B（2022）

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 第11回配信講義　計算科学技術特論B（2022）

Ähnlich wie 第11回配信講義　計算科学技術特論B（2022） (20)

Mehr von RCCSRENKEI

Mehr von RCCSRENKEI (14)