7. 7
Our Customers
▪ Millions of engineers and scientists worldwide use MATLAB and Simulink.
All of the top 10
aerospace companies2
All of the top 10
auto manufacturers1
Three of the top five
internet companies
1OICA: 2016 World Motor Vehicle Production 2PwC: Aerospace and Defense 2017 Year in Review
90,000+ business,
government, and
university sites
Millions of engineers and scientists worldwide use MATLAB and Simulink.
Our Customers
15. 17
static __global__ mykernel(A, X, Y, C, n)
{
int k = getThreadIndex(N);
int t = A[k] * X[k];
C[k] = t + Y[k];
}
Loop文からCUDAカーネルへ
for k = 1:n
t = A(k) .* X(k);
C(k) = t + Y(k);
end
{ …
mykernel<<< f(n) >>>(A, X, Y, C, n);
…
}
カーネル生成カーネルサイズ計算
Y
f(n)
カーネル変数の分類
(input, output, local)
Ins: A, X, Y, n
Outs: C
Local: t, k
並列実行
可能か?
データの依存性解析
Extracting parallelism in MATLAB
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
3. Composite functions
16. 18
MATLAB(配列利用)からのCUDAカーネル生成
output(:, 1) = (input(:, 1) – x_im) .* factor;
ループの統合
極力大きな並列ループの
作成
スカラ置換
中間変数を行列からスカラ
データへ置換
スカラ変換
ループ文への
変換
for i = 1:M
diff(i) = input(i, 1) – x_im(i);
end
for a = 1:M
output(i, 1) = diff(i) * factor(i);
end
for i = 1:M
diff(i) = input(i, 1) – x_im(i);
output(i, 1) = diff(i) * factor(i);
end
for i = 1:M
tmp = input(i, 1) – x_im(i);
output(i, 1) = tmp * factor(i);
end
Assume the following sizes:
‘output’ : M x 3
‘input’ : M x 3
‘x_im’ : M x 1
‘factor’ : M x 1
Extracting parallelism in MATLAB
1. Scalarized MATLAB (for loops)
2. Vectorized MATLAB
3. Composite functions
17. 19
GPU Coderによるデータ転送(memcpy)の最適化
A(:) = ….
C(:) = ….
for i = 1:N
….
gB = kernel1(gA);
gA = kernel2(gB);
if (some_condition)
gC = kernel3(gA, gB);
end
….
end
…. = C;
cudaMemcpy
*definitely* needed
cudaMemcpy
*not* needed
cudaMemcpy
*may be* needed
データ転送最小化のために、
• 変数毎にステータスフラグを利用して、メモリの場所をトラック
• Use-Def解析でmemcpyを挿入する箇所を決定
• 部分冗長性除去(PRE)と等価
A(:) = …
A_isDirtyOnCpu = true;
…
for i = 1:N
if (A_isDirtyOnCpu)
cudaMemcpy(gA, A);
A_isDirtyOnCpu = false;
end
gB = kernel1(gA);
gA = kernel2(gB);
if (somecondition)
gC = kernel3(gA, gB);
C_isDirtyOnGpu = true;
end
…
end
…
if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);
C_isDirtyOnGpu = false;
end
… = C;
gA, gB, gCがGPU上のメモリに展開されると推測
Generated (pseudo) code
18. 21
GPU Coder : 最適なCUDA生成のための多くの解析・変換機能
Control-flow graph
Intermediate representation
(CFG – IR)
….….
CUDA kernel
optimizations
Front – end
Traditional compiler
optimizations
MATLAB Library function mapping
Parallel loop creation
CUDA kernel creation
cudaMemcpy minimization
Shared memory mapping
CUDA code emission
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Loop
optimizations
24. 28
MATLABによるDeep Learning ワークフロー
Train in MATLAB
Model
importer
Model
importer
DNN
design + training
Trained
DNN
Application
logic
Application
design
組み込み機器
への実装
アプリケーション
配布
Standalone
Deployment
Coders
Compiler/MPS
25. 29
Check Out Deep Learning in MATLAB and GPU Coder
Deep learning in MATLAB
https://www.mathworks.com/solutions/deep-learning.html
Deep learning On-Ramp : 自己学習形式、オンライントレーニング
https://jp.mathworks.com/training-schedule/deep-learning-onramp
GPU Coder
https://www.mathworks.com/products/gpu-coder.html
NVIDIA GPU Cloud(NGC)にて
MATLABイメージが利用可能