2. みなさん、
PL/CUDA使ってますか?
Hello guys. Are you using PL/CUDA?
This caption is not automatic by machine-learning. I preliminary write up by manual.
PGconf.ASIA 2018 LT - In-database Analytics using GPU2
3. Result
PL/CUDAユーザ定義関数
PGconf.ASIA 2018 LT - In-database Analytics using GPU3
▌PL/CUDAとは?
SQLユーザ定義関数として、GPUで実行可能なCUDA Cコードを書ける。
▌特長
GPUに最適化したコードをマニュアルで記述する事ができる。
前処理・後処理に柔軟なデータ操作が可能なSQLを利用できる。
All In-database Analytics
Scan
Pre-Process
Analytics
Post-ProcessCREATE FUNCTION
my_logic( reggstore, text )
RETURNS matrix
AS $$
$$ LANGUAGE ‘plcuda’;
Custom CUDA C code block
(runs on GPU device) ✓ 統計解析・機械学習に対する
マニュアルでの最適化
✓ 数千演算コアと広帯域メモリを
最大限に活用
ready
PL/CUDA allows UDF written in CUDA C program that is executable on GPU. Valuable due to integration of
manual (extreme) optimization for GPU and flexible data operation by SQL.
4. PL/CUDA利用例 – 創薬における類似化合物サーチ
PGconf.ASIA 2018 LT - In-database Analytics using GPU4
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 00000000000100000010000000000010001000000...
2 CHEMBL405398 00000000000000010010000000000000000100000...
3 CHEMBL503634 00000100000000000000010000000000000000000...
: : :
Data structure of chemical compounds
データベース化合物
(約1,000万件)
クエリ化合物
(~1,000件)
探索すべき組合せ = 約100億通り
DBサーバ
類似度計算
ロジック
問い合わせ
類似化合物の
リスト
For similarity search on drug-discovery, GPU calculated 10billion of distance between chemical compounds
x150 times faster than C-binary on CPU. It is very computing intensive workloads.
x150 times
faster!!
5. Is there any sample program?
Oh.... this case was proprietary algorithm. Now we have no sample code in public.
それ、どこかにサンプルプログラム
転がってませんか?
PGconf.ASIA 2018 LT - In-database Analytics using GPU5
6. 作ってみた。
題材:ロジスティック回帰分析
I tried to make it.
Theme: Logistic Regression Analytics
PGconf.ASIA 2018 LT - In-database Analytics using GPU6
9. パラメータを求める(1/3)
一般化すると....
パラメータ: 𝑤 = 𝑤0, 𝑤1, ⋯ , 𝑤 𝑚
説明変数: 𝜑𝑖 = 1, 𝑥1, ⋯ , 𝑥 𝑚 𝑖
従属変数: 𝑡𝑖 = 0 𝑜𝑟 1
分割面を定めるという事は、
説明変数の重み(傾き)と
切片を求める事に等しい。
0 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑦
Determination of division surface is equivalent to seek the weight of the explanatory variables and
intercept. But teacher data tell us boolean state for the combination of explanatory variables.
PGconf.ASIA 2018 LT - In-database Analytics using GPU9
10. パラメータを求める(2/3)
問題設定:トレーニングセットが得られる確率を最大化する。
𝑧𝑖 = 𝜎 𝑊 𝑇
𝜑𝑖 であるとき、𝑷 = ς𝑖=1
𝑁
𝑃𝑖 = ς𝑖=1
𝑁
𝑍𝑖
𝑡 𝑖
1 − 𝑍𝑖
1−𝑡 𝑖
分割面から離れるほど、
当該説明変数は真である、
または偽である可能性は高い。
トレーニングセットは、
最も顕在化する可能性が
高いものであったと仮定する。
Explanatory variables far from the division surface has higher probability of true/false. We assume the
training-set is result of the highest likelihood, maximized by the W parameter.
PGconf.ASIA 2018 LT - In-database Analytics using GPU10
11. パラメータを求める(3/3)
以下を繰り返しパラメータを推定する
ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇 𝑅Φ −1Φ 𝑇 ҧ𝑧 − ҧ𝑡
ただし、
Φ =
1 𝑥11 ⋯ 𝑥1𝑚
⋮ ⋱ ⋮
1 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑚
ҧ𝑡 = 𝑡1, … , 𝑡 𝑛
ҧ𝑧 = 𝑧1, … , 𝑧 𝑛
𝑅 = 𝑑𝑖𝑎𝑔 𝑧1 1 − 𝑧1 , … , 𝑧 𝑛 1 − 𝑧 𝑛
For more details, check out the book. Anyway, W is updated for each iteration, then Wnew shall seek to the
reasonable parameter then Wold. Eventually, difference of Wnew and Wold becomes very small.
詳しくはこちら
PGconf.ASIA 2018 LT - In-database Analytics using GPU11
12. 計算量を考える。
▌説明変数の数は多くない: 数個~百個程度 ... m個
▌学習データの数は多いかも: 数百個~数千万個 ... n個
ഥ𝑤 𝑛𝑒𝑤 = ഥ𝑤 𝑜𝑙𝑑 − ഥ𝑤Δ = ഥ𝑤 𝑜𝑙𝑑 − Φ 𝑇
𝑅Φ −1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
Estimation for amount of the calculation. # of explanatory variables are to up hundreds, but # of training
data set is more than million items. It is suitable for parallel calculation by GPU.
ΦR
n
-1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
Φ 𝑇
n
m
n
1
-1
Φ 𝑇
𝑅Φ −1
Φ 𝑇
ҧ𝑧 − ҧ𝑡
ഥ𝑤Δ
𝑚 × 𝑚 𝑚 × 1
𝑚 × 1
PGconf.ASIA 2018 LT - In-database Analytics using GPU12
13. 行列積 Φ 𝑇 𝑅Φ を並列に計算するコード例
KERNEL_FUNCTION_MAXTHREADS(void) logregr_update_P(cl_double **Preg, /* out */
cl_float **Xp,
cl_int width,
VectorTypeFloat *Z) {
cl_double *P = Preg[0];
__shared__ cl_float v[MAXTHREADS_PER_BLOCK]; // shared variables
nitems_bs = TYPEALIGN(get_local_size(), nitems);
nloops = width * width * nitems_bs;
for (loop = get_global_id(); // unique identifier of GPU threads
loop < nloops;
loop += get_global_size()) { // add total number of GPU threads
k = loop % nitems_bs; // index of 𝑅 column/row
i = (loop / nitems_bs) % width; // index of Φ 𝑇
column
j = loop / (nitems_bs * width); // index of Φ column
if (k < nitems) {
cl_float z = Z->values[k];
cl_float x1 = (i == 0 ? 1.0 : Xp[i-1][k]);
cl_float x2 = (j == 0 ? 1.0 : Xp[j-1][k]);
v[get_local_id()] = x1 * z * (1.0 - z) * x2;
}
else
v[get_local_id()] = 0.0;
sum = pgstromTotalSum(v,MAXTHREADS_PER_BLOCK); // total sum of the element
if (get_local_id() == 0) // calculated by the sibling threads
atomicAdd(&P[i + j * width], sum);
__syncthreads();
}
}
PGconf.ASIA 2018 LT - In-database Analytics using GPU13
14. GPU活用による計算 – 縮約アルゴリズムの例
●item[0]
step.1 step.2 step.4step.3
GPUを用いた
Σi=0...N-1item[i]
配列総和の計算
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
log2N ステップで
items[]の総和を計算
HW支援によるコア間の同期機構
SELECT count(X),
sum(Y),
avg(Z)
FROM my_table;
集約関数の計算で用いる仕組み
PGconf.ASIA 2018 LT - In-database Analytics using GPU14
Values on shared memory can be accessed by multiple CPU cores simultaneously. Hardware supports inter-
cores synchronization, and it enables to calculate total sum with log2N steps.
15. ロジスティック回帰分析のサンプルプログラム
$ git clone https://github.com/heterodb/toybox.git
$ cd toybox/logistic_regression/
$ make && make install
$ psql postgres
postgres=# create extension logregr;
CREATE EXTENSION
To get the sample code, open “heterodb/toybox” on GitHub, then move to “logistic_regression”.
You can install it using CREATE EXTENSION, if PG-Strom is correctly setup.
https://github.com/heterodb/toybox/ ➔ logistic_regression
PGconf.ASIA 2018 LT - In-database Analytics using GPU15
16. 動かしてみる(1/4)- 人為的なテストデータを作成
postgres=# CREATE TABLE logreg (
t bool,
x1 float,
x2 float,
x3 float,
x4 float );
CREATE TABLE
-- ↓全ての 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0を true と分類するトレーニングデータ
-- 4000万件を投入してみる
postgres=# INSERT INTO logreg
(SELECT (1.0+2.0*x1-3.0*x2+x3+0.5*x4) > 0 t, x1, x2, x3, x4
FROM (SELECT random() x1,
random() x2,
random() x3,
random() x4
FROM generate_series(1,40000000)) x);
INSERT 0 40000000
OK, let’s work the PL/CUDA function. First of all, make a normal table with 40M rows of random data.
All the rows that satisfy 1 + 2𝑥1 − 3𝑥2 + 𝑥3 + 0.5𝑥4 > 0 are marked as ‘true’.
PGconf.ASIA 2018 LT - In-database Analytics using GPU16
17. 動かしてみる(2/4)- GPUデバイスメモリへのデータのロード①
postgres=# CREATE FOREIGN TABLE ft (
t bool,
x1 real,
x2 real,
x3 real,
x4 real
) SERVER gstore_fdw
OPTIONS (pinning '0');
CREATE FOREIGN TABLE
postgres=# INSERT INTO ft
(SELECT * FROM logreg);
INSERT 0 40000000
Gstore_Fdw is a FDW extension on behalf of the GPU device memory, specified by the ‘pinning’ option.
INSERT INTO the Gstore_Fdw table loads 40M rows in the ‘logreg’ table.
GPU device memory
Foreign Table
(gstore_fdw)
✓ データ形式の変換
✓ データ圧縮
✓ トランザクション制御
PGconf.ASIA 2018 LT - In-database Analytics using GPU17
18. 動かしてみる(3/4)
[kaigai@saba src]$ nvidia-smi
Thu Dec 6 12:10:56 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | N/A |
| N/A 42C P0 52W / 250W | 817MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 27650 C ...bgworker: PG-Strom GPU memory keeper 807MiB |
+-----------------------------------------------------------------------------+
807MB of GPU device memory is preserved. The dataset consumes 680MB, in addition to the 120MB
for device management.
デバイス管理用:約120MB +
(sizeof(bool) + 4*sizeof(float)) * 40M = 680MB
PGconf.ASIA 2018 LT - In-database Analytics using GPU18
20. CPUでの実装と比較してみる(1/3)
MADLib の logregr_train() 関数を利用
postgres=# SELECT madlib.logregr_train(‘logreg’, ‘hoge’,
‘t’,’ARRAY[1,x1,x2,x3,x4]’,
NULL, 20);
logregr_train
---------------
(1 row)
Time: 1301307.361 ms (21:41.307)
postgres=# SELECT coef FROM hoge;
coef
------------------------------------------------------
{3041.82722783601,6083.57794939209,-9125.44857123801,3041.73992459095,1520.98287953044}
(1 row)
For the same jobs, MADLib’s logregr_train() tooks 21min41sec. PL/CUDA implementation was 356 times
faster than the CPU-based implementation.
1301307.36 / 3647.06
= 356.8倍かかった
PGconf.ASIA 2018 LT - In-database Analytics using GPU20
21. CPUでの実装と比較してみる(2/3)- 検算
テストデータを作った時の
説明変数の“傾き”はこちら
logregr_train()の結果、
推定したパラメータは
こちらの線の傾き
w0 w1 w2 w3 w4
PL/CUDA 3376.4 6752.71 -10129.1 3376.3 1688.27
MADLib 3041.83 6083.58 -9125.45 3041.74 1520.98
The result of logregr_train() is different from the weight when we made the dataset artificially, because it
returns the gradient and intercept of the normal vector towards the division surface.
PGconf.ASIA 2018 LT - In-database Analytics using GPU21
22. CPUでの実装と比較してみる(3/3)- 検算
注意:!トレーニングセットへの推論処理は本来はご法度!
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(ARRAY[ 3376.4, 6752.71,
-10129.1, 3376.3,
1688.27]::float[],
ARRAY[x1,x2,x3,x4]) p
FROM logreg) data
WHERE t != p;
count
-------
90
(1 row)
postgres=# SELECT COUNT(*)
FROM (SELECT t, logregr_predict(hoge.coef,
ARRAY[x1,x2,x3,x4]) p
FROM logreg, hoge) data
WHERE t != p;
count
-------
70
(1 row)
Prediction by our PL/CUDA function told 90 of 40M rows wrongly, and MADLib also told 70 of 40M.
Note that we usually don’t apply prediction on the training set when we have “actual” data analytics.
推定が「正しくない」件数をカウント
PGconf.ASIA 2018 LT - In-database Analytics using GPU22