SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
PL/CUDA
~Fusion of HPC Grade Power with In-Database Analytics~
The PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kaigai@ak.jp.nec.com>
The PG-Strom Project
about myself...
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2
▌KaiGai Kohei
 tw: @kkaigai
 https://github.com/kaigai
▌PostgreSQL
 SELinux, FDW, CustomScan, ...
▌PG-Strom
 GPU acceleration for PostgreSQL
▌Works
 NEC OSS Promotion Center
 Development of the software and
its business opportunity
The PG-Strom Project
PG-Strom Overview (1/2) – Architecture
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
No Storage
Changes
No Query
Changes
 Features
• automatic GPU code generation
from the supplied SQL
• asynchronous & massive
parallel execution on GPUs
• WHERE-clause, JOIN, GROUP
BY, and projection are
supported
 Advantages
• transparent acceleration by the
power of thousands cores
• Low cost solution for analytic
processing
The PG-Strom Project
Characteristics of GPU (Graphic Processor Unit)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4
GPU CPU
Model
NVIDIA
Tesla P100
Intel Xeon
E5-2699v4
Architecture Pascal Broadwell
Launch Q2-2016 Q1-2016
# of transistors 15billion 7.2billion
# of cores
3584
(simple)
22
(functional)
core clock
1.126GHz
~1.303GHz
2.20GHz
~3.60GHz
Perk FFLOPS
(FP32)
9.3 TFLOPS
1.2 TFLOPS
(with AVX2)
DRAM Size 16GB (HBM2) max 1.5TB (DDR4)
Memory Band 732GB/s 76.8GB/s
Power
Consumption
250W 145W
The PG-Strom Project
PG-Strom Overview (2/2) – GPU binary generation on the fly
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transform of arithmetic operations
in the WHERE-clause to CUDA programs
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
5
The PG-Strom Project
GPU accelerates SQL performance
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
 t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
40.44
62.79
79.82
100.50
119.51
144.55
201.12
248.95
9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00
0
50
100
150
200
250
300
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of joined tables
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.5 PG-Strom v1.0
CPU: Xeon E5-2670v3
GPU: GTX1080
RAM: 384GB
OS: CentOS 7.2
DB: PostgreSQL 9.5 +
PG-Strom v1.0
The PG-Strom Project
Feedbacks from users during v1.0 development
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
Heavy computing
intensive workloads
• In-database Analytics
• Scientific R&D, Marketing, ...
 by PL/CUDA + Matrix-Array
Heavy I/O
intensive workloads
• Generic large OLAP
• ETL, Reporting, ...
 by SSD-to-GPU P2P DMA
The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8
Introduction of PL/CUDA
The PG-Strom Project
Our Failure (1/3) – Write an algorithm logic in SQL
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9
Apr-2016
The PG-Strom Project
Our Failure (2/3) – Performance benefit (?)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10
Apr-2016
The PG-Strom Project
Our Failure (3/3) – Problems
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11
▌Problem.1 – Who writes algorithms in SQL?
 Majority of mathematical algorithms are developed based on
the manner of procedural programming language.
 Although users don’t need to write algorithm logics in CUDA,
they also have to write up the algorithm logic using SQL puzzle.
▌Problem.2 – Performance Benefit
 Yeah, PG-Strom is much faster than PostgreSQL to execute the core
logic of min-max method. It is an excellent result but nobody knows
how many people run the algorithm on PostgreSQL.
 Performance of GpuProjection is almost equivalent to the CPU version
of implementation which is designed to a particular problem. Why?
 Inefficient code due to SQL compatibility
 Inefficient data format due to PostgreSQL’s row-format
The PG-Strom Project
Our Answer – PL/CUDA + Matrix-like Array
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12
CREATE FUNCTION my_logic(matrix, matrix)
RETURNS vector
AS $$
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
Storage
GPU Kernel
User defined
CUDA code block
Post SQL Process
 Tables JOIN
 Window
Function
 ORDER BY
 GROUP BY
 etc....
Load function’s
arguments
Write-back
result set
PL/CUDA
method for manual optimization
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix
The PG-Strom Project
Example of PL/CUDA function definition
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, int[], int[])
RETURNS float4[]
AS $$
#plcuda_begin
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k);
for (loop=0; loop < nloops; loop++) {
/* 1. calculation of the similarity */
for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) {
j = i % part_sz; /* index within partition */
/* index of database matrix (D) */
dindex = part_nums * get_global_index() + (i / part_sz);
/* index of query matrix (Q) */
qindex = loop * (part_sz - k) + (j - k);
values[i] = knn_similarity_compute(D, dindex, Q, qindex);
}
}
:
#plcuda_end
$$ LANGUAGE 'plcuda';
CUDA code block
The PG-Strom Project
How GPU Kernels are built
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14
CREATE OR REPLACE FUNCTION
my_cuda_func(float[])
RETURNS int[]
$$
#plcuda_sanity_check func_sc
#plcuda_begin
#plcuda_end
#plcuda_working_bufsz func_wb
$$ LANGUAGE ‘plcuda’;
User define CUDA code block
bool func_sc(float[])
helper function for sanity check
bigint func_wb(float[])
helper function for buffer size estimation
GPU
Binary
GPU
Kernel
Working
Buffer
Input
arguments
Run-time
compiler
input/output
of SQL data
User define
CUDA code block
Common Library
Routines
source program
of GPU code
Load the
arguments
The PG-Strom Project
Why automatic generated code was not best for performance
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15
 NULL checks for each variable references
 Overflow checks for each primitive operators
 Function call instead of primitive operators
STATIC_FUNCTION(pg_float4_t)
pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2)
{
pg_float4_t result;
result.isnull = arg1.isnull | arg2.isnull;
if (!result.isnull)
{
result.value = arg1.value * arg2.value;
CHECKFLOATVAL(&kcxt->e, result,
isinf(arg1.value) || isinf(arg2.value),
arg1.value == 0.0 || arg2.value == 0.0);
}
return result;
}
select x*y from c_test;
The PG-Strom Project
Disadvantage because of data format (1/2)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16
▌Row-oriented data
× includes unreferenced values
× many steps for data references
〇 common data format with PostgreSQL
▌Column-oriented data
〇 load only referenced variables
〇 data reference by 1 step
× needs data format exchange
GPU
core
GPU
core
GPU
core
a c d feb
a c d feb
a d feb
a c d feb
GPU
core
eb
GPU
core
GPU
core
GPU
core
GPU
core
b
b
b
b
b
GPU
core
GPU
core
e
e
e
e
e
Usual SQL load cannot justify the cost for data transformation,
Advanced algorithm processing is improved by columnar format.
The PG-Strom Project
Disadvantage because of data format (2/2)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17
▌Case of random memory access
 Increase of memory transaction, but less usage rate of the data-bus
▌Case of coalesced memory access
 Least number of memory transaction, and maximum usage rate of the data-bus
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x 8 = 256bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 100.0%)
Only 32bit x 1 = 32bit is valid data
in 256bit width memory transaction
(Bus usage ratio: 12.5%)
GPU cores
GPU cores
The PG-Strom Project
2D-Array as Matrix
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18
▌datatype[] array_matrix(variadic datatype[])
 An aggregate function to construct a 2D-array from the input stream.
 datatype is either of int2, int4, int8, float4 or float8.
 The 2D-array does not contain NULL.
▌SETOF record matrix_unnest(datatype[])
 Deform a 2D-array into stream of multiple records
▌Downside
 Unable to handle variable-length data type
 1GB limit of varlena data type in PostgreSQL
ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN…
𝑎1 ⋯ 𝑑1
⋮ ⋱ ⋮
𝑎 𝑁 ⋯ 𝑑 𝑁
Matrix of
4cols x Nrows
Matrix-like Array
2D Array without NULL, to represent a matrix
The PG-Strom Project
Example to call PL/CUDA function
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19
SELECT row_number() OVER (),
float4_as_int4(R.key_id) key_id,
R.score
FROM matrix_unnest(
(SELECT my_plcuda_function(A.matrix,
B.matrix)
FROM (SELECT cbind(array_matrix(id),
array_matrix(x, y, z)) matrix
FROM normal_table
WHERE tag LIKE ‘%abc%’) A,
(SELECT matrix
FROM matrix_table) B
)
) AS R(key_id real, score real)
ORDER BY score DESC
LIMIT 1000;
Invocation of PL/CUDA
function with two
Matrix arguments
Construct, or load
pre-built array-matrix
Deform array-matrix
to generic records
Post-process by SQL
(JOIN, window-function)
The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20
Case Study
similarity search on drug discovery
The PG-Strom Project
Background – relationship of disease and chemical compounds
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21
target disease
relevant protein
chemical compounds
(= candidate of drugs)
Discovery of chemical compounds which are “active” to the target protein
inactive active
active
but toxicity
academic papers
The PG-Strom Project
k-NN Similarity Search on Chemical Compounds (1/2)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22
Database chemical compounds set
(D; 10M records scale)
Query chemical
compounds set
(Q; ~1000 records scale)
Search by
Similarity
Target Protein
“similar compounds” will
have higher probability of active
Picks up active
chemical compounds
to the target protein
from academic papers
The PG-Strom Project
k-NN Similarity Search on Chemical Compounds (2/2)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23
Similarity
is
definition of distance
ID NAME Fingerprint (1024bit)
1 CHEMBL153534 000000000001000000100000000000000100000000000001000000...
2 CHEMBL405398 000000000000000100100000000000000000000000000000100000...
3 CHEMBL503634 000001000000000000000000001000000100000000000000000000...
: : :
Data structure of the chemical compounds
Similarity by Jaccard index:
𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
The PG-Strom Project
Scale of the computing
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24
Database chemical compounds set
(D; 10M records scale)
Q: Query chemical
compounds set
average of
the top-3
𝑑𝑖 of D-compounds
Distance of Q-set
and 𝑑𝑖 compounds
𝑑𝑗 of D-compounds
average of
the top-3
Distance of Q-set
and 𝑑𝑗 compounds
Order of calculation:
𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄
(make distance) (sorting+average)
The PG-Strom Project
Implementation of PL/CUDA function (1/3)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25
Step-1
Split all the logical combination of Q and D
into multiple partitions, then assign them
on SMM; execution unit of GPU.
Step-2
Each GPU core calculates a similarity score
between a Q-compound and a D-compound,
then store the score on “shared memory”;
which is fast and close to SMM.
The PG-Strom Project
Implementation of PL/CUDA function (2/3)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26
Step-3
Bitonic-sorting by the similarity score, then
reorder the Q-compounds Step-5
Make an average by the
top-k values, then store it
on the result buffer.
Step-4
Repeat from Step-2, if # of Q-compounds is larger
than shared memory size. Top-k items are kept.
The PG-Strom Project
Implementation of PL/CUDA function (3/3)
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27
CREATE OR REPLACE FUNCTION
knn_gpu_similarity(int, -- k-value
int[], -- ID+bitmap of Q
int[]) -- ID+bitmap of D
RETURNS float4[] -- result: ID+similarity
AS $$
#plcuda_decl
:
#plcuda_begin
#plcuda_kernel_blocksz ¥
knn_gpu_similarity_main_block_size
#plcuda_num_threads ¥
knn_gpu_similarity_main_num_threads
#plcuda_shmem_blocksz 8192
cl_int k = arg1.value;
MatrixType *Q = (MatrixType *) arg2.value;
MatrixType *D = (MatrixType *) arg3.value;
MatrixType *R = (MatrixType *) results;
:
for (loop=0; loop < nloops; loop++)
{
/* 1. calculation of the similarity */
for (i = get_local_id();
i < part_sz * part_nums;
i += get_local_size()) {
j = i % part_sz; /* index within partition */
dindex = part_nums * get_global_index()
+ (i / part_sz);
qindex = loop * (part_sz - k) + (j - k);
if (dindex < ARRAY_MATRIX_HEIGHT(D) &&
qindex < ARRAY_MATRIX_HEIGHT(Q)) {
values[i] = knn_similarity_compute(D, dindex,
Q, qindex);
}
}
__syncthreads();
/* 2. sorting by the similarity for each partition */
knn_similarity_sorting(values, part_sz, part_nums);
__syncthreads();
:
}
#plcuda_end
#plcuda_sanity_check knn_gpu_similarity_sanity_check
#plcuda_working_bufsz 0
#plcuda_results_bufsz knn_gpu_similarity_results_bufsz
$$ LANGUAGE 'plcuda';
real[] -- ID+Similarity of D-compounds (2xN)
knn_gpu_similarity(int k, -- k-value
int[] Q, -- ID+Fingerprint of Q-compounds (33xM)
int[] D); -- ID+Fingerprint of D-compounds (33xN)
The PG-Strom Project
Invocation of PL/CUDA function
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28
PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value
AS
SELECT row_number() OVER (),
fp.name,
similarity
FROM (SELECT float4_as_int4(key_id) key_id, similarity
FROM matrix_unnest(
(SELECT rbind( knn_gpu_similarity($1,Q.matrix,
D.matrix))
FROM (SELECT cbind(array_matrix(id),
array_matrix(bitmap)) matrix
FROM finger_print_query) Q,
(SELECT matrix
FROM finger_print_10m_matrix) D
)
) AS sim(key_id real, similarity real)
ORDER BY similarity DESC) sim,
finger_print_10m fp
WHERE fp.id = sim.key_id
LIMIT 1000;
Post-process by SQL; like lookup of compounds
name by compounds-id (tables JOIN), making
rank of similarity by window function.
Execution of PL/CUDA function
with Q-/D-matrix as argument
Transform the records read from tables
to Array-Matrix type.
(Pre-build is also possible)
Transform the Array-Matrix (3xN),
return value of PL/CUDA function,
into usual record data x Nrows.
The PG-Strom Project
Performance results
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29
 For comparison to CPU cases, we implemented an equivalent SQL function by C.
 # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.
 Up to 10B combination search; almost equivalent size for real drug discovery research.
 HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB
 SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0
30.25
145.29
295.95
1503.31
3034.94
12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13
0
500
1000
1500
2000
2500
3000
3500
10 50 100 500 1000
QueryResponseTime[sec]
Number of Query Compounds [Q]
Similarity search of chemical compounds by k-NN method (k=3, D=10M)
CPU(E5-2670v3) GTX980 GTX1080
x150 times
faster!
The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30
Another Usage
k-means clustering in database system
The PG-Strom Project
Clustering Analysis
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
The PG-Strom Project
k-means clustering algorithm
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32
1. Assign cluster randomly 2. Make centroid for each
cluster
3. Chose the nearest
cluster from the centroid
The PG-Strom Project
k-means clustering algorithm
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33
1. Assign cluster randomly
5. Re-calculation of the
centroid based on the
new cluster assignment.
6. Finish the loop if it gets
converged.
4. Repeat until convergence
The PG-Strom Project
k-means clustering on PL/CUDA
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34
CREATE OR REPLACE FUNCTION
gpu_kmeans(real[], -- ID + Data Matrix
int, -- k-value (number of clusters)
int = 10, -- max number of iteration
int = 1) -- seed of initial randomness
RETURNS int[]
AS $$
#plcuda_decl
:
KERNEL_FUNCTION_MAXTHREADS(void)
update_centroid(MatrixType *D,
MatrixType *R,
MatrixType *C)
{
:
/* accumulate the local centroid */
for (did = get_global_id();
did < nitems;
did += get_global_size())
{
/* pick up the target cluster */
cid = r_values[nitems + did];
atomicAdd(&l_cent[cid], 1.0);
for (index=1; index < width; index++)
atomicAdd(&l_cent[index * k_value + cid],
d_values[index * nitems + did]);
}
__syncthreads();
/* write back to the global C-matrix */
for (index = get_local_id();
index < width * k_value;
index += get_local_size())
atomicAdd(&c_values[index], l_cent[index]);
}
:
#plcuda_begin
:
status = pgstromLaunchDynamicKernel4((void *)
setup_initial_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)(r_seed),
nitems, 0, 0);
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
for (loop=0; loop < nloops; loop++)
{
:
status = pgstromLaunchDynamicKernelMaxThreads3(
(void *)kmeans_update_cluster,
(kern_arg_t)(D),
(kern_arg_t)(R),
(kern_arg_t)(C),
(kern_arg_t)k_value,
nitems, 0,
sizeof(cl_int) + sizeof(cl_float));
if (status != cudaSuccess)
PLCUDA_RUNTIME_ERROR_RETURN(status);
:
}
#plcuda_sanity_check gpu_kmeans_sanity_check
#plcuda_working_bufsz gpu_kmeans_working_bufsz
#plcuda_results_bufsz gpu_kmeans_results_bufsz
#plcuda_end
$$ LANGUAGE 'plcuda';
The PG-Strom Project
Test data for k-means clustering
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35
▌Dataset overview
 A collection of datasets of vehicle traffic,
observed between two points for a set duration
of time over a period of 6 months
 449 observation points, at Aarhus, Denmark.
 13.5M records from Feb to June of 2014
▌Data contains
 average speed
 average measured time
 number of vehicles
 latitude/longitude of the observation points
 etc...
▌What we did
 categorize the zone of road into 5-classes
according to the characteristics of vehicle’s
running style.
The PG-Strom Project
Invocation of GPU k-means function
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Make a matrix from the raw-data
Run k-means clustering logic
Pick-up most frequent cluster
The PG-Strom Project
GPU k-means (1/3) – clustering by all the data
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37
$ wget -O map.png "`psql traffic -At -f ~/traffic.sql`"
bypass highway?
Road towards
downtown?
Beltway?
The PG-Strom Project
GPU k-means (2/3) – Daytime and Nighttime
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38
daytime (8-17) nighttime (18-7)
The PG-Strom Project
GPU k-means (3/3) – Weekdays and Weekend
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39
weekdays weekend
The PG-Strom Project
Invocation of GPU k-means function
PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT report_id, k, count(*) c
FROM matrix_unnest(
(SELECT gpu_kmeans ( array_matrix(
int4_as_float4(report_id),
avg_measured_time,
avg_speed,
vehicle_count),
5)
FROM tr_rawdata
WHERE extract('hour' from timestamp)
between 7 and 17
)
) R(report_id int, k int)
GROUP BY report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Just add a line to select
different input data set.
Flexibility of SQL!
The PG-Strom Project
Performance (1/3)
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41
We used MADLib’s kmeans_random()
as a usual implementation of k-means
based on CPU
The PG-Strom Project
Performance (2/3) – Invocation of MADLib’s k-means clustering
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42
SELECT report_id, k, c
FROM (SELECT report_id, k, c,
row_number() OVER (PARTITION BY report_id
ORDER BY c DESC) rank
FROM (SELECT t.report_id,
(madlib.closest_column(centroids,
t.attrs)).column_id as k,
count(*) c
FROM tr_rawdata_madlib_s t,
(SELECT centroids
FROM madlib.kmeans_random('tr_rawdata_madlib',
'attrs',
5)
) km;
GROUP BY t.report_id, k
) __summary_1
) __summary_2
WHERE rank = 1;
Calculation of
the cluster centroid
Choice of the closest
cluster for each item
The PG-Strom Project
Performance (3/3) – k-means in PL/CUDA vs MADLib
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43
 Environment of the benchmark
 HW) CPU: Xeon E5-2670v3, GPU: GTX1080, RAM: 384GB
 SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0, MADLib 1.9
1.41 12.44
126.59
1668.49
0.21 0.29 0.94 8.41
0
200
400
600
800
1000
1200
1400
1600
1800
10,000 100,000 1,000,000 13,577,132
QueryResponseTime[sec]
(※Lowerisbetter)
Number of Items that were clustered based on the k-means algorithm
Performance comparison of in-database k-means clustering
MADLib PL/CUDA
x200 times
faster
The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics44
Summary
The PG-Strom Project
Summary (1/3)
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics45
▌What is PL/CUDA?
 PG-Strom’s original concept is automatic optimization and automatic
code generation.
 PL/CUDA is a way to pull out the maximum performance of GPU
instead of manual optimization.
Likely right decision, because nobody likely writes up advanced
algorithms in SQL.
▌Advantages
 TFLOPS grade computing engine for in-database analytics
 No need to export entire dataset from the database unlike a case when
we use external applications.
 Only “result” is all we need to export.
 Utilization of SQL’s flexibility for pre-/post-process of the advanced
analytics algorithms.
 JOIN, GROUP BY, Window function, etc...
The PG-Strom Project
Summary (2/3) – Target Area
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics46
▌Case Study 1 – Similarity Search on Drug Discovery
 Characteristics of chemical compounds are represented as fingerprint
(= feature vector). Similarity scores are calculated by GPU, in parallel,
then picks up top-k similarity.
▌Case Study 2 – Unsupervised Learning using Sensor Data
 Distance calculation of sensor generated dataset by GPU.
Automatic categorization to a few classes by the characteristics of roads;
automatically detected.
▌Other expected domain for applications
 Search of compounds ... medical drug, chemical, materials, ...
 Recommendation engine ... e-commerce
 Anomaly detection ... security
 Data mining ... marketing
 ...etc...
The PG-Strom Project
Summary (3/3) – Challenges and future
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics47
▌Technology
 Array-matrix larger than 1GB
 Because of the restriction of variable length fields in PostgreSQL, SQL-function
cannot have arguments larger than 1GB.
 Cost to use same array-matrix multiple times
 Idea: it may make sense to pin a static array-matrix on GPU side.
▌Operation
 Relatively small number of engineers are familiar both of CUDA and SQL.
 Package of well-known algorithms
 Support service by professional engineers
Let’s have joint evaluation with you!
The PG-Strom Project
Resources
PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics48
▌Repository
https://github.com/pg-strom/devel
▌Today’s Slides
http://www.slideshare.net/kaigai/pgconfasia2016-plcuda-en
▌Contact
 e-mail: kaigai@ak.jp.nec.com
 Tw: @kkaigai
Any developers are welcome!
Question?

Weitere ähnliche Inhalte

Was ist angesagt?

PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 

Was ist angesagt? (20)

GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien RouhaudPGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
 

Andere mochten auch

Andere mochten auch (18)

20170127 JAWS HPC-UG#8
20170127 JAWS HPC-UG#820170127 JAWS HPC-UG#8
20170127 JAWS HPC-UG#8
 
SQL+GPU+SSD=∞ (Japanese)
SQL+GPU+SSD=∞ (Japanese)SQL+GPU+SSD=∞ (Japanese)
SQL+GPU+SSD=∞ (Japanese)
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
20170310_InDatabaseAnalytics_#1
20170310_InDatabaseAnalytics_#120170310_InDatabaseAnalytics_#1
20170310_InDatabaseAnalytics_#1
 
An Intelligent Storage?
An Intelligent Storage?An Intelligent Storage?
An Intelligent Storage?
 
pgconfasia2016 lt ssd2gpu
pgconfasia2016 lt ssd2gpupgconfasia2016 lt ssd2gpu
pgconfasia2016 lt ssd2gpu
 
並列クエリを実行するPostgreSQLのアーキテクチャ
並列クエリを実行するPostgreSQLのアーキテクチャ並列クエリを実行するPostgreSQLのアーキテクチャ
並列クエリを実行するPostgreSQLのアーキテクチャ
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会
PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会
PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会
 
TPC-DSから学ぶPostgreSQLの弱点と今後の展望
TPC-DSから学ぶPostgreSQLの弱点と今後の展望TPC-DSから学ぶPostgreSQLの弱点と今後の展望
TPC-DSから学ぶPostgreSQLの弱点と今後の展望
 
PostgreSQL 9.6 新機能紹介
PostgreSQL 9.6 新機能紹介PostgreSQL 9.6 新機能紹介
PostgreSQL 9.6 新機能紹介
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database Analytics
 
PostgreSQLアンチパターン
PostgreSQLアンチパターンPostgreSQLアンチパターン
PostgreSQLアンチパターン
 
In-Database Analyticsの必要性と可能性
In-Database Analyticsの必要性と可能性In-Database Analyticsの必要性と可能性
In-Database Analyticsの必要性と可能性
 

Ähnlich wie pgconfasia2016 plcuda en

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
Junli Gu
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

Ähnlich wie pgconfasia2016 plcuda en (20)

RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
NVIDIA CUDA
NVIDIA CUDANVIDIA CUDA
NVIDIA CUDA
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
0507036
05070360507036
0507036
 
BlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow DemoBlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow Demo
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 

Mehr von Kohei KaiGai

Mehr von Kohei KaiGai (20)

20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History
 
20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API
 
20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
 
20210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.020210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.0
 
20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache
 
20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS
 
20200828_OSCKyoto_Online
20200828_OSCKyoto_Online20200828_OSCKyoto_Online
20200828_OSCKyoto_Online
 
20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw
 
20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw
 
20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.Japan
 
20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta
 
20190925_DBTS_PGStrom
20190925_DBTS_PGStrom20190925_DBTS_PGStrom
20190925_DBTS_PGStrom
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
20190516_DLC10_PGStrom
20190516_DLC10_PGStrom20190516_DLC10_PGStrom
20190516_DLC10_PGStrom
 
20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw
 
20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw
 
20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT
 
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

pgconfasia2016 plcuda en

  • 1. PL/CUDA ~Fusion of HPC Grade Power with In-Database Analytics~ The PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei <kaigai@ak.jp.nec.com>
  • 2. The PG-Strom Project about myself... PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics2 ▌KaiGai Kohei  tw: @kkaigai  https://github.com/kaigai ▌PostgreSQL  SELinux, FDW, CustomScan, ... ▌PG-Strom  GPU acceleration for PostgreSQL ▌Works  NEC OSS Promotion Center  Development of the software and its business opportunity
  • 3. The PG-Strom Project PG-Strom Overview (1/2) – Architecture PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics3 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU No Storage Changes No Query Changes  Features • automatic GPU code generation from the supplied SQL • asynchronous & massive parallel execution on GPUs • WHERE-clause, JOIN, GROUP BY, and projection are supported  Advantages • transparent acceleration by the power of thousands cores • Low cost solution for analytic processing
  • 4. The PG-Strom Project Characteristics of GPU (Graphic Processor Unit) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics4 GPU CPU Model NVIDIA Tesla P100 Intel Xeon E5-2699v4 Architecture Pascal Broadwell Launch Q2-2016 Q1-2016 # of transistors 15billion 7.2billion # of cores 3584 (simple) 22 (functional) core clock 1.126GHz ~1.303GHz 2.20GHz ~3.60GHz Perk FFLOPS (FP32) 9.3 TFLOPS 1.2 TFLOPS (with AVX2) DRAM Size 16GB (HBM2) max 1.5TB (DDR4) Memory Band 732GB/s 76.8GB/s Power Consumption 250W 145W
  • 5. The PG-Strom Project PG-Strom Overview (2/2) – GPU binary generation on the fly PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transform of arithmetic operations in the WHERE-clause to CUDA programs Reference to input data SQL expression in CUDA source code Run-time Compiler (nvrtc) Just-in-time Compile Parallel Execution 5
  • 6. The PG-Strom Project GPU accelerates SQL performance PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics6 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat;  t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema) 40.44 62.79 79.82 100.50 119.51 144.55 201.12 248.95 9.96 9.93 9.96 9.99 10.02 10.03 9.98 10.00 0 50 100 150 200 250 300 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of joined tables PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.5 PG-Strom v1.0 CPU: Xeon E5-2670v3 GPU: GTX1080 RAM: 384GB OS: CentOS 7.2 DB: PostgreSQL 9.5 + PG-Strom v1.0
  • 7. The PG-Strom Project Feedbacks from users during v1.0 development PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics7 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU Heavy computing intensive workloads • In-database Analytics • Scientific R&D, Marketing, ...  by PL/CUDA + Matrix-Array Heavy I/O intensive workloads • Generic large OLAP • ETL, Reporting, ...  by SSD-to-GPU P2P DMA
  • 8. The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics8 Introduction of PL/CUDA
  • 9. The PG-Strom Project Our Failure (1/3) – Write an algorithm logic in SQL PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics9 Apr-2016
  • 10. The PG-Strom Project Our Failure (2/3) – Performance benefit (?) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics10 Apr-2016
  • 11. The PG-Strom Project Our Failure (3/3) – Problems PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics11 ▌Problem.1 – Who writes algorithms in SQL?  Majority of mathematical algorithms are developed based on the manner of procedural programming language.  Although users don’t need to write algorithm logics in CUDA, they also have to write up the algorithm logic using SQL puzzle. ▌Problem.2 – Performance Benefit  Yeah, PG-Strom is much faster than PostgreSQL to execute the core logic of min-max method. It is an excellent result but nobody knows how many people run the algorithm on PostgreSQL.  Performance of GpuProjection is almost equivalent to the CPU version of implementation which is designed to a particular problem. Why?  Inefficient code due to SQL compatibility  Inefficient data format due to PostgreSQL’s row-format
  • 12. The PG-Strom Project Our Answer – PL/CUDA + Matrix-like Array PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics12 CREATE FUNCTION my_logic(matrix, matrix) RETURNS vector AS $$ $$ LANGUAGE ‘plcuda’; User define CUDA code block Storage GPU Kernel User defined CUDA code block Post SQL Process  Tables JOIN  Window Function  ORDER BY  GROUP BY  etc.... Load function’s arguments Write-back result set PL/CUDA method for manual optimization ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  • 13. The PG-Strom Project Example of PL/CUDA function definition PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics13 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, int[], int[]) RETURNS float4[] AS $$ #plcuda_begin cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : nloops = (ARRAY_MATRIX_HEIGHT(Q) + (part_sz - k - 1)) / (part_sz - k); for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ /* index of database matrix (D) */ dindex = part_nums * get_global_index() + (i / part_sz); /* index of query matrix (Q) */ qindex = loop * (part_sz - k) + (j - k); values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } : #plcuda_end $$ LANGUAGE 'plcuda'; CUDA code block
  • 14. The PG-Strom Project How GPU Kernels are built PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics14 CREATE OR REPLACE FUNCTION my_cuda_func(float[]) RETURNS int[] $$ #plcuda_sanity_check func_sc #plcuda_begin #plcuda_end #plcuda_working_bufsz func_wb $$ LANGUAGE ‘plcuda’; User define CUDA code block bool func_sc(float[]) helper function for sanity check bigint func_wb(float[]) helper function for buffer size estimation GPU Binary GPU Kernel Working Buffer Input arguments Run-time compiler input/output of SQL data User define CUDA code block Common Library Routines source program of GPU code Load the arguments
  • 15. The PG-Strom Project Why automatic generated code was not best for performance PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics15  NULL checks for each variable references  Overflow checks for each primitive operators  Function call instead of primitive operators STATIC_FUNCTION(pg_float4_t) pgfn_float4mul(kern_context *kcxt, pg_float4_t arg1, pg_float4_t arg2) { pg_float4_t result; result.isnull = arg1.isnull | arg2.isnull; if (!result.isnull) { result.value = arg1.value * arg2.value; CHECKFLOATVAL(&kcxt->e, result, isinf(arg1.value) || isinf(arg2.value), arg1.value == 0.0 || arg2.value == 0.0); } return result; } select x*y from c_test;
  • 16. The PG-Strom Project Disadvantage because of data format (1/2) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics16 ▌Row-oriented data × includes unreferenced values × many steps for data references 〇 common data format with PostgreSQL ▌Column-oriented data 〇 load only referenced variables 〇 data reference by 1 step × needs data format exchange GPU core GPU core GPU core a c d feb a c d feb a d feb a c d feb GPU core eb GPU core GPU core GPU core GPU core b b b b b GPU core GPU core e e e e e Usual SQL load cannot justify the cost for data transformation, Advanced algorithm processing is improved by columnar format.
  • 17. The PG-Strom Project Disadvantage because of data format (2/2) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics17 ▌Case of random memory access  Increase of memory transaction, but less usage rate of the data-bus ▌Case of coalesced memory access  Least number of memory transaction, and maximum usage rate of the data-bus 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x 8 = 256bit is valid data in 256bit width memory transaction (Bus usage ratio: 100.0%) Only 32bit x 1 = 32bit is valid data in 256bit width memory transaction (Bus usage ratio: 12.5%) GPU cores GPU cores
  • 18. The PG-Strom Project 2D-Array as Matrix PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics18 ▌datatype[] array_matrix(variadic datatype[])  An aggregate function to construct a 2D-array from the input stream.  datatype is either of int2, int4, int8, float4 or float8.  The 2D-array does not contain NULL. ▌SETOF record matrix_unnest(datatype[])  Deform a 2D-array into stream of multiple records ▌Downside  Unable to handle variable-length data type  1GB limit of varlena data type in PostgreSQL ArrayType header a1 a2 aN… b1 b2 bN… c1 c2 cN… d1 d2 dN… 𝑎1 ⋯ 𝑑1 ⋮ ⋱ ⋮ 𝑎 𝑁 ⋯ 𝑑 𝑁 Matrix of 4cols x Nrows Matrix-like Array 2D Array without NULL, to represent a matrix
  • 19. The PG-Strom Project Example to call PL/CUDA function PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics19 SELECT row_number() OVER (), float4_as_int4(R.key_id) key_id, R.score FROM matrix_unnest( (SELECT my_plcuda_function(A.matrix, B.matrix) FROM (SELECT cbind(array_matrix(id), array_matrix(x, y, z)) matrix FROM normal_table WHERE tag LIKE ‘%abc%’) A, (SELECT matrix FROM matrix_table) B ) ) AS R(key_id real, score real) ORDER BY score DESC LIMIT 1000; Invocation of PL/CUDA function with two Matrix arguments Construct, or load pre-built array-matrix Deform array-matrix to generic records Post-process by SQL (JOIN, window-function)
  • 20. The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics20 Case Study similarity search on drug discovery
  • 21. The PG-Strom Project Background – relationship of disease and chemical compounds PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics21 target disease relevant protein chemical compounds (= candidate of drugs) Discovery of chemical compounds which are “active” to the target protein inactive active active but toxicity academic papers
  • 22. The PG-Strom Project k-NN Similarity Search on Chemical Compounds (1/2) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics22 Database chemical compounds set (D; 10M records scale) Query chemical compounds set (Q; ~1000 records scale) Search by Similarity Target Protein “similar compounds” will have higher probability of active Picks up active chemical compounds to the target protein from academic papers
  • 23. The PG-Strom Project k-NN Similarity Search on Chemical Compounds (2/2) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics23 Similarity is definition of distance ID NAME Fingerprint (1024bit) 1 CHEMBL153534 000000000001000000100000000000000100000000000001000000... 2 CHEMBL405398 000000000000000100100000000000000000000000000000100000... 3 CHEMBL503634 000001000000000000000000001000000100000000000000000000... : : : Data structure of the chemical compounds Similarity by Jaccard index: 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝐴, 𝐵 = 𝐴 𝐹𝑃 ∩ 𝐵 𝐹𝑃 𝐴 𝐹𝑃 ∪ 𝐵 𝐹𝑃
  • 24. The PG-Strom Project Scale of the computing PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics24 Database chemical compounds set (D; 10M records scale) Q: Query chemical compounds set average of the top-3 𝑑𝑖 of D-compounds Distance of Q-set and 𝑑𝑖 compounds 𝑑𝑗 of D-compounds average of the top-3 Distance of Q-set and 𝑑𝑗 compounds Order of calculation: 𝑂 𝑄 × 𝐷 + 𝑂 𝐷 × 𝑄𝑙𝑜𝑔𝑄 (make distance) (sorting+average)
  • 25. The PG-Strom Project Implementation of PL/CUDA function (1/3) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics25 Step-1 Split all the logical combination of Q and D into multiple partitions, then assign them on SMM; execution unit of GPU. Step-2 Each GPU core calculates a similarity score between a Q-compound and a D-compound, then store the score on “shared memory”; which is fast and close to SMM.
  • 26. The PG-Strom Project Implementation of PL/CUDA function (2/3) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics26 Step-3 Bitonic-sorting by the similarity score, then reorder the Q-compounds Step-5 Make an average by the top-k values, then store it on the result buffer. Step-4 Repeat from Step-2, if # of Q-compounds is larger than shared memory size. Top-k items are kept.
  • 27. The PG-Strom Project Implementation of PL/CUDA function (3/3) PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics27 CREATE OR REPLACE FUNCTION knn_gpu_similarity(int, -- k-value int[], -- ID+bitmap of Q int[]) -- ID+bitmap of D RETURNS float4[] -- result: ID+similarity AS $$ #plcuda_decl : #plcuda_begin #plcuda_kernel_blocksz ¥ knn_gpu_similarity_main_block_size #plcuda_num_threads ¥ knn_gpu_similarity_main_num_threads #plcuda_shmem_blocksz 8192 cl_int k = arg1.value; MatrixType *Q = (MatrixType *) arg2.value; MatrixType *D = (MatrixType *) arg3.value; MatrixType *R = (MatrixType *) results; : for (loop=0; loop < nloops; loop++) { /* 1. calculation of the similarity */ for (i = get_local_id(); i < part_sz * part_nums; i += get_local_size()) { j = i % part_sz; /* index within partition */ dindex = part_nums * get_global_index() + (i / part_sz); qindex = loop * (part_sz - k) + (j - k); if (dindex < ARRAY_MATRIX_HEIGHT(D) && qindex < ARRAY_MATRIX_HEIGHT(Q)) { values[i] = knn_similarity_compute(D, dindex, Q, qindex); } } __syncthreads(); /* 2. sorting by the similarity for each partition */ knn_similarity_sorting(values, part_sz, part_nums); __syncthreads(); : } #plcuda_end #plcuda_sanity_check knn_gpu_similarity_sanity_check #plcuda_working_bufsz 0 #plcuda_results_bufsz knn_gpu_similarity_results_bufsz $$ LANGUAGE 'plcuda'; real[] -- ID+Similarity of D-compounds (2xN) knn_gpu_similarity(int k, -- k-value int[] Q, -- ID+Fingerprint of Q-compounds (33xM) int[] D); -- ID+Fingerprint of D-compounds (33xN)
  • 28. The PG-Strom Project Invocation of PL/CUDA function PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics28 PREPARE knn_sim_rand_10m_gpu_v2(int) -- arg1:@k-value AS SELECT row_number() OVER (), fp.name, similarity FROM (SELECT float4_as_int4(key_id) key_id, similarity FROM matrix_unnest( (SELECT rbind( knn_gpu_similarity($1,Q.matrix, D.matrix)) FROM (SELECT cbind(array_matrix(id), array_matrix(bitmap)) matrix FROM finger_print_query) Q, (SELECT matrix FROM finger_print_10m_matrix) D ) ) AS sim(key_id real, similarity real) ORDER BY similarity DESC) sim, finger_print_10m fp WHERE fp.id = sim.key_id LIMIT 1000; Post-process by SQL; like lookup of compounds name by compounds-id (tables JOIN), making rank of similarity by window function. Execution of PL/CUDA function with Q-/D-matrix as argument Transform the records read from tables to Array-Matrix type. (Pre-build is also possible) Transform the Array-Matrix (3xN), return value of PL/CUDA function, into usual record data x Nrows.
  • 29. The PG-Strom Project Performance results PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics29  For comparison to CPU cases, we implemented an equivalent SQL function by C.  # of D-compounds is 10M records, # of Q-compounds is 10, 50, 100, 500 and 1000.  Up to 10B combination search; almost equivalent size for real drug discovery research.  HW) CPU: Xeon E5-2670v3, GPU: GTX980 / GTX1080, RAM:384GB  SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0 30.25 145.29 295.95 1503.31 3034.94 12.97 13.46 13.90 18.16 24.6513.00 13.23 13.59 16.01 19.13 0 500 1000 1500 2000 2500 3000 3500 10 50 100 500 1000 QueryResponseTime[sec] Number of Query Compounds [Q] Similarity search of chemical compounds by k-NN method (k=3, D=10M) CPU(E5-2670v3) GTX980 GTX1080 x150 times faster!
  • 30. The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics30 Another Usage k-means clustering in database system
  • 31. The PG-Strom Project Clustering Analysis PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics31
  • 32. The PG-Strom Project k-means clustering algorithm PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics32 1. Assign cluster randomly 2. Make centroid for each cluster 3. Chose the nearest cluster from the centroid
  • 33. The PG-Strom Project k-means clustering algorithm PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics33 1. Assign cluster randomly 5. Re-calculation of the centroid based on the new cluster assignment. 6. Finish the loop if it gets converged. 4. Repeat until convergence
  • 34. The PG-Strom Project k-means clustering on PL/CUDA PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics34 CREATE OR REPLACE FUNCTION gpu_kmeans(real[], -- ID + Data Matrix int, -- k-value (number of clusters) int = 10, -- max number of iteration int = 1) -- seed of initial randomness RETURNS int[] AS $$ #plcuda_decl : KERNEL_FUNCTION_MAXTHREADS(void) update_centroid(MatrixType *D, MatrixType *R, MatrixType *C) { : /* accumulate the local centroid */ for (did = get_global_id(); did < nitems; did += get_global_size()) { /* pick up the target cluster */ cid = r_values[nitems + did]; atomicAdd(&l_cent[cid], 1.0); for (index=1; index < width; index++) atomicAdd(&l_cent[index * k_value + cid], d_values[index * nitems + did]); } __syncthreads(); /* write back to the global C-matrix */ for (index = get_local_id(); index < width * k_value; index += get_local_size()) atomicAdd(&c_values[index], l_cent[index]); } : #plcuda_begin : status = pgstromLaunchDynamicKernel4((void *) setup_initial_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)(r_seed), nitems, 0, 0); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); for (loop=0; loop < nloops; loop++) { : status = pgstromLaunchDynamicKernelMaxThreads3( (void *)kmeans_update_cluster, (kern_arg_t)(D), (kern_arg_t)(R), (kern_arg_t)(C), (kern_arg_t)k_value, nitems, 0, sizeof(cl_int) + sizeof(cl_float)); if (status != cudaSuccess) PLCUDA_RUNTIME_ERROR_RETURN(status); : } #plcuda_sanity_check gpu_kmeans_sanity_check #plcuda_working_bufsz gpu_kmeans_working_bufsz #plcuda_results_bufsz gpu_kmeans_results_bufsz #plcuda_end $$ LANGUAGE 'plcuda';
  • 35. The PG-Strom Project Test data for k-means clustering PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics35 ▌Dataset overview  A collection of datasets of vehicle traffic, observed between two points for a set duration of time over a period of 6 months  449 observation points, at Aarhus, Denmark.  13.5M records from Feb to June of 2014 ▌Data contains  average speed  average measured time  number of vehicles  latitude/longitude of the observation points  etc... ▌What we did  categorize the zone of road into 5-classes according to the characteristics of vehicle’s running style.
  • 36. The PG-Strom Project Invocation of GPU k-means function PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics36 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Make a matrix from the raw-data Run k-means clustering logic Pick-up most frequent cluster
  • 37. The PG-Strom Project GPU k-means (1/3) – clustering by all the data PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics37 $ wget -O map.png "`psql traffic -At -f ~/traffic.sql`" bypass highway? Road towards downtown? Beltway?
  • 38. The PG-Strom Project GPU k-means (2/3) – Daytime and Nighttime PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics38 daytime (8-17) nighttime (18-7)
  • 39. The PG-Strom Project GPU k-means (3/3) – Weekdays and Weekend PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics39 weekdays weekend
  • 40. The PG-Strom Project Invocation of GPU k-means function PGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics40 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT report_id, k, count(*) c FROM matrix_unnest( (SELECT gpu_kmeans ( array_matrix( int4_as_float4(report_id), avg_measured_time, avg_speed, vehicle_count), 5) FROM tr_rawdata WHERE extract('hour' from timestamp) between 7 and 17 ) ) R(report_id int, k int) GROUP BY report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Just add a line to select different input data set. Flexibility of SQL!
  • 41. The PG-Strom Project Performance (1/3) PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics41 We used MADLib’s kmeans_random() as a usual implementation of k-means based on CPU
  • 42. The PG-Strom Project Performance (2/3) – Invocation of MADLib’s k-means clustering PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics42 SELECT report_id, k, c FROM (SELECT report_id, k, c, row_number() OVER (PARTITION BY report_id ORDER BY c DESC) rank FROM (SELECT t.report_id, (madlib.closest_column(centroids, t.attrs)).column_id as k, count(*) c FROM tr_rawdata_madlib_s t, (SELECT centroids FROM madlib.kmeans_random('tr_rawdata_madlib', 'attrs', 5) ) km; GROUP BY t.report_id, k ) __summary_1 ) __summary_2 WHERE rank = 1; Calculation of the cluster centroid Choice of the closest cluster for each item
  • 43. The PG-Strom Project Performance (3/3) – k-means in PL/CUDA vs MADLib PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics43  Environment of the benchmark  HW) CPU: Xeon E5-2670v3, GPU: GTX1080, RAM: 384GB  SW) CentOS7, CUDA8.0, PostgreSQL v9.5 + PG-Strom v1.0, MADLib 1.9 1.41 12.44 126.59 1668.49 0.21 0.29 0.94 8.41 0 200 400 600 800 1000 1200 1400 1600 1800 10,000 100,000 1,000,000 13,577,132 QueryResponseTime[sec] (※Lowerisbetter) Number of Items that were clustered based on the k-means algorithm Performance comparison of in-database k-means clustering MADLib PL/CUDA x200 times faster
  • 44. The PG-Strom ProjectPGconf.SV - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics44 Summary
  • 45. The PG-Strom Project Summary (1/3) PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics45 ▌What is PL/CUDA?  PG-Strom’s original concept is automatic optimization and automatic code generation.  PL/CUDA is a way to pull out the maximum performance of GPU instead of manual optimization. Likely right decision, because nobody likely writes up advanced algorithms in SQL. ▌Advantages  TFLOPS grade computing engine for in-database analytics  No need to export entire dataset from the database unlike a case when we use external applications.  Only “result” is all we need to export.  Utilization of SQL’s flexibility for pre-/post-process of the advanced analytics algorithms.  JOIN, GROUP BY, Window function, etc...
  • 46. The PG-Strom Project Summary (2/3) – Target Area PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics46 ▌Case Study 1 – Similarity Search on Drug Discovery  Characteristics of chemical compounds are represented as fingerprint (= feature vector). Similarity scores are calculated by GPU, in parallel, then picks up top-k similarity. ▌Case Study 2 – Unsupervised Learning using Sensor Data  Distance calculation of sensor generated dataset by GPU. Automatic categorization to a few classes by the characteristics of roads; automatically detected. ▌Other expected domain for applications  Search of compounds ... medical drug, chemical, materials, ...  Recommendation engine ... e-commerce  Anomaly detection ... security  Data mining ... marketing  ...etc...
  • 47. The PG-Strom Project Summary (3/3) – Challenges and future PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics47 ▌Technology  Array-matrix larger than 1GB  Because of the restriction of variable length fields in PostgreSQL, SQL-function cannot have arguments larger than 1GB.  Cost to use same array-matrix multiple times  Idea: it may make sense to pin a static array-matrix on GPU side. ▌Operation  Relatively small number of engineers are familiar both of CUDA and SQL.  Package of well-known algorithms  Support service by professional engineers Let’s have joint evaluation with you!
  • 48. The PG-Strom Project Resources PGconf.ASIA - PL/CUDA / Fusion of HPC Grade Power with In-Database Analytics48 ▌Repository https://github.com/pg-strom/devel ▌Today’s Slides http://www.slideshare.net/kaigai/pgconfasia2016-plcuda-en ▌Contact  e-mail: kaigai@ak.jp.nec.com  Tw: @kkaigai Any developers are welcome!