PG-Strom is a module that utilizes GPUs to accelerate query processing in PostgreSQL. It uses a foreign data wrapper to push query execution to the GPU. Benchmark results show a query running 10 times faster on a table using the PG-Strom FDW compared to a regular PostgreSQL table. Future plans include supporting writable foreign tables, accelerating sort and aggregate operations using the GPU, and inheritance between regular and foreign tables. Help from the community is needed to review code, provide large real-world datasets, and understand common analytic queries.
1. PG-Strom
~A FDW module utilizing GPU device~
NEC Europe
SAP Global Competence Center
KaiGai Kohei <kohei.kaigai@emea.nec.com>
2. FDW is fun
Exec
Regular Exec
Table
Executor
Foreign MySQL
Table FDW
SELECT * FROM …
Foreign Oracle Exec
Table FDW
Run on
single PG-Strom
Foreign
thread FDW
Table
Utilizing
External
Regular Computing
Table Resource!
Page 2 PostgreSQL Conference 2012
3. Idea of Asynchronous Execution using CPU and GPU
vanilla PostgreSQL PostgreSQL with PG-Strom
CPU CPU GPU
Asynchronous memory
transfer and execution
Iteration of scan tuples and
evaluation of qualifiers
Synchronization
Larger “chunk” to scan
Earlier than
the database at once
“Only CPU” scan
: Red means, scan tuples from the database
: Green means, execution of the qualifiers
Page 3 PostgreSQL Conference 2012
4. Architecture of PG-Strom
World of CPU World of GPU
regular shadow
Data Exchange
tables tables
via shared chunk
Massive
shared buffer shared chunks Parallel
Execution
Exec Preload
PG-Strom PG-Strom GPU
GPU Kernel
Calculation Function
Executor
Server Exec
Backend
Backend
Backend Process
Process
Process
GPU Device
DMA Memory
Postmaster Transfer
PCI-E x16 Gen2 (16GB/sec)
Page 4 PostgreSQL Conference 2012
5. Data Density and Column-base structure
Foreign Table FT1
(a, b, c, d)
Table: my_schema.ft1.c.cs
column store of D
column store of C
column store of A
column store of B
Shadow 10100 {‘2010-10-21’, …}
rowid map
Tables 10200 {‘2011-01-23’, …}
10300 {‘2011-08-17’, …}
Table: my_schema.ft1.b.cs
10100 {2.4, 5.6, 4.95, … }
row store of FT1
10300 {10.23, 7.54, 5.43, … }
② Calculation
rowmap
Chunk Buffer of FT1
① Transfer
value a[] <not used>
value b[]
③ Write-Back
value c[]
value d[] <not used>
Page 5 PostgreSQL Conference 2012
6. Benchmark Result
postgres=# SELECT COUNT(*) FROM rtbl
WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
count
-------
25069
(1 row)
GPU
Accelerated!
Time: 3739.492 ms
postgres=# SELECT COUNT(*) FROM ftbl
WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
count
-------
25069 X10 times
(1 row) Faster
Time: 227.023 ms
▌CPU: Intel Xeon E5504 (2.0GHz/4core), GPU: Nvidia GeForce GTS450 (128 cuda core)
▌rtbl and ftbl contains 5 million tuples, with same values.
▌All the tuples are already in the shared buffers, so seldom disk i/o happen.
Page 6 PostgreSQL Conference 2012
7. Future Development
▌Git URL
https://github.com/kaigai/pg_strom
▌v9.3 development
Writable Foreign Table
Sort / Aggregate acceleration using GPU
Inheritance between regular and foreign tables
▌Need your help
Folks who can review the patches
Folks who can provide real-life big data
Folks who can know typical workload of analytic queries
Page 7 PostgreSQL Conference 2012