2. Self Introduction
âName: KaiGai Kohei
âCompany: NEC OSS Promotion Center
âLike: Processor with many cores
âDislike: Processor with little cores
âBackground:
HPC ď¨ OSS/Linux ď¨ SAP ď¨ GPU/PostgreSQL
âTw: @kkaigai
âMy Jobs
ďŹ SELinux (2004~)
⢠Lockless AVCăJFFS2 XATTR, ...
ďŹ PostgreSQL (2006~)
⢠SE-PostgreSQL, Security Barrier View, Writable FDW, ...
ďŹ PG-Strom (2012~)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2
3. Approach for Performance Improvement
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 3
Scale-Out
Scale-Up
Homogeneous Scale-up
Heterogeneous Scale-Up
+
4. Characteristics of GPU (Graphic Processor Unit)
âCharacteristics
ďŹ Larger percentage of ALUs on chip
ďŹ Relatively smaller percentage of cache
and control logic
ď¨Advantages to simple calculation in
parallel, but not complicated logic
ďŹ Much higher number of cores per price
⢠GTX750Ti (640core) with $150
GPU CPU
Model Nvidia Tesla K20X
Intel Xeon
E5-2670 v3
Architecture Kepler Haswell
Launch Nov-2012 Sep-2014
# of transistors 7.1billion 3.84billion
# of cores 2688 (simple) 12 (functional)
Core clock 732MHz
2.6GHz,
up to 3.5GHz
Peak Flops
(single precision)
3.95TFLOPS
998.4GFLOPS
(with AVX2)
DRAM size 6GB, GDDR5
768GB/socket,
DDR4
Memory band 250GB/s 68GB/s
Power
consumption
235W 135W
Price $3,000 $2,094
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 4
SOURCE: CUDA C Programming Guide (v6.5)
5. How GPU works â Example of reduction algorithm
âitem[0]
step.1 step.2 step.4step.3
Computing
the sum of array:
đđĄđđ[đ]
đ=0âŚđâ1
with N-cores of GPU
â
â
Ⲡâ â
â â
â
â â â˛
â
â â
â
â â Ⲡâ
â
â â
â
â â â˛
â
â â
â
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[]
with log2N steps
Inter core synchronization by HW support
6. Semiconductor Trend (1/2) â Towards Heterogeneous
âMoves to CPU/GPU integrated architecture from multicore CPU
âFree lunch for SW by HW improvement will finish soon
ď¨ No utilization of semiconductor capability, unless SW is not designed
with conscious of HW characteristics.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 6
SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
7. Semiconductor Trend (2/2) â Dark Silicon Problem
âBackground of CPU/GPU integrated architecture
ďŹ Increase of transistor density > Reduction of power consumption
ďŹ Unable to supply power for all the logic because of chip cooling
ď¨A chip has multiple logics with different features, to save the peak power
consumption.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 7
SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011
8. RDBMS and its bottleneck (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 8
Storage
Processor
Data
RAM
Data Size > RAM Data Size < RAM
Storage
Processor
Data
RAM
In the future?
Processor
Wide
Band
RAM
Non-
volatile
RAM
Data
9. World of current memory bottleneck
Join, Aggregation, Sort, Projection, ...
[strategy]
⢠burstable access pattern
⢠parallel algorithm
World of traditional disk-i/o bottleneck
SeqScan, IndexScan, ...
[strategy]
⢠reduction of i/o (size, count)
⢠distribution of disk (RAID)
RDBMS and its bottleneck (2/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 9
Processor
RAM
Storage
bandwidthďź
multiple
hundreds GB/s
bandwidthďź
multiple GB/s
10. PG-Strom
âWhat is PG-Strom
ďŹ An extension designed for PostgreSQL
ďŹ Off-loads a part of SQL workloads to GPU for parallel/rapid execution
ďŹ Three workloads are supported: Full-Scan, Hash-Join, Aggregate
(At the moment of Nov-2014, beta version)
âConcept
ďŹ Automatic GPU native code generation from SQL query, by JIT compile
ďŹ Asynchronous execution with CPU/GPU co-operation.
âAdvantage
ďŹ Works fully transparently from users
⢠It allows to use peripheral software of PostgreSQL, including SQL syntax,
backup, HA or drivers.
ďŹ Performance improvement with GPU+OSS; very low cost
âAttention
ďŹ Right now, in-memory data store is assumed
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 10
11. PostgreSQL
PG-Strom
Architecture of PG-Strom (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 11
GPU Code Generator
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
12. Architecture of PG-Strom (2/2)
âElemental technologyâ OpenCL
ďŹ Parallel computing framework on heterogeneous processors
ďŹ Can use for CPU parallel, not only GPU of NVIDIA/AMD
ďŹ Includes run-time compiler in the language specification
âElemental technology⥠Custom-Scan Interface
ďŹ Feature to implement scan/join by extensions, as if it is built-in logic for
SQL processing in PostgreSQL.
ďŹ A part of functionalities got merged to v9.5, discussions are in-progress
for full functionalities.
âElemental technology⢠Row oriented data structure
ďŹ Shared buffer of PostgreSQL as DMA source
ďŹ Data format translation between row and column, even though column
format is optimal for GPU performance.
Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
13. Elemental technologyâ OpenCL (1/2)
âCharacteristics of OpenCL
ďŹ Use abstracted âOpenCL deviceâ for CPU/MIC, not only GPUs
⢠Even though it follows characteristics of GPUs below....
ďźThree types of memory layer (global, local, constant)
ďźConcept of workgroup; synchronous execution inter-threads
ďŹ Just-in-time compile from source code like C-language to
the platform specific native code
âComparison with CUDA
ď¨ We donât need to develop JIT compile functionality by ourselves, and
want more people to try, so adopted OpenCL at this moment.
Page. 13
CUDA OpenCL
Advantage ⢠Detailed optimization
⢠Latest feature of NVIDIA GPU
⢠Driver stability
⢠Multiplatform support
⢠Built-in JIT compiler
⢠CPU parallel support
Issues ⢠Unavailable on AMD, Intel ⢠Driver stability
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
14. Elemental technologyâ OpenCL (2/2)
Page. 14
GPU
Source code
(text)
OpenCL runtime
OpenCL Interface
OpenCL runtime OpenCL runtime
OpenCL Devices
N x computing units
Global Memory
Local Memory
Constant Memory
cl_program
object
cl_kernel
object
DMA
buffer-3
DMA
buffer-2
DMA
buffer-1clBuildProgram()
clCreateKernel()
Command
Queue
Just-in-Time
Compile Look-up
kernel
functions Input a pair of kernel and
data into command queue
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
15. Automatic GPU native code generation
Page. 15
postgres=# SET pg_strom.show_device_kernel = on;
SET
postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate
Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x))
Group Key: t0.cat
-> Custom (GpuPreAgg)
Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer,
NULL::double precision, NULL::double precision, NULL::text,
pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)
Bulkload: On
Kernel Source: #include "opencl_common.h"
#include "opencl_gpupreagg.h"
#include "opencl_textlib.h"
: <...snip...>
static bool
gpupreagg_qual_eval(__private cl_int *errcode,
__global kern_parambuf *kparams,
__global kern_data_store *kds,
__global kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index);
pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index);
return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8));
}
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
16. How PG-Strom processes SQL workloadsâ â Hash-Join
Page. 16
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next step Next step
All CPU does is
just references
the result of
relations join
Hash table
search by CPU
Projection
by CPU
Parallel
Projection
Parallel Hash-
table search
Existing Hash-Join implementation GpuHashJoin implementation
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
17. Benchmark (1/2) â Simple Tables Join
[condition of measurement]
â INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
â Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...];
â All the tables are preliminary loaded
â HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980
Page. 17
178.7
334.0
513.6
713.1
941.4
1181.3
1452.2
1753.4
29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Simple Tables Join
PostgreSQL PG-Strom
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
18. How PG-Strom processes SQL workloads⥠â Aggregate
âGpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU
âNot easy to apply aggregate on all the data at once because of GPUâs RAM size,
GPU has advantage to make âpartial aggregateâ for each million rows.
âKey performance factor is reducing the job of CPU
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
several millions rows
count(*), avg(X), Y
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
several millions rows
sum(nrows),
avg_ex(nrows, psum), Y
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18
19. Benchmark (2/2) â Aggregation + Tables Join
[condition of measurement]
â Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
â Query in use:
SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT;
â Other conditions are same with the previous measurement
Page. 19
157.4
238.2
328.3
421.7
525.5
619.3
712.8
829.2
44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7
0
100
200
300
400
500
600
700
800
900
1000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Aggregation + Tables Join
PostgreSQL PG-Strom
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
20. As an aside...
Letâs back to the development history prior to
the remaining elemental technology
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 20
21. Development History of PG-Strom
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 21
2011 Feb KaiGai moved to Germany
May PGconf 2011
2012 Jan The first prototype
May Tried to use pseudo code
Aug Proposition of background worker
and writable FDW
2013 Jul NEC admit PG-Strom development
Nov Proposition of CustomScan API
2014 Feb Moved to OpenCL from CUDA
Jun GpuScan, GpuHashJoin and GpuSort
were implemented
Sep Drop GpuSort, instead of GpuPreAgg
Nov working beta-1 gets available
First
prototype
announced
âStromâ comes from Germany term; where I lived in at that time
22. Inspiration â May-2011
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 22
http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language
PGconf 2011 @ Ottawa, Canada
Youâre brave.
Unavailable to run on
regular query, not
only PL functions?
23. Inspired by PgOpencl Project
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 23
A B C D E
summary:
GPU works well towards large BLOB,
like image data.
Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011
Tim Child (3DMashUp)
Even small width of values,
GPU will work well
if we have transposition.
24. Made a prototype â Christmas vacation in 2011
ďŹ CUDA based (ď¨ Now, OpenCL)
ďŹ FDW (Foreign Data Wrapper) based (ď¨ Now, CustomScan API)
ďŹ Column oriented data structure (ď¨ Now, row-oriented data)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 24
CPU
vanilla PostgreSQL PostgreSQL + PG-Strom
Iterationofrowfetch
andevaluation
multiple rows
at once
Async DMA &
Async Kernek Exec
CUDA Compiler
: Fetch a row from buffer : Evaluation of WHERE clause
CPU GPU
Synchronization
SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10;
code for
GPU
Auto GPU code generation,
Just-in-time compile
Query response time
reduction
GPU
Device
25. ďŹ A set of APIs to show external data source as like a table of PostgreSQL
ďŹ FDW driver generates rows on demand when foreign tables are referenced.
ď¨ Extension can have arbitrary data structure and logic to process the data.
ď¨ It is also allowed to scan internal data with GPU acceleration!
What is FDW (Foreign Data Wrapper)
QueryExecutor
Regular
Table
Foreign
Table
Foreign
Table
Foreign
Table
File
FDW
Oracle
FDW
PG-Strom
FDW
storage
Regular
Table
storage
QueryPlanner
QueryParser
Exec
Exec
Exec
Exec
SQL
Query
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 25
CSV Files
....... .... .... ...
... ... . ... ... .
.... .. .... .. . ..
.. . . . .. .
Other
DBMS
26. Implementation at that time
âProblem
ďŹ Only foreign tables can be accelerated
with GPUs.
ďŹ Only full table-scan can be supported.
ďŹ Foreign tables were read-only at that
time.
ďŹ Slow-down of 1st time query because of
device initialization.
ď¨summary: Not easy to use
PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26
ForeignTable(pgstrom)
value a[]
rowmap
value b[]
value c[]
value d[] <not used>
<not used>
Table: my_schema.ft1.b.cs
10300 {10.23, 7.54, 5.43, ⌠}
10100 {2.4, 5.6, 4.95, ⌠}
⥠Calculation â Transfer
⢠Write-Back
Table: my_schema.ft1.c.cs
{â2010-10-21â, âŚ}
{â2011-01-23â, âŚ}
{â2011-08-17â, âŚ}
10100
10200
10300
Shadow tables on behalf of each column of the
foreign table managed by PG-Strom.
Each items are stored closely in physical, using large
array data type of element data.
27. PostgreSQL Enhancement (1/2) â Writable FDW
âForeign tables had supported read access only (~v9.2)
ďŹ Data load onto shadow tables was supported using special function
ď¨Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 27
PostgreSQL
Data Source
Query
Planning
SELECT DELETE UPDATE INSERT
FDW Driver
Enhancement
of this
interface
28. PostgreSQL Enhancement (2/2) â Background Worker
âSlowdown on first time of each query
ďŹ Time for JIT compile of GPU code ď¨ caching the built binaries
ďŹ Time for initialization of GPU devices ď¨ initialization once by worker process
ď¨Good side effect: it works with run-time that doesnât support concurrent usage.
âBackground Worker
ďŹ An interface allows to launch background processes managed by extensions
ďŹ Dynamic registration gets supported on v9.4 also.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 28
PostmasterProcess
PostgreSQL
Backend
Process
Built-in
BG workers
Extensionâs
BG workers
fork(2) on
connection
fork(2) on
startup
IPC:sharedmemory,...
fork(2) on
startup /
demand
5432
Port
29. Design Pivot
âIs the FDW really appropriate framework for PG-Strom?
ďŹ It originated from a feature to show external data as like a table.
[benefit]
⢠It is already supported on PostgreSQL
[Issue]
⢠Less transparency for users / applications
⢠More efficient execution path than GPU; like index-scan if available
⢠Workloads expects for full-table scan; like tables join
âCUDA or OpenCL?
ď¨ First priority is many trials by wider range of people
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 29
CUDA OpenCL
Good ⢠Fine grained optimization
⢠Latest feature of NVIDIA GPU
⢠Reliability of OpenCL drivers
⢠Multiplatform support
⢠Built-in JIT compiler
⢠CPU parallel support
Bad ⢠No support on AMD, Intel ⢠Reliability of OpenCL drivers
30. Elemental Technology⥠â Custom-Scan Interface
Page. 30
Table.A
Table to be scanned
clause
id > 200
Pathâ
SeqScan
cost=400
PathâĄ
TidScan
unavailable
Pathâ˘
IndexScan
cost=10
PathâŁ
CustomScan
(GpuScan)
cost=40
Built-in
Logics
Table.X Table.Y
Tables to be joined
Pathâ
NestLoop
cost=8000
PathâĄ
HashJoin
cost=800
Pathâ˘
MergeJoin
cost=1200
PathâŁ
CustomScan
(GpuHashJoin)
cost=500
Built-in
Logics
clause
x.id = y.id
Pathâ˘
IndexScan
cost=10
Table.A
WINNER!
Table.X Table.Y
PathâŁ
CustomScan
(GpuHashJoin)
cost=500
clause
x.id = y.id
WINNER!
It allows extensions to
offer alternative scan or
join logics, in addition to
the built-in ones
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
clause
id > 200
31. Yes!! Custom-Scan Interface got merged to v9.5
Page. 31 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Minimum consensus is, an interface that
allows extensions to implement custom
logic to replace built-in scan logics.
â
Then, we will discuss the enhancement
of the interface for tables join on the
developerâs community.
32. Enhancement of Custom-Scan Interface
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32
Table.X Table.Y
Tables to be joined
Path â
NestLoop
Path âĄ
HashJoin
Path â˘
MergeJoin
Path âŁ
CustomScan
(GpuHashJoin)
Built-in
logics
clause
x.id = y.id
X Y X Y
NestLoop HashJoin
X Y
MergeJoin
X Y
GpuHashJoin Result set
of tables join
It looks like a relation scan
on the result set of
tables join
ďŹ It replaces a node to be join by foreign-/custom-scan.
ďŹ Example: A FDW that scan on the result set of remote join.
âJoin replacement by foreign-/custom-scan
33. Abuse of Custom-Scan Interface
âUsual interface contract
ďŹ GpuScan/SeqScan fetches rows, then passed to upper node and more ...
ďŹ TupleTableSlot is used to exchange record row by row manner.
âBeyond the interface contract
ďŹ In case when both of parent and child nodes are managed by same extension,
nobody prohibit to exchange chunk of data with its own data structure.
ďŹ âBulkload: Onâ ď¨ multiple (usually, 10K~100K) rows at once
Page. 33
postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1;
QUERY PLAN
-----------------------------------------------------------------------------
Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106)
hash clause 1: (t0.aid = t1.aid)
Bulkload: On
-> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73)
-> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37)
hash keys: aid
Buckets: 46000 Batches: 1 Memory Usage: 99.97%
-> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37)
Planning time: 0.220 ms
(9 rows)
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
34. PostgreSQL
PG-Strom
Element Technology⢠â Row oriented data structure
Page. 34
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
DMA Transfer
(via PCI-E bus)
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
T-Tree
Columnar
Cache
GPU Code Generator
Cache
construction
(rowď¨column)
Materialize
the result
(columnď¨row)
Prior implementation
had table cache
with column-oriented
data structure
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
35. Why GPU well fit column-oriented data structure
Page. 35
SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made
Core Core Core Core Core Core Core Core Core Core
SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels
coalesced memory access
Global Memory (DRAM)
Wide Memory
Bandwidth
(256-384bits)
WARP:
A set of processor cores
that share instruction
pointer. It is usually 32.
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
36. Format of PostgreSQL tuples
Page. 36
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
/* current TID of this or newer tuple */
ItemPointerData t_ctid;
/* number of attributes + various flags */
uint16 t_infomask2;
/* various flag bits, see below */
uint16 t_infomask;
/* sizeof header incl. bitmap, padding */
uint8 t_hoff;
/* ^ - 23 bytes - ^ */
/* bitmap of NULLs -- VARIABLE LENGTH */
bits8 t_bits[1];
/* MORE DATA FOLLOWS AT END OF STRUCT */
};
HeapTupleHeader
NULL bitmap
(if has null)
OID of row
(if exists)
Padding
1st Column
2nd Column
4th Column
Nth Column
No data if NULL
Variable length fields
makes unavailable to
predicate offset of the
later fields.
No NULL Bitmap
if all the fields
are not NULL
We usually have
no OID of row
The worst data structure
for GPU processor
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
37. Nightmare of columnar cache (1/2)
âĺćĺăăŁăăˇăĽă¨čĄĺĺ¤ć
Page. 37
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
------------------------------------------------------------------------------
Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1)
hash clause 1: (t0.aid = t1.aid)
hash clause 2: (t0.bid = t2.bid)
number of requests: 144
total time to load: 584.67ms
total time to materialize: 7245.14ms ď§ 70% of total execution time!!
average time in send-mq: 37us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470
DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144
kernel exec: total: 441.71ms, avg: 3067us, count: 144
-> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1)
-> Custom (MultiHash) (actual time=17.839..17.839 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (actual time=0.019..6.794 rows=40000 loops=1)
Execution time: 10525.754 ms
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
38. Nightmare of columnar cache (2/2)
Page. 38
Translation from table
(row-format) to the
columnar-cache
(only at once)
Translation from PG-
Strom internal
(column-format) to
TupleTableSlot
(row-format) for data
exchange in PostgreSQL
[Hash-Join]
Search the relevant
records on inner/outer
relations.
Breakdown of
execution time
You cannot see the wood for the trees
ďźć¨ăčŚăŚćŁŽăčŚăďź
Optimization in GPU also makes additional data-format
translation cost (not ignorable) on the interface of PostgreSQL
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
39. A significant point in GPU acceleration (1/2)
Page. 39
⢠CPU is rare resource, so should put less workload on CPUs unlike GPUs
⢠We need to pay attention on access pattern of memory for CPUâs job
Storage
Shared
Buffer
T-Tree
columnar
cache
Result
Buffer
TupleTableSlot
Task of CPU
Rowď¨Column translation
on cache construction
Very heavy loads
CPUăŽăżăšăŻ
ăăŁăăˇăĽć§çŻăŽéăŤă
ä¸ĺşŚă ăčĄď¨ĺĺ¤ć
漾ăăŚéŤăč˛ čˇ
Task of PCI-E
Due to column-format,
only referenced data
shall be transformed.
Task of CPU
Columnď¨Row translation
everytime when we returns
the result to PostgreSQL.
Very heavy loads
Task of GPU
Operations on the data
with column-format.
It is optimized data
according to GPUâs nature
In case of column-format
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
40. A significant point in GPU acceleration (2/2)
Page. 40
⢠CPU is rare resource, so should put less workload on CPUs unlike GPUs
⢠We need to pay attention on access pattern of memory for CPUâs job
Storage
Shared
Buffer
Result
Buffer
TupleTableSlot
Task of CPU
Visibility check prior to
DMA translation
Light Load
CPUăŽăżăšăŻ
ăăŁăăˇăĽć§çŻăŽéăŤă
ä¸ĺşŚă ăčĄď¨ĺĺ¤ć
漾ăăŚéŤăč˛ čˇ
Task of PCI-E
Due to row-format,
all data shall be transformed.
A little heavy load
Task of CPU
Set pointer of the result
buffer
Very Light Load
Task of GPU
Operations on row-data.
Although not an optimal data,
massive cores covers its
performance
In case of row-format
No own
columnar
cache
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
41. Integration with shared-buffer of PostgreSQL
âĺćĺăăŁăăˇăĽă¨čĄĺĺ¤ć
Page. 41
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
-----------------------------------------------------------
Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1)
hash clause 1: (t0.aid = t1.aid)
hash clause 2: (t0.bid = t2.bid)
number of requests: 145
total time for inner load: 29.80ms
total time for outer load: 812.50ms
total time to materialize: 1527.95ms ď§ Reduction dramatically
average time in send-mq: 61us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619
DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290
proj kernel exec: total: 264.47ms, avg: 1823us, count: 145
main kernel exec: total: 622.83ms, avg: 4295us, count: 145
-> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=29.766..29.767 rows=80000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.005..5.742 rows=40000 loops=1)
-> Custom (MultiHash) (actual time=16.552..16.552 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (actual time=0.022..7.330 rows=40000 loops=1)
Execution time: 5161.017 ms ď§ Much better response time
Performance degrading little bit
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
42. PG-Stromâs current
âSupported logics
ďŹ GpuScan ... Full-table scan with qualifiers
ďŹ GpuHashJoin ... Hash-Join with GPUs
ďŹ GpuPreAgg ... Preprocess of aggregation with GPUs
âSupported data-types
ďŹ Integer (smallint, integer, bigint)
ďŹ Floating-point (real, float)
ďŹ String (text, varchar(n), char(n))
ďŹ Date and time (date, time, timestamp)
ďŹ NUMERIC
âSupported functions
ďŹ arithmetic operators for above data-types
ďŹ comparison operators for above data-types
ďŹ mathematical functions on floating-points
ďŹ Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR
Page. 42 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
43. PG-Stromâs future
âLogics will be supported
ďŹ Sort
ďŹ Aggregate Push-down
ďŹ ... anything else?
âData types will be supported
ďŹ ... anything else?
âFunctions will be supported
ďŹ LIKE, Regular Expression?
ďŹ Date/Time extraction?
ďŹ PostGIS?
ďŹ Geometrics?
ďŹ User defined functions?
(like pgOpenCL)
âParallel Scan
Page. 43
Shared
Buffer
block-0 block-Nblock-N/3 block-2N/3
Range
Partitioning &
Parallel Scan
Current
Future
Not easy to supply GPU enough data with
single CPU core in spite of on-memory store.
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
44. Our target segment
â1st target application: BI tools are expected
âUp-to 1.0TB data: SMB company or branches of large company
ďŹ Because it needs to keep the data in-memory
âGPU and PostgreSQL: both of them are inexpensive.
Page. 44
High-end
data analysis
appliance
Commercial or
OSS RDBMS
large response-time / batch process small response-time / real-time
smallerdata-sizelargerdata-size
< 1.0TB
> 1.0TB
Cost
advantage
Performance
advantage
PG-Strom
Hadoop?
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
45. (Ref) ROLAP and MOLAP
âExpected PG-Strom usage
ďŹ Backend RDBMS for ROLAP / acceleration of ad-hoc queries
ďŹ Cube construction for MOLAP / acceleration of batch processing
Page. 45
ERP
SCM
CRM
Finance
DWH
DWH
world of transaction
(OLTP)
world of analytics
(OLAP)
ETL
ROLAP:
DB summarizes the data
on demand by BI tools
MOLAP:
BI tools reference
information cube;
preliminary constructed
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
46. Expected Usage (1/2) â OLTP/OLAP Integration
âWhy separated OLTP/OLAP system
ďŹ Integration of multiple data source
ďŹ Optimization for analytic workloads
âHow PG-Strom will improve...
ďŹ Parallel execution of Join / Aggregate; being key of performance
ďŹ Elimination or reduction of OLAP / ETL, thus smaller amount of TCO
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 46
ERPCRMSCM BI
OLTP
database
OLAP
database
ETL
OLAP CubesMaster / Fact Tables
BI
PG-Strom
Performance Key
⢠Join master and
fact tables
⢠Aggregation
functions
47. Expected Usage (2/2) â Letâs investigate with us
âWhich region PG-Strom can work for?
âWhich workloads PG-Strom shall fit?
âWhat workloads you concern about?
ď¨ PG-Strom Project want to lean from the field.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 47
48. How to use (1/3) â Installation Prerequisites
âOS: Linux (RHEL 6.x was validated)
âPostgreSQL 9.5devel (with Custom-Plan Interface)
âPG-Strom Module
âOpenCL Driver (like nVIDIA run-time)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 48
shared_preload_libraries = '$libdir/pg_stromâ
shared_buffers = <enough size to load whole of the database>
Minimum configuration of PG-Strom
postgres=# SET pg_strom.enabled = on;
SET
Turn on/off PG-Strom at run-time
49. How to use (2/3) â Build, Install and Starup
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 49
[kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom
[kaigai@saba ~]$ cd pg_strom
[kaigai@saba pg_strom]$ make && make install
[kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf
[kaigai@saba ~]$ pg_ctl start
server starting
[kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server"
LOG: starting background worker process "PG-Strom OpenCL Server"
LOG: database system was shut down at 2014-11-09 17:45:51 JST
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA
LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB)
LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB)
LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL
LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed
LOG: PG-Strom: Device "GeForce GTX 980" was installed
LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB)
LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB)
LOG: Starting PG-Strom OpenCL Server
LOG: PG-Strom: 24 of server threads are up
50. How to use (3/3) â Deployment on AWS
Page. 50
Search by âstromâ !
AWS GPU Instance (g2.2xlarge)
CPU Xeon E5-2670 (8 xCPU)
RAM 15GB
GPU NVIDIA GRID K2 (1536core)
Storage 60GB of SSD
Price $0.898/hour, $646.56/mon
(*) Price for on-demand instance
on Tokyo region at Nov-2014
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
51. Future of PG-Strom
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 51
52. Dilemma of Innovation
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 52
SOURCE: The Innovator's Dilemma, Clayton M. Christensen
Performance demanded
at the high end of the market
Performance demanded
at the low end of the marker
or in a new emerging segment
ProductPerformance
Time
53. Move forward with community
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 53
54. (Additional comment after the conference)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 54
check it out!
https://github.com/pg-strom/devel
Oops, I forgot to say
in the conference!
PG-Strom is open source.