2. Self Introduction
▌Name: KaiGai Kohei
▌Company: NEC OSS Promotion Center
▌Like: Processor with many cores
▌Dislike: Processor with little cores
▌Background:
HPC OSS/Linux SAP GPU/PostgreSQL
▌Tw: @kkaigai
▌My Jobs
SELinux (2004~)
• Lockless AVC、JFFS2 XATTR, ...
PostgreSQL (2006~)
• SE-PostgreSQL, Security Barrier View, Writable FDW, ...
PG-Strom (2012~)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2
3. Approach for Performance Improvement
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 3
Scale-Out
Scale-Up
Homogeneous Scale-up
Heterogeneous Scale-Up
+
4. Characteristics of GPU (Graphic Processor Unit)
▌Characteristics
Larger percentage of ALUs on chip
Relatively smaller percentage of cache
and control logic
Advantages to simple calculation in
parallel, but not complicated logic
Much higher number of cores per price
• GTX750Ti (640core) with $150
GPU CPU
Model Nvidia Tesla K20X
Intel Xeon
E5-2670 v3
Architecture Kepler Haswell
Launch Nov-2012 Sep-2014
# of transistors 7.1billion 3.84billion
# of cores 2688 (simple) 12 (functional)
Core clock 732MHz
2.6GHz,
up to 3.5GHz
Peak Flops
(single precision)
3.95TFLOPS
998.4GFLOPS
(with AVX2)
DRAM size 6GB, GDDR5
768GB/socket,
DDR4
Memory band 250GB/s 68GB/s
Power
consumption
235W 135W
Price $3,000 $2,094
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 4
SOURCE: CUDA C Programming Guide (v6.5)
5. How GPU works – Example of reduction algorithm
●item[0]
step.1 step.2 step.4step.3
Computing
the sum of array:
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with N-cores of GPU
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[]
with log2N steps
Inter core synchronization by HW support
6. Semiconductor Trend (1/2) – Towards Heterogeneous
▌Moves to CPU/GPU integrated architecture from multicore CPU
▌Free lunch for SW by HW improvement will finish soon
No utilization of semiconductor capability, unless SW is not designed
with conscious of HW characteristics.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 6
SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
7. Semiconductor Trend (2/2) – Dark Silicon Problem
▌Background of CPU/GPU integrated architecture
Increase of transistor density > Reduction of power consumption
Unable to supply power for all the logic because of chip cooling
A chip has multiple logics with different features, to save the peak power
consumption.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 7
SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011
8. RDBMS and its bottleneck (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 8
Storage
Processor
Data
RAM
Data Size > RAM Data Size < RAM
Storage
Processor
Data
RAM
In the future?
Processor
Wide
Band
RAM
Non-
volatile
RAM
Data
9. World of current memory bottleneck
Join, Aggregation, Sort, Projection, ...
[strategy]
• burstable access pattern
• parallel algorithm
World of traditional disk-i/o bottleneck
SeqScan, IndexScan, ...
[strategy]
• reduction of i/o (size, count)
• distribution of disk (RAID)
RDBMS and its bottleneck (2/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 9
Processor
RAM
Storage
bandwidth:
multiple
hundreds GB/s
bandwidth:
multiple GB/s
10. PG-Strom
▌What is PG-Strom
An extension designed for PostgreSQL
Off-loads a part of SQL workloads to GPU for parallel/rapid execution
Three workloads are supported: Full-Scan, Hash-Join, Aggregate
(At the moment of Nov-2014, beta version)
▌Concept
Automatic GPU native code generation from SQL query, by JIT compile
Asynchronous execution with CPU/GPU co-operation.
▌Advantage
Works fully transparently from users
• It allows to use peripheral software of PostgreSQL, including SQL syntax,
backup, HA or drivers.
Performance improvement with GPU+OSS; very low cost
▌Attention
Right now, in-memory data store is assumed
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 10
11. PostgreSQL
PG-Strom
Architecture of PG-Strom (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 11
GPU Code Generator
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
12. Architecture of PG-Strom (2/2)
▌Elemental technology① OpenCL
Parallel computing framework on heterogeneous processors
Can use for CPU parallel, not only GPU of NVIDIA/AMD
Includes run-time compiler in the language specification
▌Elemental technology② Custom-Scan Interface
Feature to implement scan/join by extensions, as if it is built-in logic for
SQL processing in PostgreSQL.
A part of functionalities got merged to v9.5, discussions are in-progress
for full functionalities.
▌Elemental technology③ Row oriented data structure
Shared buffer of PostgreSQL as DMA source
Data format translation between row and column, even though column
format is optimal for GPU performance.
Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
13. Elemental technology① OpenCL (1/2)
▌Characteristics of OpenCL
Use abstracted “OpenCL device” for CPU/MIC, not only GPUs
• Even though it follows characteristics of GPUs below....
Three types of memory layer (global, local, constant)
Concept of workgroup; synchronous execution inter-threads
Just-in-time compile from source code like C-language to
the platform specific native code
▌Comparison with CUDA
We don’t need to develop JIT compile functionality by ourselves, and
want more people to try, so adopted OpenCL at this moment.
Page. 13
CUDA OpenCL
Advantage • Detailed optimization
• Latest feature of NVIDIA GPU
• Driver stability
• Multiplatform support
• Built-in JIT compiler
• CPU parallel support
Issues • Unavailable on AMD, Intel • Driver stability
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
14. Elemental technology① OpenCL (2/2)
Page. 14
GPU
Source code
(text)
OpenCL runtime
OpenCL Interface
OpenCL runtime OpenCL runtime
OpenCL Devices
N x computing units
Global Memory
Local Memory
Constant Memory
cl_program
object
cl_kernel
object
DMA
buffer-3
DMA
buffer-2
DMA
buffer-1clBuildProgram()
clCreateKernel()
Command
Queue
Just-in-Time
Compile Look-up
kernel
functions Input a pair of kernel and
data into command queue
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
15. Automatic GPU native code generation
Page. 15
postgres=# SET pg_strom.show_device_kernel = on;
SET
postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate
Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x))
Group Key: t0.cat
-> Custom (GpuPreAgg)
Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer,
NULL::double precision, NULL::double precision, NULL::text,
pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)
Bulkload: On
Kernel Source: #include "opencl_common.h"
#include "opencl_gpupreagg.h"
#include "opencl_textlib.h"
: <...snip...>
static bool
gpupreagg_qual_eval(__private cl_int *errcode,
__global kern_parambuf *kparams,
__global kern_data_store *kds,
__global kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index);
pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index);
return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8));
}
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
16. How PG-Strom processes SQL workloads① – Hash-Join
Page. 16
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next step Next step
All CPU does is
just references
the result of
relations join
Hash table
search by CPU
Projection
by CPU
Parallel
Projection
Parallel Hash-
table search
Existing Hash-Join implementation GpuHashJoin implementation
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
17. Benchmark (1/2) – Simple Tables Join
[condition of measurement]
▌ INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
▌ Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...];
▌ All the tables are preliminary loaded
▌ HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980
Page. 17
178.7
334.0
513.6
713.1
941.4
1181.3
1452.2
1753.4
29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Simple Tables Join
PostgreSQL PG-Strom
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
18. How PG-Strom processes SQL workloads② – Aggregate
▌GpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU
▌Not easy to apply aggregate on all the data at once because of GPU’s RAM size,
GPU has advantage to make “partial aggregate” for each million rows.
▌Key performance factor is reducing the job of CPU
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
several millions rows
count(*), avg(X), Y
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
several millions rows
sum(nrows),
avg_ex(nrows, psum), Y
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18
19. Benchmark (2/2) – Aggregation + Tables Join
[condition of measurement]
▌ Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
▌ Query in use:
SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT;
▌ Other conditions are same with the previous measurement
Page. 19
157.4
238.2
328.3
421.7
525.5
619.3
712.8
829.2
44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7
0
100
200
300
400
500
600
700
800
900
1000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Aggregation + Tables Join
PostgreSQL PG-Strom
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
20. As an aside...
Let’s back to the development history prior to
the remaining elemental technology
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 20
21. Development History of PG-Strom
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 21
2011 Feb KaiGai moved to Germany
May PGconf 2011
2012 Jan The first prototype
May Tried to use pseudo code
Aug Proposition of background worker
and writable FDW
2013 Jul NEC admit PG-Strom development
Nov Proposition of CustomScan API
2014 Feb Moved to OpenCL from CUDA
Jun GpuScan, GpuHashJoin and GpuSort
were implemented
Sep Drop GpuSort, instead of GpuPreAgg
Nov working beta-1 gets available
First
prototype
announced
“Strom” comes from Germany term; where I lived in at that time
22. Inspiration – May-2011
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 22
http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language
PGconf 2011 @ Ottawa, Canada
You’re brave.
Unavailable to run on
regular query, not
only PL functions?
23. Inspired by PgOpencl Project
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 23
A B C D E
summary:
GPU works well towards large BLOB,
like image data.
Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011
Tim Child (3DMashUp)
Even small width of values,
GPU will work well
if we have transposition.
24. Made a prototype – Christmas vacation in 2011
CUDA based ( Now, OpenCL)
FDW (Foreign Data Wrapper) based ( Now, CustomScan API)
Column oriented data structure ( Now, row-oriented data)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 24
CPU
vanilla PostgreSQL PostgreSQL + PG-Strom
Iterationofrowfetch
andevaluation
multiple rows
at once
Async DMA &
Async Kernek Exec
CUDA Compiler
: Fetch a row from buffer : Evaluation of WHERE clause
CPU GPU
Synchronization
SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10;
code for
GPU
Auto GPU code generation,
Just-in-time compile
Query response time
reduction
GPU
Device
25. A set of APIs to show external data source as like a table of PostgreSQL
FDW driver generates rows on demand when foreign tables are referenced.
Extension can have arbitrary data structure and logic to process the data.
It is also allowed to scan internal data with GPU acceleration!
What is FDW (Foreign Data Wrapper)
QueryExecutor
Regular
Table
Foreign
Table
Foreign
Table
Foreign
Table
File
FDW
Oracle
FDW
PG-Strom
FDW
storage
Regular
Table
storage
QueryPlanner
QueryParser
Exec
Exec
Exec
Exec
SQL
Query
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 25
CSV Files
....... .... .... ...
... ... . ... ... .
.... .. .... .. . ..
.. . . . .. .
Other
DBMS
26. Implementation at that time
▌Problem
Only foreign tables can be accelerated
with GPUs.
Only full table-scan can be supported.
Foreign tables were read-only at that
time.
Slow-down of 1st time query because of
device initialization.
summary: Not easy to use
PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26
ForeignTable(pgstrom)
value a[]
rowmap
value b[]
value c[]
value d[] <not used>
<not used>
Table: my_schema.ft1.b.cs
10300 {10.23, 7.54, 5.43, … }
10100 {2.4, 5.6, 4.95, … }
② Calculation ① Transfer
③ Write-Back
Table: my_schema.ft1.c.cs
{‘2010-10-21’, …}
{‘2011-01-23’, …}
{‘2011-08-17’, …}
10100
10200
10300
Shadow tables on behalf of each column of the
foreign table managed by PG-Strom.
Each items are stored closely in physical, using large
array data type of element data.
27. PostgreSQL Enhancement (1/2) – Writable FDW
▌Foreign tables had supported read access only (~v9.2)
Data load onto shadow tables was supported using special function
Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 27
PostgreSQL
Data Source
Query
Planning
SELECT DELETE UPDATE INSERT
FDW Driver
Enhancement
of this
interface
28. PostgreSQL Enhancement (2/2) – Background Worker
▌Slowdown on first time of each query
Time for JIT compile of GPU code caching the built binaries
Time for initialization of GPU devices initialization once by worker process
Good side effect: it works with run-time that doesn’t support concurrent usage.
▌Background Worker
An interface allows to launch background processes managed by extensions
Dynamic registration gets supported on v9.4 also.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 28
PostmasterProcess
PostgreSQL
Backend
Process
Built-in
BG workers
Extension’s
BG workers
fork(2) on
connection
fork(2) on
startup
IPC:sharedmemory,...
fork(2) on
startup /
demand
5432
Port
29. Design Pivot
▌Is the FDW really appropriate framework for PG-Strom?
It originated from a feature to show external data as like a table.
[benefit]
• It is already supported on PostgreSQL
[Issue]
• Less transparency for users / applications
• More efficient execution path than GPU; like index-scan if available
• Workloads expects for full-table scan; like tables join
▌CUDA or OpenCL?
First priority is many trials by wider range of people
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 29
CUDA OpenCL
Good • Fine grained optimization
• Latest feature of NVIDIA GPU
• Reliability of OpenCL drivers
• Multiplatform support
• Built-in JIT compiler
• CPU parallel support
Bad • No support on AMD, Intel • Reliability of OpenCL drivers
30. Elemental Technology② – Custom-Scan Interface
Page. 30
Table.A
Table to be scanned
clause
id > 200
Path①
SeqScan
cost=400
Path②
TidScan
unavailable
Path③
IndexScan
cost=10
Path④
CustomScan
(GpuScan)
cost=40
Built-in
Logics
Table.X Table.Y
Tables to be joined
Path①
NestLoop
cost=8000
Path②
HashJoin
cost=800
Path③
MergeJoin
cost=1200
Path④
CustomScan
(GpuHashJoin)
cost=500
Built-in
Logics
clause
x.id = y.id
Path③
IndexScan
cost=10
Table.A
WINNER!
Table.X Table.Y
Path④
CustomScan
(GpuHashJoin)
cost=500
clause
x.id = y.id
WINNER!
It allows extensions to
offer alternative scan or
join logics, in addition to
the built-in ones
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
clause
id > 200
31. Yes!! Custom-Scan Interface got merged to v9.5
Page. 31 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Minimum consensus is, an interface that
allows extensions to implement custom
logic to replace built-in scan logics.
↓
Then, we will discuss the enhancement
of the interface for tables join on the
developer’s community.
32. Enhancement of Custom-Scan Interface
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32
Table.X Table.Y
Tables to be joined
Path ①
NestLoop
Path ②
HashJoin
Path ③
MergeJoin
Path ④
CustomScan
(GpuHashJoin)
Built-in
logics
clause
x.id = y.id
X Y X Y
NestLoop HashJoin
X Y
MergeJoin
X Y
GpuHashJoin Result set
of tables join
It looks like a relation scan
on the result set of
tables join
It replaces a node to be join by foreign-/custom-scan.
Example: A FDW that scan on the result set of remote join.
▌Join replacement by foreign-/custom-scan
33. Abuse of Custom-Scan Interface
▌Usual interface contract
GpuScan/SeqScan fetches rows, then passed to upper node and more ...
TupleTableSlot is used to exchange record row by row manner.
▌Beyond the interface contract
In case when both of parent and child nodes are managed by same extension,
nobody prohibit to exchange chunk of data with its own data structure.
“Bulkload: On” multiple (usually, 10K~100K) rows at once
Page. 33
postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1;
QUERY PLAN
-----------------------------------------------------------------------------
Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106)
hash clause 1: (t0.aid = t1.aid)
Bulkload: On
-> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73)
-> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37)
hash keys: aid
Buckets: 46000 Batches: 1 Memory Usage: 99.97%
-> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37)
Planning time: 0.220 ms
(9 rows)
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
34. PostgreSQL
PG-Strom
Element Technology③ – Row oriented data structure
Page. 34
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
DMA Transfer
(via PCI-E bus)
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
T-Tree
Columnar
Cache
GPU Code Generator
Cache
construction
(rowcolumn)
Materialize
the result
(columnrow)
Prior implementation
had table cache
with column-oriented
data structure
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
35. Why GPU well fit column-oriented data structure
Page. 35
SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made
Core Core Core Core Core Core Core Core Core Core
SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels
coalesced memory access
Global Memory (DRAM)
Wide Memory
Bandwidth
(256-384bits)
WARP:
A set of processor cores
that share instruction
pointer. It is usually 32.
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
36. Format of PostgreSQL tuples
Page. 36
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
/* current TID of this or newer tuple */
ItemPointerData t_ctid;
/* number of attributes + various flags */
uint16 t_infomask2;
/* various flag bits, see below */
uint16 t_infomask;
/* sizeof header incl. bitmap, padding */
uint8 t_hoff;
/* ^ - 23 bytes - ^ */
/* bitmap of NULLs -- VARIABLE LENGTH */
bits8 t_bits[1];
/* MORE DATA FOLLOWS AT END OF STRUCT */
};
HeapTupleHeader
NULL bitmap
(if has null)
OID of row
(if exists)
Padding
1st Column
2nd Column
4th Column
Nth Column
No data if NULL
Variable length fields
makes unavailable to
predicate offset of the
later fields.
No NULL Bitmap
if all the fields
are not NULL
We usually have
no OID of row
The worst data structure
for GPU processor
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
37. Nightmare of columnar cache (1/2)
▌列指向キャッシュと行列変換
Page. 37
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
------------------------------------------------------------------------------
Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1)
hash clause 1: (t0.aid = t1.aid)
hash clause 2: (t0.bid = t2.bid)
number of requests: 144
total time to load: 584.67ms
total time to materialize: 7245.14ms 70% of total execution time!!
average time in send-mq: 37us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470
DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144
kernel exec: total: 441.71ms, avg: 3067us, count: 144
-> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1)
-> Custom (MultiHash) (actual time=17.839..17.839 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (actual time=0.019..6.794 rows=40000 loops=1)
Execution time: 10525.754 ms
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
38. Nightmare of columnar cache (2/2)
Page. 38
Translation from table
(row-format) to the
columnar-cache
(only at once)
Translation from PG-
Strom internal
(column-format) to
TupleTableSlot
(row-format) for data
exchange in PostgreSQL
[Hash-Join]
Search the relevant
records on inner/outer
relations.
Breakdown of
execution time
You cannot see the wood for the trees
(木を見て森を見ず)
Optimization in GPU also makes additional data-format
translation cost (not ignorable) on the interface of PostgreSQL
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
39. A significant point in GPU acceleration (1/2)
Page. 39
• CPU is rare resource, so should put less workload on CPUs unlike GPUs
• We need to pay attention on access pattern of memory for CPU’s job
Storage
Shared
Buffer
T-Tree
columnar
cache
Result
Buffer
TupleTableSlot
Task of CPU
RowColumn translation
on cache construction
Very heavy loads
CPUのタスク
キャッシュ構築の際に、
一度だけ行列変換
極めて高い負荷
Task of PCI-E
Due to column-format,
only referenced data
shall be transformed.
Task of CPU
ColumnRow translation
everytime when we returns
the result to PostgreSQL.
Very heavy loads
Task of GPU
Operations on the data
with column-format.
It is optimized data
according to GPU’s nature
In case of column-format
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
40. A significant point in GPU acceleration (2/2)
Page. 40
• CPU is rare resource, so should put less workload on CPUs unlike GPUs
• We need to pay attention on access pattern of memory for CPU’s job
Storage
Shared
Buffer
Result
Buffer
TupleTableSlot
Task of CPU
Visibility check prior to
DMA translation
Light Load
CPUのタスク
キャッシュ構築の際に、
一度だけ行列変換
極めて高い負荷
Task of PCI-E
Due to row-format,
all data shall be transformed.
A little heavy load
Task of CPU
Set pointer of the result
buffer
Very Light Load
Task of GPU
Operations on row-data.
Although not an optimal data,
massive cores covers its
performance
In case of row-format
No own
columnar
cache
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
41. Integration with shared-buffer of PostgreSQL
▌列指向キャッシュと行列変換
Page. 41
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
-----------------------------------------------------------
Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1)
hash clause 1: (t0.aid = t1.aid)
hash clause 2: (t0.bid = t2.bid)
number of requests: 145
total time for inner load: 29.80ms
total time for outer load: 812.50ms
total time to materialize: 1527.95ms Reduction dramatically
average time in send-mq: 61us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619
DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290
proj kernel exec: total: 264.47ms, avg: 1823us, count: 145
main kernel exec: total: 622.83ms, avg: 4295us, count: 145
-> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=29.766..29.767 rows=80000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.005..5.742 rows=40000 loops=1)
-> Custom (MultiHash) (actual time=16.552..16.552 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (actual time=0.022..7.330 rows=40000 loops=1)
Execution time: 5161.017 ms Much better response time
Performance degrading little bit
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
42. PG-Strom’s current
▌Supported logics
GpuScan ... Full-table scan with qualifiers
GpuHashJoin ... Hash-Join with GPUs
GpuPreAgg ... Preprocess of aggregation with GPUs
▌Supported data-types
Integer (smallint, integer, bigint)
Floating-point (real, float)
String (text, varchar(n), char(n))
Date and time (date, time, timestamp)
NUMERIC
▌Supported functions
arithmetic operators for above data-types
comparison operators for above data-types
mathematical functions on floating-points
Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR
Page. 42 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
43. PG-Strom’s future
▌Logics will be supported
Sort
Aggregate Push-down
... anything else?
▌Data types will be supported
... anything else?
▌Functions will be supported
LIKE, Regular Expression?
Date/Time extraction?
PostGIS?
Geometrics?
User defined functions?
(like pgOpenCL)
▌Parallel Scan
Page. 43
Shared
Buffer
block-0 block-Nblock-N/3 block-2N/3
Range
Partitioning &
Parallel Scan
Current
Future
Not easy to supply GPU enough data with
single CPU core in spite of on-memory store.
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
44. Our target segment
▌1st target application: BI tools are expected
▌Up-to 1.0TB data: SMB company or branches of large company
Because it needs to keep the data in-memory
▌GPU and PostgreSQL: both of them are inexpensive.
Page. 44
High-end
data analysis
appliance
Commercial or
OSS RDBMS
large response-time / batch process small response-time / real-time
smallerdata-sizelargerdata-size
< 1.0TB
> 1.0TB
Cost
advantage
Performance
advantage
PG-Strom
Hadoop?
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
45. (Ref) ROLAP and MOLAP
▌Expected PG-Strom usage
Backend RDBMS for ROLAP / acceleration of ad-hoc queries
Cube construction for MOLAP / acceleration of batch processing
Page. 45
ERP
SCM
CRM
Finance
DWH
DWH
world of transaction
(OLTP)
world of analytics
(OLAP)
ETL
ROLAP:
DB summarizes the data
on demand by BI tools
MOLAP:
BI tools reference
information cube;
preliminary constructed
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
46. Expected Usage (1/2) – OLTP/OLAP Integration
▌Why separated OLTP/OLAP system
Integration of multiple data source
Optimization for analytic workloads
▌How PG-Strom will improve...
Parallel execution of Join / Aggregate; being key of performance
Elimination or reduction of OLAP / ETL, thus smaller amount of TCO
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 46
ERPCRMSCM BI
OLTP
database
OLAP
database
ETL
OLAP CubesMaster / Fact Tables
BI
PG-Strom
Performance Key
• Join master and
fact tables
• Aggregation
functions
47. Expected Usage (2/2) – Let’s investigate with us
▌Which region PG-Strom can work for?
▌Which workloads PG-Strom shall fit?
▌What workloads you concern about?
PG-Strom Project want to lean from the field.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 47
48. How to use (1/3) – Installation Prerequisites
▌OS: Linux (RHEL 6.x was validated)
▌PostgreSQL 9.5devel (with Custom-Plan Interface)
▌PG-Strom Module
▌OpenCL Driver (like nVIDIA run-time)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 48
shared_preload_libraries = '$libdir/pg_strom‘
shared_buffers = <enough size to load whole of the database>
Minimum configuration of PG-Strom
postgres=# SET pg_strom.enabled = on;
SET
Turn on/off PG-Strom at run-time
49. How to use (2/3) – Build, Install and Starup
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 49
[kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom
[kaigai@saba ~]$ cd pg_strom
[kaigai@saba pg_strom]$ make && make install
[kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf
[kaigai@saba ~]$ pg_ctl start
server starting
[kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server"
LOG: starting background worker process "PG-Strom OpenCL Server"
LOG: database system was shut down at 2014-11-09 17:45:51 JST
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA
LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB)
LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB)
LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL
LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed
LOG: PG-Strom: Device "GeForce GTX 980" was installed
LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB)
LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB)
LOG: Starting PG-Strom OpenCL Server
LOG: PG-Strom: 24 of server threads are up
50. How to use (3/3) – Deployment on AWS
Page. 50
Search by “strom” !
AWS GPU Instance (g2.2xlarge)
CPU Xeon E5-2670 (8 xCPU)
RAM 15GB
GPU NVIDIA GRID K2 (1536core)
Storage 60GB of SSD
Price $0.898/hour, $646.56/mon
(*) Price for on-demand instance
on Tokyo region at Nov-2014
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
51. Future of PG-Strom
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 51
52. Dilemma of Innovation
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 52
SOURCE: The Innovator's Dilemma, Clayton M. Christensen
Performance demanded
at the high end of the market
Performance demanded
at the low end of the marker
or in a new emerging segment
ProductPerformance
Time
53. Move forward with community
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 53
54. (Additional comment after the conference)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 54
check it out!
https://github.com/pg-strom/devel
Oops, I forgot to say
in the conference!
PG-Strom is open source.