GPGPU Accelerates PostgreSQL (English)

GPGPU Accelerates PostgreSQL
NEC OSS Promotion Center
The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
(Tw: @kkaigai)

Self Introduction
▌Name: KaiGai Kohei
▌Company: NEC OSS Promotion Center
▌Like: Processor with many cores
▌Dislike: Processor with little cores
▌Background:
HPC  OSS/Linux  SAP  GPU/PostgreSQL
▌Tw: @kkaigai
▌My Jobs
 SELinux (2004~)
• Lockless AVC、JFFS2 XATTR, ...
 PostgreSQL (2006~)
• SE-PostgreSQL, Security Barrier View, Writable FDW, ...
 PG-Strom (2012~)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2

Approach for Performance Improvement
Scale-Out
Scale-Up
Homogeneous Scale-up
Heterogeneous Scale-Up
+

Characteristics of GPU (Graphic Processor Unit)
▌Characteristics
 Larger percentage of ALUs on chip
 Relatively smaller percentage of cache
and control logic
Advantages to simple calculation in
parallel, but not complicated logic
 Much higher number of cores per price
• GTX750Ti (640core) with $150
GPU CPU
Model Nvidia Tesla K20X
Intel Xeon
E5-2670 v3
Architecture Kepler Haswell
Launch Nov-2012 Sep-2014
# of transistors 7.1billion 3.84billion
# of cores 2688 (simple) 12 (functional)
Core clock 732MHz
2.6GHz,
up to 3.5GHz
Peak Flops
(single precision)
3.95TFLOPS
998.4GFLOPS
(with AVX2)
DRAM size 6GB, GDDR5
768GB/socket,
DDR4
Memory band 250GB/s 68GB/s
Power
consumption
235W 135W
Price $3,000 $2,094
SOURCE: CUDA C Programming Guide (v6.5)

How GPU works – Example of reduction algorithm
●item[0]
step.1 step.2 step.4step.3
Computing
the sum of array:
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with N-cores of GPU
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[]
with log2N steps
Inter core synchronization by HW support

Semiconductor Trend (1/2) – Towards Heterogeneous
▌Moves to CPU/GPU integrated architecture from multicore CPU
▌Free lunch for SW by HW improvement will finish soon
 No utilization of semiconductor capability, unless SW is not designed
with conscious of HW characteristics.
SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013

Semiconductor Trend (2/2) – Dark Silicon Problem
▌Background of CPU/GPU integrated architecture
 Increase of transistor density > Reduction of power consumption
 Unable to supply power for all the logic because of chip cooling
A chip has multiple logics with different features, to save the peak power
consumption.
SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011

RDBMS and its bottleneck (1/2)
Storage
Processor
Data
RAM
Data Size > RAM Data Size < RAM
Storage
Processor
Data
RAM
In the future?
Processor
Wide
Band
RAM
Non-
volatile
RAM
Data

World of current memory bottleneck
Join, Aggregation, Sort, Projection, ...
[strategy]
• burstable access pattern
• parallel algorithm
World of traditional disk-i/o bottleneck
SeqScan, IndexScan, ...
[strategy]
• reduction of i/o (size, count)
• distribution of disk (RAID)
RDBMS and its bottleneck (2/2)
Processor
RAM
Storage
bandwidth：
multiple
hundreds GB/s
bandwidth：
multiple GB/s

PG-Strom
▌What is PG-Strom
 An extension designed for PostgreSQL
 Off-loads a part of SQL workloads to GPU for parallel/rapid execution
 Three workloads are supported: Full-Scan, Hash-Join, Aggregate
(At the moment of Nov-2014, beta version)
▌Concept
 Automatic GPU native code generation from SQL query, by JIT compile
 Asynchronous execution with CPU/GPU co-operation.
▌Advantage
 Works fully transparently from users
• It allows to use peripheral software of PostgreSQL, including SQL syntax,
backup, HA or drivers.
 Performance improvement with GPU+OSS; very low cost
▌Attention
 Right now, in-memory data store is assumed

PostgreSQL
PG-Strom
Architecture of PG-Strom (1/2)
GPU Code Generator
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue

Architecture of PG-Strom (2/2)
▌Elemental technology① OpenCL
 Parallel computing framework on heterogeneous processors
 Can use for CPU parallel, not only GPU of NVIDIA/AMD
 Includes run-time compiler in the language specification
▌Elemental technology② Custom-Scan Interface
 Feature to implement scan/join by extensions, as if it is built-in logic for
SQL processing in PostgreSQL.
 A part of functionalities got merged to v9.5, discussions are in-progress
for full functionalities.
▌Elemental technology③ Row oriented data structure
 Shared buffer of PostgreSQL as DMA source
 Data format translation between row and column, even though column
format is optimal for GPU performance.
Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL

Elemental technology① OpenCL (1/2)
▌Characteristics of OpenCL
 Use abstracted “OpenCL device” for CPU/MIC, not only GPUs
• Even though it follows characteristics of GPUs below....
Three types of memory layer (global, local, constant)
Concept of workgroup; synchronous execution inter-threads
 Just-in-time compile from source code like C-language to
the platform specific native code
▌Comparison with CUDA
 We don’t need to develop JIT compile functionality by ourselves, and
want more people to try, so adopted OpenCL at this moment.
Page. 13
CUDA OpenCL
Advantage • Detailed optimization
• Latest feature of NVIDIA GPU
• Driver stability
• Multiplatform support
• Built-in JIT compiler
• CPU parallel support
Issues • Unavailable on AMD, Intel • Driver stability
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL

Elemental technology① OpenCL (2/2)
Page. 14
GPU
Source code
(text)
OpenCL runtime
OpenCL Interface
OpenCL runtime OpenCL runtime
OpenCL Devices
N x computing units
Global Memory
Local Memory
Constant Memory
cl_program
object
cl_kernel
object
DMA
buffer-3
DMA
buffer-2
DMA
buffer-1clBuildProgram()
clCreateKernel()
Command
Queue
Just-in-Time
Compile Look-up
kernel
functions Input a pair of kernel and
data into command queue

Automatic GPU native code generation
Page. 15
postgres=# SET pg_strom.show_device_kernel = on;
SET
postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate
Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x))
Group Key: t0.cat
-> Custom (GpuPreAgg)
Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer,
NULL::double precision, NULL::double precision, NULL::text,
pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)
Bulkload: On
Kernel Source: #include "opencl_common.h"
#include "opencl_gpupreagg.h"
#include "opencl_textlib.h"
: <...snip...>
static bool
gpupreagg_qual_eval(__private cl_int *errcode,
__global kern_parambuf *kparams,
__global kern_data_store *kds,
__global kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index);
pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index);
return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8));
}

How PG-Strom processes SQL workloads① – Hash-Join
Page. 16
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next step Next step
All CPU does is
just references
the result of
relations join
Hash table
search by CPU
Projection
by CPU
Parallel
Projection
Parallel Hash-
table search
Existing Hash-Join implementation GpuHashJoin implementation

Benchmark (1/2) – Simple Tables Join
[condition of measurement]
▌ INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
▌ Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...];
▌ All the tables are preliminary loaded
▌ HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980
Page. 17
178.7
334.0
513.6
713.1
941.4
1181.3
1452.2
1753.4
29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Simple Tables Join
PostgreSQL PG-Strom

How PG-Strom processes SQL workloads② – Aggregate
▌GpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU
▌Not easy to apply aggregate on all the data at once because of GPU’s RAM size,
GPU has advantage to make “partial aggregate” for each million rows.
▌Key performance factor is reducing the job of CPU
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
count(*), avg(X), Y
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
sum(nrows),
avg_ex(nrows, psum), Y
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18

Benchmark (2/2) – Aggregation + Tables Join
[condition of measurement]
▌ Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
▌ Query in use:
SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT;
▌ Other conditions are same with the previous measurement
Page. 19
157.4
238.2
328.3
421.7
525.5
619.3
712.8
829.2
44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7
0
100
200
300
400
500
600
700
800
900
1000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Aggregation + Tables Join
PostgreSQL PG-Strom

As an aside...
Let’s back to the development history prior to
the remaining elemental technology

Development History of PG-Strom
2011 Feb KaiGai moved to Germany
May PGconf 2011
2012 Jan The first prototype
May Tried to use pseudo code
Aug Proposition of background worker
and writable FDW
2013 Jul NEC admit PG-Strom development
Nov Proposition of CustomScan API
2014 Feb Moved to OpenCL from CUDA
Jun GpuScan, GpuHashJoin and GpuSort
were implemented
Sep Drop GpuSort, instead of GpuPreAgg
Nov working beta-1 gets available
First
prototype
announced
“Strom” comes from Germany term; where I lived in at that time

Inspiration – May-2011
http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language
PGconf 2011 @ Ottawa, Canada
You’re brave.
Unavailable to run on
regular query, not
only PL functions?

Inspired by PgOpencl Project
A B C D E
summary:
GPU works well towards large BLOB,
like image data.
Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011
Tim Child (3DMashUp)
Even small width of values,
GPU will work well
if we have transposition.

Made a prototype – Christmas vacation in 2011
 CUDA based ( Now, OpenCL)
 FDW (Foreign Data Wrapper) based ( Now, CustomScan API)
 Column oriented data structure ( Now, row-oriented data)
CPU
vanilla PostgreSQL PostgreSQL + PG-Strom
Iterationofrowfetch
andevaluation
multiple rows
at once
Async DMA &
Async Kernek Exec
CUDA Compiler
: Fetch a row from buffer : Evaluation of WHERE clause
CPU GPU
Synchronization
SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10;
code for
GPU
Auto GPU code generation,
Just-in-time compile
Query response time
reduction
GPU
Device

 A set of APIs to show external data source as like a table of PostgreSQL
 FDW driver generates rows on demand when foreign tables are referenced.
 Extension can have arbitrary data structure and logic to process the data.
 It is also allowed to scan internal data with GPU acceleration!
What is FDW (Foreign Data Wrapper)
QueryExecutor
Regular
Table
Foreign
Table
Foreign
Table
Foreign
Table
File
FDW
Oracle
FDW
PG-Strom
FDW
storage
Regular
Table
storage
QueryPlanner
QueryParser
Exec
Exec
Exec
Exec
SQL
Query
CSV Files
....... .... .... ...
... ... . ... ... .
.... .. .... .. . ..
.. . . . .. .
Other
DBMS

Implementation at that time
▌Problem
 Only foreign tables can be accelerated
with GPUs.
 Only full table-scan can be supported.
 Foreign tables were read-only at that
time.
 Slow-down of 1st time query because of
device initialization.
summary: Not easy to use
PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26
ForeignTable(pgstrom)
value a[]
rowmap
value b[]
value c[]
value d[] <not used>
<not used>
Table: my_schema.ft1.b.cs
10300 {10.23, 7.54, 5.43, … }
10100 {2.4, 5.6, 4.95, … }
② Calculation ① Transfer
③ Write-Back
Table: my_schema.ft1.c.cs
{‘2010-10-21’, …}
{‘2011-01-23’, …}
{‘2011-08-17’, …}
10100
10200
10300
Shadow tables on behalf of each column of the
foreign table managed by PG-Strom.
Each items are stored closely in physical, using large
array data type of element data.

PostgreSQL Enhancement (1/2) – Writable FDW
▌Foreign tables had supported read access only (~v9.2)
 Data load onto shadow tables was supported using special function
Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables
PostgreSQL
Data Source
Query
Planning
SELECT DELETE UPDATE INSERT
FDW Driver
Enhancement
of this
interface

PostgreSQL Enhancement (2/2) – Background Worker
▌Slowdown on first time of each query
 Time for JIT compile of GPU code  caching the built binaries
 Time for initialization of GPU devices  initialization once by worker process
Good side effect: it works with run-time that doesn’t support concurrent usage.
▌Background Worker
 An interface allows to launch background processes managed by extensions
 Dynamic registration gets supported on v9.4 also.
PostmasterProcess
PostgreSQL
Backend
Process
Built-in
BG workers
Extension’s
BG workers
fork(2) on
connection
fork(2) on
startup
IPC:sharedmemory,...
fork(2) on
startup /
demand
5432
Port

Design Pivot
▌Is the FDW really appropriate framework for PG-Strom?
 It originated from a feature to show external data as like a table.
[benefit]
• It is already supported on PostgreSQL
[Issue]
• Less transparency for users / applications
• More efficient execution path than GPU; like index-scan if available
• Workloads expects for full-table scan; like tables join
▌CUDA or OpenCL?
 First priority is many trials by wider range of people
CUDA OpenCL
Good • Fine grained optimization
• Latest feature of NVIDIA GPU
• Reliability of OpenCL drivers
• Multiplatform support
• Built-in JIT compiler
• CPU parallel support
Bad • No support on AMD, Intel • Reliability of OpenCL drivers

Elemental Technology② – Custom-Scan Interface
Page. 30
Table.A
Table to be scanned
clause
id > 200
Path①
SeqScan
cost=400
Path②
TidScan
unavailable
Path③
IndexScan
cost=10
Path④
CustomScan
(GpuScan)
cost=40
Built-in
Logics
Table.X Table.Y
Tables to be joined
Path①
NestLoop
cost=8000
Path②
HashJoin
cost=800
Path③
MergeJoin
cost=1200
Path④
CustomScan
(GpuHashJoin)
cost=500
Built-in
Logics
clause
x.id = y.id
Path③
IndexScan
cost=10
Table.A
WINNER!
Table.X Table.Y
Path④
CustomScan
(GpuHashJoin)
cost=500
clause
x.id = y.id
WINNER!
It allows extensions to
offer alternative scan or
join logics, in addition to
the built-in ones
clause
id > 200

Yes!! Custom-Scan Interface got merged to v9.5
Minimum consensus is, an interface that
allows extensions to implement custom
logic to replace built-in scan logics.
↓
Then, we will discuss the enhancement
of the interface for tables join on the
developer’s community.

Enhancement of Custom-Scan Interface
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32
Table.X Table.Y
Tables to be joined
Path ①
NestLoop
Path ②
HashJoin
Path ③
MergeJoin
Path ④
CustomScan
(GpuHashJoin)
Built-in
logics
clause
x.id = y.id
X Y X Y
NestLoop HashJoin
X Y
MergeJoin
X Y
GpuHashJoin Result set
of tables join
It looks like a relation scan
on the result set of
tables join
 It replaces a node to be join by foreign-/custom-scan.
 Example: A FDW that scan on the result set of remote join.
▌Join replacement by foreign-/custom-scan

Abuse of Custom-Scan Interface
▌Usual interface contract
 GpuScan/SeqScan fetches rows, then passed to upper node and more ...
 TupleTableSlot is used to exchange record row by row manner.
▌Beyond the interface contract
 In case when both of parent and child nodes are managed by same extension,
nobody prohibit to exchange chunk of data with its own data structure.
 “Bulkload: On”  multiple (usually, 10K~100K) rows at once
Page. 33
postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1;
QUERY PLAN
-----------------------------------------------------------------------------
Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106)
hash clause 1: (t0.aid = t1.aid)
Bulkload: On
-> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73)
-> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37)
hash keys: aid
Buckets: 46000 Batches: 1 Memory Usage: 99.97%
-> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37)
Planning time: 0.220 ms
(9 rows)

PostgreSQL
PG-Strom
Element Technology③ – Row oriented data structure
Page. 34
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
DMA Transfer
(via PCI-E bus)
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
T-Tree
Columnar
Cache
GPU Code Generator
Cache
construction
(rowcolumn)
Materialize
the result
(columnrow)
Prior implementation
had table cache
with column-oriented
data structure

Why GPU well fit column-oriented data structure
Page. 35
SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made
Core Core Core Core Core Core Core Core Core Core
SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels
coalesced memory access
Global Memory (DRAM)
Wide Memory
Bandwidth
(256-384bits)
WARP:
A set of processor cores
that share instruction
pointer. It is usually 32.

Format of PostgreSQL tuples
Page. 36
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
/* current TID of this or newer tuple */
ItemPointerData t_ctid;
/* number of attributes + various flags */
uint16 t_infomask2;
/* various flag bits, see below */
uint16 t_infomask;
/* sizeof header incl. bitmap, padding */
uint8 t_hoff;
/* ^ - 23 bytes - ^ */
/* bitmap of NULLs -- VARIABLE LENGTH */
bits8 t_bits[1];
/* MORE DATA FOLLOWS AT END OF STRUCT */
};
HeapTupleHeader
NULL bitmap
(if has null)
OID of row
(if exists)
Padding
1st Column
2nd Column
4th Column
Nth Column
No data if NULL
Variable length fields
makes unavailable to
predicate offset of the
later fields.
No NULL Bitmap
if all the fields
are not NULL
We usually have
no OID of row
The worst data structure
for GPU processor

Nightmare of columnar cache (1/2)
▌列指向キャッシュと行列変換
Page. 37
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
------------------------------------------------------------------------------
Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1)
hash clause 2: (t0.bid = t2.bid)
number of requests: 144
total time to load: 584.67ms
total time to materialize: 7245.14ms  70% of total execution time!!
average time in send-mq: 37us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470
DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144
kernel exec: total: 441.71ms, avg: 3067us, count: 144
-> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1)
hash keys: bid
Execution time: 10525.754 ms

Nightmare of columnar cache (2/2)
Page. 38
Translation from table
(row-format) to the
columnar-cache
(only at once)
Translation from PG-
Strom internal
(column-format) to
TupleTableSlot
(row-format) for data
exchange in PostgreSQL
[Hash-Join]
Search the relevant
records on inner/outer
relations.
Breakdown of
execution time
You cannot see the wood for the trees
（木を見て森を見ず）
Optimization in GPU also makes additional data-format
translation cost (not ignorable) on the interface of PostgreSQL

A significant point in GPU acceleration (1/2)
Page. 39
• CPU is rare resource, so should put less workload on CPUs unlike GPUs
• We need to pay attention on access pattern of memory for CPU’s job
Storage
Shared
Buffer
T-Tree
columnar
cache
Result
Buffer
TupleTableSlot
Task of CPU
RowColumn translation
on cache construction
Very heavy loads
CPUのタスク
キャッシュ構築の際に、
一度だけ行列変換
極めて高い負荷
Task of PCI-E
Due to column-format,
only referenced data
shall be transformed.
Task of CPU
ColumnRow translation
everytime when we returns
the result to PostgreSQL.
Very heavy loads
Task of GPU
Operations on the data
with column-format.
It is optimized data
according to GPU’s nature
In case of column-format

A significant point in GPU acceleration (2/2)
Page. 40
• CPU is rare resource, so should put less workload on CPUs unlike GPUs
• We need to pay attention on access pattern of memory for CPU’s job
Storage
Shared
Buffer
Result
Buffer
TupleTableSlot
Task of CPU
Visibility check prior to
DMA translation
Light Load
CPUのタスク
キャッシュ構築の際に、
一度だけ行列変換
極めて高い負荷
Task of PCI-E
Due to row-format,
all data shall be transformed.
A little heavy load
Task of CPU
Set pointer of the result
buffer
Very Light Load
Task of GPU
Operations on row-data.
Although not an optimal data,
massive cores covers its
performance
In case of row-format
No own
columnar
cache

Integration with shared-buffer of PostgreSQL
▌列指向キャッシュと行列変換
Page. 41
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
-----------------------------------------------------------
Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1)
hash clause 2: (t0.bid = t2.bid)
number of requests: 145
total time for inner load: 29.80ms
total time for outer load: 812.50ms
total time to materialize: 1527.95ms  Reduction dramatically
average time in send-mq: 61us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619
DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290
proj kernel exec: total: 264.47ms, avg: 1823us, count: 145
main kernel exec: total: 622.83ms, avg: 4295us, count: 145
-> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1)
hash keys: aid
hash keys: bid
Execution time: 5161.017 ms  Much better response time
Performance degrading little bit

PG-Strom’s current
▌Supported logics
 GpuScan ... Full-table scan with qualifiers
 GpuHashJoin ... Hash-Join with GPUs
 GpuPreAgg ... Preprocess of aggregation with GPUs
▌Supported data-types
 Integer (smallint, integer, bigint)
 Floating-point (real, float)
 String (text, varchar(n), char(n))
 Date and time (date, time, timestamp)
 NUMERIC
▌Supported functions
 arithmetic operators for above data-types
 comparison operators for above data-types
 mathematical functions on floating-points
 Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR

PG-Strom’s future
▌Logics will be supported
 Sort
 Aggregate Push-down
 ... anything else?
▌Data types will be supported
 ... anything else?
▌Functions will be supported
 LIKE, Regular Expression?
 Date/Time extraction?
 PostGIS?
 Geometrics?
 User defined functions?
(like pgOpenCL)
▌Parallel Scan
Page. 43
Shared
Buffer
block-0 block-Nblock-N/3 block-2N/3
Range
Partitioning &
Parallel Scan
Current
Future
Not easy to supply GPU enough data with
single CPU core in spite of on-memory store.

Our target segment
▌1st target application: BI tools are expected
▌Up-to 1.0TB data: SMB company or branches of large company
 Because it needs to keep the data in-memory
▌GPU and PostgreSQL: both of them are inexpensive.
Page. 44
High-end
data analysis
appliance
Commercial or
OSS RDBMS
large response-time / batch process small response-time / real-time
smallerdata-sizelargerdata-size
< 1.0TB
> 1.0TB
Cost
advantage
Performance
advantage
PG-Strom
Hadoop?

(Ref) ROLAP and MOLAP
▌Expected PG-Strom usage
 Backend RDBMS for ROLAP / acceleration of ad-hoc queries
 Cube construction for MOLAP / acceleration of batch processing
Page. 45
ERP
SCM
CRM
Finance
DWH
DWH
world of transaction
(OLTP)
world of analytics
(OLAP)
ETL
ROLAP:
DB summarizes the data
on demand by BI tools
MOLAP:
BI tools reference
information cube;
preliminary constructed

Expected Usage (1/2) – OLTP/OLAP Integration
▌Why separated OLTP/OLAP system
 Integration of multiple data source
 Optimization for analytic workloads
▌How PG-Strom will improve...
 Parallel execution of Join / Aggregate; being key of performance
 Elimination or reduction of OLAP / ETL, thus smaller amount of TCO
ERPCRMSCM BI
OLTP
database
OLAP
database
ETL
OLAP CubesMaster / Fact Tables
BI
PG-Strom
Performance Key
• Join master and
fact tables
• Aggregation
functions

Expected Usage (2/2) – Let’s investigate with us
▌Which region PG-Strom can work for?
▌Which workloads PG-Strom shall fit?
▌What workloads you concern about?
 PG-Strom Project want to lean from the field.

How to use (1/3) – Installation Prerequisites
▌OS: Linux (RHEL 6.x was validated)
▌PostgreSQL 9.5devel (with Custom-Plan Interface)
▌PG-Strom Module
▌OpenCL Driver (like nVIDIA run-time)
shared_preload_libraries = '$libdir/pg_strom‘
shared_buffers = <enough size to load whole of the database>
Minimum configuration of PG-Strom
postgres=# SET pg_strom.enabled = on;
SET
Turn on/off PG-Strom at run-time

How to use (2/3) – Build, Install and Starup
[kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom
[kaigai@saba ~]$ cd pg_strom
[kaigai@saba pg_strom]$ make && make install
[kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf
[kaigai@saba ~]$ pg_ctl start
server starting
[kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server"
LOG: starting background worker process "PG-Strom OpenCL Server"
LOG: database system was shut down at 2014-11-09 17:45:51 JST
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA
LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB)
LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB)
LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL
LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed
LOG: PG-Strom: Device "GeForce GTX 980" was installed
LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB)
LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB)
LOG: Starting PG-Strom OpenCL Server
LOG: PG-Strom: 24 of server threads are up

How to use (3/3) – Deployment on AWS
Page. 50
Search by “strom” !
AWS GPU Instance (g2.2xlarge)
CPU Xeon E5-2670 (8 xCPU)
RAM 15GB
GPU NVIDIA GRID K2 (1536core)
Storage 60GB of SSD
Price $0.898/hour, $646.56/mon
(*) Price for on-demand instance
on Tokyo region at Nov-2014

Future of PG-Strom

Dilemma of Innovation
SOURCE: The Innovator's Dilemma, Clayton M. Christensen
Performance demanded
at the high end of the market
Performance demanded
at the low end of the marker
or in a new emerging segment
ProductPerformance
Time

Move forward with community

(Additional comment after the conference)
check it out!
https://github.com/pg-strom/devel
Oops, I forgot to say
in the conference!
PG-Strom is open source.

GPGPU Accelerates PostgreSQL (English)

GPGPU Accelerates PostgreSQL (English)

More Related Content

GPGPU Accelerates PostgreSQL (English)