Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
GPGPU Accelerates PostgreSQL
NEC OSS Promotion Center
The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>
(Tw: @kkaigai)
Self Introduction
▌Name: KaiGai Kohei
▌Company: NEC OSS Promotion Center
▌Like: Processor with many cores
▌Dislike: Processor with little cores
▌Background:
HPC  OSS/Linux  SAP  GPU/PostgreSQL
▌Tw: @kkaigai
▌My Jobs
 SELinux (2004~)
• Lockless AVC、JFFS2 XATTR, ...
 PostgreSQL (2006~)
• SE-PostgreSQL, Security Barrier View, Writable FDW, ...
 PG-Strom (2012~)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2
Approach for Performance Improvement
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 3
Scale-Out
Scale-Up
Homogeneous Scale-up
Heterogeneous Scale-Up
+
Characteristics of GPU (Graphic Processor Unit)
▌Characteristics
 Larger percentage of ALUs on chip
 Relatively smaller percentage of cache
and control logic
Advantages to simple calculation in
parallel, but not complicated logic
 Much higher number of cores per price
• GTX750Ti (640core) with $150
GPU CPU
Model Nvidia Tesla K20X
Intel Xeon
E5-2670 v3
Architecture Kepler Haswell
Launch Nov-2012 Sep-2014
# of transistors 7.1billion 3.84billion
# of cores 2688 (simple) 12 (functional)
Core clock 732MHz
2.6GHz,
up to 3.5GHz
Peak Flops
(single precision)
3.95TFLOPS
998.4GFLOPS
(with AVX2)
DRAM size 6GB, GDDR5
768GB/socket,
DDR4
Memory band 250GB/s 68GB/s
Power
consumption
235W 135W
Price $3,000 $2,094
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 4
SOURCE: CUDA C Programming Guide (v6.5)
How GPU works – Example of reduction algorithm
●item[0]
step.1 step.2 step.4step.3
Computing
the sum of array:
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with N-cores of GPU
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[]
with log2N steps
Inter core synchronization by HW support
Semiconductor Trend (1/2) – Towards Heterogeneous
▌Moves to CPU/GPU integrated architecture from multicore CPU
▌Free lunch for SW by HW improvement will finish soon
 No utilization of semiconductor capability, unless SW is not designed
with conscious of HW characteristics.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 6
SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
Semiconductor Trend (2/2) – Dark Silicon Problem
▌Background of CPU/GPU integrated architecture
 Increase of transistor density > Reduction of power consumption
 Unable to supply power for all the logic because of chip cooling
A chip has multiple logics with different features, to save the peak power
consumption.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 7
SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011
RDBMS and its bottleneck (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 8
Storage
Processor
Data
RAM
Data Size > RAM Data Size < RAM
Storage
Processor
Data
RAM
In the future?
Processor
Wide
Band
RAM
Non-
volatile
RAM
Data
World of current memory bottleneck
Join, Aggregation, Sort, Projection, ...
[strategy]
• burstable access pattern
• parallel algorithm
World of traditional disk-i/o bottleneck
SeqScan, IndexScan, ...
[strategy]
• reduction of i/o (size, count)
• distribution of disk (RAID)
RDBMS and its bottleneck (2/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 9
Processor
RAM
Storage
bandwidth:
multiple
hundreds GB/s
bandwidth:
multiple GB/s
PG-Strom
▌What is PG-Strom
 An extension designed for PostgreSQL
 Off-loads a part of SQL workloads to GPU for parallel/rapid execution
 Three workloads are supported: Full-Scan, Hash-Join, Aggregate
(At the moment of Nov-2014, beta version)
▌Concept
 Automatic GPU native code generation from SQL query, by JIT compile
 Asynchronous execution with CPU/GPU co-operation.
▌Advantage
 Works fully transparently from users
• It allows to use peripheral software of PostgreSQL, including SQL syntax,
backup, HA or drivers.
 Performance improvement with GPU+OSS; very low cost
▌Attention
 Right now, in-memory data store is assumed
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 10
PostgreSQL
PG-Strom
Architecture of PG-Strom (1/2)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 11
GPU Code Generator
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
Architecture of PG-Strom (2/2)
▌Elemental technology① OpenCL
 Parallel computing framework on heterogeneous processors
 Can use for CPU parallel, not only GPU of NVIDIA/AMD
 Includes run-time compiler in the language specification
▌Elemental technology② Custom-Scan Interface
 Feature to implement scan/join by extensions, as if it is built-in logic for
SQL processing in PostgreSQL.
 A part of functionalities got merged to v9.5, discussions are in-progress
for full functionalities.
▌Elemental technology③ Row oriented data structure
 Shared buffer of PostgreSQL as DMA source
 Data format translation between row and column, even though column
format is optimal for GPU performance.
Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Elemental technology① OpenCL (1/2)
▌Characteristics of OpenCL
 Use abstracted “OpenCL device” for CPU/MIC, not only GPUs
• Even though it follows characteristics of GPUs below....
Three types of memory layer (global, local, constant)
Concept of workgroup; synchronous execution inter-threads
 Just-in-time compile from source code like C-language to
the platform specific native code
▌Comparison with CUDA
 We don’t need to develop JIT compile functionality by ourselves, and
want more people to try, so adopted OpenCL at this moment.
Page. 13
CUDA OpenCL
Advantage • Detailed optimization
• Latest feature of NVIDIA GPU
• Driver stability
• Multiplatform support
• Built-in JIT compiler
• CPU parallel support
Issues • Unavailable on AMD, Intel • Driver stability
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Elemental technology① OpenCL (2/2)
Page. 14
GPU
Source code
(text)
OpenCL runtime
OpenCL Interface
OpenCL runtime OpenCL runtime
OpenCL Devices
N x computing units
Global Memory
Local Memory
Constant Memory
cl_program
object
cl_kernel
object
DMA
buffer-3
DMA
buffer-2
DMA
buffer-1clBuildProgram()
clCreateKernel()
Command
Queue
Just-in-Time
Compile Look-up
kernel
functions Input a pair of kernel and
data into command queue
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Automatic GPU native code generation
Page. 15
postgres=# SET pg_strom.show_device_kernel = on;
SET
postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate
Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x))
Group Key: t0.cat
-> Custom (GpuPreAgg)
Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer,
NULL::double precision, NULL::double precision, NULL::text,
pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)
Bulkload: On
Kernel Source: #include "opencl_common.h"
#include "opencl_gpupreagg.h"
#include "opencl_textlib.h"
: <...snip...>
static bool
gpupreagg_qual_eval(__private cl_int *errcode,
__global kern_parambuf *kparams,
__global kern_data_store *kds,
__global kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index);
pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index);
return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8));
}
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
How PG-Strom processes SQL workloads① – Hash-Join
Page. 16
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next step Next step
All CPU does is
just references
the result of
relations join
Hash table
search by CPU
Projection
by CPU
Parallel
Projection
Parallel Hash-
table search
Existing Hash-Join implementation GpuHashJoin implementation
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Benchmark (1/2) – Simple Tables Join
[condition of measurement]
▌ INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
▌ Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...];
▌ All the tables are preliminary loaded
▌ HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980
Page. 17
178.7
334.0
513.6
713.1
941.4
1181.3
1452.2
1753.4
29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Simple Tables Join
PostgreSQL PG-Strom
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
How PG-Strom processes SQL workloads② – Aggregate
▌GpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU
▌Not easy to apply aggregate on all the data at once because of GPU’s RAM size,
GPU has advantage to make “partial aggregate” for each million rows.
▌Key performance factor is reducing the job of CPU
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
several millions rows
count(*), avg(X), Y
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
several millions rows
sum(nrows),
avg_ex(nrows, psum), Y
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18
Benchmark (2/2) – Aggregation + Tables Join
[condition of measurement]
▌ Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables
▌ Query in use:
SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT;
▌ Other conditions are same with the previous measurement
Page. 19
157.4
238.2
328.3
421.7
525.5
619.3
712.8
829.2
44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7
0
100
200
300
400
500
600
700
800
900
1000
2 3 4 5 6 7 8 9
Queryresponsetime
number of joined tables
Aggregation + Tables Join
PostgreSQL PG-Strom
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
As an aside...
Let’s back to the development history prior to
the remaining elemental technology
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 20
Development History of PG-Strom
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 21
2011 Feb KaiGai moved to Germany
May PGconf 2011
2012 Jan The first prototype
May Tried to use pseudo code
Aug Proposition of background worker
and writable FDW
2013 Jul NEC admit PG-Strom development
Nov Proposition of CustomScan API
2014 Feb Moved to OpenCL from CUDA
Jun GpuScan, GpuHashJoin and GpuSort
were implemented
Sep Drop GpuSort, instead of GpuPreAgg
Nov working beta-1 gets available
First
prototype
announced
“Strom” comes from Germany term; where I lived in at that time
Inspiration – May-2011
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 22
http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language
PGconf 2011 @ Ottawa, Canada
You’re brave.
Unavailable to run on
regular query, not
only PL functions?
Inspired by PgOpencl Project
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 23
A B C D E
summary:
GPU works well towards large BLOB,
like image data.
Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011
Tim Child (3DMashUp)
Even small width of values,
GPU will work well
if we have transposition.
Made a prototype – Christmas vacation in 2011
 CUDA based ( Now, OpenCL)
 FDW (Foreign Data Wrapper) based ( Now, CustomScan API)
 Column oriented data structure ( Now, row-oriented data)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 24
CPU
vanilla PostgreSQL PostgreSQL + PG-Strom
Iterationofrowfetch
andevaluation
multiple rows
at once
Async DMA &
Async Kernek Exec
CUDA Compiler
: Fetch a row from buffer : Evaluation of WHERE clause
CPU GPU
Synchronization
SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10;
code for
GPU
Auto GPU code generation,
Just-in-time compile
Query response time
reduction
GPU
Device
 A set of APIs to show external data source as like a table of PostgreSQL
 FDW driver generates rows on demand when foreign tables are referenced.
 Extension can have arbitrary data structure and logic to process the data.
 It is also allowed to scan internal data with GPU acceleration!
What is FDW (Foreign Data Wrapper)
QueryExecutor
Regular
Table
Foreign
Table
Foreign
Table
Foreign
Table
File
FDW
Oracle
FDW
PG-Strom
FDW
storage
Regular
Table
storage
QueryPlanner
QueryParser
Exec
Exec
Exec
Exec
SQL
Query
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 25
CSV Files
....... .... .... ...
... ... . ... ... .
.... .. .... .. . ..
.. . . . .. .
Other
DBMS
Implementation at that time
▌Problem
 Only foreign tables can be accelerated
with GPUs.
 Only full table-scan can be supported.
 Foreign tables were read-only at that
time.
 Slow-down of 1st time query because of
device initialization.
summary: Not easy to use
PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26
ForeignTable(pgstrom)
value a[]
rowmap
value b[]
value c[]
value d[] <not used>
<not used>
Table: my_schema.ft1.b.cs
10300 {10.23, 7.54, 5.43, … }
10100 {2.4, 5.6, 4.95, … }
② Calculation ① Transfer
③ Write-Back
Table: my_schema.ft1.c.cs
{‘2010-10-21’, …}
{‘2011-01-23’, …}
{‘2011-08-17’, …}
10100
10200
10300
Shadow tables on behalf of each column of the
foreign table managed by PG-Strom.
Each items are stored closely in physical, using large
array data type of element data.
PostgreSQL Enhancement (1/2) – Writable FDW
▌Foreign tables had supported read access only (~v9.2)
 Data load onto shadow tables was supported using special function
Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 27
PostgreSQL
Data Source
Query
Planning
SELECT DELETE UPDATE INSERT
FDW Driver
Enhancement
of this
interface
PostgreSQL Enhancement (2/2) – Background Worker
▌Slowdown on first time of each query
 Time for JIT compile of GPU code  caching the built binaries
 Time for initialization of GPU devices  initialization once by worker process
Good side effect: it works with run-time that doesn’t support concurrent usage.
▌Background Worker
 An interface allows to launch background processes managed by extensions
 Dynamic registration gets supported on v9.4 also.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 28
PostmasterProcess
PostgreSQL
Backend
Process
Built-in
BG workers
Extension’s
BG workers
fork(2) on
connection
fork(2) on
startup
IPC:sharedmemory,...
fork(2) on
startup /
demand
5432
Port
Design Pivot
▌Is the FDW really appropriate framework for PG-Strom?
 It originated from a feature to show external data as like a table.
[benefit]
• It is already supported on PostgreSQL
[Issue]
• Less transparency for users / applications
• More efficient execution path than GPU; like index-scan if available
• Workloads expects for full-table scan; like tables join
▌CUDA or OpenCL?
 First priority is many trials by wider range of people
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 29
CUDA OpenCL
Good • Fine grained optimization
• Latest feature of NVIDIA GPU
• Reliability of OpenCL drivers
• Multiplatform support
• Built-in JIT compiler
• CPU parallel support
Bad • No support on AMD, Intel • Reliability of OpenCL drivers
Elemental Technology② – Custom-Scan Interface
Page. 30
Table.A
Table to be scanned
clause
id > 200
Path①
SeqScan
cost=400
Path②
TidScan
unavailable
Path③
IndexScan
cost=10
Path④
CustomScan
(GpuScan)
cost=40
Built-in
Logics
Table.X Table.Y
Tables to be joined
Path①
NestLoop
cost=8000
Path②
HashJoin
cost=800
Path③
MergeJoin
cost=1200
Path④
CustomScan
(GpuHashJoin)
cost=500
Built-in
Logics
clause
x.id = y.id
Path③
IndexScan
cost=10
Table.A
WINNER!
Table.X Table.Y
Path④
CustomScan
(GpuHashJoin)
cost=500
clause
x.id = y.id
WINNER!
It allows extensions to
offer alternative scan or
join logics, in addition to
the built-in ones
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
clause
id > 200
Yes!! Custom-Scan Interface got merged to v9.5
Page. 31 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Minimum consensus is, an interface that
allows extensions to implement custom
logic to replace built-in scan logics.
↓
Then, we will discuss the enhancement
of the interface for tables join on the
developer’s community.
Enhancement of Custom-Scan Interface
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32
Table.X Table.Y
Tables to be joined
Path ①
NestLoop
Path ②
HashJoin
Path ③
MergeJoin
Path ④
CustomScan
(GpuHashJoin)
Built-in
logics
clause
x.id = y.id
X Y X Y
NestLoop HashJoin
X Y
MergeJoin
X Y
GpuHashJoin Result set
of tables join
It looks like a relation scan
on the result set of
tables join
 It replaces a node to be join by foreign-/custom-scan.
 Example: A FDW that scan on the result set of remote join.
▌Join replacement by foreign-/custom-scan
Abuse of Custom-Scan Interface
▌Usual interface contract
 GpuScan/SeqScan fetches rows, then passed to upper node and more ...
 TupleTableSlot is used to exchange record row by row manner.
▌Beyond the interface contract
 In case when both of parent and child nodes are managed by same extension,
nobody prohibit to exchange chunk of data with its own data structure.
 “Bulkload: On”  multiple (usually, 10K~100K) rows at once
Page. 33
postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1;
QUERY PLAN
-----------------------------------------------------------------------------
Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106)
hash clause 1: (t0.aid = t1.aid)
Bulkload: On
-> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73)
-> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37)
hash keys: aid
Buckets: 46000 Batches: 1 Memory Usage: 99.97%
-> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37)
Planning time: 0.220 ms
(9 rows)
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
PostgreSQL
PG-Strom
Element Technology③ – Row oriented data structure
Page. 34
Storage
Storage Manager
Shared
Buffer
Query
Parser
Query
Optimizer
Query
Executor
SQL Query
Breaks down
the query to
parse tree
Makes query
execution plan
Run the query
Custom-Plan
APIs
GpuScan
GpuHashJoin
GpuPreAgg
DMA Transfer
(via PCI-E bus)
GPU Program
Manager
PG-Strom
OpenCL
Server
Message Queue
T-Tree
Columnar
Cache
GPU Code Generator
Cache
construction
(rowcolumn)
Materialize
the result
(columnrow)
Prior implementation
had table cache
with column-oriented
data structure
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Why GPU well fit column-oriented data structure
Page. 35
SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made
Core Core Core Core Core Core Core Core Core Core
SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels
coalesced memory access
Global Memory (DRAM)
Wide Memory
Bandwidth
(256-384bits)
WARP:
A set of processor cores
that share instruction
pointer. It is usually 32.
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Format of PostgreSQL tuples
Page. 36
struct HeapTupleHeaderData
{
union
{
HeapTupleFields t_heap;
DatumTupleFields t_datum;
} t_choice;
/* current TID of this or newer tuple */
ItemPointerData t_ctid;
/* number of attributes + various flags */
uint16 t_infomask2;
/* various flag bits, see below */
uint16 t_infomask;
/* sizeof header incl. bitmap, padding */
uint8 t_hoff;
/* ^ - 23 bytes - ^ */
/* bitmap of NULLs -- VARIABLE LENGTH */
bits8 t_bits[1];
/* MORE DATA FOLLOWS AT END OF STRUCT */
};
HeapTupleHeader
NULL bitmap
(if has null)
OID of row
(if exists)
Padding
1st Column
2nd Column
4th Column
Nth Column
No data if NULL
Variable length fields
makes unavailable to
predicate offset of the
later fields.
No NULL Bitmap
if all the fields
are not NULL
We usually have
no OID of row
The worst data structure
for GPU processor
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Nightmare of columnar cache (1/2)
▌列指向キャッシュと行列変換
Page. 37
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
------------------------------------------------------------------------------
Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1)
hash clause 1: (t0.aid = t1.aid)
hash clause 2: (t0.bid = t2.bid)
number of requests: 144
total time to load: 584.67ms
total time to materialize: 7245.14ms  70% of total execution time!!
average time in send-mq: 37us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470
DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144
kernel exec: total: 441.71ms, avg: 3067us, count: 144
-> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1)
-> Custom (MultiHash) (actual time=17.839..17.839 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (actual time=0.019..6.794 rows=40000 loops=1)
Execution time: 10525.754 ms
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Nightmare of columnar cache (2/2)
Page. 38
Translation from table
(row-format) to the
columnar-cache
(only at once)
Translation from PG-
Strom internal
(column-format) to
TupleTableSlot
(row-format) for data
exchange in PostgreSQL
[Hash-Join]
Search the relevant
records on inner/outer
relations.
Breakdown of
execution time
You cannot see the wood for the trees
(木を見て森を見ず)
Optimization in GPU also makes additional data-format
translation cost (not ignorable) on the interface of PostgreSQL
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
A significant point in GPU acceleration (1/2)
Page. 39
• CPU is rare resource, so should put less workload on CPUs unlike GPUs
• We need to pay attention on access pattern of memory for CPU’s job
Storage
Shared
Buffer
T-Tree
columnar
cache
Result
Buffer
TupleTableSlot
Task of CPU
RowColumn translation
on cache construction
Very heavy loads
CPUのタスク
キャッシュ構築の際に、
一度だけ行列変換
極めて高い負荷
Task of PCI-E
Due to column-format,
only referenced data
shall be transformed.
Task of CPU
ColumnRow translation
everytime when we returns
the result to PostgreSQL.
Very heavy loads
Task of GPU
Operations on the data
with column-format.
It is optimized data
according to GPU’s nature
In case of column-format
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
A significant point in GPU acceleration (2/2)
Page. 40
• CPU is rare resource, so should put less workload on CPUs unlike GPUs
• We need to pay attention on access pattern of memory for CPU’s job
Storage
Shared
Buffer
Result
Buffer
TupleTableSlot
Task of CPU
Visibility check prior to
DMA translation
Light Load
CPUのタスク
キャッシュ構築の際に、
一度だけ行列変換
極めて高い負荷
Task of PCI-E
Due to row-format,
all data shall be transformed.
A little heavy load
Task of CPU
Set pointer of the result
buffer
Very Light Load
Task of GPU
Operations on row-data.
Although not an optimal data,
massive cores covers its
performance
In case of row-format
No own
columnar
cache
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Integration with shared-buffer of PostgreSQL
▌列指向キャッシュと行列変換
Page. 41
postgres=# explain (analyze, costs off)
select * from t0 natural join t1 natural join t2;
QUERY PLAN
-----------------------------------------------------------
Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1)
hash clause 1: (t0.aid = t1.aid)
hash clause 2: (t0.bid = t2.bid)
number of requests: 145
total time for inner load: 29.80ms
total time for outer load: 812.50ms
total time to materialize: 1527.95ms  Reduction dramatically
average time in send-mq: 61us
average time in recv-mq: 0us
max time to build kernel: 1us
DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619
DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290
proj kernel exec: total: 264.47ms, avg: 1823us, count: 145
main kernel exec: total: 622.83ms, avg: 4295us, count: 145
-> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1)
-> Custom (MultiHash) (actual time=29.766..29.767 rows=80000 loops=1)
hash keys: aid
-> Seq Scan on t1 (actual time=0.005..5.742 rows=40000 loops=1)
-> Custom (MultiHash) (actual time=16.552..16.552 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (actual time=0.022..7.330 rows=40000 loops=1)
Execution time: 5161.017 ms  Much better response time
Performance degrading little bit
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
PG-Strom’s current
▌Supported logics
 GpuScan ... Full-table scan with qualifiers
 GpuHashJoin ... Hash-Join with GPUs
 GpuPreAgg ... Preprocess of aggregation with GPUs
▌Supported data-types
 Integer (smallint, integer, bigint)
 Floating-point (real, float)
 String (text, varchar(n), char(n))
 Date and time (date, time, timestamp)
 NUMERIC
▌Supported functions
 arithmetic operators for above data-types
 comparison operators for above data-types
 mathematical functions on floating-points
 Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR
Page. 42 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
PG-Strom’s future
▌Logics will be supported
 Sort
 Aggregate Push-down
 ... anything else?
▌Data types will be supported
 ... anything else?
▌Functions will be supported
 LIKE, Regular Expression?
 Date/Time extraction?
 PostGIS?
 Geometrics?
 User defined functions?
(like pgOpenCL)
▌Parallel Scan
Page. 43
Shared
Buffer
block-0 block-Nblock-N/3 block-2N/3
Range
Partitioning &
Parallel Scan
Current
Future
Not easy to supply GPU enough data with
single CPU core in spite of on-memory store.
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Our target segment
▌1st target application: BI tools are expected
▌Up-to 1.0TB data: SMB company or branches of large company
 Because it needs to keep the data in-memory
▌GPU and PostgreSQL: both of them are inexpensive.
Page. 44
High-end
data analysis
appliance
Commercial or
OSS RDBMS
large response-time / batch process small response-time / real-time
smallerdata-sizelargerdata-size
< 1.0TB
> 1.0TB
Cost
advantage
Performance
advantage
PG-Strom
Hadoop?
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
(Ref) ROLAP and MOLAP
▌Expected PG-Strom usage
 Backend RDBMS for ROLAP / acceleration of ad-hoc queries
 Cube construction for MOLAP / acceleration of batch processing
Page. 45
ERP
SCM
CRM
Finance
DWH
DWH
world of transaction
(OLTP)
world of analytics
(OLAP)
ETL
ROLAP:
DB summarizes the data
on demand by BI tools
MOLAP:
BI tools reference
information cube;
preliminary constructed
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Expected Usage (1/2) – OLTP/OLAP Integration
▌Why separated OLTP/OLAP system
 Integration of multiple data source
 Optimization for analytic workloads
▌How PG-Strom will improve...
 Parallel execution of Join / Aggregate; being key of performance
 Elimination or reduction of OLAP / ETL, thus smaller amount of TCO
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 46
ERPCRMSCM BI
OLTP
database
OLAP
database
ETL
OLAP CubesMaster / Fact Tables
BI
PG-Strom
Performance Key
• Join master and
fact tables
• Aggregation
functions
Expected Usage (2/2) – Let’s investigate with us
▌Which region PG-Strom can work for?
▌Which workloads PG-Strom shall fit?
▌What workloads you concern about?
 PG-Strom Project want to lean from the field.
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 47
How to use (1/3) – Installation Prerequisites
▌OS: Linux (RHEL 6.x was validated)
▌PostgreSQL 9.5devel (with Custom-Plan Interface)
▌PG-Strom Module
▌OpenCL Driver (like nVIDIA run-time)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 48
shared_preload_libraries = '$libdir/pg_strom‘
shared_buffers = <enough size to load whole of the database>
Minimum configuration of PG-Strom
postgres=# SET pg_strom.enabled = on;
SET
Turn on/off PG-Strom at run-time
How to use (2/3) – Build, Install and Starup
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 49
[kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom
[kaigai@saba ~]$ cd pg_strom
[kaigai@saba pg_strom]$ make && make install
[kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf
[kaigai@saba ~]$ pg_ctl start
server starting
[kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server"
LOG: starting background worker process "PG-Strom OpenCL Server"
LOG: database system was shut down at 2014-11-09 17:45:51 JST
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA
LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB)
LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB)
LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL
LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed
LOG: PG-Strom: Device "GeForce GTX 980" was installed
LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB)
LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB)
LOG: Starting PG-Strom OpenCL Server
LOG: PG-Strom: 24 of server threads are up
How to use (3/3) – Deployment on AWS
Page. 50
Search by “strom” !
AWS GPU Instance (g2.2xlarge)
CPU Xeon E5-2670 (8 xCPU)
RAM 15GB
GPU NVIDIA GRID K2 (1536core)
Storage 60GB of SSD
Price $0.898/hour, $646.56/mon
(*) Price for on-demand instance
on Tokyo region at Nov-2014
The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
Future of PG-Strom
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 51
Dilemma of Innovation
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 52
SOURCE: The Innovator's Dilemma, Clayton M. Christensen
Performance demanded
at the high end of the market
Performance demanded
at the low end of the marker
or in a new emerging segment
ProductPerformance
Time
Move forward with community
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 53
(Additional comment after the conference)
DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 54
check it out!
https://github.com/pg-strom/devel
Oops, I forgot to say
in the conference!
PG-Strom is open source.
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)

More Related Content

GPGPU Accelerates PostgreSQL (English)

  • 1. GPGPU Accelerates PostgreSQL NEC OSS Promotion Center The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> (Tw: @kkaigai)
  • 2. Self Introduction ▌Name: KaiGai Kohei ▌Company: NEC OSS Promotion Center ▌Like: Processor with many cores ▌Dislike: Processor with little cores ▌Background: HPC  OSS/Linux  SAP  GPU/PostgreSQL ▌Tw: @kkaigai ▌My Jobs  SELinux (2004~) • Lockless AVC、JFFS2 XATTR, ...  PostgreSQL (2006~) • SE-PostgreSQL, Security Barrier View, Writable FDW, ...  PG-Strom (2012~) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 2
  • 3. Approach for Performance Improvement DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 3 Scale-Out Scale-Up Homogeneous Scale-up Heterogeneous Scale-Up +
  • 4. Characteristics of GPU (Graphic Processor Unit) ▌Characteristics  Larger percentage of ALUs on chip  Relatively smaller percentage of cache and control logic Advantages to simple calculation in parallel, but not complicated logic  Much higher number of cores per price • GTX750Ti (640core) with $150 GPU CPU Model Nvidia Tesla K20X Intel Xeon E5-2670 v3 Architecture Kepler Haswell Launch Nov-2012 Sep-2014 # of transistors 7.1billion 3.84billion # of cores 2688 (simple) 12 (functional) Core clock 732MHz 2.6GHz, up to 3.5GHz Peak Flops (single precision) 3.95TFLOPS 998.4GFLOPS (with AVX2) DRAM size 6GB, GDDR5 768GB/socket, DDR4 Memory band 250GB/s 68GB/s Power consumption 235W 135W Price $3,000 $2,094 DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 4 SOURCE: CUDA C Programming Guide (v6.5)
  • 5. How GPU works – Example of reduction algorithm ●item[0] step.1 step.2 step.4step.3 Computing the sum of array: 𝑖𝑡𝑒𝑚[𝑖] 𝑖=0…𝑁−1 with N-cores of GPU ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Total sum of items[] with log2N steps Inter core synchronization by HW support
  • 6. Semiconductor Trend (1/2) – Towards Heterogeneous ▌Moves to CPU/GPU integrated architecture from multicore CPU ▌Free lunch for SW by HW improvement will finish soon  No utilization of semiconductor capability, unless SW is not designed with conscious of HW characteristics. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 6 SOURCE: THE HEART OF AMD INNOVATION, Lisa Su, at AMD Developer Summit 2013
  • 7. Semiconductor Trend (2/2) – Dark Silicon Problem ▌Background of CPU/GPU integrated architecture  Increase of transistor density > Reduction of power consumption  Unable to supply power for all the logic because of chip cooling A chip has multiple logics with different features, to save the peak power consumption. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 7 SOURCE: Compute Power with Energy-Efficiency, Jem Davies, at AMD Fusion Developer Summit 2011
  • 8. RDBMS and its bottleneck (1/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 8 Storage Processor Data RAM Data Size > RAM Data Size < RAM Storage Processor Data RAM In the future? Processor Wide Band RAM Non- volatile RAM Data
  • 9. World of current memory bottleneck Join, Aggregation, Sort, Projection, ... [strategy] • burstable access pattern • parallel algorithm World of traditional disk-i/o bottleneck SeqScan, IndexScan, ... [strategy] • reduction of i/o (size, count) • distribution of disk (RAID) RDBMS and its bottleneck (2/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 9 Processor RAM Storage bandwidth: multiple hundreds GB/s bandwidth: multiple GB/s
  • 10. PG-Strom ▌What is PG-Strom  An extension designed for PostgreSQL  Off-loads a part of SQL workloads to GPU for parallel/rapid execution  Three workloads are supported: Full-Scan, Hash-Join, Aggregate (At the moment of Nov-2014, beta version) ▌Concept  Automatic GPU native code generation from SQL query, by JIT compile  Asynchronous execution with CPU/GPU co-operation. ▌Advantage  Works fully transparently from users • It allows to use peripheral software of PostgreSQL, including SQL syntax, backup, HA or drivers.  Performance improvement with GPU+OSS; very low cost ▌Attention  Right now, in-memory data store is assumed DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 10
  • 11. PostgreSQL PG-Strom Architecture of PG-Strom (1/2) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 11 GPU Code Generator Storage Storage Manager Shared Buffer Query Parser Query Optimizer Query Executor SQL Query Breaks down the query to parse tree Makes query execution plan Run the query Custom-Plan APIs GpuScan GpuHashJoin GpuPreAgg GPU Program Manager PG-Strom OpenCL Server Message Queue
  • 12. Architecture of PG-Strom (2/2) ▌Elemental technology① OpenCL  Parallel computing framework on heterogeneous processors  Can use for CPU parallel, not only GPU of NVIDIA/AMD  Includes run-time compiler in the language specification ▌Elemental technology② Custom-Scan Interface  Feature to implement scan/join by extensions, as if it is built-in logic for SQL processing in PostgreSQL.  A part of functionalities got merged to v9.5, discussions are in-progress for full functionalities. ▌Elemental technology③ Row oriented data structure  Shared buffer of PostgreSQL as DMA source  Data format translation between row and column, even though column format is optimal for GPU performance. Page. 12 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 13. Elemental technology① OpenCL (1/2) ▌Characteristics of OpenCL  Use abstracted “OpenCL device” for CPU/MIC, not only GPUs • Even though it follows characteristics of GPUs below.... Three types of memory layer (global, local, constant) Concept of workgroup; synchronous execution inter-threads  Just-in-time compile from source code like C-language to the platform specific native code ▌Comparison with CUDA  We don’t need to develop JIT compile functionality by ourselves, and want more people to try, so adopted OpenCL at this moment. Page. 13 CUDA OpenCL Advantage • Detailed optimization • Latest feature of NVIDIA GPU • Driver stability • Multiplatform support • Built-in JIT compiler • CPU parallel support Issues • Unavailable on AMD, Intel • Driver stability The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 14. Elemental technology① OpenCL (2/2) Page. 14 GPU Source code (text) OpenCL runtime OpenCL Interface OpenCL runtime OpenCL runtime OpenCL Devices N x computing units Global Memory Local Memory Constant Memory cl_program object cl_kernel object DMA buffer-3 DMA buffer-2 DMA buffer-1clBuildProgram() clCreateKernel() Command Queue Just-in-Time Compile Look-up kernel functions Input a pair of kernel and data into command queue The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 15. Automatic GPU native code generation Page. 15 postgres=# SET pg_strom.show_device_kernel = on; SET postgres=# EXPLAIN (verbose, costs off) SELECT cat, avg(x) from t0 WHERE x < y GROUP BY cat; QUERY PLAN -------------------------------------------------------------------------------------------- HashAggregate Output: cat, pgstrom.avg(pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x)) Group Key: t0.cat -> Custom (GpuPreAgg) Output: NULL::integer, cat, NULL::integer, NULL::integer, NULL::integer, NULL::integer, NULL::double precision, NULL::double precision, NULL::text, pgstrom.nrows(x IS NOT NULL), pgstrom.psum(x) Bulkload: On Kernel Source: #include "opencl_common.h" #include "opencl_gpupreagg.h" #include "opencl_textlib.h" : <...snip...> static bool gpupreagg_qual_eval(__private cl_int *errcode, __global kern_parambuf *kparams, __global kern_data_store *kds, __global kern_data_store *ktoast, size_t kds_index) { pg_float8_t KVAR_7 = pg_float8_vref(kds,ktoast,errcode,6,kds_index); pg_float8_t KVAR_8 = pg_float8_vref(kds,ktoast,errcode,7,kds_index); return EVAL(pgfn_float8lt(errcode, KVAR_7, KVAR_8)); } The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 16. How PG-Strom processes SQL workloads① – Hash-Join Page. 16 Inner relation Outer relation Inner relation Outer relation Hash table Hash table Next step Next step All CPU does is just references the result of relations join Hash table search by CPU Projection by CPU Parallel Projection Parallel Hash- table search Existing Hash-Join implementation GpuHashJoin implementation The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 17. Benchmark (1/2) – Simple Tables Join [condition of measurement] ▌ INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables ▌ Query in use: SELECT * FROM t0 natural join t1 [natural join t2 ...]; ▌ All the tables are preliminary loaded ▌ HW: Express5800 HR120b-1, CPU: Xeon E5-2640, RAM: 256GB, GPU: NVIDIA GTX980 Page. 17 178.7 334.0 513.6 713.1 941.4 1181.3 1452.2 1753.4 29.1 30.5 35.6 36.6 43.6 49.5 50.1 60.6 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2 3 4 5 6 7 8 9 Queryresponsetime number of joined tables Simple Tables Join PostgreSQL PG-Strom The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 18. How PG-Strom processes SQL workloads② – Aggregate ▌GpuPreAgg, prior to Aggregate/Sort, reduces num of rows to be processed by CPU ▌Not easy to apply aggregate on all the data at once because of GPU’s RAM size, GPU has advantage to make “partial aggregate” for each million rows. ▌Key performance factor is reducing the job of CPU GroupAggregate Sort SeqScan Tbl_1 Result Set several rows several millions rows several millions rows count(*), avg(X), Y X, Y X, Y X, Y, Z (table definition) GroupAggregate Sort GpuScan Tbl_1 Result Set several rows several hundreds rows several millions rows sum(nrows), avg_ex(nrows, psum), Y Y, nrows, psum_X X, Y X, Y, Z (table definition) GpuPreAgg Y, nrows, psum_X several hundreds rows SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y; The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 18
  • 19. Benchmark (2/2) – Aggregation + Tables Join [condition of measurement] ▌ Aggregate and INNER JOIN of 200M rows x 100K rows x 10K rows ... with increasing number of tables ▌ Query in use: SELECT cat, AVG(x) FROM t0 natural join t1 [natural join t2 ...] GROUP BY CAT; ▌ Other conditions are same with the previous measurement Page. 19 157.4 238.2 328.3 421.7 525.5 619.3 712.8 829.2 44.0 43.2 42.8 45.0 48.5 50.8 61.0 66.7 0 100 200 300 400 500 600 700 800 900 1000 2 3 4 5 6 7 8 9 Queryresponsetime number of joined tables Aggregation + Tables Join PostgreSQL PG-Strom The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 20. As an aside... Let’s back to the development history prior to the remaining elemental technology DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 20
  • 21. Development History of PG-Strom DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 21 2011 Feb KaiGai moved to Germany May PGconf 2011 2012 Jan The first prototype May Tried to use pseudo code Aug Proposition of background worker and writable FDW 2013 Jul NEC admit PG-Strom development Nov Proposition of CustomScan API 2014 Feb Moved to OpenCL from CUDA Jun GpuScan, GpuHashJoin and GpuSort were implemented Sep Drop GpuSort, instead of GpuPreAgg Nov working beta-1 gets available First prototype announced “Strom” comes from Germany term; where I lived in at that time
  • 22. Inspiration – May-2011 DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 22 http://ja.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language PGconf 2011 @ Ottawa, Canada You’re brave. Unavailable to run on regular query, not only PL functions?
  • 23. Inspired by PgOpencl Project DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 23 A B C D E summary: GPU works well towards large BLOB, like image data. Parallel Image Searching Using PostgreSQL PgOpenCL @ PGconf 2011 Tim Child (3DMashUp) Even small width of values, GPU will work well if we have transposition.
  • 24. Made a prototype – Christmas vacation in 2011  CUDA based ( Now, OpenCL)  FDW (Foreign Data Wrapper) based ( Now, CustomScan API)  Column oriented data structure ( Now, row-oriented data) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 24 CPU vanilla PostgreSQL PostgreSQL + PG-Strom Iterationofrowfetch andevaluation multiple rows at once Async DMA & Async Kernek Exec CUDA Compiler : Fetch a row from buffer : Evaluation of WHERE clause CPU GPU Synchronization SELECT * FROM table WHERE sqrt((x-256)^2 + (y-100)^2) < 10; code for GPU Auto GPU code generation, Just-in-time compile Query response time reduction GPU Device
  • 25.  A set of APIs to show external data source as like a table of PostgreSQL  FDW driver generates rows on demand when foreign tables are referenced.  Extension can have arbitrary data structure and logic to process the data.  It is also allowed to scan internal data with GPU acceleration! What is FDW (Foreign Data Wrapper) QueryExecutor Regular Table Foreign Table Foreign Table Foreign Table File FDW Oracle FDW PG-Strom FDW storage Regular Table storage QueryPlanner QueryParser Exec Exec Exec Exec SQL Query DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 25 CSV Files ....... .... .... ... ... ... . ... ... . .... .. .... .. . .. .. . . . .. . Other DBMS
  • 26. Implementation at that time ▌Problem  Only foreign tables can be accelerated with GPUs.  Only full table-scan can be supported.  Foreign tables were read-only at that time.  Slow-down of 1st time query because of device initialization. summary: Not easy to use PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module26 ForeignTable(pgstrom) value a[] rowmap value b[] value c[] value d[] <not used> <not used> Table: my_schema.ft1.b.cs 10300 {10.23, 7.54, 5.43, … } 10100 {2.4, 5.6, 4.95, … } ② Calculation ① Transfer ③ Write-Back Table: my_schema.ft1.c.cs {‘2010-10-21’, …} {‘2011-01-23’, …} {‘2011-08-17’, …} 10100 10200 10300 Shadow tables on behalf of each column of the foreign table managed by PG-Strom. Each items are stored closely in physical, using large array data type of element data.
  • 27. PostgreSQL Enhancement (1/2) – Writable FDW ▌Foreign tables had supported read access only (~v9.2)  Data load onto shadow tables was supported using special function Enhancement of APIs for INSERT/UPDATE/DELETE on foreign tables DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 27 PostgreSQL Data Source Query Planning SELECT DELETE UPDATE INSERT FDW Driver Enhancement of this interface
  • 28. PostgreSQL Enhancement (2/2) – Background Worker ▌Slowdown on first time of each query  Time for JIT compile of GPU code  caching the built binaries  Time for initialization of GPU devices  initialization once by worker process Good side effect: it works with run-time that doesn’t support concurrent usage. ▌Background Worker  An interface allows to launch background processes managed by extensions  Dynamic registration gets supported on v9.4 also. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 28 PostmasterProcess PostgreSQL Backend Process Built-in BG workers Extension’s BG workers fork(2) on connection fork(2) on startup IPC:sharedmemory,... fork(2) on startup / demand 5432 Port
  • 29. Design Pivot ▌Is the FDW really appropriate framework for PG-Strom?  It originated from a feature to show external data as like a table. [benefit] • It is already supported on PostgreSQL [Issue] • Less transparency for users / applications • More efficient execution path than GPU; like index-scan if available • Workloads expects for full-table scan; like tables join ▌CUDA or OpenCL?  First priority is many trials by wider range of people DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 29 CUDA OpenCL Good • Fine grained optimization • Latest feature of NVIDIA GPU • Reliability of OpenCL drivers • Multiplatform support • Built-in JIT compiler • CPU parallel support Bad • No support on AMD, Intel • Reliability of OpenCL drivers
  • 30. Elemental Technology② – Custom-Scan Interface Page. 30 Table.A Table to be scanned clause id > 200 Path① SeqScan cost=400 Path② TidScan unavailable Path③ IndexScan cost=10 Path④ CustomScan (GpuScan) cost=40 Built-in Logics Table.X Table.Y Tables to be joined Path① NestLoop cost=8000 Path② HashJoin cost=800 Path③ MergeJoin cost=1200 Path④ CustomScan (GpuHashJoin) cost=500 Built-in Logics clause x.id = y.id Path③ IndexScan cost=10 Table.A WINNER! Table.X Table.Y Path④ CustomScan (GpuHashJoin) cost=500 clause x.id = y.id WINNER! It allows extensions to offer alternative scan or join logics, in addition to the built-in ones The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL clause id > 200
  • 31. Yes!! Custom-Scan Interface got merged to v9.5 Page. 31 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL Minimum consensus is, an interface that allows extensions to implement custom logic to replace built-in scan logics. ↓ Then, we will discuss the enhancement of the interface for tables join on the developer’s community.
  • 32. Enhancement of Custom-Scan Interface The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQLPage. 32 Table.X Table.Y Tables to be joined Path ① NestLoop Path ② HashJoin Path ③ MergeJoin Path ④ CustomScan (GpuHashJoin) Built-in logics clause x.id = y.id X Y X Y NestLoop HashJoin X Y MergeJoin X Y GpuHashJoin Result set of tables join It looks like a relation scan on the result set of tables join  It replaces a node to be join by foreign-/custom-scan.  Example: A FDW that scan on the result set of remote join. ▌Join replacement by foreign-/custom-scan
  • 33. Abuse of Custom-Scan Interface ▌Usual interface contract  GpuScan/SeqScan fetches rows, then passed to upper node and more ...  TupleTableSlot is used to exchange record row by row manner. ▌Beyond the interface contract  In case when both of parent and child nodes are managed by same extension, nobody prohibit to exchange chunk of data with its own data structure.  “Bulkload: On”  multiple (usually, 10K~100K) rows at once Page. 33 postgres=# EXPLAIN SELECT * FROM t0 NATURAL JOIN t1; QUERY PLAN ----------------------------------------------------------------------------- Custom (GpuHashJoin) (cost=2234.00..468730.50 rows=19907950 width=106) hash clause 1: (t0.aid = t1.aid) Bulkload: On -> Custom (GpuScan) on t0 (cost=500.00..267167.00 rows=20000024 width=73) -> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37) hash keys: aid Buckets: 46000 Batches: 1 Memory Usage: 99.97% -> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37) Planning time: 0.220 ms (9 rows) The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 34. PostgreSQL PG-Strom Element Technology③ – Row oriented data structure Page. 34 Storage Storage Manager Shared Buffer Query Parser Query Optimizer Query Executor SQL Query Breaks down the query to parse tree Makes query execution plan Run the query Custom-Plan APIs GpuScan GpuHashJoin GpuPreAgg DMA Transfer (via PCI-E bus) GPU Program Manager PG-Strom OpenCL Server Message Queue T-Tree Columnar Cache GPU Code Generator Cache construction (rowcolumn) Materialize the result (columnrow) Prior implementation had table cache with column-oriented data structure The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 35. Why GPU well fit column-oriented data structure Page. 35 SOURCE: Maxwell: The Most Advanced CUDA GPU Ever Made Core Core Core Core Core Core Core Core Core Core SOURCE: How to Access Global Memory Efficiently in CUDA C/C++ Kernels coalesced memory access Global Memory (DRAM) Wide Memory Bandwidth (256-384bits) WARP: A set of processor cores that share instruction pointer. It is usually 32. The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 36. Format of PostgreSQL tuples Page. 36 struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; } t_choice; /* current TID of this or newer tuple */ ItemPointerData t_ctid; /* number of attributes + various flags */ uint16 t_infomask2; /* various flag bits, see below */ uint16 t_infomask; /* sizeof header incl. bitmap, padding */ uint8 t_hoff; /* ^ - 23 bytes - ^ */ /* bitmap of NULLs -- VARIABLE LENGTH */ bits8 t_bits[1]; /* MORE DATA FOLLOWS AT END OF STRUCT */ }; HeapTupleHeader NULL bitmap (if has null) OID of row (if exists) Padding 1st Column 2nd Column 4th Column Nth Column No data if NULL Variable length fields makes unavailable to predicate offset of the later fields. No NULL Bitmap if all the fields are not NULL We usually have no OID of row The worst data structure for GPU processor The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 37. Nightmare of columnar cache (1/2) ▌列指向キャッシュと行列変換 Page. 37 postgres=# explain (analyze, costs off) select * from t0 natural join t1 natural join t2; QUERY PLAN ------------------------------------------------------------------------------ Custom (GpuHashJoin) (actual time=54.005..9635.134 rows=20000000 loops=1) hash clause 1: (t0.aid = t1.aid) hash clause 2: (t0.bid = t2.bid) number of requests: 144 total time to load: 584.67ms total time to materialize: 7245.14ms  70% of total execution time!! average time in send-mq: 37us average time in recv-mq: 0us max time to build kernel: 1us DMA send: 5197.80MB/sec, len: 2166.30MB, time: 416.77ms, count: 470 DMA recv: 5139.62MB/sec, len: 287.99MB, time: 56.03ms, count: 144 kernel exec: total: 441.71ms, avg: 3067us, count: 144 -> Custom (GpuScan) on t0 (actual time=4.011..584.533 rows=20000000 loops=1) -> Custom (MultiHash) (actual time=31.102..31.102 rows=40000 loops=1) hash keys: aid -> Seq Scan on t1 (actual time=0.007..5.062 rows=40000 loops=1) -> Custom (MultiHash) (actual time=17.839..17.839 rows=40000 loops=1) hash keys: bid -> Seq Scan on t2 (actual time=0.019..6.794 rows=40000 loops=1) Execution time: 10525.754 ms The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 38. Nightmare of columnar cache (2/2) Page. 38 Translation from table (row-format) to the columnar-cache (only at once) Translation from PG- Strom internal (column-format) to TupleTableSlot (row-format) for data exchange in PostgreSQL [Hash-Join] Search the relevant records on inner/outer relations. Breakdown of execution time You cannot see the wood for the trees (木を見て森を見ず) Optimization in GPU also makes additional data-format translation cost (not ignorable) on the interface of PostgreSQL The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 39. A significant point in GPU acceleration (1/2) Page. 39 • CPU is rare resource, so should put less workload on CPUs unlike GPUs • We need to pay attention on access pattern of memory for CPU’s job Storage Shared Buffer T-Tree columnar cache Result Buffer TupleTableSlot Task of CPU RowColumn translation on cache construction Very heavy loads CPUのタスク キャッシュ構築の際に、 一度だけ行列変換 極めて高い負荷 Task of PCI-E Due to column-format, only referenced data shall be transformed. Task of CPU ColumnRow translation everytime when we returns the result to PostgreSQL. Very heavy loads Task of GPU Operations on the data with column-format. It is optimized data according to GPU’s nature In case of column-format The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 40. A significant point in GPU acceleration (2/2) Page. 40 • CPU is rare resource, so should put less workload on CPUs unlike GPUs • We need to pay attention on access pattern of memory for CPU’s job Storage Shared Buffer Result Buffer TupleTableSlot Task of CPU Visibility check prior to DMA translation Light Load CPUのタスク キャッシュ構築の際に、 一度だけ行列変換 極めて高い負荷 Task of PCI-E Due to row-format, all data shall be transformed. A little heavy load Task of CPU Set pointer of the result buffer Very Light Load Task of GPU Operations on row-data. Although not an optimal data, massive cores covers its performance In case of row-format No own columnar cache The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 41. Integration with shared-buffer of PostgreSQL ▌列指向キャッシュと行列変換 Page. 41 postgres=# explain (analyze, costs off) select * from t0 natural join t1 natural join t2; QUERY PLAN ----------------------------------------------------------- Custom (GpuHashJoin) (actual time=111.085..4286.562 rows=20000000 loops=1) hash clause 1: (t0.aid = t1.aid) hash clause 2: (t0.bid = t2.bid) number of requests: 145 total time for inner load: 29.80ms total time for outer load: 812.50ms total time to materialize: 1527.95ms  Reduction dramatically average time in send-mq: 61us average time in recv-mq: 0us max time to build kernel: 1us DMA send: 5198.84MB/sec, len: 2811.40MB, time: 540.77ms, count: 619 DMA recv: 3769.44MB/sec, len: 2182.02MB, time: 578.87ms, count: 290 proj kernel exec: total: 264.47ms, avg: 1823us, count: 145 main kernel exec: total: 622.83ms, avg: 4295us, count: 145 -> Custom (GpuScan) on t0 (actual time=5.736..812.255 rows=20000000 loops=1) -> Custom (MultiHash) (actual time=29.766..29.767 rows=80000 loops=1) hash keys: aid -> Seq Scan on t1 (actual time=0.005..5.742 rows=40000 loops=1) -> Custom (MultiHash) (actual time=16.552..16.552 rows=40000 loops=1) hash keys: bid -> Seq Scan on t2 (actual time=0.022..7.330 rows=40000 loops=1) Execution time: 5161.017 ms  Much better response time Performance degrading little bit The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 42. PG-Strom’s current ▌Supported logics  GpuScan ... Full-table scan with qualifiers  GpuHashJoin ... Hash-Join with GPUs  GpuPreAgg ... Preprocess of aggregation with GPUs ▌Supported data-types  Integer (smallint, integer, bigint)  Floating-point (real, float)  String (text, varchar(n), char(n))  Date and time (date, time, timestamp)  NUMERIC ▌Supported functions  arithmetic operators for above data-types  comparison operators for above data-types  mathematical functions on floating-points  Aggregate: MIN, MAX, SUM, AVG, STD, VAR, CORR Page. 42 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 43. PG-Strom’s future ▌Logics will be supported  Sort  Aggregate Push-down  ... anything else? ▌Data types will be supported  ... anything else? ▌Functions will be supported  LIKE, Regular Expression?  Date/Time extraction?  PostGIS?  Geometrics?  User defined functions? (like pgOpenCL) ▌Parallel Scan Page. 43 Shared Buffer block-0 block-Nblock-N/3 block-2N/3 Range Partitioning & Parallel Scan Current Future Not easy to supply GPU enough data with single CPU core in spite of on-memory store. The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 44. Our target segment ▌1st target application: BI tools are expected ▌Up-to 1.0TB data: SMB company or branches of large company  Because it needs to keep the data in-memory ▌GPU and PostgreSQL: both of them are inexpensive. Page. 44 High-end data analysis appliance Commercial or OSS RDBMS large response-time / batch process small response-time / real-time smallerdata-sizelargerdata-size < 1.0TB > 1.0TB Cost advantage Performance advantage PG-Strom Hadoop? The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 45. (Ref) ROLAP and MOLAP ▌Expected PG-Strom usage  Backend RDBMS for ROLAP / acceleration of ad-hoc queries  Cube construction for MOLAP / acceleration of batch processing Page. 45 ERP SCM CRM Finance DWH DWH world of transaction (OLTP) world of analytics (OLAP) ETL ROLAP: DB summarizes the data on demand by BI tools MOLAP: BI tools reference information cube; preliminary constructed The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 46. Expected Usage (1/2) – OLTP/OLAP Integration ▌Why separated OLTP/OLAP system  Integration of multiple data source  Optimization for analytic workloads ▌How PG-Strom will improve...  Parallel execution of Join / Aggregate; being key of performance  Elimination or reduction of OLAP / ETL, thus smaller amount of TCO DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 46 ERPCRMSCM BI OLTP database OLAP database ETL OLAP CubesMaster / Fact Tables BI PG-Strom Performance Key • Join master and fact tables • Aggregation functions
  • 47. Expected Usage (2/2) – Let’s investigate with us ▌Which region PG-Strom can work for? ▌Which workloads PG-Strom shall fit? ▌What workloads you concern about?  PG-Strom Project want to lean from the field. DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 47
  • 48. How to use (1/3) – Installation Prerequisites ▌OS: Linux (RHEL 6.x was validated) ▌PostgreSQL 9.5devel (with Custom-Plan Interface) ▌PG-Strom Module ▌OpenCL Driver (like nVIDIA run-time) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 48 shared_preload_libraries = '$libdir/pg_strom‘ shared_buffers = <enough size to load whole of the database> Minimum configuration of PG-Strom postgres=# SET pg_strom.enabled = on; SET Turn on/off PG-Strom at run-time
  • 49. How to use (2/3) – Build, Install and Starup DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 49 [kaigai@saba ~]$ git clone https://github.com/pg-strom/devel.git pg_strom [kaigai@saba ~]$ cd pg_strom [kaigai@saba pg_strom]$ make && make install [kaigai@saba pg_strom]$ vi $PGDATA/postgresql.conf [kaigai@saba ~]$ pg_ctl start server starting [kaigai@saba ~]$ LOG: registering background worker "PG-Strom OpenCL Server" LOG: starting background worker process "PG-Strom OpenCL Server" LOG: database system was shut down at 2014-11-09 17:45:51 JST LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: PG-Strom: [0] OpenCL Platform: NVIDIA CUDA LOG: PG-Strom: (0:0) Device GeForce GTX 980 (1253MHz x 16units, 4095MB) LOG: PG-Strom: (0:1) Device GeForce GTX 750 Ti (1110MHz x 5units, 2047MB) LOG: PG-Strom: [1] OpenCL Platform: Intel(R) OpenCL LOG: PG-Strom: Platform "NVIDIA CUDA (OpenCL 1.1 CUDA 6.5.19)" was installed LOG: PG-Strom: Device "GeForce GTX 980" was installed LOG: PG-Strom: shmem 0x7f447f6b8000-0x7f46f06b7fff was mapped (len: 10000MB) LOG: PG-Strom: buffer 0x7f34592795c0-0x7f44592795bf was mapped (len: 65536MB) LOG: Starting PG-Strom OpenCL Server LOG: PG-Strom: 24 of server threads are up
  • 50. How to use (3/3) – Deployment on AWS Page. 50 Search by “strom” ! AWS GPU Instance (g2.2xlarge) CPU Xeon E5-2670 (8 xCPU) RAM 15GB GPU NVIDIA GRID K2 (1536core) Storage 60GB of SSD Price $0.898/hour, $646.56/mon (*) Price for on-demand instance on Tokyo region at Nov-2014 The PostgreSQL Conference 2014, Tokyo - GPGPU Accelerates PostgreSQL
  • 51. Future of PG-Strom DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 51
  • 52. Dilemma of Innovation DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 52 SOURCE: The Innovator's Dilemma, Clayton M. Christensen Performance demanded at the high end of the market Performance demanded at the low end of the marker or in a new emerging segment ProductPerformance Time
  • 53. Move forward with community DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 53
  • 54. (Additional comment after the conference) DB Tech Showcase 2014 Tokyo; PG-Strom - GPGPU acceleration on PostgreSQLPage. 54 check it out! https://github.com/pg-strom/devel Oops, I forgot to say in the conference! PG-Strom is open source.