diff --git a/README.md b/README.md index 0bedc0b..ba30ec1 100644 --- a/README.md +++ b/README.md @@ -1,34 +1,972 @@ -Vectorized Operations extension for PostgreSQL. - -This extensions allows to increase more than 10 times speed of OLAP queries with filter and aggregation -by storing data in tiles: group of column values. It allows to reduce deform tuple and executor overhead. -Please look at tpch.sql example which shows how VOPS can be used to increase speed of TPC-H Q1/Q6 queries -more than ten times. - -How to use VOPS? First of all you need to somehow load data in VOPS. -It can be done in two ways: -1. Load data from existed table. In this case you just need to create VOPS projection of this table (using VOPS types instead -of original scalar types) and copy data to it using VOPS populate(...) function. -2. If you data is not yet loaded in the database, you can import it directly from CSV file into VOPS table using VOPS import(...) function. - -Ok, now you have data in VOPS format. What you can do with it? VOPS manual (vops.html) explains many different ways of running -VOPS queries. VOPS provides set of overloaded operators which allows you to write queries in more or less standard SQL. -Operators which can not be overloaded (and, or, not, between) are handled by VOPS executor hook. - -VOPS is able to efficiently execute filter and aggregation queries. What about other kinds of queries? For examples queries with -joins? There are once again two choices: -1. You can use original table (if any) for such queries. -2. You can use VOPS foreign data wrapper (FDW) to present VOPS table to PostgreSQL as normal table (with scalar column types). -The parts of query which can be efficiently executed by VOPS (filtering and aggregation) will be pushed by Postgres query optimizer -to VOPS FDW and will be executed using VOPS operators. Other query nodes will fetch data from VOPS as standard tuples -and process them in the same way as in case of normal tables. VOPS FDW provides statistic (you need to do ANALYZE for FDW table) -so query execution plan should be almost the same as for normal tables. The only exception is parallel processing: -parallel processing is not currently supported by VOPS FDW. - -So what finally you get? By creating of VOPS projection of existed data or storing data in VOPS table you can speed-up execution -of some queries more than ten times (mostly analytic queries with aggregation and without joins). And still be able to execute -all other queries using VOPS FDW. - -Chinces version of VOPS documantation can be found here: -https://github.com/digoal/blog/blob/master/201702/20170225_01.md +## Motivation +PostgreSQL looks very competitive with other mainstream databases on +OLTP workload (execution of large number of simple queries). But on OLAP +queries, requiring processing of larger volumes of data, DBMS-es +oriented on analytic queries processing can provide an order of +magnitude better speed. Let's investigate why it happen and can we do +something to make PostgreSQL efficient also for OLAP +queries. + +## Where DBMS spent most of the time during processing OLAP queries? + +Profiling execution of queries shows several main factors, limiting +Postgres performance: + +1. Unpacking tuple overhead (tuple\_deform). To be able to access + column values, Postgres needs to deform the tuple. Values can be + compressed, stored at some other page (TOAST), ... Also, as far as + size of column can be varying, to extract N-th column we need to + unpack preceding N-1 columns. So deforming tuple is quite expensive + operation, especially for tables with large number of attributes. In + some cases rearranging columns in the table allows to significantly + reduce query execution time. Another universal approach is to split + table into two: one small with frequently accessed scalar columns + and another with rarely used large columns. Certainly in this case + we need to perform extra join but it allows to several times reduce + amount of fetched data. In queries like TPC-H Q6, tuple deform takes + about 40% of total query execution time. +2. Interpretation overhead. Postgres compiler and optimizer build tree + representing query execution plan. So query executor performs + recursive invocation of evaluate functions for nodes of this tree. + Implementation of some nodes also contain switches used to select + requested action. So query plan is interpreted by Postgres query + executor rather than directly executed. Usually interpreter is about + 10 times slower than native code. This is why elimination of + interpretation overhead allows to several times increase query + speed, especially for queries with complex predicates where most + time is spent in expression evaluation. +3. Abstraction penalty. Support of abstract (user defined) types and + operations is one of the key features of Postgres. It's executor is + able to deal not only with built-in set of scalar types (like + integer, real, ...) but with any types defined by user (for example + complex, point,...). But the price of such flexibility is that each + operations requires function call. Instead of adding to integers + directly, Postgres executor invokes function which performs addition + of two integers. Certainly in this case function call overhead is + much larger then performed operation itself. Function call overhead + is also increased because of Postgres function call convention + requiring passing parameter values through memory (not using + register call convention). +4. Pull model overhead. Postgres executor is implementing classical + Volcano-style query execution model - pull model. Operand's values + are pulled by operator. It simplifies executor and operators + implementation. But it has negative impact on performance, because + leave nodes (fetching tuple from heap or index page) have to do a + lot of extra work saving and restoring their context. +5. MVCC overhead. Postgres provides multiversion concurrency control, + which allows multiple transactions to concurrently work with the + same record without blocking each other. It is goods for frequently + updated data (OLTP), but for read-only or append-only data in OLAP + scenarios it adds just extra overhead. Both space overhead (about 20 + extra bytes per tuple) and CPU overhead (checking visibility of each + tuple). + +There are many different ways of addressing this issues. For example we +can use JIT (Just-In-Time) compiler to generate native code for query +and eliminate interpretation overhead and increase heap deform speed. We +can rewrite optimizer from pull to push model. We can try to optimize +tuple format to make heap deforming more efficient. Or we can generate +byte code for query execution plan which interpretation is more +efficient than recursive invocation of evaluate function for each node +because of better access locality. But all this approaches require +significant rewriting of Postgres executor and some of them also require +changes of all Postgres architecture. + +But there is an approach which allows to address most of this issues +without radical changes of executor. It is vector operations. It is +explained in next section. + +## Vertical storage + +Traditional query executor (like Postgres executor) deals with single +row of data at each moment of time. If it has to evaluate expression +(x+y) then it fetches value of "x", then value of "y", performs +operation "+" and returns the result value to the upper node. In +contrast vectorized executor is able to process in one operation +multiple values. In this case "x" and "y" represent not just a single +scalar value, but vector of values and result is also a vector of +values. In vector execution model interpretation and function call +overhead is divided by size of vector. The price of performing function +call is the same, but as far as function proceeds N values instead of +just one, this overhead become less critical. + +What is the optimal size for the vector? From the explanation above it +is clear that the larger vector is, the less per-row overhead we have. +So we can form vector from all values of the correspondent table +attribute. It is so called vertical data model or columnar store. Unlike +classical "horizontal" data model where the unit of storing data is row +(tuple), here we have vertical columns. Columnar store has the following +main advantages: + + - Reduce size of fetched data: only columns used in query need to be + fetched + - Better compression: storing all values of the same attribute + together makes it possible to much better and faster compress them, + for example using delta encoding. + - Minimize interpretation overhead: each operation is perform not for + single value, but for set of values + - Use CPU vector instruction (SIMD) to process data + +There are several DBMS-es implementing columnar store model. Most +popular are [Vertical](https://vertica.com/), +[MonetDB](https://www.monetdb.org/Home). But actually performing +operation on the whole column is not so good idea. Table can be very +large (OLAP queries are used to work with large data sets), so vector +can also be very big and even doesn't fit in memory. But even if it fits +in memory, working with such larger vectors prevent efficient +utilization of CPU caches (L1, L2,...). Consider expression +(x+y)\*(x-y). Vector executor performs addition of two vectors : "x" and +"y" and produces result vector "r1". But when last element of vector "r" +is produced, first elements of vector "r1" are already thrown from CPU +cache, as well as first elements of "x" and "y" vectors. So when we need +to calculate (x-y) we once again have to load data for "x" and "y" from +slow memory to fast cache. Then we produce "r2" and perform +multiplication of "r1" and "r2". But here we also need first to load +data for this vectors into the CPU cache. + +So it is more efficient to split column into relatively small *chunks* +(or *tiles* - there is no single notion for it accepted by everyone). +This chunk is a unit of processing by vectorized executor. Size of such +chunk is chosen to keep all operands of vector operations in cache even +for complex expressions. Typical size of chunk is from 100 to 1000 +elements. So in case of (x+y)\*(x-y) expression, we calculate it not for +the whole column but only for 100 values (assume that size of the chunk +is 100). Splitting columns into chunks in successors of MonetDB x100 and +HyPer allows to increase speed up to ten times. + +## VOPS + +Overview + +There are several attempts to integrate columnar store in PostgreSQL. +The most known is [CStore FDW](https://github.com/citusdata/cstore_fdw) +by CitusDB. It is implemented as foreign data wrapper (FDW) and is +efficient for queries fetching relatively small fraction of columns. But +it is using standard Postgres raw-based executor and so is not able to +take advantages of vector processing. There is interesting +[project](https://github.com/citusdata/postgres_vectorization_test) done +by CitusDB intern. He implements vector operations on top of CStore +using executors hooks for some nodes. IT reports 4-6 times speedup for +grand aggregates and 3 times speedup for aggregation with group by. + +Another project is [IMCS](https://github.com/knizhnik/imcs): In-Memory +Columnar Store. Here columnar store is implemented in memory and is +accessed using special functions. So you can not use standard SQL query +to work with this storage - you have to rewrite it using IMCS functions. +IMCS provides vector operations (using tiles) and parallel execution. + +Both CStore and IMCS are keeping data outside Postgres. But what if we +want to use vector operations for data kept in standard Postgres tables? +Definitely, the best approach is to impalement alternative heap format. +Or even further: eliminate notion of heap at all - treat heap just as +yet another access method, similar with other indexes. + +But such radical changes requires deep redesign of all Postgres +architecture. It will be better to estimate first possible advantages we +can expect from usage of vector vector operations. Vector executor is +widely discussed in Postgres forums, but efficient vector executor is +not possible without underlying support at storage layer. Advantages of +vector processing will be annihilated if vectors are formed from +attributes of rows extracted from existed Postgres heap page. + +The idea of VOPS extension is to implement vector operations for tiles +represented as special Postgres types. Tiles should be used as table +column types instead of scalar types. For example instead of "real" we +should use "vops\_float4" which is tile representing up to 64 values of +the correspondent column. Why 64? There are several reasons for choosing +this number: + +1. We provide efficient access to tiles, we need that + size\_of\_tile\*size\_of\_attribute\*number\_of\_attributes is + smaller than page size. Typical record contains about 10 attributes, + default size of Postgres page is 8kb. +2. 64 is number of bits in large word. We need to maintain bitmask to + mark null values. Certainly it is possible to store bitmask in array + with arbitrary size, but manipulation with single 64-bit integer is + more efficient. +3. Due to the arguments above, to efficiently utilize cache, size of + tile should be in range 100..1000. + +VOPS is implemented as Postgres extension. It doesn't change anything in +Postgres executor and page format. It also doesn't setup any executors +hooks or alter query execution plan. The main idea of this project was +to measure speedup which can be reached by using vector operation with +existed executor and heap manager. VOPS provides set of standard +operators for tile types, allowing to write SQL queries in the way +similar with normal SQL queries. Right now vector operators can be used +inside predicates and aggregate expressions. Joins are not currently +supported. Details of VOPS architecture are described below. + +### Types + +VOPS supports all basic Postgres numeric types: 1,2,4,8 byte integers +and 4,8 bytes floats. Also it supports `date` and `timestamp` types but +them are using the same implementation as `int4` and `int8` +correspondingly. + +| SQL type | C type | VOPS tile type | +| ----------------------- | --------- | --------------- | +| bool | bool | vops\_bool | +| "char", char or char(1) | char | vops\_char | +| int2 | int16 | vops\_int2 | +| int4 | int32 | vops\_int4 | +| int8 | int64 | vops\_int8 | +| float4 | float4 | vops\_float4 | +| float8 | float8 | vops\_float8 | +| date | DateADT | vops\_date | +| timestamp | Timestamp | vops\_timestamp | + +VOPS doesn't support work with strings (char or varchar types), except +case of single character. If strings are used as identifiers, in most +cases it is preferable to place them in some dictionary and use integer +identifiers instead of original strings. + +### Vector operators + +VOPS provides implementation of all built-in SQL arithmetic operations +for numeric types: **+ - / \*** Certainly it also implements all +comparison operators: **= \<\> \> \>= \< \<=**. Operands of such +operators can be either tiles, either scalar constants: `x=y` or `x=1`. + +Boolean operators `and`, `or`, `not` can not be overloaded. This is why +VOPS provides instead of them operators **& | \!**. Please notice that +precedence of this operators is different from `and`, `or`, `not` +operators. So you can not write predicate as `x=1 | x=2` - it will cause +syntax error. To solve this problem please use parenthesis: `(x=1) | +(x=2)`. + +Also VOPS provides analog of between operator. In SQL expression `(x +BETWEEN a AND b)` is equivalent to `(x >= a AND x <= b)`. But as far as +AND operator can not be overloaded, such substitution will not work for +VOPS tiles. This is why VOPS provides special function for range check. +Unfortunately `BETWEEN` is reserved keyword, so no function with such +name can be defined. This is why synonym `BETWIXT` is used. + +. + +Postgres requires predicate expression to have boolean type. But result +of vector boolean operators is `vops_bool`, not `bool`. This is why +compiler doesn't allow to use it in predicate. The problem can be solved +by introducing special `filter` function. This function is given +arbitrary vector boolean expression and returns normal boolean which ... +is always true. So from Postgres executor point of view predicate value +is always true. But `filter` function sets `filter_mask` which is +actually used in subsequent operators to determine selected records. So +query in VOPS looks something like this: + +``` + select sum(price) from trades where filter(day >= '2017-01-01'::date); +``` + +Please notice one more difference from normal sequence: we have to use +explicit cast of string constant to appreciate data type (`date` type in +this example). For `betwixt` function it is not +needed: + +``` + select sum(price) from trades where filter(betwixt(day, '2017-01-01', '2017-02-01')); +``` + +For `char`, `int2` and `int4` types VOPS provides concatenation operator +**||** which produces doubled integer type: `(char || char) -> int2`, +`(int2 || int2) -> int4`, `(int4 || int4) -> int8`. Them can be used for +grouping by several columns (see below). + +| Operator | Description | +| --------------------- | ------------------------------------ | +| `+` | Addition | +| `-` | Binary subtraction or unary negation | +| `*` | Multiplication | +| `/` | Division | +| `=` | Equals | +| `<>` | Not equals | +| `<` | Less than | +| `<=` | Less than or Equals | +| `>` | Greater than | +| `>=` | Greater than or equals | +| `&` | Boolean AND | +| `\|` | Boolean OR | +| `!` | Boolean NOT | +| `bitwixt(x,low,high)` | Analog of BETWEEN | +| `is_null(x)` | Analog of IS NULL | +| `is_not_null(x)` | Analog of IS NOT NULL | +| `ifnull(x,subst)` | Analog of COALESCE | + +### Vector aggregates + +OLAP queries usually perform some kind of aggregation of large volumes +of data. These includes `grand` aggregates which are calculated for the +whole table or aggregates with `group by` which are calculated for each +group. VOPS implements all standard SQL aggregates: `count, min, max, +sum, avg, var_pop, var_sampl, variance, stddev_pop, stddev_samp, +stddev`. Them can be used exactly in the same way as in normal SQL +queries: + + select sum(l_extendedprice*l_discount) as revenue + from vops_lineitem + where filter(betwixt(l_shipdate, '1996-01-01', '1997-01-01') + & betwixt(l_discount, 0.08, 0.1) + & (l_quantity < 24)); + +Also VOPS provides weighted average aggregate VWAP which can be used to +calculate volume-weighted average price: + + select wavg(l_extendedprice,l_quantity) from vops_lineitem; + +Using aggregation with group by is more complex. VOPS provides two +functions for it: `map` and `reduce`. The work is actually done by +**map**(*group\_by\_expression*, *aggregate\_list*, *expr* {, *expr* }) +VOPS implements aggregation using hash table, which entries collect +aggregate states for all groups. And set returning function `reduce` +just iterates through the hash table consrtucted by `map`. `reduce` +function is needed because result of aggregate in Postgres can not be a +set. So aggregate query with group by looks something like + this: + + select reduce(map(l_returnflag||l_linestatus, 'sum,sum,sum,sum,avg,avg,avg', + l_quantity, + l_extendedprice, + l_extendedprice*(1-l_discount), + l_extendedprice*(1-l_discount)*(1+l_tax), + l_quantity, + l_extendedprice, + l_discount)) from vops_lineitem where filter(l_shipdate <= '1998-12-01'::date); + +Here we use concatenation operator to perform grouping by two columns. +Right now VOPS supports grouping only by integer type. Another serious +restriction is that all aggregated expressions should have the same +type, for example `vops_float4`. It is not possible to calculate +aggregates for `vops_float4` and `vopd_int8` columns in one call of +`map` function, because it accepts aggregation arguments as variadic +array, so all elements of this array should have the same type. + +Aggregate string in `map` function should contain list of requested +aggregate functions, separated by colon. Standard lowercase names should +be used: `count, sum, agg, min, max`. Count is executed for the +particular column: `count(x)`. There is no need to explicitly specify +`count(*)` because number of records in each group is returned by +`reduce` function in any case. + +`reduce` function returns set of `vops_aggregate` type. It contains +three components: value of group by expression, number of records in the +group and array of floats with aggregate values. Please notice that +values of all aggregates, including `count` and `min/max`, are returned +as + floats. + + create type vops_aggregates as(group_by int8, count int8, aggs float8[]); + create function reduce(bigint) returns setof vops_aggregates; + +But there is much simple and straightforward way of performing group +aggregates using VOPS. We need to partition table by *group by* fields. +In this case grouping keys will be stored in normal way and other fields +- inside tiles. Now Postgres executor will execute VOPS aggregates for +each group: + + select + l_returnflag, + l_linestatus, + sum(l_quantity) as sum_qty, + sum(l_extendedprice) as sum_base_price, + sum(l_extendedprice*(1-l_discount)) as sum_disc_price, + sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, + avg(l_quantity) as avg_qty, + avg(l_extendedprice) as avg_price, + avg(l_discount) as avg_disc, + countall(*) as count_order + from + vops_lineitem_projection + where + filter(l_shipdate <= '1998-12-01'::date) + group by + l_returnflag, + l_linestatus + order by + l_returnflag, + l_linestatus; + +In this example `l_returnflag` and `l_linestatus` fields of table +vops\_lineitem\_projection have `"char"` type while all other used +fields - tile types (`l_shipdate` has type `vops_date` and other fields +- `vops_float4`). The query above is executed even faster than query +with `reduce(map(...))`. The main problem with this approach is that you +have to create projection for each combination of group by keys you want +to use in queries. + +### Vector window functions + +VOPS provides limited support of Postgres window functions. It +implements `count, sum, min, max, avg` and `lag` functions. But +unfortunately Postgres requires aggregates to have to similar final type +for moving (window) and plain implementations. This is why VOPS has to +choose define this aggregate under different names: `mcount, msum, mmin, +mmax, mavg`. + +There are also two important restrictions: + +1. Filtering, grouping and sorting can be done only by scalar + (non-tile) attributes +2. Only `rows between unbounded preceding and current row` frame is + supported (but there is special version of `msum` which accepts + extra window size parameter) + +Example of using window functions with + VOPS: + + select unnest(t.*) from (select mcount(*) over w,mcount(x) over w,msum(x) over w,mavg(x) over w,mmin(x) over w,mmax(x) over w,x - lag(x) over w + from v window w as (rows between unbounded preceding and current row)) t; + +### Using indexes + +Analytic queries are usually performed on the data for which no indexes +are defined. And columnar store vector operations are most efficient in +this case. But it is still possible to use indexes with VOPS. + +As far as each VOPS tile represents multiple values, index can be used +only for some preliminary, non-precise filtering of data. It is +something similar with BRIN indexes. VOPS provides four functions: +`first, last, high, low` which can be used to obtain high/low boundary +of values stored in the tile. First two functions `first` and `last` +should be used for sorted data set. In this case first value is the +smallest value in the tile and last value is the largest value in the +tile. If data is not sorted, then `low`high functions should be used, +which are more expensive, because them need to inspect all tile values. +Using this four function it is possible to construct functional indexes +for VOPS table. BRIN index seems to be the best choice for VOPS + table: + + create index low_boundary on trades using brin(first(day)); -- trades table is ordered by day + create index high_boundary on trades using brin(last(day)); -- trades table is ordered by day + +Now it is possible to use this indexes in query. Please notice that we +have to recheck precise condition because index gives only approximate +result: + + select sum(price) from trades where first(day) >= '2015-01-01' and last(day) <= '2016-01-01' + and filter(betwixt(day, '2015-01-01', '2016-01-01')); + +### Preparing data for VOPS + +Now the most interesting question (from which may be we should start) - +how we managed to prepare data for VOPS queries? Who and how will +combine attribute values of several rows inside one VOPS tile? It is +done by `populate` functions, provided by VOPS extension. + +First of all you need to create table with columns having VOPS tile +types. It can map all columns of the original table or just some most +frequently used subset of them. This table can be treated as +`projection` of original table (this concept of projections is taken +from Vertica). Projection should include columns which are most +frequently used together in queries. + +Original table from TPC-H benchmark: + + create table lineitem( + l_orderkey integer, + l_partkey integer, + l_suppkey integer, + l_linenumber integer, + l_quantity real, + l_extendedprice real, + l_discount real, + l_tax real, + l_returnflag char, + l_linestatus char, + l_shipdate date, + l_commitdate date, + l_receiptdate date, + l_shipinstruct char(25), + l_shipmode char(10), + l_comment char(44)); + +VOPS projection of this table: + + create table vops_lineitem( + l_shipdate vops_date not null, + l_quantity vops_float4 not null, + l_extendedprice vops_float4 not null, + l_discount vops_float4 not null, + l_tax vops_float4 not null, + l_returnflag vops_char not null, + l_linestatus vops_char not null + ); + +Original table can be treated as write optimized storage (WOS). If it +has not indexes, then Postgres is able to provide very fast insertion +speed, comparable with raw disk write speed. Projection in VOPS format +can be treated as read-optimized storage (ROS), most efficient for +execution of OLAP queries. + +Data can be transferred from original to projected table using VOPS +`populate` function: + + create function populate(destination regclass, + source regclass, + predicate cstring default null, + sort cstring default null) returns bigint; + +Two first mandatory arguments of this function specify target and source +tables. Optional predicate and sort clauses allow to restrict amount of +imported data and enforce requested order. By specifying predicate it is +possible to update VOPS table using only most recently received records. +This functions returns number of loaded records. Example of populate +function + invocation: + + select populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass); + +You can use populated table in queries performing sequential scan. VOPS +operators can speed up filtering of records and calculation of +aggregates. Aggregation with `group by` requires use of `reduce + map` +functions. But as it was mentioned above in the section describing +aggregates, it is possible to populate table in such way, that standard +Postgres grouping algorithm will be used. + +We need to choose partitioning keys and sort original table by this +keys. Combination of partitioning keys expected to be NOT unique - +otherwise tiles can only increase used space and lead to performance +degradation. But if there are a lot of duplicates, then "collapsing" +them and storing other fields in tiles will help to reduce space and +speed up queries. Let's create the following projection of `lineitems` +table: + + create table vops_lineitem_projection( + l_shipdate vops_date not null, + l_quantity vops_float4 not null, + l_extendedprice vops_float4 not null, + l_discount vops_float4 not null, + l_tax vops_float4 not null, + l_returnflag "char" not null, + l_linestatus "char" not null + ); + +As you can see, in this table `l_returnflag` and `l_linestatus` fields +are scalars, and other fields - tiles. This projection can be populated +using the following + command: + + select populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem_projection'::regclass, sort := 'l_returnflag,l_linestatus'); + +Now we can create normal index on partitioning keys, define standard +predicates for them and use them in `group by` and `order by` clauses. + +Sometimes it is not possible or not desirable to store two copies of the +same dataset. VOPS allows to load data directly from CSV file into VOPS +table with tiles, bypassing creation of normal (plain) table. It can be +done using `import` + function: + + select import(destination := 'vops_lineitem'::regclass, csv_path := '/mnt/data/lineitem.csv', separator := '|'); + +`import` function is defined in this way: + + create function import(destination regclass, + csv_path cstring, + separator cstring default ',', + skip integer default 0) returns bigint; + +It accepts name of target VOPS table, path to CSV file, optional +separator (default is ',') and number of lines in CSV header (no header +by default). The function returns number of imported rows. + +### Back to normal tuples + +A query from VOPS projection returns set of tiles. Output function of +tile type is able to print content of the tile. But in some cases it is +preferable to transfer result to normal (horizontal) format where each +tuple represents one record. It can be done using `unnest` + function: + + postgres=# select unnest(l.*) from vops_lineitem l where filter(l_shipdate <= '1998-12-01'::date) limit 3; + unnest + --------------------------------------- + (1996-03-13,17,33078.9,0.04,0.02,N,O) + (1996-04-12,36,38306.2,0.09,0.06,N,O) + (1996-01-29,8,15479.7,0.1,0.02,N,O) + (3 rows) + +### Back to normal tables + +As it was mentioned in previous section, `unnest` function can scatter +records with VOPS types into normal records with scalar types. So it is +possible to use this records in arbitrary SQL queries. But there are two +problems with unnest function: + +1. It is not convenient to use. This function has no static knowledge + about the format of output record and this is why programmer has to + specify it manually, if here wants to decompose this record. +2. PostgreSQL optimizer has completely no knowledge on result of + transformation performed by unnest() function. This is why it is not + able to choose optimal query execution plan for data retrieved from + VOPS table. + +Fortunately Postgres provides solution for both of this problem: foreign +data wrappers (FDW). In our case data is not really "foreign": it is +stored inside our own database. But in alternatives (VOPS) format. VOPS +FDW allows to "hide" specific of VOPS format and run normal SQL queries +on VOPS tables. FDW allows the following: + +1. Extract data from VOPS table in normal (horizontal) format so that + it can be proceeded by upper nodes in query execution plan. +2. Pushdown to VOPS operations that can be efficiently executed using + vectorized operations on VOPS types: filtering and aggregation. +3. Provide statistic for underlying table which can be used by query + optimizer. + +So, by placing VOPS projection under FDW, we can efficiently perform +sequential scan and aggregation queries as if them will be explicitly +written for VOPS table and at the same time be able to execute any other +queries on this data, including joins, CTEs,... Query can be written in +standard SQL without usage of any VOPS specific functions. + +Below is an example of creating VOPS FDW and running some queries on it: + + create foreign table lineitem_fdw ( + l_suppkey int4 not null, + l_orderkey int4 not null, + l_partkey int4 not null, + l_shipdate date not null, + l_quantity float4 not null, + l_extendedprice float4 not null, + l_discount float4 not null, + l_tax float4 not null, + l_returnflag "char" not null, + l_linestatus "char" not null + ) server vops_server options (table_name 'vops_lineitem'); + + explain select + sum(l_extendedprice*l_discount) as revenue + from + lineitem_fdw + where + l_shipdate between '1996-01-01' and '1997-01-01' + and l_discount between 0.08 and 0.1 + and l_quantity < 24; + QUERY PLAN + --------------------------------------------------------- + Foreign Scan (cost=1903.26..1664020.23 rows=1 width=4) + (1 row) + + -- Filter was pushed down to FDW + + explain select + n_name, + count(*), + sum(l_extendedprice * (1-l_discount)) as revenue + from + customer_fdw join orders_fdw on c_custkey = o_custkey + join lineitem_fdw on l_orderkey = o_orderkey + join supplier_fdw on l_suppkey = s_suppkey + join nation on c_nationkey = n_nationkey + join region on n_regionkey = r_regionkey + where + c_nationkey = s_nationkey + and r_name = 'ASIA' + and o_orderdate >= '1996-01-01' + and o_orderdate < '1997-01-01' + group by + n_name + order by + revenue desc; + QUERY PLAN + -------------------------------------------------------------------------------------------------------------------------------------- + Sort (cost=2337312.28..2337312.78 rows=200 width=48) + Sort Key: (sum((lineitem_fdw.l_extendedprice * ('1'::double precision - lineitem_fdw.l_discount)))) DESC + -> GroupAggregate (cost=2336881.54..2337304.64 rows=200 width=48) + Group Key: nation.n_name + -> Sort (cost=2336881.54..2336951.73 rows=28073 width=40) + Sort Key: nation.n_name + -> Hash Join (cost=396050.65..2334807.39 rows=28073 width=40) + Hash Cond: ((orders_fdw.o_custkey = customer_fdw.c_custkey) AND (nation.n_nationkey = customer_fdw.c_nationkey)) + -> Hash Join (cost=335084.53..2247223.46 rows=701672 width=52) + Hash Cond: (lineitem_fdw.l_orderkey = orders_fdw.o_orderkey) + -> Hash Join (cost=2887.07..1786058.18 rows=4607421 width=52) + Hash Cond: (lineitem_fdw.l_suppkey = supplier_fdw.s_suppkey) + -> Foreign Scan on lineitem_fdw (cost=0.00..1512151.52 rows=59986176 width=16) + -> Hash (cost=2790.80..2790.80 rows=7702 width=44) + -> Hash Join (cost=40.97..2790.80 rows=7702 width=44) + Hash Cond: (supplier_fdw.s_nationkey = nation.n_nationkey) + -> Foreign Scan on supplier_fdw (cost=0.00..2174.64 rows=100032 width=8) + -> Hash (cost=40.79..40.79 rows=15 width=36) + -> Hash Join (cost=20.05..40.79 rows=15 width=36) + Hash Cond: (nation.n_regionkey = region.r_regionkey) + -> Seq Scan on nation (cost=0.00..17.70 rows=770 width=40) + -> Hash (cost=20.00..20.00 rows=4 width=4) + -> Seq Scan on region (cost=0.00..20.00 rows=4 width=4) + Filter: ((r_name)::text = 'ASIA'::text) + -> Hash (cost=294718.76..294718.76 rows=2284376 width=8) + -> Foreign Scan on orders_fdw (cost=0.00..294718.76 rows=2284376 width=8) + -> Hash (cost=32605.64..32605.64 rows=1500032 width=8) + -> Foreign Scan on customer_fdw (cost=0.00..32605.64 rows=1500032 width=8) + + -- filter on orders range is pushed to FDW + +## Standard SQL query transformation + +Previous section describes VOPS specific types, operators, functions,... +Good news\! You do not need to learn them. You can use normal SQL. Well, +it is still responsibility of programmer or database administrator to +create proper projections of original table. This projections need to +use tiles types for some attributes (vops\_float4,...). Then you can +query this table using standard SQL. And this query will be executed +using vector operations\! + +How it works? There are absolutely no magic here. There are four main +components of the puzzle: + +1. User defined types +2. User defined operator +3. User defined implicit type casts +4. Post parse analyze hook which performs query transformation + +So VOPS defines tile types and standard SQL operators for this types. +Then it defines implicit type cast from `vops_bool` (result of boolean +operation with tiles) to boolean type. Now programmer do not have to +wrap vectorized boolean operations in `filter()` function call. And the +final transformation is done by post parse analyze hook, defined by VOPS +extension. It replaces scalar boolean operations with vector boolean +operations: + +| Original expression | Result of transformation | +| --------------------------- | ------------------------------- | +| `NOT filter(o1)` | `filter(vops_bool_not(o1))` | +| `filter(o1) AND filter(o2)` | `filter(vops_bool_and(o1, o2))` | +| `filter(o1) OR filter(o2)` | `filter(vops_bool_or(o1, o2))` | + +Now there is no need to use VOPS specific `BETIXT` operator: standard +SQL `BETWEEN` operator will work (but still using `BETIXT` is slightly +more efficient, because it performs both comparions in one function). +Also there are no problems with operators precedence and extra +parenthesis are not needed. If query includes vectorized aggregates, +then `count(*)` is transformed to `countall(*)`. + +There is only one difference left between standard SQL and its +vectorized extension. You still have to perform explicit type cast in +case of using string literal, for example `l_shipdate <= '1998-12-01'` +will not work for `l_shipdate` column with tile type. Postgres have two +overloaded versions of \<= operator which can be applied here: + +1. `vops_date` **\<=** `vops_date` +2. `vops_date` **\<=** `date` + +And it decides that it is better to convert string to the tile type +`vops_date`. In principle, it is possible to provide such conversion +operator. But it is not good idea, because we have to generate dummy +tile with all components equal to the specified constant and perform +(*vector* **OP** *vector*) operation instead of more efficient (*vector* +**OP** *scalar*). + +There is one pitfall with post parse analyze hook: it is initialized in +the extension `_PG_init` function. But if extension was not registered +in `shared_preload_libraries` list, then it will be loaded on demand +when any function of this extension is requested. Unfortunately it +happens **after** parse analyze is done. So first time you execute VOPS +query, it will not be transformed. You can get wrong result in this +case. Either take it in account, either add `vops` to +`shared_preload_libraries` configuration string. VOPS extension provides +special function `vops_initialize()` which can be invoked to force +initialization of VOPS extension. After invocation of this function, +extension will be loaded and all subsequent queries will be normally +transformed and produce expected results. + +## Example + +The most popular benchmark for OLAP is [TPC-H](http://www.tpc.org/tpch). +It contains 21 different queries. We adopted for VOPS only two of them: +Q1 and Q6 which are not using joins. Most of fragments of this code are +already mentioned above, but here we collect it together: + +``` +-- Standard way of creating extension +create extension vops; + +-- Original TPC-H table +create table lineitem( + l_orderkey integer, + l_partkey integer, + l_suppkey integer, + l_linenumber integer, + l_quantity real, + l_extendedprice real, + l_discount real, + l_tax real, + l_returnflag char, + l_linestatus char, + l_shipdate date, + l_commitdate date, + l_receiptdate date, + l_shipinstruct char(25), + l_shipmode char(10), + l_comment char(44), + l_dummy char(1)); -- this table is needed because of terminator after last column in generated data + +-- Import data to it +copy lineitem from '/mnt/data/lineitem.tbl' delimiter '|' csv; + +-- Create VOPS projection +create table vops_lineitem( + l_shipdate vops_date not null, + l_quantity vops_float4 not null, + l_extendedprice vops_float4 not null, + l_discount vops_float4 not null, + l_tax vops_float4 not null, + l_returnflag vops_char not null, + l_linestatus vops_char not null +); + +-- Copy data to the projection table +select populate(destination := 'vops_lineitem'::regclass, source := 'lineitem'::regclass); + +-- For honest comparison creates the same projection without VOPS types +create table lineitem_projection as (select l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag::"char",l_linestatus::"char" from lineitem); + +-- Now create mixed projection with partitioning keys: +create table vops_lineitem_projection( + l_shipdate vops_date not null, + l_quantity vops_float4 not null, + l_extendedprice vops_float4 not null, + l_discount vops_float4 not null, + l_tax vops_float4 not null, + l_returnflag "char" not null, + l_linestatus "char" not null +); + +-- And populate it with data sorted by partitioning key: +select populate(destination := 'vops_lineitem_projection'::regclass, source := 'lineitem_projection'::regclass, sort := 'l_returnflag,l_linestatus'); + + +-- Let's measure time +\timing + +-- Original Q6 query performing filtering with calculation of grand aggregate +select + sum(l_extendedprice*l_discount) as revenue +from + lineitem +where + l_shipdate between '1996-01-01' and '1997-01-01' + and l_discount between 0.08 and 0.1 + and l_quantity < 24; + +-- VOPS version of Q6 using VOPS specific operators +select sum(l_extendedprice*l_discount) as revenue +from vops_lineitem +where filter(betwixt(l_shipdate, '1996-01-01', '1997-01-01') + & betwixt(l_discount, 0.08, 0.1) + & (l_quantity < 24)); + +-- Yet another vectorized version of Q6, but now in stadnard SQL: +select sum(l_extendedprice*l_discount) as revenue +from vops_lineitem +where l_shipdate between '1996-01-01'::date AND '1997-01-01'::date + and l_discount between 0.08 and 0.1 + and l_quantity < 24; + + + +-- Original version of Q1: filter + group by + aggregation +select + l_returnflag, + l_linestatus, + sum(l_quantity) as sum_qty, + sum(l_extendedprice) as sum_base_price, + sum(l_extendedprice*(1-l_discount)) as sum_disc_price, + sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, + avg(l_quantity) as avg_qty, + avg(l_extendedprice) as avg_price, + avg(l_discount) as avg_disc, + count(*) as count_order +from + lineitem +where + l_shipdate <= '1998-12-01' +group by + l_returnflag, + l_linestatus +order by + l_returnflag, + l_linestatus; + +-- VOPS version of Q1, sorry - no final sorting +select reduce(map(l_returnflag||l_linestatus, 'sum,sum,sum,sum,avg,avg,avg', + l_quantity, + l_extendedprice, + l_extendedprice*(1-l_discount), + l_extendedprice*(1-l_discount)*(1+l_tax), + l_quantity, + l_extendedprice, + l_discount)) from vops_lineitem where filter(l_shipdate <= '1998-12-01'::date); + +-- Mixed mode: let's Postgres does group by and calculates VOPS aggregates for each group +select + l_returnflag, + l_linestatus, + sum(l_quantity) as sum_qty, + sum(l_extendedprice) as sum_base_price, + sum(l_extendedprice*(1-l_discount)) as sum_disc_price, + sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, + avg(l_quantity) as avg_qty, + avg(l_extendedprice) as avg_price, + avg(l_discount) as avg_disc, + count(*) as count_order +from + vops_lineitem_projection +where + l_shipdate <= '1998-12-01'::date +group by + l_returnflag, + l_linestatus +order by + l_returnflag, + l_linestatus; + + +``` + +## Performance evaluation + +Now most interesting thing: compare performance results on original +table and using vector operations on VOPS projection. All measurements +were performed at desktop with 16Gb of RAM and quad-core i7-4770 CPU @ +3.40GHz processor with enabled hyper-threading. Data set for benchmark +was generated by dbgen utility included in TPC-H benchmark. Scale factor +is 10 which corresponds to about 8Gb database. It can completely fit in +memory, so we are measuring best query execution time for *warm* data. +Postgres was configured with shared buffer size equal to 8Gb. For each +query we measured time of sequential and parallel execution with 8 +parallel +workers. + +| Query | Sequential execution (msec) | Parallel execution (msec) | +| --------------------------------------- | --------------------------: | ------------------------: | +| Original Q1 for lineitem | 38028 | 10997 | +| Original Q1 for lineitem\_projection | 33872 | 9656 | +| Vectorized Q1 for vops\_lineitem | 3372 | 951 | +| Mixed Q1 for vops\_lineitem\_projection | 1490 | 396 | +| Original Q6 for lineitem | 16796 | 4110 | +| Original Q6 for lineitem\_projection | 4279 | 1171 | +| Vectorized Q6 for vops\_lineitem | 875 | 284 | + +## Conclusion + +As you can see in performance results, VOPS can provide more than 10 +times improvement of query speed. And this result is achieved without +changing something in query planner and executor. It is better than any +of existed attempt to speed up execution of OLAP queries using JIT (4 +times for Q1, 3 times for Q6): [Speeding up query execution in +PostgreSQL using LLVM JIT +compiler](http://llvm.org/devmtg/2016-09/slides/Melnik-PostgreSQLLLVM.pdf). + +Definitely VOPS extension is just a prototype which main role is to +demonstrate potential of vectorized executor. But I hope that it also +can be useful in practice to speedup execution of OLAP aggregation +queries for existed databases. And in future we should think about the +best approach of integrating vectorized executor in Postgres core. + +ALL sources of VOPS project can be obtained from this [GIT +repository](https://github.com/postgrespro/vops). Chinese version of +documentation is available +[here](https://github.com/digoal/blog/blob/master/201702/20170225_01.md). +Please send any feedbacks, complaints, bug reports, change requests to +[Konstantin Knizhnik](mailto:k.knizhnik@postgrespro.ru).