Fast Serializable Multi-Version Concurrency Control For Main-Memory Database Systems
Fast Serializable Multi-Version Concurrency Control For Main-Memory Database Systems
(n
t ) an
to
rs ng sit
o
ec
[fi d le Po
ts
Accounts
nV
d
to
ixe ne
version information stored in additional hidden column in base relation
io
re
Owner Bal
a f sio
rs
d
(indexes only store references, i.e., rowIDs, to records in base relations)
Ve
–
In Ver
fo
ri
co
Thomas 10
llu
Ac atio
m
latest version in-place
st
m
tio n)
r
itT
Larry 10
ns
im
physical before-image deltas (i.e., column values)
recentlyCommitted
[0,0) Alfons 10
in undo buffers T3 SallyàWendy
Judy 10 Undo buffer of Ty (Sallyà...) T5 SallyàHenry
Tobias 10 Ty,Bal,8
Sally 7
Hanna 10
[0,1) Hasso 10
tra
(n
Undo buffer of T5
ot
Ac ored
sta
sa
Mike 10
tio )
st
cti
rtT
T5,Bal,9 T5,Bal,10
ns
on
im
Lisa 10
ID
activeTransactions
Betty 10 Tx T4 Read only: Σ
Figure 1: Multi-version concurrency control example: transferring $1 between Accounts (from → to) and
summing all Balances (Σ)
is also the basis of our cheap serializability check, which ex- past, HyPer relied on single-version concurrency control and
ploits the structure of our versioning information. We fur- thus did not efficiently support interactive and sliced trans-
ther retain the very high scan performance of single-version actions, i.e., transactions that are decomposed into multiple
systems using synopses of positions of versioned records in tasks such as stored procedure calls or individual SQL state-
order to efficiently support analytical transactions. ments. Due to application roundtrip latencies and other
In particular, the main contributions of this paper are: factors, it is desirable to interleave the execution of these
tasks. Our novel MVCC model enables this logical concur-
1. A novel MVCC implementation that is integrated into
rency with excellent performance, even when maintaining
our high-performance hybrid OLTP and OLAP main-
serializability guarantees.
memory datbase system HyPer [21]. Our MVCC model
creates very little overhead for both transactional and
analytical workloads and thereby enables very fast and 2. MVCC IMPLEMENTATION
efficient logical transaction isolation for hybrid systems We explain our MVCC model and its implementation ini-
that support these workloads simultaneously. tially by way of an example. The formalism of our serializ-
2. Based upon that, a novel approach to guarantee seri- ability theory and proofs are then given in Section 3. Fig-
alizability for snapshot isolation (SI) that is both pre- ure 1 illustrates the version maintenance using a traditional
cise and cheap in terms of additional space consump- banking example. For simplicity, the database consists of
tion and validation time. Our approach is based on a single Accounts table that contains just two attributes,
an adaptation of Precision Locking [42] and does not Owner and Bal ance. In order to retain maximum scan per-
require explicit read locks, but still allows for more formance we refrain from creating new versions in newly
concurrency than 2PL. allocated areas as in Hekaton [8, 23]; instead we update in-
place and maintain the backward delta between the updated
3. A synopses-based approach (VersionedPositions) to re- (yet uncommitted) and the replaced version in the undo
tain the high scan performance of single-version sys- buffer of the updating transaction. Updating data in-place
tems for read-heavy and analytical transactions, which retains the contiguity of the data vectors that is essential for
are common in today’s workloads [33]. high scan performance. In contrast to positional delta trees
(PDTs) [15], which were designed to allow more efficient
4. Extensive experiments that demonstrate the high per-
updates in column stores, we refrain from using complex
formance and trade-offs of our MVCC implementation.
data structures for the deltas to allow for a high concurrent
Our novel MVCC implementation is integrated into our transactional throughput.
HyPer main-memory DBMS [21], which supports SQL-92 Upon committing a transaction, the newly generated ver-
query and ACID-compliant transaction processing (defined sion deltas have to be re-timestamped to determine their
in a PL/SQL-like scripting language [20]). For queries and validity interval. Clustering all version deltas of a trans-
transactions, HyPer generates LLVM code that is then just- action in its undo buffer expedites this commit processing
in-time compiled to optimized machine code [31]. In the tremendously. Furthermore, using the undo buffers for ver-
sion maintenance, our MVCC model incurs almost no stor- Sally’s and Wendy’s balances, respectively. The timestamp
age overhead as we need to maintain the version deltas (i.e., indicates that these version deltas have to be applied for
the before-images of the changes) during transaction pro- transactions whose startTime is below T3 and that the suc-
cessing anyway for transactional rollbacks. The only differ- cessor version is valid from there on for transactions starting
ence is that the undo buffers are (possibly) maintained for a after T3 . In our example, at startTime T4 a reader transac-
slightly longer duration, i.e., for as long as an active trans- tion with transactionID Tx entered the system and is still
action may still need to access the versions contained in the active. It will read Sally’s Bal ance at reconstructed value 9,
undo buffer. Thus, the VersionVector shown in Figure 1 Henry’s at reconstructed value 10, and Wendy’s at value 11.
anchors a chain of version reconstruction deltas (i.e., col- Another update transaction (Sally → Henry) committed at
umn values) in “newest-to-oldest” direction, possibly span- timestamp T5 and correspondingly marked the version deltas
ning across undo buffers of different transactions. Even for it created with the validity timestamp T5 . Again, the ver-
our column store backend, there is a single VersionVector sions belonging to Sally’s and Wendy’s balances that were
entry per record for the version chain, so the version chain valid just before T5 ’s update are maintained as before im-
in general connects before-images of different columns of one ages in the undo buffer of T5 . Note that a reconstructed
record. Actually, for garbage collection this chain is main- version is valid from its predecessor’s timestamp until its
tained bidirectionally, as illustrated for Sally’s Bal -versions. own timestamp. Sally’s Bal ance version reconstructed with
T5 ’s undo buffer is thus valid from timestamp T3 until times-
2.1 Version Maintenance tamp T5 . If a version delta has no predecessor (indicated by
a null pointer) such as Henry’s balance version in T5 ’s undo
Only a tiny fraction of the database will be versioned, as
buffer its validity is determined as from virtual timestamp
we continuously garbage collect versions that are no longer
“0” until timestamp T5 . Any read access of a transaction
needed. A version (reconstruction delta) becomes obsolete
with startTime below T5 applies this version delta and any
if all active transactions have started after this delta was
read access with a startTime above or equal to T5 ignores it
timestamped. The VersionVector contains null whenever
and thus reads the in-place version in the Accounts table.
the corresponding record is unversioned and a pointer to the
As said before, the deltas of not yet committed versions re-
most recently replaced version in an undo buffer otherwise.
ceive a temporary timestamp that exceeds any “real” times-
For our illustrative example only two transaction types
tamp of a committed transaction. This is exemplified for
are considered: transfer transactions are marked as “from
the update transaction (Sally → Henry) that is assigned
→ to” and transfer $1 from one account to another by first
the transactionID timestamp Ty of the updater. This tem-
subtracting 1 from one account’s Bal and then adding 1 to
porary, very large timestamp is initially assigned to Sally’s
the other account’s Bal. For brevity we omit the discussion
Bal ance version delta in Ty ’s undo buffer. Any read access,
of object deletions and creations in the example. Initially, all
except for those of transaction Ty , with a startTime-stamp
Bal ances were set to 10. The read-only transactions denoted
above T5 (and obviously below Ty ) apply this version delta to
Σ sum all Bal ances and — in our “closed world” example —
obtain value 8. The uncommitted in-place version of Sally’s
should always compute $150, no matter under what start-
balance with value 7 is only visible to Ty .
Time-stamp they operate.
All new transactions entering the system are associated
with two timestamps: transactionID and startTime-stamps. 2.2 Version Access
Upon commit, update transactions receive a third times- To access its visible version of a record, a transaction T
tamp, the commitTime-stamp that determines their serial- first reads the in-place record (e.g., from the column- or
ization order. Initially all transactions are assigned identi- row-store) and then undoes all version changes along the
fiers that are higher than any startTime-stamp of any trans- (possibly empty) version chain of undo buffer entries — by
action. We generate startTime-stamps from 0 upwards and overwriting updated attributes in the copied record with the
transactionIDs from 263 upwards to guarantee that transac- before-images from the undo buffers — up to the first version
tionIDs are all higher than the startTimes. Update trans- v, for which the following condition holds (pred points to the
actions modify data in-place. However, they retain the old predecessor; TS denotes the associated timestamp):
version of the data in their undo buffer. This old version
v .pred = null ∨ v .pred .TS = T ∨ v .pred .TS < T.startTime
serves two purposes: (1) it is needed as a before-image in
case the transaction is rolled back (undone) and (2) it serves The first condition holds if there is no older version available
as a committed version that was valid up to now. This most because it never existed or it was (safely) garbage collected
recently replaced version is inserted in front of the (possibly in the meantime. The second condition allows a transac-
empty) version chain starting at the VersionVector. While tion to access its own updates (remember that the initial
the updater is still running, the newly created version is transactionID timestamps assigned to an active transaction
marked with its transactionID, whereby the uncommitted are very high numbers exceeding any start time of a trans-
version is only accessible by the update transaction itself (as action). The third condition allows reading a version that
checked in the second condition of the version access pred- was valid at the start time of the transaction. Once the ter-
icate, cf., Section 2.2). At commit time an update trans- mination condition is satisfied, the visible version has been
action receives a commitTime-stamp with which its version re-materialized by “having undone” all changes that have oc-
deltas (undo logs) are marked as being irrelevant for trans- curred in the meantime. Note that, as shown in Section 4,
actions that start from “now” on. This commitTime-stamp version “reconstruction” is actually cheap, as we store the
is taken from the same sequence counter that generates the physical before-image deltas and thus do not have to in-
startTime-stamps. In our example, the first update transac- versely apply functions on the in-place after-image.
tion that committed at timestamp T3 (Sally → Wendy) cre- Traversing the version chain guarantees that all reads are
ated in its undo buffer the version deltas timestamped T3 for performed in the state that existed at the start of the trans-
.unchanged object. predicate space (for 4 predicates)
Interest
.modified object.
P4: P1:
.deleted object. X I = 1.6
I = 1.7 and B = 15
.created object (phantom). 1.6
.created & deleted. P2:
object (phantom) I=1 and
X B=15
startTime lifetime of T commitTime P3:
I between .1 and .2
Figure 2: Modifications/deletions/creations of data 0.2 clash and
objects relative to the lifetime of transaction T B between 10 and 20
0.1
S
im :
(S4 , T7 ) even though it read x after (S4 , T7 ) wrote x. Obvi-
rtT on
eT
explicit versioning
ta ti
ously, with a single-version scheduler this degree of logical
= s s ac
T6
TS tran
S1 concurrency would not have been possible. The figure also
r(x0) w( x ) commit(x6) illustrates the benefits of our MVCC scheme that keeps an
m ly
m n
T4
co ad-o
S2 arbitrary number of versions instead of only two as in [36].
it
r(y0) w( y ⃞) commit(y4)
re
The “long” read transaction (S1 , T1 ) needs to access x0 even
T3 though in the meantime the two newer versions x3 and x7
S3
r(y0) r(x0) commit were created. Versions are only garbage collected after they
T7
S5 are definitely no longer needed by other active transactions.
r(y4) w( y ⃞) commit(y7) Our novel use of precision locking consisting of collecting
Figure 5: Example of the explicit versioning nota- the read predicates and validating recently committed ver-
tion in our MVCC model sions against these predicates is illustrated in Figure 6(c).
Here, transaction (S2 , T5 ) reads x0 with predicate P , de-
noted rP (x0 ). When the transaction that started at S2 tries
S2 committed earlier at timestamp T4 . The read-only trans- to commit, it validates the before- and after-images of ver-
action that started at timestamp S3 logically also commits sions that were committed in the meantime. In particular,
at timestamp T3 . P (x0 ) is true and therefore leads to an abort and restart
Transactions read all the data in the version that was com- of the transaction. Likewise, phantoms and deletions are
mitted (i.e., created) most recently before their startTime- detected as exemplified for the insert i( o ) and the delete
stamp. Versions are only committed at the end of a trans- d( u ) of transaction (S1 , T4 ). Neither the inserted object
action and therefore receive the identifiers corresponding to nor the deleted object are allowed to intersect with the pred-
the commitTime-stamps of the transaction that creates the icates of concurrent transactions that commit after T4 .
version. The transaction schedule of Figure 5 creates the
version chains y0 → y4 → y7 and x0 → x6 . Note that ver-
3.2 Proof of Serializability Guarantee
sions are not themselves (densely) numbered because of our We will now prove that our MVCC scheme with predicate
scheme of identifying versions with the commitTime-stamp space validation guarantees that any execution is serializable
of the creating transaction. As we will prove in Section 3.2, in commit order.
our MVCC model guarantees equivalence to a serial mono-
Theorem. The committed projection of any multi-version
version schedule in commitTime-stamp order. Therefore,
schedule H that adheres to our protocol is conflict equiva-
the resulting schedule of Figure 5 is equivalent to the serial
lent to a serial mono-version schedule H 0 where the commit-
mono-version execution: r3 (y), r3 (x), c3 , r4 (y), w4 (y), c4 ,
ted transactions are ordered according to their commitTime-
r6 (x), w6 (x), c6 , r7 (y), w7 (y), c7 . Here all the operations
stamps and the uncommitted transactions are removed.
are subscripted with the transaction’s commitTime-stamp.
Local writing is denoted as w( x ). Such a “dirty” data Proof. Due to the nature of the MVCC protocol, the effects
object is only visible to the transaction that wrote it. In of any uncommitted transaction can never be seen by any
our implementation (cf., Section 2.1), we use the very large other successful transaction (reads will ignore the uncom-
transaction identifiers to make the dirty objects invisible to mitted writes, writes will either not see the uncommitted
other transactions. In our formal model we do not need these writes or lead to aborts). Therefore, it is sufficient to con-
identifiers. As we perform updates in-place, other transac- sider only committed transactions in this proof.
tions trying to (over-)write x are aborted and restarted. Basically, we will now show that all dependencies are in
Note that reading x is always possible, because a transac- the direction of the order of their commitTime-stamps and
tion’s reads are directed to the version of x that was commit- thus any execution is serializable in commit order. Read-
ted most recently before the transaction’s startTime-stamp only transactions see a stable snapshot of the database at
— with one exception: if a transaction updates an object time Sb , and get assigned the same commitTime-stamp Tb =
x, i.e., w( x ), it will subsequently read its own update, i.e., Sb , or, in other words, they behave as if they were executed
r( x ). This is exemplified for transaction (S1 , T2 ) on the up- at the point in time of their commitTime-stamp, which is
per left hand side of Figure 6(a). In our implementation this the same as their startTime-stamp.
read-your-own-writes scheme is again realized by assigning Update transactions are started at Sb , and get assigned a
very large transaction identifiers to dirty data versions. commitTime-stamp Tc with Tc > Sb . We will now prove by
Figure 6(a) further exemplifies a cycle of rw-dependencies, contradiction, that the transactions behave as if they were
often also referred to as rw-antidependencies [11]. rw-antide- executed at the time point Tc . Assume T is an update-
pendencies play a crucial role in non-serializable schedules transaction from the committed projection of H (i.e., T has
that are compliant under SI. The first rw-antidependency committed successfully), but T could not have been delayed
involving r(y0 ) and w( y ) in the figure could not have been to the point Tc . That is, T performed an operation o1 that
detected immediately as the write of y in (S4 , T7 ) happens conflicts with another operation o2 by a second transaction
after the read of y; the second rw-antidependency involving T 0 with o1 < o2 and T 0 committed during T ’s lifetime, i.e.,
r(x2 ) and w( x ) on the other hand could have been detected within the time period Sb ≤ Tc0 < Tc . If T 0 committed after
T2
S1
r(x0) w( x ⃞)r( x ⃞)w( x ⃞) commit(x2)
S3 immediate abort due to write-write conflict with x6
r(x2) ww w( x ⃞)
T7
S4
r(x2) r(y0) w( x ⃞)w( y ⃞) commit(x7,y7)
rw rw abort due to read-write conflict: validation fails as
S5 {x2,x6,y0,y6} intersect predicate space of S5
r(y0) r(x2) w( z ⃞)
T6
S6 succeeds because it is a read-only transaction
r(y0) r(x2) commit
(a) a write-write conflict and a read-write conflict
T1 T4
S1 S1
r(y0) r(z0) r(x0) commit r(x0) i( o ⃞)d( u ⃞)w( x ⃞) commit(x4,o4,u)
T3
S2 T5
r(x0) w( x ⃞) commit(x3) S2 abort because P(x0)=true
T7 rP(x0) rQ(z0) w( z ⃞) Test P(x0),Q(x0),P(x5),Q(x5),P(o5),Q(o5),P(u0),Q(u0)
S4
r(x3) w( x ⃞) r(z0) w( z ⃞) commit(x7,z7) abort if either Test is true:
T6
T6 S3 S(o5) indicates phantom
S5 rS(...) w( y ⃞)
r(x3) r(y0) w( y ⃞) commit(y6) Test S(x0),S(x5),S(o5),S(u0)
(b) benefits of our MVCC model (c) a read-write conflict, a read-delete conflict, and a phantom conflict
T , i.e., Tc0 > Tc , we could delay T 0 (and thus o2 ) until after iced by updating whole records and not using VersionedPo-
Tc , thus we only have to consider the case Tc0 < Tc . sitions synopses in our MVCC model. We further experi-
There are four possible combinations for the operations mented with DBMS-X, a commercial main-memory DBMS
o1 and o2 . If both are reads, we can swap the order of both, with a MVCC implementation similar to [23]. DBMS-X was
which is a contradiction to our assumption that o1 and o2 are run in Windows 7 on our evaluation machine. Due to licens-
conflicting. If both are writes, T 0 would have aborted due to ing agreements we can not disclose the name of DBMS-X.
to our protocol of immediately aborting upon detecting ww - The experiments were executed on a 2-socket Intel Xeon
conflicts. Thus, there is a contradiction to the assumption E5-2660v2 2.20 GHz (3 GHz maximum turbo) NUMA sys-
that both, T and T 0 , are committed transactions. If o1 is a tem with 256 GB DDR3 1866 MHz memory (128 GB per
read and o2 is a write, the update o2 is already in the undo CPU) running Linux 3.13. Each CPU has 10 cores and a
buffers when T commits, as Tc0 < Tc and the predicate P 25 MB shared L3 cache. Each core has a 32 KB L1-I and
of the read of o1 has been logged. The predicate validation L1-D cache as well as a 256 KB L2 cache.
at Tc then checks if o1 is affected by o2 by testing whether
P is satisfied for either the before- or the after-image of o2
(i.e., if the read should have seen the write), as illustrated in
4.1 Scan Performance
Figure 6(c). If not, that is a contradiction to the assumption Initially we demonstrate the high scan performance of our
that o1 and o2 are conflicting. If yes, that is a contradiction MVCC implementation. We implemented a benchmark sim-
to the assumption that T has committed successfully as T ilar to the SIBENCH [7] benchmark and our bank accounts
would have been aborted when P was satisfied. If o1 is a example (cf., Figure 1). The benchmark operates on a single
write and o2 is a read, the read has ignored the effect of o1 relation that contains integer (key, value) pairs. The work-
in the MVCC mechanism, as Tc0 > Sb , which is a contradic- load consists of update transactions which modify a (key,
tion to the assumption that o1 and o2 are conflicting. The value) pair by incrementing the value and a scan transac-
theorem follows. tion that scans the relation to sum up the values.
Figure 7 shows the per-core performance of scan trans-
actions on a relation with 100M (key, value) records. To
4. EVALUATION demonstrate the effect of scanning versioned records, we dis-
In this section we evaluate our MVCC implementation in able the continuous garbage collection and perform updates
our HyPer main-memory database system [21] that supports before scanning the relations. We vary both, the number of
SQL-92 queries and transactions (defined in a PL/SQL-like dirty records and the number of versions per dirty record.
scripting language [20]) and provides ACID guarantees. Additionally, we distinguish two cases: (i) the scan trans-
HyPer supports both, column- and a row-based storage of action is started before the updates (scan oldest) and thus
relations. Unless otherwise noted, we used the column-store needs to undo the effects of the update transactions and
backend, enabled continuous garbage collection, and stored (ii) the scan transaction is started after the updates (scan
the redo log in local memory. Redo log entries are generated newest) and thus only needs to verify that the dirty records
in memory and log entries are submitted in small groups are visible to the scan transaction. For all cases, the results
(group commit), which mitigates system call overheads and show that our MVCC implementation sustains the high scan
barely increases transaction latency. We evaluated HyPer throughput of our single-version concurrency control imple-
with single-version concurrency control, our novel MVCC mentation for realistic numbers of dirty records; and even
model, and a MVCC model similar to [23], which we mim- under high contention with multiple versions per record.
scan throughput 1000M 100%
75%
cycles
[records/s]
750M
50%
500M 25%
realistic scenarios 0%
250M with garbage collection 0% 0.0001% 0.001% 0.01% 0.1%
(0) (100) (1k) (10k) (100k)
0M
0.00001% 0.0001% 0.001% 0.01% 0.1% dirty records out of 100M
(10) (100) (1k) (10k) (100k)
scan retrieve version
dirty records out of 100M find first versioned find first unversioned
single-version system scan newest
4 versions 16 versions Figure 9: Cycle breakdowns of scan-oldest transac-
scan oldest, scan oldest,
dirty record dirty record tions that need to undo 4 updates per dirty record
normalized to SI
throughput [TX/s]
CC abort rate
throughput
150K 0.8× 4% SR-AL
0.6× 3% SR-RL
100K SI
0.4× SR-AL 2%
50K 0.2× SR-RL 1%
0.0× 0%
0 0% 20% 40% 60% 80% 100% 5 10 15 20 25 30
column-store row-store read-heavy top-customer transactions number of warehouses
(a) HyPer with single-version concur- (b) throughput of interleaved transactions (c) concurrency control abort rate
rency control ( ), our MVCC model under serializability with attribute- (SR- running interleaved transactions with
( ), and a MVCC model that mimics AL) and record-level predicate logs (SR- a varying number of warehouses un-
the behavior of [23] by updating whole RL) relative to SI throughput for a vary- der snapshot isolation (SI) and serial-
records and not using VersionedPosi- ing percentage of read-heavy top-customer izability with attribute- (SR-AL) and
tions ( ) transactions in the workload mix record-level predicate logs (SR-RL)
throughput [TX/s]
throughput [TX/s]
1M no log shipping 1M 1M
RDMA log shipping 800K 800K
800K
600K 600K 600K
400K 400K 400K
200K 200K 200K
0 0 0
1 2 4 6 8 10 12 14 16 18 20 0% 5% 10% 15% 20% 25% 0% 20% 40% 60% 80% 100%
number of threads (MPL) partition-crossing transactions read-only transactions
(a) scalability (b) 20 threads, varying contention (c) 20 threads, impact of read-only TX
Figure 11: Multi-threaded TPC-C experiments with 20 warehouses in HyPer with our MVCC model
We further executed the TPC-C with multiple threads. only transactions and 20% write transactions. Thereby, the
Like in H-Store [19]/VoltDB, we partition the database ac- read transactions all perform point accesses and records are
cording to warehouses. Partitions are assigned to threads mostly updated as a whole. Thus, the TATP benchmark is
and threads are pinned to cores similar to the DORA sys- a best-case scenario for the MVCC model of [23]. We ran
tem [32]. These threads process transactions that primar- the benchmark with a scale-factor of 1M subscribers. Com-
ily access data belonging to their partition. Unlike DORA, pared to running the benchmark with single-version con-
partition-crossing accesses, which, e.g., occur in 11% of the currency control (421,940 transactions/s), our MVCC im-
TPC-C transactions, are carried out by the primary thread plementation creates just a tiny overhead (407,564 transac-
to which the transaction was assigned. The scalability ex- tions/s). As expected, the mimicked MVCC model of [23]
periment (see Figure 11(a)) shows that our system scales also performs quite well in this benchmark, but still trails
near linearly up to 20 cores. Going beyond 20 cores might performance of our MVCC implementation by around 20%
require the reduction of global synchronization like in the (340,715 transactions/s).
SILO system [41]. We further varied the contention on the
partitions by varying the percentage of partition-crossing 4.4 Serializability
transactions as shown in Figure 11(b). Finally, as shown To determine the cost of our serializability validation ap-
in Figure 11(c), we also measured the impact of read-only proach (cf., Section 2.3), we first measured the cost of pred-
transactions and proportionally varied the percentage of the icate logging in isolation from predicate validation by run-
two read-only transactions in the workload mix. ning the TPC-C and TATP benchmarks each in a single
Figure 11(a) also shows the scalability of HyPer when serial stream. Without predicate logging, i.e., under SI,
shipping the redo log using remote direct memory accesses we achieve a TPC-C throughput of 112,610 transactions/s,
(RDMA) over Infiniband. RDMA-based log shipping gen- with record-level predicate logging (SR-RL) 107,365 transac-
erates an overhead of 17% with 20 threads. Our evalua- tions/s, and with attribute-level predicate logging (SR-AL)
tion system has a Mellanox ConnectX-3 Infiniband network 105,030 transactions/s. This means that there is a mere 5%
adapter, which operates at 4×QDR. The maximum write overhead for SR-RL and a mere 7% overhead for SR-AL. For
bandwidth of our setup is 3.5 GB/s with a latency of 1.3 µs. the TATP benchmark, we measured an overhead of only 1%
This bandwidth is sufficient to ship the redo log entries: for for SR-RL and 2% for SR-AL. We also measured the predi-
100k TPC-C transactions we generate 85 MB of redo log cate logging overhead for the mostly read-only TPC-H deci-
entries. In our setup, the receiving node can act as a high- sion support benchmark, which resulted in an even smaller
availability failover but could also write the log to disk. overhead. In terms of size, predicate logs generated by our
The Telecommunication Application Transaction Process- implementation are quite small. Unlike other serializable
ing (TATP) benchmark simulates a typical telecommunica- MVCC implementations, we do not need to track the entire
tions application. The workload mix consists of 80% read- read set of a transaction. To illustrate the difference in size
imagine a top-customer transaction on the TPC-C schema where |R| ≥ |W |. In our opinion this is mostly the case, as
that, for a specific warehouse and district, retrieves the cus- modern workloads tend to be read-heavy [33] and the time
tomer with the highest account balance and a good credit that a transaction is active tends to be short (long-running
rating (GC) and performs a small update: transactions would be deferred to a “safe snapshot”).
select
Finally, we evaluated the concurrency control abort rates,
c_w_id, c_d_id, max(c_balance) i.e., the aborts caused by concurrency control conflicts, of
from our MVCC implementation in HyPer. We again ran TPC-C
customer with logically interleaved transactions and varied the num-
where ber of TPC-C warehouses. As the TPC-C is largely par-
c_credit = ’GC’ titionable by warehouse, the intuition is that concurrency
and c_w_id = :w_id
and c_d_id = :d_id
control conflicts are reduced with an increasing number of
group by warehouses. The results are shown in Figure 10(c). We
c_w_id, c_d_id acknowledge that TPC-C does not show anomalies under
update ... SI [11], but of course the database system does not know
this, and this benchmark therefore tests for false positive
For such a query, serializable MVCC models that need to
aborts. The aborts under SI are “real” conflicts, i.e., two
track the read set then have to either copy all read records
transaction try to modify the same data item concurrently.
or set a flag (e.g., the SIREAD lock in PostgreSQL [34]). If
Serializability validation with SR-AL creates almost no false
we assume that at least one byte per read record is needed for
positive aborts. The only false positive aborts stem from the
book-keeping, then these approaches need to track at least
minimum (min) aggregation in delivery, as it sometimes con-
3 KB of data (a district of a warehouse serves 3k customers
flicts with concurrent inserts. Predicate logging of minimum
in TPC-C). Our SR-AL on the other hand just stores the
and maximum aggregates is currently not implemented in
read attributes and the aggregate that has been read which
our system but can easily be added in the future. SR-RL
is less than 100 bytes. This is 30× less than what traditional
creates more false positives than SR-AL, because reads are
read set book-keeping consumes; and for true OLAP-style
not only checked against updated attributes but rather any
queries that read a lot of data, predicate logging saves even
change to a record is considered a conflict, even though the
more. E.g., the read set of an analytical TPC-H query usu-
updated attribute might not even have been read by the
ally comprises millions of records and tracking the read set
original transaction.
can easily consume multiple MBs of space.
To determine the cost of predicate validation, we again
ran the TPC-C benchmark but this time interleaved trans- 5. RELATED WORK
actions (by decomposing the transactions into smaller tasks) Transaction isolation and concurrency control are among
such that transactions are running logically concurrent, which the most fundamental features of a database management
makes predicate validation necessary. We further added the system. Hence, several excellent books and survey papers
aforementioned top-customer transaction to the workload have been written on this topic in the past [42, 3, 2, 38]. In
mix and varied its share in the mix from 0% to 100%. The the following we further highlight three categories of work
results are shown in Figure 10(b). As TPC-C transactions that are particularly related to this paper, most notably
are very short, it would be an option to skip the step of multi-version concurrency control and serializability.
building the predicate tree and instead just apply all the
predicates as-is and thus make predicate validation much 5.1 Multi-Version Concurrency Control
faster. However, the predicate tree has much better asymp- Multi-Version Concurrency Control (MVCC) [42, 3, 28]
totic behavior, and is therefore much faster and more robust is a popular concurrency control mechanism that has been
when transaction complexity grows. We therefore use it all implemented in various database systems because of the de-
the time instead of optimizing for very cheap transactions. sirable property that readers never block writers. Among
And the figure also shows that with more read-only trans- these DBMSs are commercial systems such as Microsoft SQL
actions in the workload mix (as it is the case in real-world Server’s Hekaton [8, 23] and SAP HANA [10, 37] as well as
workloads [33]), the overhead of serializability validation al- open-source systems such as PostgreSQL [34].
most disappears. Building the predicate trees takes between Hekaton [23] is similar to our implementation in that it
2 µs and 15 µs for the TPC-C transactions on our system; is based on a timestamp-based optimistic concurrency con-
and between 4 µs and 24 µs for the analytical TPC-H queries trol [22] variant of MVCC and uses code generation [12]
(9.5 µs geometric mean). In comparison to traditional vali- to compile efficient code for transactions at runtime. In
dation approaches that repeat all reads, our system has, as the context of Hekaton, Larson et al. [23] compared a pes-
mentioned before, a much lower book-keeping overhead. A simistic locking-based with an optimistic validation-based
comparison of validation times by themselves is more com- MVCC scheme and proposed a novel MVCC model for main-
plicated. Validation time in traditional approaches depends memory DBMSs. Similar to what we have seen in our exper-
on the size of the read set of the committing transaction |R| iments, the optimistic scheme performs better in their eval-
and how fast reads can be repeated (usually scan speed and uation. In comparison to Hekaton, our serializable MVCC
index lookup speed); in our approach, it mostly depends model does not update records as a whole but in-place and
on the size of the write set |W | that has been committed at the attribute-level. Further, we do not restrict data ac-
during the runtime of the committing transaction. In our cesses to index lookups and optimized our model for high
system, checking the predicates of a TPC-C transaction or scan speeds that are required for OLAP-style transactions.
a TPC-H query against a versioned record that has been Finally, we use a novel serializability validation mechanism
reconstructed from undo buffers is a bit faster than an in- based on an adaptation of precision locking [17]. Lomet et
dex lookup. In general, our approach thus favors workloads al. [27] propose another MVCC scheme for main-memory
database systems where the main idea is to use ranges of Silo [41] proposes a scalable commit protocol that guar-
timestamps for a transaction. In contrast to classical MVCC antees serializability. To achieve good scalability on modern
models, we previously proposed using virtual memory snap- multi-core CPUs, Silo’s design is centered around avoiding
shots for long-running transactions [29], where updates are most points of global synchronization. The proposed tech-
merged back into the database at commit-time. Snapshot- niques can be integrated into our MVCC implementation in
ting and merging, however, can be very expensive depending order to reduce global synchronization, which could allow
on the size of the database. for better scalability. Pandis et al. [32] show that the cen-
Hyder [5, 4] is a data-sharing system that stores indexed tralized lock manager of traditional DBMSs is often a scala-
records as a multi-version log-structured database on shared bility bottleneck. To solve this bottleneck, they propose the
flash storage. Transaction conflicts are detected by a meld DORA system, which partitions a database among phys-
algorithm that merges committed updates from the log into ical CPU cores and decomposes transactions into smaller
the in-memory DBMS cache. This architecture promises actions. These are then assigned to threads that own the
to scale out without partitioning. While our MVCC model data needed for the action, such that the interaction with
uses the undo log only for validation of serializability viola- the lock manager is minimized during transaction process-
tions, in Hyder, the durable log is the database. In contrast, ing. Very lightweight locking [35] reduces the lock-manager
our implementation stores data in a main-memory row- or overhead by co-locating lock information with the records.
column-store and writes a redo log for durability. Octo- The availability of hardware transactional memory (HTM)
pusDB [9] is another DBMS that uses the log as the database in recent mainstream CPUs enables a new promising trans-
and proposes a unification of OLTP, OLAP, and streaming action processing model that reduces the substantial over-
databases in one system. head from locking and latching [24]. HTM further allows
multi-core scalability without statically partitioning the da-
5.2 Serializability tabase [24]. In future work we thus intend to employ HTM
In contrast to PostgreSQL and our MVCC implementa- to efficiently scale out our MVCC implementation, even in
tion, most other MVCC-based DBMSs only offer the weaker the presence of partition-crossing transactions.
isolation level Snapshot Isolation (SI) instead of serializabil- Deterministic database systems [39, 40] propose the exe-
ity. Berenson et al. [2], however, have shown that there exist cution of transactions according to a pre-defined serial order.
schedules that are valid under SI but are non-serializable. In In contrast to our MVCC model transactions need to be
this context, Cahill and Fekete et al. [7, 11] developed a the- known beforehand, e.g., by relying on holistic pre-canned
ory of SI anomalies. They further developed the Serializable transactions, and do not easily allow for interactive and
Snapshot Isolation (SSI) approach [7], which has been im- sliced transactions. In the context of distributed DBMSs,
plemented in PostgreSQL [34]. To guarantee serializability, [6] proposes a middleware for replicated DBMSs that adds
SSI tracks commit dependencies and tests for “dangerous global one-copy serializability for replicas that run under SI.
structures” consisting of rw-antidependencies between con-
current transactions. Unfortunately this requires keeping 6. CONCLUSION
track of every single read, similar to read-locks, which can The multi-version concurrency control (MVCC) imple-
be quite expensive for large read transactions. In contrast mentation presented in this work is carefully engineered to
to SSI, our MVCC model proposes a novel serializability accommodate high-performance processing of both, transac-
validation mechanism based on an adaptation of precision tions with point accesses as well as read-heavy transactions
locking [17]. Our approach does not track dependencies but and even OLAP scenarios. For the latter, the high scan
read predicates and validates the predicates against the undo performance of single-version main-memory database sys-
log entries, which are retained for as long as they are visible. tems was retained by an update-in-place version mechanism
Jorwekar et al. [18] tackled the problem of automatically and by using synopses of versioned record positions, called
detecting SI anomalies. [14] proposes a scalable SSI imple- VersionedPositions. Furthermore, our novel serializabiliy
mentation for multi-core CPUs. Checking updates against a validation technique that checks the before-image deltas in
predicate space is related to SharedDB [13], which optimizes undo buffers against a committing transaction’s predicate
the processing of multiple queries in parallel. space incurs only a marginal space and time overhead —
no matter how large the read set is. This results in a very
5.3 Scalability of OLTP Systems attractive and efficient transaction isolation mechanism for
Orthogonal to logical transaction isolation, there is also main-memory database systems. In particular, our serial-
a plethora of research on how to scale transaction process- izable MVCC model targets database systems that support
ing out to multiple cores on modern CPUs. H-Store [19], OLTP and OLAP processing simultaneously and in the same
which has been commercialized as VoltDB, relies on static database, such as SAP HANA [10, 37] and our HyPer [21]
partitioning of the database. Transactions that access only a system, but could also be implemented in high-performance
single partition are then processed serially and without any transactional systems that currently only support holistic
locking. Jones et al. [16] describe optimizations for partition- pre-canned transactions such as H-Store [19]/VoltDB. From
crossing transactions. Our HyPer [21] main-memory DBMS a performance perspective, we have shown that the inte-
optimizes for OLTP and OLAP workloads and follows the gration of our MVCC model in our HyPer system achieves
partitioned transaction execution model of H-Store. Prior excellent performance, even when maintaining serializability
to our MVCC integration, HyPer, just like H-Store, could guarantees. Therefore, at least from a performance perspec-
only process holistic pre-canned transactions. With the se- tive, there is little need to prefer snapshot isolation over full
rializable MVCC model introduced in this work, we provide serializability any longer. Future work focusses on better
a logical transaction isolation mechanism that allows for in- single-node scalability using hardware transactional mem-
teractive and sliced transactions. ory [24] and the scale-out of our MVCC model [30].
7. REFERENCES [23] P.-Å. Larson, S. Blanas, C. Diaconu, C. Freedman,
J. M. Patel, et al. High-Performance Concurrency
[1] A. Adya, B. Liskov, and P. O’Neil. Generalized Control Mechanisms for Main-Memory Databases.
isolation level definitions. In ICDE, 2000. PVLDB, 5(4), 2011.
[2] H. Berenson, P. A. Bernstein, J. Gray, J. Melton, E. J. [24] V. Leis, A. Kemper, and T. Neumann. Exploiting
O’Neil, et al. A Critique of ANSI SQL Isolation hardware transactional memory in main-memory
Levels. In SIGMOD, 1995. databases. In ICDE, 2014.
[3] P. A. Bernstein, V. Hadzilacos, and N. Goodman. [25] J. Levandoski, D. Lomet, and S. Sengupta. The
Concurrency Control and Recovery in Database Bw-Tree: A B-tree for New Hardware. In ICDE, 2013.
Systems. Addison-Wesley Longman, 1986.
[26] Y. Li and J. M. Patel. BitWeaving: Fast Scans for
[4] P. A. Bernstein, C. W. Reid, and S. Das. Hyder - A Main Memory Data Processing. In SIGMOD, 2013.
Transactional Record Manager for Shared Flash. In
[27] D. Lomet, A. Fekete, R. Wang, and P. Ward.
CIDR, 2011.
Multi-Version Concurrency via Timestamp Range
[5] P. A. Bernstein, C. W. Reid, M. Wu, and X. Yuan. Conflict Management. In ICDE, 2012.
Optimistic Concurrency Control by Melding Trees.
[28] C. Mohan, H. Pirahesh, and R. Lorie. Efficient and
PVLDB, 4(11), 2011.
Flexible Methods for Transient Versioning of Records
[6] M. A. Bornea, O. Hodson, S. Elnikety, and A. Fekete. to Avoid Locking by Read-only Transactions.
One-Copy Serializability with Snapshot Isolation SIGMOD Record, 21(2), 1992.
under the Hood. In ICDE, 2011.
[29] H. Mühe, A. Kemper, and T. Neumann. Executing
[7] M. J. Cahill, U. Röhm, and A. D. Fekete. Serializable Long-Running Transactions in Synchronization-Free
Isolation for Snapshot Databases. TODS, 34(4), 2009. Main Memory Database Systems. In CIDR, 2013.
[8] C. Diaconu, C. Freedman, E. Ismert, P.-Å. Larson, [30] T. Mühlbauer, W. Rödiger, A. Reiser, A. Kemper, and
P. Mittal, et al. Hekaton: SQL Server’s T. Neumann. ScyPer: Elastic OLAP Throughput on
Memory-optimized OLTP Engine. In SIGMOD, 2013. Transactional Data. In DanaC, 2013.
[9] J. Dittrich and A. Jindal. Towards a one size fits all [31] T. Neumann. Efficiently Compiling Efficient Query
database architecture. In CIDR, 2011. Plans for Modern Hardware. PVLDB, 4(9), 2011.
[10] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, [32] I. Pandis, R. Johnson, N. Hardavellas, and
S. Sigg, and W. Lehner. SAP HANA Database: Data A. Ailamaki. Data-oriented Transaction Execution.
Management for Modern Business Applications. PVLDB, 3, 2010.
SIGMOD Record, 40(4), 2012. [33] H. Plattner. The Impact of Columnar In-Memory
[11] A. Fekete, D. Liarokapis, E. O’Neil, P. O’Neil, and Databases on Enterprise Systems: Implications of
D. Shasha. Making Snapshot Isolation Serializable. Eliminating Transaction-Maintained Aggregates.
TODS, 30(2), 2005. PVLDB, 7(13), 2014.
[12] C. Freedman, E. Ismert, and P.-Å. Larson. [34] D. R. K. Ports and K. Grittner. Serializable Snapshot
Compilation in the Microsoft SQL Server Hekaton Isolation in PostgreSQL. PVLDB, 5(12), 2012.
Engine. DEBU, 37(1), 2014. [35] K. Ren, A. Thomson, and D. J. Abadi. Lightweight
[13] G. Giannikis, G. Alonso, and D. Kossmann. Locking for Main Memory Database Systems.
SharedDB: Killing One Thousand Queries with One PVLDB, 6(2), 2012.
Stone. PVLDB, 5(6), 2012. [36] M. Sadoghi, M. Canim, B. Bhattacharjee, F. Nagel,
[14] H. Han, S. Park, H. Jung, A. Fekete, U. Röhm, et al. and K. A. Ross. Reducing Database Locking
Scalable Serializable Snapshot Isolation for Multicore Contention Through Multi-version Concurrency.
Systems. In ICDE, 2014. PVLDB, 7(13), 2014.
[15] S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, [37] V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh,
and P. Boncz. Positional Update Handling in Column et al. Efficient Transaction Processing in SAP HANA
Stores. In SIGMOD, 2010. Database: The End of a Column Store Myth. In
[16] E. P. Jones, D. J. Abadi, and S. Madden. Low SIGMOD, 2012.
Overhead Concurrency Control for Partitioned Main [38] A. Thomasian. Concurrency Control: Methods,
Memory Databases. In SIGMOD, 2010. Performance, and Analysis. CSUR, 30(1), 1998.
[17] J. R. Jordan, J. Banerjee, and R. B. Batman. [39] A. Thomson and D. J. Abadi. The Case for
Precision Locks. In SIGMOD, 1981. Determinism in Database Systems. PVLDB, 3, 2010.
[18] S. Jorwekar, A. Fekete, K. Ramamritham, and [40] A. Thomson, T. Diamond, S.-C. Weng, K. Ren,
S. Sudarshan. Automating the Detection of Snapshot P. Shao, and D. J. Abadi. Calvin: Fast Distributed
Isolation Anomalies. In VLDB, 2007. Transactions for Partitioned Database Systems. In
[19] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, SIGMOD, 2012.
A. Rasin, et al. H-store: A High-performance, [41] S. Tu, W. Zheng, E. Kohler, B. Liskov, and
Distributed Main Memory Transaction Processing S. Madden. Speedy Transactions in Multicore
System. PVLDB, 1(2), 2008. In-memory Databases. In SOSP, 2013.
[20] A. Kemper et al. Transaction Processing in the [42] G. Weikum and G. Vossen. Transactional Information
Hybrid OLTP & OLAP Main-Memory Database Systems: Theory, Algorithms, and the Practice of
System HyPer. DEBU, 36(2), 2013. Concurrency Control and Recovery. Morgan
[21] A. Kemper and T. Neumann. HyPer: A Hybrid Kaufmann, 2002.
OLTP&OLAP Main Memory Database System Based [43] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner,
on Virtual Memory Snapshots. In ICDE, 2011. A. Zeier, et al. SIMD-Scan: Ultra Fast in-Memory
[22] H. T. Kung and J. T. Robinson. On Optimistic Table Scan using on-Chip Vector Processing Units.
Methods for Concurrency Control. TODS, 6(2), 1981. PVLDB, 2(1), 2009.