Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
43 views

Fast Serializable Multi-Version Concurrency Control For Main-Memory Database Systems

This document presents a novel approach to implementing Multi-Version Concurrency Control (MVCC) that provides serializable isolation for main-memory database systems with little overhead compared to snapshot isolation. The approach stores data versions as before-image deltas in undo buffers, allowing for in-place updates. It also introduces a new technique for validating serializability that checks for intersections between the writes of committed transactions and the reads of transactions currently committing. Experimental results show the approach allows fast processing of transactions with point accesses as well as read-heavy workloads, while providing full serializability with little additional cost over snapshot isolation.

Uploaded by

tes first
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Fast Serializable Multi-Version Concurrency Control For Main-Memory Database Systems

This document presents a novel approach to implementing Multi-Version Concurrency Control (MVCC) that provides serializable isolation for main-memory database systems with little overhead compared to snapshot isolation. The approach stores data versions as before-image deltas in undo buffers, allowing for in-place updates. It also introduces a new technique for validating serializability that checks for intersections between the writes of committed transactions and the reads of transactions currently committing. Experimental results show the approach allows fast processing of transactions with point accesses as well as read-heavy workloads, while providing full serializability with little additional cost over snapshot isolation.

Uploaded by

tes first
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Fast Serializable Multi-Version Concurrency Control

for Main-Memory Database Systems

Thomas Neumann Tobias Mühlbauer Alfons Kemper


Technische Universität München
{neumann, muehlbau, kemper}@in.tum.de

ABSTRACT Serializability is a great concept, but it is hard to im-


Multi-Version Concurrency Control (MVCC) is a widely em- plement efficiently. A classical way to ensure serializability
ployed concurrency control mechanism, as it allows for exe- is to rely on a variant of Two-Phase Locking (2PL) [42].
cution modes where readers never block writers. However, Using 2PL, the DBMS maintains read and write locks to
most systems implement only snapshot isolation (SI) instead ensure that conflicting transactions are executed in a well-
of full serializability. Adding serializability guarantees to ex- defined order, which results in serializable execution sched-
isting SI implementations tends to be prohibitively expensive. ules. Locking, however, has several major disadvantages:
We present a novel MVCC implementation for main-mem- First, readers and writers block each other. Second, most
ory database systems that has very little overhead compared transactions are read-only [33] and therefore harmless from
to serial execution with single-version concurrency control, a transaction-ordering perspective. Using a locking-based
even when maintaining serializability guarantees. Updating isolation mechanism, no update transaction is allowed to
data in-place and storing versions as before-image deltas in change a data object that has been read by a potentially
undo buffers not only allows us to retain the high scan per- long-running read transaction and thus has to wait until the
formance of single-version systems but also forms the ba- read transaction finishes. This severely limits the degree of
sis of our cheap and fine-grained serializability validation concurrency in the system.
mechanism. The novel idea is based on an adaptation of Multi-Version Concurrency Control (MVCC) [42, 3, 28]
precision locking and verifies that the (extensional) writes offers an elegant solution to this problem. Instead of up-
of recently committed transactions do not intersect with the dating data objects in-place, each update creates a new ver-
(intensional) read predicate space of a committing transac- sion of that data object, such that concurrent readers can
tion. We experimentally show that our MVCC model allows still see the old version while the update transaction pro-
very fast processing of transactions with point accesses as ceeds concurrently. As a consequence, read-only transac-
well as read-heavy transactions and that there is little need tions never have to wait, and in fact do not have to use
to prefer SI over full serializability any longer. locking at all. This is an extremely desirable property and
the reason why many DBMSs implement MVCC, e.g., Ora-
cle, Microsoft SQL Server [8, 23], SAP HANA [10, 37], and
Categories and Subject Descriptors PostgreSQL [34]. However, most systems that use MVCC do
H.2 [Database Management]: Systems not guarantee serializability, but the weaker isolation level
Snapshot Isolation (SI). Under SI, every transaction sees
Keywords the database in a certain state (typically the last committed
Multi-Version Concurrency Control; MVCC; Serializability state at the beginning of the transaction) and the DBMS
ensures that two concurrent transactions do not update the
1. INTRODUCTION same data object. Although SI offers fairly good isolation,
some non-serializable schedules are still allowed [1, 2]. This
Transaction isolation is one of the most fundamental fea-
is often reluctantly accepted because making SI serializable
tures offered by a database management system (DBMS). It
tends to be prohibitively expensive [7]. In particular, the
provides the user with the illusion of being alone in the da-
known solutions require keeping track of the entire read set
tabase system, even in the presence of multiple concurrent
of every transaction, which creates a huge overhead for read-
users, which greatly simplifies application development. In
heavy (e.g., analytical) workloads. Still, it is desirable to
the background, the DBMS ensures that the resulting con-
detect serializability conflicts as they can lead to silent data
current access patterns are safe, ideally by being serializable.
corruption, which in turn can cause hard-to-detect bugs.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
In this paper we introduce a novel way to implement
for profit or commercial advantage and that copies bear this notice and the full cita- MVCC that is very fast and efficient, both for SI and for full
tion on the first page. Copyrights for components of this work owned by others than serializability. Our SI implementation is admittedly more
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- carefully engineered than totally new, as MVCC is a well un-
publish, to post on servers or to redistribute to lists, requires prior specific permission derstood approach that recently received renewed interest in
and/or a fee. Request permissions from permissions@acm.org. the context of main-memory DBMSs [23]. Careful engineer-
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ing, however, matters as the performance of version main-
ACM 978-1-4503-2758-9/15/05 ...$15.00. tenance greatly affects transaction and query processing. It
http://dx.doi.org/10.1145/2723372.2749436.
as r n s
ge
t,l th io

(n
t ) an

to
rs ng sit

o
ec
[fi d le Po

ts
Accounts

nV
d

to
ixe ne
version information stored in additional hidden column in base relation

io

re
Owner Bal
a f sio

rs

d
(indexes only store references, i.e., rowIDs, to records in base relations)

Ve


In Ver

fo
ri
co
Thomas 10

llu
Ac atio
m
latest version in-place

st
m

tio n)
r
itT
Larry 10

ns
im
physical before-image deltas (i.e., column values)

recentlyCommitted
[0,0) Alfons 10
in undo buffers T3 SallyàWendy
Judy 10 Undo buffer of Ty (Sallyà...) T5 SallyàHenry
Tobias 10 Ty,Bal,8
Sally 7
Hanna 10
[0,1) Hasso 10

tra

(n
Undo buffer of T5

ot
Ac ored
sta
sa
Mike 10

tio )
st
cti

rtT
T5,Bal,9 T5,Bal,10

ns
on

im
Lisa 10

ID

activeTransactions
Betty 10 Tx T4 Read only: Σ

Cindy 10 Undo buffer of T3 Ty T6 SallyàMike

[2,5) Henry 11 Tz T7 Read only: Σ


T3,Bal,10 T3,Bal,10
Praveen 10
Wendy 11
main-memory column-store

Figure 1: Multi-version concurrency control example: transferring $1 between Accounts (from → to) and
summing all Balances (Σ)

is also the basis of our cheap serializability check, which ex- past, HyPer relied on single-version concurrency control and
ploits the structure of our versioning information. We fur- thus did not efficiently support interactive and sliced trans-
ther retain the very high scan performance of single-version actions, i.e., transactions that are decomposed into multiple
systems using synopses of positions of versioned records in tasks such as stored procedure calls or individual SQL state-
order to efficiently support analytical transactions. ments. Due to application roundtrip latencies and other
In particular, the main contributions of this paper are: factors, it is desirable to interleave the execution of these
tasks. Our novel MVCC model enables this logical concur-
1. A novel MVCC implementation that is integrated into
rency with excellent performance, even when maintaining
our high-performance hybrid OLTP and OLAP main-
serializability guarantees.
memory datbase system HyPer [21]. Our MVCC model
creates very little overhead for both transactional and
analytical workloads and thereby enables very fast and 2. MVCC IMPLEMENTATION
efficient logical transaction isolation for hybrid systems We explain our MVCC model and its implementation ini-
that support these workloads simultaneously. tially by way of an example. The formalism of our serializ-
2. Based upon that, a novel approach to guarantee seri- ability theory and proofs are then given in Section 3. Fig-
alizability for snapshot isolation (SI) that is both pre- ure 1 illustrates the version maintenance using a traditional
cise and cheap in terms of additional space consump- banking example. For simplicity, the database consists of
tion and validation time. Our approach is based on a single Accounts table that contains just two attributes,
an adaptation of Precision Locking [42] and does not Owner and Bal ance. In order to retain maximum scan per-
require explicit read locks, but still allows for more formance we refrain from creating new versions in newly
concurrency than 2PL. allocated areas as in Hekaton [8, 23]; instead we update in-
place and maintain the backward delta between the updated
3. A synopses-based approach (VersionedPositions) to re- (yet uncommitted) and the replaced version in the undo
tain the high scan performance of single-version sys- buffer of the updating transaction. Updating data in-place
tems for read-heavy and analytical transactions, which retains the contiguity of the data vectors that is essential for
are common in today’s workloads [33]. high scan performance. In contrast to positional delta trees
(PDTs) [15], which were designed to allow more efficient
4. Extensive experiments that demonstrate the high per-
updates in column stores, we refrain from using complex
formance and trade-offs of our MVCC implementation.
data structures for the deltas to allow for a high concurrent
Our novel MVCC implementation is integrated into our transactional throughput.
HyPer main-memory DBMS [21], which supports SQL-92 Upon committing a transaction, the newly generated ver-
query and ACID-compliant transaction processing (defined sion deltas have to be re-timestamped to determine their
in a PL/SQL-like scripting language [20]). For queries and validity interval. Clustering all version deltas of a trans-
transactions, HyPer generates LLVM code that is then just- action in its undo buffer expedites this commit processing
in-time compiled to optimized machine code [31]. In the tremendously. Furthermore, using the undo buffers for ver-
sion maintenance, our MVCC model incurs almost no stor- Sally’s and Wendy’s balances, respectively. The timestamp
age overhead as we need to maintain the version deltas (i.e., indicates that these version deltas have to be applied for
the before-images of the changes) during transaction pro- transactions whose startTime is below T3 and that the suc-
cessing anyway for transactional rollbacks. The only differ- cessor version is valid from there on for transactions starting
ence is that the undo buffers are (possibly) maintained for a after T3 . In our example, at startTime T4 a reader transac-
slightly longer duration, i.e., for as long as an active trans- tion with transactionID Tx entered the system and is still
action may still need to access the versions contained in the active. It will read Sally’s Bal ance at reconstructed value 9,
undo buffer. Thus, the VersionVector shown in Figure 1 Henry’s at reconstructed value 10, and Wendy’s at value 11.
anchors a chain of version reconstruction deltas (i.e., col- Another update transaction (Sally → Henry) committed at
umn values) in “newest-to-oldest” direction, possibly span- timestamp T5 and correspondingly marked the version deltas
ning across undo buffers of different transactions. Even for it created with the validity timestamp T5 . Again, the ver-
our column store backend, there is a single VersionVector sions belonging to Sally’s and Wendy’s balances that were
entry per record for the version chain, so the version chain valid just before T5 ’s update are maintained as before im-
in general connects before-images of different columns of one ages in the undo buffer of T5 . Note that a reconstructed
record. Actually, for garbage collection this chain is main- version is valid from its predecessor’s timestamp until its
tained bidirectionally, as illustrated for Sally’s Bal -versions. own timestamp. Sally’s Bal ance version reconstructed with
T5 ’s undo buffer is thus valid from timestamp T3 until times-
2.1 Version Maintenance tamp T5 . If a version delta has no predecessor (indicated by
a null pointer) such as Henry’s balance version in T5 ’s undo
Only a tiny fraction of the database will be versioned, as
buffer its validity is determined as from virtual timestamp
we continuously garbage collect versions that are no longer
“0” until timestamp T5 . Any read access of a transaction
needed. A version (reconstruction delta) becomes obsolete
with startTime below T5 applies this version delta and any
if all active transactions have started after this delta was
read access with a startTime above or equal to T5 ignores it
timestamped. The VersionVector contains null whenever
and thus reads the in-place version in the Accounts table.
the corresponding record is unversioned and a pointer to the
As said before, the deltas of not yet committed versions re-
most recently replaced version in an undo buffer otherwise.
ceive a temporary timestamp that exceeds any “real” times-
For our illustrative example only two transaction types
tamp of a committed transaction. This is exemplified for
are considered: transfer transactions are marked as “from
the update transaction (Sally → Henry) that is assigned
→ to” and transfer $1 from one account to another by first
the transactionID timestamp Ty of the updater. This tem-
subtracting 1 from one account’s Bal and then adding 1 to
porary, very large timestamp is initially assigned to Sally’s
the other account’s Bal. For brevity we omit the discussion
Bal ance version delta in Ty ’s undo buffer. Any read access,
of object deletions and creations in the example. Initially, all
except for those of transaction Ty , with a startTime-stamp
Bal ances were set to 10. The read-only transactions denoted
above T5 (and obviously below Ty ) apply this version delta to
Σ sum all Bal ances and — in our “closed world” example —
obtain value 8. The uncommitted in-place version of Sally’s
should always compute $150, no matter under what start-
balance with value 7 is only visible to Ty .
Time-stamp they operate.
All new transactions entering the system are associated
with two timestamps: transactionID and startTime-stamps. 2.2 Version Access
Upon commit, update transactions receive a third times- To access its visible version of a record, a transaction T
tamp, the commitTime-stamp that determines their serial- first reads the in-place record (e.g., from the column- or
ization order. Initially all transactions are assigned identi- row-store) and then undoes all version changes along the
fiers that are higher than any startTime-stamp of any trans- (possibly empty) version chain of undo buffer entries — by
action. We generate startTime-stamps from 0 upwards and overwriting updated attributes in the copied record with the
transactionIDs from 263 upwards to guarantee that transac- before-images from the undo buffers — up to the first version
tionIDs are all higher than the startTimes. Update trans- v, for which the following condition holds (pred points to the
actions modify data in-place. However, they retain the old predecessor; TS denotes the associated timestamp):
version of the data in their undo buffer. This old version
v .pred = null ∨ v .pred .TS = T ∨ v .pred .TS < T.startTime
serves two purposes: (1) it is needed as a before-image in
case the transaction is rolled back (undone) and (2) it serves The first condition holds if there is no older version available
as a committed version that was valid up to now. This most because it never existed or it was (safely) garbage collected
recently replaced version is inserted in front of the (possibly in the meantime. The second condition allows a transac-
empty) version chain starting at the VersionVector. While tion to access its own updates (remember that the initial
the updater is still running, the newly created version is transactionID timestamps assigned to an active transaction
marked with its transactionID, whereby the uncommitted are very high numbers exceeding any start time of a trans-
version is only accessible by the update transaction itself (as action). The third condition allows reading a version that
checked in the second condition of the version access pred- was valid at the start time of the transaction. Once the ter-
icate, cf., Section 2.2). At commit time an update trans- mination condition is satisfied, the visible version has been
action receives a commitTime-stamp with which its version re-materialized by “having undone” all changes that have oc-
deltas (undo logs) are marked as being irrelevant for trans- curred in the meantime. Note that, as shown in Section 4,
actions that start from “now” on. This commitTime-stamp version “reconstruction” is actually cheap, as we store the
is taken from the same sequence counter that generates the physical before-image deltas and thus do not have to in-
startTime-stamps. In our example, the first update transac- versely apply functions on the in-place after-image.
tion that committed at timestamp T3 (Sally → Wendy) cre- Traversing the version chain guarantees that all reads are
ated in its undo buffer the version deltas timestamped T3 for performed in the state that existed at the start of the trans-
.unchanged object. predicate space (for 4 predicates)

Interest
.modified object.
P4: P1:
.deleted object. X I = 1.6
I = 1.7 and B = 15
.created object (phantom). 1.6
.created & deleted. P2:
object (phantom) I=1 and
X B=15
startTime lifetime of T commitTime P3:
I between .1 and .2
Figure 2: Modifications/deletions/creations of data 0.2 clash and
objects relative to the lifetime of transaction T B between 10 and 20
0.1

action. This is sufficient for serializability of read-only trans- 10 20 Balance


actions. However, for update transactions we need a valida- intersection of point x
tion phase that (conceptually) verifies that its entire read set with predicate
did not change during the execution of the transaction. In X X X X
previous approaches, this task is inherently complex as the
read set can be very large, especially for main-memory data- undo buffers under validatIon
base systems that tend to rely on full-table scans much more [I=0.13,B=14]
frequently than traditional disk-based applications [33]. For- Figure 3: Checking data points in the undo buffers
tunately, we found a way to limit this validation to the ob- against the predicate space of a transaction
jects that were actually changed and are still available in the
undo buffers.
database applications [33], including analytical transactions.
2.3 Serializability Validation Here, our novel idea of using the undo-buffers for validation
We deliberately avoid write-write conflicts in our MVCC comes into play. Thereby, we limit the validation to the
model, as they may lead to cascading rollbacks. If another number of recently changed and committed data objects,
transaction tries to update an uncommitted data object (as no matter how large the read set of the transaction was.
indicated by the large transactionID timestamp in its pre- For this purpose, we adapt an old (and largely “forgotten”)
decessor version), it is aborted and restarted. Therefore, the technique called Precision Locking [17] that eliminates the
first VersionVector pointer always leads to an undo buffer inherent satisfiability test problem of predicate locking. Our
that contains a committed version — except for unversioned variation of precision locking tests discrete writes (updates,
records where the pointer is null. If the same transaction deletions, and insertions of records) of recently committed
modifies the same data object multiple times, there is an transactions against predicate-oriented reads of the transac-
internal chain of pointers within the same undo buffer that tion that is being validated. Thus, a validation fails if such
eventually leads to the committed version. an extensional write intersects with the intensional reads
In order to retain a scalable lock-free system we rely on of the transaction under validation [42]. The validation is
optimistic execution [22] in our MVCC model. To guaran- illustrated in Figure 3, where we assume that transaction
tee serializability, we thus need a validation phase at the T has read objects under the four different predicates P1 ,
end of a transaction. We have to ensure that all reads dur- P2 , P3 , and P4 , which form T ’s predicate space. We need
ing transaction processing could have been (logically) at the to validate the three undo buffers at the bottom and val-
very end of the transaction without any observable change idate that their objects (i.e., data points) do not intersect
(as shown for the object on the top of Figure 2). In terms with T’s predicates. This is done by evaluating the predi-
of this figure, we will detect the four (lower) transitions: cates for those objects. If the predicates do not match, then
modification, deletion, creation, and creation & deletion of there is no intersection and the validation passes, otherwise,
an object that is “really” relevant for the transaction T . For there is a conflict. This object-by-predicate based valida-
this purpose, transactions draw a commitTime-stamp from tion eliminates the undecidability problem inherent in other
the counter that is also “giving out” the startTime-stamps. approaches that require predicate-by-predicate validation.
The newly drawn number determines the serialization or- In order to find the extensional writes of other transac-
der of the transaction. Only updates that were committed tions that committed during the lifetime of a transaction T ,
during T ’s lifetime, i.e., in between the startTime and the we maintain a list of recentlyCommitted transactions, which
commitTime, are potentially relevant for the validation. In contains pointers to the corresponding undo buffers (cf., Fig-
terms of Figure 2, all events except for the top-most may ure 1). We start our validation with the undo buffers of the
lead to an abort, but only if these modified/deleted/created oldest transaction that committed after T ’s startTime and
objects really intersect with T ’s read predicate space. traverse to the youngest one (at the bottom of the list).
In previous approaches for serializability validation, such Each of the undo buffers is examined as follows: For each
as in Microsoft’s Hekaton [8, 23] and PostgreSQL [34], the newly created version, we check whether it satisfies any of
entire read set of a transaction needs to be tracked (e.g., T ’s selection predicates. If this is the case, T ’s read set is
by using SIREAD locks as in PostgreSQL) and needs to be inconsistent because of the detected phantom and it has to
re-checked at the end of the transaction — by redoing all be aborted. For a deletion, we check whether or not the
the read accesses. This is prohibitively expensive for large deleted object belonged to T ’s read set. If so, we have to
read sets that are very typical for scan-heavy main-memory abort T . For a modification (update) we have to check both,
P
implemented a way to check for serializability conflicts at
the granularity of attributes (SR-AL): in addition to the
B between 10 and 20 B = 15 I = 1.6 restricted attributes we further log which attributes are ac-
cessed, i.e., read, without a restriction. During validation we
I between .1 and .2 I=1 I = 1.7 then know which attributes were accessed and can thus skip
the validation of versions that modified attributes that were
Figure 4: Predicate Tree (PT) for the predicate not accessed. The evaluation in Section 4.4 shows that seri-
space of Figure 3 alizability checks at the attribute-level (SR-AL) reduce the
number of false positives compared to serializability checks
the before image as well as the after image. If either inter- at the record-level (SR-RL) while barely increasing the over-
sects with T ’s predicate space we abort T . This situation is head of predicate logging and validation.
shown in Figure 3, where the data point x of the left-most Serializability validation works as follows: At the begin-
undo buffer satisfies predicate P3 , meaning that it intersects ning of the validation of a committing transaction, a Pred-
with T ’s predicate space. icate Tree (PT) is built on a per-relation basis from the
After successful validation, a transaction T is committed predicate log. PTs are directed trees with a root node P .
by first writing its commit into the redo-log (which is re- The PT for the predicate space in Figure 3 is exemplified in
quired for durability). Thereafter, all of T ’s transactionID Figure 4. The nodes of a PT are single-attribute predicates,
timestamps are changed to its newly assigned commitTime- e.g., B = 15. Edges connect nodes with a logical AND, e.g.,
stamp. Due to our version maintenance in the undo buffers, B = 15 ∧ I = 1. The logical OR of all paths in the tree
all these changes are local and therefore very cheap. In case then defines the predicate space. Nodes for the same predi-
of an abort due to a failed validation, the usual undo-rollback cate that share the same root are merged together, e.g., for
takes place, which also removes the version delta from the B = 15 in Figure 4. During validation, data objects are
version chain. Note that the serializability validation in our checked whether they satisfy the PT, i.e., whether there is
MVCC model can be performed in parallel by several trans- a path in the PT that the data object satisfies.
actions whose serialization order has been determined by
drawing the commitTime-stamps. 2.4 Garbage Collection
Garbage collection of undo buffers is continuously per-
2.3.1 Predicate Logging formed whenever a transaction commits. After each com-
Instead of the read set, we log the predicates during the mit, our MVCC implementation determines the now oldest
execution of a transaction for our serializability validation. visible transactionID, i.e., the oldest timestamp of a transac-
Note that, in contrast to Hekaton [23], HyPer not only allows tion that has updates that are visible by at least one active
to access records through an index, but also through a base transaction. Then, all committed transactions whose trans-
table scan. We log predicates of both access patterns in actionID is older than that timestamp are removed from
our implementation. Predicates of a base table access are the list of recently committed transactions, the references
expressed as restrictions on one or more attributes of the to their undo buffers are atomically removed from the ver-
table. We log these restrictions in our predicate log on a sion lists, and the undo buffers themselves are marked with
per-relation basis. Index accesses are treated similarly by a tombstone. Note that it is not possible to immediately
logging the point and range lookups on the index. reuse the memory of a marked undo buffer, as other trans-
Index nested loop joins are treated differently. In this actions can still have references to this buffer; although the
case, we log all values that we read from the index as pred- buffer is definitely not relevant for these transactions, it may
icates. As we potentially read many values from the index, still be needed to terminate version chain traversals. It is
we subsequently coarsen these values to ranges and store safe to reuse a marked undo buffer as soon as the oldest ac-
these ranges as predicates in the predicate log instead. Other tive transaction has started after the undo buffer had been
join types are not treated this way. These joins are preceded marked. As in our system, this can be implemented with
by (potentially restricted) base table accesses. very little overhead, e.g., by maintaining high water marks.

2.3.2 Implementation Details 2.5 Handling of Index Structures


From an implementation perspective, a transaction logs Unlike other MVCC implementations in Hekaton [8, 23]
its data accesses as read predicates on a per-relation basis and PostgreSQL [34], our MVCC implementation does not
in a designated predicate log. We always use 64 bit integer use (predicate) locks and timestamps to mark read and mod-
comparison summaries per attribute to allow for efficient ified keys in indexes. To guarantee SI and serializability,
predicate checks based on cheap integer operations and to our implementation proceeds as follows: If an update up-
keep the size of the predicate log small. Variable-length data dates only non-indexed attributes, updates are performed
objects such as strings are hashed to 64 bit summaries. as usual. If an update updates an indexed attribute, the
Traditional serializable MVCC models detect conflicts at record is deleted and re-inserted into the relation and both,
the granularity of records (e.g., by “locking” the record). the deleted and the re-inserted record, are stored in the in-
In our implementation we log the comparison summaries dex. Thus, indexes retain references to all records that are
for restricted attributes (predicates), which is sufficient to visible by any active transaction. Just like undo buffers,
detect serializability conflicts at the record-level (SR-RL). indexes are cleaned up during garbage collection.
However, sometimes a record is too coarse. If the sets of We ensure the uniqueness of primary keys by aborting a
read and written attributes of transactions do not overlap, a transaction that inserts a primary key that exists either (i) in
false positive conflict could be detected. To eliminate these the snapshot that is visible to the transaction, (ii) in the last
false positives, which would lead to false aborts, we also committed version of the key’s record, or (iii) uncommitted
as an insert in an undo buffer. Note that these are the 2.7 Synchronization of Data Structures
only cases that need to be checked, as updates of indexed In this work, we focus on providing an efficient and elegant
attributes are performed as a deletion and insertion. mechanism to allow for logical concurrency of transactions,
For foreign key constraints we need to detect the case which is required to support interactive and sliced trans-
when an active transaction deletes a primary key and a con- actions, i.e., transactions that are decomposed into multiple
current transaction inserts a foreign key reference to that tasks such as stored procedure calls or individual SQL state-
key. In this case, we abort the inserting transaction as it ments. Due to application roundtrips and other factors, it
detects the (possibly uncommitted) delete. The inserting is desirable to interleave the execution of these decomposed
transaction is aborted pro-actively, even if the delete is un- tasks, and our serializable MVCC model enables this logical
commited, because transactions usually commit and only concurrency. Thread-level concurrency is a largely orthog-
rarely abort. onal topic. We thus only briefly describe how our MVCC
data structures can be synchronized and how transactional
2.6 Efficient Scanning workloads can be processed in multiple threads.
Main-memory database systems for real-time business in- To guarantee thread-safe synchronization in our imple-
telligence, i.e., systems that efficiently handle transactional mentation, we obtain short-term latches on the MVCC data
and analytical workloads in the same database, rely heavily structures for the duration of one task (a transaction typi-
on “clock-rate” scan performance [43, 26]. Therefore, test- cally consists of multiple such calls). The commit process-
ing each data object individually (using a branch statement) ing of writing transactions is done in a short exclusive crit-
whether or not it is versioned would severely jeopardize per- ical section by first drawing the commitTime-stamp, vali-
formance. Our MVCC implementation in HyPer uses LLVM dating the transaction, and inserting commit records into
code generation and just-in-time compilation [31] to gener- the redo log. Updating the validity timestamps in the undo
ate efficient scan code at runtime. To mitigate the negative buffers can be carried out unsynchronized thereafter by us-
performance implications of repeated version branches, the ing atomic operations. Our lock-free garbage collection that
generated code uses synopses of versioned record positions continuously reclaims undo log buffers has been detailed in
to determine ranges that can be scanned at maximum speed. Section 2.4. Currently we use conventional latching-based
The generated scan code proceeds under consideration of synchronization of index structures, but could adapt to lock-
these synopses, called VersionedPositions, shown on the left- free structures like the Bw-Tree [25] in the future.
hand side of Figure 1. These synopses maintain the position In future work, we want to optimize the thread-parallel-
of the first and the last versioned record for a fixed range ization of our implementation further. We currently still
of records (e.g., 1024) in a 32 bit integer, where the position rely on classical short-term latches to avoid race conditions
of the first versioned record is stored in the high 16 bit and between concurrent threads. These latches can largely be
the position of the last versioned record is stored in the low avoided by using hardware transactional memory (HTM) [24]
16 bit, respectively. Maintenance of VersionedPositions is during version retrieval, as it can protect a reader from the
very cheap as insertions and deletions of positions require unlikely event of a concurrent (i.e., racy) updater. Note that
only a few logical operations (cf., evaluation in Section 4). such a conflict is very unlikely as it has to happen in a time
Further, deletions are handled fuzzily and VersionedPosi- frame of a few CPU cycles. A combination of our MVCC
tions are corrected during the next scan where the necessary model and HTM is very promising and in initial experiments
operations can be hidden behind memory accesses. indeed outperforms our current implementation.
Note that the versions are continuously garbage collected;
therefore, most ranges do not contain any versions at all, 3. THEORY
which is denoted by an empty interval [x, x) (i.e., the lower In this section, we analyze our serializable MVCC model
and upper bound of the half-open interval are identical). more formally in the context of serializability theory.
E.g., this is the case for the synopsis for the first 5 records
in Figure 1. Using the VersionedPositions synopses, ad- 3.1 Discussion of our MVCC Model
jacent unversioned records are accumulated to one range In order to formalize our MVCC scheme we need to intro-
where version checking is not necessary. In this range, the duce some notation that is illustrated in Figure 5. On the
scan code proceeds at maximum speed without any branches top of the figure a schedule consisting of four transactions
for version checks. For modified records, the VersionVector is shown. These transactions start at times S1 , S2 , S3 , and
is consulted and the version of the record that is visible to S4 , respectively. As they access different versions of the data
the transaction is reconstructed (cf., Section 2.2). Again, objects, we need a version ordering/numbering scheme in or-
a range for modified records is determined in advance by der to differentiate their reads and their version creations.
scanning the VersionVector for set version pointers to avoid This is shown for the same four-transaction-schedule at the
repeated testing whether a record is versioned. bottom of the figure.
Looking at Figure 1, we observe that for strides 0 . . . 4 Transactions are allowed to proceed concurrently. They
and 6 . . . 10 the loop on the unversioned records scans the are, however, committed serially. An update transaction
Bal ance vector at maximum speed without having to check draws its commitTime-stamp from the same counter that
if the records are versioned. Given the fact that the strides generates the startTime-stamps. The commitTime-stamps
in between two versioned objects are in the order of millions determine the commit order and, as we will see, they also
in a practical setting, the scan performance penalty incurred determine the serialization order of the transactions. Read-
by our MVCC is marginal (as evaluated in Section 4.1). De- only transactions do not need to draw a commit order times-
termining the ranges of versioned objects further ensures tamp; they reuse their startTime-stamp. Therefore, in our
that the VersionedPositions synopses are not consulted in example the transaction that started at S1 obtained the com-
hotspot areas where all records are modified. mitTime-stamp T6 , because the transaction that started at
S1
r(x) w(x) commit immediately, but in our MVCC model we opted to validate
S2 all reads at commit time. After all, the rw-antidependencies
r(y) w(y) commit could have been resolved by an abort of (S4 , T7 ) or by a
S3
r(y) r(x) commit commit of the reading transaction before T7 .
S5 The benefits of MVCC are illustrated in Figure 6(b), where
r(y) w(y) commit
transaction (S5 , T6 ) managed to “slip in front” of transaction

S
im :
(S4 , T7 ) even though it read x after (S4 , T7 ) wrote x. Obvi-

rtT on
eT
explicit versioning

ta ti
ously, with a single-version scheduler this degree of logical

= s s ac
T6

TS tran
S1 concurrency would not have been possible. The figure also
r(x0) w( x ) commit(x6) illustrates the benefits of our MVCC scheme that keeps an

m ly
m n
T4

co ad-o
S2 arbitrary number of versions instead of only two as in [36].

it
r(y0) w( y ⃞) commit(y4)

re
The “long” read transaction (S1 , T1 ) needs to access x0 even
T3 though in the meantime the two newer versions x3 and x7
S3
r(y0) r(x0) commit were created. Versions are only garbage collected after they
T7
S5 are definitely no longer needed by other active transactions.
r(y4) w( y ⃞) commit(y7) Our novel use of precision locking consisting of collecting
Figure 5: Example of the explicit versioning nota- the read predicates and validating recently committed ver-
tion in our MVCC model sions against these predicates is illustrated in Figure 6(c).
Here, transaction (S2 , T5 ) reads x0 with predicate P , de-
noted rP (x0 ). When the transaction that started at S2 tries
S2 committed earlier at timestamp T4 . The read-only trans- to commit, it validates the before- and after-images of ver-
action that started at timestamp S3 logically also commits sions that were committed in the meantime. In particular,
at timestamp T3 . P (x0 ) is true and therefore leads to an abort and restart
Transactions read all the data in the version that was com- of the transaction. Likewise, phantoms and deletions are
mitted (i.e., created) most recently before their startTime- detected as exemplified for the insert i( o ) and the delete
stamp. Versions are only committed at the end of a trans- d( u ) of transaction (S1 , T4 ). Neither the inserted object
action and therefore receive the identifiers corresponding to nor the deleted object are allowed to intersect with the pred-
the commitTime-stamps of the transaction that creates the icates of concurrent transactions that commit after T4 .
version. The transaction schedule of Figure 5 creates the
version chains y0 → y4 → y7 and x0 → x6 . Note that ver-
3.2 Proof of Serializability Guarantee
sions are not themselves (densely) numbered because of our We will now prove that our MVCC scheme with predicate
scheme of identifying versions with the commitTime-stamp space validation guarantees that any execution is serializable
of the creating transaction. As we will prove in Section 3.2, in commit order.
our MVCC model guarantees equivalence to a serial mono-
Theorem. The committed projection of any multi-version
version schedule in commitTime-stamp order. Therefore,
schedule H that adheres to our protocol is conflict equiva-
the resulting schedule of Figure 5 is equivalent to the serial
lent to a serial mono-version schedule H 0 where the commit-
mono-version execution: r3 (y), r3 (x), c3 , r4 (y), w4 (y), c4 ,
ted transactions are ordered according to their commitTime-
r6 (x), w6 (x), c6 , r7 (y), w7 (y), c7 . Here all the operations
stamps and the uncommitted transactions are removed.
are subscripted with the transaction’s commitTime-stamp.
Local writing is denoted as w( x ). Such a “dirty” data Proof. Due to the nature of the MVCC protocol, the effects
object is only visible to the transaction that wrote it. In of any uncommitted transaction can never be seen by any
our implementation (cf., Section 2.1), we use the very large other successful transaction (reads will ignore the uncom-
transaction identifiers to make the dirty objects invisible to mitted writes, writes will either not see the uncommitted
other transactions. In our formal model we do not need these writes or lead to aborts). Therefore, it is sufficient to con-
identifiers. As we perform updates in-place, other transac- sider only committed transactions in this proof.
tions trying to (over-)write x are aborted and restarted. Basically, we will now show that all dependencies are in
Note that reading x is always possible, because a transac- the direction of the order of their commitTime-stamps and
tion’s reads are directed to the version of x that was commit- thus any execution is serializable in commit order. Read-
ted most recently before the transaction’s startTime-stamp only transactions see a stable snapshot of the database at
— with one exception: if a transaction updates an object time Sb , and get assigned the same commitTime-stamp Tb =
x, i.e., w( x ), it will subsequently read its own update, i.e., Sb , or, in other words, they behave as if they were executed
r( x ). This is exemplified for transaction (S1 , T2 ) on the up- at the point in time of their commitTime-stamp, which is
per left hand side of Figure 6(a). In our implementation this the same as their startTime-stamp.
read-your-own-writes scheme is again realized by assigning Update transactions are started at Sb , and get assigned a
very large transaction identifiers to dirty data versions. commitTime-stamp Tc with Tc > Sb . We will now prove by
Figure 6(a) further exemplifies a cycle of rw-dependencies, contradiction, that the transactions behave as if they were
often also referred to as rw-antidependencies [11]. rw-antide- executed at the time point Tc . Assume T is an update-
pendencies play a crucial role in non-serializable schedules transaction from the committed projection of H (i.e., T has
that are compliant under SI. The first rw-antidependency committed successfully), but T could not have been delayed
involving r(y0 ) and w( y ) in the figure could not have been to the point Tc . That is, T performed an operation o1 that
detected immediately as the write of y in (S4 , T7 ) happens conflicts with another operation o2 by a second transaction
after the read of y; the second rw-antidependency involving T 0 with o1 < o2 and T 0 committed during T ’s lifetime, i.e.,
r(x2 ) and w( x ) on the other hand could have been detected within the time period Sb ≤ Tc0 < Tc . If T 0 committed after
T2
S1
r(x0) w( x ⃞)r( x ⃞)w( x ⃞) commit(x2)
S3 immediate abort due to write-write conflict with x6
r(x2) ww w( x ⃞)
T7
S4
r(x2) r(y0) w( x ⃞)w( y ⃞) commit(x7,y7)
rw rw abort due to read-write conflict: validation fails as
S5 {x2,x6,y0,y6} intersect predicate space of S5
r(y0) r(x2) w( z ⃞)
T6
S6 succeeds because it is a read-only transaction
r(y0) r(x2) commit
(a) a write-write conflict and a read-write conflict
T1 T4
S1 S1
r(y0) r(z0) r(x0) commit r(x0) i( o ⃞)d( u ⃞)w( x ⃞) commit(x4,o4,u)
T3
S2 T5
r(x0) w( x ⃞) commit(x3) S2 abort because P(x0)=true
T7 rP(x0) rQ(z0) w( z ⃞) Test P(x0),Q(x0),P(x5),Q(x5),P(o5),Q(o5),P(u0),Q(u0)
S4
r(x3) w( x ⃞) r(z0) w( z ⃞) commit(x7,z7) abort if either Test is true:
T6
T6 S3 S(o5) indicates phantom
S5 rS(...) w( y ⃞)
r(x3) r(y0) w( y ⃞) commit(y6) Test S(x0),S(x5),S(o5),S(u0)
(b) benefits of our MVCC model (c) a read-write conflict, a read-delete conflict, and a phantom conflict

Figure 6: Example schedules in our MVCC model

T , i.e., Tc0 > Tc , we could delay T 0 (and thus o2 ) until after iced by updating whole records and not using VersionedPo-
Tc , thus we only have to consider the case Tc0 < Tc . sitions synopses in our MVCC model. We further experi-
There are four possible combinations for the operations mented with DBMS-X, a commercial main-memory DBMS
o1 and o2 . If both are reads, we can swap the order of both, with a MVCC implementation similar to [23]. DBMS-X was
which is a contradiction to our assumption that o1 and o2 are run in Windows 7 on our evaluation machine. Due to licens-
conflicting. If both are writes, T 0 would have aborted due to ing agreements we can not disclose the name of DBMS-X.
to our protocol of immediately aborting upon detecting ww - The experiments were executed on a 2-socket Intel Xeon
conflicts. Thus, there is a contradiction to the assumption E5-2660v2 2.20 GHz (3 GHz maximum turbo) NUMA sys-
that both, T and T 0 , are committed transactions. If o1 is a tem with 256 GB DDR3 1866 MHz memory (128 GB per
read and o2 is a write, the update o2 is already in the undo CPU) running Linux 3.13. Each CPU has 10 cores and a
buffers when T commits, as Tc0 < Tc and the predicate P 25 MB shared L3 cache. Each core has a 32 KB L1-I and
of the read of o1 has been logged. The predicate validation L1-D cache as well as a 256 KB L2 cache.
at Tc then checks if o1 is affected by o2 by testing whether
P is satisfied for either the before- or the after-image of o2
(i.e., if the read should have seen the write), as illustrated in
4.1 Scan Performance
Figure 6(c). If not, that is a contradiction to the assumption Initially we demonstrate the high scan performance of our
that o1 and o2 are conflicting. If yes, that is a contradiction MVCC implementation. We implemented a benchmark sim-
to the assumption that T has committed successfully as T ilar to the SIBENCH [7] benchmark and our bank accounts
would have been aborted when P was satisfied. If o1 is a example (cf., Figure 1). The benchmark operates on a single
write and o2 is a read, the read has ignored the effect of o1 relation that contains integer (key, value) pairs. The work-
in the MVCC mechanism, as Tc0 > Sb , which is a contradic- load consists of update transactions which modify a (key,
tion to the assumption that o1 and o2 are conflicting. The value) pair by incrementing the value and a scan transac-
theorem follows. tion that scans the relation to sum up the values.
Figure 7 shows the per-core performance of scan trans-
actions on a relation with 100M (key, value) records. To
4. EVALUATION demonstrate the effect of scanning versioned records, we dis-
In this section we evaluate our MVCC implementation in able the continuous garbage collection and perform updates
our HyPer main-memory database system [21] that supports before scanning the relations. We vary both, the number of
SQL-92 queries and transactions (defined in a PL/SQL-like dirty records and the number of versions per dirty record.
scripting language [20]) and provides ACID guarantees. Additionally, we distinguish two cases: (i) the scan trans-
HyPer supports both, column- and a row-based storage of action is started before the updates (scan oldest) and thus
relations. Unless otherwise noted, we used the column-store needs to undo the effects of the update transactions and
backend, enabled continuous garbage collection, and stored (ii) the scan transaction is started after the updates (scan
the redo log in local memory. Redo log entries are generated newest) and thus only needs to verify that the dirty records
in memory and log entries are submitted in small groups are visible to the scan transaction. For all cases, the results
(group commit), which mitigates system call overheads and show that our MVCC implementation sustains the high scan
barely increases transaction latency. We evaluated HyPer throughput of our single-version concurrency control imple-
with single-version concurrency control, our novel MVCC mentation for realistic numbers of dirty records; and even
model, and a MVCC model similar to [23], which we mim- under high contention with multiple versions per record.
scan throughput 1000M 100%
75%

cycles
[records/s]
750M
50%
500M 25%
realistic scenarios 0%
250M with garbage collection 0% 0.0001% 0.001% 0.01% 0.1%
(0) (100) (1k) (10k) (100k)
0M
0.00001% 0.0001% 0.001% 0.01% 0.1% dirty records out of 100M
(10) (100) (1k) (10k) (100k)
scan retrieve version
dirty records out of 100M find first versioned find first unversioned
single-version system scan newest
4 versions 16 versions Figure 9: Cycle breakdowns of scan-oldest transac-
scan oldest, scan oldest,
dirty record dirty record tions that need to undo 4 updates per dirty record

Figure 7: Scan performance with disabled garbage


collection: the scan newest transaction only needs to include mostly modified records. A breakdown of CPU cy-
verify the visibility of records while the scan oldest cles in Figure 9 shows that our MVCC functions are very
transaction needs to undo updates. cheap for realistic numbers of versioned records. We mea-
sured 2.8 instructions per cycle (IPC) during the scans.
1000M We further compared the scan performance of our MVCC
scan throughput

implementation to DBMS-X. DBMS-X achieves a scan speed


[records/s]

750M of 7.4M records/s with no dirty records and 2.5M records/s


with 10k dirty records (> 100× slower than our MVCC im-
500M
5.5× improvement plementation). Of course, we “misused” DBMS-X with its
250M point-query-only-optimized model for large full-table scans,
which would be necessary for analytical transactions. The
0M
0.00001% 0.0001% 0.001% 0.01% 0.1%
Hekaton model is only optimized for point queries and per-
(10) (100) (1k) (10k) (100k) forms all accesses through an index, which severely degrades
performance for scans-based analytics.
dirty records out of 100M

single-version system MVCC, no VP


4.2 Insert/Update/Delete Benchmarks
We also evaluated the per-core performance of insert, up-
MVCC, VP s = 210 MVCC, VP s = 24
date, and “delete and insert” (delin) operations on a relation
MVCC, VP s = 216
with 10 integer attributes and 100M records. As expected,
compared to our single-version concurrency control imple-
Figure 8: Effect of VersionedPositions (VP) syn- mentation (5.9M inserts/s, 3.4M updates/s, 1.1M delins/s),
opses per s records on scan performance performance with our MVCC implementation is slightly de-
graded due to visibility checks and the maintenance of the
VersionVector and the VersionedPositions (4M inserts/s,
To validate our assumptions for the number of dirty records 2M updates/s, 1M delins/s). The number of logically con-
and versions we consider Amazon.com as an example. 6.9 current active transactions, however, has no performance
million copies of Harry Potter and the Half-Blood Prince, impact. As the newest version is stored in-place and the
one of the best-selling books of all time, were sold within version record of the previous version is inserted at the be-
the first 24 hours in the United States. Even if we make ginning of the version chain, performance of updates is also
the highly pessimistic assumptions that all books are sold independent of the total number of versions.
through Amazon within 20 hours of that day and that Ama-
zon operates only six warehouses, 16 copies of that book are 4.3 TPC-C and TATP Results
sold per warehouse per second. Our experiment suggests TPC-C is a write-heavy benchmark and simulates the
that in order to measure a significant drop in scan perfor- principal activities of an order-entry environment. Its work-
mance there need to be hundreds of thousands of such best- load mix consists of 8% read-only transactions and 92%
selling items and a transaction that is open for a long period write transactions. Some of the transactions in the TPC-C
of time. Remember that in this case the long-running trans- perform aggregations and reads with range predicates. Fig-
action can be aborted and restarted on a snapshot [29]. ure 10(a) shows the per-core performance of our MVCC im-
Figure 8 shows the performance effect of having Versioned- plementation for the TPC-C benchmark with 5 warehouses
Positions synopses (see Section 2.6) on scan performance. and no think times. Compared to our single-version concur-
Our implementation maintains VersionedPositions per 1024 rency control implementation, our MVCC implementation
records. The experiment suggests that increasing or decreas- costs around 20% of performance. Still, more than 100k
ing the number of records per VersionedPositions degrades transactions/s are processed. This is true for our column-
scan performance. Compared to not using VersionedPosi- and a row-based storage backends. We also compared these
tions at all, scan performance is improved by more than numbers to a 2PL implementation in HyPer and a MVCC
5.5×. 1024 records seems to be a sweetspot where the size model similar to [23]. 2PL is prohibitively expensive and
of the VersionedPositions vector is still reasonable and the achieves a ∼5× smaller throughput. The MVCC model
synopses already encode meaningful ranges, i.e., ranges that of [23] achieves a throughput of around 50k transactions/s.
1.0× 5% SI

normalized to SI
throughput [TX/s]

CC abort rate
throughput
150K 0.8× 4% SR-AL
0.6× 3% SR-RL
100K SI
0.4× SR-AL 2%
50K 0.2× SR-RL 1%
0.0× 0%
0 0% 20% 40% 60% 80% 100% 5 10 15 20 25 30
column-store row-store read-heavy top-customer transactions number of warehouses
(a) HyPer with single-version concur- (b) throughput of interleaved transactions (c) concurrency control abort rate
rency control ( ), our MVCC model under serializability with attribute- (SR- running interleaved transactions with
( ), and a MVCC model that mimics AL) and record-level predicate logs (SR- a varying number of warehouses un-
the behavior of [23] by updating whole RL) relative to SI throughput for a vary- der snapshot isolation (SI) and serial-
records and not using VersionedPosi- ing percentage of read-heavy top-customer izability with attribute- (SR-AL) and
tions ( ) transactions in the workload mix record-level predicate logs (SR-RL)

Figure 10: Single-threaded TPC-C experiments with a default of 5 warehouses

original TPC-C original TPC-C


throughput [TX/s]

throughput [TX/s]
throughput [TX/s]

1M no log shipping 1M 1M
RDMA log shipping 800K 800K
800K
600K 600K 600K
400K 400K 400K
200K 200K 200K
0 0 0
1 2 4 6 8 10 12 14 16 18 20 0% 5% 10% 15% 20% 25% 0% 20% 40% 60% 80% 100%
number of threads (MPL) partition-crossing transactions read-only transactions
(a) scalability (b) 20 threads, varying contention (c) 20 threads, impact of read-only TX

Figure 11: Multi-threaded TPC-C experiments with 20 warehouses in HyPer with our MVCC model

We further executed the TPC-C with multiple threads. only transactions and 20% write transactions. Thereby, the
Like in H-Store [19]/VoltDB, we partition the database ac- read transactions all perform point accesses and records are
cording to warehouses. Partitions are assigned to threads mostly updated as a whole. Thus, the TATP benchmark is
and threads are pinned to cores similar to the DORA sys- a best-case scenario for the MVCC model of [23]. We ran
tem [32]. These threads process transactions that primar- the benchmark with a scale-factor of 1M subscribers. Com-
ily access data belonging to their partition. Unlike DORA, pared to running the benchmark with single-version con-
partition-crossing accesses, which, e.g., occur in 11% of the currency control (421,940 transactions/s), our MVCC im-
TPC-C transactions, are carried out by the primary thread plementation creates just a tiny overhead (407,564 transac-
to which the transaction was assigned. The scalability ex- tions/s). As expected, the mimicked MVCC model of [23]
periment (see Figure 11(a)) shows that our system scales also performs quite well in this benchmark, but still trails
near linearly up to 20 cores. Going beyond 20 cores might performance of our MVCC implementation by around 20%
require the reduction of global synchronization like in the (340,715 transactions/s).
SILO system [41]. We further varied the contention on the
partitions by varying the percentage of partition-crossing 4.4 Serializability
transactions as shown in Figure 11(b). Finally, as shown To determine the cost of our serializability validation ap-
in Figure 11(c), we also measured the impact of read-only proach (cf., Section 2.3), we first measured the cost of pred-
transactions and proportionally varied the percentage of the icate logging in isolation from predicate validation by run-
two read-only transactions in the workload mix. ning the TPC-C and TATP benchmarks each in a single
Figure 11(a) also shows the scalability of HyPer when serial stream. Without predicate logging, i.e., under SI,
shipping the redo log using remote direct memory accesses we achieve a TPC-C throughput of 112,610 transactions/s,
(RDMA) over Infiniband. RDMA-based log shipping gen- with record-level predicate logging (SR-RL) 107,365 transac-
erates an overhead of 17% with 20 threads. Our evalua- tions/s, and with attribute-level predicate logging (SR-AL)
tion system has a Mellanox ConnectX-3 Infiniband network 105,030 transactions/s. This means that there is a mere 5%
adapter, which operates at 4×QDR. The maximum write overhead for SR-RL and a mere 7% overhead for SR-AL. For
bandwidth of our setup is 3.5 GB/s with a latency of 1.3 µs. the TATP benchmark, we measured an overhead of only 1%
This bandwidth is sufficient to ship the redo log entries: for for SR-RL and 2% for SR-AL. We also measured the predi-
100k TPC-C transactions we generate 85 MB of redo log cate logging overhead for the mostly read-only TPC-H deci-
entries. In our setup, the receiving node can act as a high- sion support benchmark, which resulted in an even smaller
availability failover but could also write the log to disk. overhead. In terms of size, predicate logs generated by our
The Telecommunication Application Transaction Process- implementation are quite small. Unlike other serializable
ing (TATP) benchmark simulates a typical telecommunica- MVCC implementations, we do not need to track the entire
tions application. The workload mix consists of 80% read- read set of a transaction. To illustrate the difference in size
imagine a top-customer transaction on the TPC-C schema where |R| ≥ |W |. In our opinion this is mostly the case, as
that, for a specific warehouse and district, retrieves the cus- modern workloads tend to be read-heavy [33] and the time
tomer with the highest account balance and a good credit that a transaction is active tends to be short (long-running
rating (GC) and performs a small update: transactions would be deferred to a “safe snapshot”).
select
Finally, we evaluated the concurrency control abort rates,
c_w_id, c_d_id, max(c_balance) i.e., the aborts caused by concurrency control conflicts, of
from our MVCC implementation in HyPer. We again ran TPC-C
customer with logically interleaved transactions and varied the num-
where ber of TPC-C warehouses. As the TPC-C is largely par-
c_credit = ’GC’ titionable by warehouse, the intuition is that concurrency
and c_w_id = :w_id
and c_d_id = :d_id
control conflicts are reduced with an increasing number of
group by warehouses. The results are shown in Figure 10(c). We
c_w_id, c_d_id acknowledge that TPC-C does not show anomalies under
update ... SI [11], but of course the database system does not know
this, and this benchmark therefore tests for false positive
For such a query, serializable MVCC models that need to
aborts. The aborts under SI are “real” conflicts, i.e., two
track the read set then have to either copy all read records
transaction try to modify the same data item concurrently.
or set a flag (e.g., the SIREAD lock in PostgreSQL [34]). If
Serializability validation with SR-AL creates almost no false
we assume that at least one byte per read record is needed for
positive aborts. The only false positive aborts stem from the
book-keeping, then these approaches need to track at least
minimum (min) aggregation in delivery, as it sometimes con-
3 KB of data (a district of a warehouse serves 3k customers
flicts with concurrent inserts. Predicate logging of minimum
in TPC-C). Our SR-AL on the other hand just stores the
and maximum aggregates is currently not implemented in
read attributes and the aggregate that has been read which
our system but can easily be added in the future. SR-RL
is less than 100 bytes. This is 30× less than what traditional
creates more false positives than SR-AL, because reads are
read set book-keeping consumes; and for true OLAP-style
not only checked against updated attributes but rather any
queries that read a lot of data, predicate logging saves even
change to a record is considered a conflict, even though the
more. E.g., the read set of an analytical TPC-H query usu-
updated attribute might not even have been read by the
ally comprises millions of records and tracking the read set
original transaction.
can easily consume multiple MBs of space.
To determine the cost of predicate validation, we again
ran the TPC-C benchmark but this time interleaved trans- 5. RELATED WORK
actions (by decomposing the transactions into smaller tasks) Transaction isolation and concurrency control are among
such that transactions are running logically concurrent, which the most fundamental features of a database management
makes predicate validation necessary. We further added the system. Hence, several excellent books and survey papers
aforementioned top-customer transaction to the workload have been written on this topic in the past [42, 3, 2, 38]. In
mix and varied its share in the mix from 0% to 100%. The the following we further highlight three categories of work
results are shown in Figure 10(b). As TPC-C transactions that are particularly related to this paper, most notably
are very short, it would be an option to skip the step of multi-version concurrency control and serializability.
building the predicate tree and instead just apply all the
predicates as-is and thus make predicate validation much 5.1 Multi-Version Concurrency Control
faster. However, the predicate tree has much better asymp- Multi-Version Concurrency Control (MVCC) [42, 3, 28]
totic behavior, and is therefore much faster and more robust is a popular concurrency control mechanism that has been
when transaction complexity grows. We therefore use it all implemented in various database systems because of the de-
the time instead of optimizing for very cheap transactions. sirable property that readers never block writers. Among
And the figure also shows that with more read-only trans- these DBMSs are commercial systems such as Microsoft SQL
actions in the workload mix (as it is the case in real-world Server’s Hekaton [8, 23] and SAP HANA [10, 37] as well as
workloads [33]), the overhead of serializability validation al- open-source systems such as PostgreSQL [34].
most disappears. Building the predicate trees takes between Hekaton [23] is similar to our implementation in that it
2 µs and 15 µs for the TPC-C transactions on our system; is based on a timestamp-based optimistic concurrency con-
and between 4 µs and 24 µs for the analytical TPC-H queries trol [22] variant of MVCC and uses code generation [12]
(9.5 µs geometric mean). In comparison to traditional vali- to compile efficient code for transactions at runtime. In
dation approaches that repeat all reads, our system has, as the context of Hekaton, Larson et al. [23] compared a pes-
mentioned before, a much lower book-keeping overhead. A simistic locking-based with an optimistic validation-based
comparison of validation times by themselves is more com- MVCC scheme and proposed a novel MVCC model for main-
plicated. Validation time in traditional approaches depends memory DBMSs. Similar to what we have seen in our exper-
on the size of the read set of the committing transaction |R| iments, the optimistic scheme performs better in their eval-
and how fast reads can be repeated (usually scan speed and uation. In comparison to Hekaton, our serializable MVCC
index lookup speed); in our approach, it mostly depends model does not update records as a whole but in-place and
on the size of the write set |W | that has been committed at the attribute-level. Further, we do not restrict data ac-
during the runtime of the committing transaction. In our cesses to index lookups and optimized our model for high
system, checking the predicates of a TPC-C transaction or scan speeds that are required for OLAP-style transactions.
a TPC-H query against a versioned record that has been Finally, we use a novel serializability validation mechanism
reconstructed from undo buffers is a bit faster than an in- based on an adaptation of precision locking [17]. Lomet et
dex lookup. In general, our approach thus favors workloads al. [27] propose another MVCC scheme for main-memory
database systems where the main idea is to use ranges of Silo [41] proposes a scalable commit protocol that guar-
timestamps for a transaction. In contrast to classical MVCC antees serializability. To achieve good scalability on modern
models, we previously proposed using virtual memory snap- multi-core CPUs, Silo’s design is centered around avoiding
shots for long-running transactions [29], where updates are most points of global synchronization. The proposed tech-
merged back into the database at commit-time. Snapshot- niques can be integrated into our MVCC implementation in
ting and merging, however, can be very expensive depending order to reduce global synchronization, which could allow
on the size of the database. for better scalability. Pandis et al. [32] show that the cen-
Hyder [5, 4] is a data-sharing system that stores indexed tralized lock manager of traditional DBMSs is often a scala-
records as a multi-version log-structured database on shared bility bottleneck. To solve this bottleneck, they propose the
flash storage. Transaction conflicts are detected by a meld DORA system, which partitions a database among phys-
algorithm that merges committed updates from the log into ical CPU cores and decomposes transactions into smaller
the in-memory DBMS cache. This architecture promises actions. These are then assigned to threads that own the
to scale out without partitioning. While our MVCC model data needed for the action, such that the interaction with
uses the undo log only for validation of serializability viola- the lock manager is minimized during transaction process-
tions, in Hyder, the durable log is the database. In contrast, ing. Very lightweight locking [35] reduces the lock-manager
our implementation stores data in a main-memory row- or overhead by co-locating lock information with the records.
column-store and writes a redo log for durability. Octo- The availability of hardware transactional memory (HTM)
pusDB [9] is another DBMS that uses the log as the database in recent mainstream CPUs enables a new promising trans-
and proposes a unification of OLTP, OLAP, and streaming action processing model that reduces the substantial over-
databases in one system. head from locking and latching [24]. HTM further allows
multi-core scalability without statically partitioning the da-
5.2 Serializability tabase [24]. In future work we thus intend to employ HTM
In contrast to PostgreSQL and our MVCC implementa- to efficiently scale out our MVCC implementation, even in
tion, most other MVCC-based DBMSs only offer the weaker the presence of partition-crossing transactions.
isolation level Snapshot Isolation (SI) instead of serializabil- Deterministic database systems [39, 40] propose the exe-
ity. Berenson et al. [2], however, have shown that there exist cution of transactions according to a pre-defined serial order.
schedules that are valid under SI but are non-serializable. In In contrast to our MVCC model transactions need to be
this context, Cahill and Fekete et al. [7, 11] developed a the- known beforehand, e.g., by relying on holistic pre-canned
ory of SI anomalies. They further developed the Serializable transactions, and do not easily allow for interactive and
Snapshot Isolation (SSI) approach [7], which has been im- sliced transactions. In the context of distributed DBMSs,
plemented in PostgreSQL [34]. To guarantee serializability, [6] proposes a middleware for replicated DBMSs that adds
SSI tracks commit dependencies and tests for “dangerous global one-copy serializability for replicas that run under SI.
structures” consisting of rw-antidependencies between con-
current transactions. Unfortunately this requires keeping 6. CONCLUSION
track of every single read, similar to read-locks, which can The multi-version concurrency control (MVCC) imple-
be quite expensive for large read transactions. In contrast mentation presented in this work is carefully engineered to
to SSI, our MVCC model proposes a novel serializability accommodate high-performance processing of both, transac-
validation mechanism based on an adaptation of precision tions with point accesses as well as read-heavy transactions
locking [17]. Our approach does not track dependencies but and even OLAP scenarios. For the latter, the high scan
read predicates and validates the predicates against the undo performance of single-version main-memory database sys-
log entries, which are retained for as long as they are visible. tems was retained by an update-in-place version mechanism
Jorwekar et al. [18] tackled the problem of automatically and by using synopses of versioned record positions, called
detecting SI anomalies. [14] proposes a scalable SSI imple- VersionedPositions. Furthermore, our novel serializabiliy
mentation for multi-core CPUs. Checking updates against a validation technique that checks the before-image deltas in
predicate space is related to SharedDB [13], which optimizes undo buffers against a committing transaction’s predicate
the processing of multiple queries in parallel. space incurs only a marginal space and time overhead —
no matter how large the read set is. This results in a very
5.3 Scalability of OLTP Systems attractive and efficient transaction isolation mechanism for
Orthogonal to logical transaction isolation, there is also main-memory database systems. In particular, our serial-
a plethora of research on how to scale transaction process- izable MVCC model targets database systems that support
ing out to multiple cores on modern CPUs. H-Store [19], OLTP and OLAP processing simultaneously and in the same
which has been commercialized as VoltDB, relies on static database, such as SAP HANA [10, 37] and our HyPer [21]
partitioning of the database. Transactions that access only a system, but could also be implemented in high-performance
single partition are then processed serially and without any transactional systems that currently only support holistic
locking. Jones et al. [16] describe optimizations for partition- pre-canned transactions such as H-Store [19]/VoltDB. From
crossing transactions. Our HyPer [21] main-memory DBMS a performance perspective, we have shown that the inte-
optimizes for OLTP and OLAP workloads and follows the gration of our MVCC model in our HyPer system achieves
partitioned transaction execution model of H-Store. Prior excellent performance, even when maintaining serializability
to our MVCC integration, HyPer, just like H-Store, could guarantees. Therefore, at least from a performance perspec-
only process holistic pre-canned transactions. With the se- tive, there is little need to prefer snapshot isolation over full
rializable MVCC model introduced in this work, we provide serializability any longer. Future work focusses on better
a logical transaction isolation mechanism that allows for in- single-node scalability using hardware transactional mem-
teractive and sliced transactions. ory [24] and the scale-out of our MVCC model [30].
7. REFERENCES [23] P.-Å. Larson, S. Blanas, C. Diaconu, C. Freedman,
J. M. Patel, et al. High-Performance Concurrency
[1] A. Adya, B. Liskov, and P. O’Neil. Generalized Control Mechanisms for Main-Memory Databases.
isolation level definitions. In ICDE, 2000. PVLDB, 5(4), 2011.
[2] H. Berenson, P. A. Bernstein, J. Gray, J. Melton, E. J. [24] V. Leis, A. Kemper, and T. Neumann. Exploiting
O’Neil, et al. A Critique of ANSI SQL Isolation hardware transactional memory in main-memory
Levels. In SIGMOD, 1995. databases. In ICDE, 2014.
[3] P. A. Bernstein, V. Hadzilacos, and N. Goodman. [25] J. Levandoski, D. Lomet, and S. Sengupta. The
Concurrency Control and Recovery in Database Bw-Tree: A B-tree for New Hardware. In ICDE, 2013.
Systems. Addison-Wesley Longman, 1986.
[26] Y. Li and J. M. Patel. BitWeaving: Fast Scans for
[4] P. A. Bernstein, C. W. Reid, and S. Das. Hyder - A Main Memory Data Processing. In SIGMOD, 2013.
Transactional Record Manager for Shared Flash. In
[27] D. Lomet, A. Fekete, R. Wang, and P. Ward.
CIDR, 2011.
Multi-Version Concurrency via Timestamp Range
[5] P. A. Bernstein, C. W. Reid, M. Wu, and X. Yuan. Conflict Management. In ICDE, 2012.
Optimistic Concurrency Control by Melding Trees.
[28] C. Mohan, H. Pirahesh, and R. Lorie. Efficient and
PVLDB, 4(11), 2011.
Flexible Methods for Transient Versioning of Records
[6] M. A. Bornea, O. Hodson, S. Elnikety, and A. Fekete. to Avoid Locking by Read-only Transactions.
One-Copy Serializability with Snapshot Isolation SIGMOD Record, 21(2), 1992.
under the Hood. In ICDE, 2011.
[29] H. Mühe, A. Kemper, and T. Neumann. Executing
[7] M. J. Cahill, U. Röhm, and A. D. Fekete. Serializable Long-Running Transactions in Synchronization-Free
Isolation for Snapshot Databases. TODS, 34(4), 2009. Main Memory Database Systems. In CIDR, 2013.
[8] C. Diaconu, C. Freedman, E. Ismert, P.-Å. Larson, [30] T. Mühlbauer, W. Rödiger, A. Reiser, A. Kemper, and
P. Mittal, et al. Hekaton: SQL Server’s T. Neumann. ScyPer: Elastic OLAP Throughput on
Memory-optimized OLTP Engine. In SIGMOD, 2013. Transactional Data. In DanaC, 2013.
[9] J. Dittrich and A. Jindal. Towards a one size fits all [31] T. Neumann. Efficiently Compiling Efficient Query
database architecture. In CIDR, 2011. Plans for Modern Hardware. PVLDB, 4(9), 2011.
[10] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, [32] I. Pandis, R. Johnson, N. Hardavellas, and
S. Sigg, and W. Lehner. SAP HANA Database: Data A. Ailamaki. Data-oriented Transaction Execution.
Management for Modern Business Applications. PVLDB, 3, 2010.
SIGMOD Record, 40(4), 2012. [33] H. Plattner. The Impact of Columnar In-Memory
[11] A. Fekete, D. Liarokapis, E. O’Neil, P. O’Neil, and Databases on Enterprise Systems: Implications of
D. Shasha. Making Snapshot Isolation Serializable. Eliminating Transaction-Maintained Aggregates.
TODS, 30(2), 2005. PVLDB, 7(13), 2014.
[12] C. Freedman, E. Ismert, and P.-Å. Larson. [34] D. R. K. Ports and K. Grittner. Serializable Snapshot
Compilation in the Microsoft SQL Server Hekaton Isolation in PostgreSQL. PVLDB, 5(12), 2012.
Engine. DEBU, 37(1), 2014. [35] K. Ren, A. Thomson, and D. J. Abadi. Lightweight
[13] G. Giannikis, G. Alonso, and D. Kossmann. Locking for Main Memory Database Systems.
SharedDB: Killing One Thousand Queries with One PVLDB, 6(2), 2012.
Stone. PVLDB, 5(6), 2012. [36] M. Sadoghi, M. Canim, B. Bhattacharjee, F. Nagel,
[14] H. Han, S. Park, H. Jung, A. Fekete, U. Röhm, et al. and K. A. Ross. Reducing Database Locking
Scalable Serializable Snapshot Isolation for Multicore Contention Through Multi-version Concurrency.
Systems. In ICDE, 2014. PVLDB, 7(13), 2014.
[15] S. Héman, M. Zukowski, N. J. Nes, L. Sidirourgos, [37] V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh,
and P. Boncz. Positional Update Handling in Column et al. Efficient Transaction Processing in SAP HANA
Stores. In SIGMOD, 2010. Database: The End of a Column Store Myth. In
[16] E. P. Jones, D. J. Abadi, and S. Madden. Low SIGMOD, 2012.
Overhead Concurrency Control for Partitioned Main [38] A. Thomasian. Concurrency Control: Methods,
Memory Databases. In SIGMOD, 2010. Performance, and Analysis. CSUR, 30(1), 1998.
[17] J. R. Jordan, J. Banerjee, and R. B. Batman. [39] A. Thomson and D. J. Abadi. The Case for
Precision Locks. In SIGMOD, 1981. Determinism in Database Systems. PVLDB, 3, 2010.
[18] S. Jorwekar, A. Fekete, K. Ramamritham, and [40] A. Thomson, T. Diamond, S.-C. Weng, K. Ren,
S. Sudarshan. Automating the Detection of Snapshot P. Shao, and D. J. Abadi. Calvin: Fast Distributed
Isolation Anomalies. In VLDB, 2007. Transactions for Partitioned Database Systems. In
[19] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, SIGMOD, 2012.
A. Rasin, et al. H-store: A High-performance, [41] S. Tu, W. Zheng, E. Kohler, B. Liskov, and
Distributed Main Memory Transaction Processing S. Madden. Speedy Transactions in Multicore
System. PVLDB, 1(2), 2008. In-memory Databases. In SOSP, 2013.
[20] A. Kemper et al. Transaction Processing in the [42] G. Weikum and G. Vossen. Transactional Information
Hybrid OLTP & OLAP Main-Memory Database Systems: Theory, Algorithms, and the Practice of
System HyPer. DEBU, 36(2), 2013. Concurrency Control and Recovery. Morgan
[21] A. Kemper and T. Neumann. HyPer: A Hybrid Kaufmann, 2002.
OLTP&OLAP Main Memory Database System Based [43] T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner,
on Virtual Memory Snapshots. In ICDE, 2011. A. Zeier, et al. SIMD-Scan: Ultra Fast in-Memory
[22] H. T. Kung and J. T. Robinson. On Optimistic Table Scan using on-Chip Vector Processing Units.
Methods for Concurrency Control. TODS, 6(2), 1981. PVLDB, 2(1), 2009.

You might also like