Romulus: A Case Study for Understanding Concurrency & Elastic Range Locks

Going beyond one-dimensional, serial programs and exploiting a multi-core world

Published in

Towards Data Science

26 min readMay 14, 2020

The following article hopes to take the reader on a journey through concurrency, elasticity, and the trends inside modern computing through the case study of building a scheduler (named Romulus) that automatically transforms any serial data structure into a high-performant, concurrent key-value store. By bringing concepts from distributed computing and database design into local memory, a new paradigm in the analysis and extensibility of range locks will form a series of simplistic yet powerful governance principles that partition and protect data as a function of target contention.

Introduction

As the best paper from ASPLOS ’17 describes:

“Concurrent data structures are used everywhere in the software stack, from the kernel (e.g., priority queues for scheduling), to application libraries (e.g., tries for memory allocation), to applications (e.g., balanced trees for indexing). These data structures, when inefficient, can cripple performance.” — Calciu et. al

Until recently, the power of computers compounded as a result of scale up improvements in single core machines. The bounds of performance were doubled every two years, and since applications at the time were not advanced enough to require the high-performance computations of the modern day, serial programs dominated the application layer. Gradually, the classical form of Moore’s law died. In order to revive its trends, programmers started to distribute their workloads across synchronized resources and into specialized computing units (hello, the GPU). Today, computations are no longer defined by one-dimensional, serial programs. Instead, software has extended beyond the logic of local memory registers and into multiple cores where they must follow some form of consistency and governance under variable locations and times in their concurrent transactions.

Naive synchronization techniques, such as course-grained locking, provides a simplistic yet unscalable approach to governing shared memory. On the other hand, fine-grained and asynchronous techniques can be too complex for the everyday developer. Exploiting these cores is critical in forming the bedrock of any high-performant computing environment. But why not exploit the full power of commodity hardware for all applications? Hopefully, this case study can provide everyday developers with a newfound understanding to build concurrent systems and analyze performance therein from a first-principles viewpoint. And for more experienced developers, the research proposes a stronger approach to range locks that leverages basic heuristics to chase and converge towards a contention level proactively under any skewed workload. If you are still reading, strap in for a long ride.

Formal Abstract

The following abstract will form the guide for developing and characterizing Romulus:

Romulus injects a scheduler into every transaction in order to dissipate contention across a series of elastic partitions, synchronizing these bins through an adaptation of read-copy-update (RCU) and providing a steady state of execution for all workloads. The inspiration for the design of Romulus came from MapReduce: a simple framework that orchestrates the processing of various tasks in parallel, leveraging a global view of the state space to distribute and manage communications across resources. The main purpose of Romulus is to automatically translate any serial API into a concurrent data structure while approaching the performance of hand-crafted, state-of-the-art algorithms. However, its most significant contribution is a set of contention heuristics that provide systems with the ability to converge towards target contention levels. In more detail, Romulus quickly adapts to skewed workloads by employing a novel lock design that splits and merges partitions based upon real-time thread conflicts. Accordingly, this methodology can be extended into other applications wherein scarce resources must adapt online as skewed workloads transcend their systems. Additionally, Romulus brings concepts from database design into local memory by providing an algorithm for efficient range queries in ordered key-value stores. Overall, Romulus presents a scheduling algorithm that enables predictable parallelism with a plug-and-play deployment model.

There are three fundamental goals that will drive the decisions in the architecture of this scheduler:

(1) a simplistic methodology extensible to any ordered key-value store that also approaches the performance of state-of-the-art, hand-crafted solutions (2) guaranteed serializability across all operations (including range queries)
(3) predictable contention through collision heuristics
(4) strong tolerance to skewed workloads

Theoretical Principles of Bins/Balls

The fundamental problem at hand is a scheduling problem: how to connect an input to an equitable distribution of computational resources (akin to MapReduce) that target a pre-defined contention level and scales with hundreds of threads. The abstract provides a hint that Romulus will separate data into a series of partitions. Thus, before moving forward into the design of Romulus, a first-principle analysis of how many partitions are necessary to target a collision rate can (i) create clarity on equilibrium/behaviors pursued (ii) quickly reveal performance expectations. Additionally, it will provide a (very loose) formal model for proving the convergence of contention by mapping expected conflicts across Romulus to the measurements in a single partition. As future sections reveal, when threads determine that a series of these discrete measurements violate the expected threshold for a partition, they will split the partition into two halves in order to dissipate contention and adapt to skewed workloads. Thus, the following section is critical to our third and fourth goals.

The mathematical analysis is formed through a vertical integration of previous work on the classical bins in balls problem: the probabilistic outcomes for placing M balls into N bins. This model has been previously been applied in many shapes and forms for load balancing in computer science.

In the most basic model, M balls pass through a random function — yielding a uniform distribution — and are placed into one of N bins. The following section focuses on understanding three important phenomena: (i) probability of observing no collisions (ii) probability of observing a height greater than K in a single bin (iii) probability of observing a height greater than K across all N bins. This will help us understand how to scale the number of partitions in comparison to threads.

First Case: Probability of observing no collisions

If we let X represent the probability that no collisions are observed across all bins and assume M ≤ N, then trivially M balls must be placed across N unique bins. By employing a Taylor series approximation and bounding it by an negligible error (knowing that u = 1/N is small), we estimate Pr[X] like so:

For a constant Pr[X] = C, one can see that C= O(m²/n); that is, the number of bins must scale at a rate of n = O(m²) in order to maintain a constant probability of no collisions

Second Case: Collisions at One Bin

Let Pr[Xi ≥ K] represent the probability that at least K collisions at a single bin Xi. This can be represented as:

For a constant Pr[xi ≥ K] = C, if N ≥ M, then the required bins to maintain a constant expected height must scale at N = O(M). This expected height stays relatively constant for N = M no matter the absolute value of the inputs.

Third Case: K Collisions for N Bins

Let Pr[X ≥k] represent the probability that at least K collisions happen across all N bins. Under the assumption that M= O(N) and each bin is independent from others, an upper bound is formed through the union rule:

We approximate the ratio between M:N to be 1. Though this may not be the case in a future discussion where M:N = 0.1, the upper bound will still hold under this scenario and provide enough context for further numerical analysis. Applying this ration, one can find how K changes as N (and therefore M) increase together towards larger values:

Why does the expected value of K stay constant in a single bin but increase for a series of bins as N and M scale together? From a logical standpoint, as we increase the number of opportunities for variance to occur, we increase the chance of K extending far beyond its expected height in any single bucket.

Outcome

These derivations are important for three expectations in Romulus:

(i) In order to maintain a constant probability of no conflicts, one must scale the number of bins exponentially with respect to the number of balls at a rate of N = O(M²). Romulus will not follow this approach as it is impracticable for high thread counts, and therefore, the scheduler will not attempt to remove all conflicts

(ii) If one scales the number of bins linearly to the number of balls, the expected average height of balls in each bin remains constant. Though, for large values of N and M, conflicts at any discrete point in time may be higher in hot spots due to the nature of variance. Note that Romulus will follow this model, and therefore, it will attempt to maintain a target conflict rate

(iii) When building a view of the entire system from an individual bin, one can create a confidence interval about the state of the system assuming a uniform workload. More specifically, by measuring K (i.e. the height of a queue of balls) on a single bin, one can determine whether its value consistently violates a threshold and exceeds a probability. If so, the scheduler should predict that a skewed workload has transcended the system and violated the uniform distribution assumption

Undoubtedly, this is a super rough model (with a number of unspoken assumptions) but serves well in understanding general trends to come.

Numerical Analysis

Let’s translate this old view into a new view of threads performing update operations against bins

The following section confirms the theoretical trends of the bins into balls problem in a theoretical context where threads perform update operations across a series of partitions, and it reveals two obvious phenomena of locks in concurrent environments: (1) reducing lock partitions towards more fine-grained ranges (i.e. increasing the number of bins) has a diminishing return for dissipating contention (2) as an outcome, the most optimal theoretical configuration given unlimited resources is to maximize the number of locks into the smallest partition possible. In many cases, though, systems only have a small number of resources to expend. Accordingly, range locks are utilized in order to enable decent parallelism without the memory overhead of a 40-byte pthread_lock_t. Most importantly, though, since reducing lock partitions towards more fine-grained ranges has a diminishing return, Romulus can theoretically utilize an order of magnitude of less locks while still approaching hand-crafted performance in many scenarios.

In order to model the classical bins into balls problem, we form the notion of range locks; there exists a lock table of N bins — representing a series of N abstract partitions — which is independent of the actual size of the underlying storage layer. The number of threads represents the number of balls, and they operate across these partitions with instantaneous inserts/removes. In order to observe contention overtime, we model an infinite, iterative game as follows:

1) At the start of the game, threads are randomly placed into a partition/bin (i.e. lock)

2) Each partition has a queue of operations waiting to be completed. For each round in the game, a thread acts as a parallel agent in the system. All threads at the front of a queue are flushed out, recycling themselves into another partition. They choose this next partition by following a random number generation with a uniform distribution. This represents a thread acquiring ownership over a partition, performing an operation, releasing the region, and moving onto the next operation. All threads not at the front of the queue (i.e. behind others in the queue) perform no actions and instead wait until they move to the front of the queue in future rounds.

3) The game repeats itself forever akin to a real system

This model has a number of important assumptions, most notably: all operations are instantaneous and completed once per round, the workload at play has a uniform distribution, and variability in memory access, inter-thread communication, and other architecture-specific overheads are removed. Of course, real systems have resource constraints such as memory and I/O which can change the value equation. However, as future testbeds reveal, the general trends remain the same and create a bedrock for analysis. The experiments were run in Python at 10K, 100K, and 1M rounds. No matter the length of the game, the results converged to the same values

The two figures to the left plots the height of queues for every round of the game across two extreme settings, hoping to provide a qualitative illustration of conflict variation in high-contention vs low-contention environments. The first image represents a game with 64 threads against 48 partitions. The second image represents a game with 64 threads against 640 partitions.

In the first game, one can see the disastrous effects that can take place when the number of partitions is too small. In the second game, although there exist many spots of contention, the overwhelming majority of operations proceed without conflict. The following experiments will plan for the second scenario which Romulus attempts to mimic for superior parallelism/performance.

In this figure, we target an average height of 1.05 for active queues and plot the number of partitions required in order to achieve this height at varying thread rates. This is calculated by playing a number of games with a constant 100K rounds and discovering the closest number of partitions that create the proper average height. The purpose of this experiment is to understand how to preserve a target contention level across threads; from the results, and prior analysis, it is confirmed that one must scale the number of partitions as a linear function of the number of threads for M / N <1.

In this Figure, we reiterate this point and express the average height as a function of the number of threads assuming the number of partitions at 10 * threads. The optimal number of locks (the size of the underlying structure, represented as 1M) is included to exhibit the consistent difference in the two approaches

Finally, this last experiment illustrates the diminishing benefit of increasing partitions in dissipating contention. Logically, this makes sense. With one bin and 64 balls, the one bin holds 64 balls. With an increase to two bins, the balls split evenly and the average height goes to 32. With four bins, the balls split evenly again but only reduce volume from 32 to 16 bins. Accordingly, the marginal volume from separating continues to decrease over time.

These graphs reveal important expectations in the first-principles of contention planning in Romulus. First and foremost, Romulus will theoretically never beat a hand-crafted counterpart in classical read/write workloads that manifests two characteristics (1) wait-free traversal (2) one lock for each element (3) search cost equivalent to the translation layer. This is because Romulus does not attempt to remove all contention in its system; it limits parallelism by some degree in taking ownership over a range of elements, spreading course-grained locks across smaller partitions. As a consequence, it will suffer from more contention which results in an order of magnitude degradation in performance for conflicting threads. Practically, though, the number of ranges required to reach a reasonable level of parallelism will be an order of magnitude smaller than the number of elements (assuming 10 * threads << map size)

Back to Designing Romulus: The Methodology

Romulus will represent a scheduler that splits an abstract data structure into a series of elastic partitions. The bins are not bound to any size but rather adapt on the fly to unequal forces from application threads. For ordered key-value stores, these partitions represent a range of keys. Assume hereafter that the programmer always provides an API for ordered key-value stores.

There exist three important layers required to shape Romulus: an adaptable translation layer, a data layer, and a synchronization layer. In the translation layer, an algorithm must produce the proper partition for a given input key. Though, it should not represent a static set of ranges akin to a Hash Map since one must modify partitions reflexively to flatten skewed inputs. In the data layer, the programmer supplies an abstract API for add, contains, and remove to store and access an element. The synchronization layer ensures strict serializability of operations and often integrates itself across the translation layer and the data layer; this is important for creating isolation and other serializability guarantees in each partition. For the methodology depicted in the figure, we leverage read-copy-update (RCU) and create two copies of data. The writers’ view (active section of data) is a state space for writers to perform update operations in isolation and requires a lock for mutual exclusion. The readers’ view (the replica section of data) enables readers to continue wait-free in traversing the system. Accordingly, contention in Romulus only comes from writer threads updating partitions. Full circle, this is why we did not include readers into the numerical analysis above.

Romulus’s methodology produces many variants for its specific implementation; that is, it can be seen as an abstract and modular approach to concurrency (but is most suited to ordered key-value stores). Therefore, the algorithms hereafter are focused on optimizing performance in the programmer’s view though many other adaptations still exist and provide correct semantics. Before defining a specific implementation of Romulus, the scheduling algorithm must first apply the principles of prior contention analysis to decided how to initially partition ranges and solve for a threshold that enables Romulus to adapt on the fly to skewed inputs.

Applying Contention Heuristics

In Romulus, a skewed input distribution can be characterized as an unequal weighting of input operations that lead to contention. This means that a skew is formed through two phenomena: (i) operations with different weights (ii) non-uniform access distribution. In splitting ranges based upon a violation of contention heuristics, the Romulus methodology is agnostic to the cause of askew and simply acts as a reflexive force

In order to target an average height of 1.05 in lock queues (theoretically), the system splits into 10 * t partitions where t is the number of threads in the application (this proved to perform well practically, though the constant will shift depending on the specifics of the machine). During execution, Romulus follows two simple heuristic to split/merge ranges:

(1) if application threads measure contention above a threshold K, then they split a range into two new partitions before applying their update operation. K is determined by applying the 95 percent confidence interval to previous theoretical analysis and a new assumption that N = 10 * M, where N = the number of partitions and M = the number of threads:

The measurement of the number of threads waiting in a queue has high variance, especially as the number of threads and partition increases in the system as seen beforehand. Accordingly, Romulus reduce the number of false positives by forcing threads to vote for a split and after that vote is approved, the split proceeds

(2) After acquiring an access distribution for a series of ranges, a background thread will merge two partitions when their neighborhood has been accessed 2x below the mean access of each partition. Although merging is important, it can be viewed as a way to conserve resources rather than improve throughput for any operation (future analysis shows it will not affect range query performance)

Romulus Implementation

This depicts a snapshot of the standard implementation of Romulus while a skewed workload transcends the system for a continuum of time. The goal of Romulus is not to create a Hash Map where one lock exists per element, but rather to conserve resources and perform well in the face of range queries.

The translation layer is implemented as a custom Hash Tree. The level of the tree for which a thread will hash into is the bottom most level with a full set of interior partitions.

The data layer is an ordered key-value store; a requirement in order to efficiently support range queries.

The synchronization layer leverages a custom adaptation of RCU, specifically in the commit protocol, to optimize for readers in both range queries and single element operations. The synchronization layer bleeds into the translation layer, where an atomic counter is stored in order to signal to writer threads the number of range queries executing a range of partitions. In the abstract requirements of Romulus, an upper tree in the translation layer — built upwards from the set of partitions — is not necessary. However, in order to synchronize range queries across partitions, this model extends the requirements of Romulus and forms an upper layer called CRRQS (Cross Region Range Queries). The following sections zoom into the important aspects for each of these layers, starting from the requirements of workloads and the implications deeper into the stack.

Bulk Operations (Range Queries)

In order to provide serializable range queries, Romulus brings in concepts from databases by integrating multi-granularity synchronization. There are two extreme approaches to optimize for: lock-based and lock-free range queries. For large range queries across many individual partitions, a lock-based implementation would cause significant performance degradation while enabling the potential for fairness across various types of operations. For example, the system could implement a reader-writer lock in the active partition, blocking updates from happening until completing their operation. The overhead in acquiring a series of these locks is too high for a practical application. As a result, at the beginning of the experiment, Romulus builds an upper tree layer on top of the partitions in which represents the granularity of ranges to synchronize. When a range query wants to operate across a range, it searches from the root of the upper tree layer downwards until finding the smallest range that fits its bounds. Thereafter, the range query atomically increments a counter in the node and proceeds to perform its operation. This signals to all other agents in the system that a range query is performing, synchronizing these operations in a lock-free mechanism with constant cost by centralizing necessary state into a single location. The implication to writer threads in committing their operations is discussed below. Nonetheless, two additional aspects to range query threads is that they must read a global epoch indicating the timestamp at the start of their operation, and they must use this timestamp to indicate whether or not to update their result with the most-recent-operation on a partition. This requirement, though, has constant cost and does not significantly impact larger queries.

RCU

The utilization of read-copy-update (RCU) as the synchronization mechanism in Romulus has critical second-order effects. First, it enables read operations to access partitions lock-free. Second, it requires additional overhead across both storage space and operations. A partition in the translation layer of Romulus is represented through the following metadata (excluding pointers to the left and right children in the tree):

As a result, the scheduler copies data into two structures called replica and active. The replica pointer provides entry for reader threads into an underlying storage layer, while the active pointer provides entry for writer threads into another. In a classical RCU algorithm, one would copy the memory at hand into a thread-local address and perform operations in isolation before committing. In this implementation, the data is already provided in order to avoid costs in both storage and compute from copying a large range of memory on every update. Instead, in order to preserve serializability of operations, a writer thread follows a different sequence of actions. First, the writer thread gains ownership to the partition through a custom mutex. This mutex contains an “invalid” state which indicates that a merge/split operation occurred previously as well as the number of writers contending for the lock. Assuming nothing has been invalidated, and no merge/split needs to occur, the writer thread performs its update operation on the active data structure. Thereafter, if and only if it is valid to commit the operation, the writer thread performs an atomic compare-and-swap (CAS) to replace the replica pointer with the fresh state. This represents the linearization point of the update operation. Since scheduler knows the outcome of the update operation on the active structure, if and only if the operation returned true (i.e. successfully updated an element), the thread will then wait for all reader threads to leave the old replica and then update its stale structure to the new state in performing the same operation. Upon completion, the thread completes the replication of the two partitions into a newer version and releases the lock. One of the critical points in the RCU operation is determining whether or not it is valid to commit the operation. In a classical RCU algorithm, the writer thread does not need to wait for other agents in the system to commit. Instead, the commit operation proceeds in a non-blocking fashion. When range queries are enabled Romulus, the writer thread must ensure it is not performing an update operation while it believes a range query may operate on its partition. In order to determine whether a range query does depend upon the partition, the writer traverses upwards through the translation layer and checks that each parent has no range queries performing like so:

If it proceeds through this dependency path and no range queries seem to be happening, the writer thread will believe that it is valid to commit the operation. An issue arises when the updating thread traverses past an element in the path from leaf-to-root, and a range query starts thereafter in that lower level rendering stale state in the writer’s view of the path:

There is a trivial solution to the problem at hand. Before committing the update, the range query will record the operation into the MRO of the partition with its corresponding global timestamp. After a range query records its read operations for a partition, it will check the timestamp. If it is seen that the timestamp of the node violates that of the range query, then it will undo the most recent operation in its local record. Therefore, a range query will either not experience a conflicting update, or it will be able to resolve it without much overhead. It is important to note that under the same conditions as above if another writer thread desires to update that partition, it will not commit another operation (which would override the MRO) until the range query finishes as it will certainly see the range query’s signal in the updated version of the path. Therefore, one can say that the writer thread will always delay itself until the MRO is no longer needed, and it can replace the MRO without further implications. Undoubtedly, the implementation above is optimized for read-heavy experiments where range queries execute across a series of partitions. Writes not only take longer to commit their operation by traversing a path from leaf-to-root, but they also can be starved because they block on range queries.

Hash Tree

The Hash Tree represents the translation layer in Romulus: given an input key, it provides the partition wherein that key is stored. For a read/write operation, the scheduler first hashes the key in order to provide a thread with an initial starting level to traverse from. The location of the hash layer does not affect correctness, but rather performance and hopes to reduce the translation layer search cost by starting at a lower node. From here, the thread must search downwards until it reaches a leaf node corresponding to a partition that holds its key. The leaves represent the data layer and contain the correct pointers to structures; they are also connected in order to enable range queries to continue linearly across partitions. Read operations enter into the replica lock-free, while writers follow the custom RCU algorithm above. Before acquiring the lock, though, writers increment a variable that represents the number of writer conflicts in the partition.

Step 1: Read the size of the waiting queue

Then, writers record this measurement for the number of writer conflicts in the partition. If the measurement is higher than the threshold pre-computed (as seen in the previous analysis), then they vote to split the partition. If enough votes are acquired, once any writer thread acquires ownership over the lock, they will perform a split operation

Step 2: Execute a split operation through the fork(K) API, where K represents the value to split against inside the range.

After the split operation completes inside the active data structure, it is propagated into the hash tree by appending two new leaves to the current partition. The two new active partitions will actually be placed into the leaves’ replica pointer as this represents a commit in the RCU primitive beforehand. From here, since all new readers have been redirected to the updated structure, the writer thread will wait for all older readers to finish and then perform the same operation on the stale replica structure. Finally, the thread indicates on the custom mutex that it is no longer valid and all contending threads restart their operations, trickling down into the updated state space and hopefully dispersing evenly across the two new regions.

Reflexive, Recursive Nature of Translation Layer

This process can be repeated as many times as necessary to dissipate contention below the threshold.

Merge/Split

In recent years, much work has been developed in designing merge/split operations across common data structures. SkipLists, Red Black Trees, Linked Lists, and many other structures can perform a split operation with the same time complexity as their search operations. As a result, when a programmer provides an implementation of fork(K) with this assumption, the split operation in Romulus doubles the amount of time algorithmically for one of the threads to finish their update operation (barring the presence of range queries). The implementation provided above for merging/splitting leverages application threads to identify contention themselves and perform operations within their critical path through a synchronous approach. This creates risk in the algorithm to misclassify hotspots and requires an accurate threshold model to measure against. One might argue that offloading this work to a background thread might make sense; however, further analysis indicates otherwise. An asynchronous, lazy reaction to contention through a helper thread might help the system reach a steady state with full accuracy, but the time and complexity of identifying hot-spots for large data structures with a single helper thread would fail to immediately resolve skewed workloads. A tale of a needle in a hay stack. As a result, the parallelism of these threads would be significantly reduced due to continuous conflicts. Accordingly, in most cases, the time to resolve would likely cost more than the application threads them-selves handling a split. Additionally, a more important point, is that if the skew continues to shift, then the asynchronous thread will likely acquire a misrepresentation of conflicts and potentially never resolve dynamic skews. This extends into the more general problem of an asynchronous estimation in that the approximation of contention becomes much more difficult to model for dynamic systems. This negligence towards skewed workloads would create a disaster for systems that require resilience across different scenarios, and therefore it would inhibit Romulus’s mainstream adoption.

Other Notes

The translation layer’s shape is reflexive to skewed distributions passing through Romulus as seen in prior figures and future analysis; that is, for hot spots wherein conflicts happen often, Romulus will deepen the translation layer in order to dissipate contention across more ranges. An interesting question to consider is whether Romulus’s translation layer converges to a steady state given a skewed distribution. The very loose formal analysis and experimental results prove that Romulus does converge to a steady state given a skewed distribution, and it does so quickly by proactively adapting to contention through application threads themselves in conflict. Thus, Romulus will achieve sequential behavior in the translation layer until a new skew transcends the system and forces an update of partitions.

One aspect of online systems that often plagues performance is the act of memory reclamation: freeing memory safely for reuse. Romulus already has the overhead of memory reclamation baked into its partition isolation; that is, one does not need to add any new metadata to Romulus in order to protect the freeing of memory. Additionally, memory can be reclaimed synchronously rather than asynchronously (a common approach which delays reclamation until other threads’ cannot proceed on a path towards a retired element)

The Results

As a scheduling algorithm, the general trends of the performance in Romulus should be known beforehand:
(1) without other bottlenecks, such as memory, Romulus will scale well to higher thread counts from a single-threaded baseline
(2) skewed inputs will be resolved quickly and dissipate contention to prior levels

These are confirmed in the graphs and analysis below. On the other hand, the relative performance in to competitors and corner cases therein is hard to quantify; one must follow a discovery process. To get an accurate sense of Romulus’s performance, one should deploy a serial Skip List and compare against two hand-crafted, state-of-the-art solutions concurrent Skip List solutions: NUMASK and Herihly. These represent two of the fastest lock-free and lock-based structures respectively (at least, for now). The testbed used in the following experiment consists of a server with 4 Intel Xeon Platinum 8160 processors, providing a total of 192 threads. There are 4 sockets hosting the 4 processors via 4 NUMA zones (one-to-one) and 768 GB of memory. All numbers are an average of ten trials. In the experiments, application threads are spread evenly across NUMA zones. Note: competitors do not provide a serializable range query solution, and therefore we benchmark against their upper-bound as they unsafely traverse the structure and present stale results in their bulk operations. Nonetheless, with 10% range queries of 500 prospective elements, 70% reads, and 20% update operations on a data structure of size 2 million, one observes the following performance:

The structure scales well with respect to the upper bounds in the performance of hand-crafted solutions until 60 threads. Nonetheless, Romulus does incredibly well for a plug-and-play accelerator. One area it would outperform Node Replication (voted best paper of ASPLOS ’17, the source of the quote from the beginning) is in their ordered key-value stores; though, it would be inferior in stacks/queues/other structures where operations solely pop off the front and back of structures — where they found their sweet spot.

NR does not scale well in Skip List Key-Value stores due to the single-writer requirement in their replication approach https://cs.brown.edu/people/irina/papers/asplos2017-final.pdf

Benchmarking Linked Lists is an interesting but not valid experiment given that the reduction in search cost from the translation layer’s superior algorithmic performance induces super linear speedup.

Now that we know we approach the performance of hand-crafted approaches in Skip Lists, the rest of Romulus’s claims must be investigated for the data structure at hand; specifically, (i) how well does Romulus perform under skewed distributions? (ii) how does varying buckets effect update/RQ performance?

The two most important measurements in determining the success of a merge/split algorithm is how fast it adapts to contention (i.e. the time to resolution) and the overall effect on performance in comparison to a uniform distribution. Romulus performs well in both respects. The following plot takes a snapshot of the throughput in every millisecond of an experiment in order to create a view of throughput against time. The application threads introduce a skew distribution at ~5 seconds in order to observe how (1) Romulus adapts to the workload with fork/merge operations employed (2) the performance gain from a naive version of Romulus where fork/merge is not employed

This confirms that even though Romulus sacrifices short-term performance for a few milliseconds, the long-term performance of the system returns close to prior levels quickly for the skew at hand.

The reflexive nature of the system for the same workload can be depicted by viewing three access distributions in the leaf nodes of Romulus at the end of the experiment. The following graph plots (1) an input skew (i.e. the access distribution under no fork/merge) (2) the experimental outcome (i.e. access distribution with fork/merge enabled) with 48 threads and 100% update operations of size 2 million, and (3) the targeted, theoretical outcome of that experiment given the set of heuristics Romulus is deployed through (K = 3). It does this by looking at the access percentage (observed accesses in single partition / total accesses across partition) for each partition

Romulus falls below the theoretical target, and has more overhead in the number of partitions than expected, most likely due to a relaxed contention heuristic that is pessimistic (K = 3). A well-defined threshold with strong guarantees is critical for Romulus to converge towards a targeted final equilibrium state of partitions. Too high of a threshold, and the heuristic fails to adapt properly to skewed workloads. Too low of a threshold and the system expends resources beyond an initial target and misclassifies the input distribution.

Finally, an interesting observation on the clashing forces of range queries and updates in desiring to merge and split ranges respectively. The following plot reveals how throughput changes when increasing buckets for a pure workload of range queries and then updates

Range Query v Update Throughput Growth against Buckets. (i) 64 Threads (ii) 1 Million Size (iii) Range Query of Full Data Structure (iv) Logarithmic axis (v) Normalized means dividing each number by the max obtained value to bound all points within 0 and 1

Overall, Romulus can scale well with hand-crafted, state-of-the art solutions and performs reasonably well under skewed distributions due to the reflexive nature of the merge/split algorithms and probabilistic approach to contention. I hope that more of the ideas found inside this piece of work will be formalized and make it into modern day systems looking for a simple approach to concurrency.