Harmonia: Near-Linear Scalability for Replicated Storage
with In-Network Conflict Detection
Hang Zhu1 , Zhihao Bai1 , Jialin Li2 , Ellis Michael2 , Dan Ports3 , Ion Stoica4 , Xin Jin1
1
Johns Hopkins University, 2 University of Washington, 3 Microsoft Research, 4 UC Berkeley
arXiv:1904.08964v1 [cs.DC] 18 Apr 2019
ABSTRACT
Distributed storage employs replication to mask failures
and improve availability. However, these systems typically
exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale
with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of newgeneration programmable switches to obviate this tradeoff
by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write
conflicts in the network, which enables any replica to serve
reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of
Harmonia on a cluster of commodity servers connected by
a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a variety of replication protocols, including primarybackup, chain replication, Viewstamped Replication, and
NOPaxos. Experimental results show that Harmonia improves the throughput of these protocols by up to 10× for
a replication factor of 10, providing near-linear scalability
up to the limit of our testbed.
1. Introduction
Replication is one of the fundamental tools in the modern
distributed storage developer’s arsenal. Failures are a regular
appearance in large-scale distributed systems, and strongly
consistent replication can transparently mask these faults to
achieve high system availability. However, it comes with a
high performance cost.
One might hope that adding more servers to a replicated
system would increase not just its reliability but also its system performance—ideally, providing linear scalability with
the number of replicas. The reality is quite the opposite:
performance decreases with more replicas, as an expensive
replication protocol needs to be run to ensure that all replicas are consistent. Despite much effort to reduce the cost of
these protocols, the best case is a system that approaches the
performance of a single node [35, 56].
Can we build a strongly consistent replicated system that
approaches linear scalability? Write operations inherently
need to be applied to all replicas, so more replicas cannot
increase the write throughput. However, many real-world
workloads are highly skewed towards reads [24, 43]—with
read:write ratios as high as 30:1 [7]. A scalable but naive approach is to allow any individual replica to serve a read, permitting the system to achieve near-linear scalability for such
workloads. Yet this runs afoul of a fundamental limitation.
Individual replicas may lag behind, or run ahead of, the consensus state of the group. Thus, serving reads from any storage replica has the potential to return stale or even uncommitted data, compromising the consistency guarantee of the
replicated system. Protocol-level solutions like CRAQ [55]
require extra server coordinations, and thus inevitably impose additional performance overheads.
We show that it is possible to circumvent this limitation
and simultaneously achieve near-linear scalability and consistency for replicated storage. We do so with Harmonia, a
new replicated storage architecture that exploits the capability of new-generation programmable switches. Our key observation is that while individual replicas may constantly diverge from the consensus state, the set of inconsistent data at
any given time is small. A storage system may store millions
or billions of objects or files. Of these, only the ones that
have writes in progress—i.e., the set of objects actively being
updated—may be inconsistent. For the others, any replica
can safely serve a read. Two features of many storage systems make this an especially powerful observation: (i ) readintensive workloads in real-world deployments [7, 43] mean
that fewer objects are written over time, and (ii ) emerging
in-memory storage systems [2, 3, 46, 57] complete writes
faster, reducing the interval of inconsistency.
The challenge in leveraging this insight lies in efficiently
detecting which objects are dirty, i.e., have pending updates.
Implementing this functionality in a server would make the
system be bottlenecked by the server, instead of scaling out
with the number of storage replicas. Harmonia demonstrates
that this idea can be realized on-path in the network with
programmable switches at line rate, with no performance
penalties. The core component of Harmonia is a read-write
conflict detection module in the switch data plane that monitors all requests to the system and tracks the dirty set. The
switch detects whether a read conflicts with pending writes,
and if not, the switch sends it directly to one of the replicas.
Other requests are executed according to the normal protocol. This design exploits two principal benefits of in-network
processing: (i ) the central location of the switch on the data
path that allows it to monitor traffic to the cluster, and (ii )
its capability for line-rate, near-zero overhead processing.
Harmonia can be viewed as a new take on network anycast
in the context of replicated storage. Different from recent
work that directly embeds application data and logic into
programmable switches [20, 21, 29, 30] , Harmonia still uses
switches for forwarding, but in an application-aware manner
by tracking application metadata (contended object IDs, not
values) with much less switch memory.
Harmonia is a general approach. It augments existing
replication protocols with the ability to serve reads from any
replica, and does not sacrifice fault tolerance or strong consistency (i.e., linearizability). As a result, it can be applied
to both major classes of replication protocols—primarybackup and quorum-based. We have applied Harmonia
to a range of representative protocols, including primarybackup [14], chain replication [56], Viewstamped Replication [38, 44], and NOPaxos [35].
In summary, this paper demonstrates that:
• The Harmonia architecture can achieve near-linear scalability with near-zero overhead by moving conflict detection to an in-network component. (§4, §5)
• The Harmonia conflict detection mechanism can be implemented in the data plane of new-generation programmable
switches and run at line rate. (§6)
• Many replication protocols can be integrated with Harmonia while maintaining linearizability. (§7)
We implement a Harmonia prototype using a cluster of
servers connected by a Barefoot Tofino switch and integrate it with Redis. Our experiments show that Harmonia
improves the throughput of the replication protocols by up
to 10× for a replication factor of 10, providing near-linear
scalability up to the limit of our testbed. We provide a proof
of correctness and a model-checked TLA+ specification as
appendices. Of course, the highest possible write throughput
is that of one replica, since writes have to be processed by
all replicas. This can be achieved by chain replication [56]
and NOPaxos [35]. Harmonia fills in the missing piece for
reads: it demonstrates how to make reads scalable without
sacrificing either write performance or consistency.
2. Background
An ideal replicated system provides single-system linearizability [26]—it appears as though operations are being
executed, one at a time, on a single replica, in a way that corresponds to the timing of operation invocations and results.
Many replication protocols can be used to ensure this property. They fall primarily into two classes—primary-backup
protocols and quorum-based protocols.
Primary-backup. The primary-backup protocol [14] organizes a system into a primary replica, which is responsible
for determining the order and results of operations, and a set
of backup replicas that execute operations as directed by the
primary. This is typically achieved by having the primary
transfer state updates to the replicas after operation execution. At any time, only one primary is in operation. Should
it fail, one of the backup replicas is promoted to be the new
primary—a task often mediated by an external configuration
service [15, 28] or manual intervention. The primary-backup
protocol is able to tolerate f node failures with f +1 nodes.
The primary-backup protocol has many variants. Chain
replication [56] is a high-throughput variant, and is used in
many storage systems [4, 23, 48]. It organizes the replicas
into a chain. Writes are sent to the head and propagated to
the tail; reads are directly processed by the tail. The system
throughput is bounded by a single server—the tail.
Quorum-based protocols. Quorum-based protocols such as
Paxos [31, 32] operate by ensuring that each operation is executed at a quorum—typically a majority—of replicas before it is considered successful. While they seem quite different from primary-backup protocols, the conceptual gap
is not as wide as it appears in practice. Many Paxos deployments use the Multi-Paxos optimization [32] (or, equivalently, Viewstamped Replication [44] and Raft [45]). One
of the replicas runs the first phase of the protocol to elect itself as a stable leader until it fails. It can then run the second
phase repeatedly to execute multiple operations and commit to other replicas, which is very similar to the primarybackup protocol. System throughput is largely determined
by the number of messages that needs to be processed by
the bottleneck node, i.e., the leader. A common optimization
allows the leader to execute reads without coordinating with
the others, by giving the leader a lease. Ultimately, however,
the system throughput is limited to that of one server.
3. Towards Linear Scalability
The replication protocols above can achieve, at best, the
throughput of a single server. With judicious optimization,
they can allow reads to be processed by one designated
replica—the tail in chain replication or the leader in MultiPaxos. That single replica then becomes the bottleneck.
Read scalability, i.e., making system throughput scale with
the number of replicas, requires going further.
Could we achieve read scalability by allowing reads to
be processed by any replica, not just a single designated
one, without coordination? Unfortunately, naively doing so
could violate consistency. Updates cannot be applied instantly across all the replicas, so at any given moment some
of the replicas may not be consistent. We categorize the resulting anomalies into two kinds.
Read-ahead anomalies. A read-ahead anomaly occurs
when a replica returns a result that has not yet been committed. This might reflect a future state of the system, or show
updates that will never complete. Neither is correct.
Consider the case of chain replication, where each replica
would answer a read with its latest known state. Specifically,
suppose there are three nodes, and the latest update to an
object has been propagated to nodes 1 and 2. A read on this
object sent to either of these nodes would return the new
value. While this may not necessarily seem problematic, it
is a violation of linearizability. A subsequent request to node
3 could still return the old value—causing a client to see
an update appearing and disappearing depending on which
replica it contacts.
Read-behind anomalies. One might hope that these anomalies could be avoided by requiring replicas to return the latest known committed value. Unfortunately, this introduces a
second class of anomalies, where some replicas may return
a stale result that does not reflect the latest updates.
Consider a Multi-Paxos deployment, in which replicas
only execute an operation once they have been notified by
the leader that consensus has been reached for that operation. Suppose that a client submits a write to an object, and
consider the instant just after the leader receives the last response in a quorum. It can then execute the operation and
respond to the client. However, the other replicas do not
know that the operation is successful. If the client then executes a read to one of the other replicas, and it responds—
unilaterally—with its latest committed value, the client will
not see the effect of its latest write.
Protocols. We classify replication protocols based on the
anomalies. We refer to protocols that have each type of
anomalies as read-ahead protocols and read-behind protocols, respectively. Of the protocols we discuss in this paper, primary-backup and chain replication are read-ahead
protocols, and Viewstamped Replication/Multi-Paxos and
NOPaxos are read-behind protocols. Note that although the
primary-backup systems are read-ahead and the quorum systems are read-behind, this is not necessarily true in general;
read-ahead quorum protocols are also possible, for example.
3.1
Harmonia Approach
How, then, can we safely and efficiently achieve read scalability, without sacrificing linearizability? The key is to view
the system at the individual object level. The majority of objects are quiescent, i.e., have no modifications in progress.
These objects will be consistent across at least a majority of
replicas. In that case, any replica can unilaterally answer a
read for the object. While modifications to an object are in
progress, reads on the object must follow the full protocol.
Conceptually, Harmonia achieves read scalability by introducing a new component to the system, a request scheduler. The request scheduler monitors requests to the system
to detect conflicts between read and write operations. Abstractly, it maintains a table of objects in the system and
tracks whether they are contended or uncontended, i.e., the
dirty set. When there is no conflict, it directs reads to any
replica. The request is flagged so that the replica can respond
directly. When conflicts are detected, i.e., a concurrent write
is in progress, reads follow the normal protocol.
To allow the request scheduler to detect conflicts, it needs
to be able to interpose on all read and write traffic to the
system. This means that the request scheduler must be able
to achieve very high throughput—implementing the conflict
detection in a server would still make the entire system be
Clients
Replicated Storage Rack
L2/L3 Routing
Read-Write
Conflict Detection
ToR Switch Data plane
Storage Servers
Figure 1: Harmonia architecture.
bottlenecked by the server. Instead, we implement the request scheduler in the network itself, leveraging the capability of programmable switches to run at line rate, imposing
no performance penalties.
Conflict detection has been used before to achieve
read scalability for certain replicated systems. Specifically,
CRAQ [55] provides read scalability for chain replication
by tracking contended and uncontended objects at the protocol level. This requires introducing an extra phase to notify replicas when an object is clean vs. dirty. Harmonia’s
in-switch conflict detection architecture provides two main
benefits. First, it generalizes the approach to support many
different protocols—as examples, we have used Harmonia with primary-backup, chain replication, Viewstamped
Replication, and NOPaxos. Supporting the diverse range
of replication protocols are in use today is important because they occupy different points in the design space:
latency-optimized vs. throughput-optimized, read-optimized
vs. write-optimized, storage overhead vs. performance under
failure, etc. CRAQ is specific to chain replication, and it is
not clear that it is possible to extend its protocol-level approach to other protocols. Second, Harmonia’s in-switch implementation avoids imposing additional overhead to track
the dirty set. As we show in Section 9.5, CRAQ is able to
achieve read scalability only at the expense of a decrease in
write throughput. Harmonia has no such cost.
3.2
Challenges
Translating the basic model of the request scheduler above
to a working implementation presents several challenges:
1. How can we expose system state to the request scheduler so that it can implement conflict detection?
2. How do we ensure the switch’s view of which objects
are contended matches the system’s reality, even as
messages are dropped or reordered by the network? Errors here may cause clients to see invalid results.
3. How do we implement this functionality fully within
a switch data plane? This drastically limits computa-
tional and storage capacity.
4. What modifications are needed to replication protocols
to ensure they provide linearizability when integrated
with Harmonia?
4. Harmonia Architecture
Harmonia is a new replicated storage architecture that
achieves near-linear scalability without sacrificing consistency using in-network conflict detection. This is implemented using an in-switch request scheduler, which is located on the path between the clients and server nodes. In
many enterprise and cloud scenarios where storage servers
are located in a dedicated storage rack, this can be conveniently achieved by using the top-of-rack (ToR) switch, as
shown in Figure 1. We discuss other, more scalable deployment options in §6.3.
Switch. The switch implements the Harmonia request
scheduler. It is responsible for detecting read-write conflicts.
It behaves as a standard L2/L3 switch, but provides additional conflict detection functionality for packets with a reserved L4 port. This makes Harmonia fully compatible with
normal network functionality.
The read-write conflict detection module identifies
whether a read has a conflict with a pending write. It does
this by maintaining a sequence number, a dirty set and the
last-committed point (§5). We show how to handle requests
with this module while guaranteeing consistency (§5), and
how to use the register arrays to design a hash table supporting the necessary operations at line rate (§6).
While the Harmonia switch can be rebooted or replaced
and is not a single point of failure of the storage system,
there is only a single active Harmonia switch for conflict
detection at any time. The replication protocol enforces this
invariant by periodically agreeing on which switch to use for
each time slice (§5.3).
Storage servers. The storage servers store objects and serve
requests, using a replication protocol for consistency and
fault tolerance. Harmonia requires minimal modifications to
the replication protocol (§7). It incorporates a shim layer in
each server to translate custom Harmonia request packets to
API calls to the storage system.
Clients. Harmonia provides a client library for applications
to access the storage system, which provides a similar interface as existing storage systems. e.g., GET and SET for Redis [3] in our prototype. The library translates between API
calls and Harmonia packets. It exposes two important fields
in the packet header to Harmonia switch: the operation type
(read or write), and the affected object ID.
5. In-Network Conflict Detection
Key idea. Harmonia employs a switch as a conflict detector, which tracks the dirty set, i.e., the set of contended objects. While the available switch memory is limited, the set
Algorithm 1 ProcessRequestSwitch(pkt)
– seq: sequence number at switch
– dirty set: map containing largest sequence number
for each object with pending writes
– last committed : largest known committed sequence
number
1: if pkt.op == WRITE then
2:
seq ← seq + 1
3:
pkt.seq ← seq
4:
dirty set.put(pkt.obj id , seq)
5: else if pkt.op == WRITE - COMPLETION then
6:
if pkt.seq ≥ dirty set.get(pkt.obj id ) then
7:
dirty set.delete(obj id )
8:
last committed ← max (last committed , pkt.seq)
9: else if pkt.op == READ then
10:
if ¬dirty set.contains(pkt.obj id ) then
11:
pkt.last committed ← last committed
12:
pkt.dst ← random replica
13: Update packet header and forward
of objects with outstanding writes is small compared to the
overall storage size of the system, making this set readily
implementable on the switch.
To implement conflict detection, a Harmonia switch tracks
three pieces of state: (i ) a monotonically-increasing sequence number,1 which is incremented and inserted into
each write, (ii ) a dirty set, which additionally tracks the
largest sequence number of the outstanding writes to each
object, and (iii ) the last-committed point, which tracks the
sequence number of the latest write committed by the system
known to the switch.
The dirty set allows the switch to detect when a read contends with ongoing writes. When they do not, Harmonia can
send the read to a single random replica for better performance. Otherwise, these reads are passed unmodified to the
underlying replication protocol. The sequence number disambiguates concurrent writes to the same object, permitting
the switch to store only one entry per contended object in the
dirty set. The last-committed sequence number is used to ensure linearizability in the face of reordered messages, as will
be described in §5.2.
We now describe in detail the interface and how it is used.
We use the primary-backup protocol as an example in this
section, and describe adapting other protocols in §7.
5.1
Basic Request Processing
The Harmonia in-switch request scheduler processes
three types of operations: READ, WRITE, and WRITE COMPLETION . For each replicated system, the switch is
initialized with the addresses of the replicas and tracks
the three pieces of state described above: the dirty set,
the monotonically-increasing sequence number, and the se1
We use the term sequence number here for simplicity. Sequentiality is not necessary; a strictly increasing timestamp would suffice.
commit=0
Write
(obj_id=A)
1
obj_id
seq
E
1
B
2
A
3
commit=3
Primary
2
4
Backup 1
Storage Servers
(a) Write.
1
4
obj_id
seq
E
1
B
2
seq
X
4
C
5
Backup 1
Switch Data Plane
Client
Storage Servers
(b) Write completion is piggybacked in write reply.
commit=0
Primary
2
3
Read
(obj_id=C)
1
Backup 1
obj_id
seq
E
1
B
2
Primary
2
Backup 1
4
Backup 2
Client
Primary
3
Backup 2
commit=0
Read
(obj_id=E)
obj_id
Backup 2
Switch Data Plane
Client
Write
(obj_id=A)
Switch Data Plane
Storage Servers
(c) Read and reply on object with pending writes.
3
Client
Switch Data Plane
Backup 2
Storage Servers
(d) Read and reply on object without pending writes.
Figure 2: Handling different types of storage requests.
quence number of the latest committed write. The handling
of a single request is outlined in pseudo code in Algorithm algorithm 1.
Writes. All writes are assigned a sequence number by Harmonia. The objects being written are added to the dirty set
in the switch, and associated with the sequence number assigned to the write (lines 1–4).
Write completions. Write completions are special messages
sent by the replication protocol once a write is fully committed. If a write is the last outstanding write to the object,
the object is removed from the dirty set in the switch. The
last-committed sequence number maintained by the switch
is then updated (lines 5–8).
Reads. Reads are routed by the switch, either through
the normal replication protocol or to a randomly selected
replica, based on whether the object being read is contended
or not. The switch checks whether the ID of the object being
read is in the dirty set. If so, the switch sends the request unmodified, causing it to be processed according to the normal
replication protocol; otherwise, the read is sent to a random
replica for better performance (lines 9–12). The request is
also stamped with the last-committed sequence number on
the switch for linearizability, as will be discussed in §5.2.
Example. Figure 2 shows an example workflow. Figure 2(a)
and 2(b) show a write operation. The switch adds obj
ID=A to the dirty set when it observes the write. It removes
the object ID upon the write completion, which can be piggybacked in the write reply, and updates the last-committed
sequence number. What is in the dirty set determines how
reads are handled. In Figure 2(c), the read is for object E,
which has pending writes, so the request is sent to the primary for guaranteeing consistency. On the other hand, in
Figure 2(d), object C is not in the dirty set, so it is sent to
the second backup for better performance.
5.2
Handling Network Asynchrony
In an ideal network, where messages are processed in order, only using the dirty set would be sufficient to guarantee
consistency. In a real, asynchronous network, just because a
read to an object was uncontended when the request passed
through the switch does not mean it will still be so when the
request arrives at a replica: the message might have been delayed so long that a new write to the same object has been
partially processed. Harmonia avoids this using the sequence
number and last-committed point.
Write order requirement. The key invariant of the dirty
set requires that an object not be removed until all concurrent writes to that object have completed. Since the Harmonia switch only keeps track of the largest sequence number
for each object, Harmonia requires that the replication protocol processes writes only in sequence number order. This is
straightforward to implement in a replication protocol, e.g.,
via tracking the last received sequence number and discarding any out-of-order writes.
Dropped messages. If a WRITE - COMPLETION or forwarded
WRITE message is dropped, an object may remain in the
dirty set indefinitely. While in principle harmless—it is always safe to consider an uncontended object dirty—it may
cause a performance degradation. However, because writes
are processed in order, any stray entries in the dirty set can be
removed as soon as a WRITE - COMPLETION message with a
higher sequence number arrives. These stray objects can be
removed by the switch as it processes reads (i.e., by removing the object ID if its sequence number in the dirty set is
less than or equal to the last committed sequence number).
This removal can also be done periodically.
Last-committed point for linearizability. Harmonia aims
to maintain linearizability, even when the network can arbitrarily delay or reorder packets. The switch uses its dirty set
Stage 1
Stage 2
Stage 3
Header & Metadata
ETH
Table
IPv4
Table
IPv6
Table
Custom
Table
Match-Action Table
Match
Action
dst_ip=10.0.0.1
egress_port=1
dst_port=2000, op=read
meta=RA[hash(ID)]
dst_port=2000, op=write RA[hash(ID)]=ID
default
drop()
(a) Switch multi-stage packet processing pipeline.
Register Array (RA)
0
1
2
3
4
hash(ID) 5
E
B
A
X
Q
R
(b) Example custom table.
Figure 3: Switch data plane structure.
to ensure that a single-replica read does not contend with ongoing writes at the time it is processed by the switch. This is
not sufficient to entirely eliminate inconsistent reads. However, the last-committed sequence number stamped into the
read will provide enough information for the recipient to
compute whether or not processing the read locally is safe.
In the primary-backup, a write after a read on the same object would have a higher sequence number than the lastcommitted point carried in the read. As such, a backup can
detect the conflict even if the write happens to be executed at
the backup before the read arrives, and then send the read to
the primary for linearizability. Detailed discussion on adapting protocols is presented in §7.
5.3
Failure Handling
Harmonia would be of limited utility to replication protocols if the switch were a single point of failure. However,
because the switch only keeps soft state (i.e., the dirty set,
the sequence number and the last-committed point), it can
be rebooted or replaced. The failure of a switch will result in
temporary performance degradation. The Harmonia failure
handling protocol restores the ability for the new switch to
send writes and reads through the normal case first, and then
restores the single-replica read capability, limiting the downtime of the system to a minimum. As such, the switch is not
a single point of failure, and can be safely replaced without
sacrificing consistency.
Handling switch failures. When the switch fails, the operator either reboots it or replaces it with a backup switch.
While the switch only maintains soft state, care must be
taken to preserve consistency during the transition. As in
prior systems [34, 35], the sequence numbers in Harmonia
are augmented with the switch’s unique ID and ordered lexicographically considering the switch’s ID first. This ensures
that no two writes have the same sequence number. Next, before a newly initialized switch can process writes, Harmonia
must guarantee that single-replica reads issues by the previous switch will not be processed. Otherwise, in read-behind
protocols, the previous switch and a lagging replica could
bilaterally process a read without seeing the results of the
latest writes, resulting in read-behind anomalies. To prevent
these anomalies, the replication protocol periodically agrees
to allow single-replica reads from the current switch for a
time period. Before allowing the new switch to send writes,
the replication must agree to refuse single-replica reads with
smaller switch IDs, and either the previous switch’s time
should expire or all replicas should agree to cut it short. This
technique is similar in spirit to the leases used as an optimization to allow leader-only reads in many protocols. Finally, once the switch receives its first WRITE - COMPLETION
with the new switch ID, both its last-committed point and
dirty set will be up to date, and it can safely send singlereplica reads.
Handling server failures. The storage system handles a
server failure based on the replication protocol, and notifies the switch control plane at the beginning and end of the
process. The switch first removes the failed replica from the
replica addresses in the data plane, so that following requests
would not be scheduled to it. After the failed replica is recovered or a replacement server is added, the switch adds
the corresponding address to the replica addresses, enabling
requests to be scheduled to the server.
6. Data Plane Design and Implementation
Can the Harmonia approach be supported by a real
switch? We answer this in the affirmative by showing how
to implement it for a programmable switch [12, 13] (e.g.,
Barefoot’s Tofino [9]), and evaluate its resource usage.
6.1
Data Plane Design
The in-network conflict detection module is implemented
in the data plane of a modern programmable switch. The sequence number and last-committed point can be stored in
two registers, and the dirty set can be stored in a hash table
implemented with register arrays. While previous work has
shown how to use the register arrays to store key-value data
in a switch [29, 30], our work has two major differences:
(i ) the hash table only needs to store object IDs, instead of
both IDs and values; (ii ) the hash table needs to support insertion, search and deletion operations at line rate, instead of
only search. We provide some background on programmable
switches, and then describe the hash table design.
Switch data plane structure. Figure 3 illustrates the basic data plane structure of a modern programmable switching ASIC. The packet processing pipeline contains multiple
stages, as shown in Figure 3(a). Packets are processed by
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
Stage 1
Stage 2
Stage 3
obj seq
obj seq
obj seq
obj seq
obj seq
obj seq
obj seq
obj seq
obj seq
E
1
B
2
h1(A)
X
3
C
4
h2(A)
6
E
1
Q
5
B
2
h3(A)
Write Query (obj_id=A)
(a) Insertion.
A
h1(A)
X
3
C
4
h2(A)
A
6
E
7
Q
5
B
8
h3(A)
Read Query (obj_id=A)
h1(A)
X
9
C
10
Q
h2(A)
11
h3(A)
Write Completion (obj_id=A, seq=6)
(b) Search.
(c) Deletion.
Figure 4: Multi-stage hash table design that supports insertion, search and deletion in the switch data plane.
the stages one after one. Match-action tables are the basic
element used to process packets. If two tables have no dependencies, they can be placed in the same stage, e.g., IPv4
and IPv6 tables in Figure 3(a).
A match-action table contains a list of rules that specifies
how packets are processed, as shown in Figure 3(b). A match
in a rule specifies a header pattern, and the action specifies
how the matched packets should be processed. For example, the first rule in Figure 3(b) forwards packets to egress
port 1 for packets with destination IP 10.0.0.1. Each stage
also contains register arrays that can be accessed at line rate.
Programmable switches allow developers to define custom
packet formats and match-action tables to realize their own
protocols. The example in Figure 3(b) assumes two custom
fields in the packet header, which are op for operation and
ID for object ID. The second and third rules perform read
and write on the register array based on op type, and the
index of the register array is computed by the hash of ID.
Developers use a domain-specific language such as
P4 [12] to write a program for a custom data plane, and then
use a complier to compile the program to a binary that can be
loaded to the switch. Each stage has resource constraints on
the size of match-action tables (depending on the complexity of matches and actions) and register arrays (depending
on the length and width).
Multi-stage hash table with register arrays. The switch
data plane provides basic constructs for the conflict detection module. A register array can be naturally used to store
the object IDs. We can use the hash of an object ID as the index of the register array, and store the object ID as the value
in the register slot. One challenge is to handle hash collisions, as the switch can only perform a limited, fixed number
of operations per stage. Collision resolution for hash tables
is a well-studied problem, and the multi-stage structure of
the switch data plane makes it natural to implement open
addressing techniques to handle collisions. Specifically, we
allocate a register array in each stage and use different hash
functions for different stages. In this way, if several objects
collide in one stage, they are less likely to collide in another
stage. Figure 4 shows the design.
• Insertion. For a write, the object ID is inserted to the first
stage with an empty slot for the object (Figure 4(a)). The
write is dropped if no slot is available.
• Search. For a read, the switch iterates over all stages to
see if any slot contains the same object ID (Figure 4(b)).
• Deletion. For a write completion, the switch iterates over
all stages and removes the object ID (Figure 4(c)).
Variable-length object IDs. Many systems use variablelength IDs, but due to switch limitations, Harmonia must
use fixed-length object IDs for conflict detection. However,
variable-length IDs can be accommodated by having the
clients store fixed-length hashes of the original ID in the Harmonia packet header; the original ID is sent in the packet
payload. Harmonia then uses the fixed-length hashes for
conflict detection. Hash collisions may degrade performance
but cannot introduce consistency issues; they can only cause
Harmonia to believe a key is contended, not vice versa.
6.2
Resource Usage
Switch on-chip memory is a limited resource. Will there
be enough memory to store the entire dirty set of pending
writes? Our key insight is that since the switch only performs
conflict detection, it does not need to store actual data, but
only the object IDs. This is in contrast to previous designs
like NetCache [30] and NetChain [29] that use switch memory for object storage directly. Moreover, while the storage
system can store a massive number of objects, the number of
writes at any given time is small, implying that the dirty set
is far smaller than the storage size.
Suppose we use n stages and each stage has a register
array with m slots. Let the hash table utilization be u to account for hash collisions. The switch is able to support up
to unm writes at a given time. Suppose the duration of each
write is t, and the write ratio is w . Then the switch is able to
support unm/t writes per second—or a total throughput of
unm/(wt)—before exhausting memory. As a concrete example, let n=3, m=64000, u=50%, t=1 ms and w =5%. The
switch can support a write throughput of 96 million requests
per second (MRPS), and a total throughput of 1.92 billion
requests per second (BRPS). Let both the object ID and sequence number be 32 bits. It only consumes 1.5MB memory.
Given that a commodity switch has 10–20 stages and a few
tens of MB memory [29, 30, 54], this example only conservatively uses a small fraction of switch memory.
6.3
Deployment Issues
We imagine two possible deployment scenarios for Harmonia. First, it can be easily integrated with clustered stor-
age systems, such as on-premise storage clusters for enterprises and specialized storage clusters in the cloud. As
shown in Figure 1, all servers are deployed in the same rack,
allowing the ToR switch to be the central location that sees
all the storage traffic. We only need to add Harmonia’s functionality to the ToR switch.
For cloud-scale storage, replicas may be distributed
among many different racks for fault tolerance. Placing the
Harmonia scheduler on a ToR switch, which only sees storage traffic to its own rack, does not suffice. Instead, we leverage a network serialization approach [35, 50], where all traffic destined for a replica group is redirected through a designated switch. Prior work has shown that, with careful selection of the switch (e.g., a spine switch in a two-layer leafspine network), this need not increase latency [35]. Nor does
it impose a throughput penalty: different replica groups can
use different switches as their request scheduler, and the capacity of a switch far exceeds that of a single replica group.
7. Adapting Replication Protocols
Safely using a replication protocol with Harmonia imposes three responsibilities on the protocol. It must:
1. process writes only in sequence number order;
2. allow single-replica reads only from one active switch
at a time; and
3. ensure that single-replica reads for uncontended objects still return linearizable results.
Responsibility (1) can be handled trivially by dropping
messages that arrive out of order, and responsibility (2) can
be implemented in the same manner as leader leases in traditional replication protocols. We therefore focus on responsibility (3) here. How this is handled is different for the two
categories of read-ahead and read-behind protocols.
To demonstrate the generality of our approach, we apply Harmonia to representative protocols from both classes:
primary-backup protocols (including chain replication),
as well as leader-based quorum protocols (Viewstamped
Replication/Multi-Paxos) and recent single-phase consensus
protocols (NOPaxos). For each, we explain the necessary
protocol modifications and give a brief argument for correctness. A full proof of correctness is in Appendix A, and
a model-checked TLA+ specification of Harmonia is in Appendix B.
7.1
Requirements for Linearizability
Let us first specify the requirements that must be satisfied
for a Harmonia-adapted protocol to be correct. We only consider systems where the underlying replication protocol is
linearizable. All write operations are processed by the replication protocol based on the sequence number order. We
need only, then, consider the read operations. The following
two properties are sufficient for linearizability.
• P1. Visibility. A read operation sees the effects of all write
operations that finished before it started.
• P2. Integrity. A read operation will not see the effects of
any write operation that had not committed at the time the
read finished.
In the context of Harmonia, read operations follow the
normal-case replication protocol if they refer to an object
in the dirty set, and hence we need only consider the fastpath read operations executed at a single replica. For these,
P1 can equivalently be stated as follows.
• P1. Visibility. The replication protocol must only send a
completion notification for a write to the scheduler if any
subsequent single-replica read sent to any replica will reflect the effect of the write operation.
7.2
Read-Ahead Protocols
Both primary-backup and chain replication are read-ahead
protocols that cannot have read-behind anomalies, because
they only reply to the client once an operation has been executed on all replicas. As a result, they inherently satisfy P1.
We adapt them to send a WRITE - COMPLETION notification
to the switch at the same time as responding to the client.
However, read-ahead anomalies are possible: reads
naively executed at a single replica can reflect uncommitted results. We use the last-committed sequence number provided by the Harmonia switch to prevent this. When a replica
receives a fast-path read for object o, it checks that the lastcommitted sequence number attached to the request is at
least as large as the latest write applied to o. If not, it forwards the request to the primary or tail, to be executed using
the normal protocol. Otherwise, this implies that all writes
to o processed by the replica were committed at the time the
read was handled by the switch, satisfying P2.
7.3
Read-Behind Protocols
We have applied Harmonia to two quorum protocols:
Viewstamped Replication [38, 44], a leader-based consensus protocol equivalent to Multi-Paxos [32] or Raft [45],
and NOPaxos [35], a network-aware, single-phase consensus protocol. Both are read-behind protocols. Because replicas in these protocols only execute operations once they have
been committed, P2 is trivially satisfied.
Furthermore, because the last committed point in the Harmonia switch is greater than or equal to the sequence numbers of all writes removed from its dirty set, replicas can ensure visibility (P1) by rejecting (and sending to the leader for
processing through the normal protocol) all fast-path reads
whose last committed points are larger than that of the last
locally committed and executed write.
In read-behind protocols, WRITE - COMPLETIONs can be
sent along with the response to the client. However, in order
to reduce the number of rejected fast-path reads, we delay
WRITE - COMPLETION s until the write has likely been executed on all replicas.
Viewstamped replication. For Viewstamped Replication,
we add an additional phase to operation processing that
ensures a quorum of replicas have committed and executed the operation. Concurrently with responding to the
NOPaxos. NOPaxos [35] uses an in-network sequencer to
enable a single-round, coordination-free consensus protocol. It is a natural fit for Harmonia, as both the sequencer
and Harmonia’s request scheduler can be deployed in the
same switch. Although NOPaxos replicas do not coordinate while handling operations, they already run a periodic synchronization protocol to ensure that all replicas
have executed a common, consistent prefix of the log [36]
that serves the same purpose as the additional phase in
VR. The only Harmonia modification needed is for the
leader, upon completion of a synchronization, to send
hWRITE - COMPLETION, object id, commiti messages for all
affected objects.
8. Implementation
We have implemented a Harmonia prototype and integrated it with Redis [3]. The switch data plane is implemented in P4 [12] and is compiled to Barefoot Tofino
ASIC [9] with Barefoot Capilano software suite [8]. We use
32-bit object IDs, and use 3 stages for the hash table. Each
stage provides 64K slots to store the object IDs, resulting in
a total of 192K slots for the hash table.
The shim layer in the storage servers is implemented in
C++. It communicates with clients using Harmonia packets, and uses hiredis [1], which is the official C library of
Redis [3], to read from and write to Redis. In additional to
translate between Harmonia packets and Redis operations,
the shim layers in the servers also communicate with each
other to implement replication protocols. We have integrated
Harmonia with multiple representative replication protocols
(§9.5). We use the pipeline feature of Redis to batch requests
to Redis. Because Redis is single-threaded, we run eight Redis processes on each server to maximize per-server throughput. Our prototype is able to achieve about 0.92 MQPS for
reads and 0.8 MQPS for writes on a single Redis server.
The client library is implemented in C. It generates mixed
read and write requests to the storage system, and measures
the system throughput and latency.
9. Evaluation
These messages can be piggybacked on the next
messages, eliminating overhead.
PREPARE - OK
PREPARE
and
2.0
1.5
1
0.5
0
0
0.5 1 1.5 2 2.5
Throughput(MRPS)
(a) Read-only workload.
5
Latency(ms)
CR
Harmonia
2.5
CR
Harmonia
4
3
2
1
0
0
0.2
0.4
0.6
Throughput(MRPS)
0.8
(b) Write-only workload.
Figure 5: Throughput vs. latency for reads and writes.
9.1
Methodology
Testbed. Our experiments are conducted on a testbed consisting of twelve server machines connected by a 6.5 Tbps
Barefoot Tofino switch. Each server is equipped with an
8-core CPU (Intel Xeon E5-2620 @ 2.1GHz), 64 GB total memory, and one 40G NIC (Intel XL710). The server
OS is Ubuntu 16.04.3 LTS. Ten storage servers run Redis
v4.0.6 [3] as the storage backend; two generate client load
using a DPDK-based workload generator.
By default, we use three replicas and a uniform workload on one million objects with 5% write ratio. The 5%
write ratio is similar to that in many real-world storage systems [7, 43], previous studies [40], and standard benchmarks
like YCSB [19]. We vary the parameters in the experiments
to evaluate their impacts.
Comparison. Redis is a widely-used open-source inmemory storage system. However, Redis does not provide
native support for replication, only a cluster mode with weak
consistency. We use a shim layer to implement several representative replication protocols, including primary-backup
(PB) [14], chain replication (CR) [56], CRAQ [55] (a version of chain replication that makes reads more scalable at
the cost of more expensive writes), Viewstamped Replication (VR) [44] and NOPaxos [35]. As described in §8, we
run eight Redis processes on each server to maximize perserver throughput. The shim layer batches requests to Redis; the baseline (unreplicated) performance for one server
is 0.92 MQPS for reads and 0.8 MQPS for writes.
We compare system performance with and without Harmonia for each protocol. Due to space constraints, we show
the results of CR, which is a high-throughput variant of PB,
in most figures; §9.5 compares performance across all protocols, demonstrating the generality.
9.2
We provide experimental results to demonstrate that Harmonia provides significant throughput and latency improvements (§9.2), scales out with the number of replicas (§9.3),
is resource efficient (§9.4), is general to many replication
protocols (§9.5), and handles failures gracefully (§9.6).
2
3.0
Latency(ms)
client, the VR leader sends a COMMIT message to the
other replicas. Our additional phase calls for the replicas to respond with a COMMIT- ACK message.2 Only once
the leader receives a quorum of COMMIT- ACK messages
for an operation with sequence number n does it send a
hWRITE - COMPLETION, object id, ni notification.
Latency vs. Throughput
We first conduct a basic throughput and latency experiment. The client generates requests to three replicas, and
measures the average latency at different throughput levels.
We consider read-only, write-only, and mixed workloads.
Figure 5(a) shows the relationship between throughput
and latency under a read-only workload. Although we have
three replicas, since CR only uses the tail node to handle read
Harmonia
CR
2
1
0
0
0.2
0.4
0.6
0.8
Write Throughput (MRPS)
(a) Read vs. write throughput.
Throughput (MRPS)
Read Throughput (MRPS)
3
3
Harmonia
CR
2
1
0
0
20
40
60
80
Write Ratio (%)
100
(b) Throughput vs. write ratio.
Figure 6: Throughput for mixed read-write workloads.
(a) Read throughput as the write rate increases. (b) Total
throughput under different write ratios.
requests, the throughput is bounded by that of one server. In
comparison, since Harmonia uses the switch to detect readwrite conflicts, it is able to fully utilize the capacity of all the
three replicas when there are no conflicts. The read latency
is a few hundred microseconds at low load, and increases as
throughput goes up. For write-only workloads (Figure 5(b)),
CR and Harmonia have identical performance, as Harmonia
simply passes writes to the normal protocol.
To evaluate mixed workloads, the client fixes its rate of
generating write requests, and measures the maximum read
throughput that can be handled by the replicas. Figure 6(a)
shows the read throughput as a function of write rate. Since
CR can only leverage the capacity of tail node, its read
throughput is no more than that of one storage server, even
when the write throughput is small. On the other hand, Harmonia can utilize almost all the three replicas to handle reads
when the write throughput is small. At low write rate, Harmonia improves the throughput by 3× over CR. At high
write rate, both systems have similar throughput as Harmonia and CR process write requests in the same way.
Figure 6(b) evaluates the system performance for mixed
workloads from another angle. The client fixes the ratio of
writes and measure the saturated system throughput. The figure shows the total throughput as a function of write ratio.
Similarly, the throughput of CR is bounded by the tail node,
while Harmonia can leverage all replicas to process reads.
Similar to Figure 6(a), when the write ratio is high, Harmonia has little benefit as they process writes in the same way.
9.3
Scalability
Harmonia offers near-linear read scalability for readintensive workloads. We demonstrate this by varying the
number of replicas and measuring system throughput in several representative cases. The scale is limited by the size of
our twelve-server testbed: we can use up to ten servers as
replicas, and two servers as clients to generate requests. Our
high-performance client implementation written in C and
DPDK is able to saturate ten replicas with two client servers.
Harmonia offers dramatic improvements on read-only
workloads (Figure 7(a)). For CR, increasing the number of
replicas does not change the overall throughput, because it
only uses the tail to handle reads. In contrast, Harmonia
is able to utilize the other replicas to serve reads, causing
throughput to increase linearly with the number of replicas.
Harmonia improves the throughput by 10× with a replication factor of 10, limited by the testbed size. It can scale out
until the switch is saturated. Multiple switches can be used
for multiple replica groups to further scale out (§6.3).
On write-only workloads (Figure 7(b)), Harmonia has
no benefit regardless of the number of replicas used because Harmonia uses the underlying replication protocol for
writes. For CR, the throughput stays the same as more replicas are added since CR uses a chain to propagate writes.
Figure 7(c) considers throughput scalability under a
mixed read-write workload with a write ratio of 5%. Again,
CR does not scale with the number of replicas. In comparison, the throughput of Harmonia increases nearly linearly
with the number of replicas. Under a read-intensive workload, Harmonia can efficiently utilize the remaining capacity on the other nodes. The total throughput here is smaller
than that for read-only requests (Figure 7(a)), because handling writes is more expensive than handling reads and the
tail node becomes the bottleneck as the number of replicas
goes up to 8.
9.4
Resource Usage
We now evaluate how much switch memory is needed to
track the dirty set. As we have discussed in §6.2, Harmonia
requires much less memory than other systems such as NetCache [30] and NetChain [29] because Harmonia only needs
to store metadata (i.e., object IDs and sequence numbers).
In this experiment, we vary the size of Harmonia switch’s
hash table, and measure the total throughput of three replicas. Here, we use a write ratio of 5% and both uniform
and skewed (zipf-0.9) request distributions across one million keys. As shown in Figure 8, Harmonia only requires
about 2000 hash table slots to track all outstanding writes
and reach maximum throughput. Before reaching the maximum, the throughput of the uniform case increases faster
than for the skewed workload. This is because under the
skewed workload, a hot object would always occupy a slot
in the hash table, making the switch drop writes to other objects that collide on this slot, thus limiting throughput.
With 32-bit object IDs and 32-bit sequence numbers, 2000
slots only consume 16KB memory. Given that commodity
switches have tens of MB on-chip memory [29, 30, 54], the
memory used by Harmonia only accounts for a tiny fraction of the total memory, e.g., only 1.6% (0.8%) for 10MB
(20MB) memory. This result roughly matches the back-ofenvelop calculations in §6.2, with differences coming from
table utilization, write duration and total throughput. Thus,
Harmonia can be added to the switch and co-exist with other
modules without significant resource consumption. It also
allows Harmonia to scale out to multiple replica groups with
one switch, as one group only consumes little memory. This
is especially important if Harmonia is deployed in a spine
1.5
8
6
4
2
0
2
3
4
5
6
7
8
Number of Replicas
(a) Read-only workload.
9
10
8
Harmonia
CR
Throughput(MRPS)
Harmonia
CR
Throughput(MRPS)
Throughput(MRPS)
10
1.0
0.5
0.0
2
3
4
5
6
7
8
Number of Replicas
(b) Write-only workload.
9
10
Harmonia
CR
6
4
2
0
2
3
4
5
6
7
8
Number of Replicas
9
10
(c) Mixed workload with 5% writes.
Figure 7: Total throughput with increasing numbers of replicas for three workloads. Harmonia scales out with the
number of replicas in read-only and read-intensive workloads.
3
switch to support many replica groups across different racks.
9.6
Throughput(MRPS)
Figure 8: Impact of switch memory. Harmonia only consumes a small amount of memory.
inal implementation of NOPaxos, including the middleboxbased sequencer prototype, which runs on a Cavium Octeon
II network processor. We integrate Harmonia with these,
rather than the Tofino switch and Redis-based backend. As
a result, the absolute numbers in Figures 9(a) and 9(b) are
incomparable. The trends, however, are the same. Harmonia
significantly improves throughput for VR and NOPaxos.
Taken together, these results demonstrate that Harmonia
can be applied broadly to a wide range of replication protocols. These experiments show the advantage of in-network
conflict detection, as it introduces no performance penalties,
unlike protocol-level optimizations such as CRAQ.
9.5
uniform
zipf-0.9
2
1
0
4 16
256
4096
65536
Number of Slots in Hash Table (log-scale)
Generality
We show that Harmonia is a general approach by applying
it to a variety of replication protocols. For each replication
protocol, we examine throughput for a three-replica storage
system with and without Harmonia. Figure 9 shows the read
throughput as a function of write rate for different protocols
Figure 9(a) shows the results for two primary-backup protocols, PB and CR. Both PB and CR are limited by the performance of one server. Harmonia makes use of all three
replicas to handle reads, and provides significantly higher
throughput than PB and CR. CR is able to achieve higher
write throughput than PB, as it uses a chain structure to propagate writes.
CRAQ, a modified version of CR, obtains higher read
throughput than CR, as shown in Figure 9(a). This is because
CRAQ allows reads to be sent to any replica (reads to dirty
objects are forwarded to the tail). However, CRAQ adds an
additional phase to write operations (first marking objects as
dirty then committing the write). As a result, CRAQ’s write
throughput is much lower—hence the steeper curve. Harmonia (CR), which applies in-network conflict detection to
CR, performs much better than CRAQ, achieving the same
level of read scalability without degrading the performance
of writes.
Figure 9(b) shows the results for quorum-based protocols
VR and NOPaxos. For faithful comparison, we use the orig-
Performance Under Failures
Finally, we show how Harmonia handles failures. To simulate a failure, we first manually stop and then reactivate the
switch. Harmonia uses the mechanism described in §5.3 to
correctly recover from the failure.
Figure 10 shows the throughput during this period of failure and recovery. At time 20 s, we let the Harmonia switch
stop forwarding any packets, and the system throughput
drops to zero. We wait for a few seconds and then reactivate the switch to forward packets. Upon reactivation, the
switch retains none of its former state and uses a new switch
ID. The servers are notified with the new switch ID and
agree to drop single-replica reads from the old switch. In
the beginning, the switch forwards reads to the tail node and
writes to the tail node. During this time, the system throughput is the same as without Harmonia. After the first WRITE COMPLETION with the new switch ID passes the switch, the
switch has the up-to-date dirty set and last-committed point.
At this time, the switch starts scheduling single-replica reads
to the servers, and the system throughput is fully restored.
Because the servers complete requests quickly, the transition time is minimal, and we can see that system throughput
returns to pre-failure levels within a few seconds.
10.
Related Work
Replication protocols. Replication protocols are widely
used by storage systems to achieve strong consistency and
Harmonia(CR)
Harmonia(PB)
CRAQ
CR
PB
2
1
0
0
0.1
0.2 0.3 0.4 0.5 0.6 0.7
Write Throughput (MRPS)
0.8
0.9
Read Throughput (MRPS)
Read Throughput (MRPS)
0.6
3
Harmonia(NOPaxos)
Harmonia(VR)
NOPaxos
VR
0.4
0.2
0
(a) Primary-backup protocols.
0
0.02
0.04
0.06
0.08
0.1
Write Throughput (MRPS)
0.12
(b) Quorum-based protocols.
Throughput(MRPS)
Figure 9: Read throughput as write rate increases, for a variety of replication protocols, with and without Harmonia.
3
moving the contention detection into the network, and also
supports more general replication protocols.
stop switch
2
1
0
reactivate switch
0
20
40
60
Time (s)
80
100
Figure 10: Total throughput while the Harmonia switch
is stopped and then reactivated.
fault tolerance. Dating back to classic storage systems (e.g.,
Andrew [27], Sprite [42], Coda [53], Harp [39], RAID [47],
Zebra [25], and xFS [5]), they are now a mainstay of
cloud storage services (e.g., GFS [24], BigTable [18], Dynamo [22], HDFS [6], Ceph [58], Haystack [10], f4 [41],
and Windows Azure Storage [16]).
The primary-backup protocol [14] and its variations like
chain replication [56] and CRAQ [55] assign replicas with
different roles (e.g., primary node, head node, and tail node),
and require operations to be executed by the replicas in a
certain order. Quorum-based protocols, such as Paxos [31],
ZAB [51], Raft [45], Viewstamped Replication [44] and Virtual Synchrony [11], only require an operation to be executed at a quorum, instead of all replicas. While they do not
distinguish the roles of replicas, they often employ an optimization that first elects a leader and then uses the leader to
commit operations to other nodes, which is very similar to
the primary-backup protocol. Vertical Paxos [33] proposes
to incorporate these two classes of protocols into a single
framework, by dividing a replication protocol into two parts:
one is a steady state protocol like the primary-backup protocol that optimizes for high performance, and the other is
a reconfiguration protocol like Paxos which handles system
reconfigurations, e.g., electing a leader.
CRAQ [55] is most similar in spirit to our work. It adapts
chain replication to allow any replica to answer reads for
uncontended objects by adding a second phase to the write
protocol: objects are first marked dirty, then updated. Harmonia achieves the same goal without the write overhead by
Query scheduling. A related approach is taken in a line
of database replication systems that achieve consistent
transaction processing atop multiple databases, such as CJDBC [17], FAS [52], and Ganymed [49]. These systems use
a query scheduler to orchestrate queries among replicas with
different states. The necessary logic is more complex for
database transactions (and sometimes necessitates weaker
isolation levels). Harmonia provides a near-zero-overhead
scheduler implementation for replication using the network.
In-network computing. The emerging programmable
switches introduce new opportunities to move computation into the network. NetCache [30] and IncBricks [40] introduces in-network caching for key-value
stores. NetChain [29] builds a strongly-consistency, faulttolerant, in-network key-value store for coordination services. These designs store object data in the switch data
plane; Harmonia consciously avoids this in order to be more
resource efficient. SwitchKV [37] leverages programmable
switches to realize content-based routing for load balancing in key-value stores. Eris [34] exploits programmable
switches to realize concurrency control for distributed transactions. NetPaxos [20, 21] implements Paxos on switches.
SpecPaxos [50] and NOPaxos [35] use switches to order
messages to improve replication protocols. With NetPaxos,
SpecPaxos and NOPaxos, reads still need to be executed by
a quorum, or by a leader if the leader-based optimization is
used. Harmonia improves these solutions by allowing reads
not in the dirty set to be executed by any replica.
11.
Conclusion
In conclusion, we present Harmonia, a new replicated
storage architecture that achieves near-linear scalability and
guarantees linearizability with in-network conflict detection.
Harmonia leverages new-generation programmable switches
to efficiently track the dirty set and detect read-write conflicts in the network data plane with no performance overhead. Such a powerful capability enables Harmonia to safely
schedule reads to the replicas without sacrificing consistency. Harmonia demonstrates that rethinking the division
of labor between the network and end hosts makes it possible to achieve performance properties beyond the grasp of
distributed systems alone.
Ethics. This work does not raise any ethical issues.
References
[1] Hiredis: Redis library. https://redis.io/.
[2] Memcached key-value
//memcached.org/.
store.
https:
[3] Redis data structure store. https://redis.io/.
[4] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array
of wimpy nodes. In ACM SOSP, October 2009.
[5] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli, and R. Y. Wang. Serverless network file systems. ACM Transactions on Computer
Systems, February 1996.
[6] Apache Hadoop Distributed File System (HDFS).
http://hadoop.apache.org/.
[7] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and
M. Paleczny. Workload analysis of a large-scale keyvalue store. In ACM SIGMETRICS, June 2012.
[8] Barefoot
Capilano.
https://www.
barefootnetworks.com/technology/
#capilano.
[9] Barefoot
Tofino.
https://www.
barefootnetworks.com/technology/
#tofino.
[10] D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel, et al.
Finding a needle in Haystack: Facebook’s photo storage. In USENIX OSDI, October 2010.
[11] K. Birman and T. Joseph. Exploiting Virtual Synchrony
in Distributed Systems. SIGOPS Operating Systems
Review, November 1987.
[12] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming
protocol-independent packet processors. SIGCOMM
CCR, July 2014.
[13] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable matchaction processing in hardware for SDN. In ACM SIGCOMM, August 2013.
[14] N. Budhiraja, K. Marzullo, F. B. Schneider, and
S. Toueg. The primary-backup approach. In Distributed systems, 1993.
[15] M. Burrows. The Chubby lock service for looselycoupled distributed systems.
In USENIX OSDI,
November 2006.
[16] B. Calder, J. Wang, A. Ogus, N. Nilakantan,
A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu,
H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal,
M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand,
A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows Azure storage: A highly
available cloud storage service with strong consistency.
In ACM SOSP, October 2011.
[17] E. Cecchet, J. Marguerite, and W. Zwaenepoel. CJDBC: flexible database clustering middleware. In Proceedings of the 2004 USENIX Annual Technical Conference, Boston, MA, June 2004. USENIX.
[18] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.
Gruber. Bigtable: A distributed storage system for
structured data. In USENIX OSDI, November 2006.
[19] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan,
and R. Sears. Benchmarking cloud serving systems
with YCSB. In ACM Symposium on Cloud Computing, June 2010.
[20] H. T. Dang, M. Canini, F. Pedone, and R. Soulé. Paxos
made switch-y. SIGCOMM CCR, April 2016.
[21] H. T. Dang, D. Sciascia, M. Canini, F. Pedone, and
R. Soulé. NetPaxos: Consensus at network speed. In
ACM SOSR, June 2015.
[22] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian,
P. Vosshall, and W. Vogels. Dynamo: Amazon’s highly
available key-value store. In ACM SOSP, October
2007.
[23] R. Escriva, B. Wong, and E. G. Sirer. HyperDex: A
distributed, searchable key-value store. In ACM SIGCOMM, August 2012.
[24] S. Ghemawat, H. Gobioff, and S.-T. Leung. The
Google file system. In ACM SOSP, October 2003.
[25] J. H. Hartman and J. K. Ousterhout. The Zebra striped
network file system. ACM Transactions on Computer
Systems, August 1995.
[26] M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, July
1990.
[27] J. H. Howard, M. L. Kazar, S. G. Menees, D. A.
Nichols, M. Satyanarayanan, R. N. Sidebotham, and
M. J. West. Scale and performance in a distributed
file system. ACM Transactions on Computer Systems,
February 1988.
[28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed.
ZooKeeper: Wait-free coordination for Internet-scale
systems. In USENIX ATC, June 2010.
[42] M. N. Nelson, B. B. Welch, and J. K. Ousterhout.
Caching in the Sprite network file system. ACM Transactions on Computer Systems, February 1988.
[29] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé,
C. Kim, and I. Stoica. NetChain: Scale-free sub-RTT
coordination. In USENIX NSDI, April 2018.
[43] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski,
H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek,
P. Saab, D. Stafford, T. Tung, and V. Venkataramani.
Scaling Memcache at Facebook. In USENIX NSDI,
April 2013.
[30] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster,
C. Kim, and I. Stoica. NetCache: Balancing key-value
stores with fast in-network caching. In ACM SOSP,
October 2017.
[31] L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, May 1998.
[32] L. Lamport. Paxos made simple. ACM SIGACT News,
December 2001.
[33] L. Lamport, D. Malkhi, and L. Zhou. Vertical paxos
and primary-backup replication. In ACM PODC, August 2009.
[34] J. Li, E. Michael, and D. R. K. Ports.
Eris:
Coordination-free consistent transactions using innetwork concurrency control. In ACM SOSP, October
2017.
[35] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and
D. R. Ports. Just say NO to Paxos overhead: Replacing
consensus with network ordering. In USENIX OSDI,
November 2016.
[36] J. Li, E. Michael, A. Szekeres, N. K. Sharma, and
D. R. K. Ports. Just say NO to Paxos overhead: Replacing consensus with network ordering (extended version). Technical Report UW-CSE-TR-16-09-02, University of Washington CSE, Seattle, WA, USA, 2016.
[37] X. Li, R. Sethi, M. Kaminsky, D. G. Andersen, and
M. J. Freedman. Be fast, cheap and in control with
SwitchKV. In USENIX NSDI, March 2016.
[38] B. Liskov and J. Cowling. Viewstamped replication revisited. Technical Report MIT-CSAIL-TR-2012-021,
MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA, July 2012.
[39] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson,
L. Shrira, and M. Williams. Replication in the Harp
file system. In ACM SOSP, October 1991.
[40] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy,
and K. Atreya. IncBricks: Toward in-network computation with an in-network cache. In ACM ASPLOS, April
2017.
[41] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin,
W. Liu, S. Pan, S. Shankar, V. Sivakumar, L. Tang,
et al. f4: FacebookâĂŹs warm BLOB storage system.
In USENIX OSDI, October 2014.
[44] B. M. Oki and B. H. Liskov. Viewstamped replication: A new primary copy method to support highlyavailable distributed systems. In ACM PODC, August
1988.
[45] D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In USENIX ATC, June
2014.
[46] J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal,
C. Lee, B. Montazeri, D. Ongaro, S. J. Park, H. Qin,
M. Rosenblum, S. Rumble, R. Stutsman, and S. Yang.
The RAMCloud storage system. ACM Transactions on
Computer Systems, August 2015.
[47] D. A. Patterson, G. Gibson, and R. H. Katz. A case for
redundant arrays of inexpensive disks (RAID). In ACM
SIGMOD, June 1988.
[48] A. Phanishayee, D. G. Andersen, H. Pucha, A. Povzner,
and W. Belluomini.
Flex-KV: Enabling highperformance and flexible KV systems. In Workshop on
Management of Big Data Systems (MBDS), September
2012.
[49] C. Plattner and G. Alonso. Ganymed: Scalable replication for transactional web applications. In Proceedings
of the International Middleware Conference, Toronto,
Ontario, Canada, October 2004.
[50] D. R. K. Ports, J. Li, V. Liu, N. K. Sharma, and
A. Krishnamurthy. Designing distributed systems using approximate synchrony in data center networks. In
USENIX NSDI, May 2015.
[51] B. Reed and F. P. Junqueira. A simple totally ordered
broadcast protocol. In ACM Large-Scale Distributed
Systems and Middleware, September 2008.
[52] U. Röhm, K. Böhm, H.-J. Schek, and H. Schuldt. FAS
— a freshness-sensitive coordination middleware for
a cluster of OLAP components. In Proceedings of
the 28th International Conference on Very Large Data
Bases (VLDB ’02), Hong Kong, China, August 2002.
[53] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E.
Okasaki, E. H. Siegel, and D. C. Steere. Coda: A highly
available file system for a distributed workstation environment. IEEE Transactions on Computers, April
1990.
[54] N. K. Sharma, A. Kaufmann, T. E. Anderson, A. Krishnamurthy, J. Nelson, and S. Peter. Evaluating the power
of flexible packet processing for network resource allocation. In USENIX NSDI, March 2017.
[55] J. Terrace and M. J. Freedman. Object storage on
CRAQ: High-throughput chain replication for readmostly workloads. In USENIX ATC, June 2009.
[56] R. Van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In
USENIX OSDI, December 2004.
[57] V. Venkataramani, Z. Amsden, N. Bronson, G. Cabrera III, P. Chakka, P. Dimov, H. Ding, J. Ferris,
A. Giardullo, J. Hoon, S. Kulkarni, N. Lawrence,
M. Marchukov, D. Petrov, and L. Puzar. TAO: How
Facebook serves the social graph. In ACM SIGMOD,
May 2012.
[58] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long,
and C. Maltzahn. Ceph: A scalable, high-performance
distributed file system. In USENIX OSDI, November
2006.
APPENDIX
A.
PROOF OF CORRECTNESS
Notation. Let Q be a request, Q.commit be the lastcommitted sequence number added to the request by the
switch, R be a replica, R.obj be the local copy of an object at the replica, R.obj .seq be the sequence number of the
most recent update of the object at the replica, and R.seq be
the sequence number of the most recent write executed by
the replica to any object.
T HEOREM 1. Harmonia preserves linearizability of the
replication protocols.
P ROOF. We prove Harmonia preserves linearizability of
the replication protocols under both normal and failure scenarios. We use a fail-stop model. All write operations are
processed by the replication protocol, and are processed in
sequence number order. We need only, then, consider the
read operations. The following two properties are sufficient
for linearizability.
• P1. Visibility. A read operation sees the effects of all write
operations that finished before it started.
• P2. Integrity. A read operation will not see the effects of
any write operation that had not committed at the time the
read finished.
Normal scenario. Harmonia uses the dirty set to detect
potential conflicts, and if no conflicts are detected, reads
are scheduled to a random replica for better performance.
Harmonia leverages the last-committed sequence number to
guarantee linearizability.
For read-ahead protocols including primary-backup and
chain replication, writes are committed to all replicas. Therefore, P1 is satisfied. However, we must also verify that a read
will not see the effect of an uncommitted write. Consider a
single-replica read Q that arrives at a replica R to retrieve
object obj . It may be the case that R has applied uncommitted writes to obj . Therefore, R compares Q.commit with
R.obj .seq. If Q.commit < R.obj .seq, then R forwards Q
for handling by the normal protocol, which will return a consistent result. However, Q.commit ≥ R.obj .seq, it implies
that the latest write to R.obj has already been committed,
since writes are applied by the replication protocol in sequence number order. Therefore, P2 is satisfied.
For read-behind protocols including Viewstamped Replication and NOPaxos, the replicas first append writes to a local log. They execute and apply writes only after they have
been committed. As such, reading from local state will only
ever reflect the results of committed writes, and P2 is satisfied. However, we must also verify that a read will see
the effect of all committed writes. Again, consider a single-
replica read Q that arrives at a replica R to retrieve object
obj . It may be the case that there were writes to obj committed before Q was sent which, nevertheless, R has not yet
executed. Therefore, R compares Q.commit with R.seq. If
Q.commit > R.seq, then R forwards Q for handling by
the normal protocol, which will return a consistent result.
However, if Q.commit ≤ R.seq, this implies that R has executed all writes to obj which were committed at the time Q
was forwarded by the Harmonia switch; otherwise the last
committed sequence number on the switch would have been
larger or the dirty set would have contained obj . Therefore,
P1 is satisfied.
Switch failure. First, notice that in the above argument, we
relied on two key facts about the state held at the switch at
the time it forwards a single-replica read that will be served
by a replica. The first is that the dirty set contains all objects
with outstanding, uncommitted writes. The second is that for
all writes committed by the replication protocol, either the
sequence number of that write is less than or equal to the
last committed sequence number on the switch or the object
being written to is in the switch’s dirty set. Now, we want to
show that these two facts still hold when there are multiple
Harmonia switches, each receiving reads and writes.
In order for a switch to be able to forward single-replica
reads at all, the switch first must have received a single
WRITE - COMPLETION message with its switch ID. Since
sequence numbers are ordered lexicographically using the
switch ID first and writes are applied in order by the replication protocol, the switch’s dirty set must contain all uncommitted writes with switch ID less than or equal to its
own, and all committed writes with sequence numbers with
switch ID less than or equal to the switch’s must either have
matching entries in the dirty set or be less than or equal to the
last committed sequence number. Furthermore, if the switch
forwards a single-replica read that will actually be served
by a replica, that must mean that at the time it forwarded
that read, no switch with larger switch ID could have yet
sent any writes. Otherwise, the replicas would have already
agreed to permanently disallow single-replica reads from the
switch in question. Therefore, the switch’s state, at the time
it forwarded the single-replica read, was suitably up-to-date,
and both of our key properties held.
Server failure. A server failure is handled by the replication
protocol. However, before a failed server is removed from
the system, the protocol must ensure that the failed server is
first removed from the current switch’s routing information.
As long as this requirement is met, then all servers receiving
single-replica reads can return linearizable results.
B. HARMONIA SPECIFICATION
MODULE
harmonia
Specifies the Harmonia protocol.
EXTENDS
Naturals, FiniteSets, Sequences, TLC
Constants and Variables
CONSTANTS
ASSUME
dataItems,
numSwitches,
replicas,
isReadBehind
Set of model values representing data items
Number of total switches
Set of model values representing the replicas
Whether the replication protocol is read-behind
∧ IsFiniteSet(dataItems)
∧ IsFiniteSet(replicas)
∧ numSwitches > 0
∧ isReadBehind ∈ {TRUE, FALSE}
∆
isReadAhead = ¬isReadBehind
VARIABLE
messages,
switchStates,
activeSwitch,
sharedLog,
replicaCommitPoints
The network, a set of all messages sent
The state of the Harmonia switches
The switch allowed to send Harmonia reads
The main log decided on by replication protocol
The latest write processed by each replica
A value smaller than all writes sent by switches
∆
BottomWrite = [switchNum 7→ 0, seq 7→ 0]
Message Schemas
Write (Switch to Replication Protocol)
[ mtype
7→ MWrite,
switchNum 7→ i ∈ (1 . . numSwitches),
seq
7→ i ∈ (1 . . ),
dataItem 7→ d ∈ dataItems ]
The ProtocolRead , HarmoniaRead, and ReadResponse messages contain a field (ghostLastReponse) which is not used in the protocol,
and is only present to aid in the definition of linearizability.
ProtocolRead (Switch to Replication Protocol)
[ mtype
7→ MProtocolRead ,
dataItem
7→ d ∈ dataItems,
ghostLastReponse 7→ w ∈ WRITES (the set of all MWrite messages) ]
HarmoniaRead (Switch to Replicas)
[ mtype
7→ MHarmoniaRead,
dataItem
7→ d ∈ dataItems,
switchNum
7→ i ∈ (1 . . numSwitches),
lastCommitted 7→ w ∈ WRITES ,
ghostLastReponse 7→ w ∈ WRITES ]
ReadResponse (Replicas/Replication Protocol to Client)
[ mtype
7→ MReadResponse,
write
7→ w ∈ WRITES ,
ghostLastReponse 7→ w ∈ WRITES , ]
CONSTANTS
MWrite,
MProtocolRead ,
MHarmoniaRead ,
MReadResponse
∆
Init = ∧ messages = {}
∧ switchStates = [i ∈ (1 . . numSwitches) 7→
[seq
7→ 0,
dirtySet
7→ [d ∈ {} 7→ 0],
lastCommitted 7→ BottomWrite]]
∧ activeSwitch = 1
∧ replicaCommitPoints = [r ∈ replicas 7→ 0]
∧ sharedLog = hi
Helper and Utility Functions
Basic utility functions
∆
Range(t) = {t[i ] : i ∈ DOMAIN t}
∆
Min(S ) = CHOOSE s ∈ S : ∀ sp ∈ S : sp ≥ s
∆
Max (S ) = CHOOSE s ∈ S : ∀ sp ∈ S : sp ≤ s
Sequence number functions
∆
GTE (w 1, w 2) = ∨ w 1.switchNum > w 2.switchNum
∨ ∧ w 1.switchNum = w 2.switchNum
∧ w 1.seq ≥ w 2.seq
∆
GT (w 1, w 2) = ∨ w 1.switchNum > w 2.switchNum
∨ ∧ w 1.switchNum = w 2.switchNum
∧ w 1.seq > w 2.seq
∆
MinW (W ) = CHOOSE w ∈ W : ∀ wp ∈ W : GTE (wp, w )
∆
MaxW (W ) = CHOOSE w ∈ W : ∀ wp ∈ W : GTE (w , wp)
Common log-processing functions
∆
CommittedLog = IF isReadBehind
THEN sharedLog
ELSE SubSeq(sharedLog, 1, Min(Range(replicaCommitPoints)))
∆
MaxCommittedWriteForIn(d , log) =
MaxW ({BottomWrite} ∪ {m ∈ Range(log) : m.dataItem = d })
∆
MaxCommittedWriteFor (d ) = MaxCommittedWriteForIn(d , CommittedLog)
∆
MaxCommittedWrite = MaxW ({BottomWrite} ∪ Range(CommittedLog))
Short-hand way of sending a message
∆
Send (m) = messages ′ = messages ∪ {m}
Main Spec
Using the ghost variables in reads and read responses, we can define our main safety property (linearizability) rather simply.
∆
Linearizability = ∀ m ∈ {mp ∈ messages : mp.mtype = MReadResponse} :
∧ GTE (m.write, m.ghostLastReponse)
∧ ∨ m.write ∈ Range(CommittedLog)
∨ m.write = BottomWrite
Actions and Message Handlers
Switch s sends write for data item d
∆
SendWrite(s, d ) =
LET
∆
nextSeq = switchStates[s].seq + 1
IN
Only activated switches can send writes
∧ s ≤ activeSwitch
∧ switchStates ′ = [switchStates EXCEPT ! [s] =
[@ EXCEPT ! .seq = nextSeq,
! .dirtySet = (d :> nextSeq) @@ @]]
∧ Send ([mtype
7→ MWrite,
switchNum 7→ s,
seq
7→ nextSeq,
dataItem 7→ d ])
∧ UNCHANGED hactiveSwitch, replicaCommitPoints, sharedLogi
Add write w to the shared log
∆
HandleWrite(w ) =
The replication protocol adds writes in order
∧ ∨ Len(sharedLog) = 0
∨ ∧ Len(sharedLog) > 0
∧ GTE (w , sharedLog[Len(sharedLog)])
∧ sharedLog ′ = Append (sharedLog, w )
∧ UNCHANGED hmessages, switchStates, activeSwitch, replicaCommitPointsi
The switch processes a write completion for write w
∆
ProcessWriteCompletion(w ) =
LET
∆
s = w .switchNum
∆
ds = switchStates[s].dirtySet
∆
dsp = [dp ∈ {d ∈ DOMAIN ds : ds[d ] > w .seq} 7→ ds[dp]]
IN
Write is committed (processed by all in read-ahead mode)
∧ GTE (MaxCommittedWrite, w )
∧ switchStates ′ = [switchStates EXCEPT ! [s] =
[@ EXCEPT ! .dirtySet = dsp,
! .lastCommitted = MaxW ({@, w })]]
∧ UNCHANGED hmessages, activeSwitch, replicaCommitPoints, sharedLogi
Replica r locally commits the next write from the shared log
∆
CommitWrite(r ) =
∧ Len(sharedLog) > replicaCommitPoints[r ]
∧ replicaCommitPoints ′ = [replicaCommitPoints EXCEPT ! [r ] = @ + 1]
∧ UNCHANGED hmessages, switchStates, activeSwitch, sharedLogi
Switch s sends read for data item d
∆
SendRead (s, d ) =
LET
∆
shouldSendHarmoniaRead = ∧ d ∈
/ DOMAIN switchStates[s].dirtySet
Can send Harmonia reads after one completion
∧ GT (switchStates[s].lastCommitted , BottomWrite)
∆
returnedReads = {m.write :
m ∈ {mp ∈ messages : ∧ mp.mtype = MReadResponse
∧ mp.write 6= BottomWrite
∧ mp.write.dataItem = d }}
∆
lr = MaxW ({MaxCommittedWriteFor (d )} ∪ returnedReads)
IN
∧ ∨ ∧ shouldSendHarmoniaRead
∧ Send ([mtype
7→ MHarmoniaRead ,
dataItem
7→ d ,
switchNum
7→ s,
lastCommitted
7→ switchStates[s].lastCommitted ,
ghostLastReponse 7→ lr ])
∨ ∧ ¬shouldSendHarmoniaRead
∧ Send ([mtype
7→ MProtocolRead ,
dataItem
7→ d ,
ghostLastReponse 7→ lr ])
∧ UNCHANGED hswitchStates, activeSwitch, replicaCommitPoints, sharedLogi
Process protocol read m
∆
HandleProtocolRead (m) =
∧ Send ([mtype
7→ MReadResponse,
write
7→ MaxCommittedWriteFor (m.dataItem),
ghostLastReponse 7→ m.ghostLastReponse])
∧ UNCHANGED hswitchStates, activeSwitch, replicaCommitPoints, sharedLogi
Replica r receives Harmonia read r
∆
HandleHarmoniaRead (r , m) =
LET
∆
cp = replicaCommitPoints[r ]
∆
w = MaxCommittedWriteForIn(m.dataItem, SubSeq(sharedLog, 1, cp))
IN
Can only accept Harmonia reads from the active switch
∧ m.switchNum = activeSwitch
∧ ∨ ∧ isReadBehind
Replica can only process read if it is up-to-date
∧ GTE (IF cp > 0 THEN sharedLog[cp] ELSE BottomWrite, m.lastCommitted )
∨ ∧ isReadAhead
Replica can only process read if write was completed
∧ GTE (m.lastCommitted , w )
∧ Send ([mtype
7→ MReadResponse,
write
7→ w ,
ghostLastReponse 7→ m.ghostLastReponse])
∧ UNCHANGED hswitchStates, activeSwitch, replicaCommitPoints, sharedLogi
∆
SwitchFailover =
∧ activeSwitch < numSwitches
∧ activeSwitch ′ = activeSwitch + 1
∧ UNCHANGED hmessages, switchStates, replicaCommitPoints, sharedLogi
Main Transition Function
∆
Next = ∨ ∃ s ∈ (1 . . numSwitches) :
∃ d ∈ dataItems : ∨ SendWrite(s, d )
∨ SendRead (s, d )
∨ ∃ m ∈ Range(sharedLog) : ProcessWriteCompletion(m)
∨ ∃ m ∈ messages : ∨ ∧ m.mtype = MWrite
∧ HandleWrite(m)
∨ ∧ m.mtype = MProtocolRead
∧ HandleProtocolRead (m)
∨ ∃ r ∈ replicas :
∧ m.mtype = MHarmoniaRead
∧ HandleHarmoniaRead (r , m)
∨ ∃ r ∈ replicas : CommitWrite(r )
∨ SwitchFailover