Intrusion-Tolerant Parsimonious State Machine Replication ∗
HariGovind V. Ramasamy Adnan Agbaria William H. Sanders
University of Illinois, Urbana, IL 61801, USA
{ramasamy, adnan, whs}@crhc.uiuc.edu
Abstract
We describe a Byzantine-fault-tolerant state machine replication algorithm that reduces computation and communication costs in the fault-free case, and is reasonably efficient even in the presence of faults. Such an algorithm is
practically significant, because failures are the exception than the norm, and much of a system’s runtime is fault-free.
The algorithm is geared towards applications that require Byzantine-fault tolerance, and also require that redundant
processing and wasteful resource use should be reduced as much as possible (e.g., critical computations on the Grid).
1 Introduction
Intrusion tolerance recognizes the impracticability of defending a system against all attacks, and focuses on providing an acceptable QoS despite the compromise of some subsystems. A common way to construct an intrusion-tolerant
system is to structure the system as a state machine, replicate the system on multiple nodes, and coordinate the nodes
using a Byzantine-fault-tolerant replication protocol. Of course, this approach works only if the replica failure probabilities are not strongly correlated. Recent work in Byzantine-fault tolerance has focused on protocols that are practical
and relevant to the implementation of real services (e.g., [1, 2, 5]). SINTRA [1] implements protocols based on threshold public-key cryptography for intrusion-tolerant replication over the Internet. PBFT [2] is a replication protocol that
includes many optimizations, including the use of message authentication codes (MACs) instead of digital signatures.
Yin et al. [5] observed that separating agreement on the order of request executions from the actual execution of requests can help improve confidentiality and also decrease the number of replicas needed to execute the requests from
3t + 1 to 2t + 1, where t is the maximum number of replicas that could be simultaneously faulty.
Like [1, 2, 5], our work also deals with Byzantine-fault-tolerant state machine replication that does not rely on
synchrony assumptions for correctness. We explore another way to make Byzantine-fault tolerance useful for real
services: reduction of computation and communication costs in the fault-free case. A Byzantine-fault-tolerant replication algorithm that reduces costs in the fault-free case would be practically significant, because experience shows
that failures are the exception rather than the norm and that much of a system’s runtime is fault-free. Our algorithm
is geared towards applications that require Byzantine-fault tolerance, and also require that redundant processing and
wasteful resource use should be minimized as much as possible. For example, acquiring 2t + 1 nodes required to
execute a t-Byzantine-fault-resilient critical computation on the Grid or SETI@HOME may not be a problem. However, it is desirable to reduce the overall CPU usage and network traffic due to that computation, so as to reduce the
computation’s performance impact felt by the regular users of those nodes.
While existing Byzantine-fault-tolerant state machine replication protocols (e.g., [1, 2, 5]) require all replicas to
process requests in all protocols runs (irrespective of whether the runs are fault-free or with faults), our approach
requires only a subset of the replicas, called the primary committee, to process requests in fault-free runs; hence our
approach is parsimonious. When any primary committee member exhibits faulty behavior, a reconfiguration of the
group occurs, resulting in a reselection of the primary committee. We do not rely on a group membership service to
remove the faulty primary committee members from the replication group in order to make progress.
2 The Algorithm
We consider a client-server system in which there are two kinds of processes: a trusted client called the gateway
(similar to a real-world corporate gateway) and n server replicas (denoted by the set S) that implement deterministic
state machines. The replicas start with the same initial state. The gateway and server replicas run on different nodes in
an asynchronous distributed system. Communication between the processes takes place through FIFO, quasi-reliable
(no message loss if the sender and receiver are correct), and asynchronous channels1 . The gateway can be viewed as
∗ This
research has been supported by DARPA contract F30602-00-C-0172.
our algorithm can be extended to handle unreliable channels that discard or reorder messages, but our assumptions allow us to avoid a
discussion (about retransmissions and negative acks) that is not central to the contribution of the paper.
1 Indeed,
1
a serializing mechanism that accepts requests from multiple clients, enforces a total order among their requests, and
forwards the requests to all the server replicas in the same order. Our algorithm complements Yin et al.’s work [5],
in that our algorithm can be deployed in their execution cluster, which consists of replicas that actually execute the
totally ordered requests coming from the agreement cluster (which is the conceptual equivalent of our gateway).
Correct replicas follow their specifications, while faulty ones can behave arbitrarily. A computationally bound
adversary can coordinate faulty replicas in arbitrary ways. The minimum number of replicas needed to tolerate t
Byzantine faults among the replicas in such a system is n = 2t + 1. All messages sent in the algorithm are digitally
signed by the sender using public-key cryptography. If a message recipient is unable to verify the signature on the
message, it simply discards the message.
Any replication algorithm is required to guarantee safety and termination. Termination ensures that if the gateway
sends a request, it eventually accepts a response. Safety ensures that (1) at any two correct replicas i and j, the xth
updates to their internal states as a result of processing the xth gateway requests are the same (total order), (2) any
correct replica executes the xth update to its internal state at most once, and only if the gateway sent the xth request
(update integrity), and (3) for any response corresponding to the xth request accepted by the gateway, at least one
correct replica sent that response (response integrity). Our algorithm does not rely on synchrony assumptions for
safety. However, we rely on synchrony assumptions for liveness.
In addition to the usual safety and liveness properties, our algorithm also guarantees parsimony2 . This property
characterizes the reduced processing and communication activity involved during fault-free behavior of our algorithm,
and distinguishes our algorithm from other Byzantine-fault-tolerant state machine replication algorithms. Informally,
the property states that under normal circumstances, only the replicas in some (t + 1)-subset of S process the gateway
request and send the state update to other replicas and the response to the gateway. We call this (t + 1)-subset the
primary committee and the other replicas the backups. (t + 1) is the minimum number required to avoid the case in
which the committee consists of all Byzantine replicas that faithfully send the required messages to other processes
but carry out the wrong processing, resulting in wrong responses to gateway requests and bad updates to backups.
Suppose that PC denotes the set whose elements are (t + 1)-subsets of S that could constitute primary committees.
Consider two correct replicas i and j such that i ∈ P ∧ i ∈
/ Q, j ∈ Q ∧ j ∈
/ P, and P, Q ∈ PC. Then, the parsimony
property could be formally stated as follows: If both i and j process the same gateway request, then at least one of
the following is true: (1) at least one replica in P exhibited a commission fault3 , or is suspected by t + 1 replicas of
having exhibited an omission fault4 , and/or (2) at least one replica in Q exhibited a commission fault, or is suspected
by t + 1 replicas of having exhibited an omission fault.
All replicas are initialized with the same primary committee. The gateway sends a request message to all replicas.
The message carries a sequence number and the state machine operation to be executed. Request messages arrive at all
replicas in the same order without any gaps. Each replica maintains its own receive queue for incoming messages from
the gateway, and a handled set that contains the requests that have been processed. A committee member executes
requests in sequence number order. After executing the operation indicated by a gateway request, a committee member
generates a reply message and an update message. The reply message carries the result r of executing the operation
requested by the gateway. The update message carries the update u to the replica’s state as a result of processing the
request. A committee member sends the reply message to the gateway. The committee member combines the reply
and update messages in a single reply-update message which is sent to all the replicas. In the normal fault-free case,
the gateway will receive reply messages from t + 1 replicas with identical r values. The gateway can then accept r as
the result for its request message. Similarly, each replica j (whether it is a committee member or backup) will receive
reply and update messages with identical r and u values (respectively) from the t + 1 committee members. Replica j
can then perform the state update suggested by u, remove the head of the receive queue, and add it to the handled set.
Then the algorithm can move on to service the next gateway request if the receive queue is non-empty. For simplicity,
in the rest of this section, we will deal with a single gateway request that has the sequence number x.
As many as t out of the t + 1 committee members could be faulty. When the xth gateway request is being serviced,
a faulty committee member could try to disrupt the algorithm by sending bad messages (wrong r value in a reply
message or wrong u value in an update message) to one or more processes. It could also refuse to send reply and/or
update messages to one or more processes. Even a committee member that has not been corrupted by the adversary
2 One can view our algorithm as an adaptation of the parsimony principle used in crash-tolerant semi-passive replication [3] to the Byzantine
fault model. Semi-passive replication is a variant of the primary-backup approach, which does not rely on a group membership service to make
progress in the presence of a faulty primary.
3 A commission fault occurs when a replica sends a message that it should not have sent according to the specifications of the algorithm.
4 An omission fault occurs when a replica does not send a required message.
2
may be experiencing transient (CPU or network) load conditions that prevent its reply and/or update messages from
being received in time. In such cases, a reconfiguration protocol has to be initiated to ensure liveness. At the same
time, a faulty replica must not be able to initiate the reconfiguration protocol when the committee indeed consists of
correct members. Hence, our algorithm allows the reconfiguration protocol to be initiated by a replica only through a
reselect message that carries proper “justification.” A valid reselect message that has proper justification will convince
a recipient of the message that at least one committee member has exhibited a commission fault or is suspected by
t + 1 replicas of having exhibited an omission fault. The justification for a commission fault must be one of the three
types of suspect-commission messages, and the justification for an omission fault must be suspect-omission messages
from t + 1 replicas (described below). The convinced recipient will send its own reselect message to other replicas.
A replica j will suspect a committee member i of having exhibited an omission fault if j has not received i’s replyupdate message for the xth gateway request even after j’s local timer, which was started when the xth request became
the head of j’s receive queue, expired. Replica j conveys its suspicion of i to other replicas through a suspect-omission
message indicating replica i and request x. If there is a replica k that indeed received i’s reply-update message for
the xth request in a timely fashion, then k simply forwards i’s reply-update message to j upon receiving j’s suspectomission message for i. This strategy ensures that (1) if at least one correct replica has received i’s reply-update
message in a timely fashion, then all correct replicas will eventually receive i’s message, and (2) if no correct replica
has received i’s reply-update message in a timely fashion, then the reconfiguration protocol will be initiated.
A committee member i can exhibit a commission fault by sending an improperly formed message with a valid
signature or by sending a reply-update message with the wrong r or u values to all or a subset of the replicas. In
the first case, a recipient of the improper message will send a type-1 suspect-commission message that carries i’s
improperly formed message. In the second case, when a replica j receives reply-update messages with different values
for the result or state update from different committee members for the xth gateway request, replica j will recognize
that some committee member is faulty; however, it will not know which committee member is the faulty one, unless
j is itself a committee member, has processed the xth request, and hence knows the correct result and state update
values. Replica j sends a type-2 suspect-commission message that carries reply-update messages from two or more
committee members with differing values for the result and/or state update as the justification. A recipient replica k of
j’s type-2 suspect-commission message will not know which committee member is faulty, but will be convinced that
the reconfiguration protocol must be initiated, as at least one committee member has exhibited a commission fault.
The reconfiguration protocol consists of two stages. In the first stage, all replicas temporarily become active replicas, service the xth gateway request, send the reply messages to the gateway, and exchange reply-update messages
among themselves. A replica moves to the second stage after it has received reply-update messages with identical
result and state update values from t + 1 replicas. That will eventually happen, since there are at least t + 1 correct
replicas. At the end of the first stage, previous committee members that sent wrong result and/or state update values
in their reply-update messages can be easily identified. For each such faulty replica, a type-3 suspect-commission
message will be sent carrying the replica’s “bad” reply-update message and the “good” reply-update messages from
t + 1 other replicas. In the second stage of the reconfiguration protocol, a reselection of the primary committee occurs.
Once committee reselection is completed, replicas that are not members of the new committee become backups and
do not service the next gateway request. If some members of the new committee exhibit faulty behavior, then the
reconfiguration protocol will be initiated again, and so on.
2t+1
. Let PC ord be an ordered list containing
Suppose that PC consists of all the t+1-subsets of S. Then, |PC| = Ct+1
all the elements of PC. PC ord is known to all replicas. Each replica maintains a local integer called the committee
number. If the committee number has a value g, then the gth element in PC ord represents the primary committee.
All replicas start with committee number 1. Each replica j maintains an omission-fault list and a commission-fault
list. A replica i is added to j’s omission-fault list after j has seen suspect-omission messages from t + 1 replicas
(possibly including j itself) for i. Replica i is added to j’s commission-fault list after j has seen a type-1 or type-3
suspect-commission message (possibly from itself) for i. At a correct replica j, if the committee number is g at the
start of the reconfiguration protocol, then the committee number is advanced to the smallest integer g greater than g,
such that the g th element in PC ord does not contain any replicas in j’s omission-fault list or commission-fault list.
Now, the g th element in PC ord is the new primary committee from j’s perspective. Correct replicas may temporarily
differ in their perspectives of the primary committee. However, their perspectives will eventually concur, since their
fault lists will eventually become the same. That is ensured because a correct replica sends a reselect message with
proper justification upon any addition to its fault lists. A convinced recipient will update its own fault lists, and send
its own reselect message with proper justification.
If the gateway has not accepted a result within a certain time after sending its request, it retransmits the request to
3
all the replicas. If any correct replica has received reply-update messages with identical result and state-update values
from t + 1 replicas for that request, then the replica simply forwards the reply messages to the gateway. Otherwise,
the reconfiguration protocol will eventually be initiated, causing all correct replicas to process the request and send the
reply message to the gateway. In either case, the gateway is guaranteed to accept a response eventually.
3 Discussion
Previous Byzantine-fault-tolerant state machine replication algorithms did not distinguish between fault-free and
faulty operation. Their principle was active replication in which all correct replicas always process all requests.
Our algorithm obtains a reduction of processing and communication activity in the normal fault-free case by using
only t + 1 out of 2t + 1 replicas (the primary committee) to process the client request. In the presence of faulty
committee members, a reconfiguration protocol is initiated, which causes all replicas to process the request (as in
active replication). When there is a faulty committee member, the additional overhead imposed by the algorithm
(relative to active replication) consists of the fault detection latency. However, this is a small price to pay considering
the improved performance obtained in the fault-free case, in which the system is expected to spend most of its runtime.
Our algorithm is not just a simple hybrid protocol that uses a smaller subset of replicas in the fault-free case and
switches to using all the replicas in the faulty case. Only faults in the primary committee can initiate the switch.
Also, faulty backups cannot force reconfiguration when the committee members are functioning properly (because the
reselect messages require proper justification). The eventual goal of the reconfiguration protocol is to make the primary
2t+1
elements of
committee “settle down” at a (t + 1)-subset of S consisting of all correct replicas. Though there are Ct+1
PC ord , the number of times the reconfiguration protocol has to be initiated for the primary committee to settle down
on an all-correct (t + 1)-subset is only O(f ), where f ≤ t is the actual number of faults. The reason is that while the
next primary committee is being chosen, the elements of PC ord that contain any replicas in the commission-fault list
or the omission-fault list are skipped over.
Our algorithm is safe under denial-of-service (DoS) attacks. A DoS attack may delay the gateway’s acceptance
of a correct response, but it will not succeed in making the gateway accept a wrong response. The attack may also
cause non-corrupted replicas to be added to the omission-fault list. As a result, the number of replicas in either the
omission-fault list or commission-fault list may exceed t when a DoS attack is underway. When that happens, the
reconfiguration protocol is initiated as usual; however, in the second stage of the reconfiguration protocol, entries in
the omission-fault list are refreshed, and the committee number is reset to the smallest integer ĝ greater than or equal
to 1, such that the ĝ th element in PC ord does not contain any replicas in the commission-fault list. Such a refresh of
the omission-fault list may happen several times during the DoS attack. Obviously, the O(f ) bound mentioned above
will not hold during such periods of instability in the system. However, as long as no more than t replicas have been
actually corrupted by the adversary, the algorithm will make progress after the DoS attack is over.
In our algorithm, backups do not actually process the requests; they are required only to monitor the progress of the
protocol, update their replicated states based on the reply-update messages from the committee members, and generate
suspect messages if a committee member is not behaving correctly. For some applications, updating the state may be a
very low portion of the processing load (e.g., read-only file systems). However, for applications in which updating the
state forms a significant part of the processing load (e.g., read-write file systems), this strategy may not be sufficient to
significantly reduce the processing load at backups in the fault-free case. In addressing this issue, we observe that the
state of a backup need not be up to date after every request, but only just before it starts processing the gateway request
during the first stage of the reconfiguration protocol. This suggests that state updates can be performed at backups in
a lazy (on-demand) fashion. In [4], we describe an extension to the algorithm that uses checkpointing to perform lazy
updates at backups. [4] also describes a modified version of the algorithm that uses MACs instead of digital signatures
during normal operation for the reply-update messages.
Acknowledgments We thank Christian Cachin for his helpful comments, and Jenny Applequist for her editorial help.
References
[1] C. Cachin and J. A. Poritz, “Secure Intrusion-Tolerant Replication on the Internet,” In Proc. DSN-2002, pp. 167-176, 2002.
[2] M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance and Proactive Recovery,” ACM TOCS, 20(4):398–461, Nov.
2002.
[3] X. Defago, A. Schiper, and N. Sergent, “Semi-Passive Replication,” In Proc. SRDS-17, pp. 43-50, Oct. 1998.
[4] H. V. Ramasamy, A. Agbaria, and W. H. Sanders, “Intrusion-Tolerant Parsimonious State Machine Replication,” University
of Illinois Coordinated Science Laboratory technical report, to appear.
[5] J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin, “Separating Agreement from Execution for Byzantine Fault
Tolerant Services,” In Proc. SOSP-2003, pp. 253-267, 2003.
4