Controlling The Cost of Reliability in Peer-to-Peer Overlays
Controlling The Cost of Reliability in Peer-to-Peer Overlays
Controlling The Cost of Reliability in Peer-to-Peer Overlays
Abstract— Structured peer-to-peer overlay networks pro- We also present novel techniques that reduce the mainte-
vide a useful substrate for building distributed applications nance cost by observing and adapting to the environment.
but there are general concerns over the cost of maintain- First, we describe a self-tuning mechanism that minimizes
ing these overlays. The current approach is to configure the the overlay maintenance cost given a performance or relia-
overlays statically and conservatively to achieve the desired
reliability even under uncommon adverse conditions. This
bility target. The current mechanism minimizes the probe
results in high cost in the common case, or poor reliability rate for fault detection given a target message loss rate. It
in worse than expected conditions. We analyze the cost of estimates both the failure rate and the size of the overlay,
overlay maintenance in realistic dynamic environments and and uses the analytic models to compute the required probe
design novel techniques to reduce this cost by adapting to rate. Second, we present mechanisms to effectively and ef-
the operating conditions. With our techniques, the concerns ficiently deal with uncommon conditions such as network
over the overlay maintenance cost are no longer warranted. partitions and extremely high failure rates. These mecha-
Simulations using real traces show that they enable high re-
nisms enable the use of less conservative overlay configu-
liability and performance even in very adverse conditions
with low maintenance cost. rations with lower maintenance cost. Though presented in
the context of Pastry [7, 2], our results and techniques can
be directly applied to other overlays.
I. I NTRODUCTION Our results show that concerns over the maintenance
cost in structured p2p overlays are not warranted anymore.
Structured peer-to-peer (p2p) overlay networks (e.g., [6, It is possible to achieve high reliability and performance
12, 7, 14]) are a useful substrate for building distributed even in adverse conditions with low maintenance cost. In
applications because they are scalable, self-organizing and simulations with a corporate network trace [1], over 99%
reliable. They provide a hash table like primitive to route of the messages were routed efficiently while control traf-
messages using their keys. These messages are routed fic was under 0.2 messages per second per node. With
in a small number of hops using small per-node rout- a much more dynamic Gnutella trace [10], similar perfor-
ing state. The overlays update routing state automatically mance levels were achieved with a maintenance cost below
when nodes join or leave, and can route messages correctly one message per second per node most of the time.
even when a large fraction of the nodes crash or the net-
The remainder of the paper is organized as follows.
work partitions.
We provide an overview of Pastry and our environmen-
But scalability, self-organization, and reliability have a tal model in Section II, and present analytic reliability and
cost; nodes must consume network bandwidth to main- cost models in Section III. Techniques to reduce mainte-
tain routing state. There is a general concern over this nance cost appear in Section IV. In Section V we discuss
cost [10, 11] but there has been little work studying it. The how our techniques can be generalized for other structured
current approach is to configure the overlays statically and p2p overlays. Related work is in Section VI, and conclu-
conservatively to achieve the desired reliability and per- sions in Section VII.
formance even under uncommon adverse conditions. This
results in high cost in the common case, or poor reliability II. BACKGROUND
in worse than expected conditions.
This section starts with a brief overview of Pastry with
This paper studies the cost of overlay maintenance in re- a focus on aspects relevant to this paper. Then, it presents
alistic environments where nodes join and leave the system our environment model.
continuously. We derive analytic models for routing relia-
bility and maintenance cost in these dynamic conditions. PASTRY Nodes and objects are assigned random iden-
This work was done while Ratul Mahajan was visit-
tifiers from a large sparse 128-bit id space. These identi-
ing Microsoft Research. Emails: ratul@cs.washington.edu, fiers are called nodeIds and keys, respectively. Pastry pro-
{mcastro,antr}@microsoft.com vides a primitive to send a message to a key that routes
2
the message to the live node whose nodeId is numerically may have for that slot. These mechanisms are described in
closest to the key in the id space. more detail in [2].
The routing state maintained by each node consists of E NVIRONMENT MODEL Our analysis and some of our
the leaf set and the routing table. Each entry in the rout- cost reduction techniques assume that nodes join accord-
ing state contains the nodeId and IP address of a node. The ing to a Poisson process with rate λ and leave according
leaf set contains the l/2 neighboring nodeIds on either side to an exponential distribution with rate parameter µ (as
of the local node’s nodeId in the id space. In the routing in [4]). But we also evaluate our techniques using realistic
table, nodeIds and keys are interpreted as unsigned inte- node arrival and departure patterns and simulated massive
gers in base 2b (where b is a parameter with typical value correlated failures such as network partitions. We assume
4). The routing table is a matrix with 128/b rows and 2b a fail-stop model and conservatively assume that all nodes
columns. The entry in row r and column c of the routing leave ungracefully without informing other nodes and that
table contains a nodeId that shares the first r digits with the nodes never return with the same nodeId.
local node’s nodeId, and has the (r + 1)th digit equal to c.
If there is no such nodeId or the local nodeId satisfies this III. R ELIABILITY AND COST MODELS
constraint, the entry is left empty. On average only log2b N
rows have non-empty entries. Pastry forwards messages using UDP with no acknowl-
edgments by default. This is efficient and simple, but mes-
Pastry routes a message to a key using no more than sages forwarded to a faulty node are lost. The probability
log2b N hops on average. At each hop, the local node nor- of forwarding a message to a faulty node at each hop is
mally forwards the message to a node whose nodeId shares Pf (T, µ) = 1 − T1µ (1 − e−T µ ), where T is the maximum
with the key a prefix that is at least one digit longer than time it takes to detect the fault. There are no more than
the prefix that the key shares with the local node’s nodeId. log2b N overlay hops in a Pastry route on average. Typi-
If no such node is known, the message is forwarded to a cally, the last hop uses the leaf set and the others use the
node whose nodeId is numerically closer to the key and routing table. If we ignore messages lost by the underlying
shares a prefix with the key at least as long. If there is no network, the message loss rate, L, is:
such node, the message is delivered to the local node.
Pastry updates routing state when nodes join and leave L = 1−(1−Pf (Tls +Tout , µ)).(1−Pf (Trt +2Tout , µ))log2b N −1
the overlay. Joins are handled as described in [2] and fail-
ures are handled as follows. Pastry uses periodic probing Reliability can be improved by applications if re-
for failure detection. Every node sends a keep-alive to the quired [9]. Applications can retransmit messages and set
members of its leaf set every Tls seconds. Since the leaf a flag indicating that they should be acknowledged at each
set membership is symmetric, each node should receive a hop. This provides very strong reliability guarantees [2]
keep-alive message from each of its leaf set members. If it because nodes can choose an alternate next hop if the pre-
does not, the node sends an explicit probe and assumes that viously chosen one is detected to be faulty. But waiting
a member is dead if it does not receive a response within for timeouts to detect that the next hop is faulty can lead
Tout seconds. Additionally, every node sends a liveness to very bad routing performance. Therefore, we use the
probe to each entry in its routing table every Trt seconds. message loss rate, L, in this paper because it models both
Since routing tables are not symmetric, nodes respond to performance and reliability – the probability of being able
these probes. If no response is received within Tout , an- to route efficiently without waiting for timeouts.
other probe is sent. The node is assumed faulty if no re-
We can also derive a model to compute the cost of main-
sponse is received to the second probe within Tout .
taining the overlay. Each node generates control traffic
Faulty entries are removed from the routing state but it for five operations: leaf set keep-alives, routing table entry
is necessary to replace them with other nodes. It is suffi- probes, node joins, background routing table maintenance,
cient to replace leaf set entries to achieve correctness but it and locality probes. The control traffic in our setting is
is important to replace routing table entries to achieve log- dominated by the first two operations. So for simplicity,
arithmic routing cost. Leaf set replacements are obtained we only consider the control traffic per second per node,
by piggybacking information about leaf set membership C, due to leaf set keep-alives and routing table probes:
in keep-alive messages. Routing table maintenance is per-
formed by periodically asking a node in each row of the 128 1
l 2× r=0 ((2 − 1) × (1 − b(0; N, (2b )(r+1) )))
b b
routing table for the corresponding row in its routing ta- C= +
ble, and when a routing table slot is found empty during Tls Trt
routing, the next hop node is asked to return any entry it
3
20 20
the overlay. The average session time over the trace was
We verified these equations using simulation. We These traces show that failure rates vary significantly
started by creating a Pastry overlay with 10,000 nodes. with both daily and weekly patterns, and the failure rate
Then we let new nodes arrive and depart according to a in the Gnutella overlay is more than an order of magni-
Poisson processes with the same rate to keep the num- tude higher than in the corporate environment. Therefore,
ber of nodes in the system roughly constant. After ten the current static configuration approach would require not
simulated minutes, 500,000 messages were sent over the only different settings for the two environments, but also
next ten minutes from randomly selected nodes to ran- expensive configurations if good performance is desired at
domly selected keys. Figure 1a shows the message loss all times. The next sections show how to achieve high re-
rate for three different values of Trt (10, 30 and 60 sec- liability with lower cost in all scenarios.
onds) with Tls fixed at 30 seconds. The x-axis shows the
mean session lifetime of a node (µ = 1/lif etime). The B. Self-tuning
lines correspond to the values predicted with the loss rate
equation and the dots correspond to the simulation results The goal of a self-tuning mechanism is to enable an
(three simulation runs for each parameter setting). Fig- overlay to operate at the desired trade-off point between
ure 1b shows a similar graph for control traffic. The results cost and reliability. In this paper we show how to operate
show that both equations are quite accurate. As expected, at one such point – achieve a target routing reliability while
the loss rate decreases when Trt (or Tls ) decrease but the minimizing control traffic. The methods we use to do this
control traffic increases. can be easily generalized to make arbitrary trade-offs.
In the loss rate and control traffic equations, there are
IV. R EDUCING MAINTENANCE COST
four parameters that we can set: Trt , Tls , Tout , and l. Cur-
This section describes our techniques to reduce the rently, we choose fixed values for Tls and l that achieve
amount of control traffic required to maintain the overlay. the desired resilience to massive failures (Section IV-C).
We start by motivating the importance of observing and Tout is fixed at a value higher than the maximum expected
adapting to the environment by discussing the characteris- round trip delay between two nodes; it is set to 3 seconds
tics of realistic environments. Then, we explain the self- in our experiments (same as the TCP SYN timeout). We
tuning mechanism and the techniques to deal with massive tune Trt to achieve the specified target loss rate by period-
failures. ically recomputing it using the loss rate equation with the
current estimates of N and µ. Below we describe mech-
anisms that, without any additional communication, esti-
A. Node arrivals and departures in realistic environments mate N and µ.
We obtained traces of node arrivals and failures from We use the density of nodeIds in the leaf set to esti-
two recent measurement studies of p2p environments. The mate N . Since nodeIds are picked randomly with uniform
first study [10] monitored 17,000 unique nodes in the probability from the 128-bit id space, the average distance
Gnutella overlay over a period of 60 hours. It probed each between nodeIds in the leaf set is 2128 /N . It can be shown
node every seven minutes to check if it was still part of that this estimate is within a small factor of N with very
4
0.020 2.5
failure rate (1/min)
Fig. 2. Self-tuning: (a) shows the measured failure rate in the Gnutella network; (b) and (c) show the loss rate in the hand-tuned
and self-tuned versions of Pastry; and (d) shows the control traffic in the two systems.
high probability, which is sufficient for our purposes since Figure 2d shows the control traffic generated by both the
the loss rate depends only on log2b N . hand-tuned and the self-tuned versions. The control traffic
The value of µ is estimated by using node failures in generated by hand-tuned is roughly constant whereas the
the routing table and leaf set. If nodes fail with rate µ, one generated by self-tuned varies according to the failure
a node with M unique nodes in its routing state should rate to meet the loss rate target. It is interesting to note that
observe K failures in time MKµ . Every node remembers the the loss rate of hand-tuned increases significantly above
time of the last K failures. A node inserts its current time 1% between 52 and 58 hours due to an increased failure
in the history when it joins the overlay. If there are only rate. The control traffic generated by the self-tuned version
k < K failures in the history, we compute the estimate as clearly increases during this period to achieve the target
if there was a failure at the current time. The estimate of loss rate with the increased failure rate. If the hand-tuned
µ is M ×Tk
, where Tkf is the time span between the first version was instead configured to always keep loss rate
k f below 1%, it would have generated over 2 messages per
and the last failure in the history. second per node all the time.
The accuracy of µ’s estimate depends on K; increas- We simulated the Microsoft corporate network trace
ing K increases accuracy but decreases responsiveness to also and obtained similar results. The self-tuned version
changes in the failure rate. We improve responsiveness achieved the desired loss rate with under 0.2 messages per
when the failure rate decreases by using the current esti- second per node. The hand-tuned version required a dif-
mate of µ to discard old entries from the failure history that ferent setting for Trt as the old value would have resulted
are unlikely to reflect the current behavior of the overlay. in an unnecessarily high overhead.
When the probability of observing a failure given the cur-
rent estimate of µ reaches a threshold (e.g., 0.90) without
C. Dealing with massive failures
any new failure being observed, we drop the oldest failure
time from the history and compute a new estimate for µ. Next we describe mechanisms to deal with massive but
We evaluated our self-tuning mechanism using simula- rare failures such as network partitions.
tions driven by the Gnutella trace. We simulated two ver-
B ROKEN LEAF SETS Pastry relies on the invariant that
sions of Pastry: self-tuned uses the self-tuning mechanism
each node has at least one live leaf set member on each
to adjust Trt to achieve a loss rate of 1%; and hand-tuned
side. This is necessary for the current leaf set repair mech-
sets Trt to a fixed value that was determined by trial and
anism to work. Chord relies on a similar assumption [12].
error to achieve the same average loss rate. Hand-tuning
Currently, Pastry uses large leaf sets (l = 32) to ensure that
is not possible in real settings because it requires perfect
the invariant holds with high probability even when there
knowledge about the future. Therefore, a comparison be-
are massive failures and the overlay is large [2].
tween these two versions of Pastry provides a conservative
evaluation of the benefits of self-tuning. We describe a new leaf set repair algorithm that uses
Figures 2b and 2c show the loss rates achieved by the the entries in the routing table. It can repair leaf sets
self-tuned and the best hand-tuned versions, respectively. even when the invariant is broken. So it allows the use
The loss rate is averaged over 10 minute windows, and of smaller leaf sets, which require less maintenance traffic.
is measured by sending 1,000 messages per minute from The algorithm works as follows. When a node n detects
random nodes to random keys. Tls was fixed at 30 seconds that all members in one side of its leaf set are faulty, it se-
in both versions and Trt at 120 seconds for hand-tuned. lects the nodeId that is numerically closest to n’s nodeId
The results show that self-tuning works well, achieving the on that side from among all the entries in its routing state.
target loss rate independent of the failure rate. Then it asks this seed node to return the entry in its routing
5
state with the nodeId closest to n’s nodeId that lies be- 70 5
be used when a CAN node loses all its neighbors along and using different performance or reliability targets; ii)
one dimension (the current version uses flooding). Finally, choosing a self-tuning target that takes into account the
partition detection can be done using the successor set in application’s retransmission behavior, such that total traf-
Chord and neighbors in CAN. fic is minimized; and iii) varying Tls (has implications for
detecting leaf set failures since keep-alives are unidirec-
VI. R ELATED W ORK tional) along with Trt under the constraint that Tls has an
upper bound determined by the desired resilience to mas-
Most previous work has studied overlay maintenance sive failures. We are also studying the impact of failures
under static conditions but the following studied dynamic on other performance criteria such as locality.
environments where nodes continuously join and leave the
overlay. Saia et al use a butterfly network to build an over- ACKNOWLEDGEMENTS
lay that routes efficiently even with large adversarial fail-
ures provided that the network keeps growing [8]. Pan- We thank Ayalvadi Ganesh for help with mathematical
durangan et al present a centralized algorithm to ensure analysis, and John Douceur and Stefan Saroiu for the trace
connectivity in the face of node failures [5]. Liben-Nowell data used in this paper.
et al provide an asymptotic analysis of the cost of main-
taining Chord [12]. Ledlie et al [3] present some simula- R EFERENCES
tion results of Chord in an idealized model with Poisson [1] W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility
arrivals and departures. We too study the overlay main- of a serverless distributed file system deployed on an existing set
tenance cost in dynamic environments but we provide an of desktop PCs. In ACM SIGMETRICS, June 2000.
exact analysis in an idealized model together with simu- [2] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron. Exploiting
network proximity in peer-to-peer overlay networks. Technical
lations using real traces. Additionally, we describe new
Report MSR-TR-2002-82, Microsoft Research, May 2002.
techniques to reduce this cost while providing high relia- [3] J. Ledlie, J. Taylor, L. Serban, and M. Seltzer. Self-organization
bility and performance. in peer-to-peer systems. In ACM SIGOPS European Workshop,
Weatherspoon and Kubiatowicz have looked at efficient Sept. 2002.
[4] D. Liben-Nowell, H. Balakrishnan, and D. Karger. Analysis of
node failure discovery [13]; they propose that nodes fur-
the evolution of peer-to-peer systems. In ACM Principles of Dis-
ther away be probed less frequently to reduce wide area tributed Computing (PODC), July 2002.
traffic. In contrast, we reduce the cost of failure discovery [5] G. Pandurangan, P. Raghavan, and E. Upfal. Building low-
through adapting to the environment. The two approaches diameter peer-to-peer networks. In IEEE FOCS, Oct. 2001.
can potentially be combined though their approach makes [6] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A
the later hops in Tapestry (and Pastry) less reliable, with scalable content-addressable network. In SIGCOMM, Aug. 2001.
messages more likely to be lost after having been routed [7] A. Rowstron and P. Druschel. Pastry: Scalable, distributed ob-
ject location and routing for large-scale peer-to-peer systems. In
for a few initial hops.
IFIP/ACM Middleware, Nov. 2001.
[8] J. Saia, A. Fiat, S. Gribble, A. Karlin, and S. Saroiu. Dynamically
VII. C ONCLUSIONS AND F UTURE W ORK fault-tolerant content addressable networks. In IPTPS, Mar. 2002.
[9] J. Saltzer, D. Reed, and D. Clarke. End-to-end arguments in sys-
There are general concerns over the cost of maintain- tem design. ACM TOCS, 2(4), Nov. 1984.
ing structured p2p overlay networks. We examined this [10] S. Saroiu, K. Gummadi, and S. Gribble. A measurement study of
cost in realistic dynamic conditions, and presented novel peer-to-peer file sharing systems. In MMCN, Jan. 2002.
techniques to reduce this cost by observing and adapting [11] S. Sen and J. Wang. Analyzing Peer-to-Peer Traffic Across
Large Networks. In Proc. ACM SIGCOMM Internet Measure-
to the environment. These techniques adjust control traffic
ment Workshop, Marseille, France, Nov. 2002.
based on observed failure rate and they detect and recover [12] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrish-
from massive failures efficiently. We evaluated these tech- nan. Chord: A scalable peer-to-peer lookup service for Internet
niques using mathematical analysis and simulation with applications. In ACM SIGCOMM, Aug. 2001.
real traces. The results show that concerns over the overlay [13] H. Weatherspoon and J. Kubiatowicz. Efficient heartbeats and re-
maintenance cost are no longer warranted. Our techniques pair of softstate in decentralized object location and routing sys-
enable high reliability and performance even in adverse tems. In ACM SIGOPS European Workshop, Sept. 2002.
[14] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An
conditions with low maintenance cost. Though done in the
infrastructure for fault-resilient wide-area location and routing.
context of Pastry, this work is relevant to other structured Technical Report UCB-CSD-01-1141, U. C. Berkeley, Apr. 2001.
p2p networks such as CAN, Chord and Tapestry.
As part of ongoing work, we are exploring different
self-tuning goals and methods. These include i) oper-
ating at arbitrary points in the reliability vs. cost curve