Article

Latency and bandwidth-minimizing failure detectors

Authors:

Kelvin C. W. So,

Emin Gün SirerAuthors Info & Claims

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

Pages 89 - 99

https://doi.org/10.1145/1272996.1273008

Published: 21 March 2007 Publication History

Abstract

Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, failures need to be detected quickly and with low overhead. Achieving these properties simultaneously poses a difficult tradeoff between detection latency and resource consumption.

In this paper, we examine this central tradeoff, formalize it as an optimization problem and analytically derive the optimal closed form formulas for multi-node failure detectors. We provide two variants of the optimal solution for optimality metrics appropriate for two different deployment scenarios. √s-LM is a latency-minimizing optimal failure detector that achieves the lowest average failure detection latency given a fixed bandwidth constraint for system maintenance. √s-BM is a bandwidth-minimizing optimal failure detector that meets a desired detection latency target with the least amount of bandwidth consumed. We evaluate our optimal results with node lifetimes chosen from bimodal and Pareto distributions, as well as real-world trace data from PlanetLab hosts, web sites and Microsoft PCs. Compared to standard failure detectors in wide use, √s failure detectors reduce failure detection latencies by 40% on average for the same bandwidth consumption, or conversely, reduce the amount of bandwidth consumed by 30% for the same failure detection latency.

References

[1]

M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a Timeout-free Failure Detector for Quiescent Reliable Communication. In Proceedings of the International Workshop on Distributed Algorithms, Saarbrücken, Germany, Sept. 1997.

Digital Library

[2]

M. K. Aguilera, W. Chen, and S. Toueg. Failure Detection and Consensus in the Crash-Recovery Model. In Proceedings of the International Symposium on Distributed Computing, Andros, Greece, Sept. 1998.

Digital Library

[3]

Y. Amir, D. Dolev, S. Kramer, and D. Malkhi. Transis: A Communication Sub-system for High Availability. In Proceedings of the International Symposium on Fault-Tolerant Computing, Boston, Massachussetts, July 1992.

[4]

M. Bakkaloglu, J. J. Wylie, C. Wang, and G. R. Ganger. On Correlated Failures in Survivable Storage Systems. Technical Report CMU-CS-02-129, School of Computer Science, Carnegie Mellon University, May 2002.

[5]

M. Bertier, O. Marin, and P. Sens. Implementation and Performance Evaluation of an Adaptable Failure Detector. In Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, June 2002.

Digital Library

[6]

M. Bertier, O. Marin, and P. Sens. Performance Analysis of Hierarchical Failure Detector. In Proceedings of the International Conference on Dependable Systems and Networks, San Francisco, California, June 2003.

[7]

W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs. In Proceedings of the SIGMETRICS Conference, Santa Clara, California, June 2000.

Digital Library

[8]

M. Castro, M. Costa, and A. I. T. Rowstron. Performance and Dependability of Structured Peer-to-Peer Overlays. In Proceedings of the International Conference on Dependable Systems and Networks, Florence, Italy, June 2004.

Digital Library

[9]

T. D. Chandra and S. Toueg. Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225--267, Mar. 1996.

Digital Library

[10]

W. Chen. On the Quality of Service of Failure Detectors. PhD thesis, Cornell University, May 2000.

Digital Library

[11]

W. Chen, S. Toueg, and M. K. Aguilera. On the Quality of Service of Failure Detectors. IEEE Transactions on Computers, 51(5), May 2002.

Digital Library

[12]

X. Defago, P. Felber, and A. Schiper. Optimization Techniques for Replicating CORBA Objects. In Proceedings of the IEEE International Workshop on Object-oriented Real-time Dependable Systems, Santa Barbara, California, Jan. 1999.

Digital Library

[13]

X. Defago, P. Urban, N. Hayashibara, and T. Katayama. Definition and Specification of Accrual Failure Detectors. In Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005.

Digital Library

[14]

J. Dunagan, N. J. A. Harvey, M. B. Jones, D. Kostic, M. Theimer, and A. Wolman. FUSE: Lightweight Guaranteed Distributed Failure Notification. In Proceedings of the Symposium on Operating System Design and Implementation, San Francisco, California, Dec. 2004.

Digital Library

[15]

S. A. Fakhouri, G. S. Goldszmidt, and I. Gupta. Gulfstream - A System for Dynamic Topology Management in Multi-domain Server Farms. Technical Report RC21954, IBM T.J. Watson Research Center, Feb. 2001.

[16]

C. Fetzer, M. Raynal, and F. Tronel. An Adaptive Failure Detection Protocol. In Proceedings of the Pacific Rim Symposium on Dependable Computing, Seoul, Korea, Dec. 2001.

Digital Library

[17]

B. Glade, K. P. Birman, R. Cooper, and R. van Renesse. Light-Weight Process Groups in the ISIS System. Distributed Systems Engineering, 1(1):29--36, Sept. 1993.

[18]

Gnutella. The Gnutella Protocol Specification. http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf.

[19]

P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing Churn in Distributed Systems. In Proceedings of the SIGCOMM Conference, Pisa, Italy, Sept. 2006.

Digital Library

[20]

I. Gupta, K. Birman, P. Linga, A. Demers, and R. van Rennesse. Kelips: Building an Efficient and Stable P2P DHT Through Increased Memory and Background Overhead. In Proceedings of the International Workshop on Peer-to-Peer Systems, Berkeley, California, Feb. 2003.

[21]

I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On Scalable and Efficient Distributed Failure Detectors. In Proceedings of the ACM Symposium on Principles of Distributed Computing, Newport, Rhode Island, Aug. 2001.

Digital Library

[22]

N. Hayashibara, X. Defago, and T. Katayama. Two-ways Adaptive Failure Detection with the ψ-failure Detector. In Proceedings of the International Workshop on Adaptive Distributed Systems, Italy, Oct. 2003.

[23]

M. G. Hayden. The Ensemble System. PhD thesis, Cornell University, Jan. 1998.

Digital Library

[24]

M. Larrea, S. Arevalo, and A. Fernandez. Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. In Proceedings of the International Symposium on Distributed Computing, Bratislava, Slovak Republic, Sept. 1999.

Digital Library

[25]

M. Larrea, A. Fernandez, and S. Arevalo. Optimal Implementation of the Weakest Failure Detector for Solving Consensus. In Proceedings of the ACM Symposium on Principles of Distributed Computing, Portland, Oregon, July 2000.

Digital Library

[26]

J. Li, J. Stribling, R. Morris, and M. F. Kaashoek. Bandwidth-Efficient Management of DHT Routing Tables. In Proceedings of the Symposium on Networked System Design and Implementation, Boston, Massachusetts, May 2005.

Digital Library

[27]

L. E. Moser, P. M. Melliar-Smith, D. A. Argarwal, R. K. Budhia, and C. A. Lingley-Papadopoulos. Totem: A Fault-tolerant Multicast Group Communication System. Communications of the ACM, 39(4):54--63, Apr. 1996.

Digital Library

[28]

V. Paxson. End-to-End Internet Packet Dynamics. In Proceedings of the SIGCOMM Conference, France, Sept. 1997.

Digital Library

[29]

V. Ramasubramanian, R. Peterson, and E. G. Sirer. Corona: A High Performance Publish-Subscribe System for the World Wide Web. In Proceedings of the Symposium on Networked System Design and Implementation, San Jose, California, May 2006.

Digital Library

[30]

V. Ramasubramanian and E. G. Sirer. The Design and Implementation of a Next Generation Name Service for the Internet. In Proceedings of the SIGCOMM Conference, Portland, Oregon, Aug. 2004.

Digital Library

[31]

R. V. Renesse, Y. Minsky, and M. Hayden. A Gossip-Style Failure Detection Service. In Proceedings of the International Conference and Distributed Systems Platforms and Open Distributed Processing, Vienna, Austria, Sept. 1998.

[32]

S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling Churn in a DHT. In Proceedings of the USENIX Technical Conference, Boston, Massachussetts, June 2004.

Digital Library

[33]

A. Rowstorn and P. Druschel. Pastry: Scalable, Decentralized Object Location and Routing for Large-scale Peer-to-Peer Systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms, Heidelberg, Germany, Nov. 2001.

Digital Library

[34]

S. Saroiu, K. Gummadi, and S. D. Gribble. Measuring and Analyzing the Characteristics of Napster and Gnutella Hosts. Multimedia Systems Journal, 9(2):170--184, Aug. 2003.

Digital Library

[35]

Y. J. Song, V. Ramasubramanian, and E. G. Sirer. Optimal Resource Utilization in Content Distribution Networks. Technical Report TR2005-2004, Cornell University, Computing and Information Science, Ithaca, New York, Nov. 2005.

[36]

P. Stelling, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski. A Fault Detection Service for Wide Area Distributed Computations. In Proceedings of the IEEE Symposium on High Performance Distributed Computing, Chicago, Illinois, July 1998.

Digital Library

[37]

J. Stribling. PlanetLab All Pairs Ping Data. http://pdos.csail.mit.edu/~strib/pl_app.

[38]

V. Vishnumurthy and P. Francis. On Heterogeneous Overlay Construction and Random Node Selection in Unstructured P2P Networks. In Proceedings of the INFOCOM Conference, Barcelona, Spain, Apr. 2006.

[39]

H. Weatherspoon, B.-G. Chun, C. W. So, and J. Kubiatowicz. Long-Term Data Maintenance in Wide-Area Storage Systems: A Quantitative Approach. Technical Report UCB/CSD-05-1404, University of California-Berkeley, July 2005.

[40]

M. Yajnik, S. B. Moon, J. F. Kurose, and D. F. Towsley. Measurement and Modeling of the Temporal Dependence in Packet Loss. In Proceedings of the INFOCOM Conference, pages 345--352, New York, New York, Mar. 1999.

[41]

S. Zhuang, D. Geels, I. Stoica, and R. Katz. On Failure Detection Algorithms in Overlay Networks. In Proceedings of the INFOCOM Conference, Miami, Florida, Mar. 2005.

Cited By

Zhang YLi DGuo CWu HXiong YLu X(2017)CubicRingIEEE/ACM Transactions on Networking10.1109/TNET.2017.266921525:4(2040-2053)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.1109/TNET.2017.2669215
Anderson JMeling HRasmussen AVahdat AMarzullo K(2017)Local Recovery for High Availability in Strongly Consistent Cloud ServicesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2015.244378114:2(172-184)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/TDSC.2015.2443781
Zhang YGuo CLi DChu RWu HXiong YBarham PKrishnamurthy A(2015)CubicRingProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation10.5555/2789770.2789807(529-542)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.5555/2789770.2789807
Show More Cited By

Index Terms

Latency and bandwidth-minimizing failure detectors
1. Networks
  1. Network services

Recommendations

Latency and bandwidth-minimizing failure detectors
EuroSys'07 Conference Proceedings

Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, ...
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an ...
Unreliable failure detectors for reliable distributed systems

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

March 2007

431 pages

ISBN:9781595936363

DOI:10.1145/1272996

ACM SIGOPS Operating Systems Review Volume 41, Issue 3
EuroSys'07 Conference Proceedings
June 2007
386 pages
ISSN:0163-5980
DOI:10.1145/1272998
Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

EuroSys07

Sponsor:

SIGOPS

EuroSys07: Eurosys 2007 Conference

March 21 - 23, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
475
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YLi DGuo CWu HXiong YLu X(2017)CubicRingIEEE/ACM Transactions on Networking10.1109/TNET.2017.266921525:4(2040-2053)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.1109/TNET.2017.2669215
Anderson JMeling HRasmussen AVahdat AMarzullo K(2017)Local Recovery for High Availability in Strongly Consistent Cloud ServicesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2015.244378114:2(172-184)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/TDSC.2015.2443781
Zhang YGuo CLi DChu RWu HXiong YBarham PKrishnamurthy A(2015)CubicRingProceedings of the 12th USENIX Conference on Networked Systems Design and Implementation10.5555/2789770.2789807(529-542)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.5555/2789770.2789807
Leners JGupta TAguilera MWalfish MRéveillère LHarris THerlihy M(2015)Taming uncertainty in distributed systems with help from the networkProceedings of the Tenth European Conference on Computer Systems10.1145/2741948.2741976(1-16)Online publication date: 17-Apr-2015
https://dl.acm.org/doi/10.1145/2741948.2741976
Tran NChiang FLi J(2012)Efficient cooperative backup with decentralized trust managementACM Transactions on Storage10.1145/2339118.23391198:3(1-25)Online publication date: 20-Sep-2012
https://dl.acm.org/doi/10.1145/2339118.2339119
Liu DPayton J(2011)Adaptive fault detection approaches for dynamic mobile networks2011 IEEE Consumer Communications and Networking Conference (CCNC)10.1109/CCNC.2011.5766588(735-739)Online publication date: Jan-2011
https://doi.org/10.1109/CCNC.2011.5766588
de Sá Ade Araújo Macêdo R(2010)QoS self-configuring failure detectors for distributed systemsProceedings of the 10th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems10.1007/978-3-642-13645-0_10(126-140)Online publication date: 7-Jun-2010
https://dl.acm.org/doi/10.1007/978-3-642-13645-0_10
Repantis TKalogeraki VBaldoni R(2008)Replica placement for high availability in distributed stream processing systemsProceedings of the second international conference on Distributed event-based systems10.1145/1385989.1386012(181-192)Online publication date: 1-Jul-2008
https://dl.acm.org/doi/10.1145/1385989.1386012
Pasin MFontaine SBouchenak S(2008)Failure Detection in Large Scale Systems: a SurveyNOMS Workshops 2008 - IEEE Network Operations and Management Symposium Workshops10.1109/NOMSW.2007.28(165-168)Online publication date: Apr-2008
https://doi.org/10.1109/NOMSW.2007.28
Yang ZDai YLi X(2008)The Neutralizer: a self‐configurable failure detector for minimizing distributed storage maintenance costConcurrency and Computation: Practice and Experience10.1002/cpe.133821:2(185-204)Online publication date: 6-Jun-2008
https://doi.org/10.1002/cpe.1338
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents