Article

Enforcing Perfect Failure Detection

ICDCS '01: Proceedings of the The 21st International Conference on Distributed Computing Systems

Page 350

Published: 13 November 2019 Publication History

Abstract

Abstract: Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed distributed systems with hardware watchdogs. The two main system model assumptions are (1) each computer can measure time intervals with a known maximum error, and (2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware.

References

[1]

BHIDE, A., ELNOZAHYE., AND S.P. MORGANA. highly available network file server. In Proceedings of the USENIX Winter Conference (Jan 1991), USENIX, pp. 199-205.

Google Scholar

[2]

BIRMANK, K. P. Replication and fault-tolerance in the isis system. In Proceedings of the Tenth Symposium on Operating System Principles (December 1985), ACM, pp. 79-86.

Digital Library

Google Scholar

[3]

CHANDRA, T., HADZILACOS, V., AND TOUEG, S. The weakest failure detector for solving consensus. In Proceedings of the 11th ACM Symposium on Principles of Distributed Computing (Aug 1992), pp. 147- 158.

Digital Library

Google Scholar

[4]

CHANDRA, T., AND TOUEG, S. Unreliable failure detectors for asynchronous systems. In Proceedings of the 10th ACM Symposium on Principles of Distributed Computing (Aug 1991), pp. 325-340.

Digital Library

Google Scholar

[5]

CHANDRA, T. D., AND TOUEG, S. Unreliable failure detectors for reliable distributed systems. Journal of the ACM43, 2 (March 1996), 225-267.

Digital Library

Google Scholar

[6]

CHANDY, M., AND LAMPORT, L. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (Feb 1985), 63-75.

Digital Library

Google Scholar

[7]

CRISTIAN, F., AND FETZER, C. The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems (Jun 1999), 642-657.

Digital Library

Google Scholar

[8]

FETZER, C. Enforcing perfect failure detection. Tech. Rep. TD-4LAJLQ, AT&T Labs Research, Florham-Park, NJ, USA, 2000.

Google Scholar

[9]

FETZER, C., AND CRISTIAN, F. A fail-aware datagram service. IEE Proceedings - Software Engineering 146, 2 (April 1999), 58-74.

Google Scholar

[10]

FISCHERM, . J., LYNCH, N. A., AND PATERSON, M. S. Impossibility of distributed consensus with one faulty process. Journal of the ACM 32, 2 (Apr 1985), 374-382.

Digital Library

Google Scholar

[11]

GRAY, C. G., AND CHERITON, D. R. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (Dec 1989), pp. 202-210.

Digital Library

Google Scholar

[12]

LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Communications of ACM 21, 7 (Jul 1978), 558-565.

Digital Library

Google Scholar

[13]

SABEL, L., AND MARZULLO, K. Simulating failstop in asynchronous distributed systems. In Proceedings 13th Symposium on Reliable Distributed Systems (Reliable Distributed Systems 1994), pp. 138-147.

Crossref

Google Scholar

Cited By

View all

Sakib K(2018)Asynchronous failed sensor node detection method for sensor networksInternational Journal of Network Management10.1002/nem.78222:1(27-49)Online publication date: 26-Dec-2018
https://dl.acm.org/doi/10.1002/nem.782
Zhou JChu LYang T(2005)An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based ServicesProceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 0110.1109/IPDPS.2005.92Online publication date: 4-Apr-2005
https://dl.acm.org/doi/10.1109/IPDPS.2005.92
Baldoni RMarchetti C(2003)Three-tier replication for FT-CORBA infrastructuresSoftware—Practice & Experience10.1002/spe.52533:8(767-797)Online publication date: 10-Jul-2003
https://dl.acm.org/doi/10.1002/spe.525

Index Terms

Enforcing Perfect Failure Detection
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles

Index terms have been assigned to the content through auto-classification.

Recommendations

Perfect Failure Detection in Timed Asynchronous Systems

Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed asynchronous ...
Crash-quiescent failure detection
DISC'09: Proceedings of the 23rd international conference on Distributed computing

A distributed algorithm is crash quiescent if it eventually stops sending messages to crashed processes. An algorithm can be made crash quiescent by providing it with either a crash notification service or a reliable communication service. Both services ...
Failure Detection Sequencers: Necessary and Sufficient Information about Failures to Solve Predicate Detection
DISC '02: Proceedings of the 16th International Conference on Distributed Computing

This paper investigates the amount of information about failures needed to solve the predicate detection problem in asynchronous systems with crash failures. In particular, we show that predicate detection cannot be solved with traditional failure ...

Comments

Information & Contributors

Information

Published In

ICDCS '01: Proceedings of the The 21st International Conference on Distributed Computing Systems

April 2001

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 November 2019

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Sakib K(2018)Asynchronous failed sensor node detection method for sensor networksInternational Journal of Network Management10.1002/nem.78222:1(27-49)Online publication date: 26-Dec-2018
https://dl.acm.org/doi/10.1002/nem.782
Zhou JChu LYang T(2005)An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based ServicesProceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 0110.1109/IPDPS.2005.92Online publication date: 4-Apr-2005
https://dl.acm.org/doi/10.1109/IPDPS.2005.92
Baldoni RMarchetti C(2003)Three-tier replication for FT-CORBA infrastructuresSoftware—Practice & Experience10.1002/spe.52533:8(767-797)Online publication date: 10-Jul-2003
https://dl.acm.org/doi/10.1002/spe.525

Abstract

References

Cited By

Index Terms

Recommendations

Perfect Failure Detection in Timed Asynchronous Systems

Crash-quiescent failure detection

Failure Detection Sequencers: Necessary and Sufficient Information about Failures to Solve Predicate Detection

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media