Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/876878.879310guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Enforcing Perfect Failure Detection

Published: 13 November 2019 Publication History

Abstract

Abstract: Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed distributed systems with hardware watchdogs. The two main system model assumptions are (1) each computer can measure time intervals with a known maximum error, and (2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware.

References

[1]
BHIDE, A., ELNOZAHYE., AND S.P. MORGANA. highly available network file server. In Proceedings of the USENIX Winter Conference (Jan 1991), USENIX, pp. 199-205.
[2]
BIRMANK, K. P. Replication and fault-tolerance in the isis system. In Proceedings of the Tenth Symposium on Operating System Principles (December 1985), ACM, pp. 79-86.
[3]
CHANDRA, T., HADZILACOS, V., AND TOUEG, S. The weakest failure detector for solving consensus. In Proceedings of the 11th ACM Symposium on Principles of Distributed Computing (Aug 1992), pp. 147- 158.
[4]
CHANDRA, T., AND TOUEG, S. Unreliable failure detectors for asynchronous systems. In Proceedings of the 10th ACM Symposium on Principles of Distributed Computing (Aug 1991), pp. 325-340.
[5]
CHANDRA, T. D., AND TOUEG, S. Unreliable failure detectors for reliable distributed systems. Journal of the ACM43, 2 (March 1996), 225-267.
[6]
CHANDY, M., AND LAMPORT, L. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3, 1 (Feb 1985), 63-75.
[7]
CRISTIAN, F., AND FETZER, C. The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems (Jun 1999), 642-657.
[8]
FETZER, C. Enforcing perfect failure detection. Tech. Rep. TD-4LAJLQ, AT&T Labs Research, Florham-Park, NJ, USA, 2000.
[9]
FETZER, C., AND CRISTIAN, F. A fail-aware datagram service. IEE Proceedings - Software Engineering 146, 2 (April 1999), 58-74.
[10]
FISCHERM, . J., LYNCH, N. A., AND PATERSON, M. S. Impossibility of distributed consensus with one faulty process. Journal of the ACM 32, 2 (Apr 1985), 374-382.
[11]
GRAY, C. G., AND CHERITON, D. R. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (Dec 1989), pp. 202-210.
[12]
LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Communications of ACM 21, 7 (Jul 1978), 558-565.
[13]
SABEL, L., AND MARZULLO, K. Simulating failstop in asynchronous distributed systems. In Proceedings 13th Symposium on Reliable Distributed Systems (Reliable Distributed Systems 1994), pp. 138-147.

Cited By

View all
  • (2018)Asynchronous failed sensor node detection method for sensor networksInternational Journal of Network Management10.1002/nem.78222:1(27-49)Online publication date: 26-Dec-2018
  • (2005)An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based ServicesProceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 0110.1109/IPDPS.2005.92Online publication date: 4-Apr-2005
  • (2003)Three-tier replication for FT-CORBA infrastructuresSoftware—Practice & Experience10.1002/spe.52533:8(767-797)Online publication date: 10-Jul-2003

Index Terms

  1. Enforcing Perfect Failure Detection
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICDCS '01: Proceedings of the The 21st International Conference on Distributed Computing Systems
    April 2001

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 13 November 2019

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Asynchronous failed sensor node detection method for sensor networksInternational Journal of Network Management10.1002/nem.78222:1(27-49)Online publication date: 26-Dec-2018
    • (2005)An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based ServicesProceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 0110.1109/IPDPS.2005.92Online publication date: 4-Apr-2005
    • (2003)Three-tier replication for FT-CORBA infrastructuresSoftware—Practice & Experience10.1002/spe.52533:8(767-797)Online publication date: 10-Jul-2003

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media