On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

Aguilera, Marcos K.; Le Lann, Gérard; Toueg, Sam

doi:10.1007/3-540-36108-1_24

Marcos K. Aguilera⁵,
Gérard Le Lann⁶ &
Sam Toueg⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2508))

Included in the following conference series:

International Symposium on Distributed Computing

425 Accesses
12 Citations

Abstract

We investigate whether fast failure detectors can be useful— and if so by how much— in the design of real-time fault-tolerant systems. Specifically, we show how fast failure detectors can speed up consensus and fault-tolerant broadcasts, by providing fast algorithms and deriving some matching lower bounds, for synchronous systems with crashes. These results show that a fast failure detector service (implemented using specialized hardware or expedited message delivery) can be an important tool in the design of real-time mission-critical systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Article 18 August 2023

A Closer Look at Fault Tolerance

Article 15 May 2017

A Sequentialization Procedure for Fault-Tolerant Protocols

References

M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3–30, June 1999.
Google Scholar
M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. Distributed Computing, 13(2):99–125, Apr. 2000.
Google Scholar
M. K. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg. Stable leader election. In Proceedings of the 15th International Symposium on Distributed Computing, Lecture Notes on Computer Science, Oct. 2001.
Google Scholar
O. Babaoğlu, R. Davoli, and A. Montresor. Failure detectors, group membership and view-synchronous communication in partitionable asynchronous systems. Technical Report UBLCS-95-18, Dept. of Computer Science, University of Bologna, Bologna, Italy, November 1995.
Google Scholar
A. Casimiro, P. Martins, and P. Veríssimo. How to build a timely computing base using real-time linux. In Proceedings of the 2000 IEEE International Workshop on Factory Communication Systems, pages 127–134, Sept. 2000.
Google Scholar
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, Mar. 1996. A preliminary version appeared in Proceedings of the 10th ACM Symposium on Principles of Distributed Computing, Aug., 1991, 325–340.
Google Scholar
W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 51(1):13–32, Jan. 2002.
Google Scholar
B. Deianov and S. Toueg. Failure detector service for dependable computing (fast abstract). In Proceedings of the 2000 International Conference on Dependable Systems and Networks, pages B14–B15. IEEE Computer Society, June 2000.
Google Scholar
D. Dolev and R. Reischuk. Bounds on information exchange for Byzantine agreement. J. ACM, 32(1):191–204, Jan. 1985.
Google Scholar
D. Ferrari and D. C. Verma. A scheme for real-time channel establishment in wide-area networks. IEEE Journal on Selected Areas in Communications, 8(3):368–379, Apr. 1990.
Google Scholar
R. Guerraoui, M. Larrea, and A. Schiper. Non blocking atomic commitment with an unreliable failure detector. In Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, pages 41–50, Sept. 1995.
Google Scholar
V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425, Department of Computer Science, Cornell University, Ithaca, New York, May 1994.
Google Scholar
J.-F. Hermant and G. Le Lann. Fast asynchronous uniform consensus in real-time distributed systems. IEEE Transactions on Computers, Aug. 2002. Special issue on Asynchronous Real-Time Distributed Systems.
Google Scholar
M. Hurfin and M. Raynal. A simple and fast asynchronous consensus protocol based on a weak failure detector. Distributed Computing, 12(4):209–223, 1999.
Article Google Scholar
D. Ivan, M. K. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg, November 2001. Prototype of a shared failure detector service with QoS guarantees.
Google Scholar
J. F. Kurose, M. Schwartz, and Y. Yemini. Multiple-access protocols and time-constrained communication. ACM Computing Surveys, 16(1):43–70, Mar. 1984.
Google Scholar
C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard real-time environment. J. ACM, 20(1):46–61, Jan. 1973.
Google Scholar
N. A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., 1996.
Google Scholar
G. Le Lann, 2001. Private communication with Astrium, Axlog, European Space Agency.
Google Scholar
G. Neiger and S. Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, 1990.
Article MATH MathSciNet Google Scholar
K. Tindell, A. Burns, and A. J. Wellings. Analysis of hard real-time communications. Real-Time Systems, 9(1):147–171, Sept. 1995.
Google Scholar
H. Zhang. Service disciplines for guaranteed performance service in packet-switching networks. Proceedings of the IEEE, 83(10):1374–1399, Oct. 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

HP Systems Research Center, 1501 Page Mill Road, Palo Alto, CA, 94304, USA
Marcos K. Aguilera
INRIA Rocquencourt, BP 105, F-78153, Le Chesnay, France
Gérard Le Lann
Department of Computer Science, University of Toronto, Toronto, Canada
Sam Toueg

Authors

Marcos K. Aguilera
View author publications
You can also search for this author in PubMed Google Scholar
Gérard Le Lann
View author publications
You can also search for this author in PubMed Google Scholar
Sam Toueg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, 91904, Israel
Dahlia Malkhi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aguilera, M.K., Le Lann, G., Toueg, S. (2002). On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems. In: Malkhi, D. (eds) Distributed Computing. DISC 2002. Lecture Notes in Computer Science, vol 2508. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36108-1_24

Download citation

DOI: https://doi.org/10.1007/3-540-36108-1_24
Published: 24 October 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00073-0
Online ISBN: 978-3-540-36108-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

A Closer Look at Fault Tolerance

A Sequentialization Procedure for Fault-Tolerant Protocols

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

A Closer Look at Fault Tolerance

A Sequentialization Procedure for Fault-Tolerant Protocols

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation