Article

Failure detection with booting in partially synchronous systems

Authors:

Gérard Le Lann,

Ulrich SchmidAuthors Info & Claims

EDCC'05: Proceedings of the 5th European conference on Dependable Computing

Pages 20 - 37

https://doi.org/10.1007/11408901_3

Published: 20 April 2005 Publication History

Abstract

Unreliable failure detectors are a well known means to enrich asynchronous distributed systems with time-free semantics that allow to solve consensus in the presence of crash failures. Implementing unreliable failure detectors requires a system that provides some synchrony, typically an upper bound on end-to-end message delays. Recently, we introduced an implementation of the perfect failure detector in a novel partially synchronous model, referred to as the Θ-Model, where only the ratio Θ of maximum vs. minimum end-to-end delay of messages that are simultaneously in transit must be known a priori (while the actual delays need not be known and not even be bounded). In this paper, we present an alternative failure detector algorithm, which is based on a clock synchronization algorithm for the Θ-Model. It not only surpasses our first implementation with respect to failure detection time, but also works during the system booting phase.

References

[1]

Hermant, J. F., Le Lann, G.: Fast asynchronous uniform consensus in real-time distributed systems. IEEE Transactions on Computers 51 (2002) 931-944

Digital Library

[2]

Le Lann, G., Schmid, U.: How to maximize computing systems coverage. Technical Report 183/1-128, Department of Automation, Technische Universität Wien (2003)

[3]

Fischer, M. J., Lynch, N. A., Paterson, M. S.: Impossibility of distributed consensus with one faulty processor. Journal of the ACM 32 (1985) 374-382

Digital Library

[4]

Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchronism needed for distributed consensus. Journal of the ACM 34 (1987) 77-97

Digital Library

[5]

Chandra, T. D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43 (1996) 225-267

Digital Library

[6]

Chandra, T. D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving consensus. Journal of the ACM 43 (1996) 685-722

Digital Library

[7]

Le Lann, G., Schmid, U.: How to implement a timer-free perfect failure detector in partially synchronous systems. Technical Report 183/1-127, Department of Automation, Technische Universität Wien (2003)

[8]

Larrea, M., Fernandez, A., Arevalo, S.: On the implementation of unreliable failure detectors in partially synchronous systems. IEEE Transactions on Computers 53 (2004) 815-828

Digital Library

[9]

Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM 35 (1988) 288-323

Digital Library

[10]

Widder, J.: Distributed Computing in the Presence of Bounded Asynchrony. PhD thesis, Vienna University of Technology, Fakultät für Informatik (2004)

[11]

Larrea, M., Fernández, A., Arévalo, S.: On the impossibility of implementing perpetual failure detectors in partially synchronous systems. In: Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP'02), Gran Canaria Island, Spain (2002)

Digital Library

[12]

Widder, J.: Booting clock synchronization in partially synchronous systems. In: Proceedings of the 17th International Symposium on Distributed Computing (DISC'03). Volume 2848 of LNCS., Sorrento, Italy, Springer Verlag (2003) 121-135

[13]

Widder, J., Schmid, U.: Booting clock synchronization in partially synchronous systems with hybrid node and link failures. Technical Report 183/1-126, Department of Automation, Technische Universität Wien (2003) (submitted for publication).

[14]

Srikanth, T. K., Toueg, S.: Optimal clock synchronization. Journal of the ACM 34 (1987) 626-645

Digital Library

[15]

Dolev, D., Friedman, R., Keidar, I., Malkhi, D.: Failure detectors in omission failure environments. In: Proc. 16th ACM Symposium on Principles of Distributed Computing, Santa Barbara, California (1997) 286

Digital Library

[16]

Malkhi, D., Reiter, M.: Unreliable intrusion detection in distributed computations. In: Proceedings of the 10th Computer Security Foundations Workshop (CSFW97), Rockport, MA, USA (1997) 116-124

Digital Library

[17]

Kihlstrom, K. P., Moser, L. E., Melliar-Smith, P. M.: Solving consensus in a byzantine environment using an unreliable fault detector. In: Proceedings of the International Conference on Principles of Distributed Systems (OPODIS), Chantilly, France (1997) 61-75

[18]

Doudou, A., Garbinato, B., Guerraoui, R., Schiper, A.: Muteness failure detectors: Specification and implementation. In: Proceedings 3rd European Dependable Computing Conference (EDCC-3). Volume 1667 of LNCS 1667., Prague, Czech Republic, Springer (1999) 71-87

Digital Library

[19]

Doudou, A., Garbinato, B., Guerraoui, R.: Encapsulating failure detection: From crash to byzantine failures. In: Reliable Software Technologies - Ada-Europe 2002. LNCS 2361, Vienna, Austria, Springer (2002) 24-50

Digital Library

[20]

Basu, A., Charron-Bost, B., Toueg, S.: Simulating reliable links with unreliable links in the presence of process crashes. In Babaoglu, Ö., ed.: Distributed algorithms. Volume 1151 of Lecture Notes in Computer Science. (1996) 105-122

Digital Library

[21]

Liu, J. W. S.: Real-Time Systems. Prentice Hall (2000)

[22]

Stankovic, J. A., Spuri, M., Ramamritham, K., Buttazzo, G.C.: Deadline Scheduling for Real-Time Systems. Kluwer Academic Publishers (1998)

Digital Library

[23]

Albeseder, D.: Experimentelle Verifikation von Synchronitätsannahmen für Computernetzwerke. Diplomarbeit, Embedded Computing Systems Group, Technische Universität Wien (2004) (in German).

[24]

Hadzilacos, V., Toueg, S.: Fault-tolerant broadcasts and related problems. In Mullender, S., ed.: Distributed Systems. 2nd edn. Addison-Wesley (1993) 97-145

Digital Library

[25]

Schmid, U., Fetzer, C.: Randomized asynchronous consensus with imperfect communications. In: 22nd Symposium on Reliable Distributed Systems (SRDS'03), Florence, Italy (2003) 361-370

[26]

Le Lann, G.: On real-time and non real-time distributed computing. In: Proceedings 9th International Workshop on Distributed Algorithms (WDAG'95). Volume 972 of Lecture Notes in Computer Science., Le Mont-Saint-Michel, France, Springer (1995) 51-70

Digital Library

[27]

Aguilera, M. K., Chen, W., Toueg, S.: Failure detection and consensus in the crashrecovery model. Distributed Computing 13 (2000) 99-125

Digital Library

[28]

Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems 10 (1999) 642-657

Digital Library

[29]

Veríssimo, P., Casimiro, A., Fetzer, C.: The timely computing base: Timely actions in the presence of uncertain timeliness. In: Proceedings IEEE International Conference on Dependable Systems and Networks (DSN'01 / FTCS'30), New York City, USA (2000) 533-542

Digital Library

Cited By

Függer MSchmid U(2012)Reconciling fault-tolerant distributed computing and systems-on-chipDistributed Computing10.1007/s00446-011-0151-724:6(323-355)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1007/s00446-011-0151-7
Biely MWidder J(2009)Optimal message-driven implementations of omega with mute processesACM Transactions on Autonomous and Adaptive Systems10.1145/1462187.14621914:1(1-22)Online publication date: 9-Feb-2009
https://dl.acm.org/doi/10.1145/1462187.1462191
Moser H(2009)Towards a real-time distributed computing modelTheoretical Computer Science10.1016/j.tcs.2008.10.012410:6-7(629-659)Online publication date: 20-Feb-2009
https://dl.acm.org/doi/10.1016/j.tcs.2008.10.012
Show More Cited By

Failure detection with booting in partially synchronous systems
1. General and reference
  1. Cross-computing tools and techniques
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures

Recommendations

On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an ...
Quasi-synchronous checkpointing and failure recovery in distributed systems
Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems
Proceedings of the 13th International Symposium on Distributed Computing

Unreliable failure detectors, proposed by Chandra and Toueg [2], are mechanisms that provide information about process failures. In [2], eight classes of failure detectors were defined, depending on how accurate this information is, and an algorithm ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

EDCC'05: Proceedings of the 5th European conference on Dependable Computing

April 2005

470 pages

ISBN:3540257233

Editors:
Mario Cin
Institute for Computer Sciences III, University of Erlangen-Nürnberg, Martensstr. 3, Erlangen, Germany
,
Mohamed Kaâniche
UPS, INSA, INP, ISAE; LAAS-CNRS, Université de Toulouse, Martensstr. 3, Toulouse, France
,
András Pataricza
Department of Measurement and Information Systems, Budapest University of Technology and Economics, Martensstr. 3, Toulouse

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 20 April 2005

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Függer MSchmid U(2012)Reconciling fault-tolerant distributed computing and systems-on-chipDistributed Computing10.1007/s00446-011-0151-724:6(323-355)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1007/s00446-011-0151-7
Biely MWidder J(2009)Optimal message-driven implementations of omega with mute processesACM Transactions on Autonomous and Adaptive Systems10.1145/1462187.14621914:1(1-22)Online publication date: 9-Feb-2009
https://dl.acm.org/doi/10.1145/1462187.1462191
Moser H(2009)Towards a real-time distributed computing modelTheoretical Computer Science10.1016/j.tcs.2008.10.012410:6-7(629-659)Online publication date: 20-Feb-2009
https://dl.acm.org/doi/10.1016/j.tcs.2008.10.012
Widder JSchmid U(2009)The Theta-ModelDistributed Computing10.1007/s00446-009-0080-x22:1(29-47)Online publication date: 1-Apr-2009
https://dl.acm.org/doi/10.1007/s00446-009-0080-x
Guerraoui RLynch N(2008)A general characterization of indulgenceACM Transactions on Autonomous and Adaptive Systems10.1145/1452001.14520103:4(1-19)Online publication date: 12-Dec-2008
https://dl.acm.org/doi/10.1145/1452001.1452010
Robinson PSchmid UBazzi RPatt-Shamir B(2008)The asynchronous bounded-cycle modelProceedings of the twenty-seventh ACM symposium on Principles of distributed computing10.1145/1400751.1400815(423-423)Online publication date: 18-Aug-2008
https://dl.acm.org/doi/10.1145/1400751.1400815
Robinson PSchmid U(2008)The Asynchronous Bounded-Cycle ModelProceedings of the 10th International Symposium on Stabilization, Safety, and Security of Distributed Systems10.1007/978-3-540-89335-6_20(246-262)Online publication date: 21-Nov-2008
https://dl.acm.org/doi/10.1007/978-3-540-89335-6_20
Widder JSchmid U(2007)Booting clock synchronization in partially synchronous systems with hybrid process and link failuresDistributed Computing10.1007/s00446-007-0026-020:2(115-140)Online publication date: 1-Aug-2007
https://dl.acm.org/doi/10.1007/s00446-007-0026-0
Biely MWidder J(2006)Optimal message-driven implementation of omega with mute processesProceedings of the 8th international conference on Stabilization, safety, and security of distributed systems10.5555/1759076.1759086(110-121)Online publication date: 17-Nov-2006
https://dl.acm.org/doi/10.5555/1759076.1759086
Hutle MWidder JAguilera MAspnes J(2005)Brief announcementProceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing10.1145/1073814.1073852(208-208)Online publication date: 17-Jul-2005
https://dl.acm.org/doi/10.1145/1073814.1073852
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents