Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1244002.1244127acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

Evaluation of the QoS of crash-recovery failure detection

Published: 11 March 2007 Publication History

Abstract

Crash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this paper we study and model a crash-recovery service which has the ability to recover from the crash state. We analyse the QoS bounds for such a crash-recovery failure detection service. Our results show that the dependability metrics of the monitored service will have an impact on the QoS of the failure detection service. Our results are corroborated by simulation results, showing bounds on the QoS.

References

[1]
M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. In Int. Sym. on Distributed Computing, pages 231--245, 1998.
[2]
M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3--30, 1999.
[3]
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, 1996.
[4]
W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. volume 51, pages 13--32, 2002.
[5]
D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. In Proc. of the 16th Annual ACM Sym. on Principles of Distributed Computing, page 286, 1997.
[6]
L. Falai and A. Bondavalli. Experimental evaluation of the qos of failure detectors on wide area network. In 2005 Int. Conf. on Dependable Systems and Networks, pages 624--633, 2005.
[7]
C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Trans. Comput., 52(2):99--112, 2003.
[8]
C. Fetzer, M. Raynal, and F. Tronel. An adaptive failure detection protocol. In Proc. of the 2001 Pacific Rim Int. Sym. on Dependable Computing, page 146, 2001.
[9]
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374--382, 1985.
[10]
V. K. Garg and J. R. Mitchell. Implementable failure detectors in asynchronous systems. In Proc. 18th Conf. on Foundations of Software Technology and Theoretical Computer Science, number 1530, 1998.
[11]
I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proc. of the 20th Annual ACM Sym. on Principles of Distributed Computing, pages 170--179, 2001.
[12]
N. Hayashibara, A. Cherif, and T. Katayama. Failure detectors for large-scale distributed systems. In Proc. of the 21st IEEE Sym. on Reliable Distributed Systems, page 404, 2002.
[13]
N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The accrual failure detector. In 23rd IEEE Int. Sym. on Reliable Distributed Systems, pages 66--78, 2004.
[14]
M. Hurfin, A. Mostefaoui, and M. Raynal;. Consensus in asynchronous systems where processes can crash and recover. In The 17th IEEE Sym. on Reliable Distributed Systems, pages 280--286, 1998.
[15]
G. Neiger. Failure detectors and the wait-free hierarchy (extended abstract). In Proc. of the 14th Annual ACM Sym. on Principles of Distributed Computing, pages 100--109, 1995.
[16]
R. C. Nunes and I. Jansch-Porto. Qos of timeout-based self-tuned failure detectors: The effects of the communication delay predictor and the safety margin. In 2004 Int. Conf. on Dependable Systems and Networks, page 753, 2004.
[17]
R. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash-recover model. Technical Report TR-97/239, 1997.
[18]
R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98--1687, 1998.
[19]
I. Sotoma and E. R. M. Madeira. A markov model for quality of service of failure detectors in the pressure of loss bursts. In 18th Int. Conf. on Advanced Information Networking and Applications, volume 2, page 62, 2004.
[20]
P. Stelling, C. DeMatteis, I. Foster, C. Kesselman, C. A. Lee, and G. von Laszewski. A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117--128, 1999.

Cited By

View all
  • (2021)Intelligent Crowd-Sourced 5G Heat-map with Event-driven Architecture2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON53756.2021.9623247(0982-0987)Online publication date: 27-Oct-2021
  • (2014)Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated FailuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.7825:4(1034-1043)Online publication date: 1-Apr-2014
  • (2012)Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2011.28523:7(1288-1301)Online publication date: 1-Jul-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '07: Proceedings of the 2007 ACM symposium on Applied computing
March 2007
1688 pages
ISBN:1595934804
DOI:10.1145/1244002
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 March 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dependability
  2. failure detection
  3. fault tolerance
  4. quality of services
  5. reliability
  6. web services

Qualifiers

  • Article

Conference

SAC07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Intelligent Crowd-Sourced 5G Heat-map with Event-driven Architecture2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON53756.2021.9623247(0982-0987)Online publication date: 27-Oct-2021
  • (2014)Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated FailuresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2013.7825:4(1034-1043)Online publication date: 1-Apr-2014
  • (2012)Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2011.28523:7(1288-1301)Online publication date: 1-Jul-2012
  • (2010)On the Quality of Service of Crash-Recovery Failure DetectorsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2009.357:3(271-283)Online publication date: 1-Jul-2010
  • (2009)Structure-Based Mesh Completion2009 International Conference on Computational Intelligence and Software Engineering10.1109/CISE.2009.5366092(1-4)Online publication date: Dec-2009
  • (2009)Research and Design of Redundant Cluster-Head Model in WIA-PA Based on Adaptive Heartbeat Protocol2009 International Conference on Computational Intelligence and Software Engineering10.1109/CISE.2009.5365032(1-4)Online publication date: Dec-2009
  • (2009)Research and Implementation of EAI Based on SOA2009 International Conference on Computational Intelligence and Software Engineering10.1109/CISE.2009.5362586(1-4)Online publication date: Dec-2009
  • (2007)On the Quality of Service of Crash-Recovery Failure DetectorsProceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2007.70(739-748)Online publication date: 25-Jun-2007

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media