chapter

Architecting dependable systems with proactive fault management

Authors:

Miroslaw MalekAuthors Info & Claims

Architecting dependable systems VII

January 2010

Pages 171 - 200

Published: 01 January 2010 Publication History

Abstract

Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system's complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system's architecture.

References

[1]

Amari, S.V., McLaughlin, L.: Optimal design of a condition-based maintenance model. In: Proceedings of Reliability and Maintainability Symposium (RAMS), pp. 528-533 (January 2004).

[2]

Andrzejak, A., Silva, L.: Deterministic models of software aging and optimal rejuvenation schedules. In: Proceedings of 10th IEEE/IFIP International Symposium on Integrated Network Management (IM 2007), pp. 159-168 (May 2007).

[3]

Avižienis, A., Laprie, J.-C.: Dependable computing: From concepts to design diversity. Proceedings of the IEEE 74(5), 629-638 (1986).

[4]

Algirdas Avižienis, J.-C., Laprie, B., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11-33 (2004).

Digital Library

[5]

Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel, A., van Steen, M. (eds.): SELF-STAR 2004. LNCS, vol. 3460. Springer, Heidelberg (2005).

[6]

Bao, Y., Sun, X., Trivedi, K.S.: Adaptive software rejuvenation: Degradation model and rejuvenation scheme. In: Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN 2003). IEEE Computer Society, Los Alamitos (2003).

[7]

Barborak, M., Dahbura, A., Malek, M.: The Consensus Problem in Fault-Tolerant Computing. Computing Surveys (CSUR) 25(2), 171-220 (1993).

Digital Library

[8]

Basseville, M., Nikiforov, I.V.: Detection of abrupt changes: theory and application. Prentice Hall, Englewood Cliffs (1993).

Digital Library

[9]

Bridgewater, D.: Standardize Messages with the Common Base Event Model (2004), http://www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/

[10]

Brown, A., Patterson, D.A.: Embracing failure: A case for recovery-oriented computing (roc). In: High Performance Transaction Processing Symposium (October 2001).

[11]

Candea, G., Delgado, M., Chen, M., Fox, A.: Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications. In: Proceedings of 3rd IEEE Workshop on Internet Applications (WIAPP), San Jose, CA (June 2003).

Digital Library

[12]

Candea, G., Cutler, J., Fox, A.: Improving availability with recursive microreboots: A soft-state system case study. Performance Evaluation Journal 56(1-3) (March 2004).

Digital Library

[13]

Cassady, C.R., Maillart, L.M., Bowden, R.O., Smith, B.K.: Characterization of optimal age-replacement policies. In: IEEE Proceedings of Reliability and Maintainability Symposium, pp. 170-175 (January 1998).

[14]

Castelli, V., Harper, R.E., Heidelberger, P., Hunter, S.W., Trivedi, K.S., Vaidyanathan, K., Zeggert, W.P.: Proactive management of software aging. IBM Journal of Research and Development 45(2), 311-332 (2001).

Digital Library

[15]

Chakravorty, S., Mendes, C., Kale, L.V.: Proactive fault tolerance in large systems. In: HPCRI Workshop in conjunction with HPCA 2005 (2005).

[16]

Cheng, F.T., Wu, S.L., Tsai, P.Y., Chung, Y.T., Yang, H.C.: Application cluster service scheme for near-zero-downtime services. In: IEEE Proceedings of the International Conference on Robotics and Automation, pp. 4062-4067 (2005).

[17]

Coleman, D., Thompson, C.: Model Based Automation and Management for the Adaptive Enterprise. In: Proceedings of 12th Annual Workshop of HP OpenView University Association, pp. 171-184 (2005).

[18]

International Electrotechnical Commission. Dependability and quality of service. In IEC: International Technical Comission, editor, IEC 60050: International Electrotechnical Vocabulary, IEC, 2 edn. ch. 191 (2002).

[19]

Cristian, F., Aghili, H., Strong, R., Dolev, D.: Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. In: Proceedings of 15th International Symposium on Fault Tolerant Computing (FTCS). IEEE, Los Alamitos (1985).

[20]

Csenki, A.: Bayes predictive analysis of a fundamental software reliability model. IEEE Transactions on Reliability 39(2), 177-183 (1990).

[21]

Buhmann, M.D.: Radial basis functions: theory and implementations. Cambridge monographs on applied and computational mathematics, vol. 12. Cambridge University Press, Cambridge (2003).

Digital Library

[22]

Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Analysis of software cost models with rejuvenation. In: Proceedings of IEEE Intl. Symposium on High Assurance Systems Engineering, HASE 2000 (November 2000).

[23]

Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Statistical non-parametric algorihms to estimate the optimal software rejuvenation schedule. In: Proceedings of the Pacific Rim International Symposium on Dependable Computing, PRDC 2000 (December 2000).

Digital Library

[24]

Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375-408 (2002).

Digital Library

[25]

Farr, W.: Software reliability modeling survey. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, ch. 3, pp. 71-117. McGraw-Hill, New York (1996).

Digital Library

[26]

Flach, P.A.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proceedings of 20th International Conference on Machine Learning (ICML 2003), pp. 194-201. AAAI Press, Menlo Park (2003).

[27]

Garg, S., Puliafito, A., Telek, M., Trivedi, K.S.: Analysis of preventive maintenance in transactions based software systems. IEEE Trans. Comput. 47(1), 96-107 (1998).

Digital Library

[28]

Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.S.: A methodology for detection and estimation of software aging. In: Proceedings of the 9th International Symposium on Software Reliability Engineering, ISSRE (Novomber 1998).

Digital Library

[29]

Gertsbakh, I.: Reliability Theory: with Applications to Preventive Maintenance. Springer, Berlin (2000).

[30]

Grottke, M., Matias, R., Trivedi, K.S.: The Fundamentals of Software Aging. In: Proceedings of Workshop on Software Aging and Rejuvenation, in conjunction with ISSRE, Seattle, WA. IEEE, Los Alamitos (2008).

[31]

Grottke, M., Trivedi, K.S.: Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. Computer 40(2), 107-109 (2007).

Digital Library

[32]

Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: A Meta-Learning Failure Predictor for Blue Gene/L Systems. In: Proceedings of International Conference on Parallel Processing (ICPP 2007). IEEE, Los Alamitos (2007).

Digital Library

[33]

Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157-1182 (2003); Special Issue on Variable and Feature Selection.

Digital Library

[34]

Wolpert, D.H.: Stacked Generalization. Neural Networks 5(5), 241-259 (1992).

Digital Library

[35]

Hoffmann, G.A., Trivedi, K.S., Malek, M.: A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability 56(4), 615- 628 (2007).

[36]

Hoffmann, G.A.: Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker, Aachen (2006).

[37]

Hoffmann, G.A., Malek, M.: Call availability prediction in a telecommunication system: A data driven empirical approach. In: Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, United Kingdom (October 2006).

Digital Library

[38]

Horn, P.: Autonomic Computing: IBM's perspective on the State of Information Technology (October 2001), http://www.research.ibm.com/autonomic/ manifesto/autonomic_computing.pdf

[39]

Huang, Y., Kintala, C., Kolettis, N., Fulton, N.: Software rejuvenation: Analysis, module and applications. In: Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing, FTCS 25 (1995).

Digital Library

[40]

IBM. An architectural blueprint for autonomic computing. White paper (June 2006), http://www-01.ibm.com/software/tivoli/autonomic/pdfs/AC_Blueprint_ White_Paper_4th.pdf

[41]

Iyer, R.K., Young, L.T., Sridhar, V.: Recognition of error symptoms in large systems. In: Proceedings of 1986 ACM Fall Joint Computer Conference, Dallas, Texas, United States, pp. 797-806. IEEE Computer Society Press, Los Alamitos (1986).

Digital Library

[42]

Kajko-Mattson, M.: Can we learn anything from hardware preventive maintenance? In: Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems, ICECCS 2001, pp. 106-111. IEEE Computer Society Press, Los Alamitos (2001).

Digital Library

[43]

Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems, 1st edn. Chapman and Hall, London (1995).

Digital Library

[44]

Kumar, D., Westberg, U.: Maintenance scheduling under age replacement policy using proportional hazards model and ttt-plotting. European Journal of Operational Research 99(3), 507-515 (1997).

[45]

Laprie, J.-C., Kanoun, K.: Software Reliability and System Reliability. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, pp. 27-69. McGraw-Hill, New York (1996).

Digital Library

[46]

Laranjeira, L.A., Malek, M., Jenevein, R.: On tolerating faults in naturally redundant algorithms. In: Proceedings of Tenth Symposium on Reliable Distributed Systems (SRDS), pp. 118-127. IEEE Computer Society Press, Los Alamitos (September 1991).

[47]

Leangsuksun, C., Liu, T., Rao, T., Scott, S.L., Libby, R.: A failure predictive and policy-based high availability strategy for linux high performance computing cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution, pp. 18-20 (2004).

[48]

Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.L.: Availability prediction and modeling of high mobility oscar cluster. In: IEEE Proceedings of International Conference on Cluster Computing, pp. 380-386 (2003).

[49]

Levy, D., Chillarege, R.: Early warning of failures through alarm analysis - a case study in telecom voice mail systems. In: Proceedings of the 14th International Symposium on Software Reliability Engineering, ISSRE 2003, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2003).

Digital Library

[50]

Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: IEEE Proceedings of the Sixth International Symposium on Cluster Computing and the Grid (CCGRID 2006), pp. 531-538. IEEE Computer Society, Los Alamitos (2006).

Digital Library

[51]

Linand, T.-T.Y., Siewiorek, D.P.: Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability 39(4), 419-432 (1990).

[52]

Lin, T.-T.Y.: Design and evaluation of an on-line predictive diagnostic system. Master's thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA (April 1988).

Digital Library

[53]

Malek, M., Cotroneo, D., Kalbarczyk, Z., Madeira, H., Penkler, D., Reitenspiess, M.: search of real data on faults, errors and failures, Panel discussion at Sixth European Dependable Computing Conference (EDCC) (October 2006).

Digital Library

[54]

Melliar-Smith, P.M., Randell, B.: Software reliability: The role of programmed exception handling. SIGPLAN Not. 12(3), 95-100 (1977).

Digital Library

[55]

Mundie, C., de Vries, P., Haynes, P., Corwine, M.: Trustworthy Computing. Technical report, 10 (2002), http://download.microsoft.com/download/a/f/2/ af22fd56-7f19-47aa-8167-4b1d73cd3c57/twc_mundie.doc

[56]

Nassar, F.A., Andrews, D.M.: A methodology for analysis of failure prediction data. In: IEEE Real-Time Systems Symposium, pp. 160-166 (1985).

[57]

Department of Defense. MIL-HDBK-217F Reliability Prediction of Electronic Equipment. Washington D.C (1990).

[58]

Oliner, A., Sahoo, R.: Evaluating cooperative checkpointing for supercomputing systems. In: IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium, IPDPS 2006 (April 2006).

Digital Library

[59]

Parekh, S., Gandhi, N., Hellerstein, J., Tilbury, D., Jayram, T.S., Bigus, J.: Using Control Theory to Achieve Service Level Objectives In Performance Management. Real-Time Systems 23(1), 127-141 (2002).

Digital Library

[60]

Parnas, D.L.: Software aging. In: IEEE Proceedings of the 16th International Conference on Software Engineering (ICSE 1994), pp. 279-287. IEEE Computer Society Press, Los Alamitos (1994).

Digital Library

[61]

Pfefferman, J.D., Cernuschi-Frias, B.: A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability 51(4), 434-442 (2002).

[62]

Randell, B.: System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 220-232 (1975).

Digital Library

[63]

Randell, B., Lee, P., Treleaven, P.C.: Reliability issues in computing system design. ACM Computing Survey 10(2), 123-165 (1978).

Digital Library

[64]

Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. Dissertation. de Verlag im Internet, Berlin, Germany (2008).

[65]

Salfner, F., Lenk, M., Malek, M.: A Survey of Online Failure Prediction Methods. ACM Computing Surveys (CSUR) 42(3), 1-42 (2010).

Digital Library

[66]

Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 2nd edn. Digital Press, Bedford (1992).

Digital Library

[67]

Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 3rd edn., p. 908. A. K. Peters, Wellesley (1998).

Digital Library

[68]

Singer, R.M., Gross, K.C., Herzog, J.P., King, R.W., Wegerich, S.: Model-Based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In: Proceedings of Intelligent System Application to Power Systems (ISAP 1997), Seoul, Korea, pp. 60-65 (July 1997).

[69]

Starr, A.G.: A structured approach to the selection of condition based maintenance. In IEEE Proceedings of Fifth International Conference on Factory 2000 - The Technology Exploitation Process, pages Condition based maintenance (CBM) triggers maintenance activity on a parameter which is indicative of machine health. Regular tasks, which are the staple of planned preventive maintenance become scheduled inspections and measurements rather than repair or (April 1997).

[70]

Sterritt, R., Parashar, M., Tianfield, H., Unland, R.: A concise introduction to autonomic computing. Advanced Engineering Informatics (AEI) 19(3), 181-187 (2005); Autonomic Computing.

Digital Library

[71]

Vaidyanathan, K., Trivedi, K.S.: A comprehensive model for software rejuvenation. IEEE Transactions on Dependable and Secure Computing 2, 124-137 (2005).

Digital Library

[72]

Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 62-71. ACM Press, New York (2001).

Digital Library

[73]

Vilalta, R., Apte, C.V., Hellerstein, J.L., Ma, S., Weiss, S.M.: Predictive algorithms in the management of computer systems. IBM Systems Journal 41(3), 461-474 (2002).

Digital Library

[74]

Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artificial Intelligence Review 18(2), 77-95 (2002).

Digital Library

Cited By

Faqeh RFetzer CHermanns HHoffmann JKlauck MKöhl MSteinmetz MWeidenbach C(2020)Towards Dynamic Dependable Systems Through Evidence-Based Continuous CertificationLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles10.1007/978-3-030-61470-6_25(416-439)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-61470-6_25

Architecting dependable systems with proactive fault management
1. Computer systems organization
2. General and reference
  1. Cross-computing tools and techniques

Recommendations

The customizable fault/error model for dependable distributed systems
Dependable computing

Dependability is a qualitative term referring to a system's ability to meet its service requirements in the presence of faults. The types and number of faults covered by a system play a primary role in determining the level of dependability which that ...
Proactive fault management in operational software systems
Architecting Fault-tolerant Component-based Systems: from requirements to testing

Fault tolerance is one of the most important means to avoid service failure in the presence of faults, so to guarantee they will not interrupt the service delivery. Software testing, instead, is one of the major fault removal techniques, realized in ...

Comments

Information & Contributors

Information

Published In

cover image Guide books

Architecting dependable systems VII

January 2010

322 pages

ISBN:364217244X

Editors:
Antonio Casimiro
University of Lisbon, Faculty of Science, Lisbon, Portugal
,
Rogério de Lemos
University of Kent, School of Computing, Canterbury, Kent, UK
,
Cristina Gacek
City University, London, Centre for Software Reliability, London, UK

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 January 2010

Qualifiers

Chapter

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Faqeh RFetzer CHermanns HHoffmann JKlauck MKöhl MSteinmetz MWeidenbach C(2020)Towards Dynamic Dependable Systems Through Evidence-Based Continuous CertificationLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles10.1007/978-3-030-61470-6_25(416-439)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-61470-6_25

View Options

View options

Media

Figures

Other

Tables

View Table of Contents