Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1985596.1985607guidebooksArticle/Chapter ViewAbstractPublication PagesBookacm-pubtype

Architecting dependable systems with proactive fault management

January 2010
Pages 171 - 200
Published: 01 January 2010 Publication History


Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system's complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system's architecture.


Amari, S.V., McLaughlin, L.: Optimal design of a condition-based maintenance model. In: Proceedings of Reliability and Maintainability Symposium (RAMS), pp. 528-533 (January 2004).
Andrzejak, A., Silva, L.: Deterministic models of software aging and optimal rejuvenation schedules. In: Proceedings of 10th IEEE/IFIP International Symposium on Integrated Network Management (IM 2007), pp. 159-168 (May 2007).
Avižienis, A., Laprie, J.-C.: Dependable computing: From concepts to design diversity. Proceedings of the IEEE 74(5), 629-638 (1986).
Algirdas Avižienis, J.-C., Laprie, B., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11-33 (2004).
Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel, A., van Steen, M. (eds.): SELF-STAR 2004. LNCS, vol. 3460. Springer, Heidelberg (2005).
Bao, Y., Sun, X., Trivedi, K.S.: Adaptive software rejuvenation: Degradation model and rejuvenation scheme. In: Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN 2003). IEEE Computer Society, Los Alamitos (2003).
Barborak, M., Dahbura, A., Malek, M.: The Consensus Problem in Fault-Tolerant Computing. Computing Surveys (CSUR) 25(2), 171-220 (1993).
Basseville, M., Nikiforov, I.V.: Detection of abrupt changes: theory and application. Prentice Hall, Englewood Cliffs (1993).
Bridgewater, D.: Standardize Messages with the Common Base Event Model (2004), http://www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/
Brown, A., Patterson, D.A.: Embracing failure: A case for recovery-oriented computing (roc). In: High Performance Transaction Processing Symposium (October 2001).
Candea, G., Delgado, M., Chen, M., Fox, A.: Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications. In: Proceedings of 3rd IEEE Workshop on Internet Applications (WIAPP), San Jose, CA (June 2003).
Candea, G., Cutler, J., Fox, A.: Improving availability with recursive microreboots: A soft-state system case study. Performance Evaluation Journal 56(1-3) (March 2004).
Cassady, C.R., Maillart, L.M., Bowden, R.O., Smith, B.K.: Characterization of optimal age-replacement policies. In: IEEE Proceedings of Reliability and Maintainability Symposium, pp. 170-175 (January 1998).
Castelli, V., Harper, R.E., Heidelberger, P., Hunter, S.W., Trivedi, K.S., Vaidyanathan, K., Zeggert, W.P.: Proactive management of software aging. IBM Journal of Research and Development 45(2), 311-332 (2001).
Chakravorty, S., Mendes, C., Kale, L.V.: Proactive fault tolerance in large systems. In: HPCRI Workshop in conjunction with HPCA 2005 (2005).
Cheng, F.T., Wu, S.L., Tsai, P.Y., Chung, Y.T., Yang, H.C.: Application cluster service scheme for near-zero-downtime services. In: IEEE Proceedings of the International Conference on Robotics and Automation, pp. 4062-4067 (2005).
Coleman, D., Thompson, C.: Model Based Automation and Management for the Adaptive Enterprise. In: Proceedings of 12th Annual Workshop of HP OpenView University Association, pp. 171-184 (2005).
International Electrotechnical Commission. Dependability and quality of service. In IEC: International Technical Comission, editor, IEC 60050: International Electrotechnical Vocabulary, IEC, 2 edn. ch. 191 (2002).
Cristian, F., Aghili, H., Strong, R., Dolev, D.: Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. In: Proceedings of 15th International Symposium on Fault Tolerant Computing (FTCS). IEEE, Los Alamitos (1985).
Csenki, A.: Bayes predictive analysis of a fundamental software reliability model. IEEE Transactions on Reliability 39(2), 177-183 (1990).
Buhmann, M.D.: Radial basis functions: theory and implementations. Cambridge monographs on applied and computational mathematics, vol. 12. Cambridge University Press, Cambridge (2003).
Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Analysis of software cost models with rejuvenation. In: Proceedings of IEEE Intl. Symposium on High Assurance Systems Engineering, HASE 2000 (November 2000).
Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Statistical non-parametric algorihms to estimate the optimal software rejuvenation schedule. In: Proceedings of the Pacific Rim International Symposium on Dependable Computing, PRDC 2000 (December 2000).
Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375-408 (2002).
Farr, W.: Software reliability modeling survey. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, ch. 3, pp. 71-117. McGraw-Hill, New York (1996).
Flach, P.A.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proceedings of 20th International Conference on Machine Learning (ICML 2003), pp. 194-201. AAAI Press, Menlo Park (2003).
Garg, S., Puliafito, A., Telek, M., Trivedi, K.S.: Analysis of preventive maintenance in transactions based software systems. IEEE Trans. Comput. 47(1), 96-107 (1998).
Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.S.: A methodology for detection and estimation of software aging. In: Proceedings of the 9th International Symposium on Software Reliability Engineering, ISSRE (Novomber 1998).
Gertsbakh, I.: Reliability Theory: with Applications to Preventive Maintenance. Springer, Berlin (2000).
Grottke, M., Matias, R., Trivedi, K.S.: The Fundamentals of Software Aging. In: Proceedings of Workshop on Software Aging and Rejuvenation, in conjunction with ISSRE, Seattle, WA. IEEE, Los Alamitos (2008).
Grottke, M., Trivedi, K.S.: Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. Computer 40(2), 107-109 (2007).
Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: A Meta-Learning Failure Predictor for Blue Gene/L Systems. In: Proceedings of International Conference on Parallel Processing (ICPP 2007). IEEE, Los Alamitos (2007).
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157-1182 (2003); Special Issue on Variable and Feature Selection.
Wolpert, D.H.: Stacked Generalization. Neural Networks 5(5), 241-259 (1992).
Hoffmann, G.A., Trivedi, K.S., Malek, M.: A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability 56(4), 615- 628 (2007).
Hoffmann, G.A.: Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker, Aachen (2006).
Hoffmann, G.A., Malek, M.: Call availability prediction in a telecommunication system: A data driven empirical approach. In: Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, United Kingdom (October 2006).
Horn, P.: Autonomic Computing: IBM's perspective on the State of Information Technology (October 2001), http://www.research.ibm.com/autonomic/ manifesto/autonomic_computing.pdf
Huang, Y., Kintala, C., Kolettis, N., Fulton, N.: Software rejuvenation: Analysis, module and applications. In: Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing, FTCS 25 (1995).
IBM. An architectural blueprint for autonomic computing. White paper (June 2006), http://www-01.ibm.com/software/tivoli/autonomic/pdfs/AC_Blueprint_ White_Paper_4th.pdf
Iyer, R.K., Young, L.T., Sridhar, V.: Recognition of error symptoms in large systems. In: Proceedings of 1986 ACM Fall Joint Computer Conference, Dallas, Texas, United States, pp. 797-806. IEEE Computer Society Press, Los Alamitos (1986).
Kajko-Mattson, M.: Can we learn anything from hardware preventive maintenance? In: Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems, ICECCS 2001, pp. 106-111. IEEE Computer Society Press, Los Alamitos (2001).
Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems, 1st edn. Chapman and Hall, London (1995).
Kumar, D., Westberg, U.: Maintenance scheduling under age replacement policy using proportional hazards model and ttt-plotting. European Journal of Operational Research 99(3), 507-515 (1997).
Laprie, J.-C., Kanoun, K.: Software Reliability and System Reliability. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, pp. 27-69. McGraw-Hill, New York (1996).
Laranjeira, L.A., Malek, M., Jenevein, R.: On tolerating faults in naturally redundant algorithms. In: Proceedings of Tenth Symposium on Reliable Distributed Systems (SRDS), pp. 118-127. IEEE Computer Society Press, Los Alamitos (September 1991).
Leangsuksun, C., Liu, T., Rao, T., Scott, S.L., Libby, R.: A failure predictive and policy-based high availability strategy for linux high performance computing cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution, pp. 18-20 (2004).
Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.L.: Availability prediction and modeling of high mobility oscar cluster. In: IEEE Proceedings of International Conference on Cluster Computing, pp. 380-386 (2003).
Levy, D., Chillarege, R.: Early warning of failures through alarm analysis - a case study in telecom voice mail systems. In: Proceedings of the 14th International Symposium on Software Reliability Engineering, ISSRE 2003, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2003).
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: IEEE Proceedings of the Sixth International Symposium on Cluster Computing and the Grid (CCGRID 2006), pp. 531-538. IEEE Computer Society, Los Alamitos (2006).
Linand, T.-T.Y., Siewiorek, D.P.: Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability 39(4), 419-432 (1990).
Lin, T.-T.Y.: Design and evaluation of an on-line predictive diagnostic system. Master's thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA (April 1988).
Malek, M., Cotroneo, D., Kalbarczyk, Z., Madeira, H., Penkler, D., Reitenspiess, M.: search of real data on faults, errors and failures, Panel discussion at Sixth European Dependable Computing Conference (EDCC) (October 2006).
Melliar-Smith, P.M., Randell, B.: Software reliability: The role of programmed exception handling. SIGPLAN Not. 12(3), 95-100 (1977).
Mundie, C., de Vries, P., Haynes, P., Corwine, M.: Trustworthy Computing. Technical report, 10 (2002), http://download.microsoft.com/download/a/f/2/ af22fd56-7f19-47aa-8167-4b1d73cd3c57/twc_mundie.doc
Nassar, F.A., Andrews, D.M.: A methodology for analysis of failure prediction data. In: IEEE Real-Time Systems Symposium, pp. 160-166 (1985).
Department of Defense. MIL-HDBK-217F Reliability Prediction of Electronic Equipment. Washington D.C (1990).
Oliner, A., Sahoo, R.: Evaluating cooperative checkpointing for supercomputing systems. In: IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium, IPDPS 2006 (April 2006).
Parekh, S., Gandhi, N., Hellerstein, J., Tilbury, D., Jayram, T.S., Bigus, J.: Using Control Theory to Achieve Service Level Objectives In Performance Management. Real-Time Systems 23(1), 127-141 (2002).
Parnas, D.L.: Software aging. In: IEEE Proceedings of the 16th International Conference on Software Engineering (ICSE 1994), pp. 279-287. IEEE Computer Society Press, Los Alamitos (1994).
Pfefferman, J.D., Cernuschi-Frias, B.: A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability 51(4), 434-442 (2002).
Randell, B.: System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 220-232 (1975).
Randell, B., Lee, P., Treleaven, P.C.: Reliability issues in computing system design. ACM Computing Survey 10(2), 123-165 (1978).
Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. Dissertation. de Verlag im Internet, Berlin, Germany (2008).
Salfner, F., Lenk, M., Malek, M.: A Survey of Online Failure Prediction Methods. ACM Computing Surveys (CSUR) 42(3), 1-42 (2010).
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 2nd edn. Digital Press, Bedford (1992).
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 3rd edn., p. 908. A. K. Peters, Wellesley (1998).
Singer, R.M., Gross, K.C., Herzog, J.P., King, R.W., Wegerich, S.: Model-Based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In: Proceedings of Intelligent System Application to Power Systems (ISAP 1997), Seoul, Korea, pp. 60-65 (July 1997).
Starr, A.G.: A structured approach to the selection of condition based maintenance. In IEEE Proceedings of Fifth International Conference on Factory 2000 - The Technology Exploitation Process, pages Condition based maintenance (CBM) triggers maintenance activity on a parameter which is indicative of machine health. Regular tasks, which are the staple of planned preventive maintenance become scheduled inspections and measurements rather than repair or (April 1997).
Sterritt, R., Parashar, M., Tianfield, H., Unland, R.: A concise introduction to autonomic computing. Advanced Engineering Informatics (AEI) 19(3), 181-187 (2005); Autonomic Computing.
Vaidyanathan, K., Trivedi, K.S.: A comprehensive model for software rejuvenation. IEEE Transactions on Dependable and Secure Computing 2, 124-137 (2005).
Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 62-71. ACM Press, New York (2001).
Vilalta, R., Apte, C.V., Hellerstein, J.L., Ma, S., Weiss, S.M.: Predictive algorithms in the management of computer systems. IBM Systems Journal 41(3), 461-474 (2002).
Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artificial Intelligence Review 18(2), 77-95 (2002).

Cited By

View all
  • (2020)Towards Dynamic Dependable Systems Through Evidence-Based Continuous CertificationLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles10.1007/978-3-030-61470-6_25(416-439)Online publication date: 20-Oct-2020
  1. Architecting dependable systems with proactive fault management



      Information & Contributors


      Published In

      cover image Guide books
      Architecting dependable systems VII
      January 2010
      322 pages
      • Editors:
      • Antonio Casimiro,
      • Rogério de Lemos,
      • Cristina Gacek



      Berlin, Heidelberg

      Publication History

      Published: 01 January 2010


      • Chapter


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 27 Dec 2024

      Other Metrics


      Cited By

      View all
      • (2020)Towards Dynamic Dependable Systems Through Evidence-Based Continuous CertificationLeveraging Applications of Formal Methods, Verification and Validation: Engineering Principles10.1007/978-3-030-61470-6_25(416-439)Online publication date: 20-Oct-2020

      View Options

      View options







      Share this Publication link

      Share on social media