article

Free access

Minimizing completion time of a program by checkpointing and rejuvenation

Authors:

Chandra Kintala,

Kishor S. TrivediAuthors Info & Claims

ACM SIGMETRICS Performance Evaluation Review, Volume 24, Issue 1

Pages 252 - 261

https://doi.org/10.1145/233008.233050

Published: 15 May 1996 Publication History

Abstract

Checkpointing with rollback-recovery is a well known technique to reduce the completion time of a program in the presence of failures. While checkpointing is corrective in nature, rejuvenation refers to preventive maintenance of software aimed to reduce unexpected failures mostly resulting from the "aging" phenomenon. In this paper, we show how both these techniques may be used together to further reduce the expected completion time of a program. The idea of using checkpoints to reduce the amount of rollback upon a failure is taken a step further by combining it with rejuvenation. We derive the equations for expected completion time of a program with finite failure free running time for the following three cases when; (a) neither checkpointing nor rejuvenation is employed, (b) only checkpointing is employed, and finally (c) both checkpointing and rejuvenation are employed.We also present numerical results for Weibull failure time distribution for the above three cases and discuss optimal checkpointing and rejuvenation that minimizes the expected completion time. Using the numerical results, some interesting conclusions are drawn about benefits of these techniques in relation to the nature of failure distribution.

References

[1]

M. Sullivan and R. Chillarege, "Software defects and their impact on system availability- A study of field failures in operating systems", in Proc. IEEE Fault-Tolerant Computing Symposium, pp. 2-9, 1991.

[2]

J-C. Laprie, J. Arlat, C. B~ounes and K. Kanoun, "Architectural issues in software fault-tolerance", Software Fault Tolerance, Ed. M. R. Lyu, John, Wiley & sons. ltd., pp. 47-80, 1995.

[3]

Y. Huang and C. Kintala, "Software fault-tolerance in the application layer", Software Fault Tolerance, Ed. M. R. Lyu, John, Wiley & sons. ltd., pp. 231- 248, 1995.

[4]

Y. Huang, C. Kintala, N. Koletis, N. D. Fulton, "Software Rejuvenation-design, implementation and analysis", Proc. of Fault-tolerant Computing Symposium, Pasadena, CA, June 1995.

[5]

Y-M Wang, Y. Huang, P. Vo, P-Y Chung and C. Kintala, "Checkpointing and its applications", In Proc. of Symposium on Fault Tolerant Computer Systems, Pasadena, California, 1995.

Digital Library

[6]

J. Gray, "A census of tandem system availability between 1985 and 1990", IEEE Trans. on Reliability, Vol. 39, pp. 409-418, Oct. 1990.

[7]

J. Gray, "Why do computers stop and what can be done about it?", Proc. of 5~h Syrup. on Reliability in Distributed Software and Database Systems, pp. 3-12, January 1986.

[8]

J. Gray and D. P. Siewiorek, "High-availability computer systems", IEEE Computer Mag., pp. 39- 48, Sept. 1991.

Digital Library

[9]

B. Randell, "System structure for software fault tolerance", IEEE Trans. on Software Engg., Vol. SE-1, pp. 220-232, June 1975.

Digital Library

[10]

A. Avizienis, "The n-verion approach to faulttolerant software", IEEE Trans. on Soflware Engg., Vol. SE-11, No. 12, pp. 1491-1501, December 1985.

Digital Library

[11]

P. Jalote, Y. Huang and C. Kintala, "A framework for understanding and handling transient failures", In Proc. of 2nd ISSAT Intnl. Conf. on Reliability and Quality in Design, March 8-10, 1995, Orlando, Florida, pp.231-237.

[12]

Inhwan Lee, "Software dependability in the operational phase", Ph.D. Thesis, Dept. of Electrical and Computer Engineering, Univ. of Illinois, Urbana- Champaign, 1995.

Digital Library

[13]

P. E. Ammann and J. C. Knight, "Data-diversity: an approach to software fault-tolerance", Proc. of 17th Intnl. Syrup. on Fault Tolerant Computing, pp. 122-126, June 1987.

[14]

K. G. Shin, T. Lin and Y. Lee, "Optimal checkpointing of real-time tasks", IEEE Transactions on Computers, Vol. C-36, No. 11, November 1987.

Digital Library

[15]

C. H. C. Leung and Q. H. Choo, "On the execution of large batch programs in unreliable computing systems", IEEE Trans. on Software Engg., Vol. SE- 10, No. 4, July 1984, pp. 444-450.

Digital Library

[16]

E. G. Coffman and E. N. Gilbert, "Optimal strategies for scheduling checkpoints and preventive maintenance" IEEE Trans. on Reliability, Vol. 39, No. 1, April 1990, pp. 9-18.

[17]

A. Duda, "The effects of checkpointing on program execution time", Information Processing Letters, Vol. 16, pp. 221-229, 1983.

[18]

V.G. Kulkarni, V.F. Nicola and K. S. Trivedi, "Effects of checkpointing and queuing on program performance", Communications on Statistics. Stochastic Models, 6(4), 615-648, 1990.

[19]

S. Toueg and O. Babouglu, "On the optimum checkpoint selection problem", SIAM Journal on Computing, Vol. 13, No. 3, pp. 630-649, August 1984.

Digital Library

[20]

R. Geist, R. Reynolds and J. Westall, "Selection of a checkpoint interval in a critical-task environment", IEEE Trans. on Reliability, 37(4), 395-400~ October 1988.

[21]

K. S. Trivedi, "Probability and Statistics with reliability, queuing and computer science applications", Prentice.Hall, 1982.

Digital Library

Cited By

Jafary BFiondella LChang P(2020)Optimal equidistant checkpointing of fault tolerant systems subject to correlated failureProceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability10.1177/1748006X19893569(1748006X1989356)Online publication date: 4-May-2020
https://doi.org/10.1177/1748006X19893569
Carzaniga AGorla APerino NPezzè M(2015)Automatic WorkaroundsACM Transactions on Software Engineering and Methodology10.1145/275597024:3(1-42)Online publication date: 13-May-2015
https://dl.acm.org/doi/10.1145/2755970
Cui LHao ZLi LFei HDing ZLi BLiu P(2015)Lightweight Virtual Machine Checkpoint and Rollback for Long-running ApplicationsAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-27137-8_42(577-596)Online publication date: 16-Dec-2015
https://doi.org/10.1007/978-3-319-27137-8_42
Show More Cited By

Index Terms

Minimizing completion time of a program by checkpointing and rejuvenation

Recommendations

Minimizing completion time of a program by checkpointing and rejuvenation
SIGMETRICS '96: Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Checkpointing with rollback-recovery is a well known technique to reduce the completion time of a program in the presence of failures. While checkpointing is corrective in nature, rejuvenation refers to preventive maintenance of software aimed to reduce ...
Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

This paper examines comprehensive evaluation of aperiodic time-based checkpointing and rejuvenation schemes maximizing the steady-state system availability in an operational software system. We consider two kinds of maintenance policies: checkpointing ...
Minimizing Restart Time for Fast Rejuvenation and Availability Enhancement
ISADS '11: Proceedings of the 2011 Tenth International Symposium on Autonomous Decentralized Systems

An overview of a generic restart procedure of a computer system -- be it a personal computer, a server or a big parallel machine -- is presented. The objective of this paper is to analyze the restart procedures and propose techniques to reduce the time ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMETRICS Performance Evaluation Review

ACM SIGMETRICS Performance Evaluation Review Volume 24, Issue 1

May 1996

273 pages

ISSN:0163-5999

DOI:10.1145/233008

Editors:
Blaine E. Gaither
Hewlett-Packard, Fort Collins, CO
,
Daniel A. Reed
Univ. of Illinois, Urbana

Issue’s Table of Contents

SIGMETRICS '96: Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
May 1996
279 pages
ISBN:0897917936
DOI:10.1145/233013
Chairman:
Daniel A. Reed
Univ. of Illinois, Urbana
,
Editor:
Blaine D. Gaither
Hewlett-Packard, Fort Collins, CO

Copyright © 1996 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 1996

Published in SIGMETRICS Volume 24, Issue 1

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
518
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jafary BFiondella LChang P(2020)Optimal equidistant checkpointing of fault tolerant systems subject to correlated failureProceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability10.1177/1748006X19893569(1748006X1989356)Online publication date: 4-May-2020
https://doi.org/10.1177/1748006X19893569
Carzaniga AGorla APerino NPezzè M(2015)Automatic WorkaroundsACM Transactions on Software Engineering and Methodology10.1145/275597024:3(1-42)Online publication date: 13-May-2015
https://dl.acm.org/doi/10.1145/2755970
Cui LHao ZLi LFei HDing ZLi BLiu P(2015)Lightweight Virtual Machine Checkpoint and Rollback for Long-running ApplicationsAlgorithms and Architectures for Parallel Processing10.1007/978-3-319-27137-8_42(577-596)Online publication date: 16-Dec-2015
https://doi.org/10.1007/978-3-319-27137-8_42
Alonso JTrivedi K(2015)Software Rejuvenation and its Application in Distributed SystemsQuantitative Assessments of Distributed Systems10.1002/9781119131151.ch11(301-325)Online publication date: 17-Apr-2015
https://doi.org/10.1002/9781119131151.ch11
Cotroneo DNatella RPietrantuono RRusso S(2014)A survey of software aging and rejuvenation studiesACM Journal on Emerging Technologies in Computing Systems10.1145/253911710:1(1-34)Online publication date: 13-Jan-2014
https://dl.acm.org/doi/10.1145/2539117
Sudhakar CShah IRamesh T(2014)Software rejuvenation in cloud systems using neural networks2014 International Conference on Parallel, Distributed and Grid Computing10.1109/PDGC.2014.7030747(230-233)Online publication date: Dec-2014
https://doi.org/10.1109/PDGC.2014.7030747
Koutras VMalefaki SPlatis A(2014)Rejuvenation effects on the grid environment performance with response time delays using Monte Carlo simulationSimulation Modelling Practice and Theory10.1016/j.simpat.2013.10.00140(176-191)Online publication date: Jan-2014
https://doi.org/10.1016/j.simpat.2013.10.001
Bouguerra MTrystram DWagner F(2013)Complexity Analysis of Checkpoint Scheduling with Variable CostsIEEE Transactions on Computers10.1109/TC.2012.5762:6(1269-1275)Online publication date: 1-Jun-2013
https://dl.acm.org/doi/10.1109/TC.2012.57
Carzaniga AGorla APerino NPezzè MRoman Gvan der Hoek A(2010)Automatic workarounds for web applicationsProceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering10.1145/1882291.1882327(237-246)Online publication date: 7-Nov-2010
https://dl.acm.org/doi/10.1145/1882291.1882327
Jin HChen YZhu HSun X(2010)Optimizing HPC Fault-Tolerant EnvironmentProceedings of the 2010 39th International Conference on Parallel Processing10.1109/ICPP.2010.80(525-534)Online publication date: 13-Sep-2010
https://dl.acm.org/doi/10.1109/ICPP.2010.80
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents