Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An On-Line Algorithm for Checkpoint Placement

Published: 01 September 1997 Publication History

Abstract

Checkpointing enables us to reduce the time to recover from a fault by saving intermediate states of the program in a reliable storage. The length of the intervals between checkpoints affects the execution time of programs. On one hand, long intervals lead to long reprocessing time, while, on the other hand, too frequent checkpointing leads to high checkpointing overhead. In this paper, we present an on-line algorithm for placement of checkpoints. The algorithm uses knowledge of the current cost of a checkpoint when it decides whether or not to place a checkpoint. The total overhead of the execution time when the proposed algorithm is used is smaller than the overhead when fixed intervals are used. Although the proposed algorithm uses only on-line knowledge about the cost of checkpointing, its behavior is close to the off-line optimal algorithm that uses a complete knowledge of checkpointing cost.

References

[1]
A. Brock, "An Analysis of Checkpointing," ICL Technical J., vol. 1, 1979.
[2]
K.M. Chandy and C.V. Ramamoorthy, "Rollback and Recovery Strategies for Computer Programs," IEEE Trans. Computers, vol. 21, no. 6, pp. 546-556, June 1972.
[3]
E.G. Coffman and E.N. Gilbert, "Optimal Strategies for Scheduling Checkpoints and Preventive Maintenance," IEEE Trans. Reliability, vol. 39, pp. 9-18, Apr. 1990.
[4]
A. Duda, "The Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, pp. 221-229, June 1983.
[5]
E. Gelenbe, "On the Optimum Checkpoint Interval," J. ACM, vol. 26, pp. 259-270, Apr. 1979.
[6]
S. Karlin and H.M. Taylor, A First Course in Stochastic Processes. Academic Press, 1975.
[7]
V.G. Kulkarni V.F. Nicola and K.S. Trivedi, "Effects of Checkpointing and Queueing on Program Performance," Comm. Statistics—Stochastic Models, vol. 6, pp. 615-648, Apr. 1990.
[8]
P. L'Ecuyer and J. Malenfant, "Computing Optimal Checkpointing for Rollback and Recovery Systems," IEEE Trans. Computers, vol. 37, no. 4, pp. 491-496, Apr. 1988.
[9]
C.-C. J. Li E.M. Stewart and W.K. Fuchs, "Compiler-Assisted Full Checkpointing," Software—Practice and Experience, vol. 24, pp. 871-886, Oct. 1994.
[10]
V.F. Nicola, "Checkpointing and the Modeling of Program Execution Time," Software Fault-Tolerance, M.R. Lyu, ed., pp. 167-188. John Wiley, 1995.
[11]
V.F. Nicola and J.M. van Spanje, "Comparative Analysis of Different Models of Checkpointing and Recovery," IEEE Trans. Software Eng., vol. 16, no. 8, pp. 807-821, Aug. 1990.
[12]
S. Toueg and Ö. Babaoglu, "On the Optimum Checkpoint Selection Problem," SIAM J. Computing, vol. 13, pp. 630-649, Aug. 1984.
[13]
A. Ziv, "Analysis and Performance Optimization of Checkpointing Schemes with Task Duplication," PhD thesis, Stanford Univ., 1995.

Cited By

View all
  • (2023)Taking 5G RAN Analytics and Control to a New LevelProceedings of the 29th Annual International Conference on Mobile Computing and Networking10.1145/3570361.3592493(1-16)Online publication date: 2-Oct-2023
  • (2022)A Genetic Algorithm-Based Approach to Identify Near-Optimal Non-Equidistant Checkpointing Strategies2022 Annual Reliability and Maintainability Symposium (RAMS)10.1109/RAMS51457.2022.9894018(1-6)Online publication date: 24-Jan-2022
  • (2022)Bounded DBM-based clock state construction for timed automata in UppaalInternational Journal on Software Tools for Technology Transfer (STTT)10.1007/s10009-022-00667-x25:1(19-47)Online publication date: 8-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 46, Issue 9
September 1997
96 pages
ISSN:0018-9340
Issue’s Table of Contents

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 September 1997

Author Tags

  1. Fault-tolerant computing
  2. checkpointing
  3. on-line algorithm
  4. performance optimization.

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Taking 5G RAN Analytics and Control to a New LevelProceedings of the 29th Annual International Conference on Mobile Computing and Networking10.1145/3570361.3592493(1-16)Online publication date: 2-Oct-2023
  • (2022)A Genetic Algorithm-Based Approach to Identify Near-Optimal Non-Equidistant Checkpointing Strategies2022 Annual Reliability and Maintainability Symposium (RAMS)10.1109/RAMS51457.2022.9894018(1-6)Online publication date: 24-Jan-2022
  • (2022)Bounded DBM-based clock state construction for timed automata in UppaalInternational Journal on Software Tools for Technology Transfer (STTT)10.1007/s10009-022-00667-x25:1(19-47)Online publication date: 8-Sep-2022
  • (2020)NVGraph: Enforcing Crash Consistency of Evolving Network Analytics in NVMM SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.296545231:6(1255-1269)Online publication date: 23-Jan-2020
  • (2017)The Optimal Checkpoint Interval for the Long-Running ApplicationInternational Journal of Advanced Pervasive and Ubiquitous Computing10.4018/IJAPUC.20170401039:2(45-54)Online publication date: 1-Apr-2017
  • (2017)Analysis of Relationship Between Modes of Intercomputer Communications and Fault Types in Redundant Computer SystemsLNCS on Transactions on Computational Science XXIX - Volume 1022010.1007/978-3-662-54563-8_1(1-32)Online publication date: 1-Jan-2017
  • (2016)On channel failures, file fragmentation policies, and heavy-tailed completion timesIEEE/ACM Transactions on Networking10.1109/TNET.2014.237592024:1(529-541)Online publication date: 1-Feb-2016
  • (2016)Optimizing the Level of Confidence for Multiple JobsIEEE Transactions on Computers10.1109/TC.2015.243925465:4(1239-1252)Online publication date: 1-Apr-2016
  • (2015)An aperiodic checkpointing strategy in desktop gridsInternational Journal of Computational Science and Engineering10.1504/IJCSE.2015.06883310:3(244-252)Online publication date: 1-Apr-2015
  • (2013)Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptationProceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.5555/2555729.2555751(1-10)Online publication date: 29-Sep-2013
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media