Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-642-40047-6_43guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Multi-criteria checkpointing strategies: response-time versus resource utilization

Published: 26 August 2013 Publication History

Abstract

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.

References

[1]
Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. IJHPCA 23(4), 309-322 (2009)
[2]
Gibson, G.: Failure tolerance in petascale computers. Journal of Physics: Conference Series 78, 012022 (2007)
[3]
Ferreira, K., Stearley, J., Laros, J. H. I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P. G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of SC 2011. ACM/IEEE (2011)
[4]
Elnozahy, E. N. M., Alvisi, L., Wang, Y. M., Johnson, D. B.: A survey of rollbackrecovery protocols in message-passing systems. ACM Survey 34, 375-408 (2002)
[5]
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. J.: Correlated set coordination in fault tolerant message logging protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 51-64. Springer, Heidelberg (2011)
[6]
Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic MPI applications. In: Proc. 26th IPDPS, pp. 1216-1227. IEEE (May 2012)
[7]
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Research report RR-7950, INRIA (2012)
[8]
Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518-528 (1984)
[9]
Chen, Z., Fagg, G. E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. 10th ACM SIGPLAN PPoPP, pp. 213-223. ACM (2005)
[10]
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3), 319-333 (2006)
[11]
Bouteiller, A., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y.: Multi-criteria checkpointing strategies: Optimizing response-time versus resource utilization. Research report ICL-UT-1301, University of Tennessee (February 2013)
[12]
Miyazaki, H., Kusano, Y., Okano, H., Nakada, T., Seki, K., Shimizu, T., Shinjo, N., Shoji, F., Uno, A., Kurokawa, M.: K computer: 8.162 petaflops massively parallel scalar supercomputer built with over 548k cores. In: ISSCC, pp. 192-194. IEEE (2012)
[13]
Chakravorty, S., Kale, L.: A fault tolerance protocol with fast fault recovery. In: Proc. 21st IPDPS, pp. 1-10. IEEE (March 2007)
[14]
Yang, X., Du, Y., Wang, P., Fu, H., Jia, J.: FTPA: Supporting fault-tolerant parallel computing through parallel recomputing. IEEE Transactions on Parallel and Distributed Systems 20(10), 1471-1486 (2009)
[15]
Gustafson, J. L.: Reevaluating Amdahl's law. Communications of the ACM 31, 532-533 (1988)
[16]
Thekkath, R., Eggers, S. J.: The effectiveness of multiple hardware contexts. In: Proc. of the 6th ASPLOS, pp. 328-337. ACM (1994)
[17]
Huang, C., Zheng, G., Kalé, L., Kumar, S.: Performance evaluation of Adaptive MPI. In: Proc. 11th ACM SIGPLAN PPoPP, pp. 12-21. ACM (2006)
[18]
Bouteiller, A., Bouziane, H. L., Herault, T., Lemarinier, P., Cappello, F.: Hybrid preemptive scheduling of message passing interface applications on grids. IJHPCA 20(1), 77-90 (2006)
[19]
Daly, J. T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3), 303-312 (2004)

Cited By

View all
  • (2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014

Index Terms

  1. Multi-criteria checkpointing strategies: response-time versus resource utilization
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      Euro-Par'13: Proceedings of the 19th international conference on Parallel Processing
      August 2013
      885 pages
      ISBN:9783642400469
      • Editors:
      • Felix Wolf,
      • Bernd Mohr,
      • Dieter Mey

      Sponsors

      • INTEL: Intel Corporation
      • Deutsche Forschungsgemeinschaft
      • NVIDIA
      • Bull GmbH: Bull GmbH

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 26 August 2013

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media