research-article

Checkpointing Workflows à la Young/Daly Is Not Good Enough

Authors:

Yves Robert, and

Hongyang SunAuthors Info & Claims

ACM Transactions on Parallel Computing, Volume 9, Issue 4

Article No.: 14, Pages 1 - 25

https://doi.org/10.1145/3548607

Published: 16 December 2022 Publication History

Abstract

This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.

Supplementary Material

3548607.supp (3548607.supp.pdf)

Supplementary material

Download
4.41 MB

References

[1]

Anne Benoit, Lucas Perotin, Yves Robert, and Hongyang Sun. 2021. Checkpointing Workflows à la Young/Daly Is Not Good Enough: Code for In-house Simulator. (June2021). https://graal.ens-lyon.fr/yrobert/simulator.zip.

[2]

Argonne Leadership Computing Facility (ALCF). Mira Log Traces. Retrieved from https://reports.alcf.anl.gov/data/mira.html.

[3]

Malcolm Atkinson, Sandra Gesing, Johan Montagnat, and Ian Taylor. 2017. Scientific workflows: Past, present and future. Fut. Gener. Comput. Syst. 75 (2017), 216–227.

[4]

Guillaume Aupy, Anne Benoit, Henri Casanova, and Yves Robert. 2016. Scheduling computational workflows on failure-prone platforms. Int. J. Netw. Comput. 6, 1 (2016), 2–26.

[5]

Anne Benoit, Aurélien Cavelan, Yves Robert, and Hongyang Sun. 2016. Assessing general-purpose algorithms to cope with fail-stop and silent errors. ACM Trans. Parallel Comput. 3, 2 (2016).

Digital Library

[6]

Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. 2013. An evaluation of User-Level Failure Mitigation support in MPI. Computing 95, 12 (2013), 1171–1184.

Digital Library

[7]

Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and Frédéric Vivien. 2011. Checkpointing Strategies for Parallel Jobs. Research Report 7520. INRIA, France.

Digital Library

[8]

Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1, 1 (2014).

[9]

F. Cappello, K. Mohror, et al. 2019. VeloC: Very Low Overhead Checkpointing System. Retrieved from https://veloc.readthedocs.io/en/latest/.

[10]

K. M. Chandy and L. Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (1985), 63–75.

Digital Library

[11]

E. G. Coffman and R. L. Graham. 1972. Optimal scheduling for two-processor systems. Acta Inf. 1, 3 (1972), 200–213.

Digital Library

[12]

J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Fut. Gener. Comput. Syst. 22, 3 (2006), 303–312.

[13]

Gökalp Demirci, Henry Hoffmann, and David H. K. Kim. 2018. Approximation algorithms for scheduling with resource and precedence constraints. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS’18). 25:1–25:14.

[14]

Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail. In Proceedings of the 43rd International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1–12.

Digital Library

[15]

Fault-Tolerance Research Hub. 2021. User Level Failure Mitigation. Retrieved from https://fault-tolerance.org.

[16]

Dror G. Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Job Scheduling Strategies for Parallel Processing. Springer, 1–26.

[17]

Anja Feldmann, Ming-Yang Kao, Jiří Sgall, and Shang-Hua Teng. 1998. Optimal on-line scheduling of parallel jobs with dependencies. J. Combin. Optim. 1, 4 (1998), 393–411.

[18]

K. Ferreira, J. Stearley, J. H. III Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. 2011. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM.

Digital Library

[19]

Rafael Ferreira da Silva, Loïc Pottier, Tainã Coleman, Ewa Deelman, and Henri Casanova. 2020. WorkflowHub: Community framework for enabling scientific workflow research and development. In Proceedings of the IEEE/ACM Workflows in Support of Large-Scale Science (WORKS’20). 49–56.

[20]

M. R. Garey and D. S. Johnson. 1979. Computers and Intractability, a Guide to the Theory of NP-Completeness. W. H. Freeman & Company.

Digital Library

[21]

Pter J. Grabner and Helmut Prodinger. 1997. Maximum statistics of N random variables distributed by the negative binomial distribution. Combin. Probab. Comput. 6, 2 (1997), 179–183.

Digital Library

[22]

R. L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 2 (1969), 416–429.

Digital Library

[23]

Li Han, Louis-Claude Canon, Henri Casanova, Yves Robert, and Frédéric Vivien. 2018. Checkpointing workflows for fail-stop errors. IEEE Trans. Comput. 67, 8 (2018), 1105–1120.

Digital Library

[24]

Li Han, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, and Frédéric Vivien. 2018. A generic approach to scheduling and checkpointing workflows. In Proceedings of the 47th Int. Conf. on Parallel Processing (ICPP’18). IEEE Computer Society Press.

Digital Library

[25]

Thomas Herault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag.

[26]

Udo Hönig and Wolfram Schiffmann. 2003. A parallel branch-and-bound algorithm for computing optimal task graph schedules. In Proceedings of the 2nd International Workshop on Grid and Cooperative Computing (GCC’03). 18–25.

[27]

T. C. Hu. 1961. Parallel sequencing and assembly line problems. Operat. Res. 9, 6 (1961), 841–848.

Digital Library

[28]

IBM Spectrum LSF Job Scheduler. 2021. Fault Tolerance and Automatic Management Host Failover. Retrieved from.

[29]

Klaus Jansen and Hu Zhang. 2006. An approximation algorithm for scheduling malleable tasks under general precedence constraints. ACM Trans. Algor. 2, 3 (2006), 416–434.

Digital Library

[30]

Yu-Kwong Kwok and Ishfaq Ahmad. 1999. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31, 4 (1999), 406–471.

Digital Library

[31]

Renaud Lepère, Denis Trystram, and Gerhard J. Woeginger. 2001. Approximation algorithms for scheduling malleable tasks under precedence constraints. In Proceedings of the European Symposium on Algorithms (ESA’01). 146–157.

[32]

Keqin Li. 1999. Analysis of the list scheduling algorithm for precedence constrained parallel tasks. J. Combin. Optim. 3, 1 (1999), 73–88.

[33]

National Energy Research Scientific Computing Center (NERSC). Cori Log Traces. Retrieved from https://docs.nersc.gov/systems/cori/.

[34]

B. Schroeder and G. A. Gibson. 2007. Understanding failures in petascale computers. J. Phys.: Conf. Ser. 78, 1 (2007).

[35]

Ahmed Zaki Semar Shahul and Oliver Sinnen. 2010. Scheduling task graphs optimally with A*. J. Supercomput. 51 (2010), 310–332.

Digital Library

[36]

Pegasus Team. 2014. Pegasus Workflow Generator. Retrieved from https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator.

[37]

Sam Toueg and Özalp Babaoğlu. 1984. On the optimum checkpoint selection problem. SIAM J. Comput. 13, 3 (1984).

Digital Library

[38]

John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (1974), 530–531.

Digital Library

Cited By

Dongarra JTourancheau BBhowmick SBell PTaufer M(2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingInternational Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1177/10943420231166610

Index Terms

Checkpointing Workflows à la Young/Daly Is Not Good Enough

Recommendations

A generic approach to scheduling and checkpointing workflows

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous ...
Read More
Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC Platforms
This article studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-...
Read More
Multilevel Diskless Checkpointing

Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 9, Issue 4

December 2022

102 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3572851

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2022

Online AM: 02 September 2022

Accepted: 11 July 2022

Revised: 07 July 2022

Received: 18 November 2021

Published in TOPC Volume 9, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
229
Total Downloads

Downloads (Last 12 months)144
Downloads (Last 6 weeks)7

Other Metrics

View Author Metrics

Citations

Cited By

Dongarra JTourancheau BBhowmick SBell PTaufer M(2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingInternational Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1177/10943420231166610

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents