Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Checkpointing Workflows à la Young/Daly Is Not Good Enough

Published: 16 December 2022 Publication History
  • Get Citation Alerts
  • Abstract

    This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.

    Supplementary Material

    3548607.supp (3548607.supp.pdf)
    Supplementary material

    References

    [1]
    Anne Benoit, Lucas Perotin, Yves Robert, and Hongyang Sun. 2021. Checkpointing Workflows à la Young/Daly Is Not Good Enough: Code for In-house Simulator. (June2021). https://graal.ens-lyon.fr/yrobert/simulator.zip.
    [2]
    Argonne Leadership Computing Facility (ALCF). Mira Log Traces. Retrieved from https://reports.alcf.anl.gov/data/mira.html.
    [3]
    Malcolm Atkinson, Sandra Gesing, Johan Montagnat, and Ian Taylor. 2017. Scientific workflows: Past, present and future. Fut. Gener. Comput. Syst. 75 (2017), 216–227.
    [4]
    Guillaume Aupy, Anne Benoit, Henri Casanova, and Yves Robert. 2016. Scheduling computational workflows on failure-prone platforms. Int. J. Netw. Comput. 6, 1 (2016), 2–26.
    [5]
    Anne Benoit, Aurélien Cavelan, Yves Robert, and Hongyang Sun. 2016. Assessing general-purpose algorithms to cope with fail-stop and silent errors. ACM Trans. Parallel Comput. 3, 2 (2016).
    [6]
    Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. 2013. An evaluation of User-Level Failure Mitigation support in MPI. Computing 95, 12 (2013), 1171–1184.
    [7]
    Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, and Frédéric Vivien. 2011. Checkpointing Strategies for Parallel Jobs. Research Report 7520. INRIA, France.
    [8]
    Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1, 1 (2014).
    [9]
    F. Cappello, K. Mohror, et al. 2019. VeloC: Very Low Overhead Checkpointing System. Retrieved from https://veloc.readthedocs.io/en/latest/.
    [10]
    K. M. Chandy and L. Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (1985), 63–75.
    [11]
    E. G. Coffman and R. L. Graham. 1972. Optimal scheduling for two-processor systems. Acta Inf. 1, 3 (1972), 200–213.
    [12]
    J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Fut. Gener. Comput. Syst. 22, 3 (2006), 303–312.
    [13]
    Gökalp Demirci, Henry Hoffmann, and David H. K. Kim. 2018. Approximation algorithms for scheduling with resource and precedence constraints. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS’18). 25:1–25:14.
    [14]
    Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail. In Proceedings of the 43rd International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1–12.
    [15]
    Fault-Tolerance Research Hub. 2021. User Level Failure Mitigation. Retrieved from https://fault-tolerance.org.
    [16]
    Dror G. Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Job Scheduling Strategies for Parallel Processing. Springer, 1–26.
    [17]
    Anja Feldmann, Ming-Yang Kao, Jiří Sgall, and Shang-Hua Teng. 1998. Optimal on-line scheduling of parallel jobs with dependencies. J. Combin. Optim. 1, 4 (1998), 393–411.
    [18]
    K. Ferreira, J. Stearley, J. H. III Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. 2011. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM.
    [19]
    Rafael Ferreira da Silva, Loïc Pottier, Tainã Coleman, Ewa Deelman, and Henri Casanova. 2020. WorkflowHub: Community framework for enabling scientific workflow research and development. In Proceedings of the IEEE/ACM Workflows in Support of Large-Scale Science (WORKS’20). 49–56.
    [20]
    M. R. Garey and D. S. Johnson. 1979. Computers and Intractability, a Guide to the Theory of NP-Completeness. W. H. Freeman & Company.
    [21]
    Pter J. Grabner and Helmut Prodinger. 1997. Maximum statistics of N random variables distributed by the negative binomial distribution. Combin. Probab. Comput. 6, 2 (1997), 179–183.
    [22]
    R. L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 2 (1969), 416–429.
    [23]
    Li Han, Louis-Claude Canon, Henri Casanova, Yves Robert, and Frédéric Vivien. 2018. Checkpointing workflows for fail-stop errors. IEEE Trans. Comput. 67, 8 (2018), 1105–1120.
    [24]
    Li Han, Valentin Le Fèvre, Louis-Claude Canon, Yves Robert, and Frédéric Vivien. 2018. A generic approach to scheduling and checkpointing workflows. In Proceedings of the 47th Int. Conf. on Parallel Processing (ICPP’18). IEEE Computer Society Press.
    [25]
    Thomas Herault and Yves Robert (Eds.). 2015. Fault-Tolerance Techniques for High-Performance Computing. Springer Verlag.
    [26]
    Udo Hönig and Wolfram Schiffmann. 2003. A parallel branch-and-bound algorithm for computing optimal task graph schedules. In Proceedings of the 2nd International Workshop on Grid and Cooperative Computing (GCC’03). 18–25.
    [27]
    T. C. Hu. 1961. Parallel sequencing and assembly line problems. Operat. Res. 9, 6 (1961), 841–848.
    [28]
    IBM Spectrum LSF Job Scheduler. 2021. Fault Tolerance and Automatic Management Host Failover. Retrieved from.
    [29]
    Klaus Jansen and Hu Zhang. 2006. An approximation algorithm for scheduling malleable tasks under general precedence constraints. ACM Trans. Algor. 2, 3 (2006), 416–434.
    [30]
    Yu-Kwong Kwok and Ishfaq Ahmad. 1999. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31, 4 (1999), 406–471.
    [31]
    Renaud Lepère, Denis Trystram, and Gerhard J. Woeginger. 2001. Approximation algorithms for scheduling malleable tasks under precedence constraints. In Proceedings of the European Symposium on Algorithms (ESA’01). 146–157.
    [32]
    Keqin Li. 1999. Analysis of the list scheduling algorithm for precedence constrained parallel tasks. J. Combin. Optim. 3, 1 (1999), 73–88.
    [33]
    National Energy Research Scientific Computing Center (NERSC). Cori Log Traces. Retrieved from https://docs.nersc.gov/systems/cori/.
    [34]
    B. Schroeder and G. A. Gibson. 2007. Understanding failures in petascale computers. J. Phys.: Conf. Ser. 78, 1 (2007).
    [35]
    Ahmed Zaki Semar Shahul and Oliver Sinnen. 2010. Scheduling task graphs optimally with A*. J. Supercomput. 51 (2010), 310–332.
    [36]
    Pegasus Team. 2014. Pegasus Workflow Generator. Retrieved from https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator.
    [37]
    Sam Toueg and Özalp Babaoğlu. 1984. On the optimum checkpoint selection problem. SIAM J. Comput. 13, 3 (1984).
    [38]
    John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (1974), 530–531.

    Cited By

    View all
    • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingInternational Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 1-Jul-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 9, Issue 4
    December 2022
    102 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/3572851
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 December 2022
    Online AM: 02 September 2022
    Accepted: 11 July 2022
    Revised: 07 July 2022
    Received: 18 November 2021
    Published in TOPC Volume 9, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Checkpoint
    2. workflow
    3. concurrent tasks
    4. Young/Daly formula

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)144
    • Downloads (Last 6 weeks)7

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingInternational Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 1-Jul-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media