Abstract
The emergence of multiple resource management systems, such as SLURM and Kubernetes, for different computational purposes has led to a desire to support a single workflow that spans multiple resource management domains, which can include multiple HPCs, edges, and cloud, over different network domains. Best-of-class tools developed in one domain often do not run well or at all in a different resource management regime demanding these hybrid environments. Understanding the resilience properties and concerns for cross-resource management system workflows is an unexplored area. Further, we lack tools and techniques to test this resilience and to understand how well systems and systems of systems work in the face of faults and failures. We are proposing a Fault Tolerance 500 (FT500) and a related set of benchmarks that test from the hardware layer through the software layers to create resilience scenarios. By making this a scored benchmark set, we offer a public ranking of systems and software and motivation for facilities to allow benchmarking. We also discuss potential approaches to enable fault-tolerant converged computing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Exascale Computing Project CANDLE. https://www.exascaleproject.org/research-group/data-analytics-and-optimization/
NVIDIA SDK. https://developer.nvidia.com/hpc-sdk
Slurm Fault Tolerant Workload Management. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=1303290
Stress-ng. https://github.com/ColinIanKing/stress-ng
Ahn, D.H., et al.: Scalable composition and analysis techniques for massive scientific workflows. In: e-Science (2022)
AlZain, M.A., Soh, B., Pardede, E.: A new approach using redundancy technique to improve security in cloud computing. In: CyberSec, pp. 230–235. IEEE (2012)
Calhoun, J., Olson, L., Snir, M.: FlipIt: an LLVM based fault injector for HPC. In: Lopes, L., et al. (eds.) Euro-Par 2014. LNCS, vol. 8805, pp. 547–558. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14325-5_47
Chakrabarti, D.R., Boehm, H.J., Bhandari, K.: Atlas: leveraging locks for non-volatile memory consistency. In: ACM OOPSLA (2014)
Dongarra, J.J., Meuer, H.W., Strohmaier, E., et al.: Top500 supercomputer sites. Supercomputer 13, 89–111 (1997)
Georgakoudis, G., Laguna, I., Nikolopoulos, D.S., Schulz, M.: REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed. In: ACM/IEEE SC, pp. 1–14 (2017)
Guo, L., Georgakoudis, G., Parasyris, K., Laguna, I., Li, D.: MATCH: an MPI fault tolerance benchmark suite. In: 2020 IEEE International Symposium on Workload Characterization (IISWC), pp. 60–71. IEEE (2020)
Guo, L., Li, D.: MOARD: modeling application resilience to transient faults on data objects. In: IPDPS (2019)
Guo, L., Li, D., Laguna, I.: Paris: predicting application resilience using machine learning. J. Parallel Distrib. Comput. 152, 111–124 (2021)
Guo, L., Li, D., Laguna, I., Schulz, M.: Fliptracker: understanding natural error resilience in HPC applications. In: SC (2018)
Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. JPDC 72, 1318–1331 (2012)
Jhawar, R., Piuri, V., Santambrogio, M.: A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In: IEEE ISC, pp. 1–5 (2012)
Kestor, G., Krishnamoorthy, S., Ma, W.: Localized fault recovery for nested fork-join programs. In: IEEE IPDPS (2017)
Kunkel, J., Bent, J., Lofstead, J., Markomanolis, G.S.: Establishing the IO-500 benchmark. White Paper (2016)
Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: IPAS: intelligent protection against silent output corruption in scientific applications. In: IEEE CGO, pp. 227–238 (2016)
Li, Z., et al.: A visual comparison of silent error propagation. IEEE TVCG (2022)
Mohammed, B., Kiran, M., Maiyama, K.M., Kamala, M.M., Awan, I.U.: Failover strategy for fault tolerance in cloud computing environment. Software (2017)
Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings of IEEE/ACM MICRO (2003)
Nicolae, B., et al.: VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: IEEE IPDPS (2019)
Oukid, I., et al.: FPTree: a hybrid SCM-DRAM persistent and concurrent B-Tree for storage class memory. In: SIGMOD (2016)
Peterson, J.L., et al.: Enabling machine learning-ready HPC ensembles with merlin. FGCS 131(C), 255–268 (2022)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: IEEE CGO, pp. 243–254 (2005)
Ren, J., Wu, K., Li, D.: Exploring non-volatility of non-volatile memory for high performance computing under failures. In: IEEE CLUSTER, pp. 237–247 (2020)
Rorabaugh, D., Guevara, M., Llamas, R., Kitson, J., Vargas, R., Taufer, M.: SOMOSPIE: a modular SOil MOisture SPatial inference engine based on data-driven decisions. In: eScience, pp. 1–10 (2019)
Saadi, A.A., et al.: Impeccable: integrated modeling pipeline for COVID cure by assessing better leads. In: ICPP, pp. 1–12 (2021)
Shahzad, F., Thies, J., Kreutzer, M., Zeiser, T., Hager, G., Wellein, G.: CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE TPDS (2018)
Shin, K.G., Kim, H.: A time redundancy approach to TMR failures using fault-state likelihoods. IEEE Trans. Comput. 43(10), 1151–1162 (1994)
Wang, J., Bao, W., Zhu, X., Yang, L.T., Xiang, Y.: FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE TC (2014)
Wei, J., Thomas, A., Li, G., Pattabiraman, K.: Quantifying the accuracy of high-level fault injection techniques for hardware faults. In: IEEE/IFIP DSN, pp. 375–382 (2014)
Yu, L., Li, D., Mittal, S., Vetter, J.S.: Quantitatively modeling application resiliency with data vulnerability factor. In: SC (2014)
Acknowledgment
We thank the anonymous reviewers for their valuable feedbacks. This work was partially supported by the Pacific Northwest National Laboratory (PNNL). operated by Battelle for the U.S. Department of Energy (DOE) under contract DE-AC05-76RL01830. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was authored in part by employees of Brookhaven Science Associates, LLC under Contract No. DESC0012704. This work was also supported in part by National Science Foundation (NSF) CCF-2114514.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, L. et al. (2023). Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-40843-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)