Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Analyzing and predicting job failures from HPC system log

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we analyze the scheduler log of a production supercomputer that contains complete job information, which is in contrast to many existing (publicly available) HPC logs that only have largely limited job information. We not only provide an in-depth statistical analysis of failed jobs from the scheduler log, but also demonstrate how the scheduler log, which is available in a detailed form, can be leveraged to predict job failures. For the latter, we first conduct a feature analysis based on the framework of ‘weight of evidence’ and ‘information value’ to uncover the impact of each workload attribute (feature) on the failure or success of a job, thereby enabling us to identify key features. We then conduct a comparative performance study of six data-driven machine learning models for predicting job failures in a HPC system based on the scheduler log. Our experiment results show that tree-based models exhibit superior performance in terms of both prediction accuracy and computational cost. We also demonstrate that our feature analysis improves the computational efficiency of each machine learning model without losing its prediction performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

No additional data or materials available.

Notes

  1. The Tachyon is the fourth supercomputer at the National Supercomputing Center in the Korean Institute of Science and Technology Information, which has provided computing resources to support the large-scale national research works until 2017. While the fifth supercomputer, Nurion, is currently in operation, the logs of Nurion are not open to the public yet due to security reasons. Thus, we focus on the log of Tachyon in this work.

  2. Note that a job can be running on multiple nodes simultaneously, so it can be associated with multiple hostname’s. Thus, when computing the IV value of hostname, we use the hostname of the first node associated with each job.

  3. In general, the values of IV have the following implications [4]: \(\text {IV} < 0.03\) (poor predictor), \(0.03< \text {IV} < 0.1\) (weak predictor), \(0.1< \text {IV} < 0.3\) (average predictor), \(0.3< \text {IV} < 0.5\) (strong predictor), and \(0.5 < \text {IV}\) (very strong predictor).

References

  1. Abdou HA, Pointon J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intell Syst Accounting Financ Manag 18(2–3):59–88

    Article  Google Scholar 

  2. Abeyratne N, Chen HM, Oh B, et al (2016) Checkpointing exascale memory systems with existing memory technologies. In: International Symposium on Memory Systems (MEMSYS’16), ACM, pp 18–29

  3. Alharthi KA, Jhumka A, Di S, et al (2022) Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems. In: Proceedings of the 36th ACM International Conference on Supercomputing, pp 1–14

  4. Bailey M (2001) Credit scoring: the principles and practicalities. White Box Publishing, Bristol

    Google Scholar 

  5. Benoit A, Le Fèvre V, Raghavan P, et al (2020) Design and comparison of resilient scheduling heuristics for parallel jobs. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp 567–576

  6. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Oxford

    Book  Google Scholar 

  7. Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning. Springer, Berlin

    Google Scholar 

  8. Borges G, David M, Gomes J, et al (2007) Sun Grid Engine, a new scheduler for EGEE middleware. In: IBERGRID–Iberian Grid Infrastructure Conference

  9. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  10. Burkov A (2019) The hundred-page machine learning book. Andriy Burkov, Quebec City

    Google Scholar 

  11. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794

  12. Cirne W, Berman F (2001) A comprehensive model of the supercomputer workload. In: IEEE International Workshop on Workload Characterization, pp 140–148

  13. Das A, Mueller F, Rountree B (2020) Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp 1092–1101

  14. Di S, Gupta R, Snir M, et al (2017) Logaider: a tool for mining potential correlations of HPC log events. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’17), IEEE, pp 442–451

  15. Dongarra J, Herault T, Robert Y (2015) Fault tolerance techniques for high-performance computing. Springer, Cham, pp 3–85

    Book  Google Scholar 

  16. Egwutuoha IP, Levy D, Selic B et al (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326

    Article  Google Scholar 

  17. Feitelson D (2022) Parallel workloads archive and standard workload format. http://www.cs.huji.ac.il/labs/parallel/workload, Accessed Nov. 25, 2022

  18. Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982

    Article  Google Scholar 

  19. Foss S, Korshunov D, Zachary S (2013) An introduction to heavy-tailed and subexponential distributions. Springer series in operations research and financial engineering, 2nd edn. Springer, New York

    Google Scholar 

  20. Gainaru A, Cappello F, Snir M, et al (2012) Fault prediction under the microscope: a closer look into HPC systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), IEEE, pp 1–11

  21. Gotoda S, Ito M, Shibata N (2012) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 260–267

  22. Gupta S, Tiwari D, Jantzi C, et al (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15), IEEE, pp 37–44

  23. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  24. Heien E, LaPine D, Kondo D, et al (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11)

  25. Hothorn T, Zeileis A (2015) partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 16(1):3905–3909

    MathSciNet  Google Scholar 

  26. Huang S, Liu Y, Fung C et al (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans Netw Serv Manage 17(4):2064–2076

    Article  Google Scholar 

  27. Jin H, Ke T, Chen Y, et al (2012) Checkpointing orchestration: toward a scalable HPC fault-tolerant environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 276–283

  28. Lai CD, Xie M, Barlow RE (2006) Stochastic ageing and dependence for reliability. Springer-Verlag, New York

    Google Scholar 

  29. León B, Franco D, Rexachs D et al (2021) Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77(5):4582–4617

    Article  Google Scholar 

  30. León B, Méndez S, Franco D et al (2022) A model of checkpoint behavior for applications that have i/o. J Supercomput 78(13):15404–15436

    Article  Google Scholar 

  31. Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 176–193

  32. Li H, Groep D, Wolters L, et al (2006) Job failure analysis and its implications in a large-scale production grid. In: IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, pp 27–27

  33. Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23

    Article  Google Scholar 

  34. Meng W, Liu Y, Zhang S et al (2021) Logclass: anomalous log identification and classification with partial labels. IEEE Trans Netw Serv Manage 18(2):1870–1884

    Article  Google Scholar 

  35. Min JH, Lee YC (2008) A practical approach to credit scoring. Expert Syst Appl 35(4):1762–1770

    Article  Google Scholar 

  36. Naksinehaboon N, Liu Y, Leangsuksun C, et al (2008) Reliability-aware approach: An incremental checkpoint/restart model in HPC environments. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), IEEE, pp 783–788

  37. Nanni L, Lumini A (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Syst Appl 36(2):3028–3033

    Article  Google Scholar 

  38. Nguyen AT, Reiter S, Rigo P (2014) A review on simulation-based optimization methods applied to building performance analysis. Appl Energy 113:1043–1058

    Article  Google Scholar 

  39. Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp 575–584

  40. Parasyris K, Keller K, Bautista-Gomez L, et al (2020) Checkpoint restart support for heterogeneous hpc applications. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 242–251

  41. Park JW (2019) Queue waiting time prediction for large-scale high-performance computing system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 850–855

  42. Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651

    Article  Google Scholar 

  43. Park JW, Kim E (2018) Exploiting the behavior of the failed job in high performance computing system. In: 2018 18th International Conference on Computational Science and Applications (ICCSA), IEEE, pp 1–3

  44. Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  Google Scholar 

  45. Rodrigo Álvarez GP, Östberg PO, Elmroth E, et al (2015) HPC system lifetime story: Workload characterization and evolutionary analyses on NERSC systems. In: ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp 57–60

  46. Roux NL, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS’12 - 26th Annual Conference on Neural Information Processing Systems

  47. Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr 59(1):34–35

    Article  Google Scholar 

  48. Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secur Comput 7(4):337–350

    Article  Google Scholar 

  49. Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of IEEE/IFIP DSN, pp 25–36

  50. Wu M, Sun XH, Jin H (2007) Performance under failures of high-end computing. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’07), ACM, p 48

  51. Yoon J, Hong T, Park C et al (2015) Stable HPC cluster management scheme through performance evaluation. In: Park JJJH, Stojmenovic I, Jeong HY et al (eds) Computer science and its applications. Springer, Berlin, pp 1017–1023

    Chapter  Google Scholar 

  52. You H, Zhang H (2012) Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 253–271

  53. Yuan Y, Wu Y, Wang Q et al (2012) Job failures in high performance computing systems: a large-scale empirical study. Comput Math Appl 63(2):365–377

    Article  Google Scholar 

  54. Zheng Z, Yu L, Tang W, et al (2011) Co-analysis of RAS log and job log on Blue Gene/P. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS’11), pp 840–851

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

J.P. and C.L. wrote the main manuscript. J.P. and C.L. conducted statistical analysis of failed jobs from the scheduler log. J.P. and X.H. conducted the performance comparison of machine learning models. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ju-Won Park.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

We provide in Fig. 8 the results of MLP models with different structures, i.e., different numbers of neurons per layer and/or layers. As can be seen from Fig. 8, when we increase the model complexity by using more neurons per layer and/or more MLP layers, the accuracy performance of MLP models does not show any clear improvement (and sometimes becomes even worse), while their running time can increase significantly. Therefore, we have used a three-layer MLP model having one hidden layer of 100 neurons for the performance comparison with other machine learning models. Note that when different MLP models are trained, they may converge at different numbers of training iterations as we stop the model training when the training loss is not improving by at least a tolerance of 0.0001 for ten consecutive iterations. This is why the running time of an MLP model does not increase with respect to the number of neurons per layer and/or the number of layers. Nonetheless, the three-layer model used in this work achieves the minimum running time for each case.

Fig. 8
figure 8

Results of different MLP structures. Here, we use [x] to refer to the sizes of hidden layers. For example, [100,100,100,100,100] indicates an MLP model with five hidden layers of 100 neurons each

In addition, we below provide in Table 7 detailed numeric values of the results shown in Fig. 7.

Table 7 Performance comparison of machine learning models

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, JW., Huang, X. & Lee, CH. Analyzing and predicting job failures from HPC system log. J Supercomput 80, 435–462 (2024). https://doi.org/10.1007/s11227-023-05482-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05482-y

Keywords