Abstract
We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim’s scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an attractive tool for conducting job dispatching research in HPC systems.
Similar content being viewed by others
Notes
References
Acun, B., Jain, N., Bhatele, A., Mubarak, M., Carothers, C.D., Kalé, L.V.: Preliminary evaluation of a parallel trace replay tool for HPC network simulations. In: Proc. of Euro-Par’15 Workshops, vol. 9523 of LNCS, pp. 417–429. Springer (2015)
Auweter, A., Bode, A., Brehm, M., Brochard, L., Hammer, N., Huber, H., Panda, R., Thomas, F., Wilde, T.: A case study of energy aware scheduling on supermuc. In:Proc. of ISC’14, vol. 8488 of LNCS, pp. 394–409. Springer (2014)
Banerjee, A., Mukherjee, T., Varsamopoulos, G., Gupta, S.K.: Integrating cooling awareness with thermal aware workload placement for hpc data centers. Sustain. Comput. 1(2), 134–150 (2011)
Blazewicz, J., Lenstra, J.K., Kan, A.H.G.R.: Scheduling subject to resource constraints: classification and complexity. Discret. Appl. Math. 5(1), 11–24 (1983)
Bodas, D., Song, J., Rajappa, M., Hoffman, A.: Simple power-aware scheduler to limit power consumption by HPC system within a budget. In: Proc. of E2SC@SC’14, pp. 21–30. IEEE (2014)
Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power capping in high performance computing systems. In:Proc. of CP’15, vol. 9255 of LNCS, pp. 524–540. Springer (2015)
Brandt, J.M., Debusschere, B.J., Gentile, A.C., Mayo, J., Pébay, P.P., Thompson, D.C., Wong, M.: Using probabilistic characterization to reduce runtime faults in HPC systems. In: Proc. of CCGRID’08, pp. 759–764. IEEE CS (2008)
Brennan, J., Kureshi, I., Holmes, V.: CDES: an approach to HPC workload modelling. In: Proc. of DS-RT’14, pp. 47–54. IEEE CS (2014)
Bridi, T., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A constraint programming scheduler for heterogeneous high-performance computing machines. IEEE Trans. Parallel Distrib. Syst. 27(10), 2781–2794 (2016)
Dutot, P., Mercier, M., Poquet, M., Richard, O.: Batsim: A realistic language-independent resources and jobs management systems simulator. In: Proc. of JSSPP’16, vol. 10353 of Lecture Notes in Computer Science, pp. 178–197. Springer (2016)
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Proc. of JSSPP’01, vol. 2221 of LNCS, pp. 188–206. Springer (2001)
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Galleguillos, C., Kiziltan, Z., Netti, A.: Accasim: an HPC simulator for workload management. In: Proc. of CARLA’17, vol. 796 of Communications in Computer and Information Science, pp. 169–184. Springer (2017)
Galleguillos, C., Sîrbu, A., Kiziltan, Z., Babaoglu, Ö., Borghesi, A., Bridi, T.: Data-driven job dispatching in HPC systems. In: Proc. of MOD’17, vol. 10710 of Lecture Notes in Computer Science, pp. 449–461. Springer (2017)
Gaussier, É., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proc. of SC’15, pp. 64:1–64:10. ACM (2015)
Gómez-Martín, C., Vega-Rodríguez, M.A., Sánchez, J.L.G.: Performance and energy aware scheduling simulator for HPC: evaluating different resource selection methods. Concurr. Comput. 27(17), 5436–5459 (2015)
Hurst, W.B., Ramaswamy, S., Lenin, R.B., Hoffman, D.: Modeling and simulation of hpc systems through job scheduling analysis. In: Conference on Applied Research in Information Technology. Acxiom Laboratory of Applied Research (2010)
Jain, N., Bhatele, A., White, S., Gamblin, T., Kalé, L. V.: Evaluating HPC networks via simulation of parallel workloads. In: Proc. of SC’16, pp. 154–165. IEEE CS (2016)
Klusácek, D., Rudová, H.: Alea 2: job scheduling simulator. In: Proc. of SimuTools’10, pp. 61:1–61:10. ICST/ACM (2010)
Klusácek, D., Tóth, S., Podolníková, G.: Real-life experience with major reconfiguration of job scheduling system. In: Proc. of JSSPP’15, vol. 10353 of Lecture Notes in Computer Science, pp. 83–101. Springer (2015)
Lelong, J., Reis, V., Trystram, D.: Tuning easy-backfilling queues. In: Proc. of JSSPP’17, vol. 10773 of Lecture Notes in Computer Science, pp. 43–61. Springer (2017)
Li, Y., Gujrati, P., Lan, Z., Sun, X.: Fault-driven re-scheduling for improving system-level fault resilience. In: Proc. of ICPP’07, p. 39. IEEE CS (2007)
Liu, F., Weissman, J.B.: Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications. In: Proc. of SC’15, pp. 33:1–33:12. ACM (2015)
Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)
Lucero, A.: Simulation of batch scheduling using real production-ready software tools. In: Proc. of IBERGRID’11, pp. 345–356. Netbiblo (2011)
Mohamed, N., Al-Jaroodi, J.: Real-time big data analytics: applications and challenges. In: Proc. of HPCS’14, pp. 305–310. IEEE (2014)
Mubarak, M., Carothers, C.D., Ross, R.B., Carns, P.H.: Enabling parallel simulation of large-scale HPC network systems. IEEE Trans. Parallel Distrib. Syst. 28(1), 87–100 (2017)
Murali, P., Vadhiyar, S.: Metascheduling of HPC jobs in day-ahead electricity markets. IEEE Trans. Parallel Distrib. Syst. 29(3), 614–627 (2018)
Nakata, M.: All about RICC: RIKEN integrated cluster of clusters. In: Proc. of ICNC’11, pp. 27–29. IEEE Computer Society (2011)
Netti, A., Galleguillos, C., Kiziltan, Z., Sîrbu, A., Babaoglu, Ö.: Heterogeneity-aware resource allocation in HPC systems. In: Proc. of ISC’18, vol. 10876 of Lecture Notes in Computer Science, pp. 3–21. Springer (2018)
Nuñez, A., Fernández, J., García, J.D., García, F., Carretero, J.: New techniques for simulating high performance MPI applications on large storage networks. J. Supercomput. 51(1), 40–57 (2010)
Rodrigo, G.P., Elmroth, E., Östberg, P., Ramakrishnan, L.: Scsf: a scheduling simulation framework. In: Proc. of JSSPP’17, vol. 10773 of Lecture Notes in Computer Science, pp. 152–173. Springer (2017)
Snyder, S., Carns, P.H., Latham, R., Mubarak, M., Ross, R.B., Carothers, C.D., Behzad, B., Luu, H.V.T., Byna, S., Prabhat.: Techniques for modeling large-scale HPC I/O workloads. In: Proc. of PMBS@SC’15, pp. 5:1–5:11. ACM (2015)
Stephen, T., Benini, M.: Using and modifying the bsc slurm workload simulator. Technical report, Slurm User Group Meeting (2015)
Tang, Q., Gupta, S.K.S., Varsamopoulos, G.: Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: a cyber-physical approach. IEEE Trans. Parallel Distrib. Syst. 19(11), 1458–1472 (2008)
Wong, A.K.L., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: Proc. of CLUSTER’07. IEEE Computer Society (2007)
Zhou, Z., Lan, Z., Tang, W., Desai, N.: Reducing energy costs for IBM blue gene/p via power-aware job scheduling. In: Proc. of JSSPP’13, vol. 8429 of LNCS, pp. 96–115. Springer (2014)
Acknowledgements
C. Galleguillos is supported by Postgraduate Grant PUCV 2018. A. Netti is supported by a research fellowship from the Oprecomp-Open Transprecision Computing project. R. Soto is supported by Grant CONICYT/FONDECYT/ REGULAR/1160455. We are grateful to Åke Sandgren, Motoyoshi Kurokawa, and the Czech National Grid Infrastructure MetaCentrum, for providing, respectively, the Seth, RICC and the MetaCentrum workload datasets. We thank Alina Sîrbu for fruitful discussions on the work presented here. Finally, we appreciate the precious comments of the reviewers which helped improve the paper significantly. We especially thank Millian Poquet for signing his review and giving us the possibility to interact during the revision of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Galleguillos, C., Kiziltan, Z., Netti, A. et al. AccaSim: a customizable workload management simulator for job dispatching research in HPC systems. Cluster Comput 23, 107–122 (2020). https://doi.org/10.1007/s10586-019-02905-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02905-5