Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

AccaSim: a customizable workload management simulator for job dispatching research in HPC systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim’s scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an attractive tool for conducting job dispatching research in HPC systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. http://www.prace-ri.eu/praceannualreports/.

  2. http://www2.itif.org/2016-high-performance-computing.pdf.

  3. https://www.python.org/events/python-events/.

  4. http://accasim.readthedocs.io/en/latest/.

  5. https://pypi.org.

  6. https://slurm.schedmd.com/.

  7. http://simgrid.gforge.inria.fr/.

  8. http://www.cloudbus.org/gridsim/.

  9. https://www.spec.org/power_ssj2008/.

  10. http://www.omnetpp.org/.

  11. http://frieda.lbl.gov/download.

  12. https://git.io/fhmbM.

  13. https://www.hpc2n.umu.se/resources/hardware/seth.

  14. http://www.cs.huji.ac.il/labs/parallel/workload/l_hpc2n/index.html.

  15. http://www.cs.huji.ac.il/labs/parallel/workload/l_ricc/index.html.

  16. http://www.cs.huji.ac.il/labs/parallel/workload/l_metacentrum2/index.html.

  17. https://metavo.metacentrum.cz/en/index.html.

  18. https://pypi.org/project/psutil/.

  19. https://github.com/oar-team/batsim.

  20. https://github.com/aleasimulator/alea/.

  21. https://git.io/fhmba.

References

  1. Acun, B., Jain, N., Bhatele, A., Mubarak, M., Carothers, C.D., Kalé, L.V.: Preliminary evaluation of a parallel trace replay tool for HPC network simulations. In: Proc. of Euro-Par’15 Workshops, vol. 9523 of LNCS, pp. 417–429. Springer (2015)

  2. Auweter, A., Bode, A., Brehm, M., Brochard, L., Hammer, N., Huber, H., Panda, R., Thomas, F., Wilde, T.: A case study of energy aware scheduling on supermuc. In:Proc. of ISC’14, vol. 8488 of LNCS, pp. 394–409. Springer (2014)

  3. Banerjee, A., Mukherjee, T., Varsamopoulos, G., Gupta, S.K.: Integrating cooling awareness with thermal aware workload placement for hpc data centers. Sustain. Comput. 1(2), 134–150 (2011)

    Google Scholar 

  4. Blazewicz, J., Lenstra, J.K., Kan, A.H.G.R.: Scheduling subject to resource constraints: classification and complexity. Discret. Appl. Math. 5(1), 11–24 (1983)

    Article  MathSciNet  Google Scholar 

  5. Bodas, D., Song, J., Rajappa, M., Hoffman, A.: Simple power-aware scheduler to limit power consumption by HPC system within a budget. In: Proc. of E2SC@SC’14, pp. 21–30. IEEE (2014)

  6. Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power capping in high performance computing systems. In:Proc. of CP’15, vol. 9255 of LNCS, pp. 524–540. Springer (2015)

  7. Brandt, J.M., Debusschere, B.J., Gentile, A.C., Mayo, J., Pébay, P.P., Thompson, D.C., Wong, M.: Using probabilistic characterization to reduce runtime faults in HPC systems. In: Proc. of CCGRID’08, pp. 759–764. IEEE CS (2008)

  8. Brennan, J., Kureshi, I., Holmes, V.: CDES: an approach to HPC workload modelling. In: Proc. of DS-RT’14, pp. 47–54. IEEE CS (2014)

  9. Bridi, T., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A constraint programming scheduler for heterogeneous high-performance computing machines. IEEE Trans. Parallel Distrib. Syst. 27(10), 2781–2794 (2016)

    Article  Google Scholar 

  10. Dutot, P., Mercier, M., Poquet, M., Richard, O.: Batsim: A realistic language-independent resources and jobs management systems simulator. In: Proc. of JSSPP’16, vol. 10353 of Lecture Notes in Computer Science, pp. 178–197. Springer (2016)

  11. Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Proc. of JSSPP’01, vol. 2221 of LNCS, pp. 188–206. Springer (2001)

  12. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  13. Galleguillos, C., Kiziltan, Z., Netti, A.: Accasim: an HPC simulator for workload management. In: Proc. of CARLA’17, vol. 796 of Communications in Computer and Information Science, pp. 169–184. Springer (2017)

  14. Galleguillos, C., Sîrbu, A., Kiziltan, Z., Babaoglu, Ö., Borghesi, A., Bridi, T.: Data-driven job dispatching in HPC systems. In: Proc. of MOD’17, vol. 10710 of Lecture Notes in Computer Science, pp. 449–461. Springer (2017)

  15. Gaussier, É., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proc. of SC’15, pp. 64:1–64:10. ACM (2015)

  16. Gómez-Martín, C., Vega-Rodríguez, M.A., Sánchez, J.L.G.: Performance and energy aware scheduling simulator for HPC: evaluating different resource selection methods. Concurr. Comput. 27(17), 5436–5459 (2015)

    Article  Google Scholar 

  17. Hurst, W.B., Ramaswamy, S., Lenin, R.B., Hoffman, D.: Modeling and simulation of hpc systems through job scheduling analysis. In: Conference on Applied Research in Information Technology. Acxiom Laboratory of Applied Research (2010)

  18. Jain, N., Bhatele, A., White, S., Gamblin, T., Kalé, L. V.: Evaluating HPC networks via simulation of parallel workloads. In: Proc. of SC’16, pp. 154–165. IEEE CS (2016)

  19. Klusácek, D., Rudová, H.: Alea 2: job scheduling simulator. In: Proc. of SimuTools’10, pp. 61:1–61:10. ICST/ACM (2010)

  20. Klusácek, D., Tóth, S., Podolníková, G.: Real-life experience with major reconfiguration of job scheduling system. In: Proc. of JSSPP’15, vol. 10353 of Lecture Notes in Computer Science, pp. 83–101. Springer (2015)

  21. Lelong, J., Reis, V., Trystram, D.: Tuning easy-backfilling queues. In: Proc. of JSSPP’17, vol. 10773 of Lecture Notes in Computer Science, pp. 43–61. Springer (2017)

  22. Li, Y., Gujrati, P., Lan, Z., Sun, X.: Fault-driven re-scheduling for improving system-level fault resilience. In: Proc. of ICPP’07, p. 39. IEEE CS (2007)

  23. Liu, F., Weissman, J.B.: Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications. In: Proc. of SC’15, pp. 33:1–33:12. ACM (2015)

  24. Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)

    Article  Google Scholar 

  25. Lucero, A.: Simulation of batch scheduling using real production-ready software tools. In: Proc. of IBERGRID’11, pp. 345–356. Netbiblo (2011)

  26. Mohamed, N., Al-Jaroodi, J.: Real-time big data analytics: applications and challenges. In: Proc. of HPCS’14, pp. 305–310. IEEE (2014)

  27. Mubarak, M., Carothers, C.D., Ross, R.B., Carns, P.H.: Enabling parallel simulation of large-scale HPC network systems. IEEE Trans. Parallel Distrib. Syst. 28(1), 87–100 (2017)

    Article  Google Scholar 

  28. Murali, P., Vadhiyar, S.: Metascheduling of HPC jobs in day-ahead electricity markets. IEEE Trans. Parallel Distrib. Syst. 29(3), 614–627 (2018)

    Article  Google Scholar 

  29. Nakata, M.: All about RICC: RIKEN integrated cluster of clusters. In: Proc. of ICNC’11, pp. 27–29. IEEE Computer Society (2011)

  30. Netti, A., Galleguillos, C., Kiziltan, Z., Sîrbu, A., Babaoglu, Ö.: Heterogeneity-aware resource allocation in HPC systems. In: Proc. of ISC’18, vol. 10876 of Lecture Notes in Computer Science, pp. 3–21. Springer (2018)

  31. Nuñez, A., Fernández, J., García, J.D., García, F., Carretero, J.: New techniques for simulating high performance MPI applications on large storage networks. J. Supercomput. 51(1), 40–57 (2010)

    Article  Google Scholar 

  32. Rodrigo, G.P., Elmroth, E., Östberg, P., Ramakrishnan, L.: Scsf: a scheduling simulation framework. In: Proc. of JSSPP’17, vol. 10773 of Lecture Notes in Computer Science, pp. 152–173. Springer (2017)

  33. Snyder, S., Carns, P.H., Latham, R., Mubarak, M., Ross, R.B., Carothers, C.D., Behzad, B., Luu, H.V.T., Byna, S., Prabhat.: Techniques for modeling large-scale HPC I/O workloads. In: Proc. of PMBS@SC’15, pp. 5:1–5:11. ACM (2015)

  34. Stephen, T., Benini, M.: Using and modifying the bsc slurm workload simulator. Technical report, Slurm User Group Meeting (2015)

  35. Tang, Q., Gupta, S.K.S., Varsamopoulos, G.: Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: a cyber-physical approach. IEEE Trans. Parallel Distrib. Syst. 19(11), 1458–1472 (2008)

    Article  Google Scholar 

  36. Wong, A.K.L., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: Proc. of CLUSTER’07. IEEE Computer Society (2007)

  37. Zhou, Z., Lan, Z., Tang, W., Desai, N.: Reducing energy costs for IBM blue gene/p via power-aware job scheduling. In: Proc. of JSSPP’13, vol. 8429 of LNCS, pp. 96–115. Springer (2014)

Download references

Acknowledgements

C. Galleguillos is supported by Postgraduate Grant PUCV 2018. A. Netti is supported by a research fellowship from the Oprecomp-Open Transprecision Computing project. R. Soto is supported by Grant CONICYT/FONDECYT/ REGULAR/1160455. We are grateful to Åke Sandgren, Motoyoshi Kurokawa, and the Czech National Grid Infrastructure MetaCentrum, for providing, respectively, the Seth, RICC and the MetaCentrum workload datasets. We thank Alina Sîrbu for fruitful discussions on the work presented here. Finally, we appreciate the precious comments of the reviewers which helped improve the paper significantly. We especially thank Millian Poquet for signing his review and giving us the possibility to interact during the revision of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristian Galleguillos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Galleguillos, C., Kiziltan, Z., Netti, A. et al. AccaSim: a customizable workload management simulator for job dispatching research in HPC systems. Cluster Comput 23, 107–122 (2020). https://doi.org/10.1007/s10586-019-02905-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02905-5

Keywords