research-article

Public Access

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning

Authors:

Mohammed Tanash,

Daniel Andresen,

Adedolapo OkanlawonAuthors Info & Claims

PEARC '19: Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)

Article No.: 69, Pages 1 - 8

https://doi.org/10.1145/3332186.3333041

Published: 28 July 2019 Publication History

Abstract

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model.

Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.

References

[1]

{n. d.}. Beocat. https://support.beocat.ksu.edu/BeocatDocs/index.php/Main_Page. (Accessed on 03/013/2019).

[2]

{n. d.}. Documentation Index. http://www.adaptivecomputing.com/support/documentation-index/. (Accessed on 02/011/2019).

[3]

{n. d.}. GitHub - ubccr-slurm-simulator/slurm_simulator: Slurm Simulator: Slurm Modification to Enable its Simulation. https://github.com/ubccr-slurm-simulator/slurm_simulator. (Accessed on 01/03/2019).

[4]

{n. d.}. PBS Professional Open Source Project. https://www.pbspro.org/. (Accessed on 02/03/2019).

[5]

{n. d.}. Slurm Workload Manager - Documentation. https://slurm.schedmd.com/. (Accessed on 01/07/2019).

[6]

{n. d.}. TORQUE Resource Manager. http://www.adaptivecomputing.com/products/torque/. (Accessed on 02/02/2019).

[7]

2019. Getting Started with Scikit-learn for Machine Learning. In Python® Machine Learning. John Wiley & Sons, Inc., 93--117.

[8]

Dan Andresen, William Hsu, Huichen Yang, and Adedolapo Okanlawon. 2018. Machine Learning for Predictive Analytics of Compute Cluster Jobs. CoRR abs/1806.01116 (2018). arXiv:1806.01116 http://arxiv.org/abs/1806.01116

[9]

Josep Ll. Berral, Íñigo Goiri, Ramón Nou, Ferran Julià, Jordi Guitart, Ricard Gavaldà, and Jordi Torres. 2010. Towards energy-aware scheduling in data centers using machine learning. In Proceedings of the 1st International Conference on Energy-Efficient Computing and Networking - e-Energy '10. ACM Press.

Digital Library

[10]

Bruce Bugbee, Caleb Phillips, Hilary Egan, Ryan Elmore, Kenny Gruchalla, and Avi Purkayastha. 2017. Prediction and characterization of application power use in a high-performance computing environment. Statistical Analysis and Data Mining: The ASA Data Science Journal 10, 3 (Feb. 2017), 155--165.

[11]

N.R. Council, D.E.L. Studies, D.E.P. Sciences, and C.P.I.H.E.C.I.F.S. Engineering. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. National Academies Press. https://books.google.com/books?id=2XadAgAAQBAJ

[12]

Fenoy GarcÃηa and Carlos. 2014. Improving HPC applications scheduling with predictions based on automatically collected historical data. https://upcommons.upc.edu/handle/2099.1/23049

[13]

Eric Gaussier, David Glesser, Valentin Reis, and Denis Trystram. 2015. Improving backfilling by using machine learning to predict running times. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15. ACM Press.

Digital Library

[14]

W. Gentzsch. {n. d.}. Sun Grid Engine: towards creating a compute power grid. In Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE Comput. Soc.

Digital Library

[15]

S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas. 2007. Machine learning: a review of classification and combining techniques. https://link.springer.com/article/10.1007/s10462-007-9052-3

Digital Library

[16]

Rajath Kumar and Sathish Vadhiyar. 2013. Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs. In Job Scheduling Strategies for Parallel Processing, Walfredo Cirne, Narayan Desai, Eitan Frachtenberg, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 196--215.

[17]

L. Massaron and A. Boschetti. 2016. Regression Analysis with Python. Packt Publishing. https://books.google.com/books?id=d2tLDAAAQBAJ

[18]

Andréa Matsunaga and José A.B. Fortes. 2010. On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE.

Digital Library

[19]

Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.). Springer International Publishing, Cham, 197--217.

[20]

Warren Smith. 2007. Prediction Services for Distributed Computing. In 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE.

[21]

Chaowei Yang, David Wong, Qianjun Miao, and Ruixin Yang (Eds.). 2010. Advanced Geoinformation Science. CRC Press.

Digital Library

[22]

Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44--60.

Cited By

Feng BDing Z(2025)Application-Oriented Cloud Workload Prediction: A Survey and New PerspectivesTsinghua Science and Technology10.26599/TST.2024.901002430:1(34-54)Online publication date: Feb-2025
https://doi.org/10.26599/TST.2024.9010024
Riffel DAndresen DHutchison SHsu W(2024)Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in ParallelPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670610(1-4)Online publication date: 17-Jul-2024
https://dl.acm.org/doi/10.1145/3626203.3670610
Phung TThain D(2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00034
Show More Cited By

Index Terms

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning

Recommendations

Ensemble Prediction of Job Resources to Improve System Performance for Slurm-Based HPC Systems
PEARC '21: Practice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions

In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised ...
Enabling Workflow-Aware Scheduling on HPC Systems
HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Scientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. ...
Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in Parallel
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing

High-performance computing (HPC) clusters are widely used as a platform for scientific and engineering research as well as a broad range of data analysis tasks. Demand for HPC resources continues to grow, necessitating more scalable systems and improved ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

PEARC '19: Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)

July 2019

775 pages

ISBN:9781450372275

DOI:10.1145/3332186

General Chair:
Tom Furlani
Roswell Park Comprehensive Cancer Center

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

PEARC '19

PEARC '19: Practice and Experience in Advanced Research Computing

July 28 - August 1, 2019

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
1,452
Total Downloads

Downloads (Last 12 months)442
Downloads (Last 6 weeks)85

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feng BDing Z(2025)Application-Oriented Cloud Workload Prediction: A Survey and New PerspectivesTsinghua Science and Technology10.26599/TST.2024.901002430:1(34-54)Online publication date: Feb-2025
https://doi.org/10.26599/TST.2024.9010024
Riffel DAndresen DHutchison SHsu W(2024)Parallel Backfill: Improving HPC System Performance by Scheduling Jobs in ParallelPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670610(1-4)Online publication date: 17-Jul-2024
https://dl.acm.org/doi/10.1145/3626203.3670610
Phung TThain D(2024)Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00034(300-311)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00034
Dey ADhakal AIslam TYeom JPatki TNichols DMovsesyan ABhatele A(2024)Relative Performance Prediction Using Few-Shot Learning2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00278(1764-1769)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00278
Yang WYu J(2024)Trade-off topology design for hierarchical network based on job characteristicsCCF Transactions on High Performance Computing10.1007/s42514-024-00193-zOnline publication date: 21-May-2024
https://doi.org/10.1007/s42514-024-00193-z
Kumar MKim S(2024)Augmented access pattern-based I/O performance prediction using directed acyclic graph regressionCluster Computing10.1007/s10586-024-04719-628:1Online publication date: 14-Oct-2024
https://doi.org/10.1007/s10586-024-04719-6
An ZYuan YZhou XMiao QSong WPan H(2024)The Running Time Prediction of Spacecraft Simulation Job Based on HC-LSTMSignal and Information Processing, Networking and Computers10.1007/978-981-97-2116-0_59(482-490)Online publication date: 3-May-2024
https://doi.org/10.1007/978-981-97-2116-0_59
Nunes APortella FEstrela PMalini RLopes BBittencourt ALeite GCoutinho GDrummond L(2023)Prediction of Reservoir Simulation Jobs Times Using a Real-World SLURM LogAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235649(49-60)Online publication date: 17-Oct-2023
https://doi.org/10.5753/wscad.2023.235649
Gonzalez EZarei AHendler NSimmons TZarei ADemieville JStrand RRozzi BCalleja SEllingson HCosi MDavey SLavelle DTruco MSwetnam TMerchant NMichelmore RLyons EPauli D(2023)PhytoOracle: Scalable, modular phenomics data processing pipelinesFrontiers in Plant Science10.3389/fpls.2023.111297314Online publication date: 6-Mar-2023
https://doi.org/10.3389/fpls.2023.1112973
Kolker-Hicks EZhang DDai D(2023)A Reinforcement Learning Based Backfilling Strategy for HPC Batch JobsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624201(1316-1323)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624201
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents