Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472456.3473521acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

PREP: Predicting Job Runtime with Job Running Path on Supercomputers

Published: 05 October 2021 Publication History

Abstract

Supercomputers serve a lot of parallel jobs by scheduling jobs and allocating computing resources. One popular scheduling strategy is First Come First Serve (FCFS). However, there are always some idle resources not being effectively utilized, since they are not enough and are reserved for the head job in the waiting queue. To improve resource utilization, a common solution is to use backfilling, which allocates the reserved computing resources to a small, short job selected from the queue, on the premise of not delaying the original head job. Unfortunately, the estimated job runtime provided by users is often overestimated. Previous studies extract features from historical job logs and predict runtime based on machine learning. However, traditional features (e.g. CPU, user, submitting time, etc.) are insufficient to describe the characteristics of jobs. In this paper, we propose a novel runtime prediction framework called PREP. It explores a new feature named job running path, which encodes important implications about the job’s characteristics, such as the project it belongs to, data sets and parameters it uses, etc. As there is a strong correlation between job runtime and its running path. PREP groups jobs into separate clusters according to their running paths and trains a runtime prediction model for each job cluster. Final results demonstrate that adding the new feature can achieve high prediction accuracy of 88% and has a better prediction effect than other methods, such as Last-2 and IRPA.

References

[1]
2021. Levenshtein Distance. https://en.wikipedia.org/wiki/Levenshtein_distance.
[2]
2021. Pearson correlation coefficient. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.
[3]
Yariv Aridor, Tamar Domany, Oleg Goldshmidt, José E Moreira, and Edi Shmueli. 2005. Resource allocation and utilization in the Blue Gene/L supercomputer. IBM Journal of Research and Development 49, 2.3 (2005), 425–436.
[4]
Cynthia Bailey Lee. 2005. Are User Runtime Estimates Inherently Inaccurate?. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 253–263.
[5]
HS Behera and Swain. 2012. A new proposed round robin with highest response ratio next (rrhrrn) scheduling algorithm for soft real time system. International Journal of Engineering and Advanced Technology 37 (2012), 200–206.
[6]
X. Chen and C. Lu. 2013. Predicting job completion times using system logs in supercomputing clusters. In 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W). 1–8. https://doi.org/10.1109/DSNW.2013.6615513
[7]
W. Cirne and F. Berman. 2001. A comprehensive model of the supercomputer workload. In Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538). 140–148. https://doi.org/10.1109/WWC.2001.990753
[8]
Renato L.F. Cunha and Eduardo R. Rodrigues. 2017. Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Generation Computer Systems 67 (2017), 35–46. https://doi.org/10.1016/j.future.2016.08.010
[9]
Menno Dobber and Rob van der Mei. 2007. A prediction method for job runtimes on shared processors: Survey, statistical analysis and new avenues. Performance Evaluation 64, 7 (2007), 755–781. https://doi.org/10.1016/j.peva.2007.01.001
[10]
Y. Fan and P. Rich. 2017. Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 530–540. https://doi.org/10.1109/CLUSTER.2017.11
[11]
E. Gaussier and D. Glesser. 2015. Improving backfilling by using machine learning to predict running times. In SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–10. https://doi.org/10.1145/2807591.2807646
[12]
Wu Guibao and Shen Yu. 2019. Runtime Prediction of Jobs for Backfilling Optimizaion. Small microcomputer system(2019).
[13]
Shonali Krishnaswamy. 2003. Estimating computation times in data intensive e-services. In Processings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003. IEEE, 72–80.
[14]
Xu Lunfan. 2019. Job runtime prediction based on historical logs. Ph.D. Dissertation. China Academy of Engineering Physics.
[15]
Tran Ngoc Minh. 2010. Using historical data to predict runtime on backfilling parallel systems. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. IEEE, 246–252.
[16]
F. Nadeem and T. Fahringer. 2009. Using Templates to Predict Execution Time of Scientific Workflow Applications in the Grid. In 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. 316–323. https://doi.org/10.1109/CCGRID.2009.77
[17]
Panu Phinjaroenphan. 2005. A Method for Estimating the Execution Time of a Parallel Task on a Grid Node. In Advances in Grid Computing - EGC 2005. Springer Berlin Heidelberg, Berlin, Heidelberg, 226–236.
[18]
Andysah Putera Utama Siahaan. 2016. Comparision analysis of CPU scheduling: FCFS, SJF and Round Robin. International Journal of Engineering Development and Research 4, 3(2016), 124–132.
[19]
Warren Smith and Taylor. 1999. Using run-time predictions to estimate queue wait times and improve scheduler performance. In Workshop on Job scheduling strategies for Parallel Processing. Springer, 202–219.
[20]
Srividya Srinivasan. 2002. Selective Reservation Strategies for Backfill Job Scheduling. In Job Scheduling Strategies for Parallel Processing, Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 55–71.
[21]
D. Tetzlaff and S. Glesner. 2013. Intelligent prediction of execution times. In 2013 Second International Conference on Informatics Applications (ICIA). 234–239. https://doi.org/10.1109/ICoIA.2013.6650262
[22]
D. Tsafrir and Y. Etsion. 2007. Backfilling Using System-Generated Predictions Rather than User Runtime Estimates. IEEE Transactions on Parallel and Distributed Systems 18, 6 (June 2007), 789–803. https://doi.org/10.1109/TPDS.2007.70606
[23]
J. Yu and W. Yang. 2020. Spatially Bursty I/O on Supercomputers: Causes, Impacts and Solutions. IEEE Transactions on Parallel and Distributed Systems 31, 12 (Dec 2020), 2908–2922. https://doi.org/10.1109/TPDS.2020.3005572

Cited By

View all
  • (2022)Towards scalable resource management for supercomputersProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571916(1-15)Online publication date: 13-Nov-2022
  • (2022)Towards Scalable Resource Management for SupercomputersSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00029(1-15)Online publication date: Nov-2022
  • (2022)A Quantitative Study of the Spatiotemporal I/O Burstiness of HPC Application2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00133(1349-1359)Online publication date: May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. backfilling
  2. machine learning.
  3. running path
  4. runtime prediction

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Towards scalable resource management for supercomputersProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571916(1-15)Online publication date: 13-Nov-2022
  • (2022)Towards Scalable Resource Management for SupercomputersSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00029(1-15)Online publication date: Nov-2022
  • (2022)A Quantitative Study of the Spatiotemporal I/O Burstiness of HPC Application2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00133(1349-1359)Online publication date: May-2022
  • (2022)An Ensemble Learning-Based HPC Multi-Resource Demand Prediction Model for Hybrid Clusters2022 3rd International Conference on Computer Science and Management Technology (ICCSMT)10.1109/ICCSMT58129.2022.00094(413-420)Online publication date: Nov-2022
  • (2022)Temporal Staggering of Applications Based on Job Classification and I/O Burst Prediction2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00154(965-970)Online publication date: Dec-2022
  • (2022)PreF: Predicting job failure on supercomputers with job path and user behaviorConcurrency and Computation: Practice and Experience10.1002/cpe.720234:23Online publication date: 21-Aug-2022
  • (2021)Analysis and Classification of Job Multiple Characteristics on Supercomputers2021 7th International Conference on Computer and Communications (ICCC)10.1109/ICCC54389.2021.9674250(853-857)Online publication date: 10-Dec-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media