research-article

Public Access

A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs

Authors:

Elliot Kolker-Hicks,

Di Zhang,

Dong DaiAuthors Info & Claims

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1316 - 1323

https://doi.org/10.1145/3624062.3624201

Published: 12 November 2023 Publication History

All formats PDF

Abstract

High Performance Computing (HPC) systems are used across a wide range of disciplines for both large and complex computations. HPC systems often receive many thousands of computational tasks at a time, colloquially referred to as “jobs”. These jobs must then be scheduled as optimally as possible so they can be completed within a reasonable timeframe. HPC scheduling systems often employ a technique called “backfilling”, wherein low-priority jobs are scheduled earlier to use the available resources that are waiting for the pending high-priority jobs. To make it work, backfilling largely relies on job runtime to calculate the start time of the ready-to-schedule jobs and avoid delaying them. It is a common belief that better estimations of job runtime will lead to better backfilling and more effective scheduling. However, our experiments show a different conclusion: there is a missing trade-off between prediction accuracy and backfilling opportunities. To learn how to achieve the best trade-off, we believe reinforcement learning (RL) can be effectively leveraged. Reinforcement Learning relies on an “agent” which makes decisions from observing the environment, and gains rewards or punishments based on the quality of its decision-making. Based on this idea, we designed RLBackfilling, a reinforcement learning-based backfilling algorithm. We show how RLBackfilling can learn effective backfilling strategies via trial-and-error on existing job traces. Our evaluation results show up to 17x better scheduling performance (based on average bounded job slowdown) compared to EASY backfilling using user-provided job runtime and 4.7x better performance compared with EASY using the ideal predicted job runtime (the actual job runtime).

Supplemental Material

MP4 File

Recording of "A Reinforcement Learning Based Backfilling Strategy for HPC Batch Jobs" presentation at PMBS23.

Download
198.84 MB

References

[1]

2019. Slurm. https://slurm.schedmd.com/sched-config.html/.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Single machine parallel-batch scheduling with deteriorating jobs

Scheduling jobs with agreeable processing times and due dates on a single batch processing machine

Scheduling jobs with release dates on parallel batch processing machines

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations