poster

Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels

Authors:

Sreepathi Pai,

R. Govindarajan,

Matthew J. ThazhuthaveetilAuthors Info & Claims

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Pages 483 - 484

https://doi.org/10.1145/2628071.2628117

Published: 24 August 2014 Publication History

Get Access

Abstract

Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) currently uses the FIFO policy to schedule thread blocks of concurrent kernels. We show that the FIFO policy leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the Shortest Remaining Time First (SRTF) policy and evaluate it on two-program workloads from ER-CBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.

Reference

[1]

S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil, "Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels," in http://arxiv.org/abs/1406.6037phCoRR abs/1406.6037, 2014.

Digital Library

Google Scholar

Cited By

View all

Wang SChen SShi Y(2024)GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clustersFuture Generation Computer Systems10.1016/j.future.2023.10.022152(127-137)Online publication date: Mar-2024
https://doi.org/10.1016/j.future.2023.10.022
Barnes AShen FRogers T(2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Mar-2023
https://doi.org/10.1109/HPCA56546.2023.10070957
Tan XGolikov PVijaykumar NPekhimenko GKloeckner AMoreira J(2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569650
Show More Cited By

Index Terms

Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features

Recommendations

Improving GPGPU concurrency with elastic kernels
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...
Improving GPGPU concurrency with elastic kernels
ASPLOS '13

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...
Improving GPGPU concurrency with elastic kernels
ASPLOS '13

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...

Comments

Information & Contributors

Information

Published In

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

514 pages

ISBN:9781450328098

DOI:10.1145/2628071

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Check for updates

Author Tags

Qualifiers

Poster

Conference

PACT '14

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '14: International Conference on Parallel Architectures and Compilation

August 24 - 27, 2014

AB, Edmonton, Canada

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
297
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang SChen SShi Y(2024)GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clustersFuture Generation Computer Systems10.1016/j.future.2023.10.022152(127-137)Online publication date: Mar-2024
https://doi.org/10.1016/j.future.2023.10.022
Barnes AShen FRogers T(2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Mar-2023
https://doi.org/10.1109/HPCA56546.2023.10070957
Tan XGolikov PVijaykumar NPekhimenko GKloeckner AMoreira J(2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569650
Avalos Baddouh CKhairy MGreen RPayer MRogers T(2021)Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU WorkloadsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480100(724-737)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480100
López-Albelda BCastro FGonzález-Linares JGuil N(2021)FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUsThe Journal of Supercomputing10.1007/s11227-021-03819-zOnline publication date: 19-May-2021
https://doi.org/10.1007/s11227-021-03819-z
Olmedo ICapodieci NMartinez JMarongiu ABertogna M(2020)Dissecting the CUDA scheduling hierarchy: a Performance and Predictability Perspective2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS48715.2020.000-5(213-225)Online publication date: May-2020
https://doi.org/10.1109/RTAS48715.2020.000-5
Oh YKoo GAnnavaram MRo WManne SHunter HAltman E(2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322222
Garg SKothapalli KPurini S(2018)Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00041(294-303)Online publication date: Dec-2018
https://doi.org/10.1109/HiPC.2018.00041
Wang ZYang JMelhem RChilders BZhang YGuo M(2017)Quality of Service Support for Fine-Grained Sharing on GPUsACM SIGARCH Computer Architecture News10.1145/3140659.308020345:2(269-281)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3140659.3080203
Wang ZYang JMelhem RChilders BZhang YGuo M(2017)Quality of Service Support for Fine-Grained Sharing on GPUsProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080203(269-281)Online publication date: 24-Jun-2017
https://dl.acm.org/doi/10.1145/3079856.3080203
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Improving GPGPU concurrency with elastic kernels

Improving GPGPU concurrency with elastic kernels

Improving GPGPU concurrency with elastic kernels