Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2628071.2628117acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
poster

Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels

Published: 24 August 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) currently uses the FIFO policy to schedule thread blocks of concurrent kernels. We show that the FIFO policy leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the Shortest Remaining Time First (SRTF) policy and evaluate it on two-program workloads from ER-CBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.

    Reference

    [1]
    S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil, "Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels," in http://arxiv.org/abs/1406.6037phCoRR abs/1406.6037, 2014.

    Cited By

    View all
    • (2024)GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clustersFuture Generation Computer Systems10.1016/j.future.2023.10.022152(127-137)Online publication date: Mar-2024
    • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Mar-2023
    • (2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
    August 2014
    514 pages
    ISBN:9781450328098
    DOI:10.1145/2628071
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2014

    Check for updates

    Author Tags

    1. concurrent kernels
    2. cuda
    3. gpgpu
    4. staircase model
    5. thread block scheduler

    Qualifiers

    • Poster

    Conference

    PACT '14
    Sponsor:
    • IFIP WG 10.3
    • SIGARCH
    • IEEE CS TCPP
    • IEEE CS TCAA

    Acceptance Rates

    PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clustersFuture Generation Computer Systems10.1016/j.future.2023.10.022152(127-137)Online publication date: Mar-2024
    • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Mar-2023
    • (2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
    • (2021)Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU WorkloadsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480100(724-737)Online publication date: 18-Oct-2021
    • (2021)FlexSched: Efficient scheduling techniques for concurrent kernel execution on GPUsThe Journal of Supercomputing10.1007/s11227-021-03819-zOnline publication date: 19-May-2021
    • (2020)Dissecting the CUDA scheduling hierarchy: a Performance and Predictability Perspective2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS48715.2020.000-5(213-225)Online publication date: May-2020
    • (2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
    • (2018)Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00041(294-303)Online publication date: Dec-2018
    • (2017)Quality of Service Support for Fine-Grained Sharing on GPUsACM SIGARCH Computer Architecture News10.1145/3140659.308020345:2(269-281)Online publication date: 24-Jun-2017
    • (2017)Quality of Service Support for Fine-Grained Sharing on GPUsProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080203(269-281)Online publication date: 24-Jun-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media