Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2451116.2451160acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Improving GPGPU concurrency with elastic kernels

Published: 16 March 2013 Publication History
  • Get Citation Alerts
  • Abstract

    Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

    References

    [1]
    J. Adriaens et al. The case for GPGPU spatial multitasking. In HPCA, 2012.
    [2]
    A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
    [3]
    S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.
    [4]
    M. Desnoyers et al. LTTng-UST User Space Tracer.
    [5]
    J. Evans. A Scalable Concurrent malloc(3) Implementation for FreeBSD. In BSDcan, 2006.
    [6]
    S. Eyerman and L. Eeckhout. System-level Performance Metrics for Multiprogram Workloads. IEEE Micro, 28(3), 2008.
    [7]
    C. Gregg et al. Fine-grained resource sharing for concurrent GPGPU kernels. In HotPar, 2012.
    [8]
    M. Guevara et al. Enabling task parallelism in the CUDA scheduler. In Workshop on Programming Models for Emerging Architectures (PMEA), 2009.
    [9]
    E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008.
    [10]
    NVIDIA. Compute Command Line Profiler: User Guide.
    [11]
    NVIDIA. CUDA Occupancy Calculator.
    [12]
    NVIDIA. NVIDIA CUDA C Programming Guide (version 4.2).
    [13]
    NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.
    [14]
    V. T. Ravi et al. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In HPDC, 2011.
    [15]
    S. Rennich. CUDA C/C++ Streams and Concurrency.
    [16]
    J. Shirako et al. Chunking parallel loops in the presence of synchronization. In ICS, 2009.
    [17]
    J. A. Stratton et al. Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs. In CGO, 2010.
    [18]
    J. A. Stratton, S. S. Stone, andW. meiW. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008.
    [19]
    TOP500.org. The Top 500.
    [20]
    N. Tuck and D. M. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In PACT, 2003.
    [21]
    G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, 2010.
    [22]
    L.Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In HPCS, 2011.

    Cited By

    View all
    • (2024)nvshare: Practical GPU Sharing without Memory Size ConstraintsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640034(16-20)Online publication date: 14-Apr-2024
    • (2023)A Software-Only Approach to Enable Diverse Redundancy on Intel GPUs for Safety-Related KernelsProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577610(451-460)Online publication date: 27-Mar-2023
    • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
    March 2013
    574 pages
    ISBN:9781450318709
    DOI:10.1145/2451116
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 48, Issue 4
      ASPLOS '13
      April 2013
      540 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2499368
      Issue’s Table of Contents
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
      ASPLOS '13
      March 2013
      540 pages
      ISSN:0163-5964
      DOI:10.1145/2490301
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 March 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. concurrent kernels
    2. cuda
    3. gpgpu

    Qualifiers

    • Research-article

    Conference

    ASPLOS '13

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)127
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)nvshare: Practical GPU Sharing without Memory Size ConstraintsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640034(16-20)Online publication date: 14-Apr-2024
    • (2023)A Software-Only Approach to Enable Diverse Redundancy on Intel GPUs for Safety-Related KernelsProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577610(451-460)Online publication date: 27-Mar-2023
    • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
    • (2023)ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource ManagementIEEE Transactions on Computers10.1109/TC.2022.321408872:5(1473-1487)Online publication date: 1-May-2023
    • (2023)KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00046(247-254)Online publication date: 6-Nov-2023
    • (2023)Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00023(185-196)Online publication date: 31-Oct-2023
    • (2023) OptimumJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102901141:COnline publication date: 1-Aug-2023
    • (2023)Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and ExperimentsJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-43943-8_3(46-73)Online publication date: 15-Sep-2023
    • (2022)Co-Concurrency Mechanism for Multi-GPUs in Distributed Heterogeneous EnvironmentsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320808233:12(4935-4947)Online publication date: 1-Dec-2022
    • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media