Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2451116.2451160acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Improving GPGPU concurrency with elastic kernels

Published: 16 March 2013 Publication History

Abstract

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

References

[1]
J. Adriaens et al. The case for GPGPU spatial multitasking. In HPCA, 2012.
[2]
A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.
[3]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.
[4]
M. Desnoyers et al. LTTng-UST User Space Tracer.
[5]
J. Evans. A Scalable Concurrent malloc(3) Implementation for FreeBSD. In BSDcan, 2006.
[6]
S. Eyerman and L. Eeckhout. System-level Performance Metrics for Multiprogram Workloads. IEEE Micro, 28(3), 2008.
[7]
C. Gregg et al. Fine-grained resource sharing for concurrent GPGPU kernels. In HotPar, 2012.
[8]
M. Guevara et al. Enabling task parallelism in the CUDA scheduler. In Workshop on Programming Models for Emerging Architectures (PMEA), 2009.
[9]
E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008.
[10]
NVIDIA. Compute Command Line Profiler: User Guide.
[11]
NVIDIA. CUDA Occupancy Calculator.
[12]
NVIDIA. NVIDIA CUDA C Programming Guide (version 4.2).
[13]
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.
[14]
V. T. Ravi et al. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In HPDC, 2011.
[15]
S. Rennich. CUDA C/C++ Streams and Concurrency.
[16]
J. Shirako et al. Chunking parallel loops in the presence of synchronization. In ICS, 2009.
[17]
J. A. Stratton et al. Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs. In CGO, 2010.
[18]
J. A. Stratton, S. S. Stone, andW. meiW. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008.
[19]
TOP500.org. The Top 500.
[20]
N. Tuck and D. M. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In PACT, 2003.
[21]
G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, 2010.
[22]
L.Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In HPCS, 2011.

Cited By

View all
  • (2024)nvshare: Practical GPU Sharing without Memory Size ConstraintsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640034(16-20)Online publication date: 14-Apr-2024
  • (2023)A Software-Only Approach to Enable Diverse Redundancy on Intel GPUs for Safety-Related KernelsProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577610(451-460)Online publication date: 27-Mar-2023
  • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
March 2013
574 pages
ISBN:9781450318709
DOI:10.1145/2451116
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 4
    ASPLOS '13
    April 2013
    540 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2499368
    Issue’s Table of Contents
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 41, Issue 1
    ASPLOS '13
    March 2013
    540 pages
    ISSN:0163-5964
    DOI:10.1145/2490301
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concurrent kernels
  2. cuda
  3. gpgpu

Qualifiers

  • Research-article

Conference

ASPLOS '13

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)29
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)nvshare: Practical GPU Sharing without Memory Size ConstraintsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640034(16-20)Online publication date: 14-Apr-2024
  • (2023)A Software-Only Approach to Enable Diverse Redundancy on Intel GPUs for Safety-Related KernelsProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577610(451-460)Online publication date: 27-Mar-2023
  • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
  • (2023)ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource ManagementIEEE Transactions on Computers10.1109/TC.2022.321408872:5(1473-1487)Online publication date: 1-May-2023
  • (2023)KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00046(247-254)Online publication date: 6-Nov-2023
  • (2023)Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00023(185-196)Online publication date: 31-Oct-2023
  • (2023) OptimumJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102901141:COnline publication date: 1-Aug-2023
  • (2023)Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and ExperimentsJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-43943-8_3(46-73)Online publication date: 15-Sep-2023
  • (2022)Co-Concurrency Mechanism for Multi-GPUs in Distributed Heterogeneous EnvironmentsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320808233:12(4935-4947)Online publication date: 1-Dec-2022
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media