research-article

Improving GPGPU concurrency with elastic kernels

Authors:

Matthew J. Thazhuthaveetil,

R. GovindarajanAuthors Info & Claims

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Pages 407 - 418

https://doi.org/10.1145/2451116.2451160

Published: 16 March 2013 Publication History

Abstract

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.

References

[1]

J. Adriaens et al. The case for GPGPU spatial multitasking. In HPCA, 2012.

Digital Library

[2]

A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.

[3]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.

Digital Library

[4]

M. Desnoyers et al. LTTng-UST User Space Tracer.

[5]

J. Evans. A Scalable Concurrent malloc(3) Implementation for FreeBSD. In BSDcan, 2006.

[6]

S. Eyerman and L. Eeckhout. System-level Performance Metrics for Multiprogram Workloads. IEEE Micro, 28(3), 2008.

Digital Library

[7]

C. Gregg et al. Fine-grained resource sharing for concurrent GPGPU kernels. In HotPar, 2012.

Digital Library

[8]

M. Guevara et al. Enabling task parallelism in the CUDA scheduler. In Workshop on Programming Models for Emerging Architectures (PMEA), 2009.

[9]

E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39--55, 2008.

Digital Library

[10]

NVIDIA. Compute Command Line Profiler: User Guide.

[11]

NVIDIA. CUDA Occupancy Calculator.

[12]

NVIDIA. NVIDIA CUDA C Programming Guide (version 4.2).

[13]

NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110.

[14]

V. T. Ravi et al. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In HPDC, 2011.

Digital Library

[15]

S. Rennich. CUDA C/C++ Streams and Concurrency.

[16]

J. Shirako et al. Chunking parallel loops in the presence of synchronization. In ICS, 2009.

Digital Library

[17]

J. A. Stratton et al. Efficient compilation of fine-grained SPMDthreaded programs for multicore CPUs. In CGO, 2010.

Digital Library

[18]

J. A. Stratton, S. S. Stone, andW. meiW. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008.

Digital Library

[19]

TOP500.org. The Top 500.

[20]

N. Tuck and D. M. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor. In PACT, 2003.

Digital Library

[21]

G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing, GREENCOM-CPSCOM '10, 2010.

Digital Library

[22]

L.Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In HPCS, 2011.

Cited By

Alexopoulos GMitropoulos DRoychoudhury APaiva AAbreu RStorey M(2024)nvshare: Practical GPU Sharing without Memory Size ConstraintsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640034(16-20)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3640034
Andriotis NSerrano-Cases AAlcaide SAbella JCazorla FPeng YBaldovin APaulitsch MTsymbal VHong JLanperne MPark JCerny TShahriar H(2023)A Software-Only Approach to Enable Diverse Redundancy on Intel GPUs for Safety-Related KernelsProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577610(451-460)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3577610
Chen HLin EChou YChou J(2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3119205
Show More Cited By

Index Terms

Improving GPGPU concurrency with elastic kernels
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features

Recommendations

Improving GPGPU concurrency with elastic kernels
ASPLOS '13

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...
Improving GPGPU concurrency with elastic kernels
ASPLOS '13

Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available ...
A unified optimizing compiler framework for different GPGPU architectures

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

March 2013

574 pages

ISBN:9781450318709

DOI:10.1145/2451116

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

ACM SIGPLAN Notices Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '13

Sponsor:

ASPLOS '13: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2013

Texas, Houston, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

198
Total Citations
View Citations
1,693
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)11

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alexopoulos GMitropoulos DRoychoudhury APaiva AAbreu RStorey M(2024)nvshare: Practical GPU Sharing without Memory Size ConstraintsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3640034(16-20)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3640034
Andriotis NSerrano-Cases AAlcaide SAbella JCazorla FPeng YBaldovin APaulitsch MTsymbal VHong JLanperne MPark JCerny TShahriar H(2023)A Software-Only Approach to Enable Diverse Redundancy on Intel GPUs for Safety-Related KernelsProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3577610(451-460)Online publication date: 27-Mar-2023
https://dl.acm.org/doi/10.1145/3555776.3577610
Chen HLin EChou YChou J(2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3119205
Zhao HCui WChen QGuo M(2023)ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource ManagementIEEE Transactions on Computers10.1109/TC.2022.321408872:5(1473-1487)Online publication date: 1-May-2023
https://doi.org/10.1109/TC.2022.3214088
Lin ZMo ZHuang XZhang XLu Y(2023)KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00046(247-254)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00046
Saroliya UArima ELiu DSchulz M(2023)Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00023(185-196)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00023
Guo KXu YQi ZGuan H(2023) OptimumJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.102901141:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.sysarc.2023.102901
Masola ACapodieci NCavicchioli ROlmedo IRouxel B(2023)Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and ExperimentsJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-43943-8_3(46-73)Online publication date: 15-Sep-2023
https://doi.org/10.1007/978-3-031-43943-8_3
Zhang XTang ZZhang XLi K(2022)Co-Concurrency Mechanism for Multi-GPUs in Distributed Heterogeneous EnvironmentsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.320808233:12(4935-4947)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3208082
Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents