poster

POSTER: Pagoda: A Runtime System to Maximize GPU Utilization in Data Parallel Tasks with Limited Parallelism

Authors:

Timothy G. RogersAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 449 - 450

https://doi.org/10.1145/2967938.2974055

Published: 11 September 2016 Publication History

Get Access

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.

GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization.

This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.

References

[1]

S. J. Krieder, J. M. Wozniak, T. Armstrong, M. Wilde, D. S. Katz, B. Grimmer, I. T. Foster, and I. Raicu. Design and evaluation of the gemtc framework for gpu-enabled many-task computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 153--164, New York, NY, USA, 2014. ACM.

Digital Library

Google Scholar

[2]

NVIDIA. Hyper-Q Example. {Online}. Available: http://docs.nvidia.com/cuda/samples/6Advanced/simpleHyperQ/doc/HyperQ.pdf, 2012. (accessed March. 5, 2016).

Google Scholar

[3]

K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Berkeley, CA, 2013. USENIX.

Digital Library

Google Scholar

[4]

J. Subhlok and G. Vondran. Optimal use of mixed task and data parallelism for pipelined computations. J. Parallel Distrib. Comput., 60(3):297--319, 2000.

Digital Library

Google Scholar

Index Terms

POSTER: Pagoda: A Runtime System to Maximize GPU Utilization in Data Parallel Tasks with Limited Parallelism
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

A novel GPU resources management and scheduling system based on virtual machines

Virtual machine (VM) technologies offer lots of benefits such as users' isolation, server consolidation and live migration. However, owing to the overhead incurred by indirect access to physical resources such as GPU, IO devices and VM technologies have ...
Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

We present a highly parallel CUDA kernel based on the Lattice Monte Carlo (LMC) method for transient thermal conduction, which achieves a peak acceleration of more than 100x over a single-threaded Fortran version. A number of memory and branching ...
Poster: Leveraging PEPPHER Technology for Performance Portable Supercomputing
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

PEPPHER is a 3-year EU FP7 project that develops a novel approach and framework to enhance performance portability and programmability of heterogeneous multi-core systems. Its primary target is single-node heterogeneous systems, where several CPU cores ...

Comments

Information & Contributors

Information

Published In

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Check for updates

Author Tags

Qualifiers

Poster

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
191
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

A novel GPU resources management and scheduling system based on virtual machines

Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations

Poster: Leveraging PEPPHER Technology for Performance Portable Supercomputing