research-article

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Authors:

Putt Sakdhnagool,

Rudolf Eigenmann,

Timothy G. RogersAuthors Info & Claims

ACM SIGPLAN Notices, Volume 52, Issue 8

Pages 221 - 234

https://doi.org/10.1145/3155284.3018754

Published: 26 January 2017 Publication History

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.

GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.

This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

References

[1]

C. A. Augonnet, S. Thibault, R. Namyst, and P.-A. W. Wacrenier. Starpu: A uni ed platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., 23(2):187--198, Feb. 2011.

Digital Library

[2]

M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 97--108, New York, NY, USA, 2012. ACM.

Digital Library

[3]

J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguad and J. Labarta. Productive programming of GPU clusters with ompss. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 557--568, May 2012.

Digital Library

[4]

A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP'09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[5]

P. FIPS. 46-3: Data encryption standard (des). National Institute of Standards and Technology, 25(10):1--22, 1999.

[6]

Fraqtive. [Online]. Available: http://fraqtive.mimec.org/, 2016. (accessed March 5, 2016).

[7]

N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008.

[8]

C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism, HotPar'12, pages 10--10, Berkeley, CA, USA, 2012. USENIX Association.

Digital Library

[9]

K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1--14. IEEE, 2012.

[10]

A. S. Kaseb, E. Berry, Y. Koh, A. Mohan, W. Chen, H. Li, Y. H. Lu, and E. J. Delp. A system for large-scale analysis of distributed cameras. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 340--344, Dec 2014.

[11]

S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.

Digital Library

[12]

S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 201--216, Berkeley, CA, USA, 2014. USENIX Association.

Digital Library

[13]

K. C. Knowlton. A fast storage allocator. Commun. ACM, 8(10):623--624, Oct. 1965.

Digital Library

[14]

S. J. Krieder, J. M. Wozniak, T. Armstrong, M. Wilde, D. S. Katz, B. Grimmer, I. T. Foster, and I. Raicu. Design and evaluation of the GeMTC framework for GPU-enabled many-task computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 153--164, New York, NY, USA, 2014. ACM.

Digital Library

[15]

J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev., 34(1):82--109, Mar. 1992.

Digital Library

[16]

F. McKenna. Opensees: A framework for earthquake engineering simulation. Computing in Science Engineering, 13(4):58--66, July 2011.

Digital Library

[17]

G. Memik, W. H. Mangione-Smith, and W. Hu. Netbench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, pages 39--42. IEEE Press, 2001.

[18]

A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. SIGPLAN Not., 48(8):103--112, Feb. 2013.

Digital Library

[19]

NVIDIA. Texture-based Separable Convolution. [Online]. Available: http://docs.nvidia.com/cuda/cuda-samples/#graphics, 2007. (accessed March. 5, 2016).

[20]

NVIDIA. Hyper-Q Example. [Online]. Available: http://docs.nvidia.com/cuda/samples/6Advanced/simpleHyperQ/doc/HyperQ.pdf, 2012. (accessed March. 5, 2016).

[21]

NVIDIA. The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http://docs.nvidia.com/cuda/samples/3 Imaging/dct8x8/doc/dct8x8.pdf, 2012. (accessed March. 5, 2016).

[22]

NVIDIA. CUDA. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015. (accessed March 5, 2016).

[23]

NVIDIA. PTX. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/, 2016. (accessed March 5, 2016).

[24]

K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Berkeley, CA, 2013. USENIX.

[25]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013. ACM.

Digital Library

[26]

J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM

Digital Library

[27]

M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.

Digital Library

[28]

V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 217--228, New York, NY, USA, 2011. ACM.

Digital Library

[29]

A. Sabne, P. Sakdhnagool, and R. Eigenmann. Scaling large-data computations on Multi-GPU accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 443--454, New York, NY, USA, 2013. ACM.

Digital Library

[30]

D. Sengupta, R. Belapure, and K. Schwan. Multi-tenancy on GPGPU-based servers. In Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing, pages 3--10, 2013.

Digital Library

[31]

M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a le system with GPUs. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 485--498, New York, NY, USA, 2013. ACM.

Digital Library

[32]

J. Subhlok, J. M. Stichnoth, D. R. O'Hallaron, and T. Gross. Exploiting task and data parallelism on a multicomputer. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 13--22, New York, NY, USA, 1993. ACM.

Digital Library

[33]

J. Subhlok and G. Vondran. Optimal use of mixed task and data parallelism for pipelined computations. J. Parallel Distrib. Comput., 60(3):297--319, 2000.

Digital Library

[34]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press.

[35]

W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In Compiler Construction, pages 179--196. Springer, 2002.

[36]

V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--11. IEEE, 2008.

[37]

G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom), pages 344--350, Dec 2010.

Digital Library

[38]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE Computer Architecture Letters, PP(99):1--1, 2015.

[39]

P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In Proceedings of the International Workshop on Memory Management, IWMM '95, pages 1--116, London, UK, UK, 1995. Springer-Verlag.

[40]

Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared memory multiplexing: A novel way to improve gpgpu throughput. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 283--292, New York, NY, USA, 2012. ACM.

Digital Library

[41]

J. Zhong and B. He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst., 25(6):1522--1532, June 2014.

Digital Library

Cited By

Ma LChen HShao EWang LChen QTan GMencagli GDazzi PLowenthal DBadia R(2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658654
Pfandzelter TDhakal AFrachtenberg EChalamalasetti SEmmot DHogade NEnriquez RRattihalli GBermbach DMilojicic D(2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629115
Fingler HTarte IYu HSzekely AHu BAkella ARossbach CAamodt TJerger NSwift M(2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575697
Show More Cited By

Index Terms

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Pagoda: A GPU Runtime System for Narrow Tasks

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that ...
Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of ...
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 52, Issue 8

PPoPP '17

August 2017

442 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3155284

Editor:
Matthew Fluet

Issue’s Table of Contents

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2017
476 pages
ISBN:9781450344937
DOI:10.1145/3018743
General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Lawrence Rauchwerger
Texas A&M University, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2017

Published in SIGPLAN Volume 52, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
669
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)6

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma LChen HShao EWang LChen QTan GMencagli GDazzi PLowenthal DBadia R(2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658654
Pfandzelter TDhakal AFrachtenberg EChalamalasetti SEmmot DHogade NEnriquez RRattihalli GBermbach DMilojicic D(2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629115
Fingler HTarte IYu HSzekely AHu BAkella ARossbach CAamodt TJerger NSwift M(2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575697
Tan XGolikov PVijaykumar NPekhimenko GKloeckner AMoreira J(2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569650
Sundram SLee WAiken A(2022)Task Fusion in Distributed Runtimes2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00007(13-25)Online publication date: Nov-2022
https://doi.org/10.1109/PAW-ATM56565.2022.00007
Oden LKeller J(2022)Improving cryptanalytic applications with stochastic runtimes on GPUs and multicoresParallel Computing10.1016/j.parco.2022.102944112:COnline publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.parco.2022.102944
Tang DBao CYao YXie CShi QMao MXu RLi LHaghighat MQi ZGuan HShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)CAREProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475617(4582-4590)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475617
Park SJang JKim SNa BYoon S(2021)Memory-Augmented Neural Networks on FPGA for Real-Time and Energy-Efficient Question AnsweringIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2020.303716629:1(162-175)Online publication date: Jan-2021
https://doi.org/10.1109/TVLSI.2020.3037166
Cho KBahn H(2020)Performance Analysis of Thread Block Schedulers in GPGPU and Its ImplicationsApplied Sciences10.3390/app1024912110:24(9121)Online publication date: 20-Dec-2020
https://doi.org/10.3390/app10249121
Hetherington TLubeznov MShah DAamodt T(2019)EDGE: Event-Driven GPU Execution2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00034(337-353)Online publication date: Sep-2019
https://doi.org/10.1109/PACT.2019.00034
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents