Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Published: 26 January 2017 Publication History

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.
GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.
This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

References

[1]
C. A. Augonnet, S. Thibault, R. Namyst, and P.-A. W. Wacrenier. Starpu: A uni ed platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., 23(2):187--198, Feb. 2011.
[2]
M. Becchi, K. Sajjapongse, I. Graves, A. Procter, V. Ravi, and S. Chakradhar. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 97--108, New York, NY, USA, 2012. ACM.
[3]
J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguad and J. Labarta. Productive programming of GPU clusters with ompss. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 557--568, May 2012.
[4]
A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP'09, pages 124--131, Washington, DC, USA, 2009. IEEE Computer Society.
[5]
P. FIPS. 46-3: Data encryption standard (des). National Institute of Standards and Technology, 25(10):1--22, 1999.
[6]
Fraqtive. [Online]. Available: http://fraqtive.mimec.org/, 2016. (accessed March 5, 2016).
[7]
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 2. IEEE Press, 2008.
[8]
C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism, HotPar'12, pages 10--10, Berkeley, CA, USA, 2012. USENIX Association.
[9]
K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1--14. IEEE, 2012.
[10]
A. S. Kaseb, E. Berry, Y. Koh, A. Mohan, W. Chen, H. Li, Y. H. Lu, and E. J. Delp. A system for large-scale analysis of distributed cameras. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 340--344, Dec 2014.
[11]
S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 2--2, Berkeley, CA, USA, 2011. USENIX Association.
[12]
S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 201--216, Berkeley, CA, USA, 2014. USENIX Association.
[13]
K. C. Knowlton. A fast storage allocator. Commun. ACM, 8(10):623--624, Oct. 1965.
[14]
S. J. Krieder, J. M. Wozniak, T. Armstrong, M. Wilde, D. S. Katz, B. Grimmer, I. T. Foster, and I. Raicu. Design and evaluation of the GeMTC framework for GPU-enabled many-task computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 153--164, New York, NY, USA, 2014. ACM.
[15]
J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev., 34(1):82--109, Mar. 1992.
[16]
F. McKenna. Opensees: A framework for earthquake engineering simulation. Computing in Science Engineering, 13(4):58--66, July 2011.
[17]
G. Memik, W. H. Mangione-Smith, and W. Hu. Netbench: A benchmarking suite for network processors. In Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, pages 39--42. IEEE Press, 2001.
[18]
A. Morrison and Y. Afek. Fast concurrent queues for x86 processors. SIGPLAN Not., 48(8):103--112, Feb. 2013.
[19]
NVIDIA. Texture-based Separable Convolution. [Online]. Available: http://docs.nvidia.com/cuda/cuda-samples/#graphics, 2007. (accessed March. 5, 2016).
[20]
NVIDIA. Hyper-Q Example. [Online]. Available: http://docs.nvidia.com/cuda/samples/6Advanced/simpleHyperQ/doc/HyperQ.pdf, 2012. (accessed March. 5, 2016).
[21]
NVIDIA. The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http://docs.nvidia.com/cuda/samples/3 Imaging/dct8x8/doc/dct8x8.pdf, 2012. (accessed March. 5, 2016).
[22]
NVIDIA. CUDA. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/, 2015. (accessed March 5, 2016).
[23]
NVIDIA. PTX. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/, 2016. (accessed March 5, 2016).
[24]
K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems, Berkeley, CA, 2013. USENIX.
[25]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 407--418, New York, NY, USA, 2013. ACM.
[26]
J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 593--606, New York, NY, USA, 2015. ACM
[27]
M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group, 2003.
[28]
V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 217--228, New York, NY, USA, 2011. ACM.
[29]
A. Sabne, P. Sakdhnagool, and R. Eigenmann. Scaling large-data computations on Multi-GPU accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 443--454, New York, NY, USA, 2013. ACM.
[30]
D. Sengupta, R. Belapure, and K. Schwan. Multi-tenancy on GPGPU-based servers. In Proceedings of the 7th International Workshop on Virtualization Technologies in Distributed Computing, pages 3--10, 2013.
[31]
M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a le system with GPUs. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 485--498, New York, NY, USA, 2013. ACM.
[32]
J. Subhlok, J. M. Stichnoth, D. R. O'Hallaron, and T. Gross. Exploiting task and data parallelism on a multicomputer. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '93, pages 13--22, New York, NY, USA, 1993. ACM.
[33]
J. Subhlok and G. Vondran. Optimal use of mixed task and data parallelism for pipelined computations. J. Parallel Distrib. Comput., 60(3):297--319, 2000.
[34]
I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 193--204, Piscataway, NJ, USA, 2014. IEEE Press.
[35]
W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. In Compiler Construction, pages 179--196. Springer, 2002.
[36]
V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1--11. IEEE, 2008.
[37]
G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on Int'l Conference on Cyber, Physical and Social Computing (CPSCom), pages 344--350, Dec 2010.
[38]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel: Fine-grained sharing of gpgpus. IEEE Computer Architecture Letters, PP(99):1--1, 2015.
[39]
P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In Proceedings of the International Workshop on Memory Management, IWMM '95, pages 1--116, London, UK, UK, 1995. Springer-Verlag.
[40]
Y. Yang, P. Xiang, M. Mantor, N. Rubin, and H. Zhou. Shared memory multiplexing: A novel way to improve gpgpu throughput. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 283--292, New York, NY, USA, 2012. ACM.
[41]
J. Zhong and B. He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst., 25(6):1522--1532, June 2014.

Cited By

View all
  • (2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
  • (2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
  • (2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
  • Show More Cited By

Index Terms

  1. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 52, Issue 8
    PPoPP '17
    August 2017
    442 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3155284
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      January 2017
      476 pages
      ISBN:9781450344937
      DOI:10.1145/3018743
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 January 2017
    Published in SIGPLAN Volume 52, Issue 8

    Check for updates

    Author Tags

    1. gpu runtime system
    2. task parallelism
    3. utilization

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
    • (2023)Kernel-as-a-ServiceProceedings of the 24th International Middleware Conference10.1145/3590140.3629115(192-206)Online publication date: 27-Nov-2023
    • (2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
    • (2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
    • (2022)Task Fusion in Distributed Runtimes2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00007(13-25)Online publication date: Nov-2022
    • (2022)Improving cryptanalytic applications with stochastic runtimes on GPUs and multicoresParallel Computing10.1016/j.parco.2022.102944112:COnline publication date: 1-Sep-2022
    • (2021)CAREProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475617(4582-4590)Online publication date: 17-Oct-2021
    • (2021)Memory-Augmented Neural Networks on FPGA for Real-Time and Energy-Efficient Question AnsweringIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2020.303716629:1(162-175)Online publication date: Jan-2021
    • (2020)Performance Analysis of Thread Block Schedulers in GPGPU and Its ImplicationsApplied Sciences10.3390/app1024912110:24(9121)Online publication date: 20-Dec-2020
    • (2019)EDGE: Event-Driven GPU Execution2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00034(337-353)Online publication date: Sep-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media