Abstract
Graphics Processing Units (GPUs) are energy-efficient massively parallel accelerators that are increasingly deployed in multi-tenant environments such as data-centers for general-purpose computing as well as graphics applications. Using GPUs in multi-tenant setups requires an efficient and low-overhead method for sharing the device among multiple users that improves system throughput while adapting to the changes in workload. This requires mechanisms to control the resources allocated to each kernel, and an efficient policy to make decisions about this allocation.
In this paper, we propose adaptive simultaneous multi-tenancy to address these issues. Adaptive simultaneous multi-tenancy allows for sharing the GPU among multiple kernels, as opposed to single kernel multi-tenancy that only runs one kernel on the GPU at any given time and static simultaneous multi-tenancy that does not adapt to events in the system. Our proposed system dynamically adjusts the kernels’ parameters at run-time when a new kernel arrives or a running kernel ends. Evaluations using our prototype implementation show that, compared to sequentially executing the kernels, system throughput is improved by an average of 9.8% (and up to 22.4%) for combinations of kernels that include at least one low-utilization kernel.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The recently announced NVIDIA Volta architecture solves the head-of-line blocking at the GPU block scheduler by dividing the GPU into smaller virtual GPUs, but it lacks the flexibility provided by persistent threads.
- 2.
Scratchpad memory in NVIDIA terminology is called shared memory.
References
Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture. HPCA 2012, pp. 1–12. IEEE Computer Society, Washington, DC (2012). http://dx.doi.org/10.1109/HPCA.2012.6168946
Amazon Web Services: Elastic GPUS (2017). https://aws.amazon.com/ec2/Elastic-GPUs/
Basaran, C., Kang, K.D.: Supporting preemptive task executions and memory copies in GPGPUS. In: Proceedings of the 2012 24th Euromicro Conference on Real-Time Systems. ECRTS 2012, pp. 287–296. IEEE Computer Society, Washington, DC (2012). http://dx.doi.org/10.1109/ECRTS.2012.15
Chase, J.S., Anderson, D.C., Thakar, P.N., Vahdat, A.M., Doyle, R.P.: Managing energy and server resources in hosting centers. In: Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles. SOSP 2001, pp. 103–116. ACM, New York (2001). http://doi.acm.org/10.1145/502034.502045
Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the Rodinia benchmark suite with comparison to contemporary cmp workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC 2010), pp. 1–11. IISWC 2010. IEEE Computer Society, Washington, DC (2010). http://dx.doi.org/10.1109/IISWC.2010.5650274
Chen, G., Zhao, Y., Shen, X., Zhou, H.: Effisha: a software framework for enabling effficient preemptive scheduling of GPU. In: Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3–16. PPoPP 2017, ACM, New York (2017). http://doi.acm.org/10.1145/3018743.3018748
Danalis, A., et al.: The scalable heterogeneous computing (shoc) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74. GPGPU-3, ACM, New York (2010). http://doi.acm.org/10.1145/1735688.1735702
Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)
Google: Google cloud platforms (2017). https://cloud.google.com/gpu/
Gregg, C., Dorn, J., Hazelwood, K., Skadron, K.: Fine-grained Resource Sharing for Concurrent GPGPU Kernels. In: Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism. HotPar 2012, p. 10. USENIX Association, Berkeley, (2012). http://dl.acm.org/citation.cfm?id=2342788.2342798
Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: 2012 Innovative Parallel Computing (InPar), pp. 1–14, May 2012
Jiao, Q., Lu, M., Huynh, H.P., Mitra, T.: Improving GPGPU energy-efficiency through concurrent kernel execution and DVFs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO 2015, pp. 1–11. IEEE Computer Society, Washington, DC (2015). http://dl.acm.org/citation.cfm?id=2738600.2738602
Jones, S.: Introduction to dynamic parallelism. In: Nvidia GPU Technology Conference. NVIDIA (2012). http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0338-GTC2012-CUDA-Programming-Model.pdf
Liang, Y., Huynh, H.P., Rupnow, K., Goh, R.S.M., Chen, D.: Efficient gpu spatial-temporal multitasking. IEEE Trans. Parall. Distrib. Syst. 26(3), 748–760 (2015)
Microsoft: Microsoft azure (2016). https://azure.microsoft.com/en-us/blog/azure-n-series-general-availability-on-december-1/
Nvidia: CUDA programming guide (2008). https://docs.nvidia.com/cuda/cuda-c-programming-guide/
Nvidia: Next generation CUDA computer architecture Kepler GK110 (2012)
NVIDIA: Multi-process service (2015). https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
NVIDIA: Pascal architecture whitepaper, June 2015. http://www.nvidia.com/object/pascal-architecture-whitepaper.html
NVIDIA: Volta architecture whitepaper, June 2015. http://www.nvidia.com/object/volta-architecture-whitepaper.html
Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving GPGPU concurrency with elastic kernels. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 407–418. ASPLOS 2013, ACM, New York (2013). http://doi.acm.org/10.1145/2451116.2451160
Park, J.J.K., Park, Y., Mahlke, S.: Chimera: collaborative preemption for multitasking on a shared GPU. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS 2015, pp. 593–606. ACM, New York (2015). http://doi.acm.org/10.1145/2694344.2694346
Park, J.J.K., Park, Y., Mahlke, S.: Dynamic resource management for efficient utilization of multitasking GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS 2017, pp. 527–540. ACM, New York (2017). http://doi.acm.org/10.1145/3037697.3037707
Randles, M., Lamb, D., Taleb-Bendiab, A.: A comparative study into distributed load balancing algorithms for cloud computing. In: 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, pp. 551–556, April 2010
Shahar, S., Bergman, S., Silberstein, M.: Activepointers: a case for software address translation on GPUs. In: Proceedings of the 43rd International Symposium on Computer Architecture. ISCA 2016, pp. 596–608. IEEE Press, Piscataway (2016). https://doi.org/10.1109/ISCA.2016.58
Stratton, J.A., et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical report (2012). https://scholar.google.com/scholar?oi=bibs&hl=en&cluster=14097255143770688510
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on GPUs. In: Proceeding of the 41st Annual International Symposium on Computer Architecuture, pp. 193–204. ISCA 2014, IEEE Press, Piscataway (2014). http://dl.acm.org/citation.cfm?id=2665671.2665702
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 358–369, March 2016
Wu, B., Chen, G., Li, D., Shen, X., Vetter, J.: Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS 2015, pp. 119–130. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751213
Wu, B., Liu, X., Zhou, X., Jiang, C.: Flep: enabling flexible and efficient preemption on GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 483–496. ASPLOS 2017, ACM, New York (2017). http://doi.acm.org/10.1145/3037697.3037742
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 230–242, June 2016
Zhong, J., He, B.: Kernelet: high-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014). https://doi.org/10.1109/TPDS.2013.257
Acknowledgments
This work is supported in part by the National Science Foundation (CCF-1335443) and equipment donations from NVIDIA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bashizade, R., Li, Y., Lebeck, A.R. (2019). Adaptive Simultaneous Multi-tenancy for GPUs. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-10632-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10631-7
Online ISBN: 978-3-030-10632-4
eBook Packages: Computer ScienceComputer Science (R0)