research-article

Public Access

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs

Authors:

Jason Jong Kyu Park,

Scott MahlkeAuthors Info & Claims

ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 527 - 540

https://doi.org/10.1145/3037697.3037707

Published: 04 April 2017 Publication History

Abstract

As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals on multitasking GPUs have focused on either spatial multitasking, which partitions GPU resource at a streaming multiprocessor (SM) granularity, or simultaneous multikernel (SMK), which runs multiple kernels on the same SM. However, multitasking performance varies heavily depending on the resource partitions within each scheme, and the application mixes. In this paper, we propose GPU Maestro that performs dynamic resource management for efficient utilization of multitasking GPUs. GPU Maestro can discover the best performing GPU resource partition exploiting both spatial multitasking and SMK. Furthermore, dynamism within a kernel and interference between the kernels are automatically considered because GPU Maestro finds the best performing partition through direct measurements. Evaluations show that GPU Maestro can improve average system throughput by 20.2% and 13.9% over the baseline spatial multitasking and SMK, respectively.

References

[1]

Green500 list, 2016. https://www.top500.org/green500/lists/2016/11/.

[2]

Top500 list, 2016. http://www.top500.org/lists/2016/11/.

[3]

J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. Thecase for GPGPU spatial multitasking. In Proc. of the 18th International Symposium on High-Performance Computer Architecture, pages 1--12, 2012.

Digital Library

[4]

Amazon. Amazon web services. https://aws.amazon.com/ec2/.

[5]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. of the 2009 IEEE Symposium on Performance Analysis of Systems and Software, pages 163--174, Apr. 2009.

[6]

C. Basaran and K.-D. Kang. Supporting preemptive task executions and memory copies in GPGPUs. In 2012 24th Euromicro Conference on Real-Time Systems, pages 287--296, 2012.

Digital Library

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the IEEE Symposium on Workload Characterization, pages 44--54, 2009.

Digital Library

[8]

S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 28(3):42--53, 2008.

Digital Library

[9]

K. Gupta, J. A. Stuart, and J. D. Owens. A study of persistent threads style GPU programming for GPGPU workloads. Innovative Parallel Computing, pages 1--14, 2012.

[10]

S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 152--163, 2009.

Digital Library

[11]

S. Kato, K. Lakshmanan, R. R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU scheduling for real-time multi-tasking environments. pages 17--30, 2011.

[12]

KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010. URL http://www.khronos.org.

[13]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU performance via large warps and two-level warp scheduling. In Proc. of the 44th Annual International Symposium on Microarchitecture, pages 308--317, 2011.

Digital Library

[14]

NVIDIA. GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.

[15]

NVIDIA. NVIDIA's next generation CUDA compute architecture: Kepler GK110, 2012. www.nvidia.com/content/PDF/NVIDIAKeplerGK110ArchitectureWhitepaper.pdf.

[16]

NVIDIA. Sharing a GPU between MPI processes: Multi-process service (MPS) overview, 2014. http://docs.nvidia.com/deploy/mps /index.html.

[17]

NVIDIA. NVIDIA GeForce GTX 980: Featuring Maxwell, the most advanced GPU ever made, 2014. http://international.download.nvidia.com/geforce-com/international/pdfs/GeForceGTX980WhitepaperFINAL.PDF.

[18]

NVIDIA. NVIDIA CUDA C Programming Guide, version 7.5, 2015.

[19]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic kernels. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 407--418, Mar. 2013.

Digital Library

[20]

J. J. K. Park, Y. Park, and S. Mahlke. ELF: Maximizin memory-level parallelism for GPUs with coordinated warp and fetch scheduling. In Proceedings of SC15: the International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2015.

Digital Library

[21]

J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. In 20th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 593--606, Mar. 2015.

Digital Library

[22]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc. of the 45th Annual International Symposium on Microarchitecture, pages 72--83, 2012.

Digital Library

[23]

C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating system abstractions to manage GPUs as compute devices. In Proc. of the 23rd ACM Symposium on Operating Systems Principles, pages 233--248, 2011.

Digital Library

[24]

A. Silberschatz, P. B. Galvin, and G. Gagne. Operating System Concepts. John Wiley and Sons, Inc., 8th edition, 2013.

[25]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W. mei Hwu. Parboil: A revised benchmark suite for scientific and commercial through put computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, Mar. 2012.

[26]

I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. Enabling preemptive multiprogramming on GPUs. In Proc. of the 41st Annual International Symposium on Computer Architecture, pages 193--204, 2014.

[27]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proc. of the 22nd Annual International Symposium on Computer Architecture, pages 392--403, June 1995.

Digital Library

[28]

D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proc. of the 23rd Annual International Symposium on Computer Architecture, pages 191--202, May 1996.

Digital Library

[29]

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proc. of the 10th European Conference on Computer Systems, 2015.

Digital Library

[30]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proc. of the 22nd International Symposium on High-Performance Computer Architecture, pages 358--369, Mar. 2016.

[31]

B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. Enabling and exploiting flexible task assignment on GPU through SMcentric program transformations. In Proc. of the 2015 International Conference on Supercomputing, pages 119--130, June 2015.

[32]

Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In Proc. of the 43rd Annual International Symposium on Computer Architecture, 2016.

Digital Library

[33]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In Proc. of the 17th International Symposium on High-Performance Computer Architecture, pages 382--393, Feb. 2011.

Cited By

Ma LChen HShao EWang LChen QTan GLee IChabbi MSteuwer M(2024)POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN InferencesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638485(451-453)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638485
Ma LChen HShao EWang LChen QTan GMencagli GDazzi PLowenthal DBadia R(2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658654
Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Show More Cited By

Index Terms

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs

Recommendations

Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
Asplos'17

As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals on multitasking GPUs have focused on either spatial multitasking, which partitions ...
Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
ASPLOS '17

As graphics processing units (GPUs) are broadly adopted, running multiple applications on a GPU at the same time is beginning to attract wide attention. Recent proposals on multitasking GPUs have focused on either spatial multitasking, which partitions ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

April 2017

856 pages

ISBN:9781450344654

DOI:10.1145/3037697

General Chairs:
Yunji Chen
Institute of Computing Technology, CAS, China
,
Olivier Temam
Google, USA
,
Program Chair:
John Carter
IBM, USA

ACM SIGARCH Computer Architecture News Volume 45, Issue 1
Asplos'17
March 2017
812 pages
ISSN:0163-5964
DOI:10.1145/3093337
Editor:
Babak Falsafi
Interim
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 52, Issue 4
ASPLOS '17
April 2017
811 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3093336
Editor:
Matthew Fluet
Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ASPLOS '17

Sponsor:

ASPLOS '17: Architectural Support for Programming Languages and Operating Systems

April 8 - 12, 2017

Xi'an, China

Acceptance Rates

ASPLOS '17 Paper Acceptance Rate 53 of 320 submissions, 17%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
2,523
Total Downloads

Downloads (Last 12 months)472
Downloads (Last 6 weeks)54

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma LChen HShao EWang LChen QTan GLee IChabbi MSteuwer M(2024)POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN InferencesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638485(451-453)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638485
Ma LChen HShao EWang LChen QTan GMencagli GDazzi PLowenthal DBadia R(2024)ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority SchedulingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658654(1-14)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658654
Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Wu JWang LJin QLiu F(2024)Graft: Efficient Inference Serving for Hybrid Deep Learning With SLO Guarantees via DNN Re-AlignmentIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334051835:2(280-296)Online publication date: Feb-2024
https://doi.org/10.1109/TPDS.2023.3340518
Chen BZhao HCui WHe YZhang SChen QLi ZGuo M(2023)Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with ComboProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624660(265-280)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624660
Zeng DZhu AGu LLi PChen QGuo M(2023)Enabling Efficient Spatio-Temporal GPU Sharing for Network Function VirtualizationIEEE Transactions on Computers10.1109/TC.2023.327854172:10(2963-2977)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3278541
Zhao HCui WChen QGuo M(2023)ISPA: Exploiting Intra-SM Parallelism in GPUs via Fine-Grained Resource ManagementIEEE Transactions on Computers10.1109/TC.2022.321408872:5(1473-1487)Online publication date: 1-May-2023
https://doi.org/10.1109/TC.2022.3214088
Wang SChen SMeng FShi Y(2023)MSHGN: Multi-Scenario Adaptive Hierarchical Spatial Graph Convolution Network for GPU Utilization Prediction in Heterogeneous GPU ClustersJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104796(104796)Online publication date: Nov-2023
https://doi.org/10.1016/j.jpdc.2023.104796
Bao YSun YFeric ZShen MWeston MAbellán JBaruah TKim JJoshi AKaeli DKloeckner AMoreira J(2022)NaviSimProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569666(333-345)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569666
Tan XGolikov PVijaykumar NPekhimenko GKloeckner AMoreira J(2022)GPUPoolProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569650(317-332)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569650
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents