Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3642961.3643799acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

GPU-Initiated Resource Allocation for Irregular Workloads

Published: 04 April 2024 Publication History

Abstract

GPU kernels may suffer from resource underutilization in multi-GPU systems due to insufficient workload to saturate devices when incorporated within an irregular application. To better utilize the resources in multi-GPU systems, we propose a GPU-sided resource allocation method that can increase or decrease the number of GPUs in use as the workload changes over time. Our method employs GPU-to-CPU callbacks to allow GPU device(s) to request additional devices while the kernel execution is in flight. We implemented and tested multiple callback methods required for GPU-initiated workload offloading to other devices and measured their overheads on Nvidia and AMD platforms. To showcase the usage of callbacks in irregular applications, we implemented Breadth-First Search (BFS) that uses device-initiated workload offloading. Apart from allowing dynamic device allocation in persistently running kernels, it reduces time to solution on average by 15.7% at the cost of callback overheads with a minimum of 6.50 microseconds on AMD and 4.83 microseconds on Nvidia, depending on the chosen callback mechanism. Moreover, the proposed model can reduce the total device usage by up to 35%, which is associated with higher energy efficiency.

References

[1]
Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah, and Didem Unat. 2021. ComScribe: Identifying Intra-node GPU Communication. In Benchmarking, Measuring, and Optimizing, Felix Wolf and Wanling Gao (Eds.). Springer International Publishing, Cham, 157–174.
[2]
AMD. 2020. "AMD Instinct MI100" Instruction Set Architecture Reference Guide. AMD.
[3]
Arkaprava Basu, Joseph L. Greathouse, Guru Venkataramani, and Ján Veselý. 2018. Interference from GPU System Service Requests. In 2018 IEEE Int’l Symposium on Workload Characterization (IISWC). 179–190.
[4]
Leander Beernaert, Miguel Matos, Ricardo Vilaça, and Rui Oliveira. [n. d.]. Automatic Elasticity in OpenStack. In Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management (Montreal, Quebec, Canada) (SDMCMM ’12). ACM, New York, NY, USA, Article 2, 6 pages.
[5]
Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Austin, Texas, USA) (PPoPP ’17). ACM, New York, NY, USA, 235–248.
[6]
Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. SIGPLAN Not. 52, 8 (jan 2017), 235–248. https://doi.org/10.1145/3155284.3018756
[7]
Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. 2017. SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 167–179.
[8]
Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, and John D. Owens. 2022. Atos: A Task-Parallel GPU Scheduler for Graph Analytics. In Proceedings of the International Conference on Parallel Processing(ICPP 2022). arXiv:2112.00132
[9]
Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable irregular parallelism with GPUs: Getting CPUs out of the way. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(SC ’22).
[10]
D. Foley and J. Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37, 2 (2017), 7–17.
[11]
Taylor Groves, Ben Brock, Yuxin Chen, Khaled Z. Ibrahim, Lenny Oliker, Nicholas J. Wright, Samuel Williams, and Katherine Yelick. [n. d.]. Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. 126–137.
[12]
Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of Persistent Threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). 1–14. https://doi.org/10.1109/InPar.2012.6339596
[13]
Ismayil Ismayilov, Javid Baydamirli, Doğan Sağbili, Mohamed Wahib, and Didem Unat. 2023. Multi-GPU Communication Schemes for Iterative Solvers: When CPUs Are Not in Charge. In Proceedings of the 37th International Conference on Supercomputing (Orlando, FL, USA) (ICS ’23). Association for Computing Machinery, New York, NY, USA, 192–202. https://doi.org/10.1145/3577193.3593713
[14]
A. Li, S. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 01 (jan 2020), 94–110.
[15]
Sumit K. Mandal, Umit Y. Ogras, Janardhan Rao Doppa, Raid Z. Ayoub, Michael Kishinevsky, and Partha P. Pande. 2020. Online Adaptive Learning for Runtime Resource Management of Heterogeneous SoCs. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference (Virtual Event, USA) (DAC ’20). IEEE Press, Article 176, 6 pages.
[16]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic Resource Management for Efficient Utilization of Multitasking GPUs. SIGPLAN Not. 52, 4 (apr 2017), 527–540. https://doi.org/10.1145/3093336.3037707
[17]
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014. GPUfs: Integrating a File System with GPUs. ACM Trans. Comput. Syst. 32, 1, Article 1 (feb 2014), 31 pages.
[18]
Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2016. GPUnet: Networking Abstractions for GPU Programs. ACM Trans. Comput. Syst. 34, 3, Article 9 (sep 2016), 31 pages. https://doi.org/10.1145/2963098
[19]
Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-Based Scheduling of Dynamic Workloads on the GPU. ACM Trans. Graph. 33, 6, Article 228 (nov 2014), 11 pages. https://doi.org/10.1145/2661229.2661250
[20]
Jeff A. Stuart, Michael Cox, and John D. Owens. 2010. GPU-to-CPU Callbacks. In Proceedings of the 2010 Conference on Parallel Processing (Ischia, Italy) (Euro-Par 2010). Springer-Verlag, Berlin, Heidelberg, 365–372.
[21]
Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Prannoy Mohan, and David Kaeli. 2018. Evaluating Performance Tradeoffs on the Radeon Open Compute Platform. In 2018 IEEE Int’l Symposium on Performance Analysis of Systems and Software (ISPASS). 209–218.
[22]
Othon Tomoutzoglou, Dimitris Mbakoyiannis, George Kornaros, and Marcello Coppola. 2020. Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime Infrastructure. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 5 (2020), 1017–1030.
[23]
Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In Proceedings of the Conference on High Performance Graphics (Saarbrucken, Germany) (HPG ’10). Eurographics Association, Goslar, DEU, 29–37.
[24]
Anuj Vaishnav, Khoa Dang Pham, and Dirk Koch. 2019. Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (Nagasaki, Japan) (HEART 2019). ACM, New York, NY, USA, Article 1, 6 pages. https://doi.org/10.1145/3337801.3337819
[25]
Ján Veselý, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H. Loh, Mark Oskin, and Steven K. Reinhardt. 2018. Generic System Calls for GPUs. In Proceedings of the 45th Annual International Symposium on Computer Architecture (Los Angeles, California) (ISCA ’18). IEEE Press, 843–856.
[26]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. SIGPLAN Not. 51, 8, Article 11 (feb 2016), 12 pages. https://doi.org/10.1145/3016078.2851145
[27]
Jinyu Yu, Dan Feng, Wei Tong, Pengze Lv, and Yufei Xiong. 2021. CERES: Container-Based Elastic Resource Management System for Mixed Workloads. In 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP 2021). ACM, New York, NY, USA, Article 13, 10 pages.
[28]
Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, and Satoshi Matsuoka. 2022. Persistent Kernels for Iterative Memory-bound GPU Applications. https://arxiv.org/abs/2204.02064
[29]
Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka. 2020. A study of single and multi-device synchronization methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 483–493.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ExHET '24: Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions
March 2024
29 pages
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2024

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PPoPP '24
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 405
    Total Downloads
  • Downloads (Last 12 months)405
  • Downloads (Last 6 weeks)67
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media