research-article

Open access

GPU-Initiated Resource Allocation for Irregular Workloads

Authors:

Ilyas Turimbetov,

Muhammad Aditya Sasongko,

Didem UnatAuthors Info & Claims

ExHET '24: Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions

Pages 1 - 8

https://doi.org/10.1145/3642961.3643799

Published: 04 April 2024 Publication History

All formats PDF

Abstract

GPU kernels may suffer from resource underutilization in multi-GPU systems due to insufficient workload to saturate devices when incorporated within an irregular application. To better utilize the resources in multi-GPU systems, we propose a GPU-sided resource allocation method that can increase or decrease the number of GPUs in use as the workload changes over time. Our method employs GPU-to-CPU callbacks to allow GPU device(s) to request additional devices while the kernel execution is in flight. We implemented and tested multiple callback methods required for GPU-initiated workload offloading to other devices and measured their overheads on Nvidia and AMD platforms. To showcase the usage of callbacks in irregular applications, we implemented Breadth-First Search (BFS) that uses device-initiated workload offloading. Apart from allowing dynamic device allocation in persistently running kernels, it reduces time to solution on average by 15.7% at the cost of callback overheads with a minimum of 6.50 microseconds on AMD and 4.83 microseconds on Nvidia, depending on the chosen callback mechanism. Moreover, the proposed model can reduce the total device usage by up to 35%, which is associated with higher energy efficiency.

References

[1]

Palwisha Akhtar, Erhan Tezcan, Fareed Mohammad Qararyah, and Didem Unat. 2021. ComScribe: Identifying Intra-node GPU Communication. In Benchmarking, Measuring, and Optimizing, Felix Wolf and Wanling Gao (Eds.). Springer International Publishing, Cham, 157–174.

[2]

AMD. 2020. "AMD Instinct MI100" Instruction Set Architecture Reference Guide. AMD.

[3]

Arkaprava Basu, Joseph L. Greathouse, Guru Venkataramani, and Ján Veselý. 2018. Interference from GPU System Service Requests. In 2018 IEEE Int’l Symposium on Workload Characterization (IISWC). 179–190.

[4]

Leander Beernaert, Miguel Matos, Ricardo Vilaça, and Rui Oliveira. [n. d.]. Automatic Elasticity in OpenStack. In Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management (Montreal, Quebec, Canada) (SDMCMM ’12). ACM, New York, NY, USA, Article 2, 6 pages.

[5]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Austin, Texas, USA) (PPoPP ’17). ACM, New York, NY, USA, 235–248.

Digital Library

[6]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. SIGPLAN Not. 52, 8 (jan 2017), 235–248. https://doi.org/10.1145/3155284.3018756

Digital Library

[7]

Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. 2017. SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). 167–179.

[8]

Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, and John D. Owens. 2022. Atos: A Task-Parallel GPU Scheduler for Graph Analytics. In Proceedings of the International Conference on Parallel Processing(ICPP 2022). arXiv:2112.00132

Digital Library

[9]

Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydın Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable irregular parallelism with GPUs: Getting CPUs out of the way. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis(SC ’22).

[10]

D. Foley and J. Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37, 2 (2017), 7–17.

Digital Library

[11]

Taylor Groves, Ben Brock, Yuxin Chen, Khaled Z. Ibrahim, Lenny Oliker, Nicholas J. Wright, Samuel Williams, and Katherine Yelick. [n. d.]. Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. 126–137.

[12]

Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of Persistent Threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). 1–14. https://doi.org/10.1109/InPar.2012.6339596

[13]

Ismayil Ismayilov, Javid Baydamirli, Doğan Sağbili, Mohamed Wahib, and Didem Unat. 2023. Multi-GPU Communication Schemes for Iterative Solvers: When CPUs Are Not in Charge. In Proceedings of the 37th International Conference on Supercomputing (Orlando, FL, USA) (ICS ’23). Association for Computing Machinery, New York, NY, USA, 192–202. https://doi.org/10.1145/3577193.3593713

Digital Library

[14]

A. Li, S. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 01 (jan 2020), 94–110.

Digital Library

[15]

Sumit K. Mandal, Umit Y. Ogras, Janardhan Rao Doppa, Raid Z. Ayoub, Michael Kishinevsky, and Partha P. Pande. 2020. Online Adaptive Learning for Runtime Resource Management of Heterogeneous SoCs. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference (Virtual Event, USA) (DAC ’20). IEEE Press, Article 176, 6 pages.

Digital Library

[16]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic Resource Management for Efficient Utilization of Multitasking GPUs. SIGPLAN Not. 52, 4 (apr 2017), 527–540. https://doi.org/10.1145/3093336.3037707

Digital Library

[17]

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014. GPUfs: Integrating a File System with GPUs. ACM Trans. Comput. Syst. 32, 1, Article 1 (feb 2014), 31 pages.

Digital Library

[18]

Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2016. GPUnet: Networking Abstractions for GPU Programs. ACM Trans. Comput. Syst. 34, 3, Article 9 (sep 2016), 31 pages. https://doi.org/10.1145/2963098

Digital Library

[19]

Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-Based Scheduling of Dynamic Workloads on the GPU. ACM Trans. Graph. 33, 6, Article 228 (nov 2014), 11 pages. https://doi.org/10.1145/2661229.2661250

Digital Library

[20]

Jeff A. Stuart, Michael Cox, and John D. Owens. 2010. GPU-to-CPU Callbacks. In Proceedings of the 2010 Conference on Parallel Processing (Ischia, Italy) (Euro-Par 2010). Springer-Verlag, Berlin, Heidelberg, 365–372.

[21]

Yifan Sun, Saoni Mukherjee, Trinayan Baruah, Shi Dong, Julian Gutierrez, Prannoy Mohan, and David Kaeli. 2018. Evaluating Performance Tradeoffs on the Radeon Open Compute Platform. In 2018 IEEE Int’l Symposium on Performance Analysis of Systems and Software (ISPASS). 209–218.

[22]

Othon Tomoutzoglou, Dimitris Mbakoyiannis, George Kornaros, and Marcello Coppola. 2020. Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime Infrastructure. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 5 (2020), 1017–1030.

[23]

Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task Management for Irregular-Parallel Workloads on the GPU. In Proceedings of the Conference on High Performance Graphics (Saarbrucken, Germany) (HPG ’10). Eurographics Association, Goslar, DEU, 29–37.

Digital Library

[24]

Anuj Vaishnav, Khoa Dang Pham, and Dirk Koch. 2019. Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures. In Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (Nagasaki, Japan) (HEART 2019). ACM, New York, NY, USA, Article 1, 6 pages. https://doi.org/10.1145/3337801.3337819

Digital Library

[25]

Ján Veselý, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel H. Loh, Mark Oskin, and Steven K. Reinhardt. 2018. Generic System Calls for GPUs. In Proceedings of the 45th Annual International Symposium on Computer Architecture (Los Angeles, California) (ISCA ’18). IEEE Press, 843–856.

Digital Library

[26]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. SIGPLAN Not. 51, 8, Article 11 (feb 2016), 12 pages. https://doi.org/10.1145/3016078.2851145

Digital Library

[27]

Jinyu Yu, Dan Feng, Wei Tong, Pengze Lv, and Yufei Xiong. 2021. CERES: Container-Based Elastic Resource Management System for Mixed Workloads. In 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP 2021). ACM, New York, NY, USA, Article 13, 10 pages.

[28]

Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, and Satoshi Matsuoka. 2022. Persistent Kernels for Iterative Memory-bound GPU Applications. https://arxiv.org/abs/2204.02064

[29]

Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka. 2020. A study of single and multi-device synchronization methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 483–493.

Index Terms

GPU-Initiated Resource Allocation for Irregular Workloads
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Self-organization
    2. Parallel programming languages
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms

Recommendations

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference

Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips

Commodity processors are comprised of several CPU cores and one integrated GPU. To fully exploit this type of architectures, one needs to automatically determine how to partition the workload between both devices. This is specially challenging for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ExHET '24: Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions

March 2024

29 pages

ISBN:9798400705373

DOI:10.1145/3642961

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2024

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

European Research Council

Conference

PPoPP '24

Sponsor:

SIGPLAN

PPoPP '24: The 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 2 - 6, 2024

Edinburgh, United Kingdom

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
405
Total Downloads

Downloads (Last 12 months)405
Downloads (Last 6 weeks)67

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents