research-article

Open access

DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks

Authors:

Yanfei YinAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 701 - 711

https://doi.org/10.1145/3673038.3673091

Published: 12 August 2024 Publication History

All formats PDF

Abstract

For the applications of artificial intelligence, training models with GPUs are widely noted, while inferring requirements are somehow neglected. In some scenario, it is quite important to finish the deep learning inference (DLI) task and get the response in time, e.g., anomaly detection in AIOps or QoE (Quality of Experience) assurance for customers. However, the challenges of GPU inferring are as follows: 1) the interference among inferring tasks on the shared GPU are not well studied and even not considered, which may cause a surge in the inference latency due to the hardware contention; 2) the deadline miss rate caused by the arrival rate is not clearly considered, which often exhibits significant fluctuations in real-world cases. Therefore, the interference among tasks and arrival rates of tasks should be well designed to decrease the deadline miss rate, when sharing GPU resources. To tackle the issue, we propose the algorithm, DeInfer, through following manners: 1) we identify the key factors that lead to interference, conduct a systematic study and develop a highly accurate interference prediction algorithm based on the random forest algorithm, achieving a four times improvement compared to the state-of-the-art interference prediction algorithms; 2) we utilize the queue theory to model the randomness of the arrival process and put forward a GPU resource allocation algorithm, which reduces the deadline miss rate by an average of over 30%.

References

[1]

2023. MULTI-PROCESS SERVICE. pdf. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.

[2]

2023. NVIDIA Multi-Instance GPU User Guide. pdf. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_MIG_User_Guide.pdf.

[3]

2023. Virtual GPU Software User Guide. pdf. https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf.

[4]

Arnold O Allen. 1990. Probability, statistics, and queueing theory. Gulf Professional Publishing.

[5]

Martin F Arlitt and Carey L Williamson. 1997. Internet web servers: Workload characterization and performance implications. IEEE/ACM Transactions on networking 5, 5 (1997), 631–645.

[6]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.

[7]

Gérard Biau and Erwan Scornet. 2016. A random forest guided tour. Test 25 (2016), 197–227.

[8]

Jin Cao, William S Cleveland, Dong Lin, and Don X Sun. 2001. On the nonstationarity of Internet traffic. In Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems. 102–112.

Digital Library

[9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. { TVM} : An automated { End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.

[10]

Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on { Multi-GPU} servers with { Spatio-Temporal} sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 199–216.

[11]

Marcus Chow, Ali Jahanshahi, and Daniel Wong. 2023. KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 624–637.

[12]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A { Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613–627.

[13]

Aditya Dhakal, Sameer G Kulkarni, and KK Ramakrishnan. 2020. Gslice: controlled spatial sharing of gpus for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing. 492–506.

Digital Library

[14]

Zeshan Fayyaz, Mahsa Ebrahimian, Dina Nawara, Ahmed Ibrahim, and Rasha Kashef. 2020. Recommendation systems: Algorithms, challenges, metrics, and business opportunities. applied sciences 10, 21 (2020), 7748.

[15]

Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45.

[16]

Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, and Michael Gerndt. 2023. FaST-GShare: Enabling efficient spatio-temporal GPU sharing in serverless computing for deep learning inference. In Proceedings of the 52nd International Conference on Parallel Processing. 635–644.

Digital Library

[17]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving { DNNs} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462.

[18]

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1041–1057.

[19]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent { GPU-accelerated}{ DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539–558.

[20]

Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. 2021. Scrooge: A cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing. 624–638.

Digital Library

[21]

Gautam Jain and Karl Sigman. 1996. A Pollaczek–Khintchine formula for M/G/1 queues with disasters. Journal of Applied Probability 33, 4 (1996), 1191–1200.

[22]

Yonatan Levy and Uri Yechiali. 1975. Utilization of idle time in an M/G/1 queueing system. Management Science 22, 2 (1975), 202–211.

Digital Library

[23]

Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2023. Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.

Digital Library

[24]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.

[25]

Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, 2022. Walle: An { End-to-End}, { General-Purpose}, and { Large-Scale} Production System for { Device-Cloud} Collaborative Machine Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 249–265.

[26]

Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. 2022. Coordinated batching and DVFS for DNN inference on GPU accelerators. IEEE transactions on parallel and distributed systems 33, 10 (2022), 2496–2508.

[27]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. ACM SIGARCH Computer Architecture News 43, 1 (2015), 593–606.

Digital Library

[28]

Heyang Qin, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, and Feng Yan. 2019. Swift machine learning model serving scheduling: a region based reinforcement learning approach. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–23.

Digital Library

[29]

Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. { INFaaS} : Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397–411.

[30]

Jun Shao. 2003. Mathematical statistics. Springer Science & Business Media.

[31]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322–337.

Digital Library

[32]

Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and Chuanxiong Guo. 2021. Serving DNN models with multi-instance gpus: A case of the reconfigurable machine scheduling problem. arXiv preprint arXiv:2109.11067 (2021).

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[34]

Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu. 2020. { ALERT} : Accurate learning for energy and timeliness. In 2020 USENIX annual technical conference (USENIX ATC 20). 353–369.

[35]

Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. 2021. Morphling: fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing. 639–653.

Digital Library

[36]

Fei Xu, Jianian Xu, Jiabin Chen, Li Chen, Ruitao Shang, Zhi Zhou, and Fangming Liu. 2023. iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud. IEEE Transactions on Parallel & Distributed Systems 34, 03 (2023), 812–827.

[37]

Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022. INFless: a native serverless system for low-latency, high-throughput inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 768–781.

Digital Library

[38]

Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu, and Xiang Chen. 2022. A survey of multi-tenant deep learning inference on gpu. arXiv preprint arXiv:2203.09040 (2022).

[39]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. { MArk} : Exploiting Cloud Services for { Cost-Effective}, { SLO-Aware} Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.

[40]

Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Treadmill: Attributing the source of tail latency through precise load testing and statistical inference. ACM SIGARCH Computer Architecture News 44, 3 (2016), 456–468.

Digital Library

Index Terms

DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks
1. Computing methodologies
  1. Concurrent computing methodologies

Recommendations

Fair Resource Allocation for Heterogeneous Tasks
IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium

We consider the problem of fair resource allocation for tasks where a resource can be assigned to at most one task, without any fractional allocation. The system is heterogeneous: capacity and cost may vary across resources, and different tasks may have ...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Accelerating PQMRCGSTAB algorithm on GPU
UCHPC-MAW '09: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop

The general computations on GPU are becoming more and more popular because of GPU's powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
268
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)59

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents