Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3673038.3673091acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks

Published: 12 August 2024 Publication History

Abstract

For the applications of artificial intelligence, training models with GPUs are widely noted, while inferring requirements are somehow neglected. In some scenario, it is quite important to finish the deep learning inference (DLI) task and get the response in time, e.g., anomaly detection in AIOps or QoE (Quality of Experience) assurance for customers. However, the challenges of GPU inferring are as follows: 1) the interference among inferring tasks on the shared GPU are not well studied and even not considered, which may cause a surge in the inference latency due to the hardware contention; 2) the deadline miss rate caused by the arrival rate is not clearly considered, which often exhibits significant fluctuations in real-world cases. Therefore, the interference among tasks and arrival rates of tasks should be well designed to decrease the deadline miss rate, when sharing GPU resources. To tackle the issue, we propose the algorithm, DeInfer, through following manners: 1) we identify the key factors that lead to interference, conduct a systematic study and develop a highly accurate interference prediction algorithm based on the random forest algorithm, achieving a four times improvement compared to the state-of-the-art interference prediction algorithms; 2) we utilize the queue theory to model the randomness of the arrival process and put forward a GPU resource allocation algorithm, which reduces the deadline miss rate by an average of over 30%.

References

[1]
2023. MULTI-PROCESS SERVICE. pdf. https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.
[2]
2023. NVIDIA Multi-Instance GPU User Guide. pdf. https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_MIG_User_Guide.pdf.
[3]
2023. Virtual GPU Software User Guide. pdf. https://docs.nvidia.com/grid/latest/pdf/grid-vgpu-user-guide.pdf.
[4]
Arnold O Allen. 1990. Probability, statistics, and queueing theory. Gulf Professional Publishing.
[5]
Martin F Arlitt and Carey L Williamson. 1997. Internet web servers: Workload characterization and performance implications. IEEE/ACM Transactions on networking 5, 5 (1997), 631–645.
[6]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.
[7]
Gérard Biau and Erwan Scornet. 2016. A random forest guided tour. Test 25 (2016), 197–227.
[8]
Jin Cao, William S Cleveland, Dong Lin, and Don X Sun. 2001. On the nonstationarity of Internet traffic. In Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems. 102–112.
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. { TVM} : An automated { End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.
[10]
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. 2022. Serving heterogeneous machine learning models on { Multi-GPU} servers with { Spatio-Temporal} sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 199–216.
[11]
Marcus Chow, Ali Jahanshahi, and Daniel Wong. 2023. KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 624–637.
[12]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A { Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 613–627.
[13]
Aditya Dhakal, Sameer G Kulkarni, and KK Ramakrishnan. 2020. Gslice: controlled spatial sharing of gpus for a scalable inference platform. In Proceedings of the 11th ACM Symposium on Cloud Computing. 492–506.
[14]
Zeshan Fayyaz, Mahsa Ebrahimian, Dina Nawara, Ahmed Ibrahim, and Rasha Kashef. 2020. Recommendation systems: Algorithms, challenges, metrics, and business opportunities. applied sciences 10, 21 (2020), 7748.
[15]
Alex Graves and Alex Graves. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012), 37–45.
[16]
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, and Michael Gerndt. 2023. FaST-GShare: Enabling efficient spatio-temporal GPU sharing in serverless computing for deep learning inference. In Proceedings of the 52nd International Conference on Parallel Processing. 635–644.
[17]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving { DNNs} like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462.
[18]
Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1041–1057.
[19]
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent { GPU-accelerated}{ DNN} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539–558.
[20]
Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. 2021. Scrooge: A cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing. 624–638.
[21]
Gautam Jain and Karl Sigman. 1996. A Pollaczek–Khintchine formula for M/G/1 queues with disasters. Journal of Applied Probability 33, 4 (1996), 1191–1200.
[22]
Yonatan Levy and Uri Yechiali. 1975. Utilization of idle time in an M/G/1 queueing system. Management Science 22, 2 (1975), 202–211.
[23]
Baolin Li, Siddharth Samsi, Vijay Gadepally, and Devesh Tiwari. 2023. Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
[24]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
[25]
Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, 2022. Walle: An { End-to-End}, { General-Purpose}, and { Large-Scale} Production System for { Device-Cloud} Collaborative Machine Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 249–265.
[26]
Seyed Morteza Nabavinejad, Sherief Reda, and Masoumeh Ebrahimi. 2022. Coordinated batching and DVFS for DNN inference on GPU accelerators. IEEE transactions on parallel and distributed systems 33, 10 (2022), 2496–2508.
[27]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. ACM SIGARCH Computer Architecture News 43, 1 (2015), 593–606.
[28]
Heyang Qin, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, and Feng Yan. 2019. Swift machine learning model serving scheduling: a region based reinforcement learning approach. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–23.
[29]
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. { INFaaS} : Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397–411.
[30]
Jun Shao. 2003. Mathematical statistics. Springer Science & Business Media.
[31]
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322–337.
[32]
Cheng Tan, Zhichao Li, Jian Zhang, Yu Cao, Sikai Qi, Zherui Liu, Yibo Zhu, and Chuanxiong Guo. 2021. Serving DNN models with multi-instance gpus: A case of the reconfigurable machine scheduling problem. arXiv preprint arXiv:2109.11067 (2021).
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[34]
Chengcheng Wan, Muhammad Santriaji, Eri Rogers, Henry Hoffmann, Michael Maire, and Shan Lu. 2020. { ALERT} : Accurate learning for energy and timeliness. In 2020 USENIX annual technical conference (USENIX ATC 20). 353–369.
[35]
Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. 2021. Morphling: fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing. 639–653.
[36]
Fei Xu, Jianian Xu, Jiabin Chen, Li Chen, Ruitao Shang, Zhi Zhou, and Fangming Liu. 2023. iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud. IEEE Transactions on Parallel & Distributed Systems 34, 03 (2023), 812–827.
[37]
Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022. INFless: a native serverless system for low-latency, high-throughput inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 768–781.
[38]
Fuxun Yu, Di Wang, Longfei Shangguan, Minjia Zhang, Chenchen Liu, and Xiang Chen. 2022. A survey of multi-tenant deep learning inference on gpu. arXiv preprint arXiv:2203.09040 (2022).
[39]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. { MArk} : Exploiting Cloud Services for { Cost-Effective}, { SLO-Aware} Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.
[40]
Yunqi Zhang, David Meisner, Jason Mars, and Lingjia Tang. 2016. Treadmill: Attributing the source of tail latency through precise load testing and statistical inference. ACM SIGARCH Computer Architecture News 44, 3 (2016), 456–468.

Index Terms

  1. DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
    August 2024
    1279 pages
    ISBN:9798400717932
    DOI:10.1145/3673038
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2024

    Check for updates

    Author Tags

    1. Queuing-Based model
    2. deep learning inference
    3. interference awareness
    4. spatial sharing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICPP '24

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 268
      Total Downloads
    • Downloads (Last 12 months)268
    • Downloads (Last 6 weeks)59
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media