Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3702634.3702950acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Open access

Advancing Serverless Computing for Scalable AI Model Inference: Challenges and Opportunities

Published: 02 December 2024 Publication History

Abstract

Artificial Intelligence (AI) model inference has emerged as a crucial component across numerous applications. Serverless computing, known for its scalability, flexibility, and cost-efficiency, is an ideal paradigm for executing AI model inference tasks. This survey provides a comprehensive review of recent research on AI model inference systems in serverless environments, focusing on studies published since 2019. We investigate system-level advancements aimed at optimizing performance and cost-efficiency through a range of innovative techniques. By analyzing high-impact papers from leading venues in AI model inference and serverless computing, we highlight key breakthroughs and solutions. This survey serves as a valuable resource for both practitioners and academic researchers, offering critical insights into the current state and future trends in integrating AI model inference with serverless architectures. To the best of our knowledge, this is the first survey that includes Large Language Models (LLMs) inference in the context of serverless computing.

References

[1]
Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2020. BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.
[2]
Amazon. 2024. AWS Bedrock. https://docs.aws.amazon.com/bedrock/
[3]
Ta Phuong Bac, Minh Ngoc Tran, and Young Han Kim. 2022. Serverless Computing Approach for Deploying Machine Learning Applications in Edge Layer. In 2022 International Conference on Information Networking (ICOIN). 396--401.
[4]
Amine Barrak, Fabio Petrillo, and Fehmi Jaafar. 2022. Serverless on Machine Learning: A Systematic Mapping Study. IEEE Access 10 (2022), 99337--99352.
[5]
Anirban Bhattacharjee, Ajay Dev Chhokra, Zhuangwei Kang, Hongyang Sun, Aniruddha Gokhale, and Gabor Karsai. 2019. BARISTA: Efficient and Scalable Serverless Serving System for Deep Learning Prediction Services. In 2019 IEEE International Conference on Cloud Engineering (IC2E). 23--33.
[6]
Shen Cai, Zhi Zhou, Kongyange Zhao, and Xu Chen. 2023. Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing. In Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems (Seoul, Republic of Korea) (APSys '23). 43--49.
[7]
Zinuo Cai, Zebin Chen, Ruhui Ma, and Haibing Guan. 2023. SMSS: Stateful Model Serving in Metaverse With Serverless Computing and GPU Sharing. IEEE J.Sel. A. Commun. 42, 3 (dec 2023), 799--811.
[8]
Jiaang Duan, Shiyou Qian, Dingyu Yang, Hanwen Hu, Jian Cao, and Guangtao Xue. 2024. MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms. ArXiv abs/2404.02445 (2024).
[9]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 135--153.
[10]
Adrien Gallego, Uraz Odyurt, Yi Cheng, Yuandou Wang, and Zhiming Zhao. 2024. Machine Learning Inference on Serverless Platforms Using Model Decomposition. In Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing (Taormina (Messina), Italy) (UCC '23). Article 33, 6 pages.
[11]
Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, and Michael Gerndt. 2023. FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference. In Proceedings of the 52nd International Conference on Parallel Processing (Salt Lake City, UT, USA) (ICPP '23). 635--644.
[12]
Jashwant Raj Gunasekaran, Prashanth Thinakaran, Nachiappan C. Nachiappan, Mahmut Taylan Kandemir, and Chita R. Das. 2020. Fifer: Tackling Resource Underutilization in the Serverless Era. In Proceedings of the 21st International Middleware Conference (Delft, Netherlands) (Middleware '20). 280--295.
[13]
Zicong Hong, Jian Lin, Song Guo, Sifu Luo, Wuhui Chen, Roger Wattenhofer, and Yue Yu. 2024. Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation. In Proceedings of the Nineteenth European Conference on Computer Systems (Athens, Greece) (EuroSys '24). 1039--1053.
[14]
Tao Huang, Pengfei Chen, Kyoka Gong, Jocky Hawk, Zachary Bright, Wenxin Xie, Kecheng Huang, and Zhi Ji. 2024. ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving. arXiv preprint arXiv:2407.09486 (2024).
[15]
Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. 2021. AMPS-Inf: Automatic Model Partitioning for Serverless Inference with Cost Efficiency. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP '21). Article 14, 12 pages.
[16]
Justin San Juan and Bernard Wong. 2023. Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving. In 2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops). 225--230.
[17]
Rabimba Karanjai and Weidong Shi. 2024. Trusted LLM Inference on the Edge with Smart Contracts. In 2024 IEEE International Conference on Blockchain and Cryptocurrency (ICBC). 1--7.
[18]
Kamil Kojs. 2023. A Survey of Serverless Machine Learning Model Inference. arXiv preprint arXiv:2311.13587 (2023).
[19]
Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. 2022. Tetris: Memory-efficient serverless inference through tensor sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22).
[20]
Kunal Mahajan and Rumit Desai. 2022. Serving distributed inference deep learning models in serverless computing. In 2022 IEEE 15th International Conference on Cloud Computing (CLOUD). 109--111.
[21]
Microsoft. 2024. Microsoft Azure AI Studio. https://learn.microsoft.com/en-us/azure/ai-studio
[22]
Joe Oakley and Hakan Ferhatosmanoglu. 2024. FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). 2109--2122.
[23]
Efterpi Paraskevoulakou and Dimosthenis Kyriazis. 2023. ML-FaaS: Toward Exploiting the Serverless Paradigm to Facilitate Machine Learning Functions as a Service. IEEE Transactions on Network and Service Management 20, 3 (2023), 2110--2123.
[24]
Subin Park, Jaeghang Choi, and Kyungyong Lee. 2022. All-you-can-inference: serverless DNN model inference suite. In Proceedings of the Eighth International Workshop on Serverless Computing (Quebec, Quebec City, Canada) (WoSC '22). 1--6.
[25]
Qiangyu Pei, Yongjie Yuan, Haichuan Hu, Qiong Chen, and Fangming Liu. 2023. AsyFunc: A High-Performance and Resource-Efficient Serverless Inference System via Asymmetric Functions. In Proceedings of the 2023 ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC '23). 324--340.
[26]
Predibase. 2024. LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs. https://github.com/predibase/lorax.
[27]
Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397--411.
[28]
Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022. INFless: a native serverless system for low-latency, high-throughput inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '22). 768--781.
[29]
Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep Learning Workload Scheduling in GPU Datacenters: A Survey. ACM Comput. Surv. 56, 6, Article 146 (jan 2024), 38 pages.
[30]
Minchen Yu, Zhifeng Jiang, Hok Chun Ng, Wei Wang, Ruichuan Chen, and Bo Li. 2021. Gillis: Serving Large Neural Networks in Serverless Functions with Automatic Model Partitioning. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). 138--148.
[31]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. {MArk}: Exploiting cloud services for {Cost-Effective},{SLO-Aware} machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049--1062.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WoSC10 '24: Proceedings of the 10th International Workshop on Serverless Computing
December 2024
46 pages
ISBN:9798400713361
DOI:10.1145/3702634
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

In-Cooperation

  • IFIP
  • Usenix

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2024

Check for updates

Author Tags

  1. serverless computing
  2. LLMs inference
  3. DL inference
  4. ML inference

Qualifiers

  • Research-article

Conference

WoSC10 '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 268
    Total Downloads
  • Downloads (Last 12 months)268
  • Downloads (Last 6 weeks)268
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media