research-article

Open access

Advancing Serverless Computing for Scalable AI Model Inference: Challenges and Opportunities

Authors:

Ningfang MiAuthors Info & Claims

WoSC10 '24: Proceedings of the 10th International Workshop on Serverless Computing

Pages 1 - 6

https://doi.org/10.1145/3702634.3702950

Published: 02 December 2024 Publication History

Abstract

Artificial Intelligence (AI) model inference has emerged as a crucial component across numerous applications. Serverless computing, known for its scalability, flexibility, and cost-efficiency, is an ideal paradigm for executing AI model inference tasks. This survey provides a comprehensive review of recent research on AI model inference systems in serverless environments, focusing on studies published since 2019. We investigate system-level advancements aimed at optimizing performance and cost-efficiency through a range of innovative techniques. By analyzing high-impact papers from leading venues in AI model inference and serverless computing, we highlight key breakthroughs and solutions. This survey serves as a valuable resource for both practitioners and academic researchers, offering critical insights into the current state and future trends in integrating AI model inference with serverless architectures. To the best of our knowledge, this is the first survey that includes Large Language Models (LLMs) inference in the context of serverless computing.

References

[1]

Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2020. BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.

[2]

Amazon. 2024. AWS Bedrock. https://docs.aws.amazon.com/bedrock/

[3]

Ta Phuong Bac, Minh Ngoc Tran, and Young Han Kim. 2022. Serverless Computing Approach for Deploying Machine Learning Applications in Edge Layer. In 2022 International Conference on Information Networking (ICOIN). 396--401.

[4]

Amine Barrak, Fabio Petrillo, and Fehmi Jaafar. 2022. Serverless on Machine Learning: A Systematic Mapping Study. IEEE Access 10 (2022), 99337--99352.

[5]

Anirban Bhattacharjee, Ajay Dev Chhokra, Zhuangwei Kang, Hongyang Sun, Aniruddha Gokhale, and Gabor Karsai. 2019. BARISTA: Efficient and Scalable Serverless Serving System for Deep Learning Prediction Services. In 2019 IEEE International Conference on Cloud Engineering (IC2E). 23--33.

[6]

Shen Cai, Zhi Zhou, Kongyange Zhao, and Xu Chen. 2023. Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing. In Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems (Seoul, Republic of Korea) (APSys '23). 43--49.

Digital Library

[7]

Zinuo Cai, Zebin Chen, Ruhui Ma, and Haibing Guan. 2023. SMSS: Stateful Model Serving in Metaverse With Serverless Computing and GPU Sharing. IEEE J.Sel. A. Commun. 42, 3 (dec 2023), 799--811.

[8]

Jiaang Duan, Shiyou Qian, Dingyu Yang, Hanwen Hu, Jian Cao, and Guangtao Xue. 2024. MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms. ArXiv abs/2404.02445 (2024).

[9]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 135--153.

[10]

Adrien Gallego, Uraz Odyurt, Yi Cheng, Yuandou Wang, and Zhiming Zhao. 2024. Machine Learning Inference on Serverless Platforms Using Model Decomposition. In Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing (Taormina (Messina), Italy) (UCC '23). Article 33, 6 pages.

[11]

Jianfeng Gu, Yichao Zhu, Puxuan Wang, Mohak Chadha, and Michael Gerndt. 2023. FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference. In Proceedings of the 52nd International Conference on Parallel Processing (Salt Lake City, UT, USA) (ICPP '23). 635--644.

Digital Library

[12]

Jashwant Raj Gunasekaran, Prashanth Thinakaran, Nachiappan C. Nachiappan, Mahmut Taylan Kandemir, and Chita R. Das. 2020. Fifer: Tackling Resource Underutilization in the Serverless Era. In Proceedings of the 21st International Middleware Conference (Delft, Netherlands) (Middleware '20). 280--295.

Digital Library

[13]

Zicong Hong, Jian Lin, Song Guo, Sifu Luo, Wuhui Chen, Roger Wattenhofer, and Yue Yu. 2024. Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation. In Proceedings of the Nineteenth European Conference on Computer Systems (Athens, Greece) (EuroSys '24). 1039--1053.

Digital Library

[14]

Tao Huang, Pengfei Chen, Kyoka Gong, Jocky Hawk, Zachary Bright, Wenxin Xie, Kecheng Huang, and Zhi Ji. 2024. ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving. arXiv preprint arXiv:2407.09486 (2024).

[15]

Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. 2021. AMPS-Inf: Automatic Model Partitioning for Serverless Inference with Cost Efficiency. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP '21). Article 14, 12 pages.

Digital Library

[16]

Justin San Juan and Bernard Wong. 2023. Reducing the Cost of GPU Cold Starts in Serverless Deep Learning Inference Serving. In 2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops). 225--230.

[17]

Rabimba Karanjai and Weidong Shi. 2024. Trusted LLM Inference on the Edge with Smart Contracts. In 2024 IEEE International Conference on Blockchain and Cryptocurrency (ICBC). 1--7.

[18]

Kamil Kojs. 2023. A Survey of Serverless Machine Learning Model Inference. arXiv preprint arXiv:2311.13587 (2023).

[19]

Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. 2022. Tetris: Memory-efficient serverless inference through tensor sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22).

[20]

Kunal Mahajan and Rumit Desai. 2022. Serving distributed inference deep learning models in serverless computing. In 2022 IEEE 15th International Conference on Cloud Computing (CLOUD). 109--111.

[21]

Microsoft. 2024. Microsoft Azure AI Studio. https://learn.microsoft.com/en-us/azure/ai-studio

[22]

Joe Oakley and Hakan Ferhatosmanoglu. 2024. FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). 2109--2122.

[23]

Efterpi Paraskevoulakou and Dimosthenis Kyriazis. 2023. ML-FaaS: Toward Exploiting the Serverless Paradigm to Facilitate Machine Learning Functions as a Service. IEEE Transactions on Network and Service Management 20, 3 (2023), 2110--2123.

Digital Library

[24]

Subin Park, Jaeghang Choi, and Kyungyong Lee. 2022. All-you-can-inference: serverless DNN model inference suite. In Proceedings of the Eighth International Workshop on Serverless Computing (Quebec, Quebec City, Canada) (WoSC '22). 1--6.

Digital Library

[25]

Qiangyu Pei, Yongjie Yuan, Haichuan Hu, Qiong Chen, and Fangming Liu. 2023. AsyFunc: A High-Performance and Resource-Efficient Serverless Inference System via Asymmetric Functions. In Proceedings of the 2023 ACM Symposium on Cloud Computing (Santa Cruz, CA, USA) (SoCC '23). 324--340.

Digital Library

[26]

Predibase. 2024. LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs. https://github.com/predibase/lorax.

[27]

Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 397--411.

[28]

Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. 2022. INFless: a native serverless system for low-latency, high-throughput inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '22). 768--781.

Digital Library

[29]

Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep Learning Workload Scheduling in GPU Datacenters: A Survey. ACM Comput. Surv. 56, 6, Article 146 (jan 2024), 38 pages.

[30]

Minchen Yu, Zhifeng Jiang, Hok Chun Ng, Wei Wang, Ruichuan Chen, and Bo Li. 2021. Gillis: Serving Large Neural Networks in Serverless Functions with Automatic Model Partitioning. In 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). 138--148.

[31]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. {MArk}: Exploiting cloud services for {Cost-Effective},{SLO-Aware} machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049--1062.

Index Terms

Advancing Serverless Computing for Scalable AI Model Inference: Challenges and Opportunities

Recommendations

Supporting Multi-Provider Serverless Computing on the Edge
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Serverless computing has recently emerged as a new execution model for cloud computing, in which service providers offer compute runtimes, also known as Function-as-a-Service (FaaS) platforms, allowing users to develop, execute and manage application ...
Serverless Computing: A Survey of Opportunities, Challenges, and Applications
The emerging serverless computing paradigm has attracted attention from both academia and industry. This paradigm brings benefits such as less operational complexity, a pay-as-you-go pricing model, and an auto-scaling feature. The paradigm opens up new ...
Serverless Computing Revisited: Evolution, State-of-the-Art, and Performance Challenges
ICPE '23 Companion: Companion of the 2023 ACM/SPEC International Conference on Performance Engineering

Market analysts are agreed that serverless computing has strong market potential, with projected compound annual growth rates varying between 21% and 28% through 2028 and a projected market value of 36.8 billion by that time. Although serverless ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WoSC10 '24: Proceedings of the 10th International Workshop on Serverless Computing

December 2024

46 pages

ISBN:9798400713361

DOI:10.1145/3702634

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

In-Cooperation

IFIP
Usenix

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WoSC10 '24

WoSC10 '24: 10th International Workshop on Serverless Computing

December 2 - 6, 2024

Hong Kong, Hong Kong

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
268
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)268

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents