research-article

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Authors:

Mehran Salmani,

Saeid Ghafouri,

Alireza Sanaee,

Max Mühlhäuser,

Pooyan Jamshidi,

Mohsen SharifiAuthors Info & Claims

EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems

Pages 78 - 86

https://doi.org/10.1145/3578356.3592578

Published: 08 May 2023 Publication History

Abstract

The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler).

References

[1]

2022. GPU virtualization in K8S: challenges and state of the art. https://www.arrikto.com/blog/gpu-virtualization-in-k8s-challenges-and-state-of-the-art/. (Nov 2022).

[2]

2022. Horizontal Pod autoscaling. (Jun 2022). https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

[3]

2022. Inter-op parallelism threads. (2022). https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads

[4]

2022. Intra-op parallelism threads. (2022). https://www.tensorflow.org/api_docs/python/tf/config/threading/set_intra_op_parallelism_threads

[5]

2022. TorchServe parallelism threads. (2022). https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html

[6]

2023. Kserve. https://github.com/kserve/kserve. (2023).

[7]

2023. Seldon. https://github.com/SeldonIO/seldon-core. (2023).

[8]

2023. Triton inference server. https://github.com/triton-inference-server/server. (2023).

[9]

2023. Vertical Pod autoscaling. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler. (2023).

[10]

Sherif Akoush, Andrei Paleyes, Arnaud Van Looveren, and Clive Cox. 2022. Desiderata for next generation of ML model serving. arXiv preprint arXiv:2210.14665 (2022).

[11]

Dario Amodei and Danny Hernandez. 2019. AI and compute. https://openai.com/blog/ai-and-compute/. (Nov 2019).

[12]

archiveteam. 2021. Archiveteam-twitter-stream-2021-08. https://archive.org/details/archiveteam-twitter-stream-2021-08. (2021).

[13]

Jeff Bar. 2019. Amazon EC2 ML inference. https://tinyurl.com/5n8yb5ub. (Dec 2019).

[14]

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2019. Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019).

[15]

PyTorch Serve Contributors. 2020. Torch serve. https://pytorch.org/serve/. (2020).

[16]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: a low-latency online prediction serving system. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 613--627.

[17]

Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A Kozuch. 2012. Autoscale: dynamic, robust capacity management for multi-tier data centers. ACM Transactions on Computer Systems (TOCS) 30, 4 (2012), 1--26.

Digital Library

[18]

Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. 109--120.

Digital Library

[19]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like clockwork: performance predictability from the bottom up. arXiv preprint arXiv:2006.02464 (2020).

[20]

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022. Cocktail: a multidimensional optimization for model serving in cloud. In USENIX NSDI. 1041--1057.

[21]

Gurobi Optimization, LLC. 2023. Gurobi optimizer reference manual. (2023). https://www.gurobi.com

[22]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[23]

Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. 2021. Scrooge: a cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing. 624--638.

Digital Library

[24]

Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons learned from the Chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.

[25]

George Leopold. 2019. AWS to offer Nvidia's T4 GPUs for AI inferencing. https://www.hpcwire.com/2019/03/19/aws-upgrades-its-gpu-backed-ai-inference-platform/. (Mar 2019).

[26]

Vinod Nigade, Pablo Bauszat, Henri Bal, and Lin Wang. 2022. Jellyfish: timely inference serving for dynamic edge networks. In 2022 IEEE Real-Time Systems Symposium (RTSS). IEEE, 277--290.

[27]

Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-Serving: flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139 (2017).

[28]

Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. 2018. Deep learning inference in Facebook data centers: characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018).

[29]

Kamran Razavi, Manisha Luthra, Boris Koldehofe, Max Mühlhäuser, and Lin Wang. 2022. FA2: fast, accurate autoscaling for serving deep learning inference with SLA guarantees. In 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS). 146--159.

[30]

Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. {INFaaS}: automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397--411.

[31]

Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, et al. 2020. Autopilot: workload autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems. 1--16.

Digital Library

[32]

Devvi Sarwinda, Radifa Hilya Paradisa, Alhadi Bustamam, and Pinkie Anggia. 2021. Deep learning in image classification using residual network (ResNet) variants for detection of colorectal cancer. Procedia Computer Science 179 (2021), 423--431.

[33]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322--337.

Digital Library

[34]

Leonid Velikovich, Ian Williams, Justin Scheiner, Petar S Aleksic, Pedro J Moreno, and Michael Riley. 2018. Semantic lattice processing in contextual automatic speech recognition for Google assistant. In Interspeech. 2222--2226.

[35]

Luping Wang, Lingyun Yang, Yinghao Yu, Wei Wang, Bo Li, Xianchao Sun, Jian He, and Liping Zhang. 2021. Morphling: fast, near-optimal auto-configuration for cloud-native model serving. In Proceedings of the ACM Symposium on Cloud Computing. 639--653.

Digital Library

[36]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: exploiting cloud services for cost-effective, SLO-aware machine learning inference serving. In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19). 1049--1062.

[37]

Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J Freedman. 2017. Live video analytics at scale with approximation and {delay-tolerance}. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 377--392.

[38]

Jeff Zhang, Sameh Elnikety, Shuayb Zarar, Atul Gupta, and Siddharth Garg. 2020. Model-switching: dealing with fluctuating workloads in machine-learning-as-a-service systems. In 12th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 20).

Cited By

Torbarina LFerkovic TRoguski LMihelcic VSarlija BKraljevic Z(2024)Challenges and Opportunities of Using Transformer-Based Multi-Task Learning in NLP Through ML Lifecycle: A Position PaperNatural Language Processing Journal10.1016/j.nlp.2024.1000767(100076)Online publication date: Jun-2024
https://doi.org/10.1016/j.nlp.2024.100076
Hussain RSalehi M(2024)Resource allocation of industry 4.0 micro-service applications across serverless fog federationFuture Generation Computer Systems10.1016/j.future.2024.01.017154:C(479-490)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.future.2024.01.017
Christofidi GPapaioannou KDoudali T(2023)Is Machine Learning Necessary for Cloud Resource Usage Forecasting?Proceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624790(544-554)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624790
Show More Cited By

Index Terms

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
    2. Other architectures
      1. Reconfigurable computing

Recommendations

INFless: a native serverless system for low-latency, high-throughput inference
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Modern websites increasingly rely on machine learning (ML) to improve their business efficiency. Developing and maintaining ML services incurs high costs for developers. Although serverless systems are a promising solution to reduce costs, we find that ...
Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing
APSys '23: Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems

With the emerging of machine learning, many commercial companies increasingly utilize machine learning inference systems as backend services to improve their products. Serverless computing is a modern paradigm that provides auto-scaling, event-driven ...
Reconciling High Efficiency with Low Latency in the Datacenter

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems

May 2023

176 pages

ISBN:9798400700842

DOI:10.1145/3578356

Workshop Co-chairs:
Eiko Yoneki
University of Cambridge
,
Luigi Nardi
Lund University
Stanford University

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EuroMLSys '23

Sponsor:

SIGOPS

EuroMLSys '23: 3rd Workshop on Machine Learning and Systems

May 8, 2023

Rome, Italy

Acceptance Rates

Overall Acceptance Rate 18 of 26 submissions, 69%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)272
Downloads (Last 6 weeks)33

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Torbarina LFerkovic TRoguski LMihelcic VSarlija BKraljevic Z(2024)Challenges and Opportunities of Using Transformer-Based Multi-Task Learning in NLP Through ML Lifecycle: A Position PaperNatural Language Processing Journal10.1016/j.nlp.2024.1000767(100076)Online publication date: Jun-2024
https://doi.org/10.1016/j.nlp.2024.100076
Hussain RSalehi M(2024)Resource allocation of industry 4.0 micro-service applications across serverless fog federationFuture Generation Computer Systems10.1016/j.future.2024.01.017154:C(479-490)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1016/j.future.2024.01.017
Christofidi GPapaioannou KDoudali T(2023)Is Machine Learning Necessary for Cloud Resource Usage Forecasting?Proceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624790(544-554)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624790
Ghafouri SAbdipoor SDoyle J(2023)Smart-Kube: Energy-Aware and Fair Kubernetes Job Scheduler Using Deep Reinforcement Learning2023 IEEE 8th International Conference on Smart Cloud (SmartCloud)10.1109/SmartCloud58862.2023.00035(154-163)Online publication date: 16-Sep-2023
https://doi.org/10.1109/SmartCloud58862.2023.00035
Joyce JSebastian S(2023)Enhancing Kubernetes Auto-Scaling: Leveraging Metrics for Improved Workload Performance2023 Global Conference on Information Technologies and Communications (GCITC)10.1109/GCITC60406.2023.10426170(1-7)Online publication date: 1-Dec-2023
https://doi.org/10.1109/GCITC60406.2023.10426170

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents