research-article

Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling

Authors:

Saeid Ghafouri,

Max Mühlhäuser,

Pooyan Jamshidi,

Lin WangAuthors Info & Claims

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

Pages 184 - 191

https://doi.org/10.1145/3642970.3655833

Published: 22 April 2024 Publication History

Abstract

Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level.

This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.

References

[1]

Fawad Ahmad, Hang Qiu, Ray Eells, Fan Bai, and Ramesh Govindan. 2020. CarMap: Fast 3D Feature Map Updates for Automobiles. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). 1063--1081.

[2]

Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference. 483--485.

Digital Library

[3]

The Kubernetes Authors. 2023. In-place Resource Resize for Kubernetes Pods. https://kubernetes.io/blog/2023/05/12/in-place-pod-resize-alpha/. (2023). Accessed on 30.01.2024.

[4]

The Kubernetes Authors. 2024. Kubernetes Horizontal Pod Autoscaling. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/. (2024). Accessed on 30.01.2024.

[5]

The Kubernetes Authors. 2024. Kubernetes Vertical Pod Autoscaling. https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler/. (2024). Accessed on 30.01.2024.

[6]

The Kubernetes Authors. 2024. Minikube. https://minikube.sigs.k8s.io/.(2024). Accessed on 30.01.2024.

[7]

The Prometheus Authors. 2024. Prometheus monitoring and alerting toolkit. https://prometheus.io/. (2024). Accessed on 30.01.2024.

[8]

Florian Brandherm, Julien Gedeon, Osama Abboud, and Max Mühlhäuser. 2022. BigMEC: Scalable Service Migration for Mobile Edge Computing. In 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC). IEEE, 136--148.

[9]

Yujeong Choi, Yunseong Kim, and Minsoo Rhu. 2021. Lazy Batching: An SLA-aware batching system for cloud machine learning inference. In IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 493--506.

[10]

Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: Latency-Aware Provisioning and Scaling for Prediction Serving Pipelines. In ACM Symposium on Cloud Computing (SoCC). 477--491. https://doi.org/10.1145/3419111.3421285

Digital Library

[11]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A {Low-Latency} Online Prediction Serving System. In USENIX Symposium on Networked Systems Design and Implementation (NSDI). 613--627.

[12]

Aditya Dhakal, Sameer G. Kulkarni, and K. K. Ramakrishnan. 2020. GSLICE: Controlled Spatial Sharing of GPUs for a Scalable Inference Platform. In ACM Symposium on Cloud Computing (SoCC). 492--506. https://doi.org/10.1145/3419111.3421284

Digital Library

[13]

Martin A. Fischler and Robert C. Bolles. 1981. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 24, 6 (1981), 381--395.

Digital Library

[14]

Saeid Ghafouri, Alireza Karami, Danial Bidekani Bakhtiarvan, Aliakbar Saleh Bigdeli, Sukhpal Singh Gill, and Joseph Doyle. 2022. Mobile-Kube: Mobility-aware and energy-efficient service orchestration on kubernetes edge servers. In 2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC). IEEE, 82--91.

[15]

Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, and Pooyan Jamshidi. 2024. IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency. (2024). arXiv:cs.DC/2308.12871

[16]

grpc [n. d.]. gRPC. https://grpc.io. ([n. d.]). Accessed on 29.10.2021.

[17]

Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In ACM/IFIP/USENIX Middleware Conference. 109--120.

Digital Library

[18]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). 443--462.

[19]

Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022. Cocktail: A multidimensional optimization for model serving in cloud. In USENIX NSDI. 1041--1057.

[20]

Yitao Hu, Rajrup Ghosh, and Ramesh Govindan. 2021. Scrooge: A cost-effective deep learning inference system. In Proceedings of the ACM Symposium on Cloud Computing. 624--638.

Digital Library

[21]

Yitao Hu, Weiwu Pang, Xiaochen Liu, Rajrup Ghosh, Bongjun Ko, Wei-Han Lee, and Ramesh Govindan. 2021. Rim: Offloading Inference to the Edge. In Proceedings of the International Conference on Internet-of-Things Design and Implementation. 80--92.

Digital Library

[22]

Junxian Huang, Feng Qian, Alexandre Gerber, Z Morley Mao, Subhabrata Sen, and Oliver Spatscheck. 2012. A close examination of performance and power characteristics of 4G LTE networks. In Proceedings of the 10th international conference on Mobile systems, applications, and services. 225--238.

Digital Library

[23]

Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks. In ACM European Conference on Computer Systems (EuroSys). 1--16. https://doi.org/10.1145/3302424.3303958

Digital Library

[24]

Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons learned from the Chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.

[25]

Veton Kepuska and Gamal Bohouta. 2018. Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th annual computing and communication workshop and conference (CCWC). IEEE, 99--103.

[26]

Luyang Liu, Hongyu Li, and Marco Gruteser. 2019. Edge assisted realtime object detection for mobile augmented reality. In The 25th annual international conference on mobile computing and networking. 1--16.

[27]

Vinod Nigade, Pablo Bauszat, Henri Bal, and Lin Wang. 2022. Jellyfish: Timely Inference Serving for Dynamic Edge Networks. In 2022 IEEE Real-Time Systems Symposium (RTSS). 277--290. https://doi.org/10.1109/RTSS55097.2022.00032

[28]

Kamran Razavi, Manisha Luthra, Boris Koldehofe, Max Mühlhäuser, and Lin Wang. 2022. FA2: Fast, accurate autoscaling for serving deep learning inference with SLA guarantees. In 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 146--159.

[29]

Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In USENIX Annual Technical Conference (ATC). 397--411.

[30]

Francisco Romero, Mark Zhao, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines. In ACM Symposium on Cloud Computing (SoCC). 1--17.

Digital Library

[31]

Mehran Salmani, Saeid Ghafouri, Alireza Sanaee, Kamran Razavi, Max Mühlhäuser, Joseph Doyle, Pooyan Jamshidi, and Mohsen Sharifi. 2023. Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems. In Proceedings of the 3rd Workshop on Machine Learning and Systems. 78--86.

Digital Library

[32]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. In ACM Symposium on Operating Systems Principles (SOSP). 322--337. https://doi.org/10.1145/3341301.3359658

Digital Library

[33]

ultralytics. 2024. YOLOv5. https://github.com/ultralytics/yolov5. (2024). Accessed on 30.01.2024.

[34]

J. van der Hooft, S. Petrangeli, T. Wauters, R. Huysegems, P. R. Alface, T. Bostoen, and F. De Turck. 2016. HTTP/2-Based Adaptive Streaming of HEVC Video Over 4G/LTE Networks. IEEE Communications Letters 20, 11 (2016), 2177--2180.

[35]

Chad Verbowski, Ed Thayer, Paolo Costa, Hugh Leather, and Björn Franke. 2018. Right-Sizing Server Capacity Headroom for Global Online Services. In IEEE International Conference on Distributed Computing Systems (ICDCS). 645--659.

[36]

Dongzhu Xu, Anfu Zhou, Xinyu Zhang, Guixian Wang, Xi Liu, Congkai An, Yiming Shi, Liang Liu, and Huadong Ma. 2020. Understanding operational 5G: A first measurement study on its coverage, performance and energy consumption. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 479--494.

Digital Library

[37]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. Mark: Exploiting cloud services for cost-effective, SLO-aware machine learning inference serving. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19). 1049--1062.

[38]

Jeff Zhang, Sameh Elnikety, Shuayb Zarar, Atul Gupta, and Siddharth Garg. 2020. Model-switching: Dealing with fluctuating workloads in machine-learning-as-a-service systems. In 12th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 20).

Cited By

Mühlhäuser MAlexopoulos NGropengießer URazavi KWang L(2024)Towards Democratic ComputingFrom Multimedia Communications to the Future Internet10.1007/978-3-031-71874-8_17(245-265)Online publication date: 13-Sep-2024
https://doi.org/10.1007/978-3-031-71874-8_17

Index Terms

Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
    2. Other architectures
      1. Reconfigurable computing

Recommendations

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems
EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems

The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing ...
Runtime Vertical Scaling of Virtualized Applications via Online Model Estimation
SASO '14: Proceedings of the 2014 IEEE Eighth International Conference on Self-Adaptive and Self-Organizing Systems

Applications in virtualized data centers are often subject to Service Level Objectives (SLOs) regarding their performance (e.g., latency or throughput). In order to fulfill these SLOs, it is necessary to allocate sufficient resources of different types (...
Power-performance tradeoffs in data center servers

Dynamic Voltage and Frequency Scaling (DVFS), CPU pinning, horizontal, and vertical scaling, are four techniques that have been proposed as actuators to control the performance and energy consumption on data center servers. This work investigates the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

April 2024

218 pages

ISBN:9798400705410

DOI:10.1145/3642970

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '24

Sponsor:

SIGOPS

EuroSys '24: Nineteenth European Conference on Computer Systems

April 22, 2024

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 18 of 26 submissions, 69%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)125
Downloads (Last 6 weeks)12

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mühlhäuser MAlexopoulos NGropengießer URazavi KWang L(2024)Towards Democratic ComputingFrom Multimedia Communications to the Future Internet10.1007/978-3-031-71874-8_17(245-265)Online publication date: 13-Sep-2024
https://doi.org/10.1007/978-3-031-71874-8_17

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents