research-article

Open access

Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency

Authors:

Arpan Gujarati,

Sameh Elnikety,

Kathryn S. McKinley,

Björn B. BrandenburgAuthors Info & Claims

Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference

Pages 109 - 120

https://doi.org/10.1145/3135974.3135993

Published: 11 December 2017 Publication History

Abstract

Developers use Machine Learning (ML) platforms to train ML models and then deploy these ML models as web services for inference (prediction). A key challenge for platform providers is to guarantee response-time Service Level Agreements (SLAs) for inference workloads while maximizing resource efficiency. Swayam is a fully distributed autoscaling framework that exploits characteristics of production ML inference workloads to deliver on the dual challenge of resource efficiency and SLA compliance. Our key contributions are (1) model-based autoscaling that takes into account SLAs and ML inference workload characteristics, (2) a distributed protocol that uses partial load information and prediction at frontends to provision new service instances, and (3) a backend self-decommissioning protocol for service instances. We evaluate Swayam on 15 popular services that were hosted on a production ML-as-a-service platform, for the following service-specific SLAs: for each service, at least 99% of requests must complete within the response-time threshold. Compared to a clairvoyant autoscaler that always satisfies the SLAs (i.e., even if there is a burst in the request rates), Swayam decreases resource utilization by up to 27%, while meeting the service-specific SLAs over 96% of the time during a three hour window. Microsoft Azure's Swayam-based framework was deployed in 2016 and has hosted over 100,000 services.

References

[1]

2017. Amazon Machine Learning - Predictive Analytics with AWS. (2017). https://aws.amazon.com/machine-learning/

[2]

2017. Google Cloud Prediction API Documentation. (2017). https://cloud.google.com/prediction/docs/

[3]

2017. HAProxy: The Reliable, High Performance TCP/HTTP Load Balancer. (2017). http://www.haproxy.org/

[4]

2017. Instance-based learning. (2017). https://en.wikipedia.org/wiki/Instance-based_learning

[5]

2017. Machine Learning - Predictive Analytics with Microsoft Azure. (2017). https://azure.microsoft.com/en-us/services/machine-learning/

[6]

2017. Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set. (2017). http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions

[7]

2017. Watson Machine Learning. (2017). http://datascience.ibm.eom/features#machinelearning

[8]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA.

Digital Library

[9]

Constantin Adam and Rolf Stadler. 2007. Service middleware for self-managing large-scale systems. IEEE Transactions on Network and Service Management 4, 3 (2007), 50--64.

Digital Library

[10]

Enda Barrett, Enda Howley, and Jim Duggan. 2013. Applying Reinforcement Learning Towards Automating Resource Allocation and Application Scalability in the Cloud. Concurrency and Computation: Practice and Experience 25, 12 (2013), 1656--1674.

[11]

Lawrence Brown, Noah Gans, Avishai Mandelbaum, Anat Sakov, Haipeng Shen, Sergey Zeltyn, and Linda Zhao. 2005. Statistical analysis of a telephone call center: A queueing-science perspective. Journal of the American statistical association 100, 469 (2005), 36--50.

[12]

G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. 2008. Energy-Aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services. In NSDI. 338--350.

Digital Library

[13]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS. 269--284.

Digital Library

[14]

Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2016. Clipper: A Low-Latency Online Prediction Serving System. arXiv preprint arXiv: 1612.03079 (2016).

[15]

Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56, 2 (2013), 74--80.

Digital Library

[16]

Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael Kozuch. 2011. Distributed, Robust Auto-Scaling Policies for Power Management in Compute Intensive Server Farms. In OCS.

Digital Library

[17]

Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg. 2017. Swayam: online appendix. (2017). https://people.mpi-sws.org/~bbb/papers/details/middlewarel7

[18]

Varun Gupta, Mor Harchol-Balter, JG Dai, and Bert Zwart. 2010. On the inapproximability of M/G/K: why two moments of job size distribution are not enough. Queueing Systems 64, 1 (2010), 5--48.

Digital Library

[19]

Rui Han, Moustafa M Ghanem, Li Guo, Yike Guo, and Michelle Osmond. 2014. Enabling Cost-Aware and Adaptive Elasticity of Multi-Tier Cloud Applications. Future Generation Computer Systems 32 (2014), 82--98.

[20]

Brendan Jennings and Rolf Stadler. 2014. Resource Management in Clouds: Survey and Research Challenges. Journal of Network and Systems Management (2014), 1--53.

Digital Library

[21]

Chung Laung Liu and James W Layland. 1973. Scheduling algorithms for multi-programming in a hard-real-time environment. Journal of the ACM (JACM) 20, 1 (1973), 46--61.

Digital Library

[22]

Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A Lozano. 2014. A Review of Auto-Scaling Techniques for Elastic Applications in Cloud Environments. Journal of Grid Computing 12, 4 (2014), 559--592.

Digital Library

[23]

Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert Greenberg. 2011. Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services. Performance Evaluation 68, 11 (2011), 1056--1071.

Digital Library

[24]

Michael Copeland. 2016. WhatâÂ&Zacute; the Difference Between Deep Learning Training and Inference? (2016). https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/

[25]

Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104.

Digital Library

[26]

Michael David Mitzenmacher. 1996. The Power of Two Choices in Randomized Load Balancing. Ph.D. Dissertation. UNIVERSITY of CALIFORNIA at BERKELEY.

Digital Library

[27]

Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems. 163--171.

Digital Library

[28]

Christopher Stewart, Aniket Chakrabarti, and Rean Griffith. 2013. Zoolander: Efficiently Meeting Very Strict, Low-Latency SLOs. In ICAC.

[29]

Chandramohan A Thekkath, Timothy Mann, and Edward K Lee. 1997. Frangipani: A scalable distributed file system. In ACM Symposium on Operating Systems Principles (SOSP). 224--237.

Digital Library

[30]

Beth Trushkowsky, Peter Bodík, Armando Fox, Michael J Franklin, Michael I Jordan, and David A Patterson. 2011. The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements. In FAST. 163--176.

Digital Library

[31]

Bhuvan Urgaonkar, Prashant Shenoy, Abhishek Chandra, Pawan Goyal, and Timothy Wood. 2008. Agile Dynamic Provisioning of Multi-Tier Internet Applications. ACM Transactions on Autonomous and Adaptive Systems (TAAS) 3, 1 (2008), 1.

Digital Library

[32]

Fetahi Wuhib, Rolf Stadler, and Mike Spreitzer. 2010. Gossip-based resource management for cloud environments. In 2010 International Conference on Network and Service Management. IEEE, 1--8.

Cited By

Hong SKim YNam JKim S(2024)On the Analysis of Inter-Relationship between Auto-Scaling Policy and QoS of FaaS WorkloadsSensors10.3390/s2412377424:12(3774)Online publication date: 10-Jun-2024
https://doi.org/10.3390/s24123774
Trihinas DMichael PSymeonides M(2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
https://doi.org/10.3390/fi16120468
Dai YPan RIyer ALi KNetravali RWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML ServingProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695963(607-623)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695963
Show More Cited By

Index Terms

Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency

Recommendations

An Innovative Self-Adaptive Configuration Optimization System in Cloud Computing
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Cloud computing has emerging as an extremely popular and cost-effective computational service model using pay-as-you-go executing environments that scale transparently to the user. However, cloud providers should tackle the challenge of configuring their ...
Simcan2Cloud: a discrete-event-based simulator for modelling and simulating cloud computing infrastructures
Abstract
Cloud computing is an evolving paradigm whose adoption has been increasing over the last few years. This fact has led to the growth of the cloud computing market, together with fierce competition for the leading market share, with an increase in ...
A crowdsource model for quality assurance in cloud computing

Cloud computing has transformed the way businesses manage their IT resources. By uploading IT solution to the cloud, businesses are assured that computational power and storage capacity are available on demand. This makes cloud computing the next phase ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference

December 2017

268 pages

ISBN:9781450347204

DOI:10.1145/3135974

General Chairs:
K. R. Jayaram
IBM T. J. Watson Research Center
,
Anshul Gandhi
Stony Brook University
,
Program Chairs:
Bettina Kemme
McGill University
,
Peter Pietzuch
Imperial College London

Copyright © 2017 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

ACM: Association for Computing Machinery

In-Cooperation

USENIX Assoc: USENIX Assoc
IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2017

Check for updates

Author Tags

Qualifiers

Research-article

Conference

Middleware '17

Sponsor:

ACM

Middleware '17: 18th International Middleware Conference

December 11 - 15, 2017

Nevada, Las Vegas

Acceptance Rates

Middleware '17 Paper Acceptance Rate 20 of 85 submissions, 24%;

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
1,879
Total Downloads

Downloads (Last 12 months)318
Downloads (Last 6 weeks)37

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hong SKim YNam JKim S(2024)On the Analysis of Inter-Relationship between Auto-Scaling Policy and QoS of FaaS WorkloadsSensors10.3390/s2412377424:12(3774)Online publication date: 10-Jun-2024
https://doi.org/10.3390/s24123774
Trihinas DMichael PSymeonides M(2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
https://doi.org/10.3390/fi16120468
Dai YPan RIyer ALi KNetravali RWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML ServingProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695963(607-623)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695963
Nguyen MTruong HTruong-Huu TBosch JLewis GCleland-Huang JMuccini H(2024)Novel Contract-based Runtime Explainability Framework for End-to-End Ensemble Machine Learning ServingProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644964(234-244)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3644815.3644964
Razavi KGhafouri SMühlhäuser MJamshidi PWang L(2024)SpongeProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655833(184-191)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655833
Zhang HAlhanahnah MAhmed FFatih DLeitner PAli-Eldin A(2024)Machine Learning Systems are Bloated and VulnerableProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390328:1(1-30)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639032
Pavlenko ACahoon JZhu YKroth BNelson MCarter ALiao DWright TCamacho-Rodríguez JSaur KBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653378(241-254)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653378
Ahmad SGuan HSitaraman RMencagli GDazzi PLowenthal DBadia R(2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658688
Li YLi ZHan ZZhang QMa X(2024)Automating Cloud Deployment for Real-Time Online Foundation Model InferenceIEEE/ACM Transactions on Networking10.1109/TNET.2023.332196732:2(1509-1523)Online publication date: Apr-2024
https://doi.org/10.1109/TNET.2023.3321967
Tan XLi HXie XGuo LAnsari NHuang XWang LXu ZLiu Y(2024)Reinforcement Learning Based Online Request Scheduling Framework for Workload-Adaptive Edge Deep Learning InferenceIEEE Transactions on Mobile Computing10.1109/TMC.2024.342957123:12(13222-13239)Online publication date: Dec-2024
https://doi.org/10.1109/TMC.2024.3429571
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents