Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3135974.3135993acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Open access

Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency

Published: 11 December 2017 Publication History

Abstract

Developers use Machine Learning (ML) platforms to train ML models and then deploy these ML models as web services for inference (prediction). A key challenge for platform providers is to guarantee response-time Service Level Agreements (SLAs) for inference workloads while maximizing resource efficiency. Swayam is a fully distributed autoscaling framework that exploits characteristics of production ML inference workloads to deliver on the dual challenge of resource efficiency and SLA compliance. Our key contributions are (1) model-based autoscaling that takes into account SLAs and ML inference workload characteristics, (2) a distributed protocol that uses partial load information and prediction at frontends to provision new service instances, and (3) a backend self-decommissioning protocol for service instances. We evaluate Swayam on 15 popular services that were hosted on a production ML-as-a-service platform, for the following service-specific SLAs: for each service, at least 99% of requests must complete within the response-time threshold. Compared to a clairvoyant autoscaler that always satisfies the SLAs (i.e., even if there is a burst in the request rates), Swayam decreases resource utilization by up to 27%, while meeting the service-specific SLAs over 96% of the time during a three hour window. Microsoft Azure's Swayam-based framework was deployed in 2016 and has hosted over 100,000 services.

References

[1]
2017. Amazon Machine Learning - Predictive Analytics with AWS. (2017). https://aws.amazon.com/machine-learning/
[2]
2017. Google Cloud Prediction API Documentation. (2017). https://cloud.google.com/prediction/docs/
[3]
2017. HAProxy: The Reliable, High Performance TCP/HTTP Load Balancer. (2017). http://www.haproxy.org/
[4]
2017. Instance-based learning. (2017). https://en.wikipedia.org/wiki/Instance-based_learning
[5]
2017. Machine Learning - Predictive Analytics with Microsoft Azure. (2017). https://azure.microsoft.com/en-us/services/machine-learning/
[6]
2017. Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set. (2017). http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
[7]
2017. Watson Machine Learning. (2017). http://datascience.ibm.eom/features#machinelearning
[8]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA.
[9]
Constantin Adam and Rolf Stadler. 2007. Service middleware for self-managing large-scale systems. IEEE Transactions on Network and Service Management 4, 3 (2007), 50--64.
[10]
Enda Barrett, Enda Howley, and Jim Duggan. 2013. Applying Reinforcement Learning Towards Automating Resource Allocation and Application Scalability in the Cloud. Concurrency and Computation: Practice and Experience 25, 12 (2013), 1656--1674.
[11]
Lawrence Brown, Noah Gans, Avishai Mandelbaum, Anat Sakov, Haipeng Shen, Sergey Zeltyn, and Linda Zhao. 2005. Statistical analysis of a telephone call center: A queueing-science perspective. Journal of the American statistical association 100, 469 (2005), 36--50.
[12]
G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. 2008. Energy-Aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services. In NSDI. 338--350.
[13]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS. 269--284.
[14]
Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2016. Clipper: A Low-Latency Online Prediction Serving System. arXiv preprint arXiv: 1612.03079 (2016).
[15]
Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56, 2 (2013), 74--80.
[16]
Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael Kozuch. 2011. Distributed, Robust Auto-Scaling Policies for Power Management in Compute Intensive Server Farms. In OCS.
[17]
Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg. 2017. Swayam: online appendix. (2017). https://people.mpi-sws.org/~bbb/papers/details/middlewarel7
[18]
Varun Gupta, Mor Harchol-Balter, JG Dai, and Bert Zwart. 2010. On the inapproximability of M/G/K: why two moments of job size distribution are not enough. Queueing Systems 64, 1 (2010), 5--48.
[19]
Rui Han, Moustafa M Ghanem, Li Guo, Yike Guo, and Michelle Osmond. 2014. Enabling Cost-Aware and Adaptive Elasticity of Multi-Tier Cloud Applications. Future Generation Computer Systems 32 (2014), 82--98.
[20]
Brendan Jennings and Rolf Stadler. 2014. Resource Management in Clouds: Survey and Research Challenges. Journal of Network and Systems Management (2014), 1--53.
[21]
Chung Laung Liu and James W Layland. 1973. Scheduling algorithms for multi-programming in a hard-real-time environment. Journal of the ACM (JACM) 20, 1 (1973), 46--61.
[22]
Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A Lozano. 2014. A Review of Auto-Scaling Techniques for Elastic Applications in Cloud Environments. Journal of Grid Computing 12, 4 (2014), 559--592.
[23]
Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert Greenberg. 2011. Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services. Performance Evaluation 68, 11 (2011), 1056--1071.
[24]
Michael Copeland. 2016. WhatâÂŹ the Difference Between Deep Learning Training and Inference? (2016). https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/
[25]
Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104.
[26]
Michael David Mitzenmacher. 1996. The Power of Two Choices in Randomized Load Balancing. Ph.D. Dissertation. UNIVERSITY of CALIFORNIA at BERKELEY.
[27]
Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems. 163--171.
[28]
Christopher Stewart, Aniket Chakrabarti, and Rean Griffith. 2013. Zoolander: Efficiently Meeting Very Strict, Low-Latency SLOs. In ICAC.
[29]
Chandramohan A Thekkath, Timothy Mann, and Edward K Lee. 1997. Frangipani: A scalable distributed file system. In ACM Symposium on Operating Systems Principles (SOSP). 224--237.
[30]
Beth Trushkowsky, Peter Bodík, Armando Fox, Michael J Franklin, Michael I Jordan, and David A Patterson. 2011. The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements. In FAST. 163--176.
[31]
Bhuvan Urgaonkar, Prashant Shenoy, Abhishek Chandra, Pawan Goyal, and Timothy Wood. 2008. Agile Dynamic Provisioning of Multi-Tier Internet Applications. ACM Transactions on Autonomous and Adaptive Systems (TAAS) 3, 1 (2008), 1.
[32]
Fetahi Wuhib, Rolf Stadler, and Mike Spreitzer. 2010. Gossip-based resource management for cloud environments. In 2010 International Conference on Network and Service Management. IEEE, 1--8.

Cited By

View all
  • (2024)On the Analysis of Inter-Relationship between Auto-Scaling Policy and QoS of FaaS WorkloadsSensors10.3390/s2412377424:12(3774)Online publication date: 10-Jun-2024
  • (2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
  • (2024)Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML ServingProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695963(607-623)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '17: Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference
December 2017
268 pages
ISBN:9781450347204
DOI:10.1145/3135974
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc
  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2017

Check for updates

Author Tags

  1. SLAs
  2. distributed autoscaling
  3. machine learning

Qualifiers

  • Research-article

Conference

Middleware '17
Sponsor:
Middleware '17: 18th International Middleware Conference
December 11 - 15, 2017
Nevada, Las Vegas

Acceptance Rates

Middleware '17 Paper Acceptance Rate 20 of 85 submissions, 24%;
Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)318
  • Downloads (Last 6 weeks)37
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)On the Analysis of Inter-Relationship between Auto-Scaling Policy and QoS of FaaS WorkloadsSensors10.3390/s2412377424:12(3774)Online publication date: 10-Jun-2024
  • (2024)Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark AnalysisFuture Internet10.3390/fi1612046816:12(468)Online publication date: 13-Dec-2024
  • (2024)Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML ServingProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695963(607-623)Online publication date: 4-Nov-2024
  • (2024)Novel Contract-based Runtime Explainability Framework for End-to-End Ensemble Machine Learning ServingProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644964(234-244)Online publication date: 14-Apr-2024
  • (2024)SpongeProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655833(184-191)Online publication date: 22-Apr-2024
  • (2024)Machine Learning Systems are Bloated and VulnerableProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390328:1(1-30)Online publication date: 21-Feb-2024
  • (2024)Vertically Autoscaling Monolithic Applications with CaaSPER: Scalable Container-as-a-Service Performance Enhanced Resizing Algorithm for the CloudCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653378(241-254)Online publication date: 9-Jun-2024
  • (2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
  • (2024)Automating Cloud Deployment for Real-Time Online Foundation Model InferenceIEEE/ACM Transactions on Networking10.1109/TNET.2023.332196732:2(1509-1523)Online publication date: Apr-2024
  • (2024)Reinforcement Learning Based Online Request Scheduling Framework for Workload-Adaptive Edge Deep Learning InferenceIEEE Transactions on Mobile Computing10.1109/TMC.2024.342957123:12(13222-13239)Online publication date: Dec-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media