research-article

Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving

Authors:

Liping ZhangAuthors Info & Claims

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

Pages 639 - 653

https://doi.org/10.1145/3472883.3486987

Published: 01 November 2021 Publication History

Abstract

Machine learning models are widely deployed in production cloud to provide online inference services. Efficiently deploying inference services requires careful tuning of hardware and runtime configurations (e.g., GPU type, GPU memory, batch size), which can significantly improve the model serving performance and reduce cost. However, existing autoconfiguration approaches for general workloads, such as Bayesian optimization and white-box prediction, are inefficient in navigating the high-dimensional configuration space of model serving, incurring high sampling cost.

In this paper, we present Morphling, a fast, near-optimal auto-configuration framework for cloud-native model serving. Morphling employs model-agnostic meta-learning to navigate the large configuration space. It trains a metamodel offline to capture the general performance trend under varying configurations. Morphling quickly adapts the metamodel to a new inference service by sampling a small number of configurations and uses it to find the optimal one. We have implemented Morphling as an auto-configuration service in Kubernetes, and evaluate its performance with popular CV and NLP models, as well as the production inference services in Alibaba. Compared with existing approaches, Morphling reduces the median search cost by 3x-22x, quickly converging to the optimal configuration by sampling only 30 candidates in a large search space consisting of 720 options.

Supplementary Material

MP4 File (Day4_13_3_LupingWang.mp4)

Presentation video

Download
140.13 MB

References

[1]

2021. Custom Resources. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/.

[2]

2021. Deliver high performance ML inference with AWS Inferentia. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deliver_high_performance_ML_inference_with_AWS_Inferentia_CMP324-R1.pdf.

[3]

2021. Docker. https://www.docker.com.

[4]

2021. Httperf. https://github.com/httperf/httperf.

[5]

2021. Jmeter. https://jmeter.apache.org/.

[6]

2021. Kubernetes: Production-Grade Container Orchestration. https://kubernetes.io/.

[7]

2021. Machine Learning on AWS. https://aws.amazon.com/machine-learning.

[8]

2021. Module: Tensorflow Keras Applications. https://www.tensorflow.org/api_docs/python/tf/keras/applications.

[9]

2021. NVIDIA Data Center Deep Learning Product Performance. https://developer.nvidia.com/deep-learning-performance-training-inference.

[10]

2021. NVIDIA TensorRT Inference Server. https://github.com/triton-inference-server/server.

[11]

2021. NVIDIA TESLA M60 GPU ACCELERATOR. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/nvidia-m60-datasheet.pdf.

[12]

2021. Nvidia Virtual GPU Technology. https://www.nvidia.com/en-us/data-center/virtual-gpu-technology/.

[13]

2021. Redis: an open source, in-memory data structure store. https://redis.io.

[14]

2021. Siege. https://www.joedog.org/siege-home/.

[15]

2021. TensorFlow Hub. https://tfhub.dev/.

[16]

2021. TensorFlow Serving for model deployment in production. https://www.tensorflow.org/serving/.

[17]

Deepak Agarwal, Bo Long, Jonathan Traupman, Doris Xin, and Liang Zhang. 2014. Laser: A scalable response prediction platform for online advertising. In Proc. ACM WSDM, 2014.

Digital Library

[18]

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proc. USENIX, 2017.

[19]

Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade. Springer, 421--436.

[20]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).

[21]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In Proc. USENIX NSDI, 2017.

[22]

Brian Dalessandro, Daizhuo Chen, Troy Raeder, Claudia Perlich, Melinda Han Williams, and Foster Provost. 2014. Scalable handsfree transfer learning for online advertising. In Proc. ACM SIGKDD, 2014.

[23]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[24]

Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: guaranteed job latency in data parallel clusters. In Proc. ACM EuroSys, 2012.

Digital Library

[25]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. PMLR ICML, 2017.

[26]

Peter I Frazier. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).

[27]

Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In Proc. ACM SIGKDD, 2017.

Digital Library

[28]

Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proc. ACM/IFIP/USENIX Middleware, 2017.

Digital Library

[29]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In Proc. USENIX OSDI, 2020.

[30]

Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proc. IEEE HPCA, 2018.

[31]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. IEEE/CVF CVPR, 2016.

[32]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proc. ECCV, 2016.

[33]

Trong Nghia Hoang, Quang Minh Hoang, Ruofei Ouyang, and Kian Hsiang Low. 2018. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[34]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[35]

Chin-Jung Hsu, Vivek Nair, Vincent W Freeh, and Tim Menzies. 2018. Arrow: Low-level augmented bayesian optimization for finding the best cloud vm. In Proc. IEEE ICDCS, 2018.

[36]

Chin-Jung Hsu, Vivek Nair, Tim Menzies, and Vincent W Freeh. 2018. Scout: An experienced guide to find the best cloud configuration. arXiv preprint arXiv:1803.01296 (2018).

[37]

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. 2016. Deep networks with stochastic depth. In Proc. ECCV, 2016.

[38]

Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic space-time scheduling for gpu inference. arXiv preprint arXiv:1901.00041 (2018).

[39]

Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019).

[40]

Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: scalable adaptation of video analytics. In Proc. ACM SIGCOMM, 2018.

Digital Library

[41]

Jiho Kim, Jehee Cha, Jason Jong Kyu Park, Dongsuk Jeon, and Yongjun Park. 2018. Improving GPU multitasking efficiency using dynamic resource sharing. IEEE Comput. Archit. 18, 1 (2018), 1--5.

Digital Library

[42]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942 (2019).

[43]

Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo Interlandi. 2018. PRETZEL: Opening the black box of machine learning prediction serving systems. In USENIX OSDI, 2018.

[44]

Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835 (2017).

[45]

Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, Hanlin Tang, et al. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 2020 40, 2 (2020), 8--16.

[46]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[47]

Valerio Perrone, Huibin Shen, Matthias Seeger, Cedric Archambeau, and Rodolphe Jenatton. 2019. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. arXiv preprint arXiv:1909.12552 (2019).

[48]

Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner: Efficient video analysis at scale. ACM Trans. Graph., 2018 37, 4 (2018).

[49]

Santu Rana, Cheng Li, Sunil Gupta, Vu Nguyen, and Svetha Venkatesh. 2017. High dimensional Bayesian optimization with elastic Gaussian process. In Proc. PMLR ICML, 2017.

[50]

Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In Proc. USENIX ATC, 2021.

[51]

Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, et al. 2020. Autopilot: workload autoscaling at Google. In Proc. ACM EuroSys, 2020.

Digital Library

[52]

Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional.

Digital Library

[53]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. IEEE/CVF CVPR, 2018.

[54]

Yongzhe Shi, Wei-Qiang Zhang, Meng Cai, and Jia Liu. 2014. Efficient one-pass decoding with NNLM for speech recognition. IEEE Signal Process. Lett. 21, 4 (2014), 377--381.

[55]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[56]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. arXiv preprint arXiv:1206.2944 (2012).

[57]

Richard Socher. 2014. Recursive deep learning for natural language processing and computer vision. Citeseer.

[58]

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proc. IEEE/CVF CVPR.

[59]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR.

[60]

Takeshi Teshima, Issei Sato, and Masashi Sugiyama. 2020. Few-shot domain adaptation by causal mechanism transfer. In Proc. PMLR ICCV, 2020.

[61]

Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In Proc. USENIX NSDI, 2016.

Digital Library

[62]

Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. 2018. Rafiki: machine learning as an analytics service system. VLDB Endowment, 2018 12, 2 (2018).

[63]

Yu-Xiong Wang and Martial Hebert. 2016. Learning to learn: Model regression networks for easy small sample learning. In Proc. Springer ECCV, 2016.

[64]

Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2017. Quality of service support for finegrained sharing on GPUs. In Proc. ACM/IEEE ISCA, 2017.

[65]

Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, Nando De Freitas, et al. 2013. Bayesian Optimization in High Dimensions via Random Embeddings. In Proc. IJCAI, 2013.

[66]

Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Chen Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large Heterogeneous GPU Clusters. In Proc. USENIX NSDI, 2022.

[67]

Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, Burton Smith, and Randy H Katz. 2017. Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In Proc. ACM SoCC, 2017.

Digital Library

[68]

Feng Yan, Olatunji Ruwase, Yuxiong He, and Evgenia Smirni. 2016. SERF: efficient scheduling for fast deep neural network serving via judicious parallelism. In Proc. IEEE SC, 2016.

[69]

Peifeng Yu and Mosharaf Chowdhury. 2019. Salus: Fine-grained gpu sharing primitives for deep learning applications. arXiv preprint arXiv:1902.04610 (2019).

[70]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In Proc. USENIX ATC, 2019.

[71]

Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).

[72]

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proc. IEEE/CVF CVPR, 2018.

[73]

Corey Zumar. 2018. InferLine: ML Inference Pipeline Composition Framework. (2018).

Cited By

Wang QLan TTang YSang BHuang ZDu YZhang HSha JLu HZhou YZhang KTang M(2024)DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the CloudProceedings of the VLDB Endowment10.14778/3685800.368583217:12(4130-4144)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685832
Gong JChen T(2024)Deep Configuration Performance Learning: A Systematic Survey and TaxonomyACM Transactions on Software Engineering and Methodology10.1145/370298634:1(1-62)Online publication date: 5-Nov-2024
https://dl.acm.org/doi/10.1145/3702986
Wei XLi ZTan C(2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689156
Show More Cited By

Index Terms

Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving
1. Information systems
  1. Information systems applications
    1. Enterprise information systems
      1. Enterprise resource planning

Recommendations

A Reinforcement Learning Approach to Online Web Systems Auto-configuration
ICDCS '09: Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems

In a web system, configuration is crucial to the performance and service availability. It is a challenge, not only because of the dynamics of Internet traffic, but also the dynamic virtual machine environment the system tends to be run on. In this paper,...
Scalable Hierarchical Distributive Auto-configuration Protocol for MANETs
SITIS '13: Proceedings of the 2013 International Conference on Signal-Image Technology & Internet-Based Systems

In the mobile adhoc networks (MANET's) [3] [6], one of the most challenging task is, how to configure a new node so that it can communicate with the already configured nodes in the network. The nodes in the MANET are free to move randomly, so the ...
AROMA: automated resource allocation and configuration of mapreduce environment in the cloud
ICAC '12: Proceedings of the 9th international conference on Autonomic computing

Distributed data processing framework MapReduce is increasingly deployed in Clouds to leverage the pay-per-usage cloud computing model. Popular Hadoop MapReduce environment expects that end users determine the type and amount of Cloud resources for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '21: Proceedings of the ACM Symposium on Cloud Computing

November 2021

685 pages

ISBN:9781450386388

DOI:10.1145/3472883

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoCC '21

Sponsor:

SoCC '21: ACM Symposium on Cloud Computing

November 1 - 4, 2021

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
854
Total Downloads

Downloads (Last 12 months)187
Downloads (Last 6 weeks)15

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang QLan TTang YSang BHuang ZDu YZhang HSha JLu HZhou YZhang KTang M(2024)DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the CloudProceedings of the VLDB Endowment10.14778/3685800.368583217:12(4130-4144)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685832
Gong JChen T(2024)Deep Configuration Performance Learning: A Systematic Survey and TaxonomyACM Transactions on Software Engineering and Methodology10.1145/370298634:1(1-62)Online publication date: 5-Nov-2024
https://dl.acm.org/doi/10.1145/3702986
Wei XLi ZTan C(2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689156
Chen YLi WZhou HYang XYin Y(2024)DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasksProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673091(701-711)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673091
Rahimi Jafari MSu JZhang YWang OZhang WSerra ESpezzano F(2024)PISeL: Pipelining DNN Inference for Serverless ComputingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679824(1951-1960)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679824
Ahmad SGuan HSitaraman RMencagli GDazzi PLowenthal DBadia R(2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658688
Eilam TBose PCarloni LCidon AFranke HKim MLee ENaghshineh MParida PStein CTantawi A(2024)Reducing Datacenter Compute Carbon Footprint by Harnessing the Power of Specialization: Principles, Metrics, Challenges and OpportunitiesIEEE Transactions on Semiconductor Manufacturing10.1109/TSM.2024.343433137:4(481-488)Online publication date: Nov-2024
https://doi.org/10.1109/TSM.2024.3434331
Dou HWang YZhang YChen PZheng Z(2024)DeepCAT⁺: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data FrameworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345988935:11(2114-2131)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3459889
Gu RWang SDai HChen XWang ZBao WZheng JTu YHuang YQi LXu XDou WChen G(2024)Fluid-Shuttle: Efficient Cloud Data Transmission Based on Serverless Computing CompressionIEEE/ACM Transactions on Networking10.1109/TNET.2024.340256132:6(4554-4569)Online publication date: Dec-2024
https://doi.org/10.1109/TNET.2024.3402561
Guindani BArdagna DGuglielmi ARocco RPalermo G(2024)Integrating Bayesian Optimization and Machine Learning for the Optimal Configuration of Cloud SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2024.336107012:1(277-294)Online publication date: Jan-2024
https://doi.org/10.1109/TCC.2024.3361070
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents