Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472883.3486987acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving

Published: 01 November 2021 Publication History

Abstract

Machine learning models are widely deployed in production cloud to provide online inference services. Efficiently deploying inference services requires careful tuning of hardware and runtime configurations (e.g., GPU type, GPU memory, batch size), which can significantly improve the model serving performance and reduce cost. However, existing autoconfiguration approaches for general workloads, such as Bayesian optimization and white-box prediction, are inefficient in navigating the high-dimensional configuration space of model serving, incurring high sampling cost.
In this paper, we present Morphling, a fast, near-optimal auto-configuration framework for cloud-native model serving. Morphling employs model-agnostic meta-learning to navigate the large configuration space. It trains a metamodel offline to capture the general performance trend under varying configurations. Morphling quickly adapts the metamodel to a new inference service by sampling a small number of configurations and uses it to find the optimal one. We have implemented Morphling as an auto-configuration service in Kubernetes, and evaluate its performance with popular CV and NLP models, as well as the production inference services in Alibaba. Compared with existing approaches, Morphling reduces the median search cost by 3x-22x, quickly converging to the optimal configuration by sampling only 30 candidates in a large search space consisting of 720 options.

Supplementary Material

MP4 File (Day4_13_3_LupingWang.mp4)
Presentation video

References

[1]
2021. Custom Resources. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/.
[2]
2021. Deliver high performance ML inference with AWS Inferentia. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deliver_high_performance_ML_inference_with_AWS_Inferentia_CMP324-R1.pdf.
[3]
2021. Docker. https://www.docker.com.
[4]
2021. Httperf. https://github.com/httperf/httperf.
[5]
2021. Jmeter. https://jmeter.apache.org/.
[6]
2021. Kubernetes: Production-Grade Container Orchestration. https://kubernetes.io/.
[7]
2021. Machine Learning on AWS. https://aws.amazon.com/machine-learning.
[8]
2021. Module: Tensorflow Keras Applications. https://www.tensorflow.org/api_docs/python/tf/keras/applications.
[9]
2021. NVIDIA Data Center Deep Learning Product Performance. https://developer.nvidia.com/deep-learning-performance-training-inference.
[10]
2021. NVIDIA TensorRT Inference Server. https://github.com/triton-inference-server/server.
[11]
2021. NVIDIA TESLA M60 GPU ACCELERATOR. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/nvidia-m60-datasheet.pdf.
[12]
2021. Nvidia Virtual GPU Technology. https://www.nvidia.com/en-us/data-center/virtual-gpu-technology/.
[13]
2021. Redis: an open source, in-memory data structure store. https://redis.io.
[14]
2021. Siege. https://www.joedog.org/siege-home/.
[15]
2021. TensorFlow Hub. https://tfhub.dev/.
[16]
2021. TensorFlow Serving for model deployment in production. https://www.tensorflow.org/serving/.
[17]
Deepak Agarwal, Bo Long, Jonathan Traupman, Doris Xin, and Liang Zhang. 2014. Laser: A scalable response prediction platform for online advertising. In Proc. ACM WSDM, 2014.
[18]
Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, and Ming Zhang. 2017. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proc. USENIX, 2017.
[19]
Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade. Springer, 421--436.
[20]
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
[21]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In Proc. USENIX NSDI, 2017.
[22]
Brian Dalessandro, Daizhuo Chen, Troy Raeder, Claudia Perlich, Melinda Han Williams, and Foster Provost. 2014. Scalable handsfree transfer learning for online advertising. In Proc. ACM SIGKDD, 2014.
[23]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[24]
Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: guaranteed job latency in data parallel clusters. In Proc. ACM EuroSys, 2012.
[25]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. PMLR ICML, 2017.
[26]
Peter I Frazier. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018).
[27]
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for black-box optimization. In Proc. ACM SIGKDD, 2017.
[28]
Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proc. ACM/IFIP/USENIX Middleware, 2017.
[29]
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. In Proc. USENIX OSDI, 2020.
[30]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proc. IEEE HPCA, 2018.
[31]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. IEEE/CVF CVPR, 2016.
[32]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proc. ECCV, 2016.
[33]
Trong Nghia Hoang, Quang Minh Hoang, Ruofei Ouyang, and Kian Hsiang Low. 2018. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[34]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[35]
Chin-Jung Hsu, Vivek Nair, Vincent W Freeh, and Tim Menzies. 2018. Arrow: Low-level augmented bayesian optimization for finding the best cloud vm. In Proc. IEEE ICDCS, 2018.
[36]
Chin-Jung Hsu, Vivek Nair, Tim Menzies, and Vincent W Freeh. 2018. Scout: An experienced guide to find the best cloud configuration. arXiv preprint arXiv:1803.01296 (2018).
[37]
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. 2016. Deep networks with stochastic depth. In Proc. ECCV, 2016.
[38]
Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic space-time scheduling for gpu inference. arXiv preprint arXiv:1901.00041 (2018).
[39]
Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking. arXiv preprint arXiv:1903.07486 (2019).
[40]
Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: scalable adaptation of video analytics. In Proc. ACM SIGCOMM, 2018.
[41]
Jiho Kim, Jehee Cha, Jason Jong Kyu Park, Dongsuk Jeon, and Yongjun Park. 2018. Improving GPU multitasking efficiency using dynamic resource sharing. IEEE Comput. Archit. 18, 1 (2018), 1--5.
[42]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942 (2019).
[43]
Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo Interlandi. 2018. PRETZEL: Opening the black box of machine learning prediction serving systems. In USENIX OSDI, 2018.
[44]
Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835 (2017).
[45]
Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, Hanlin Tang, et al. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 2020 40, 2 (2020), 8--16.
[46]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[47]
Valerio Perrone, Huibin Shen, Matthias Seeger, Cedric Archambeau, and Rodolphe Jenatton. 2019. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. arXiv preprint arXiv:1909.12552 (2019).
[48]
Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner: Efficient video analysis at scale. ACM Trans. Graph., 2018 37, 4 (2018).
[49]
Santu Rana, Cheng Li, Sunil Gupta, Vu Nguyen, and Svetha Venkatesh. 2017. High dimensional Bayesian optimization with elastic Gaussian process. In Proc. PMLR ICML, 2017.
[50]
Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. INFaaS: Automated Model-less Inference Serving. In Proc. USENIX ATC, 2021.
[51]
Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, et al. 2020. Autopilot: workload autoscaling at Google. In Proc. ACM EuroSys, 2020.
[52]
Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional.
[53]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proc. IEEE/CVF CVPR, 2018.
[54]
Yongzhe Shi, Wei-Qiang Zhang, Meng Cai, and Jia Liu. 2014. Efficient one-pass decoding with NNLM for speech recognition. IEEE Signal Process. Lett. 21, 4 (2014), 377--381.
[55]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[56]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. arXiv preprint arXiv:1206.2944 (2012).
[57]
Richard Socher. 2014. Recursive deep learning for natural language processing and computer vision. Citeseer.
[58]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proc. IEEE/CVF CVPR.
[59]
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR.
[60]
Takeshi Teshima, Issei Sato, and Masashi Sugiyama. 2020. Few-shot domain adaptation by causal mechanism transfer. In Proc. PMLR ICCV, 2020.
[61]
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In Proc. USENIX NSDI, 2016.
[62]
Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. 2018. Rafiki: machine learning as an analytics service system. VLDB Endowment, 2018 12, 2 (2018).
[63]
Yu-Xiong Wang and Martial Hebert. 2016. Learning to learn: Model regression networks for easy small sample learning. In Proc. Springer ECCV, 2016.
[64]
Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2017. Quality of service support for finegrained sharing on GPUs. In Proc. ACM/IEEE ISCA, 2017.
[65]
Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, Nando De Freitas, et al. 2013. Bayesian Optimization in High Dimensions via Random Embeddings. In Proc. IJCAI, 2013.
[66]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Chen Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large Heterogeneous GPU Clusters. In Proc. USENIX NSDI, 2022.
[67]
Neeraja J Yadwadkar, Bharath Hariharan, Joseph E Gonzalez, Burton Smith, and Randy H Katz. 2017. Selecting the best vm across multiple public clouds: A data-driven performance modeling approach. In Proc. ACM SoCC, 2017.
[68]
Feng Yan, Olatunji Ruwase, Yuxiong He, and Evgenia Smirni. 2016. SERF: efficient scheduling for fast deep neural network serving via judicious parallelism. In Proc. IEEE SC, 2016.
[69]
Peifeng Yu and Mosharaf Chowdhury. 2019. Salus: Fine-grained gpu sharing primitives for deep learning applications. arXiv preprint arXiv:1902.04610 (2019).
[70]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving. In Proc. USENIX ATC, 2019.
[71]
Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
[72]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proc. IEEE/CVF CVPR, 2018.
[73]
Corey Zumar. 2018. InferLine: ML Inference Pipeline Composition Framework. (2018).

Cited By

View all
  • (2024)DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the CloudProceedings of the VLDB Endowment10.14778/3685800.368583217:12(4130-4144)Online publication date: 8-Nov-2024
  • (2024)Deep Configuration Performance Learning: A Systematic Survey and TaxonomyACM Transactions on Software Engineering and Methodology10.1145/370298634:1(1-62)Online publication date: 5-Nov-2024
  • (2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
  • Show More Cited By

Index Terms

  1. Morphling: Fast, Near-Optimal Auto-Configuration for Cloud-Native Model Serving

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '21: Proceedings of the ACM Symposium on Cloud Computing
    November 2021
    685 pages
    ISBN:9781450386388
    DOI:10.1145/3472883
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Auto-Configuration
    2. Cloud Computing
    3. Meta-Learning
    4. Model Serving

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoCC '21
    Sponsor:
    SoCC '21: ACM Symposium on Cloud Computing
    November 1 - 4, 2021
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)187
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the CloudProceedings of the VLDB Endowment10.14778/3685800.368583217:12(4130-4144)Online publication date: 8-Nov-2024
    • (2024)Deep Configuration Performance Learning: A Systematic Survey and TaxonomyACM Transactions on Software Engineering and Methodology10.1145/370298634:1(1-62)Online publication date: 5-Nov-2024
    • (2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
    • (2024)DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasksProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673091(701-711)Online publication date: 12-Aug-2024
    • (2024)PISeL: Pipelining DNN Inference for Serverless ComputingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679824(1951-1960)Online publication date: 21-Oct-2024
    • (2024)Loki: A System for Serving ML Inference Pipelines with Hardware and Accuracy ScalingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658688(267-280)Online publication date: 3-Jun-2024
    • (2024)Reducing Datacenter Compute Carbon Footprint by Harnessing the Power of Specialization: Principles, Metrics, Challenges and OpportunitiesIEEE Transactions on Semiconductor Manufacturing10.1109/TSM.2024.343433137:4(481-488)Online publication date: Nov-2024
    • (2024)DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data FrameworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345988935:11(2114-2131)Online publication date: 1-Nov-2024
    • (2024)Fluid-Shuttle: Efficient Cloud Data Transmission Based on Serverless Computing CompressionIEEE/ACM Transactions on Networking10.1109/TNET.2024.340256132:6(4554-4569)Online publication date: Dec-2024
    • (2024)Integrating Bayesian Optimization and Machine Learning for the Optimal Configuration of Cloud SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2024.336107012:1(277-294)Online publication date: Jan-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media