Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3419111.3421284acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

GSLICE: controlled spatial sharing of GPUs for a scalable inference platform

Published: 12 October 2020 Publication History

Abstract

The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance.
GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100μs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60--800% and achieves 2--13X improvement in aggregate throughput.

Supplementary Material

MP4 File (p492-dhakal-presentation.mp4)

References

[1]
2014. Data plane development kit. http://dpdk.org/. [online].
[2]
2019. NVIDIA Container runtime for Docker. https://github.com/NVIDIA/nvidia-docker. (2019). [online].
[3]
2019. NVIDIA TensorRT Inference Server. https://github.com/NVIDIA/tensorrt-inference-server. [online].
[4]
2020. Clipper Github. https://github.com/ucbrise/clipper.
[5]
2020. CUBLAS LIBRARY. https://docs.nvidia.com/cuda/cublas/index.html. Accessed: 2020-02-19.
[6]
2020. Metal Documentation. https://developer.apple.com/documentation/metal. Accessed: 2020-04-25.
[7]
2020. NVIDIA Ampere MIG. https://www.nvidia.com/en-us/technologies/multi-instance-gpu.
[8]
2020. NVIDIA Tesla V100 GPU Architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed: 2020-02-01.
[9]
2020. ROCm Github. https://github.com/RadeonOpenCompute/ROCml. Accessed: 2020-04-25.
[10]
2020. tcpreplay Github. https://github.com/appneta/tcpreplay.
[11]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283.
[12]
Ganesh Ananthanarayanan, Paramvir Bahl, Peter Bodík, Krishna Chintalapudi, Matthai Philipose, Lenin Ravindranath, and Sudipta Sinha. 2017. Real-time video analytics: The killer app for edge computing. computer 50, 10 (2017), 58--67.
[13]
Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1--43.
[14]
Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. Benchmark Analysis of Representative Deep Neural Network Architectures. IEEE Access 6 (2018), 64270--64277.
[15]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[16]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 578--594.
[17]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.
[18]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[19]
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.
[20]
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[21]
Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 613--627.
[22]
C Cuda. 2018. Best practice guide, 2018.
[23]
Aditya Dhakal and K. K. Ramakrishnan. 2019. NetML: An NFV Platform with Efficient Support for Machine Learning Applications. In 2019 IEEE Conference on Network Softwarization (NetSoft). IEEE, 396--404.
[24]
Paul Emmerich, Sebastian Gallenmüller, Daniel Raumer, Florian Wohlfart, and Georg Carle. 2015. Moongen: A scriptable high-speed packet generator. In Proceedings of the 2015 Internet Measurement Conference. ACM, 275--287.
[25]
Younghwan Go, Muhammad Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park. 2017. APUNet: revitalizing GPU as packet processing accelerator. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 83--96.
[26]
Allison Gray, Chris Gottbrath, Ryan Olson, and Shashank Prasanna. 2017. Deploying deep neural networks with nvidia tensorrt. https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/.
[27]
Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: a GPU-accelerated software router. In ACM SIGCOMM Computer Communication Review, Vol. 40. ACM, 195--206.
[28]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[29]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[30]
Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. "O'Reilly Media, Inc.".
[31]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[32]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).
[33]
IBM Corporation. 2018. PowerAI Vision Inference Server. https://www.ibm.com/support/knowledgecenter/SSRU69_1.1.2/base/vision_pdf.pdf?view=kc. Accessed:2019-12-01.
[34]
Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic Space-Time Scheduling for GPU Inference. arXiv preprint arXiv:1901.00041 (2018).
[35]
Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In NSDI.
[36]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 1--12.
[37]
Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. 2015. NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 22.
[38]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[39]
Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv preprint arXiv:1904.03288 (2019).
[40]
Peng Liu, Bozhao Qi, and Suman Banerjee. 2018. Edgeeye: An edge service framework for real-time intelligent video analytics. In Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking. ACM, 1--6.
[41]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient {GPU} Cluster Scheduling. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20). 289--304.
[42]
Mellanox Inc. 2018. AI composabilitity and Virtualization: Mellanox Network attached GPUs. http://www.mellanox.com/related-docs/solutions/SB_ai_composability_virtualization.pdf. [online].
[43]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. ACM, 1--15.
[44]
NVIDIA. 2019. TensorRT Developer Guide. https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html. [online].
[45]
NVIDIA, Tesla. 2017. V100 GPU architecture. The world's most advanced data center GPU. Version WP-08608-001_v1.1. NVIDIA. Aug (2017), 108.
[46]
NVIDIA, Tesla. 2019. MULTI-PROCESS SERVICE. (2019).
[47]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. ACM SIGPLAN Notices 50, 4 (2015), 593--606.
[48]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library In Advances in Neural Information Processing Systems. 8024--8035.
[49]
Joseph Redmon. 2013--2016. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/.
[50]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.
[51]
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135--2135.
[52]
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322--337.
[53]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[54]
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66.
[55]
Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. {GASPP}: A GPU-Accelerated Stateful Packet Processing Framework. In 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14). 321--332.
[56]
Wikipedia Article. 2018. Diminishing returns. https://en.wikipedia.org/wiki/Diminishing_returns. [online].
[57]
Piotr Wojciechowski, Purnendu Mukherjee, and Siddharth Sharma. 2018. How to Speed Up Deep Learning Inference Using TensorRT. https://devblogs.nvidia.com/speed-up-inference-tensorrt/.
[58]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[59]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 595--610.
[60]
Peifeng Yu and Mosharaf Chowdhury. 2020. Fine-Grained GPU Sharing Primitives for Deep Learning Applications. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.). Vol. 2. 98--111. https://proceedings.mlsys.org/paper/2020/file/f7177163c833dff4b38fc8d2872f1ec6-Paper.pdf
[61]
Kai Zhang, Bingsheng He, Jiayu Hu, Zeke Wang, Bei Hua, Jiayi Meng, and Lishan Yang. 2018. G-NET: Effective {GPU} Sharing in {NFV} Systems. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). 187--200.
[62]
Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.

Cited By

View all
  • (2025)Joint Configuration Optimization and GPU Allocation for Multi-Tenant Real-Time Video Analytics on Resource-Constrained EdgeIEEE Transactions on Mobile Computing10.1109/TMC.2024.346543424:2(794-811)Online publication date: Feb-2025
  • (2024)MuxServeProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692543(11905-11917)Online publication date: 21-Jul-2024
  • (2024)ConspiratorProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692039(767-784)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. GSLICE: controlled spatial sharing of GPUs for a scalable inference platform

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing
      October 2020
      535 pages
      ISBN:9781450381376
      DOI:10.1145/3419111
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 October 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      • US NSF

      Conference

      SoCC '20
      Sponsor:
      SoCC '20: ACM Symposium on Cloud Computing
      October 19 - 21, 2020
      Virtual Event, USA

      Acceptance Rates

      SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;
      Overall Acceptance Rate 169 of 722 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)950
      • Downloads (Last 6 weeks)132
      Reflects downloads up to 12 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Joint Configuration Optimization and GPU Allocation for Multi-Tenant Real-Time Video Analytics on Resource-Constrained EdgeIEEE Transactions on Mobile Computing10.1109/TMC.2024.346543424:2(794-811)Online publication date: Feb-2025
      • (2024)MuxServeProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692543(11905-11917)Online publication date: 21-Jul-2024
      • (2024)ConspiratorProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692039(767-784)Online publication date: 10-Jul-2024
      • (2024)Corun: Concurrent Inference and Continuous Training at the Edge for Cost-Efficient AI-Based Mobile Image SensingSensors10.3390/s2416526224:16(5262)Online publication date: 14-Aug-2024
      • (2024)InferCool: Enhancing AI Inference Cooling through Transparent, Non-Intrusive Task ReassignmentProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698556(487-504)Online publication date: 20-Nov-2024
      • (2024)KACE: Kernel-Aware Colocation for Efficient GPU Spatial SharingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698555(460-469)Online publication date: 20-Nov-2024
      • (2024)On-demand and Parallel Checkpoint/Restore for GPU ApplicationsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698510(415-433)Online publication date: 20-Nov-2024
      • (2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
      • (2024)Improving Performance on Replica-Exchange Molecular Dynamics Simulations by Optimizing GPU Core UtilizationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673097(1082-1091)Online publication date: 12-Aug-2024
      • (2024)DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasksProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673091(701-711)Online publication date: 12-Aug-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media