research-article

Open access

GSLICE: controlled spatial sharing of GPUs for a scalable inference platform

Authors:

Sameer G Kulkarni,

K. K. RamakrishnanAuthors Info & Claims

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

Pages 492 - 506

https://doi.org/10.1145/3419111.3421284

Published: 12 October 2020 Publication History

Abstract

The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance.

GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100μs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60--800% and achieves 2--13X improvement in aggregate throughput.

Supplementary Material

MP4 File (p492-dhakal-presentation.mp4)

Download
322.47 MB

References

[1]

2014. Data plane development kit. http://dpdk.org/. [online].

[2]

2019. NVIDIA Container runtime for Docker. https://github.com/NVIDIA/nvidia-docker. (2019). [online].

[3]

2019. NVIDIA TensorRT Inference Server. https://github.com/NVIDIA/tensorrt-inference-server. [online].

[4]

2020. Clipper Github. https://github.com/ucbrise/clipper.

[5]

2020. CUBLAS LIBRARY. https://docs.nvidia.com/cuda/cublas/index.html. Accessed: 2020-02-19.

[6]

2020. Metal Documentation. https://developer.apple.com/documentation/metal. Accessed: 2020-04-25.

[7]

2020. NVIDIA Ampere MIG. https://www.nvidia.com/en-us/technologies/multi-instance-gpu.

[8]

2020. NVIDIA Tesla V100 GPU Architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed: 2020-02-01.

[9]

2020. ROCm Github. https://github.com/RadeonOpenCompute/ROCml. Accessed: 2020-04-25.

[10]

2020. tcpreplay Github. https://github.com/appneta/tcpreplay.

[11]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265--283.

Digital Library

[12]

Ganesh Ananthanarayanan, Paramvir Bahl, Peter Bodík, Krishna Chintalapudi, Matthai Philipose, Lenin Ravindranath, and Sudipta Sinha. 2017. Real-time video analytics: The killer app for edge computing. computer 50, 10 (2017), 58--67.

[13]

Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1--43.

Digital Library

[14]

Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. Benchmark Analysis of Representative Deep Neural Network Architectures. IEEE Access 6 (2018), 64270--64277.

[15]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[16]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 578--594.

[17]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.

Digital Library

[18]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).

[19]

R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.

[20]

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).

[21]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A low-latency online prediction serving system. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 613--627.

[22]

C Cuda. 2018. Best practice guide, 2018.

[23]

Aditya Dhakal and K. K. Ramakrishnan. 2019. NetML: An NFV Platform with Efficient Support for Machine Learning Applications. In 2019 IEEE Conference on Network Softwarization (NetSoft). IEEE, 396--404.

[24]

Paul Emmerich, Sebastian Gallenmüller, Daniel Raumer, Florian Wohlfart, and Georg Carle. 2015. Moongen: A scriptable high-speed packet generator. In Proceedings of the 2015 Internet Measurement Conference. ACM, 275--287.

Digital Library

[25]

Younghwan Go, Muhammad Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park. 2017. APUNet: revitalizing GPU as packet processing accelerator. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 83--96.

[26]

Allison Gray, Chris Gottbrath, Ryan Olson, and Shashank Prasanna. 2017. Deploying deep neural networks with nvidia tensorrt. https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/.

[27]

Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: a GPU-accelerated software router. In ACM SIGCOMM Computer Communication Review, Vol. 40. ACM, 195--206.

Digital Library

[28]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

[29]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[30]

Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. "O'Reilly Media, Inc.".

[31]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[32]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016).

[33]

IBM Corporation. 2018. PowerAI Vision Inference Server. https://www.ibm.com/support/knowledgecenter/SSRU69_1.1.2/base/vision_pdf.pdf?view=kc. Accessed:2019-12-01.

[34]

Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic Space-Time Scheduling for GPU Inference. arXiv preprint arXiv:1901.00041 (2018).

[35]

Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In NSDI.

[36]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 1--12.

Digital Library

[37]

Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. 2015. NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 22.

Digital Library

[38]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

[39]

Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv preprint arXiv:1904.03288 (2019).

[40]

Peng Liu, Bozhao Qi, and Suman Banerjee. 2018. Edgeeye: An edge service framework for real-time intelligent video analytics. In Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking. ACM, 1--6.

Digital Library

[41]

Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient {GPU} Cluster Scheduling. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20). 289--304.

[42]

Mellanox Inc. 2018. AI composabilitity and Virtualization: Mellanox Network attached GPUs. http://www.mellanox.com/related-docs/solutions/SB_ai_composability_virtualization.pdf. [online].

[43]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. ACM, 1--15.

Digital Library

[44]

NVIDIA. 2019. TensorRT Developer Guide. https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html. [online].

[45]

NVIDIA, Tesla. 2017. V100 GPU architecture. The world's most advanced data center GPU. Version WP-08608-001_v1.1. NVIDIA. Aug (2017), 108.

[46]

NVIDIA, Tesla. 2019. MULTI-PROCESS SERVICE. (2019).

[47]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. ACM SIGPLAN Notices 50, 4 (2015), 593--606.

Digital Library

[48]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library In Advances in Neural Information Processing Systems. 8024--8035.

[49]

Joseph Redmon. 2013--2016. Darknet: Open Source Neural Networks in C. http://pjreddie.com/darknet/.

[50]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.

Digital Library

[51]

Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135--2135.

Digital Library

[52]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322--337.

Digital Library

[53]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[54]

John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66.

[55]

Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. {GASPP}: A GPU-Accelerated Stateful Packet Processing Framework. In 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14). 321--332.

[56]

Wikipedia Article. 2018. Diminishing returns. https://en.wikipedia.org/wiki/Diminishing_returns. [online].

[57]

Piotr Wojciechowski, Purnendu Mukherjee, and Siddharth Sharma. 2018. How to Speed Up Deep Learning Inference Using TensorRT. https://devblogs.nvidia.com/speed-up-inference-tensorrt/.

[58]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[59]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, et al. 2018. Gandiva: Introspective cluster scheduling for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 595--610.

[60]

Peifeng Yu and Mosharaf Chowdhury. 2020. Fine-Grained GPU Sharing Primitives for Deep Learning Applications. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.). Vol. 2. 98--111. https://proceedings.mlsys.org/paper/2020/file/f7177163c833dff4b38fc8d2872f1ec6-Paper.pdf

[61]

Kai Zhang, Bingsheng He, Jiayu Hu, Zeke Wang, Bei Hua, Jiayi Meng, and Lishan Yang. 2018. G-NET: Effective {GPU} Sharing in {NFV} Systems. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). 187--200.

[62]

Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.

Cited By

Wang HLi TZhang MLi QCui HJiang YYuan Z(2025)Joint Configuration Optimization and GPU Allocation for Multi-Tenant Real-Time Video Analytics on Resource-Constrained EdgeIEEE Transactions on Mobile Computing10.1109/TMC.2024.346543424:2(794-811)Online publication date: Feb-2025
https://doi.org/10.1109/TMC.2024.3465434
Duan JLu RDuanmu HLi XZhang XLin DStoica IZhang HSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)MuxServeProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692543(11905-11917)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692543
Xiao YTootaghaj DDhakal ACao LSharma PKuzmanovic ABagchi SZhang Y(2024)ConspiratorProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692039(767-784)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692039
Show More Cited By

Index Terms

GSLICE: controlled spatial sharing of GPUs for a scalable inference platform
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Computing methodologies
  1. Machine learning

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

October 2020

535 pages

ISBN:9781450381376

DOI:10.1145/3419111

General Chair:
Rodrigo Fonseca
Microsoft and Brown University
,
Program Chairs:
Christina Delimitrou
Cornell University
,
Beng Chin Ooi
National University of Singapore

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

US NSF

Conference

SoCC '20

Sponsor:

SoCC '20: ACM Symposium on Cloud Computing

October 19 - 21, 2020

Virtual Event, USA

Acceptance Rates

SoCC '20 Paper Acceptance Rate 35 of 143 submissions, 24%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

77
Total Citations
View Citations
3,199
Total Downloads

Downloads (Last 12 months)950
Downloads (Last 6 weeks)132

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang HLi TZhang MLi QCui HJiang YYuan Z(2025)Joint Configuration Optimization and GPU Allocation for Multi-Tenant Real-Time Video Analytics on Resource-Constrained EdgeIEEE Transactions on Mobile Computing10.1109/TMC.2024.346543424:2(794-811)Online publication date: Feb-2025
https://doi.org/10.1109/TMC.2024.3465434
Duan JLu RDuanmu HLi XZhang XLin DStoica IZhang HSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)MuxServeProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692543(11905-11917)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692543
Xiao YTootaghaj DDhakal ACao LSharma PKuzmanovic ABagchi SZhang Y(2024)ConspiratorProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692039(767-784)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692039
Liu YAndhare AKang K(2024)Corun: Concurrent Inference and Continuous Training at the Edge for Cost-Efficient AI-Based Mobile Image SensingSensors10.3390/s2416526224:16(5262)Online publication date: 14-Aug-2024
https://doi.org/10.3390/s24165262
Pei QWang LZhang DYan BYu CLiu F(2024)InferCool: Enhancing AI Inference Cooling through Transparent, Non-Intrusive Task ReassignmentProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698556(487-504)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698556
Han BPaul TLiu ZGandhi A(2024)KACE: Kernel-Aware Colocation for Efficient GPU Spatial SharingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698555(460-469)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698555
Yang YDu DSong HXia Y(2024)On-demand and Parallel Checkpoint/Restore for GPU ApplicationsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698510(415-433)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698510
Wei XLi ZTan C(2024)Optimizing GPU Sharing for Container-Based DNN Serving with Multi-Instance GPUsProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689156(68-82)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689156
Boku TSugita MKobayashi RFuruya SFujie TOhue MAkiyama Y(2024)Improving Performance on Replica-Exchange Molecular Dynamics Simulations by Optimizing GPU Core UtilizationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673097(1082-1091)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673097
Chen YLi WZhou HYang XYin Y(2024)DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasksProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673091(701-711)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673091
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents