research-article

Public Access

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

Authors:

Lingjia TangAuthors Info & Claims

ACM SIGPLAN Notices, Volume 51, Issue 4

Pages 681 - 696

https://doi.org/10.1145/2954679.2872368

Published: 25 March 2016 Publication History

Abstract

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

References

[1]

David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I Rudnicky. Pocketsphinx: A Free, Real-time Continuous Speech Recognition System for Hand-Held Devices. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages 185--188. IEEE, 2006.

[2]

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukás Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, et al. The Kaldi Speech Recognition Toolkit. 2011.

[3]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded Up Robust Features. In Computer Vision--ECCV 2006, pages 404--417. Springer, 2006.

Digital Library

[4]

Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company.

[5]

Erik F Tjong Kim Sang and Sabine Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking. In the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-Volume 7, pages 127--132. Association for Computational Linguistics, 2000.

Digital Library

[6]

Marti A Hearst. 'Natural' Search User Interfaces. Communications of the ACM, 54(11):60--67, 2011.

Digital Library

[7]

Google's Google Now. http://www.google.com/landing/now/.

[8]

Microsoft's Cortana. http://www.windowsphone.com/en-us/features-8--1.

[9]

Apple Siri. https://www.apple.com/ios/siri/.

[10]

Baidu YuYin. http://yuyin.baidu.com/.

[11]

Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ron Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2015.

[12]

Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. ACM/IEEE, 2014.

[13]

Nicola Jones. The Learning Machines, 2014.

[14]

Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 8(3):1--154, 2013.

Digital Library

[15]

Jeffrey Dean and Luiz André Barroso. The Tail at Scale. Communications of the ACM, 56(2):74--80, 2013.

Digital Library

[16]

Lingjia Tang, Jason Mars, and Mary Lou Soffa. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), CGO '12, pages 1--12, New York, NY, USA, 2012. ACM.

Digital Library

[17]

Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), ISCA '13, pages 619--630, New York, NY, USA, 2013. ACM.

Digital Library

[18]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259. IEEE/ACM, 2011.

Digital Library

[19]

Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack W. Davidson, and Mary Lou Soffa. Performance analysis of thread mappings with a holistic view of the hardware resources. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), ISPASS '12, pages 156--167, Washington, DC, USA, 2012. IEEE Computer Society.

Digital Library

[20]

Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618. ACM/IEEE, 2013.

[21]

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144. ACM, 2014.

[22]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In the 42nd International Symposium on Computer Architecture (ISCA), pages 450--462. ACM/IEEE, 2015.

[23]

Yunqi Zhang, Michael Laurenzano, Jason Mars, and Lingjia Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 406--418. IEEE/ACM, 2014.

Digital Library

[24]

Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Ronald Dreslinski, Trevor Mudge, Jason Mars, and Lingjia Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40. ACM/IEEE, 2015.

Digital Library

[25]

Vinicius Petrucci, Michael Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. Octopus-Man: QoS-driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.

[26]

Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.

[27]

Carlos Boneti, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Chen-Yong Cher, and Mateo Valero. Software-Controlled Priority Characterization of POWER5 Processor. In the 35th International Symposium on Computer Architecture (ISCA), pages 415--426. ACM/IEEE, 2008.

Digital Library

[28]

Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30. USENIX, 2011.

[29]

Glenn Elliott, Bryan C Ward, and James H Anderson. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013.

[30]

Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.

[31]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.

[32]

CUDA Nvidia. cuBLAS library. Nvidia Corporation, Santa Clara, California, 15, 2008.

[33]

David Kirk et al. Nvidia CUDA Software and GPU Parallel Computing Architecture. In the 6th International Symposium on Memory Management (ISMM), volume 7, pages 103--104. ACM, 2007.

[34]

George AF Seber and Alan J Lee. Linear Regression Analysis, volume 936. John Wiley & Sons, 2012.

[35]

Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions. Journal of the ACM, 45(6):891--923, 1998.

Digital Library

[36]

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning. Springer, 2013.

Digital Library

[37]

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.

Digital Library

[38]

Cui Yu, Rui Zhang, Yaochun Huang, and Hui Xiong. High-Dimensional KNN Joins with Incremental Updates. Geoinformatica, 14(1):55--82, 2010.

Digital Library

[39]

Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast K Nearest Neighbor Search using GPU. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1--6. IEEE, 2008.

[40]

Alex Goldhammer and John Ayer Jr. Understanding Performance of PCI Express Systems. Xilinx WP350, Sept, 4, 2008.

[41]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.

[42]

Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and Enrique S Quintana-Orti. Evaluation and Tuning of the Level 3 cuBLAS for Graphics Processors. In International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8. IEEE, 2008.

[43]

Harold W Kuhn. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly, 2(1--2):83--97, 1955.

[44]

Subramanian Kannan, Mark Roberts, Peter Mayes, Dave Brelsford, and Joseph F Skovira. Workload Management with Loadleveler. IBM Redbooks, 2:2, 2001.

[45]

Haeseung Lee, Al Faruque, and Mohammad Abdullah. GPU-EvR: Run-time Event based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.

[46]

Michael A Laurenzano, Yunqi Zhang, Lingjia Tang, and Jason Mars. Protean code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE/ACM, 2014.

Digital Library

[47]

David Jackson, Quinn Snell, and Mark Clement. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer, 2001.

Digital Library

[48]

Ahuva W Mu Alem and Dror G Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529--543, 2001.

Digital Library

[49]

Chung Laung Liu and James W Layland. Scheduling Algorithms for Multiprogramming in a Hard-real-time Environment. Journal of the ACM, 20(1):46--61, 1973.

Digital Library

[50]

Lui Sha, Ragunathan Rajkumar, and Shirish S Sathaye. Generalized Rate-Monotonic Scheduling Theory: A Framework for Developing Real-time Systems. Proceedings of the IEEE, 82(1):68--82, 1994.

[51]

Neil C Audsley, Alan Burns, MF Richardson, and AJ Wellings. Deadline Monotonic Scheduling. Citeseer, 1990.

[52]

Alan Bertossi, Luigi V Mancini, and Federico Rossini. Fault-tolerant Rate-Monotonic First-fit Scheduling in Hard-real-time Systems. IEEE Transactions on Parallel and Distributed Systems, 10(9):934--945, 1999.

Digital Library

[53]

Glenn A Elliott and James H Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48(1):34--74, 2012.

Digital Library

[54]

Pedro Aguilera, Katherine Morrow, and Nam Sung Kim. QoS-aware Dynamic Resource Allocation for Spatial-multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014.

[55]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015.

Digital Library

[56]

Kittisak Sajjapongse, Xiang Wang, and Michela Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013.

Digital Library

[57]

Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling Preemptive Multiprogramming on GPUs. In the 41st International Symposium on Computer Architecuture (ISCA), pages 193--204. ACM/IEEE, 2014.

[58]

Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2015.

[59]

Jens Breitbart. Analysis of a Memory Bandwidth Limited Scenario for NUMA and GPU systems. In the 25th International Symposium on Parallel and Distributed Processing Workshops(IPDPSW), pages 693--699. IEEE, 2011.

Digital Library

[60]

Ping Xiang, Yi Yang, and Huiyang Zhou. Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation. In the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 284--295. IEEE, 2014.

[61]

Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. Enabling Portable Optimizations of Data Placement on GPU. Micro, 35(4):16--24, July 2015.

[62]

Daniel Lustig and Margaret Martonosi. Reducing GPU Offload Latency via Fine-grained CPU-GPU Synchronization. In the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2013.

Digital Library

[63]

Ankit Sethia and Scott Mahlke. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 647--658. IEEE/ACM, 2014.

[64]

J-F Dollinger and Vincent Loechner. Adaptive Runtime Selection for GPU. In the 42nd International Conference on Parallel Processing (ICPP), pages 70--79. IEEE, 2013.

[65]

Vignesh T Ravi, Michela Becchi, Gagan Agrawal, and Srimat Chakradhar. Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework. In the 20th International Symposium on High Performance Distributed Computing (HPDC), pages 217--228. ACM, 2011.

Digital Library

[66]

Khaled M Diab, M Mustafa Rafique, and Mohamed Hefeeda. Dynamic Sharing of GPUs in Cloud Systems. In the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 947--954. IEEE, 2013.

[67]

Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU Concurrency in Heterogeneous Architectures. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 114--126. IEEE/ACM, 2014.

[68]

Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T Lewis, Chunling Hu, and Keshav Pingali. Adaptive Heterogeneous Scheduling for Integrated GPUs. In the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 151--162. ACM, 2014.

Cited By

Zhao HDeng JCui WChen QZhang YZeng DGuo M(2025)Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers10.1109/TC.2024.347799574:2(386-400)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3477995
Huang ZXie MTang SChang ZYao ZBao YWang S(2024)INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive PathsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698508(380-397)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698508
Zhang BLi SLi Z(2024)MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning ClustersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673089(504-513)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673089
Show More Cited By

Index Terms

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

Recommendations

Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS'16

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 51, Issue 4

ASPLOS '16

April 2016

774 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2954679

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
March 2016
824 pages
ISBN:9781450340915
DOI:10.1145/2872362
General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Published in SIGPLAN Volume 51, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

164
Total Citations
View Citations
3,097
Total Downloads

Downloads (Last 12 months)585
Downloads (Last 6 weeks)92

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao HDeng JCui WChen QZhang YZeng DGuo M(2025)Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers10.1109/TC.2024.347799574:2(386-400)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3477995
Huang ZXie MTang SChang ZYao ZBao YWang S(2024)INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive PathsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698508(380-397)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698508
Zhang BLi SLi Z(2024)MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning ClustersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673089(504-513)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673089
Zhang ZZhao YLi HLiu J(2024)BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge DevicesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.340970121:4(4131-4145)Online publication date: 5-Jun-2024
https://dl.acm.org/doi/10.1109/TNSM.2024.3409701
Dhakal AKulkarni SRamakrishnan K(2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
https://doi.org/10.1109/TCC.2024.3476210
Alizadeh NMomtazpour M(2024)Multi-Objective Concurrent Kernel Scheduling for Multi-GPU Systems2024 32nd International Conference on Electrical Engineering (ICEE)10.1109/ICEE63041.2024.10667973(1-6)Online publication date: 14-May-2024
https://doi.org/10.1109/ICEE63041.2024.10667973
Nabavinejad SReda SGuo T(2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00063
Nabavinejad SReda SGuo T(2024)FairCIM: Fair Interference Mitigation by DNN Switching for Latency-Sensitive Inference Jobs2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS61780.2024.00025(71-80)Online publication date: 16-Sep-2024
https://doi.org/10.1109/ACSOS61780.2024.00025
Chen BZhao HCui WHe YZhang SChen QLi ZGuo M(2023)Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with ComboProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624660(265-280)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624660
Xu FXu JChen JChen LShang RZhou ZLiu F(2023) iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323271534:3(812-827)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TPDS.2022.3232715
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Issue’s Table of Contents