Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

Published: 25 March 2016 Publication History

Abstract

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

References

[1]
David Huggins-Daines, Mohit Kumar, Arthur Chan, Alan W Black, Mosur Ravishankar, and Alex I Rudnicky. Pocketsphinx: A Free, Real-time Continuous Speech Recognition System for Hand-Held Devices. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages 185--188. IEEE, 2006.
[2]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukás Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, et al. The Kaldi Speech Recognition Toolkit. 2011.
[3]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded Up Robust Features. In Computer Vision--ECCV 2006, pages 404--417. Springer, 2006.
[4]
Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company.
[5]
Erik F Tjong Kim Sang and Sabine Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking. In the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning-Volume 7, pages 127--132. Association for Computational Linguistics, 2000.
[6]
Marti A Hearst. 'Natural' Search User Interfaces. Communications of the ACM, 54(11):60--67, 2011.
[7]
Google's Google Now. http://www.google.com/landing/now/.
[8]
Microsoft's Cortana. http://www.windowsphone.com/en-us/features-8--1.
[9]
Apple Siri. https://www.apple.com/ios/siri/.
[10]
Baidu YuYin. http://yuyin.baidu.com/.
[11]
Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ron Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2015.
[12]
Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. ACM/IEEE, 2014.
[13]
Nicola Jones. The Learning Machines, 2014.
[14]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, 8(3):1--154, 2013.
[15]
Jeffrey Dean and Luiz André Barroso. The Tail at Scale. Communications of the ACM, 56(2):74--80, 2013.
[16]
Lingjia Tang, Jason Mars, and Mary Lou Soffa. Compiling for niceness: Mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), CGO '12, pages 1--12, New York, NY, USA, 2012. ACM.
[17]
Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), ISCA '13, pages 619--630, New York, NY, USA, 2013. ACM.
[18]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259. IEEE/ACM, 2011.
[19]
Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack W. Davidson, and Mary Lou Soffa. Performance analysis of thread mappings with a holistic view of the hardware resources. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), ISPASS '12, pages 156--167, Washington, DC, USA, 2012. IEEE Computer Society.
[20]
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618. ACM/IEEE, 2013.
[21]
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144. ACM, 2014.
[22]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In the 42nd International Symposium on Computer Architecture (ISCA), pages 450--462. ACM/IEEE, 2015.
[23]
Yunqi Zhang, Michael Laurenzano, Jason Mars, and Lingjia Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 406--418. IEEE/ACM, 2014.
[24]
Johann Hauswald, Yiping Kang, Michael A. Laurenzano, Quan Chen, Cheng Li, Ronald Dreslinski, Trevor Mudge, Jason Mars, and Lingjia Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40. ACM/IEEE, 2015.
[25]
Vinicius Petrucci, Michael Laurenzano, John Doherty, Yunqi Zhang, Daniel Mosse, Jason Mars, and Lingjia Tang. Octopus-Man: QoS-driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.
[26]
Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.
[27]
Carlos Boneti, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Chen-Yong Cher, and Mateo Valero. Software-Controlled Priority Characterization of POWER5 Processor. In the 35th International Symposium on Computer Architecture (ISCA), pages 415--426. ACM/IEEE, 2008.
[28]
Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30. USENIX, 2011.
[29]
Glenn Elliott, Bryan C Ward, and James H Anderson. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013.
[30]
Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.
[31]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.
[32]
CUDA Nvidia. cuBLAS library. Nvidia Corporation, Santa Clara, California, 15, 2008.
[33]
David Kirk et al. Nvidia CUDA Software and GPU Parallel Computing Architecture. In the 6th International Symposium on Memory Management (ISMM), volume 7, pages 103--104. ACM, 2007.
[34]
George AF Seber and Alan J Lee. Linear Regression Analysis, volume 936. John Wiley & Sons, 2012.
[35]
Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions. Journal of the ACM, 45(6):891--923, 1998.
[36]
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning. Springer, 2013.
[37]
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011.
[38]
Cui Yu, Rui Zhang, Yaochun Huang, and Hui Xiong. High-Dimensional KNN Joins with Incremental Updates. Geoinformatica, 14(1):55--82, 2010.
[39]
Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast K Nearest Neighbor Search using GPU. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1--6. IEEE, 2008.
[40]
Alex Goldhammer and John Ayer Jr. Understanding Performance of PCI Express Systems. Xilinx WP350, Sept, 4, 2008.
[41]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.
[42]
Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and Enrique S Quintana-Orti. Evaluation and Tuning of the Level 3 cuBLAS for Graphics Processors. In International Parallel and Distributed Processing Symposium (IPDPS), pages 1--8. IEEE, 2008.
[43]
Harold W Kuhn. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly, 2(1--2):83--97, 1955.
[44]
Subramanian Kannan, Mark Roberts, Peter Mayes, Dave Brelsford, and Joseph F Skovira. Workload Management with Loadleveler. IBM Redbooks, 2:2, 2001.
[45]
Haeseung Lee, Al Faruque, and Mohammad Abdullah. GPU-EvR: Run-time Event based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.
[46]
Michael A Laurenzano, Yunqi Zhang, Lingjia Tang, and Jason Mars. Protean code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE/ACM, 2014.
[47]
David Jackson, Quinn Snell, and Mark Clement. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer, 2001.
[48]
Ahuva W Mu Alem and Dror G Feitelson. Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529--543, 2001.
[49]
Chung Laung Liu and James W Layland. Scheduling Algorithms for Multiprogramming in a Hard-real-time Environment. Journal of the ACM, 20(1):46--61, 1973.
[50]
Lui Sha, Ragunathan Rajkumar, and Shirish S Sathaye. Generalized Rate-Monotonic Scheduling Theory: A Framework for Developing Real-time Systems. Proceedings of the IEEE, 82(1):68--82, 1994.
[51]
Neil C Audsley, Alan Burns, MF Richardson, and AJ Wellings. Deadline Monotonic Scheduling. Citeseer, 1990.
[52]
Alan Bertossi, Luigi V Mancini, and Federico Rossini. Fault-tolerant Rate-Monotonic First-fit Scheduling in Hard-real-time Systems. IEEE Transactions on Parallel and Distributed Systems, 10(9):934--945, 1999.
[53]
Glenn A Elliott and James H Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48(1):34--74, 2012.
[54]
Pedro Aguilera, Katherine Morrow, and Nam Sung Kim. QoS-aware Dynamic Resource Allocation for Spatial-multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014.
[55]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015.
[56]
Kittisak Sajjapongse, Xiang Wang, and Michela Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013.
[57]
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling Preemptive Multiprogramming on GPUs. In the 41st International Symposium on Computer Architecuture (ISCA), pages 193--204. ACM/IEEE, 2014.
[58]
Neha Agarwal, David Nellans, Mike O'Connor, Stephen W Keckler, and Thomas F Wenisch. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2015.
[59]
Jens Breitbart. Analysis of a Memory Bandwidth Limited Scenario for NUMA and GPU systems. In the 25th International Symposium on Parallel and Distributed Processing Workshops(IPDPSW), pages 693--699. IEEE, 2011.
[60]
Ping Xiang, Yi Yang, and Huiyang Zhou. Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation. In the 20th International Symposium on High Performance Computer Architecture (HPCA), pages 284--295. IEEE, 2014.
[61]
Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. Enabling Portable Optimizations of Data Placement on GPU. Micro, 35(4):16--24, July 2015.
[62]
Daniel Lustig and Margaret Martonosi. Reducing GPU Offload Latency via Fine-grained CPU-GPU Synchronization. In the 19th International Symposium on High Performance Computer Architecture (HPCA), pages 354--365. IEEE, 2013.
[63]
Ankit Sethia and Scott Mahlke. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 647--658. IEEE/ACM, 2014.
[64]
J-F Dollinger and Vincent Loechner. Adaptive Runtime Selection for GPU. In the 42nd International Conference on Parallel Processing (ICPP), pages 70--79. IEEE, 2013.
[65]
Vignesh T Ravi, Michela Becchi, Gagan Agrawal, and Srimat Chakradhar. Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework. In the 20th International Symposium on High Performance Distributed Computing (HPDC), pages 217--228. ACM, 2011.
[66]
Khaled M Diab, M Mustafa Rafique, and Mohamed Hefeeda. Dynamic Sharing of GPUs in Cloud Systems. In the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 947--954. IEEE, 2013.
[67]
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU Concurrency in Heterogeneous Architectures. In the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 114--126. IEEE/ACM, 2014.
[68]
Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T Lewis, Chunling Hu, and Keshav Pingali. Adaptive Heterogeneous Scheduling for Integrated GPUs. In the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 151--162. ACM, 2014.

Cited By

View all
  • (2025)Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers10.1109/TC.2024.347799574:2(386-400)Online publication date: Feb-2025
  • (2024)INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive PathsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698508(380-397)Online publication date: 20-Nov-2024
  • (2024)MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning ClustersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673089(504-513)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
  • Editor:
  • Andy Gill
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2016
    824 pages
    ISBN:9781450340915
    DOI:10.1145/2872362
    • General Chair:
    • Tom Conte,
    • Program Chair:
    • Yuanyuan Zhou
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016
Published in SIGPLAN Volume 51, Issue 4

Check for updates

Author Tags

  1. non-preemptive accelerators
  2. quality of service
  3. scheduling
  4. warehouse scale computers

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)585
  • Downloads (Last 6 weeks)92
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Adaptive Kernel Fusion for Improving the GPU Utilization While Ensuring QoSIEEE Transactions on Computers10.1109/TC.2024.347799574:2(386-400)Online publication date: Feb-2025
  • (2024)INS: Identifying and Mitigating Performance Interference in Clouds via Interference-Sensitive PathsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698508(380-397)Online publication date: 20-Nov-2024
  • (2024)MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning ClustersProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673089(504-513)Online publication date: 12-Aug-2024
  • (2024)BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge DevicesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.340970121:4(4131-4145)Online publication date: 5-Jun-2024
  • (2024)D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUsIEEE Transactions on Cloud Computing10.1109/TCC.2024.347621012:4(1344-1358)Online publication date: Oct-2024
  • (2024)Multi-Objective Concurrent Kernel Scheduling for Multi-GPU Systems2024 32nd International Conference on Electrical Engineering (ICEE)10.1109/ICEE63041.2024.10667973(1-6)Online publication date: 14-May-2024
  • (2024)MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00063(502-512)Online publication date: 7-Jul-2024
  • (2024)FairCIM: Fair Interference Mitigation by DNN Switching for Latency-Sensitive Inference Jobs2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS61780.2024.00025(71-80)Online publication date: 16-Sep-2024
  • (2023)Maximizing the Utilization of GPUs Used by Cloud Gaming through Adaptive Co-location with ComboProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624660(265-280)Online publication date: 30-Oct-2023
  • (2023) iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud IEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.323271534:3(812-827)Online publication date: 1-Mar-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media