research-article

Public Access

Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers

Authors:

Ram Srivatsa Kannan,

Lingjia TangAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 45, Issue 1

Pages 17 - 32

https://doi.org/10.1145/3093337.3037700

Published: 04 April 2017 Publication History

Abstract

Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs.

In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.

References

[1]

Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.

[2]

Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.

[3]

J. Adriaens, K. Compton, N. S. Kim, and M. Schulte. The Case for GPGPU Spatial Multitasking. In the 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12. IEEE, 2012.

Digital Library

[4]

P. Aguilera, K. Morrow, and N. S. Kim. QoS-aware Dynamic Resource Allocation for Spatial-Multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014.

[5]

J. Anantpur and R. Govindarajan. PRO: Progress Aware GPU Warp Scheduling Algorithm. In International Parallel and Distributed Processing Symposium (IPDPS), pages 979--988. IEEE, 2015.

[6]

n, Loh, Das, Kandemir, and Mutlu]ausavarungnirunexploitingR. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 25--38. ACM, 2015.

Digital Library

[7]

le(2007)]barroso2007caseL. A. Barroso and U. Hölzle. The Case for Energy-proportional Computing. Computer, (12): 33--37, 2007.

[8]

le]barroso2003webL. A. Barroso, J. Dean, and U. Hölzle. Web Search for a Planet: The Google Cluster Architecture. Micro, 23 (2): 22--28, 2003.

[9]

T. Beisel, T. Wiersema, C. Plessl, and A. Brinkmann. Cooperative Multitasking for Heterogeneous Accelerators in the Linux Completely Fair Scheduler. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pages 223--226. IEEE, 2011.

Digital Library

[10]

R. Bittner, E. Ruf, and A. Forin. Direct GPU/FPGA Communication via PCI Express. Cluster Computing, 17 (2): 339--348, 2014.

Digital Library

[11]

J. Cabezas, L. Vilanova, I. Gelado, T. B. Jablin, N. Navarro, and W.-m. Hwu. Automatic Execution of Single-GPU Computations across Multiple GPUs. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 467--468. ACM, 2014.

Digital Library

[12]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.

Digital Library

[13]

Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax: QoS Awareness and Increased Utilization of Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the 21th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 681--696, New York, NY, USA, 2016. ACM.

Digital Library

[14]

]delimitrou2013ibenchC. Delimitrou and C. Kozyrakis. iBench: Quantifying Interference for Datacenter Applications. In International Symposium on Workload Characterization (IISWC), pages 23--33. IEEE, 2013\natexlaba.

[15]

]delimitrou2013paragonC. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. ACM SIGARCH Computer Architecture News, 41 (1): 77--88, 2013\natexlabb.

[16]

C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, New York, NY, USA, 2014. ACM.

Digital Library

[17]

G. Elliott, B. C. Ward, J. H. Anderson, et al. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013.

Digital Library

[18]

G. A. Elliott and J. H. Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48 (1): 34--74, 2012.

Digital Library

[19]

Hauswald, Kang, Laurenzano, Chen, Li, Dreslinski, Mudge, Mars, and Tang]hauswald15iscaJ. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, R. Dreslinski, T. Mudge, J. Mars, and L. Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40, New York, NY, USA, 2015\natexlaba. ACM.

Digital Library

[20]

Hauswald, Laurenzano, Zhang, Li, Rovinski, Khurana, Dreslinski, Mudge, Petrucci, Tang, and Mars]hauswald15asplosJ. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--238, New York, NY, USA, 2015\natexlabb. ACM.

Digital Library

[21]

N. Jones. The Learning Machines, 2014.

[22]

W. Joo and D. Shin. Resource-Constrained Spatial Multi-tasking for Embedded GPU. In International Conference on Consumer Electronics (ICCE), pages 339--340. IEEE, 2014.

[23]

H. Kasture and D. Sanchez. Ubik: Efficient Cache Sharing with Strict QoS for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 729--742, New York, NY, USA, 2014. ACM.

Digital Library

[24]

S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30, 2011.

[25]

D. Kirk et al. NVIDIA CUDA Software and GPU Parallel Computing Architecture. In ISMM, volume 7, pages 103--104, 2007.

Digital Library

[26]

M. A. Laurenzano, Y. Zhang, L. Tang, and J. Mars. Protean code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE, 2014.

Digital Library

[27]

H. Lee, A. Faruque, and M. Abdullah. GPU-EvR: Run-time Event-based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.

[28]

S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 515--527. ACM, 2015.

Digital Library

[29]

J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proceedings of the 9th European Conference on Computer Systems, page 4. ACM, 2014.

Digital Library

[30]

D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 450--462. ACM, 2015.

Digital Library

[31]

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, New York, NY, USA, 2011. ACM.

Digital Library

[32]

D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating Server Idle Power. In ACM Sigplan Notices, volume 44, pages 205--216. ACM, 2009.

[33]

K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 208--222. IEEE, 2006.

Digital Library

[34]

C. Nvidia. Compute Unified Device Architecture Programming Guide. 2007.

[35]

C. NVIDIA. GPU Occupancy Calculator. CUDA SDK, 2010.

[36]

S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 407--418, New York, NY, USA, 2013. ACM.

Digital Library

[37]

J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015.

Digital Library

[38]

V. Petrucci, M. Laurenzano, J. Doherty, Y. Zhang, D. Mosse, J. Mars, L. Tang, et al. Octopus-Man: QoS-Driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.

[39]

R. Phull, C.-H. Li, K. Rao, H. Cadambi, and S. Chakradhar. Interference-Driven Resource Management for GPU-based Heterogeneous Clusters. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 109--120. ACM, 2012.

Digital Library

[40]

B. Pichai, L. Hsu, and A. Bhattacharjee. Address Translation for Throughput-Oriented Accelerators. Micro, IEEE, 35 (3): 102--113, May 2015.

[41]

A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. IEEE, 2014.

[42]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-Aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 99--110. ACM, 2013.

Digital Library

[43]

K. Sajjapongse, X. Wang, and M. Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In Proceedings of the 22nd international symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013.

Digital Library

[44]

L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 89--100, New York, NY, USA, 2013. ACM.

Digital Library

[45]

N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu. A Case for Core-assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 41--53. ACM, 2015.

Digital Library

[46]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In the 22th International Symposium on High Performance Computer Architecture (HPCA), pages 358--369. IEEE, 2016.

[47]

G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. GPGPU Performance and Power Estimation Using Machine Learning. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 564--576. IEEE, 2015.

[48]

H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618, New York, NY, USA, 2013. ACM.

Digital Library

[49]

G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 34--44. ACM, 2009.

Digital Library

[50]

Y. Zhang, M. Laurenzano, J. Mars, and L. Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proceedings of the 47th Annual International Symposium on Microarchitecture (MICRO), pages 406--418, New York, NY, USA, 2014. ACM.

Digital Library

[51]

J. Zhong and B. He. Kernelet: High-throughput GPU Kernel Executions with Dynamic Slicing and Scheduling. IEEE Transactions on Parallel and Distributed Systems, 25 (6): 1522--1532, 2014.

Digital Library

Cited By

Liao HLiu TGuo JHuang BYang DDing J(2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3494879
Bhasi VSharma AJain RGunasekaran JPattnaik AKandemir MDas CSchiavoni VEdinger JCao JJin Z(2024)Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU ArchitecturesProceedings of the 25th International Middleware Conference10.1145/3652892.3700760(211-224)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700760
Bhasi VSharma AMohanty SKandemir MDas C(2024)Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00018(100-113)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00018
Show More Cited By

Index Terms

Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
  2. Real-time systems
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on ...
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
ASPLOS '16

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language ...
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
ASPLOS '17

Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 45, Issue 1

Asplos'17

March 2017

812 pages

ISSN:0163-5964

DOI:10.1145/3093337

Editor:
Babak Falsafi
Interim

Issue’s Table of Contents

ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
April 2017
856 pages
ISBN:9781450344654
DOI:10.1145/3037697
General Chairs:
Yunji Chen
Institute of Computing Technology, CAS, China
,
Olivier Temam
Google, USA
,
Program Chair:
John Carter
IBM, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Published in SIGARCH Volume 45, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Science Foundation
National Basic Research 973 Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

104
Total Citations
View Citations
3,529
Total Downloads

Downloads (Last 12 months)535
Downloads (Last 6 weeks)83

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liao HLiu TGuo JHuang BYang DDing J(2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
https://doi.org/10.1109/TPDS.2024.3494879
Bhasi VSharma AJain RGunasekaran JPattnaik AKandemir MDas CSchiavoni VEdinger JCao JJin Z(2024)Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU ArchitecturesProceedings of the 25th International Middleware Conference10.1145/3652892.3700760(211-224)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700760
Bhasi VSharma AMohanty SKandemir MDas C(2024)Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00018(100-113)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00018
Wang JYu HFan GZhang J(2024)Request Deadline Split and Interference‐Aware Request Migration in Edge CloudConcurrency and Computation: Practice and Experience10.1002/cpe.831537:1Online publication date: 24-Oct-2024
https://doi.org/10.1002/cpe.8315
Kim SGenc HNikiforov VAsanović KNikolić BShao Y(2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071035
Zhang WChen QZheng NCui WFu KGuo M(2022)Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUsIEEE Transactions on Computers10.1109/TC.2021.306435271:4(866-879)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TC.2021.3064352
Yin LZhang YZhang ZPeng YZhao P(2021)ParaXProceedings of the VLDB Endowment10.14778/3447689.344769214:6(864-877)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3447689.3447692
Metzger PSeeker VFensch CCole M(2021)Device HoppingACM Transactions on Architecture and Code Optimization10.1145/347190918:4(1-25)Online publication date: 31-Dec-2021
https://doi.org/10.1145/3471909
Zhang WZheng NChen QYang YSong ZMa TLeng JGuo M(2020)URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public CloudsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404451(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404451
Choi YRhu M(2020)PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00027(220-233)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00027
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents