Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers

Published: 04 April 2017 Publication History

Abstract

Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs.
In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.

References

[1]
Nvidia Multi-Process Service. https://docs.nvidia.com/deploy/pdf/CUDA\_Multi\_Process\_Service\_Overview.pdf.
[2]
Profiler User's Guide. http://docs.nvidia.com/cuda/profiler-users-guide.
[3]
J. Adriaens, K. Compton, N. S. Kim, and M. Schulte. The Case for GPGPU Spatial Multitasking. In the 18th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12. IEEE, 2012.
[4]
P. Aguilera, K. Morrow, and N. S. Kim. QoS-aware Dynamic Resource Allocation for Spatial-Multitasking GPUs. In the 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 726--731. IEEE, 2014.
[5]
J. Anantpur and R. Govindarajan. PRO: Progress Aware GPU Warp Scheduling Algorithm. In International Parallel and Distributed Processing Symposium (IPDPS), pages 979--988. IEEE, 2015.
[6]
n, Loh, Das, Kandemir, and Mutlu]ausavarungnirunexploitingR. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 25--38. ACM, 2015.
[7]
le(2007)]barroso2007caseL. A. Barroso and U. Hölzle. The Case for Energy-proportional Computing. Computer, (12): 33--37, 2007.
[8]
le]barroso2003webL. A. Barroso, J. Dean, and U. Hölzle. Web Search for a Planet: The Google Cluster Architecture. Micro, 23 (2): 22--28, 2003.
[9]
T. Beisel, T. Wiersema, C. Plessl, and A. Brinkmann. Cooperative Multitasking for Heterogeneous Accelerators in the Linux Completely Fair Scheduler. In IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pages 223--226. IEEE, 2011.
[10]
R. Bittner, E. Ruf, and A. Forin. Direct GPU/FPGA Communication via PCI Express. Cluster Computing, 17 (2): 339--348, 2014.
[11]
J. Cabezas, L. Vilanova, I. Gelado, T. B. Jablin, N. Navarro, and W.-m. Hwu. Automatic Execution of Single-GPU Computations across Multiple GPUs. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 467--468. ACM, 2014.
[12]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC), pages 44--54. IEEE, 2009.
[13]
Q. Chen, H. Yang, J. Mars, and L. Tang. Baymax: QoS Awareness and Increased Utilization of Non-Preemptive Accelerators in Warehouse Scale Computers. In Proceedings of the 21th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 681--696, New York, NY, USA, 2016. ACM.
[14]
]delimitrou2013ibenchC. Delimitrou and C. Kozyrakis. iBench: Quantifying Interference for Datacenter Applications. In International Symposium on Workload Characterization (IISWC), pages 23--33. IEEE, 2013\natexlaba.
[15]
]delimitrou2013paragonC. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. ACM SIGARCH Computer Architecture News, 41 (1): 77--88, 2013\natexlabb.
[16]
C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, New York, NY, USA, 2014. ACM.
[17]
G. Elliott, B. C. Ward, J. H. Anderson, et al. GPUSync: A Framework for Real-time GPU Management. In the 34th Real-Time Systems Symposium (RTSS), pages 33--44. IEEE, 2013.
[18]
G. A. Elliott and J. H. Anderson. Globally Scheduled Real-time Multiprocessor Systems with GPUs. Real-Time Systems, 48 (1): 34--74, 2012.
[19]
Hauswald, Kang, Laurenzano, Chen, Li, Dreslinski, Mudge, Mars, and Tang]hauswald15iscaJ. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, R. Dreslinski, T. Mudge, J. Mars, and L. Tang. DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 27--40, New York, NY, USA, 2015\natexlaba. ACM.
[20]
Hauswald, Laurenzano, Zhang, Li, Rovinski, Khurana, Dreslinski, Mudge, Petrucci, Tang, and Mars]hauswald15asplosJ. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 223--238, New York, NY, USA, 2015\natexlabb. ACM.
[21]
N. Jones. The Learning Machines, 2014.
[22]
W. Joo and D. Shin. Resource-Constrained Spatial Multi-tasking for Embedded GPU. In International Conference on Consumer Electronics (ICCE), pages 339--340. IEEE, 2014.
[23]
H. Kasture and D. Sanchez. Ubik: Efficient Cache Sharing with Strict QoS for Latency-critical Workloads. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 729--742, New York, NY, USA, 2014. ACM.
[24]
S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU Scheduling for Real-time Multi-tasking Environments. In USENIX Annual Technical Conference (ATC), pages 17--30, 2011.
[25]
D. Kirk et al. NVIDIA CUDA Software and GPU Parallel Computing Architecture. In ISMM, volume 7, pages 103--104, 2007.
[26]
M. A. Laurenzano, Y. Zhang, L. Tang, and J. Mars. Protean code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 558--570. IEEE, 2014.
[27]
H. Lee, A. Faruque, and M. Abdullah. GPU-EvR: Run-time Event-based Real-time Scheduling Framework on GPGPU Platform. In Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1--6. IEEE, 2014.
[28]
S.-Y. Lee, A. Arunkumar, and C.-J. Wu. CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 515--527. ACM, 2015.
[29]
J. Leverich and C. Kozyrakis. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proceedings of the 9th European Conference on Computer Systems, page 4. ACM, 2014.
[30]
D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 450--462. ACM, 2015.
[31]
J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 248--259, New York, NY, USA, 2011. ACM.
[32]
D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating Server Idle Power. In ACM Sigplan Notices, volume 44, pages 205--216. ACM, 2009.
[33]
K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 208--222. IEEE, 2006.
[34]
C. Nvidia. Compute Unified Device Architecture Programming Guide. 2007.
[35]
C. NVIDIA. GPU Occupancy Calculator. CUDA SDK, 2010.
[36]
S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU Concurrency with Elastic Kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 407--418, New York, NY, USA, 2013. ACM.
[37]
J. J. K. Park, Y. Park, and S. Mahlke. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 593--606. ACM, 2015.
[38]
V. Petrucci, M. Laurenzano, J. Doherty, Y. Zhang, D. Mosse, J. Mars, L. Tang, et al. Octopus-Man: QoS-Driven Task Management for Heterogeneous Multicores in Warehouse-Scale Computers. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 246--258. IEEE, 2015.
[39]
R. Phull, C.-H. Li, K. Rao, H. Cadambi, and S. Chakradhar. Interference-Driven Resource Management for GPU-based Heterogeneous Clusters. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC), pages 109--120. ACM, 2012.
[40]
B. Pichai, L. Hsu, and A. Bhattacharjee. Address Translation for Throughput-Oriented Accelerators. Micro, IEEE, 35 (3): 102--113, May 2015.
[41]
A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, et al. A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services. In the 41st International Symposium on Computer Architecture (ISCA), pages 13--24. IEEE, 2014.
[42]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-Aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 99--110. ACM, 2013.
[43]
K. Sajjapongse, X. Wang, and M. Becchi. A Preemption-based Runtime to Efficiently Schedule Multi-process Applications on Heterogeneous Clusters with GPUs. In Proceedings of the 22nd international symposium on High-performance Parallel and Distributed Computing (HPDC), pages 179--190. ACM, 2013.
[44]
L. Tang, J. Mars, W. Wang, T. Dey, and M. L. Soffa. ReQoS: Reactive Static/Dynamic Compilation for QoS in Warehouse Scale Computers. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 89--100, New York, NY, USA, 2013. ACM.
[45]
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu. A Case for Core-assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), pages 41--53. ACM, 2015.
[46]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing. In the 22th International Symposium on High Performance Computer Architecture (HPCA), pages 358--369. IEEE, 2016.
[47]
G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. GPGPU Performance and Power Estimation Using Machine Learning. In the 21st International Symposium on High Performance Computer Architecture (HPCA), pages 564--576. IEEE, 2015.
[48]
H. Yang, A. Breslow, J. Mars, and L. Tang. Bubble-Flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), pages 607--618, New York, NY, USA, 2013. ACM.
[49]
G. L. Yuan, A. Bakhoda, and T. M. Aamodt. Complexity Effective Memory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 34--44. ACM, 2009.
[50]
Y. Zhang, M. Laurenzano, J. Mars, and L. Tang. SMiTe: Precise QoS Prediction on Real System SMT Processors to Improve Utilization in Warehouse Scale Computers. In Proceedings of the 47th Annual International Symposium on Microarchitecture (MICRO), pages 406--418, New York, NY, USA, 2014. ACM.
[51]
J. Zhong and B. He. Kernelet: High-throughput GPU Kernel Executions with Dynamic Slicing and Scheduling. IEEE Transactions on Parallel and Distributed Systems, 25 (6): 1522--1532, 2014.

Cited By

View all
  • (2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
  • (2024)Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU ArchitecturesProceedings of the 25th International Middleware Conference10.1145/3652892.3700760(211-224)Online publication date: 2-Dec-2024
  • (2024)Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00018(100-113)Online publication date: 27-May-2024
  • Show More Cited By

Index Terms

  1. Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 45, Issue 1
        Asplos'17
        March 2017
        812 pages
        ISSN:0163-5964
        DOI:10.1145/3093337
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
          April 2017
          856 pages
          ISBN:9781450344654
          DOI:10.1145/3037697
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 April 2017
        Published in SIGARCH Volume 45, Issue 1

        Check for updates

        Author Tags

        1. non-preemptive accelerators
        2. quality-of-service prediction
        3. warehouse-scale computers

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)535
        • Downloads (Last 6 weeks)83
        Reflects downloads up to 22 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data CentersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.349487936:1(67-83)Online publication date: Jan-2025
        • (2024)Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU ArchitecturesProceedings of the 25th International Middleware Conference10.1145/3652892.3700760(211-224)Online publication date: 2-Dec-2024
        • (2024)Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00018(100-113)Online publication date: 27-May-2024
        • (2024)Request Deadline Split and Interference‐Aware Request Migration in Edge CloudConcurrency and Computation: Practice and Experience10.1002/cpe.831537:1Online publication date: 24-Oct-2024
        • (2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
        • (2022)Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUsIEEE Transactions on Computers10.1109/TC.2021.306435271:4(866-879)Online publication date: 1-Apr-2022
        • (2021)ParaXProceedings of the VLDB Endowment10.14778/3447689.344769214:6(864-877)Online publication date: 12-Apr-2021
        • (2021)Device HoppingACM Transactions on Architecture and Code Optimization10.1145/347190918:4(1-25)Online publication date: 31-Dec-2021
        • (2020)URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public CloudsProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404451(1-11)Online publication date: 17-Aug-2020
        • (2020)PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00027(220-233)Online publication date: Feb-2020
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media