Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs

Published: 04 March 2020 Publication History
  • Get Citation Alerts
  • Editorial Notes

    A corrigendum was issued for this article on June 12, 2020. You can download the corrigendum from the supplemental material section of this citation page.

    Abstract

    As a critical computing resource in multiuser systems such as supercomputers, data centers, and cloud services, a GPU contains multiple compute units (CUs). GPU Multitasking is an intuitive solution to underutilization in GPGPU computing. Recently proposed solutions of multitasking GPUs can be classified into two categories: (1) spatially partitioned sharing (SPS), which coexecutes different kernels on disjointed sets of compute units (CU), and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU. Compared to SPS, SMK can improve resource utilization even further due to the interleaving of instructions from kernels with low dynamic resource contentions.
    However, it is hard to implement SMK on current GPU architecture, because (1) techniques for applying SMK on top of GPU hardware scheduling policy are scarce and (2) finding an efficient SMK scheme is difficult due to the complex interferences of concurrently executed kernels. In this article, we propose a lightweight and effective performance model to evaluate the complex interferences of SMK. Based on the probability of independent events, our performance model is built from a totally new angle and contains limited parameters. Then, we propose a metric, symbiotic factor, which can evaluate an SMK scheme so that kernels with complementary resource utilization can corun within a CU. Also, we analyze the advantages and disadvantages of kernel slicing and kernel stretching techniques and integrate them to apply SMK on GPUs instead of simulators. We validate our model on 18 benchmarks. Compared to the optimized hardware-based concurrent kernel execution whose kernel launching order brings fast execution time, the results of corunning kernel pairs show 11%, 18%, and 12% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. Compared to the Warped-Slicer, the results show 29%, 18%, and 51% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average.

    Supplementary Material

    a7-wu-corrigendum (a7-wu-corrigendum.pdf)
    Corrigendum to "A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs" by Wu et al., ACM Transactions on Architecture and Code Optimization, Volume 17, Issue 1 (TACO 17:1).

    References

    [1]
    P. Aguilera, K. Morrow, and N. S. Kim. 2014. Fair share: Allocation of GPU resources for both performance and fairness. In 2014 IEEE 32nd International Conference on Computer Design (ICCD’14). 440--447.
    [2]
    AMD. [n.d.]. CodeXL. http://gpuopen.com/compute-product/codexl/.
    [3]
    Joshua A. Anderson, Chris D. Lorenz, and A. Travesset. 2008. General purpose molecular dynamics simulations fully implemented on graphics processing units. J. Comput. Phys. 227, 10 (2008), 5342--5359.
    [4]
    Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen-mei W. Hwu. 2010. An adaptive performance modeling tool for GPU architectures. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 105--114.
    [5]
    S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC’09). 44--54.
    [6]
    H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou. 2018. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 208--220.
    [7]
    S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar’12). 1--10.
    [8]
    Khronos OpenCL Working Group et al. 2008. The Opencl specification. Version 1, 29 (2008), 8.
    [9]
    Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News 37, 3 (June 2009), 152--163.
    [10]
    Q. Hu, J. Shu, J. Fan, and Y. Lu. 2016. Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In 2016 45th International Conference on Parallel Processing (ICPP’16). 57--66.
    [11]
    Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 31--42.
    [12]
    Teng Li, Vikram K. Narayana, and Tarek El-Ghazawi. 2014. Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations. In Proceedings of the 11th ACM Conference on Computing Frontiers. ACM, Article 36, 10 pages.
    [13]
    Y. Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. 2015. Efficient GPU spatial-temporal multitasking. IEEE Trans. Parallel Distrib. Syst. 26, 3 (March 2015), 748--760.
    [14]
    Zhen Lin, Hongwen Dai, Michael Mantor, and Huiyang Zhou. 2019. Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution. ACM Trans. Archit. Code Optim. 16, 3, Article 23 (June 2019), 27 pages.
    [15]
    Zhen Lin, Michael Mantor, and Huiyang Zhou. 2018. GPU performance vs. thread-level parallelism: Scalability analysis and a novel way to improve TLP. ACM Trans. Archit. Code Optim. 15, 1, Article 15 (March 2018), 21 pages.
    [16]
    Mike Mantor. 2012. AMD RadeonTM HD 7970 with graphics core next (GCN) architecture. In 2012 IEEE Hot Chips 24 Symposium (HCS’12). IEEE, 1--35.
    [17]
    Christos Margiolas and Michael F. P. O’Boyle. 2016. Portable and transparent software managed scheduling on accelerators for fair resource sharing. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 82--93.
    [18]
    X. Mei and X. Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 72--86.
    [19]
    Nvidia. [n.d.]. CUDA C Propgramming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
    [20]
    Nvidia. [n.d.]. Tuning CUDA Applications for Kepler. https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html.
    [21]
    Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 407--418.
    [22]
    Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. In International Conference on Architectural Support for Programming Languages and Operating Systems. 527--540.
    [23]
    A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 174--185.
    [24]
    Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a file system with GPUs. SIGARCH Comput. Arch. News 41, 1 (March 2013), 485--498.
    [25]
    Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A performance analysis framework for identifying potential benefits in GPGPU applications. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 11--22.
    [26]
    Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium. IEEE Computer Society, 1068--1079.
    [27]
    Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. IEEE Press, 25--36.
    [28]
    H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog. 2018. Efficient and fair multi-programming in GPUs via effective bandwidth management. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA’18). 247--258.
    [29]
    Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 358--369.
    [30]
    Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 107--118.
    [31]
    Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. 2016. Warped-Slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 230--242.
    [32]
    Yao Zhang and John D. Owens. 2011. A quantitative performance analysis model for GPU architectures. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE Computer Society, 382--393. http://dl.acm.org/citation.cfm?id=2014698.2014875
    [33]
    J. Zhong and B. He. 2014. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25, 6 (June 2014), 1522--1532.

    Cited By

    View all
    • (2023)Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPUProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems10.1145/3625687.3625789(97-110)Online publication date: 12-Nov-2023
    • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
    • (2020)Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPUFuture Generation Computer Systems10.1016/j.future.2020.05.023Online publication date: May-2020

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 1
    March 2020
    206 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3386454
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2020
    Accepted: 01 December 2019
    Revised: 01 December 2019
    Received: 01 July 2019
    Published in TACO Volume 17, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. concurrent kernel execution

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • The Research Grants Council of Hong Kong

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)182
    • Downloads (Last 6 weeks)15

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPUProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems10.1145/3625687.3625789(97-110)Online publication date: 12-Nov-2023
    • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
    • (2020)Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPUFuture Generation Computer Systems10.1016/j.future.2020.05.023Online publication date: May-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media