Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Published: 16 November 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch between massive threads and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPUs is usually limited by the inefficient resource management of register file and scratchpad memory. This inefficiency also leads to register file and scratchpad memory underutilization. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. When the available register file becomes limited, our approach leverages the underutilized scratchpad memory to support additional register allocation. Therefore, more CTAs can be dispatched to SMs, which improves the GPU utilization. Our experiments on representative benchmark suites show that the number of CTAs dispatched to each SM increases by 1.28× on average. In addition, our approach improves the GPU resource utilization significantly, with the register file utilization improved by 11.64% and the scratchpad memory utilization improved by 48.20% on average. With better TLP, our approach achieves 20.01% performance improvement on average with negligible energy overhead.

    References

    [1]
    Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, 163--174.
    [2]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 44--54.
    [3]
    Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGARCH Comput. Architect. News 45, 1 (2017), 17--32.
    [4]
    Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGARCH Comput. Architect. News 44, 2 (2016), 681--696.
    [5]
    Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 235--246.
    [6]
    Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, 465--476.
    [7]
    Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, 96--106.
    [8]
    Ari B. Hayes and Eddy Z. Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, 293--302.
    [9]
    Vishwesh Jatala, Jayvant Anantpur, and Amey Karkare. 2016. Improving GPU performance through resource sharing. In Proceedings of the 25th ACM International Symposium on High-Performance Distributed Computing (HPDC’16). ACM, 203--214.
    [10]
    Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). ACM, 420--432.
    [11]
    Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 344--355.
    [12]
    Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.
    [13]
    Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, 157--166.
    [14]
    Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. C-States: Fine-grained GPU datapath power management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 17--30.
    [15]
    Nvidia Kepler. 2012. NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110. Retrieved from https://www.nvidia.com/content/PDF/kepler/NVIDIA-kepler-GK110-architecture-whitepaper.pdf.
    [16]
    Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-warp GPU register time-sharing. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE.
    [17]
    John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, 151--164.
    [18]
    Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the IEEE 18th International Symposium on. IEEE, 91--102.
    [19]
    Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 260--271.
    [20]
    Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT’14). ACM, 175--186.
    [21]
    Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 487--498.
    [22]
    Zhen Lin, Michael Mantor, and Huiyang Zhou. 2018. GPU performance vs. thread-level parallelism: Scalability analysis and a novel way to improve TLP. ACM Trans. Architect. Code Optim. 15, 1 (2018), 15.
    [23]
    Majid Namaki-Shoushtari, Abbas Rahimi, Nikil Dutt, Puneet Gupta, and Rajesh K. Gupta. 2013. ARGO: Aging-aware GPGPU register file allocation. In Proceedings of the 9th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. IEEE Press, 30.
    [24]
    Nvidia. 2009. Nvidia’s next generation cuda compute architecture: Fermi. Retrieved from https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
    [25]
    Nvidia. 2011. CUDA C/C++ SDK code samples. Retrieved from https://developer.nvidia.com/cuda-toolkit-40.
    [26]
    Nvidia. 2014. Nvidia GeForce GTX 980 whitepaper. Retrieved from https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.
    [27]
    Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, 356--367.
    [28]
    Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, 591--602.
    [29]
    Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 489--502.
    [30]
    John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center Reliable High-perform. Comput. 127 (2012).
    [31]
    Abdulaziz Tabbakh, Murali Annavaram, and Xuehai Qian. 2017. Power efficient sharing-aware GPU data management. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE, 698--707.
    [32]
    Jingweijia Tan and Xin Fu. 2015. Mitigating the susceptibility of GPGPUs register file to process variations. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’15). IEEE, 969--978.
    [33]
    Jingweijia Tan, Zhi Li, and Xin Fu. 2015. Soft-error reliability and power co-optimization for GPGPUS register file using resistive memory. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 369--374.
    [34]
    Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the reliability challenge of GPU register file at low supply voltage. In Proceedings of the International Conference on Parallel Architectures and Compilation. ACM, 3--15.
    [35]
    Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--14.
    [36]
    Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2016. Laperm: Locality aware scheduler for dynamic parallelism on gpus. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). IEEE, 583--595.
    [37]
    Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677--688.
    [38]
    Ping Xiang, Yi Yang, and Huiyang Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 284--295.
    [39]
    Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). ACM, 395--406.
    [40]
    Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). IEEE, 283--292.
    [41]
    Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit. In Proceedings of the 43rd ACM/IEEE International Symposium on Computer Architecture (ISCA’16). IEEE, 609--621.
    [42]
    Chao Yu, Yuebin Bai, Hailong Yang, Kun Cheng, Yuhao Gu, Zhongzhi Luan, and Depei Qian. 2018. SMGuard: A flexible and fine-grained resource management framework for GPUs. IEEE Trans. Parallel Distrib. Syst. (2018).
    [43]
    Yulong Yu, Weijun Xiao, Xubin He, He Guo, Yuxin Wang, and Xin Chen. 2015. A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 15--24.

    Cited By

    View all
    • (2024)PresCount: Effective Register Allocation for Bank Conflict Reduction2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
    • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
    • (2022)Answer Questions with Right Image Regions: A Visual Attention Regularization ApproachACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349834018:4(1-18)Online publication date: 4-Mar-2022
    • Show More Cited By

    Index Terms

    1. Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 4
        December 2018
        706 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3284745
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 16 November 2018
        Accepted: 01 September 2018
        Revised: 01 August 2018
        Received: 01 May 2018
        Published in TACO Volume 15, Issue 4

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. GPU
        2. register file
        3. resource utilization
        4. scratchpad memory

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)245
        • Downloads (Last 6 weeks)20

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)PresCount: Effective Register Allocation for Bank Conflict Reduction2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
        • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
        • (2022)Answer Questions with Right Image Regions: A Visual Attention Regularization ApproachACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349834018:4(1-18)Online publication date: 4-Mar-2022
        • (2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
        • (2021)Similarity Analysis of Visual Sketch-based Search for SoundsProceedings of the 16th International Audio Mostly Conference10.1145/3478384.3478423(101-108)Online publication date: 1-Sep-2021
        • (2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media