research-article

Open access

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Authors:

Qingxiao Sun, and

Hailong YangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 4

Article No.: 48, Pages 1 - 24

https://doi.org/10.1145/3280849

Published: 16 November 2018 Publication History

All formats PDF

Abstract

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch between massive threads and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPUs is usually limited by the inefficient resource management of register file and scratchpad memory. This inefficiency also leads to register file and scratchpad memory underutilization. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. When the available register file becomes limited, our approach leverages the underutilized scratchpad memory to support additional register allocation. Therefore, more CTAs can be dispatched to SMs, which improves the GPU utilization. Our experiments on representative benchmark suites show that the number of CTAs dispatched to each SM increases by 1.28× on average. In addition, our approach improves the GPU resource utilization significantly, with the register file utilization improved by 11.64% and the scratchpad memory utilization improved by 48.20% on average. With better TLP, our approach achieves 20.01% performance improvement on average with negligible energy overhead.

References

[1]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, 163--174.

[2]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 44--54.

Digital Library

[3]

Quan Chen, Hailong Yang, Minyi Guo, Ram Srivatsa Kannan, Jason Mars, and Lingjia Tang. 2017. Prophet: Precise qos prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. ACM SIGARCH Comput. Architect. News 45, 1 (2017), 17--32.

Digital Library

[4]

Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: Qos awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. ACM SIGARCH Comput. Architect. News 44, 2 (2016), 681--696.

Digital Library

[5]

Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 235--246.

Digital Library

[6]

Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, 465--476.

Digital Library

[7]

Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, 96--106.

Digital Library

[8]

Ari B. Hayes and Eddy Z. Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the 28th ACM International Conference on Supercomputing (ICS’14). ACM, 293--302.

Digital Library

[9]

Vishwesh Jatala, Jayvant Anantpur, and Amey Karkare. 2016. Improving GPU performance through resource sharing. In Proceedings of the 25th ACM International Symposium on High-Performance Distributed Computing (HPDC’16). ACM, 203--214.

Digital Library

[10]

Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). ACM, 420--432.

Digital Library

[11]

Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal, and Xiaoyao Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 344--355.

Digital Library

[12]

Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.

Digital Library

[13]

Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, 157--166.

Digital Library

[14]

Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. C-States: Fine-grained GPU datapath power management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT’16). IEEE, 17--30.

Digital Library

[15]

Nvidia Kepler. 2012. NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110. Retrieved from https://www.nvidia.com/content/PDF/kepler/NVIDIA-kepler-GK110-architecture-whitepaper.pdf.

[16]

Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-warp GPU register time-sharing. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE.

Digital Library

[17]

John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). ACM, 151--164.

Digital Library

[18]

Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the IEEE 18th International Symposium on. IEEE, 91--102.

Digital Library

[19]

Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 260--271.

[20]

Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT’14). ACM, 175--186.

Digital Library

[21]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 487--498.

Digital Library

[22]

Zhen Lin, Michael Mantor, and Huiyang Zhou. 2018. GPU performance vs. thread-level parallelism: Scalability analysis and a novel way to improve TLP. ACM Trans. Architect. Code Optim. 15, 1 (2018), 15.

Digital Library

[23]

Majid Namaki-Shoushtari, Abbas Rahimi, Nikil Dutt, Puneet Gupta, and Rajesh K. Gupta. 2013. ARGO: Aging-aware GPGPU register file allocation. In Proceedings of the 9th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. IEEE Press, 30.

Digital Library

[24]

Nvidia. 2009. Nvidia’s next generation cuda compute architecture: Fermi. Retrieved from https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[25]

Nvidia. 2011. CUDA C/C++ SDK code samples. Retrieved from https://developer.nvidia.com/cuda-toolkit-40.

[26]

Nvidia. 2014. Nvidia GeForce GTX 980 whitepaper. Retrieved from https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.

[27]

Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, 356--367.

Digital Library

[28]

Minsoo Rhu and Mattan Erez. 2013. The dual-path execution model for efficient GPU control flow. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE, 591--602.

Digital Library

[29]

Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, 489--502.

Digital Library

[30]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center Reliable High-perform. Comput. 127 (2012).

[31]

Abdulaziz Tabbakh, Murali Annavaram, and Xuehai Qian. 2017. Power efficient sharing-aware GPU data management. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE, 698--707.

[32]

Jingweijia Tan and Xin Fu. 2015. Mitigating the susceptibility of GPGPUs register file to process variations. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’15). IEEE, 969--978.

Digital Library

[33]

Jingweijia Tan, Zhi Li, and Xin Fu. 2015. Soft-error reliability and power co-optimization for GPGPUS register file using resistive memory. In Proceedings of the Design, Automation 8 Test in Europe Conference 8 Exhibition. EDA Consortium, 369--374.

Digital Library

[34]

Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the reliability challenge of GPU register file at low supply voltage. In Proceedings of the International Conference on Parallel Architectures and Compilation. ACM, 3--15.

Digital Library

[35]

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--14.

Digital Library

[36]

Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2016. Laperm: Locality aware scheduler for dynamic parallelism on gpus. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). IEEE, 583--595.

Digital Library

[37]

Steven J. E. Wilton and Norman P. Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits 31, 5 (1996), 677--688.

[38]

Ping Xiang, Yi Yang, and Huiyang Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 284--295.

[39]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’15). ACM, 395--406.

Digital Library

[40]

Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). IEEE, 283--292.

Digital Library

[41]

Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit. In Proceedings of the 43rd ACM/IEEE International Symposium on Computer Architecture (ISCA’16). IEEE, 609--621.

Digital Library

[42]

Chao Yu, Yuebin Bai, Hailong Yang, Kun Cheng, Yuhao Gu, Zhongzhi Luan, and Depei Qian. 2018. SMGuard: A flexible and fine-grained resource management framework for GPUs. IEEE Trans. Parallel Distrib. Syst. (2018).

[43]

Yulong Yu, Weijun Xiao, Xubin He, He Guo, Yuxin Wang, and Xin Chen. 2015. A stall-aware warp scheduling for dynamically optimizing thread-level parallelism in GPGPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 15--24.

Digital Library

Cited By

Guan XZhou HBao GLi HZhu LYao J(2024)PresCount: Effective Register Allocation for Bank Conflict Reduction2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444841
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Liu YGuo YYin JSong XLiu WNie LZhang M(2022)Answer Questions with Right Image Regions: A Visual Attention Regularization ApproachACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349834018:4(1-18)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3498340
Show More Cited By

Index Terms

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Software-Directed Techniques for Improved GPU Register File Utilization

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming ...
Read More
GPU register file virtualization
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge register file, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large register file is ...
Read More
Scratchpad Sharing in GPUs

General-Purpose Graphics Processing Unit (GPGPU) applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism (TLP) present in the GPU is limited by the ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 4

December 2018

706 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3284745

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 November 2018

Accepted: 01 September 2018

Revised: 01 August 2018

Received: 01 May 2018

Published in TACO Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
1,162
Total Downloads

Downloads (Last 12 months)245
Downloads (Last 6 weeks)20

Other Metrics

View Author Metrics

Citations

Cited By

Guan XZhou HBao GLi HZhu LYao J(2024)PresCount: Effective Register Allocation for Bank Conflict Reduction2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://doi.org/10.1109/CGO57630.2024.10444841
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Liu YGuo YYin JSong XLiu WNie LZhang M(2022)Answer Questions with Right Image Regions: A Visual Attention Regularization ApproachACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349834018:4(1-18)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3498340
Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3154315
Engeln LLe NMcGinity MGroh R(2021)Similarity Analysis of Visual Sketch-based Search for SoundsProceedings of the 16th International Audio Mostly Conference10.1145/3478384.3478423(101-108)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1145/3478384.3478423
Yu CBai YWang R(2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TC.2020.3026043

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents