research-article

Open access

Software-Directed Techniques for Improved GPU Register File Utilization

Authors:

Dani Voitsechov,

Arslan Zulfiqar,

Mark Stephenson,

Stephen W. KecklerAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 3

Article No.: 38, Pages 1 - 23

https://doi.org/10.1145/3243905

Published: 24 September 2018 Publication History

All formats PDF

Abstract

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming multiprocessor (SM), we find that nearly half of real-world applications we examined are register-bound and would benefit from a larger register file to enable more concurrent threads. This article seeks to increase the thread occupancy and improve performance of these register-bound applications by making more efficient use of the existing register file capacity. Our first technique eagerly deallocates register resources during execution. We show that releasing register resources based on value liveness as proposed in prior states of the art leads to unreliable performance and undue design complexity. To address these deficiencies, our article presents a novel compiler-driven approach that identifies and exploits last use of a register name (instead of the value contained within) to eagerly release register resources. Furthermore, while previous works have leveraged “scalar” and “narrow” operand properties of a program for various optimizations, their impact on thread occupancy has been relatively unexplored. Our article evaluates the effectiveness of these techniques in improving thread occupancy and demonstrates that while any one approach may fail to free very many registers, together they synergistically free enough registers to launch additional parallel work. An in-depth evaluation on a large suite of applications shows that just our early register technique outperforms previous work on dynamic register allocation, and together these approaches, on average, provide 12% performance speedup (23% higher thread occupancy) on register bound applications not already saturating other GPU resources.

References

[1]

AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. Retrieved from https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

[2]

AMD. 2016. Dissecting the Polaris Architecture. Retrieved from http://radeon.wpengine.netdna-cdn.com/wp-content/uploads/2016/08/Polaris-Architecture-Whitepaper-Final-08042016.pdf.

[3]

David Brooks and Margaret Martonosi. 1999. Dynamically exploiting narrow width operands to improve processor power and performance. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’99). 13--22.

Digital Library

[4]

Mihai Budiu, Majd Sakr, Kip Walker, and Seth C. Goldstein. 2000. BitValue inference: Detecting and exploiting narrow bitwidth computations. In Proceedings of the European Conference on Parallel Processing. 969--979.

Digital Library

[5]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the International Symposium on Workload Characterization (IISWC’12). 141--151.

Digital Library

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC’09).

Digital Library

[7]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. https://arxiv.org/abs/1410.0759. (2014).

[8]

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (almost) from scratch. https://arxiv.org/abs/1103.0398. (2011).

[9]

Oguz Ergin, Deniz Balkan, Kanad Ghose, and Dmitry Ponomarev. 2004. Register packing: Exploiting narrow-width operands for reducing register file pressure. In Proceedings of the International Symposium on Microarchitecture (MICRO’04). 304--315.

Digital Library

[10]

Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’11). 235--246.

Digital Library

[11]

Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In Proceedings of the International Symposium on Microarchitecture (MICRO’11). 465--476.

Digital Library

[12]

Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the International Symposium on Microarchitecture (MICRO’12). 96--106.

Digital Library

[13]

Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 330--341.

Digital Library

[14]

R. Gonzalez, A. Cristal, D. Ortega, A. Veidenbaum, and M. Valero. 2004. A content aware integer register file organization. In Proceedings of the International Symposium on Computer Architecture (ISCA’04). 314--324.

Digital Library

[15]

A. Graves and J. Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. In Neural Networks 18, 5--6 (2005), 602--610.

Digital Library

[16]

John F. Hughes, Andries Van Dam, James D. Foley, and Steven K. Feiner. 2014. Computer Graphics: Principles and Practice. Pearson Education.

[17]

Itseez. 2015. Open Source Computer Vision Library. Retrieved from https://github.com/itseez/opencv.

[18]

Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the International Symposium on Microarchitecture (MICRO’15). 420--432.

Digital Library

[19]

N. Jing, H. Liu, Y. Lu, and X. Liang. 2013. Compiler assisted dynamic register file in GPGPU. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’13). 3--8.

Digital Library

[20]

Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO’16).

Digital Library

[21]

Timothy M. Jones, MFR O’Boyle, Jaume Abella, Antonio Gonzalez, and Oguz Ergin. 2005. Compiler directed early register release. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 110--119.

Digital Library

[22]

Stephen Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the International Symposium on Microarchitecture (MICRO’98). 216--225.

Digital Library

[23]

Charu Kalra. 2016. Design and Evaluation of Register Allocation on GPUs. Master’s thesis. Northeastern University.

[24]

A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’16).

Digital Library

[25]

Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the International Symposium on Computer Architecture (ISCA’15). 502--514.

Digital Library

[26]

Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. 2013. Convergence and scalarization for data-parallel architectures. In International Symposium on Code Generation and Optimization (CGO’13). 1--11.

Digital Library

[27]

Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’15). 17:1--17:12.

Digital Library

[28]

D. Li, M. Rhu, D. R. Johnson, M. O’Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’15). 89--100.

[29]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 8, 2 (2008), 39--55.

Digital Library

[30]

Mikko H. Lipasti, Brian R. Mestan, and Erika Gunadi. 2004. Physical register inlining. In Proceedings of the International Symposium on Computer Architecture (ISCA’04). 325--337.

Digital Library

[31]

Jack L. Lo, Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, and Dean M. Tullsen. 1999. Software-directed register deallocation for simultaneous multithreaded processors. IEEE Trans. Parallel Distrib. Syst. 10, 9 (1999), 922--933.

Digital Library

[32]

Milo M. Martin, Amir Roth, and Charles N. Fischer. 1997. Exploiting dead value information. In Proceedings of the International Symposium on Microarchitecture (MICRO). 125--135.

Digital Library

[33]

Teresa Monreal, Víctor Viñals, Antonio González, and Mateo Valero. 2002. Hardware schemes for early register release. In Proceedings of the International Conference on Parallel Processing. 5--13.

Digital Library

[34]

Mayan Moudgill, Keshav Pingali, and Stamatis Vassiliadis. 1993. Register renaming and dynamic speculation: An alternative approach. In Proceedings of the International Symposium on Microarchitecture (MICRO’93). 202--213.

Digital Library

[35]

Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.

Digital Library

[36]

NVIDIA. 2012. Kepler GK110. Retrieved from http://www.nvidia.co.uk/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[37]

NVIDIA. 2016. CUDA Occupancy Calculator. Retrieved from http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls.

[38]

NVIDIA. 2016. NVIDIA GeForce GTX 1080 Whitepaper. Retrieved from http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf.

[39]

NVIDIA. 2016. NVIDIA Tesla P100. Retrieved from https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.

[40]

NVIDIA. 2016. Visual Profiler User’s Guide. Retrieved from http://docs.nvidia.com/cuda/profiler-users-guide.

[41]

Mark Stephenson, Jonathan Babb, and Saman Amarasinghe. 2000. Bitwidth analysis with application to silicon compilation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’00). 108--120.

Digital Library

[42]

Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA’15). 185--197.

Digital Library

[43]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.

[44]

D. Wong, N. S. Kim, and M. Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’16). 176--187.

[45]

Javier Zalamea, Josep Llosa, Eduard Ayguadé, and Mateo Valero. 2000. Two-level hierarchical register file organization for VLIW processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’00). 137--146.

Digital Library

[46]

Yanjun Zhang, Hu He, and Yihe Sun. 2005. A new register file access architecture for software pipelining in VLIW processors. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’05). 627--630.

Digital Library

Cited By

Yao JZhou HZhang YLi YFeng CChen SChen JWang YHu Q(2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070941
Zhou SWang HTong D(2021)Prediction of Register Instance Usage and Time-sharing Register for Extended Register Reuse SchemeProceedings of the 26th Asia and South Pacific Design Automation Conference10.1145/3394885.3431412(216-221)Online publication date: 18-Jan-2021
https://dl.acm.org/doi/10.1145/3394885.3431412
Angerd ASintorn EStenstrom P(2020)A GPU Register File using Static Data CompressionProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404431(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404431
Show More Cited By

Index Terms

Software-Directed Techniques for Improved GPU Register File Utilization
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
  2. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded hardware
2. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors

Recommendations

CORF: Coalescing Operand Register File for GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

The Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially affects the overall performance and the energy efficiency of a GPU. For ...
GPU register file virtualization
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

To support massive number of parallel thread contexts, Graphics Processing Units (GPUs) use a huge register file, which is responsible for a large fraction of GPU's total power and area. The conventional belief is that a large register file is ...
Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 3

September 2018

322 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3274266

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2018

Accepted: 01 July 2018

Revised: 01 June 2018

Received: 01 March 2018

Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
1,475
Total Downloads

Downloads (Last 12 months)469
Downloads (Last 6 weeks)83

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yao JZhou HZhang YLi YFeng CChen SChen JWang YHu Q(2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070941
Zhou SWang HTong D(2021)Prediction of Register Instance Usage and Time-sharing Register for Extended Register Reuse SchemeProceedings of the 26th Asia and South Pacific Design Automation Conference10.1145/3394885.3431412(216-221)Online publication date: 18-Jan-2021
https://dl.acm.org/doi/10.1145/3394885.3431412
Angerd ASintorn EStenstrom P(2020)A GPU Register File using Static Data CompressionProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404431(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404431
Esfeden HAbdolrashidi ARahman SWong DAbu-Ghazaleh N(2020)BOW: Breathing Operand Windows to Exploit Bypassing in GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00084(996-1008)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00084
Gobieski GNagi ASerafin NIsgenc MBeckmann NLucia B(2019)MANICProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358277(670-684)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358277

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents