Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Software-Directed Techniques for Improved GPU Register File Utilization

Published: 24 September 2018 Publication History

Abstract

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming multiprocessor (SM), we find that nearly half of real-world applications we examined are register-bound and would benefit from a larger register file to enable more concurrent threads. This article seeks to increase the thread occupancy and improve performance of these register-bound applications by making more efficient use of the existing register file capacity. Our first technique eagerly deallocates register resources during execution. We show that releasing register resources based on value liveness as proposed in prior states of the art leads to unreliable performance and undue design complexity. To address these deficiencies, our article presents a novel compiler-driven approach that identifies and exploits last use of a register name (instead of the value contained within) to eagerly release register resources. Furthermore, while previous works have leveraged “scalar” and “narrow” operand properties of a program for various optimizations, their impact on thread occupancy has been relatively unexplored. Our article evaluates the effectiveness of these techniques in improving thread occupancy and demonstrates that while any one approach may fail to free very many registers, together they synergistically free enough registers to launch additional parallel work. An in-depth evaluation on a large suite of applications shows that just our early register technique outperforms previous work on dynamic register allocation, and together these approaches, on average, provide 12% performance speedup (23% higher thread occupancy) on register bound applications not already saturating other GPU resources.

References

[1]
AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. Retrieved from https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
[2]
AMD. 2016. Dissecting the Polaris Architecture. Retrieved from http://radeon.wpengine.netdna-cdn.com/wp-content/uploads/2016/08/Polaris-Architecture-Whitepaper-Final-08042016.pdf.
[3]
David Brooks and Margaret Martonosi. 1999. Dynamically exploiting narrow width operands to improve processor power and performance. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’99). 13--22.
[4]
Mihai Budiu, Majd Sakr, Kip Walker, and Seth C. Goldstein. 2000. BitValue inference: Detecting and exploiting narrow bitwidth computations. In Proceedings of the European Conference on Parallel Processing. 969--979.
[5]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the International Symposium on Workload Characterization (IISWC’12). 141--151.
[6]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the International Symposium on Workload Characterization (IISWC’09).
[7]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. https://arxiv.org/abs/1410.0759. (2014).
[8]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (almost) from scratch. https://arxiv.org/abs/1103.0398. (2011).
[9]
Oguz Ergin, Deniz Balkan, Kanad Ghose, and Dmitry Ponomarev. 2004. Register packing: Exploiting narrow-width operands for reducing register file pressure. In Proceedings of the International Symposium on Microarchitecture (MICRO’04). 304--315.
[10]
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Computer Architecture (ISCA’11). 235--246.
[11]
Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011. A compile-time managed multi-level register file hierarchy. In Proceedings of the International Symposium on Microarchitecture (MICRO’11). 465--476.
[12]
Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the International Symposium on Microarchitecture (MICRO’12). 96--106.
[13]
Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). 330--341.
[14]
R. Gonzalez, A. Cristal, D. Ortega, A. Veidenbaum, and M. Valero. 2004. A content aware integer register file organization. In Proceedings of the International Symposium on Computer Architecture (ISCA’04). 314--324.
[15]
A. Graves and J. Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. In Neural Networks 18, 5--6 (2005), 602--610.
[16]
John F. Hughes, Andries Van Dam, James D. Foley, and Steven K. Feiner. 2014. Computer Graphics: Principles and Practice. Pearson Education.
[17]
Itseez. 2015. Open Source Computer Vision Library. Retrieved from https://github.com/itseez/opencv.
[18]
Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the International Symposium on Microarchitecture (MICRO’15). 420--432.
[19]
N. Jing, H. Liu, Y. Lu, and X. Liang. 2013. Compiler assisted dynamic register file in GPGPU. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’13). 3--8.
[20]
Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO’16).
[21]
Timothy M. Jones, MFR O’Boyle, Jaume Abella, Antonio Gonzalez, and Oguz Ergin. 2005. Compiler directed early register release. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 110--119.
[22]
Stephen Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the International Symposium on Microarchitecture (MICRO’98). 216--225.
[23]
Charu Kalra. 2016. Design and Evaluation of Register Allocation on GPUs. Master’s thesis. Northeastern University.
[24]
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’16).
[25]
Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Murali Annavaram. 2015. Warped-compression: Enabling power efficient GPUs through register compression. In Proceedings of the International Symposium on Computer Architecture (ISCA’15). 502--514.
[26]
Yunsup Lee, Ronny Krashinsky, Vinod Grover, Stephen W. Keckler, and Krste Asanovic. 2013. Convergence and scalarization for data-parallel architectures. In International Symposium on Code Generation and Optimization (CGO’13). 1--11.
[27]
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’15). 17:1--17:12.
[28]
D. Li, M. Rhu, D. R. Johnson, M. O’Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’15). 89--100.
[29]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 8, 2 (2008), 39--55.
[30]
Mikko H. Lipasti, Brian R. Mestan, and Erika Gunadi. 2004. Physical register inlining. In Proceedings of the International Symposium on Computer Architecture (ISCA’04). 325--337.
[31]
Jack L. Lo, Sujay S. Parekh, Susan J. Eggers, Henry M. Levy, and Dean M. Tullsen. 1999. Software-directed register deallocation for simultaneous multithreaded processors. IEEE Trans. Parallel Distrib. Syst. 10, 9 (1999), 922--933.
[32]
Milo M. Martin, Amir Roth, and Charles N. Fischer. 1997. Exploiting dead value information. In Proceedings of the International Symposium on Microarchitecture (MICRO). 125--135.
[33]
Teresa Monreal, Víctor Viñals, Antonio González, and Mateo Valero. 2002. Hardware schemes for early register release. In Proceedings of the International Conference on Parallel Processing. 5--13.
[34]
Mayan Moudgill, Keshav Pingali, and Stamatis Vassiliadis. 1993. Register renaming and dynamic speculation: An alternative approach. In Proceedings of the International Symposium on Microarchitecture (MICRO’93). 202--213.
[35]
Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[36]
NVIDIA. 2012. Kepler GK110. Retrieved from http://www.nvidia.co.uk/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[37]
NVIDIA. 2016. CUDA Occupancy Calculator. Retrieved from http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls.
[38]
NVIDIA. 2016. NVIDIA GeForce GTX 1080 Whitepaper. Retrieved from http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf.
[39]
NVIDIA. 2016. NVIDIA Tesla P100. Retrieved from https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.
[40]
NVIDIA. 2016. Visual Profiler User’s Guide. Retrieved from http://docs.nvidia.com/cuda/profiler-users-guide.
[41]
Mark Stephenson, Jonathan Babb, and Saman Amarasinghe. 2000. Bitwidth analysis with application to silicon compilation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’00). 108--120.
[42]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA’15). 185--197.
[43]
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Liwen Chang, Geng Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign, Urbana.
[44]
D. Wong, N. S. Kim, and M. Annavaram. 2016. Approximating warps with intra-warp operand value similarity. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’16). 176--187.
[45]
Javier Zalamea, Josep Llosa, Eduard Ayguadé, and Mateo Valero. 2000. Two-level hierarchical register file organization for VLIW processors. In Proceedings of the International Symposium on Microarchitecture (MICRO’00). 137--146.
[46]
Yanjun Zhang, Hu He, and Yihe Sun. 2005. A new register file access architecture for software pipelining in VLIW processors. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’05). 627--630.

Cited By

View all
  • (2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
  • (2021)Prediction of Register Instance Usage and Time-sharing Register for Extended Register Reuse SchemeProceedings of the 26th Asia and South Pacific Design Automation Conference10.1145/3394885.3431412(216-221)Online publication date: 18-Jan-2021
  • (2020)A GPU Register File using Static Data CompressionProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404431(1-10)Online publication date: 17-Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
September 2018
322 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3274266
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2018
Accepted: 01 July 2018
Revised: 01 June 2018
Received: 01 March 2018
Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. register file
  3. thread occupancy

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)469
  • Downloads (Last 6 weeks)83
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
  • (2021)Prediction of Register Instance Usage and Time-sharing Register for Extended Register Reuse SchemeProceedings of the 26th Asia and South Pacific Design Automation Conference10.1145/3394885.3431412(216-221)Online publication date: 18-Jan-2021
  • (2020)A GPU Register File using Static Data CompressionProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404431(1-10)Online publication date: 17-Aug-2020
  • (2020)BOW: Breathing Operand Windows to Exploit Bypassing in GPUs2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00084(996-1008)Online publication date: Oct-2020
  • (2019)MANICProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358277(670-684)Online publication date: 12-Oct-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media