research-article

A compile-time managed multi-level register file hierarchy

Authors:

Stephen W. Keckler,

William J. DallyAuthors Info & Claims

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 465 - 476

https://doi.org/10.1145/2155620.2155675

Published: 03 December 2011 Publication History

Abstract

As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the register file system of a modern throughput processor with a combined hardware and software solution that reduces register file energy without harming system performance. Throughput processors utilize a large number of threads to tolerate latency, requiring a large, energy-intensive register file to store thread context. Our results show that a compiler controlled register file hierarchy can reduce register file energy by up to 54%, compared to a hardware only caching approach that reduces register file energy by 34%. We explore register allocation algorithms that are specifically targeted to improve energy efficiency by sharing temporary register file resources across concurrently running threads and conduct a detailed limit study on the further potential to optimize operand delivery for throughput processors. Our efficiency gains represent a direct performance gain for power limited systems, such as GPUs.

References

[1]

AMD. HD 6900 Series Instruction Set Architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_ Set_Architecture.pdf, February 2011.

[2]

J. L. Ayala, A. Veidenbaum, and M. Lopez-Vallejo. Power-aware Compilation for Register File Energy Reduction. International Journal of Parallel Programming, 31(6):451--467, December 2003.

Digital Library

[3]

J. Balfour, R. Harting, and W. Dally. Operand Registers and Explicit Operand Forwarding. IEEE Computer Architecture Letters, 8(2):60--63, July 2009.

Digital Library

[4]

G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein. Register Allocation Via Coloring. Computer Languages, 6:47--57, 1981.

Digital Library

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization, pages 44--54, October 2009.

Digital Library

[6]

K. D. Cooper and T. J. Harvey. Compiler-controlled Memory. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, October 1998.

Digital Library

[7]

N. Crago and S. Patel. OUTRIDER: Efficient Memory Latency Tolerance with Decoupled Strands. In International Symposium on Computer Architecture, pages 117--128, June 2011.

Digital Library

[8]

W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J.-H. Ahn, N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with Streams. In International Conference for High Performance Computing, pages 35--42, November 2003.

Digital Library

[9]

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems. In International Conference on Parallel Architectures and Compilation Techniques, pages 353--364, September 2010.

Digital Library

[10]

M. Franklin and G. S. Sohi. Register Traffic Analysis for Streamlining Inter-operation Communication in Fine-grain Parallel Processors. In International Symposium on Microarchitecture, pages 236--245, November 1992.

Digital Library

[11]

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In International Symposium on Computer Architecture, pages 235--246, June 2011.

Digital Library

[12]

C. H. Gebotys. Low Energy Memory and Register Allocation Using Network Flow. In Design Automation Conference, pages 435--440, June 1997.

Digital Library

[13]

S. Hong and H. Kim. An Integrated GPU Power and Performance Model. In International Symposium on Computer Architecture, pages 280--289, June 2010.

Digital Library

[14]

P. Kogge et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report TR-2008-13, University of Notre Dame, September 2008.

[15]

NVIDIA. Compute Unified Device Architecture Programming Guide Version 2.0. http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf, June 2008.

[16]

NVIDIA. PTX: Parallel Thread Execution ISA Version 2.3. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_2.3.pdf, 2011.

[17]

Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.

[18]

I. Park, M. D. Powell, and T. N. Vijaykumar. Reducing Register Ports for Higher Speed and Lower Energy. In International Symposium on Microarchitecture, pages 171--182, December 2002.

Digital Library

[19]

J. Park and W. J. Dally. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Technical Report Concurrent VLSI Architecture Group Memo 127, Stanford University, March 2011.

[20]

S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie. Bypass Aware Instruction Scheduling for Register File Power Reduction. In Languages, Compilers and Tools for Embedded Systems, pages 173--181, April 2006.

Digital Library

[21]

M. Poletto and V. Sarkar. Linear Scan Register Allocation. ACM Transactions on Programming Language Systems, 21(5):895--913, September 1999.

Digital Library

[22]

S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. Owens. Register Organization for Media Processing. In International Symposium on High Performance Computer Architecture, pages 375--386, 2000.

[23]

R. M. Russell. The CRAY-1 Computer System. Communications of the ACM, 21(1):63--72, January 1978.

Digital Library

[24]

J. A. Swensen and Y. N. Patt. Hierarchical Registers for Scientific Computers. In International Conference on Supercomputing, pages 346--354, September 1988.

Digital Library

[25]

J. H. Tseng and K. Asanovic. Energy-Efficient Register Access. In Symposium on Integrated Circuits and Systems Design, pages 377--382, September 2000.

Digital Library

[26]

M. N. Yankelevsky and C. D. Polychronopoulos. α-Coral: A Multigrain, Multithreaded Processor Architecture. In International Conference on Supercomputing, pages 358--367, June 2001.

Digital Library

[27]

W. Yu, R. Huang, S. Xu, S.-E. Wang, E. Kan, and G. E. Suh. SRAM-DRAM Hybrid Memory with Applications to Efficient Register Files in Fine-Grained Multi-Threading. In International Symposium on Computer Architecture, June 2011.

Digital Library

[28]

J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Two-level Hierarchical Register File Organization for VLIW Processors. In International Symposium on Microarchitecture, pages 137--146, December 2000.

Digital Library

[29]

J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures. International Journal of Parallel Programming, 32(6):447--474, December 2004.

Digital Library

[30]

Y. Zhang, H. he, and Y. Sin. A New Register File Access Architecture for Software Pipelining in VLIW Processors. In Asia and South Pacific Design Automation Conference, pages 627--630, January 2005.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Jeon H(2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_66
Show More Cited By

Index Terms

A compile-time managed multi-level register file hierarchy
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Multiple-banked register file architectures
ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...
Compacting register file via 2-level renaming and bit-partitioning

A large multi-ported rename register file (RRF) is indispensable for simultaneous multithreaded (SMT) processors to hold more intermediate results of in-flight instructions from multiple threads running simultaneously. However, enlarging the RRF incurs ...
Multiple-banked register file architectures
Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 2011

519 pages

ISBN:9781450310536

DOI:10.1145/2155620

Conference Chair:
Carlo Galuzzi
Technische Universiteit Delft, The Netherlands
,
General Chair:
Luigi Carro
Universidade Federal do Rio Grande do Sul, Brasil
,
Program Chairs:
Andreas Moshovos
University of Toronto, Canada
,
Milos Prvulovic
Georgia Institute of Technology, United States

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE
ACM: Association for Computing Machinery
UFRGS: Universidade Federal do Rio Grande do Sul
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

MICRO-44

Sponsor:

ACM
UFRGS
SIGMICRO
IEEE-CS

MICRO-44: The 44th Annual IEEE/ACM International Symposium on Microarchitecture

December 3 - 7, 2011

Porto Alegre, Brazil

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

64
Total Citations
View Citations
681
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)4

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Jeon H(2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_66
Abaie Shoushtary MArnau JTubella Murgadas JGonzalez A(2023)Lightweight Register File Caching in Collector Units for GPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589245(27-33)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3589236.3589245
Barnes AShen FRogers T(2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070957
Yao JZhou HZhang YLi YFeng CChen SChen JWang YHu Q(2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070941
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
https://doi.org/10.1007/978-981-15-6401-7_66-2
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
https://doi.org/10.1007/978-981-15-6401-7_66-1
Ze TJun ZXianglong RFeihu FYue C(2022)Survey of Shared Register File design for Unified Shader Array in GPUs2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/CSCloud-EdgeCom54986.2022.00042(201-206)Online publication date: Jun-2022
https://doi.org/10.1109/CSCloud-EdgeCom54986.2022.00042
Ze TXianglong RJun ZFeihu FYue C(2022)Design of Shared Register File of GPU Unified Shader Array2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/CSCloud-EdgeCom54986.2022.00030(123-128)Online publication date: Jun-2022
https://doi.org/10.1109/CSCloud-EdgeCom54986.2022.00030
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten