Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2155620.2155675acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

A compile-time managed multi-level register file hierarchy

Published: 03 December 2011 Publication History

Abstract

As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the register file system of a modern throughput processor with a combined hardware and software solution that reduces register file energy without harming system performance. Throughput processors utilize a large number of threads to tolerate latency, requiring a large, energy-intensive register file to store thread context. Our results show that a compiler controlled register file hierarchy can reduce register file energy by up to 54%, compared to a hardware only caching approach that reduces register file energy by 34%. We explore register allocation algorithms that are specifically targeted to improve energy efficiency by sharing temporary register file resources across concurrently running threads and conduct a detailed limit study on the further potential to optimize operand delivery for throughput processors. Our efficiency gains represent a direct performance gain for power limited systems, such as GPUs.

References

[1]
AMD. HD 6900 Series Instruction Set Architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_ Set_Architecture.pdf, February 2011.
[2]
J. L. Ayala, A. Veidenbaum, and M. Lopez-Vallejo. Power-aware Compilation for Register File Energy Reduction. International Journal of Parallel Programming, 31(6):451--467, December 2003.
[3]
J. Balfour, R. Harting, and W. Dally. Operand Registers and Explicit Operand Forwarding. IEEE Computer Architecture Letters, 8(2):60--63, July 2009.
[4]
G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein. Register Allocation Via Coloring. Computer Languages, 6:47--57, 1981.
[5]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization, pages 44--54, October 2009.
[6]
K. D. Cooper and T. J. Harvey. Compiler-controlled Memory. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, October 1998.
[7]
N. Crago and S. Patel. OUTRIDER: Efficient Memory Latency Tolerance with Decoupled Strands. In International Symposium on Computer Architecture, pages 117--128, June 2011.
[8]
W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J.-H. Ahn, N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with Streams. In International Conference for High Performance Computing, pages 35--42, November 2003.
[9]
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems. In International Conference on Parallel Architectures and Compilation Techniques, pages 353--364, September 2010.
[10]
M. Franklin and G. S. Sohi. Register Traffic Analysis for Streamlining Inter-operation Communication in Fine-grain Parallel Processors. In International Symposium on Microarchitecture, pages 236--245, November 1992.
[11]
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In International Symposium on Computer Architecture, pages 235--246, June 2011.
[12]
C. H. Gebotys. Low Energy Memory and Register Allocation Using Network Flow. In Design Automation Conference, pages 435--440, June 1997.
[13]
S. Hong and H. Kim. An Integrated GPU Power and Performance Model. In International Symposium on Computer Architecture, pages 280--289, June 2010.
[14]
P. Kogge et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report TR-2008-13, University of Notre Dame, September 2008.
[15]
NVIDIA. Compute Unified Device Architecture Programming Guide Version 2.0. http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf, June 2008.
[16]
NVIDIA. PTX: Parallel Thread Execution ISA Version 2.3. http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_2.3.pdf, 2011.
[17]
Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.
[18]
I. Park, M. D. Powell, and T. N. Vijaykumar. Reducing Register Ports for Higher Speed and Lower Energy. In International Symposium on Microarchitecture, pages 171--182, December 2002.
[19]
J. Park and W. J. Dally. Guaranteeing Forward Progress of Unified Register Allocation and Instruction Scheduling. Technical Report Concurrent VLSI Architecture Group Memo 127, Stanford University, March 2011.
[20]
S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie. Bypass Aware Instruction Scheduling for Register File Power Reduction. In Languages, Compilers and Tools for Embedded Systems, pages 173--181, April 2006.
[21]
M. Poletto and V. Sarkar. Linear Scan Register Allocation. ACM Transactions on Programming Language Systems, 21(5):895--913, September 1999.
[22]
S. Rixner, W. Dally, B. Khailany, P. Mattson, U. Kapasi, and J. Owens. Register Organization for Media Processing. In International Symposium on High Performance Computer Architecture, pages 375--386, 2000.
[23]
R. M. Russell. The CRAY-1 Computer System. Communications of the ACM, 21(1):63--72, January 1978.
[24]
J. A. Swensen and Y. N. Patt. Hierarchical Registers for Scientific Computers. In International Conference on Supercomputing, pages 346--354, September 1988.
[25]
J. H. Tseng and K. Asanovic. Energy-Efficient Register Access. In Symposium on Integrated Circuits and Systems Design, pages 377--382, September 2000.
[26]
M. N. Yankelevsky and C. D. Polychronopoulos. α-Coral: A Multigrain, Multithreaded Processor Architecture. In International Conference on Supercomputing, pages 358--367, June 2001.
[27]
W. Yu, R. Huang, S. Xu, S.-E. Wang, E. Kan, and G. E. Suh. SRAM-DRAM Hybrid Memory with Applications to Efficient Register Files in Fine-Grained Multi-Threading. In International Symposium on Computer Architecture, June 2011.
[28]
J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Two-level Hierarchical Register File Organization for VLIW Processors. In International Symposium on Microarchitecture, pages 137--146, December 2000.
[29]
J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Software and Hardware Techniques to Optimize Register File Utilization in VLIW Architectures. International Journal of Parallel Programming, 32(6):447--474, December 2004.
[30]
Y. Zhang, H. he, and Y. Sin. A New Register File Access Architecture for Software Pipelining in VLIW Processors. In Asia and South Pacific Design Automation Conference, pages 627--630, January 2005.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
  • (2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
  • Show More Cited By

Index Terms

  1. A compile-time managed multi-level register file hierarchy

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2011
    519 pages
    ISBN:9781450310536
    DOI:10.1145/2155620
    • Conference Chair:
    • Carlo Galuzzi,
    • General Chair:
    • Luigi Carro,
    • Program Chairs:
    • Andreas Moshovos,
    • Milos Prvulovic
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 December 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    MICRO-44
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
    • (2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
    • (2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
    • (2023)Lightweight Register File Caching in Collector Units for GPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589245(27-33)Online publication date: 25-Feb-2023
    • (2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
    • (2023)High Performance and Power Efficient Accelerator for Cloud Inference2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070941(1003-1016)Online publication date: Feb-2023
    • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
    • (2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
    • (2022)Survey of Shared Register File design for Unified Shader Array in GPUs2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/CSCloud-EdgeCom54986.2022.00042(201-206)Online publication date: Jun-2022
    • (2022)Design of Shared Register File of GPU Unified Shader Array2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/CSCloud-EdgeCom54986.2022.00030(123-128)Online publication date: Jun-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media