Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3173162.3173211acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Published: 19 March 2018 Publication History

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.

References

[1]
"LTRF Register-Interval-Algorithm," https://github.com/Carnegie Mellon University-SAFARI/Register-Interval.
[2]
M. Abdel-Majeed and M. Annavaram, "Warped register file: A power efficient register file for GPGPUs," in HPCA, 2013.
[3]
M. Abdel-Majeed, A. Shafaei, H. Jeon, M. Pedram, and M. Annavaram, "Pilot Register File: Energy efficient partitioned register file for GPUs," in HPCA, 2017.
[4]
A. Annunziata, M. Gaidis, L. Thomas, C. Chien, C. Hung, P. Chevalier, E. O'Sullivan, J. Hummel, E. Joseph, Y. Zhu et al., "Racetrack memory cell array with integrated magnetic tunnel junction readout," in IEDM, 2011.
[5]
C. Augustine, A. Raychowdhury, B. Behin-Aein, S. Srinivasan, J. Tschanz, V. K. De, and K. Roy, "Numerical analysis of domain wall propagation for dense memory arrays," in IEDM, 2011.
[6]
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting inter-warp heterogeneity to improve gpgpu performance," in PACT, 2015.
[7]
A. Bakhoda, J. Kim, and T. M. Aamodt, "On-chip network design considerations for compute accelerators," in PACT, 2010.
[8]
A. Bakhoda, J. Kim, and T. M. Aamodt, "Throughput-effective on-chip networks for manycore accelerators," in MICRO, 2010.
[9]
A. Bakhoda, J. Kim, and T. M. Aamodt, "Designing on-chip networks for throughput accelerators," in ACM TACO, 2013.
[10]
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.
[11]
R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, "Reducing the complexity of the register file in dynamic superscalar processors," in MICRO, 2001.
[12]
K. K. Bhuwalka, S. Sedlmaier, A. K. Ludsteck, C. Tolksdorf, J. Schulze, and I. Eisele, "Vertical tunnel field-effect transistor," in IEEE TED, 2004.
[13]
E. Borch, E. Tune, S. Manne, and J. Emer, "Loose loops sink chips," in HPCA, 2002.
[14]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.
[15]
K. D. Cooper and T. J. Harvey, "Compiler-controlled memory," in ASPLOS, 1998.
[16]
J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, "Multiple-banked register file architectures," in ISCA, 2000.
[17]
X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," in IEEE TCAD, 2012.
[18]
S. Fukami, T. Suzuki, K. Nagahara, N. Ohshima, Y. Ozaki, S. Saito, R. Nebashi, N. Sakimura, H. Honjo, K. Mori et al., "Low-current perpendicular domain wall motion cell for scalable high-speed mram," in VLSIT, 2009.
[19]
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in ISCA, 2011.
[20]
M. Gebhart, S. W. Keckler, and W. J. Dally, "A compile-time managed multi-level register file hierarchy," in MICRO, 2011.
[21]
M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, "Unifying primary cache, scratch, and register file memories in a throughput processor," in MICRO, 2012.
[22]
M. S. Hecht, Flow analysis of computer programs. hskip 1em plus 0.5em minus 0.4emrelax Elsevier Science Inc., 1977.
[23]
C.-C. Hsiao, S.-L. Chu, and C.-C. Hsieh, "An adaptive thread scheduling mechanism with low-power register file for mobile GPUs," in IEEE TMM, 2014.
[24]
H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim, "Bandwidth-efficient on-chip interconnect designs for GPGPUs," in DAC, 2015.
[25]
H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU register file virtualization," in MICRO, 2015.
[26]
N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, "Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs," in IEEE TC, 2016.
[27]
N. Jing, H. Liu, Y. Lu, and X. Liang, "Compiler assisted dynamic register file in GPGPU," in ISLPED, 2013.
[28]
N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal Corretger, and X. Liang, "An energy-efficient and scalable eDRAM-based register file architecture for GPGPU," in ISCA, 2013.
[29]
A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in ASPLOS, 2013.
[30]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in ISCA, 2013.
[31]
T. M. Jones, M. F. P. O'Boyle, J. Abella, A. González, and O. Ergin, "Energy-efficient register caching with compiler assistance," in ACM TACO, 2009.
[32]
U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, "The imagine stream processor," in ICCD, 2002.
[33]
O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: optimizing thread-level parallelism for GPGPUs," in PACT, 2013.
[34]
O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU concurrency in heterogeneous architectures," in MICRO, 2014.
[35]
J. Kim, J. Balfour, and W. Dally, "Flattened butterfly topology for on-chip networks," in MICRO, 2007.
[36]
J. Kloosterman, J. Beaumont, D. A. Jamshidi, J. Bailey, T. Mudge, and S. Mahlke, "Regless: Just-in-time operand staging for GPUs," in MICRO, 2017.
[37]
C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in CGO, 2004.
[38]
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-thread aware prefetching mechanisms for GPGPU applications," in MICRO, 2010.
[39]
S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, "Warped-Compression: Enabling power efficient GPUs through register compression," in ISCA, 2015.
[40]
J. Leng, T. Hetherington, A. Eltantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling energy optimizations in GPGPUs," in ISCA, 2013.
[41]
E. Lewis, D. Petit, L. O'Brien, A. Fernandez-Pacheco, J. Sampaio, A. Jausovec, H. Zeng, D. Read, and R. Cowburn, "Fast domain wall motion in magnetic comb structures," in Nature Materials, 2010.
[42]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-driven dynamic GPU cache bypassing," in ICS, 2015.
[43]
Z. Li, J. Tan, and X. Fu, "Hybrid CMOS-TFET based register files for energy-efficient GPGPUs," in ISQED, 2013.
[44]
J. E. Lindholm, M. Y. Siu, S. S. Moy, S. Liu, and J. R. Nickolls, "Simulating multiported memories using lower port count memories," 2008, US Patent 7,339,592.
[45]
X. Liu, Y. Li, Y. Zhang, A. K. Jones, and Y. Chen, "STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures," in ASP-DAC, 2014.
[46]
X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, "An efficient STT-RAM-based register file in GPU architectures," in ASP-DAC, 2015.
[47]
A. Magni, C. Dubach, and M. F. P. O'Boyle, "A large-scale cross-architecture evaluation of thread-coarsening," in SC, 2013.
[48]
M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li, "Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory," in DAC, 2014.
[49]
A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch, "BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems," in NOCS, 2017.
[50]
S. Mookerjea and S. Datta, "Comparative study of si, ge and inas based steep subthreshold slope tunnel transistors for 0.25 v supply voltage logic applications," in Device Research Conference, 2008.
[51]
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," HP Laboratories, Tech. Rep., 2009.
[52]
G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, "Optimal loop unrolling for GPGPU programs," in IPDPS, 2010.
[53]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in MICRO, 2011.
[54]
P. R. Nuth and W. J. Dally, "The named-state register file: Implementation and performance," in HPCA, 1995.
[55]
Nvidia, "C programming guide V6. 5. 2014," San Jose California: Nvidia.
[56]
Nvidia, "White paper: NVIDIA GeForce GTX 980," Nvidia, Tech. Rep.
[57]
Nvidia, "White paper: NVIDIA Tesla P100," Nvidia, Tech. Rep.
[58]
D. W. Oehmke, N. L. Binkert, T. Mudge, and S. K. Reinhardt, "How to fake 1000 registers," in MICRO, 2005.
[59]
S. S. Parkin, M. Hayashi, and L. Thomas, "Magnetic domain-wall racetrack memory," in Science, 2008.
[60]
W. M. Reddick and G. A. Amaratunga, "Silicon surface tunnel transistor," Applied Physics Letters, 1995.
[61]
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory access scheduling," in ISCA, 2000.
[62]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in MICRO, 2012.
[63]
R. M. Russell, "The CRAY-1 computer system," Commun. ACM, 1978.
[64]
M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad, "Effective cache bank placement for GPUs," in DATE.
[65]
M. H. Samavatian, H. Abbasitabar, M. Arjomand, and H. Sarbazi-Azad, "An efficient STT-RAM last level cache architecture for GPUs," in DAC, 2014.
[66]
M. H. Samavatian, M. Arjomand, R. Bashizade, and H. Sarbazi-Azad, "Architecting the last-level cache for GPUs using STT-RAM technology," in ACM TODAES, 2015.
[67]
A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, "APOGEE: Adaptive prefetching on GPUs for energy efficiency," in PACT, 2013.
[68]
A. Sethia and S. Mahlke, "Equalizer: Dynamic tuning of gpu resources for efficient execution," in MICRO, 2014.
[69]
M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, "Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches," in ISLPED, 2013.
[70]
R. Shioya, K. Horio, M. Goshima, and S. Sakai, "Register cache system not for latency reduction purpose," in MICRO, 2010.
[71]
J. Singh, K. Ramakrishnan, S. Mookerjea, S. Datta, N. Vijaykrishnan, and D. Pradhan, "A novel si-tunnel FET based SRAM design for ultra low-power 0.3V VDD applications," in ASP-DAC, 2010.
[72]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, UIUC, Tech. Rep., 2012.
[73]
J. A. Swensen and Y. N. Patt, "Hierarchical registers for scientific computers," in ICS, 1988.
[74]
L. Thomas, R. Moriya, C. Rettner, and S. S. Parkin, "Dynamics of magnetic domain walls under their own inertia," in Science, 2010.
[75]
Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jiménez, "Adaptive GPU Cache Bypassing," in GPGPU, 2015.
[76]
R. Venkatesan, S. G. Ramasubramanian, S. Venkataramani, K. Roy, and A. Raghunathan, "Stag: Spintronic-tape architecture for GPGPU cache hierarchies," in ISCA, 2014.
[77]
R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, "Dwm-tapestri-an energy efficient all-spin cache using domain wall shift based writes," in DATE, 2013.
[78]
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A holistic approach to resource virtualization in GPUs," in MICRO, 2016.
[79]
N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015.
[80]
J. Wang and Y. Xie, "A write-aware STTRAM-based register file architecture for GPGPU," in ACM JETC, 2015.
[81]
P.-F. Wang, "Complementary tunneling-FETs (CTFET) in CMOS technology," Ph.D. dissertation, Technische Universit"at München, Universit"atsbibliothek, 2003.
[82]
X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan, "Enabling coordinated register allocation and thread-level parallelism optimization for GPUs," in MICRO, 2015.
[83]
X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in ICCAD, 2013.
[84]
Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou, "A unified optimizing compiler framework for different GPGPU architectures," in ACM TACO, 2012.
[85]
W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh, "SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading," in ISCA, 2011.
[86]
R. Yung and N. C. Wilhelm, "Caching processor general registers," in ICCD, 1995.
[87]
H. Zeng and K. Ghose, "Register file caching for energy efficiency," in ISLPED, 2006.
[88]
W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.

Cited By

View all
  • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
  • (2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
  • (2023)Lightweight Register File Caching in Collector Units for GPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589245(27-33)Online publication date: 25-Feb-2023
  • Show More Cited By

Index Terms

  1. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2018
      827 pages
      ISBN:9781450349116
      DOI:10.1145/3173162
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 53, Issue 2
        ASPLOS '18
        February 2018
        809 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/3296957
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 March 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPUs
      2. energy efficiency
      3. latency tolerance
      4. memory latency
      5. memory technology
      6. register file design

      Qualifiers

      • Research-article

      Conference

      ASPLOS '18

      Acceptance Rates

      ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;
      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)81
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
      • (2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
      • (2023)Lightweight Register File Caching in Collector Units for GPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589245(27-33)Online publication date: 25-Feb-2023
      • (2022)REMOCProceedings of the 19th ACM International Conference on Computing Frontiers10.1145/3528416.3530229(1-11)Online publication date: 17-May-2022
      • (2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
      • (2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
      • (2020)Efficient Nearest-Neighbor Data Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/342998118:1(1-26)Online publication date: 30-Dec-2020
      • (2020)FRF: Toward Warp-Scheduler Friendly STT-RAM/SRAM Fine-Grained Hybrid GPGPU Register File DesignIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.294680839:10(2396-2409)Online publication date: Oct-2020
      • (2020)Exploiting Zero Data to Reduce Register File and Execution Unit Dynamic Power Consumption in GPGPUs2020 57th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18072.2020.9218547(1-6)Online publication date: Jul-2020
      • (2020)DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register FilesIEEE Access10.1109/ACCESS.2020.30258998(173276-173288)Online publication date: 2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media