research-article

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching

Authors:

Mohammad Sadrosadati,

Amirhossein Mirhosseini,

Seyed Borna Ehsani,

Hamid Sarbazi-Azad,

Rachata Ausavarungnirun,

Onur MutluAuthors Info & Claims

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 489 - 502

https://doi.org/10.1145/3173162.3173211

Published: 19 March 2018 Publication History

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8X larger capacity and improving overall GPU performance by 31% while reducing register file power consumption by 46%.

References

[1]

"LTRF Register-Interval-Algorithm," https://github.com/Carnegie Mellon University-SAFARI/Register-Interval.

[2]

M. Abdel-Majeed and M. Annavaram, "Warped register file: A power efficient register file for GPGPUs," in HPCA, 2013.

Digital Library

[3]

M. Abdel-Majeed, A. Shafaei, H. Jeon, M. Pedram, and M. Annavaram, "Pilot Register File: Energy efficient partitioned register file for GPUs," in HPCA, 2017.

[4]

A. Annunziata, M. Gaidis, L. Thomas, C. Chien, C. Hung, P. Chevalier, E. O'Sullivan, J. Hummel, E. Joseph, Y. Zhu et al., "Racetrack memory cell array with integrated magnetic tunnel junction readout," in IEDM, 2011.

[5]

C. Augustine, A. Raychowdhury, B. Behin-Aein, S. Srinivasan, J. Tschanz, V. K. De, and K. Roy, "Numerical analysis of domain wall propagation for dense memory arrays," in IEDM, 2011.

[6]

R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, "Exploiting inter-warp heterogeneity to improve gpgpu performance," in PACT, 2015.

Digital Library

[7]

A. Bakhoda, J. Kim, and T. M. Aamodt, "On-chip network design considerations for compute accelerators," in PACT, 2010.

Digital Library

[8]

A. Bakhoda, J. Kim, and T. M. Aamodt, "Throughput-effective on-chip networks for manycore accelerators," in MICRO, 2010.

Digital Library

[9]

A. Bakhoda, J. Kim, and T. M. Aamodt, "Designing on-chip networks for throughput accelerators," in ACM TACO, 2013.

Digital Library

[10]

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.

[11]

R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi, "Reducing the complexity of the register file in dynamic superscalar processors," in MICRO, 2001.

Digital Library

[12]

K. K. Bhuwalka, S. Sedlmaier, A. K. Ludsteck, C. Tolksdorf, J. Schulze, and I. Eisele, "Vertical tunnel field-effect transistor," in IEEE TED, 2004.

[13]

E. Borch, E. Tune, S. Manne, and J. Emer, "Loose loops sink chips," in HPCA, 2002.

Digital Library

[14]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.

Digital Library

[15]

K. D. Cooper and T. J. Harvey, "Compiler-controlled memory," in ASPLOS, 1998.

Digital Library

[16]

J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham, "Multiple-banked register file architectures," in ISCA, 2000.

Digital Library

[17]

X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," in IEEE TCAD, 2012.

Digital Library

[18]

S. Fukami, T. Suzuki, K. Nagahara, N. Ohshima, Y. Ozaki, S. Saito, R. Nebashi, N. Sakimura, H. Honjo, K. Mori et al., "Low-current perpendicular domain wall motion cell for scalable high-speed mram," in VLSIT, 2009.

[19]

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in ISCA, 2011.

Digital Library

[20]

M. Gebhart, S. W. Keckler, and W. J. Dally, "A compile-time managed multi-level register file hierarchy," in MICRO, 2011.

Digital Library

[21]

M. Gebhart, S. W. Keckler, B. Khailany, R. Krashinsky, and W. J. Dally, "Unifying primary cache, scratch, and register file memories in a throughput processor," in MICRO, 2012.

Digital Library

[22]

M. S. Hecht, Flow analysis of computer programs. hskip 1em plus 0.5em minus 0.4emrelax Elsevier Science Inc., 1977.

Digital Library

[23]

C.-C. Hsiao, S.-L. Chu, and C.-C. Hsieh, "An adaptive thread scheduling mechanism with low-power register file for mobile GPUs," in IEEE TMM, 2014.

[24]

H. Jang, J. Kim, P. Gratz, K. H. Yum, and E. J. Kim, "Bandwidth-efficient on-chip interconnect designs for GPGPUs," in DAC, 2015.

Digital Library

[25]

H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU register file virtualization," in MICRO, 2015.

Digital Library

[26]

N. Jing, L. Jiang, T. Zhang, C. Li, F. Fan, and X. Liang, "Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs," in IEEE TC, 2016.

Digital Library

[27]

N. Jing, H. Liu, Y. Lu, and X. Liang, "Compiler assisted dynamic register file in GPGPU," in ISLPED, 2013.

Digital Library

[28]

N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal Corretger, and X. Liang, "An energy-efficient and scalable eDRAM-based register file architecture for GPGPU," in ISCA, 2013.

Digital Library

[29]

A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in ASPLOS, 2013.

Digital Library

[30]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in ISCA, 2013.

Digital Library

[31]

T. M. Jones, M. F. P. O'Boyle, J. Abella, A. González, and O. Ergin, "Energy-efficient register caching with compiler assistance," in ACM TACO, 2009.

Digital Library

[32]

U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, "The imagine stream processor," in ICCD, 2002.

[33]

O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither more nor less: optimizing thread-level parallelism for GPGPUs," in PACT, 2013.

Digital Library

[34]

O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, "Managing GPU concurrency in heterogeneous architectures," in MICRO, 2014.

Digital Library

[35]

J. Kim, J. Balfour, and W. Dally, "Flattened butterfly topology for on-chip networks," in MICRO, 2007.

Digital Library

[36]

J. Kloosterman, J. Beaumont, D. A. Jamshidi, J. Bailey, T. Mudge, and S. Mahlke, "Regless: Just-in-time operand staging for GPUs," in MICRO, 2017.

Digital Library

[37]

C. Lattner and V. Adve, "LLVM: A compilation framework for lifelong program analysis & transformation," in CGO, 2004.

Digital Library

[38]

J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, "Many-thread aware prefetching mechanisms for GPGPU applications," in MICRO, 2010.

Digital Library

[39]

S. Lee, K. Kim, G. Koo, H. Jeon, W. W. Ro, and M. Annavaram, "Warped-Compression: Enabling power efficient GPUs through register compression," in ISCA, 2015.

Digital Library

[40]

J. Leng, T. Hetherington, A. Eltantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling energy optimizations in GPGPUs," in ISCA, 2013.

Digital Library

[41]

E. Lewis, D. Petit, L. O'Brien, A. Fernandez-Pacheco, J. Sampaio, A. Jausovec, H. Zeng, D. Read, and R. Cowburn, "Fast domain wall motion in magnetic comb structures," in Nature Materials, 2010.

[42]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-driven dynamic GPU cache bypassing," in ICS, 2015.

Digital Library

[43]

Z. Li, J. Tan, and X. Fu, "Hybrid CMOS-TFET based register files for energy-efficient GPGPUs," in ISQED, 2013.

[44]

J. E. Lindholm, M. Y. Siu, S. S. Moy, S. Liu, and J. R. Nickolls, "Simulating multiported memories using lower port count memories," 2008, US Patent 7,339,592.

[45]

X. Liu, Y. Li, Y. Zhang, A. K. Jones, and Y. Chen, "STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures," in ASP-DAC, 2014.

[46]

X. Liu, M. Mao, X. Bi, H. Li, and Y. Chen, "An efficient STT-RAM-based register file in GPU architectures," in ASP-DAC, 2015.

[47]

A. Magni, C. Dubach, and M. F. P. O'Boyle, "A large-scale cross-architecture evaluation of thread-coarsening," in SC, 2013.

Digital Library

[48]

M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li, "Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory," in DAC, 2014.

Digital Library

[49]

A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch, "BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems," in NOCS, 2017.

Digital Library

[50]

S. Mookerjea and S. Datta, "Comparative study of si, ge and inas based steep subthreshold slope tunnel transistors for 0.25 v supply voltage logic applications," in Device Research Conference, 2008.

[51]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "CACTI 6.0: A tool to model large caches," HP Laboratories, Tech. Rep., 2009.

[52]

G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, "Optimal loop unrolling for GPGPU programs," in IPDPS, 2010.

[53]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in MICRO, 2011.

Digital Library

[54]

P. R. Nuth and W. J. Dally, "The named-state register file: Implementation and performance," in HPCA, 1995.

Digital Library

[55]

Nvidia, "C programming guide V6. 5. 2014," San Jose California: Nvidia.

[56]

Nvidia, "White paper: NVIDIA GeForce GTX 980," Nvidia, Tech. Rep.

[57]

Nvidia, "White paper: NVIDIA Tesla P100," Nvidia, Tech. Rep.

[58]

D. W. Oehmke, N. L. Binkert, T. Mudge, and S. K. Reinhardt, "How to fake 1000 registers," in MICRO, 2005.

Digital Library

[59]

S. S. Parkin, M. Hayashi, and L. Thomas, "Magnetic domain-wall racetrack memory," in Science, 2008.

[60]

W. M. Reddick and G. A. Amaratunga, "Silicon surface tunnel transistor," Applied Physics Letters, 1995.

[61]

S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory access scheduling," in ISCA, 2000.

Digital Library

[62]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in MICRO, 2012.

Digital Library

[63]

R. M. Russell, "The CRAY-1 computer system," Commun. ACM, 1978.

Digital Library

[64]

M. Sadrosadati, A. Mirhosseini, S. Roozkhosh, H. Bakhishi, and H. Sarbazi-Azad, "Effective cache bank placement for GPUs," in DATE.

Digital Library

[65]

M. H. Samavatian, H. Abbasitabar, M. Arjomand, and H. Sarbazi-Azad, "An efficient STT-RAM last level cache architecture for GPUs," in DAC, 2014.

Digital Library

[66]

M. H. Samavatian, M. Arjomand, R. Bashizade, and H. Sarbazi-Azad, "Architecting the last-level cache for GPUs using STT-RAM technology," in ACM TODAES, 2015.

Digital Library

[67]

A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, "APOGEE: Adaptive prefetching on GPUs for energy efficiency," in PACT, 2013.

Digital Library

[68]

A. Sethia and S. Mahlke, "Equalizer: Dynamic tuning of gpu resources for efficient execution," in MICRO, 2014.

Digital Library

[69]

M. Sharad, R. Venkatesan, A. Raghunathan, and K. Roy, "Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches," in ISLPED, 2013.

Digital Library

[70]

R. Shioya, K. Horio, M. Goshima, and S. Sakai, "Register cache system not for latency reduction purpose," in MICRO, 2010.

Digital Library

[71]

J. Singh, K. Ramakrishnan, S. Mookerjea, S. Datta, N. Vijaykrishnan, and D. Pradhan, "A novel si-tunnel FET based SRAM design for ultra low-power 0.3V VDD applications," in ASP-DAC, 2010.

Digital Library

[72]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, UIUC, Tech. Rep., 2012.

[73]

J. A. Swensen and Y. N. Patt, "Hierarchical registers for scientific computers," in ICS, 1988.

Digital Library

[74]

L. Thomas, R. Moriya, C. Rettner, and S. S. Parkin, "Dynamics of magnetic domain walls under their own inertia," in Science, 2010.

[75]

Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jiménez, "Adaptive GPU Cache Bypassing," in GPGPU, 2015.

Digital Library

[76]

R. Venkatesan, S. G. Ramasubramanian, S. Venkataramani, K. Roy, and A. Raghunathan, "Stag: Spintronic-tape architecture for GPGPU cache hierarchies," in ISCA, 2014.

Digital Library

[77]

R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, "Dwm-tapestri-an energy efficient all-spin cache using domain wall shift based writes," in DATE, 2013.

Digital Library

[78]

N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A holistic approach to resource virtualization in GPUs," in MICRO, 2016.

Digital Library

[79]

N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, "A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps," in ISCA, 2015.

Digital Library

[80]

J. Wang and Y. Xie, "A write-aware STTRAM-based register file architecture for GPGPU," in ACM JETC, 2015.

Digital Library

[81]

P.-F. Wang, "Complementary tunneling-FETs (CTFET) in CMOS technology," Ph.D. dissertation, Technische Universit"at München, Universit"atsbibliothek, 2003.

[82]

X. Xie, Y. Liang, X. Li, Y. Wu, G. Sun, T. Wang, and D. Fan, "Enabling coordinated register allocation and thread-level parallelism optimization for GPUs," in MICRO, 2015.

Digital Library

[83]

X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in ICCAD, 2013.

Digital Library

[84]

Y. Yang, P. Xiang, J. Kong, M. Mantor, and H. Zhou, "A unified optimizing compiler framework for different GPGPU architectures," in ACM TACO, 2012.

Digital Library

[85]

W.-k. S. Yu, R. Huang, S. Q. Xu, S.-E. Wang, E. Kan, and G. E. Suh, "SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading," in ISCA, 2011.

Digital Library

[86]

R. Yung and N. C. Wilhelm, "Caching processor general registers," in ICCD, 1995.

Digital Library

[87]

H. Zeng and K. Ghose, "Register file caching for energy efficiency," in ISLPED, 2006.

Digital Library

[88]

W. K. Zuravleff and T. Robinson, "Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order," 1997, US Patent 5,630,096.

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Abaie Shoushtary MArnau JTubella Murgadas JGonzalez A(2023)Lightweight Register File Caching in Collector Units for GPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589245(27-33)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3589236.3589245
Show More Cited By

Index Terms

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Hardware
  1. Power and energy

Recommendations

Highly Concurrent Latency-tolerant Register Files for GPUs

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
ASPLOS '18

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power ...
A large, fast instruction window for tolerating cache misses
ISCA '02: Proceedings of the 29th annual international symposium on Computer architecture

Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately naively scaling conventional window ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

March 2018

827 pages

ISBN:9781450349116

DOI:10.1145/3173162

General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA

ACM SIGPLAN Notices Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '18

Sponsor:

ASPLOS '18: Architectural Support for Programming Languages and Operating Systems

March 24 - 28, 2018

VA, Williamsburg, USA

Acceptance Rates

ASPLOS '18 Paper Acceptance Rate 56 of 319 submissions, 18%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
649
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)8

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Abaie Shoushtary MArnau JTubella Murgadas JGonzalez A(2023)Lightweight Register File Caching in Collector Units for GPUsProceedings of the 15th Workshop on General Purpose Processing Using GPU10.1145/3589236.3589245(27-33)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3589236.3589245
Li BWei JSterpone LBartolini AButko A(2022)REMOCProceedings of the 19th ACM International Conference on Computing Frontiers10.1145/3528416.3530229(1-11)Online publication date: 17-May-2022
https://dl.acm.org/doi/10.1145/3528416.3530229
Darabi SMahani NBaxishi HYousefzadeh-Asl-Miandoab ESadrosadati MSarbazi-Azad H(2022)NURAProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35080366:1(1-27)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3508036
Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3154315
Nematollahi NSadrosadati MFalahati HBarkhordar MDrumond MSarbazi-Azad HFalsafi B(2020)Efficient Nearest-Neighbor Data Sharing in GPUsACM Transactions on Architecture and Code Optimization10.1145/342998118:1(1-26)Online publication date: 30-Dec-2020
https://dl.acm.org/doi/10.1145/3429981
Deng QZhang YZhao ZZhang SZhang MYang J(2020)FRF: Toward Warp-Scheduler Friendly STT-RAM/SRAM Fine-Grained Hybrid GPGPU Register File DesignIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.294680839:10(2396-2409)Online publication date: Oct-2020
https://doi.org/10.1109/TCAD.2019.2946808
Radaideh AGratz P(2020)Exploiting Zero Data to Reduce Register File and Execution Unit Dynamic Power Consumption in GPGPUs2020 57th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18072.2020.9218547(1-6)Online publication date: Jul-2020
https://doi.org/10.1109/DAC18072.2020.9218547
Valero ASuarez-Gracia DGran-Tejero R(2020)DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register FilesIEEE Access10.1109/ACCESS.2020.30258998(173276-173288)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3025899
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten