research-article

Public Access

A Write-Aware STTRAM-Based Register File Architecture for GPGPU

Authors:

Yuan XieAuthors Info & Claims

ACM Journal on Emerging Technologies in Computing Systems (JETC), Volume 12, Issue 1

Article No.: 6, Pages 1 - 12

https://doi.org/10.1145/2700230

Published: 03 August 2015 Publication History

Abstract

The massively parallel processing capacity of GPGPUs requires a large register file (RF), and its size keeps increasing to support more concurrent threads from generation to generation. Using traditional SRAM-based RFs, there are concerns in both area cost and energy consumption, and soon they will become unrealistic. In this work, we analyze the feasibility of using STTRAM-based RF designs, which have benefits in terms of smaller silicon area and zero standby leakage power. However, STTRAM long write latency and high write energy bring new challenges. Therefore, we propose a write-aware STTRAM-based RF architecture (WarRF), which contains two techniques: Split Bank Write modifies the arbitrator design to increase the parallelism of read and write accesses in the same bank; Write Pool reduces the number of repeated write accesses to RFs. Our experiment shows that the performance of STTRAM-based RF is improved by 13% and up to 23% after adopting WarRF. In addition, the energy consumption is reduced by 38% on average compared to SRAM-based RFs.

References

[1]

Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture. 412--423.

Digital Library

[2]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, and others. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.

[3]

Shuai Che, M. Boyer, Jiayuan Meng, and others. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization.

Digital Library

[4]

Xiangyu Dong, Xiaoxia Wu, Guangyu Sun, and others. 2008. Circuit and microarchitecture evaluation of 3d stacking magnetic RAM (MRAM) as a universal memory replacement. In Proceedings of the Design Automation Conference. 554--559.

Digital Library

[5]

Xiangyu Dong, Cong Xu, Yuan Xie, and others. 2012. NVSim: A circuit-level performance, energy, and area model for emerging non-volatile memory. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 31, 0.

Digital Library

[6]

Mark Gebhart, Daniel R. Johnson, David Tarjan, and others. 2011a. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA'11). 235--246.

Digital Library

[7]

Mark Gebhart, Stephen W. Keckler, and William J. Dally. 2011b. A compile-time managed multi-level register file hierarchy. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 465--476.

Digital Library

[8]

R. Gonzalez and M. Horowitz. 1996. Energy dissipation in general purpose microprocessors. IEEE J. Solid-State Circuits 31, 9, 1277--1284.

[9]

Nilanjan Goswami, Bingyi Cao, and Tao Li. 2013. Power-performance co-optimization of throughput core architecture using resistive memory. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture. 342--353.

Digital Library

[10]

Naifeng Jing, Yao Shen, Yao Lu, et al. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 344--355.

Digital Library

[11]

T. Kawahara, R. Takemura, K. Miura, and others. 2008. 2Mb spin-transfer torque RAM (SPRAM) with bit-by-bit bidirectional current write and parallelizing-direction current read. IEEE J. Solid-State Circuits 43, 1, 109--120.

[12]

Samuel Liu, John Erik Lindholm, Ming Y. Siu, BrettWCoon, and Stuart F. Oberman. 2010. Operand collector architecture. US Patent 7,834,881.

[13]

N. Brookwood. 2010. AMD Fusion. Family of APUs: Enabling superior, immersive PC Experience. AMD White Paper.

[14]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, and others. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 308--317.

Digital Library

[15]

NVIDIA. 2010. Geforce GTX 480. http://www.geforce.com/hardware/desktop-gpus.

[16]

NVIDIA. 2012. Geforce GTX 680. http://www.geforce.com/hardware/desktop-gpus.

[17]

NVIDIA Corporation. 2009. NVIDIA's Next Generation CUDA Compute Architecture: Fermi. (2009). Nvidia White Paper.

[18]

C. Smullen, V. Mohan, A. Nigam, and others. 2011. Relaxing Non-Volatility for Fast and Energy-Efficient STT-RAM Caches. In Proceedings of the International Symposium on High Performance Computer Architecture. 50--61.

Digital Library

[19]

Guangyu Sun, Xiangyu Dong, Yuan Xie, and others. 2009. A Novel 3D Stacked MRAM Cache Architecture for CMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture. 239--249.

[20]

Zhenyu Sun, Xiuyuan Bi, Hai Li, and others. 2011. Multi Retention Level STT-RAM Cache Designs with a Dynamic Refresh Scheme. In Proceedings of the International Symposium on Microarchitecture. 329--338.

Digital Library

[21]

Shyamkumar Thoziyoor, Jung Ho Ahn, Matteo Monchiero, and others. 2008. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In Proceedings of the International Symposium on Computer Architecture. 51--62.

Digital Library

[22]

K. Tsuchida, T. Inaba, K. Fujita, et al. 2010. A 64Mb MRAM with clamped-reference and adequate-reference schemes. In Proceedings of the International Solid-State Circuits Conference. 258--259.

[23]

W. Xu, Hongbin Sun, Xiaobin Wang, et al. 2011. Design of Last-Level On-Chip Cache Using Spin-Torque Transfer RAM. IEEE Trans. VLSI Syst. 19, 3, 483--493.

Digital Library

[24]

W. S. Yu, Ruirui Huang, S. Q. Xu, et al. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of the 38th Annual International Symposium on Computer Architecture. 247--258.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Zhang CSun HLi SWang YChen HLiu H(2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3297595
Sadrosadati MMirhosseini AHajiabadi AEhsani SFalahati HSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3419973
Show More Cited By

Index Terms

A Write-Aware STTRAM-Based Register File Architecture for GPGPU
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

An energy-efficient and scalable eDRAM-based register file architecture for GPGPU
ICSA '13

The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the ...
Exploration of GPGPU Register File Architecture Using Domain-wall-shift-write based Racetrack Memory
DAC '14: Proceedings of the 51st Annual Design Automation Conference

SRAM based register file (RF) is one of the major factors limiting the scaling of GPGPU. In this work, we propose to use the emerging nonvolatile domain-wall-shift-write based racetrack memory (DWSW-RM) to implement a power-efficient GPGPU RF, of which ...
Write Activity Minimization for Nonvolatile Main Memory Via Scheduling and Recomputation

Nonvolatile memories such as Flash memory, phase change memory (PCM), and magnetic random access memory (MRAM) have many desirable characteristics for embedded systems to employ them as main memory. However, there are two common challenges we need to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems

ACM Journal on Emerging Technologies in Computing Systems Volume 12, Issue 1

July 2015

210 pages

ISSN:1550-4832

EISSN:1550-4840

DOI:10.1145/2810396

Editor:
Krishnendu Chakrabarty
Duke University, USA

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 03 August 2015

Accepted: 01 October 2014

Revised: 01 September 2014

Received: 01 February 2014

Published in JETC Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
291
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Zhang CSun HLi SWang YChen HLiu H(2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3297595
Sadrosadati MMirhosseini AHajiabadi AEhsani SFalahati HSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3419973
Liu XMao MBi XLi HChen Y(2021)Exploring Applications of STT-RAM in GPU ArchitecturesIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2020.303189568:1(238-249)Online publication date: Jan-2021
https://doi.org/10.1109/TCSI.2020.3031895
Zhang JJung MKandemir M(2019)FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00055(426-439)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00055
Sadrosadati MMirhosseini AEhsani SSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2018)LTRFACM SIGPLAN Notices10.1145/3296957.317321153:2(489-502)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173211
Sadrosadati MMirhosseini AEhsani SSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu OShen XTuck JBianchini RSarkar V(2018)LTRFProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173211(489-502)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173211
Oh YYoon MSong WRo WOskin MInoue K(2018)FineRegProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00037(364-376)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00037
Ni YGong ZChen WYang CQiu K(2017)State-Transition-Aware Spilling Heuristic for MLC STT-RAM-Based RegistersVLSI Design10.1155/2017/10302492017Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1155/2017/1030249
Gong ZQiu KChen WNi YXu YYang J(2017)Pipeline Optimizations of Architecting STT-RAM as Registers in Rad-Hard Environment2017 IEEE Trustcom/BigDataSE/ICESS10.1109/Trustcom/BigDataSE/ICESS.2017.321(844-852)Online publication date: Aug-2017
https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.321
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents