research-article

Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs

Authors:

Asit K. Mishra,

N. Vijaykrishnan,

Chita R. DasAuthors Info & Claims

ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

Pages 69 - 80

https://doi.org/10.1145/2000064.2000074

Published: 04 June 2011 Publication History

Abstract

Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STT-RAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.

Supplementary Material

JPG File (isca_3a_2.jpg)

Download
11.60 KB

MP4 File (isca_3a_2.mp4)

Download
123.85 MB

References

[1]

B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die Stacking (3D) Microarchitecture. In MICRO-39, 2006.

Digital Library

[2]

W. J. Dally and B. Towles. Route Packets, Not Wires: On-Chip Interconnection Networks. In 38th DAC, 2001.

Digital Library

[3]

R. Das, S. Eachempati, A. Mishra, V. Narayanan, and C. Das. Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs. In 15th HPCA, 2009.

[4]

X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen. Circuit and Microarchitecture Evaluation of 3D Stacking Magnetic RAM (MRAM) as a Universal Memory Replacement. In 45th DAC, 2008.

Digital Library

[5]

S. Eyerman and L. Eeckhout. System-Level Performance Metrics for Multiprogram Workloads. IEEE Micro, 2008.

Digital Library

[6]

P. Gratz, B. Grot, and S. Keckler. Regional Congestion Awareness for Load Balance in Networks-on-Chip. In 14th HPCA, 2008.

[7]

X. Guo, E. Ipek, and T. Soyata. Resistive Computation: Avoiding the Power Wall with Low-Leakage, STT-MRAM Based Computing. In 37th ISCA, 2010.

Digital Library

[8]

M. Hosomi, H. Y. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane, H. Yamada, M. Shoji, H. Hachino, C. Fukumoto, H. Nagao, and H. Kano. A Novel Nonvolatile Memory with Spin Torque Transfer Magnetization Switching: Spin-RAM. In IEDM, 2005.

[9]

Y. Joo, D. Niu, X. Dong, G. Sun, N. Chang, and Y. Xie. Energy and Endurance-Aware Design of Phase Change Memory Caches. In DATE, 2010.

Digital Library

[10]

T. Kawahara, R. Takemura, K. Miura, J. Hayakawa, S. Ikeda, Y. Lee, R. Sasaki, Y. Goto, K. Ito, I. Meguro, F. Matsukura, H. Takahashi, H. Matsuoka, and H. Ohno. 2Mb Spin-Transfer Torque RAM (SPRAM) with Bit-by-Bit Bidirectional Current Write and Parallelizing-Direction Current Read. In ISSCC, 2007.

[11]

T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, and K. Flautner. PicoServer: Using 3D Stacking Technology to Enable a Compact Energy Efficient Chip Multiprocessor. ASPLOS-XII, 2006.

Digital Library

[12]

B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting Phase Change Memory as a Scalable DRAM Alternative. In 36th ISCA, 2009.

Digital Library

[13]

N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni, and D. Newell. Optimizing Communication and Capacity in a 3D Stacked Reconfigurable Cache Hierarchy. In 15th HPCA, 2009.

[14]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In MICRO-40, 2007.

Digital Library

[15]

L.-S. Peh and W. J. Dally. A Delay Model and Speculative Architecture for Pipelined Routers. In 7th HPCA, 2001.

Digital Library

[16]

M. K. Qureshi, J. P. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali. Enhancing Lifetime and Security of PCM-Based Main Memory with Start-Gap Wear Leveling. In MICRO-42, 2009.

Digital Library

[17]

M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable High Performance Main Memory System Using Phase-Change Memory Technology. In 36th ISCA, 2009.

Digital Library

[18]

A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreaded Processor. In ASPLOS-IX, 2000.

Digital Library

[19]

G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A Novel Architecture of the 3D Stacked MRAM L2 Cache for CMPs. In 15th HPCA, 2009.

[20]

M. Tremblay and S. Chaudhry. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC Processor. In ISSCC, 2008.

[21]

H. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: A Power-Performance Simulator for Interconnection Networks. In MICRO-35, 2002.

Digital Library

[22]

D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth. In 16th HPCA, 2010.

[23]

Y. Xie. Modeling, Architecture, and Applications for Emerging Memory Technologies. IEEE Design and Test of Computers, Special Issues on Memory Technologies, 2010.

Digital Library

[24]

Y. Xie, G. H. Loh, B. Black, and K. Bernstein. Design Space Exploration for 3D Architectures. ACM JETC, 2(2), 2006.

Digital Library

[25]

W. Zhao, E. Belhaire, Q. Mistral, C. Chappert, V. Javerliac, B. Dieny, and E. Nicolle. Macro-Model of Spin-Transfer Torque Based Magnetic Tunnel Junction Device for Hybrid Magnetic-CMOS Design. In BMAS, 2006.

[26]

P. Zhou, B. Zhao, J. Yang, and Y. Zhang. Energy Reduction for STT-RAM Using Early Write Termination. In ICCAD, 2009.

Digital Library

Cited By

Cai MShen JTang BHuang HYe B(2024)Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-Fast, Byte-Addressable NVMsACM Transactions on Storage10.1145/362067320:1(1-47)Online publication date: 30-Jan-2024
https://dl.acm.org/doi/10.1145/3620673
Ye CChen MJiang QWang C(2024)Hercules: Enabling Atomic Durability for Persistent Memory with Transient Persistence DomainACM Transactions on Embedded Computing Systems10.1145/360747323:6(1-34)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3607473
Singh SSurana NPrasad KJain PMekie JAwasthi M(2023)HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache HierarchyACM Transactions on Architecture and Code Optimization10.1145/357283920:2(1-20)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3572839
Show More Cited By

Index Terms

Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs
ISCA '11

Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory ...
Endurance enhancement of write-optimized STT-RAM caches
MEMSYS '19: Proceedings of the International Symposium on Memory Systems

Low density and high leakage power of SRAM are the major setbacks for its scalability. Non-volatile memory (NVM) like spin-transfer torque random access memory (STT-RAM) is a suitable replacement for SRAM at the last level cache (LLC). NVM offers high ...
Preventing STT-RAM Last-Level Caches from Port Obstruction

Many new nonvolatile memory (NVM) technologies have been heavily studied to replace the power-hungry SRAM/DRAM-based memory hierarchy in today's computers. Among various emerging NVM technologies, Spin-Transfer Torque RAM (STT-RAM) has many benefits, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

June 2011

488 pages

ISBN:9781450304726

DOI:10.1145/2000064

General Chairs:
Ravi Iyer
Intel
,
Qing Yang
University of Rhode Island
,
Program Chair:
Antonio González
Intel and UPC

ACM SIGARCH Computer Architecture News Volume 39, Issue 3
ISCA '11
June 2011
462 pages
ISSN:0163-5964
DOI:10.1145/2024723
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '11

Sponsor:

SIGARCH

ISCA '11: The 38th Annual International Symposium on Computer Architecture

June 4 - 8, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
1,042
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cai MShen JTang BHuang HYe B(2024)Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-Fast, Byte-Addressable NVMsACM Transactions on Storage10.1145/362067320:1(1-47)Online publication date: 30-Jan-2024
https://dl.acm.org/doi/10.1145/3620673
Ye CChen MJiang QWang C(2024)Hercules: Enabling Atomic Durability for Persistent Memory with Transient Persistence DomainACM Transactions on Embedded Computing Systems10.1145/360747323:6(1-34)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3607473
Singh SSurana NPrasad KJain PMekie JAwasthi M(2023)HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache HierarchyACM Transactions on Architecture and Code Optimization10.1145/357283920:2(1-20)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1145/3572839
Zhao HJia XWatanabe T(2019)Router-Integrated Cache Hierarchy Design for Highly Parallel Computing in Efficient CMP SystemsElectronics10.3390/electronics81113638:11(1363)Online publication date: 17-Nov-2019
https://doi.org/10.3390/electronics8111363
Ghane MChandrasekaran SCheung M(2019)GeckoProceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3303084.3309489(21-30)Online publication date: 17-Feb-2019
https://dl.acm.org/doi/10.1145/3303084.3309489
Asad AFazeli MJahed-Motlagh MFathy MMohammadi F(2019)An Energy-efficient Reliable Heterogeneous Uncore Architecture for Future 3D Chip-multiprocessorsJournal of Circuits, Systems and Computers10.1142/S0218126619502244Online publication date: 8-Feb-2019
https://doi.org/10.1142/S0218126619502244
Kim NAhn JChoi KSanchez DYoo DRyu S(2018)BenzeneACM Transactions on Architecture and Code Optimization10.1145/317796315:1(1-23)Online publication date: 22-Mar-2018
https://dl.acm.org/doi/10.1145/3177963
Jain SRanjan ARoy KRaghunathan A(2018)Computing in Memory With Spin-Transfer Torque Magnetic RAMIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.277695426:3(470-483)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1109/TVLSI.2017.2776954
Yan HJiang LDuan LLin WJohn E(2017)FlowPaP and FlowReRACM Transactions on Embedded Computing Systems10.1145/312653216:5s(1-20)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126532
Yang JSeymour J(2017)Pmbench: A Micro-Benchmark for Profiling Paging Performance on a System with Low-Latency SSDsInformation Technology - New Generations10.1007/978-3-319-54978-1_79(627-633)Online publication date: 18-Jul-2017
https://doi.org/10.1007/978-3-319-54978-1_79
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents