research-article

Open access

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

Authors:

Candace Walden,

Meenatchi Jagasivamani,

Mehdi Asnaashari,

Sylvain Dubois,

Donald YeungAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 4

Article No.: 48, Pages 1 - 26

https://doi.org/10.1145/3462632

Published: 17 July 2021 Publication History

All formats PDF

Abstract

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests.

We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

References

[1]

Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems.

[2]

Masab Ahmad, Farrukh Jijaz, Qingchuan Shi, and Omer Khan. 2015. CRONO: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization.

Digital Library

[3]

Mohamed M. Sabry Aly, Mingyu Gao, Gage Hills, Chi-Shuen Lee, Greg Pitner, Max M. Shulaker, Tony F. Wu, et al. 2015. Energy-efficient abundant-data computing: The N3XT . Computer 48, 12 (Dec. 2015), 24–33.

Digital Library

[4]

Mohamed M. Sabry Aly, Tony F. Wu, Andrew Bartolo, Yash H. Malviya, William Hwang, Gage Hills, Igor Markov, et al. 2019. The N3XT approach to energy-efficient abundant-data computing. Proceedings of the IEEE 107, 1 (Jan. 2019), 19–48.

[5]

David A. Bader, John Feo, John Gilbert, Jeremy Kepner, David Koester, Eugene Loh, Kamesh Madduri, Bill Mann, and Theresa Meuse. 2006. HPCS Scalable Synthetic Compact Applications #2 Graph Analysis. Retrieved May 28, 2021 from http://www.graphanalysis.org/benchmark/HPCS-SSCA2_Graph-Theory_v2.1.pdf.

[6]

T. E. Carlson, W. Heirman, and L. Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’11). 1–12.

[7]

Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3 (April 2014), Article 28.

Digital Library

[8]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining.

[9]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization.

Digital Library

[10]

Leon O. Chua. 1971. Memristor—The missing circuit element. IEEE Transactions on Circuit Theory 18, 5 (1971), 507–519.

[11]

Crossbar. 2017. ReRAM Memory, Crossbar. https://www.crossbar-inc.com/assets/resources/white-papers/Crossbar-ReRAM-Technology.pdf.

[12]

Crossbar. 2020. Personal communication.

[13]

Ian Cutress. 2015. SuperComputing 15: Intel’s Knights Landing/Xeon Phi Silicon on Display. Retrieved May 28, 2021 from https://www.anandtech.com/show/9802/supercomputing-15-intels-knights-landing-xeon-phi-silicon-on-display.

[14]

Gaurav Dhiman, Raid Ayoub, and Tajana Rosing. 2009. PDRAM: A hybrid PRAM and DRAM main memory system. In Proceedings of the Design Automation Conference.

Digital Library

[15]

Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (July 2012), 994–1007.

[16]

Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the 11th European Conference on Computer Systems.

Digital Library

[17]

Young-Ho Gong. 2021. Monolithic 3D-based SRAM/MRAM hybrid memory for an energy-efficient unified L2 TLB-cache architecture. IEEE Access 9 (2021), 18915–18926.

[18]

Anoop Gupta, Wolf Dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312–321.

[19]

Charlie Demerjian. 2004. Sun’s Niagara falls neatly into multithreaded place. The Inquirer, 02 November 2004.

[20]

Intel. 2012. Intel Software Development Emulator. Retrieved May 28, 2021 from http://software.intel.com/en-us/articles/intel-software-development-emulator.

[21]

Intel. 2017. AVX 512 Instruction Extensions. Retrieved May 28, 2021 from http://software.intel.com/en-us/blogs/2013/avx-512-instructions.

[22]

Intel. 2017. Intel Optane Technology. Retrieved May 28, 2021 from http://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html.

[23]

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. 2020. Tileable monolithic ReRAM memory design. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems.

[24]

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. 2019. Analyzing the monolithic integration of a ReRAM-based main memory into a CPU’s die. IEEE Micro 39, 6 (Nov.-Dec. 2019), 64–72.

[25]

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Donald Yeung, and Bruce Jacob. 2019. Design for ReRAM-based main-memory architectures. In Proceedings of the 5th International Symposium on Memory Systems.

Digital Library

[26]

Sung Hyun Jo, Kuk-Hwan Kim, and Wei Lu. 2009. High-density cross-bar arrays based on a Si memristive system. Nano Letters 9, 2 (2009), 870–874.

[27]

Sung Hyun Jo, T. Kumar, S. Narayanan, W. D. Lu, and H. Nazarian. 2014. 3D-stackable crossbar resistive memory based on field assisted superlinear threshold (FAST) selector. In Proceedings of the IEEE International Electron Devices Meeting.

[28]

Doris Keitel-Schulz and Norbert Wehn. 2001. Embedded DRAM development: Technology, physical design, and application issues. IEEE Design & Test of Computers 18, 3 (May-June 2001), 7–15.

Digital Library

[29]

John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the International Symposium on Computer Architecture. 140–151.

[30]

Myoung-Jae Lee, Chang Bum Lee, Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, et al. 2011. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O5-x/TaO2-x bilayer structures. Nature Materials 10 (Aug. 2011), 625–630.

[31]

Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. CLOCK-DWF: A write-history-aware page replacement algorithm for hybrid PCM and DRAM memory architectures. IEEE Transactions on Computers 63, 9 (Sept. 2014), 2187–2200.

Digital Library

[32]

Sukhan Lee, HyunYoon Cho, Young Hoon Son, Yuhwan Ro, Nam Sung Kim, and Jung Ho Ahn. 2018. Leveraging power-performance relationship of energy-efficient modern DRAM devices. IEEE Access 6 (June 2018), 31387–31398.

[33]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture.

[34]

Shang Li, Zhiyuan Yang, Dhiraj Reddy, Ankur Srivastava, and Bruce Jacob. 2019. DRAMsim3: A cycle-accurate, thermal capable memory system simulator. IEEE Computer Architecture Letters 19, 2 (2019), 106–109.

Digital Library

[35]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.

Digital Library

[36]

Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture.

[37]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE, Los Alamitos, CA, 3–14.

Digital Library

[38]

Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 50th International Symposium on Microarchitecture.

[39]

Moinuddin K. Qureshi, John Karidis, Michele Franceschini, Vijayalakshmi Srinivasan, Luis Lastras, and Bulent Abali. 2009. Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling. In Proceedings of the 42nd Annual International Symposium on Microarchitecture.

Digital Library

[40]

Moinuddin K. Qureshi, Vijayalakshmi, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the International Symposium on Computer Architecture.

[41]

Luiz Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the 2011 International Conference on Supercomputing.

Digital Library

[42]

Parthasarathy Ranganathan. 2011. From microprocessors to nanostores: Rethinking data-centric systems. Computer 44, 1 (Jan. 2011), 39–48.

Digital Library

[43]

Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture.

Digital Library

[44]

Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, et al. 2008. Larrabee: A many-core architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 1–16.

Digital Library

[45]

Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman Jouppi. 2008. CACTI 5.1. Technical Report. HP Laboratories.

[46]

Dmitrii Ustiugov, Alexandros Daglis, Javier Picorel, Mark Sutherland, Edouard Bugnion, Babak Falsafi, and Dionisios Pnevmatikatos. 2018. Design guidelines for high-performance SCM hierarchies. In Proceedings of the 4th International Symposium on Memory Systems.

Digital Library

[47]

Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. 2015. Overcoming the challenges of crossbar resistive memory architectures. In Proceedings of the International Symposium on High Performance Computer Architecture.

[48]

D. H. Yoon, M. K. Jeong, and M. Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA’11). 295–306.

[49]

Lunkay Zhang, Brian Neely, Diana Franklin, Dmitri Strukov, Yuan Xie, and Frederic T. Chong. 2016. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proceedings of the 43rd International Symposium on Computer Architecture.

[50]

Wangyuan Zhang and Tao Li. 2009. Exploring phase change memory and 3D die-stacking for power/thermal friendly, fast and durable memory architectures. In Proceedings of the International Symposium on Parallel Architectures and Compilation Techniques.

Digital Library

Cited By

Tang CNie CQian WHe Z(2024)PIMLC: Logic Compiler for Bit-Serial Based PIM2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546754(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546754
Singh IRaj BKhosla M(2024)SOT-MRAM Memories for Energy Efficient Embedded and AI ApplicationsInnovations in VLSI, Signal Processing and Computational Technologies10.1007/978-981-99-7077-3_2(13-24)Online publication date: 28-Jan-2024
https://doi.org/10.1007/978-981-99-7077-3_2
Badri SSaini MGoel N(2023)An Efficient NVM-Based Architecture for Intermittent Computing Under Energy ConstraintsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.326655531:6(725-737)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TVLSI.2023.3266555

Index Terms

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Increasing the capacity of the Last Level Cache (LLC) can help scale the memory wall. Due to prohibitive area and leakage power, however, growing conventional SRAM LLC already incurs diminishing returns. Emerging Non-Volatile Memory (NVM) technologies ...
Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 18, Issue 4

December 2021

497 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3476575

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 July 2021

Accepted: 01 April 2021

Revised: 01 April 2021

Received: 01 October 2020

Published in TACO Volume 18, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,156
Total Downloads

Downloads (Last 12 months)271
Downloads (Last 6 weeks)28

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang CNie CQian WHe Z(2024)PIMLC: Logic Compiler for Bit-Serial Based PIM2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546754(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546754
Singh IRaj BKhosla M(2024)SOT-MRAM Memories for Energy Efficient Embedded and AI ApplicationsInnovations in VLSI, Signal Processing and Computational Technologies10.1007/978-981-99-7077-3_2(13-24)Online publication date: 28-Jan-2024
https://doi.org/10.1007/978-981-99-7077-3_2
Badri SSaini MGoel N(2023)An Efficient NVM-Based Architecture for Intermittent Computing Under Energy ConstraintsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.326655531:6(725-737)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TVLSI.2023.3266555
Rai STalawar B(2023)Nonvolatile Memory Technologies: Characteristics, Deployment, and Research ChallengesFrontiers of Quality Electronic Design (QED)10.1007/978-3-031-16344-9_4(137-173)Online publication date: 12-Jan-2023
https://doi.org/10.1007/978-3-031-16344-9_4

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents