Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost

Published: 06 January 2016 Publication History

Abstract

3D-stacked DRAM alleviates the limited memory bandwidth bottleneck that exists in modern systems by leveraging through silicon vias (TSVs) to deliver higher external memory channel bandwidth. Today’s systems, however, cannot fully utilize the higher bandwidth offered by TSVs, due to the limited internal bandwidth within each layer of the 3D-stacked DRAM. We identify that the bottleneck to enabling higher bandwidth in 3D-stacked DRAM is now the global bitline interface, the connection between the DRAM row buffer and the peripheral IO circuits. The global bitline interface consists of a limited and expensive set of wires and structures, called global bitlines and global sense amplifiers, whose high cost makes it difficult to simply scale up the bandwidth of the interface within a single DRAM layer in the 3D stack. We alleviate this bandwidth bottleneck by exploiting the observation that several global bitline interfaces already exist across the multiple DRAM layers in current 3D-stacked designs, but only a fraction of them are enabled at the same time.
We propose a new 3D-stacked DRAM architecture, called Simultaneous Multi-Layer Access (SMLA), which increases the internal DRAM bandwidth by accessing multiple DRAM layers concurrently, thus making much greater use of the bandwidth that the TSVs offer. To avoid channel contention, the DRAM layers must coordinate with each other when simultaneously transferring data. We propose two approaches to coordination, both of which deliver four times the bandwidth for a four-layer DRAM, over a baseline that accesses only one layer at a time. Our first approach, Dedicated-IO, statically partitions the TSVs by assigning each layer to a dedicated set of TSVs that operate at a higher frequency. Unfortunately, Dedicated-IO requires a nonuniform design for each layer (increasing manufacturing costs), and its DRAM energy consumption scales linearly with the number of layers. Our second approach, Cascaded-IO, solves both issues by instead time multiplexing all of the TSVs across layers. Cascaded-IO reduces DRAM energy consumption by lowering the operating frequency of higher layers. Our evaluations show that SMLA provides significant performance improvement and energy reduction across a variety of workloads (55%/18% on average for multiprogrammed workloads, respectively) over a baseline 3D-stacked DRAM, with low overhead.

Supplementary Material

TACO1204-63 (taco1204-63.pdf)
Slide deck associated with this paper

References

[1]
Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. 2006. Die stacking (3D) microarchitecture. In MICRO.
[2]
Shekhar Borkar. 2007. Thousand core chips: A technology perspective. In DAC.
[3]
Doug Burger, James R. Goodman, and Alain Kägi. 1996. Memory bandwidth limitations of future microprocessors. In ISCA.
[4]
Karthik Chandrasekar, Christian Weis, Benny Akesson, Norbert Wehn, and Kees Goossens. 2013. System and circuit level power modeling of energy-efficient 3D-stacked Wide I/O DRAMs. In DATE.
[5]
Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu. 2014. Improving DRAM performance by parallelizing refreshes with accesses. In HPCA.
[6]
David Chapman. 2013. DiRAM architecture overview. In MemCon.
[7]
Yuan Chou, Brian Fahs, and Santosh Abraham. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In ISCA.
[8]
Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur Mutlu. 2011. Memory power management via dynamic voltage/frequency scaling. In ICAC.
[9]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013).
[10]
Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. MemScale: Active low-power modes for main memory. In ASPLOS.
[11]
Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (2008).
[12]
Qawi Harvard and R. Jacob Baker. 2011. A scalable I/O architecture for Wide I/O DRAM. In MWSCAS.
[13]
Wei Huang, Mircea R. Stan, Kevin Skadron, Karthik Sankaranarayanan, Shougata Ghosh, and Sivakumar Velusam. 2004. Compact thermal modeling for temperature-aware design. In DAC.
[14]
Cedric Huyghebaert, Jan Van Olmen, Okoro Chukwudi, Jens Coenen, Anne Jourdain, Marc Van Cauwenberghe, Rahul Agarwahl, Alain Phommahaxay, Michele Stucchi, and Philippe Soussan. 2010. Enabling 10μm pitch hybrid Cu-Cu IC stacking with through silicon vias. In ECTC.
[15]
Hybrid Memory Cube Consortium. 2013. HMC specification 1.1. (2013).
[16]
Hybrid Memory Cube Consortium. 2014. HMC specification 2.0. (2014).
[17]
ITRS. 2007. International Technology Roadmap for Semiconductors.
[18]
JEDEC. 2011. Wide I/O Single Data Rate (Wide I/O SDR). Standard No. JESD229. (2011).
[19]
JEDEC. 2012a. DDR3 SDRAM. Standard No. JESD79-3F. (2012).
[20]
JEDEC. 2012b. DDR4 SDRAM. Standard No. JESD79-4. (2012).
[21]
JEDEC. 2013a. High Bandwidth Memory (HBM) DRAM. Standard No. JESD235. (2013).
[22]
JEDEC. 2013b. Low Power Double Data Rate 3 (LPDDR3). Standard No. JESD209-3B. (2013).
[23]
JEDEC. 2014. Wide I/O 2 (WideIO2). Standard No. JESD229-2. (2014).
[24]
Uksong Kang, Hoe-Ju Chung, Seongmoo Heo, Soon-Hong Ahn, Hoon Lee, Soo-Ho Cha, Jaesung Ahn, DukMin Kwon, Jin-Ho Kim, Jae-Wook Lee, Han-Sung Joo, Woo-Seop Kim, Hyun-Kyung Kim, Eun-Mi Lee, So-Ra Kim, Keum-Hee Ma, Dong-Hyun Jang, Nam-Seog Kim, Man-Sik Choi, Sae-Jang Oh, Jung-Bae Lee, Tae-Kyung Jung, Jei-Hwan Yoo, and Changhyun Kim. 2009. 8Gb 3D DDR3 DRAM using through-silicon-via technology. In ISSCC.
[25]
Brent Keeth, R. Jacob Baker, Brian Johnson, and Feng Lin. 2007. DRAM Circuit Design. Fundamental and High-Speed Topics. Wiley-IEEE Press.
[26]
Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R. Alameldeen, Chris Wilkerson, and Onur Mutlu. 2014. The efficacy of error mitigation techniques for DRAM retention failures: A comparative experimental study. In SIGMETRICS.
[27]
Jung-Sik Kim, Chi Sung Oh, Hocheol Lee, Donghyuk Lee, Hyong-Ryol Hwang, Sooman Hwang, Byongwook Na, Joungwook Moon, Jin-Guk Kim, Hanna Park, Jang-Woo Ryu, Kiwon Park, Sang-Kyu Kang, So-Young Kim, Hoyoung Kim, Jong-Min Bang, Hyunyoon Cho, Minsoo Jang, Cheolmin Han, Jung-Bae Lee, Kyehyun Kyung, Joo-Sun Choi, and Young-Hyun Jun. 2011. A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4x128 I/Os using TSV-based stacking. In ISSCC.
[28]
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA.
[29]
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010b. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO.
[30]
Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. 2012. A case for exploiting subarray-level parallelism (SALP) in DRAM. In ISCA.
[31]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2015. Ramulator: A fast and extensible dram simulator. IEEE CAL (2015).
[32]
Patrik Larsson. 1996. High-speed architecture for a programmable frequency divider and a dual-modulus prescaler. IEEE J. Solid-State Circuits 31, 5 (1996).
[33]
Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems. Technical Report TR-HPS-2010-007. UT Austin HPS Group.
[34]
Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu. 2015a. Adaptive-latency DRAM: Optimizing dram timing for the common-case. In HPCA.
[35]
Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu. 2013. Tiered-latency DRAM: A low latency and low cost dram architecture. In HPCA.
[36]
Donghyuk Lee, Gennady Pekhimenko, Samira Khan, Saugata Ghose, and Onur Mutlu. 2015b. Simultaneous multi layer access: A high bandwidth and low cost 3D-stacked memory interface. CoRR abs/1506.03160 (2015b).
[37]
Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Hongjung Kim, Ju Young Kim, Young Jun Park, Jae Hwan Kim, Dae Suk Kim, Heat Bit Park, Jin Wook Shin, Jang Hwan Cho, Ki Hun Kwon, Min Jeong Kim, Jaejin Lee, Kun Woo Park, Byongtae Chung, and Sungjoo Hong. 2014. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In ISSCC.
[38]
Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu. 2013. An experimental study of data retention behavior in modern DRAM devices: Implications for retention time profiling mechanisms. In ISCA.
[39]
Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. 2012. RAIDR: Retention-aware intelligent DRAM refresh. In ISCA.
[40]
Gabriel H. Loh. 2008. 3D-stacked memory architectures for multi-core processors. In ISCA.
[41]
Gabriel H. Loh. 2009. Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. In MICRO.
[42]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI.
[43]
Krishna T. Malladi, Frank A. Nothaft, Karthika Periyathambi, Benjamin C. Lee, Christos Kozyrakis, and Mark Horowitz. 2012. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA.
[44]
John D. McCalpin. 2007. The STREAM benchmark. Retrieved from http://www.streambench.org.
[45]
Micron. 2010. DDR3 SDRAM system-power calculator. Retrieved from http://www.micron.com/support/dram/power_calc/.
[46]
Micron. 2014. 2Gb: x4, x8, x16 DDR3 SDRAM. Retrieved from http://www.micron.com∼/media/documents/products/data-sheet/dram/ddr3/2gb_ddr3_sdram.pdf.
[47]
Yuki Mori, Kiyonori Ohyu, Kensuke Okonogi, and Ren-Ichi Yamada. 2005. The origin of variable retention time in DRAM. In IEDM.
[48]
Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda. 2011. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In MICRO.
[49]
Onur Mutlu. 2013. Memory scaling: A systems architecture perspective. In IMW.
[50]
Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO.
[51]
Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA.
[52]
Onur Mutlu and Lavanya Subramanian. 2014. Research problems and opportunities in memory systems. SUPERFRI 1, 3 (2014).
[53]
Ilya I. Novof, John Austin, Ram Kelkar, Don Strayer, and Steve Wyatt. 1995. Fully integrated CMOS phase-locked loop with 15 to 240 MHz locking range and 50ps jitter. IEEE J. Solid-State Circuits 30, 11 (1995).
[54]
Reum Oh, Byunghyun Lee, Sang-Woong Shin, Wonil Bae, Hundai Choi, Indal Song, Yun-Sang Lee, Jung-Hwan Choi, Chi-Wook Kim, Seong-Jin Jang, and Joo Sun Choi. 2014. Design technologies for a 1.2V 2.4Gb/s/pin high capacity DDR4 SDRAM with TSVs. In VLSIC.
[55]
Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and Anand Karunanidhi. 2004. Pinpointing representative portions of large Intel itanium programs with dynamic instrumentation. In MICRO.
[56]
Moinuddin K. Qureshi, Dae-Hyun Kim, Samira Khan, Prashant J. Nair, and Onur Mutlu. 2015. AVATAR: A variable-retention-time (VRT) aware refresh for DRAM systems. In DSN.
[57]
Rambus. 2010. DRAM power model. Retrieved from http://www.rambus.com/energy.
[58]
Behzad Razavi. 2000. Design of Analog CMOS Integrated Circuits. McGraw-Hill.
[59]
Phillip J. Restle, J. W. Park, and Brian F. Lloyd. 1992. DRAM variable retention time. In IEDM.
[60]
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In ISCA.
[61]
Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In ISCA.
[62]
Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2013. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In MICRO.
[63]
Vivek Seshadri, Thomas Mullins, Amirali Boroumand, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Gather-scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses. In MICRO.
[64]
Manjunath Shevgoor, Jung-Sik Kim, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, and Aniruddha N. Udipi. 2013. Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device. In MICRO.
[65]
Kenneth C. Smith, Alice Wang, and Laura C. Fujino. 2012. Through the looking glass: Trend tracking for ISSCC 2012. In IEEE Solid-State Circuits Mag. 4 (2012).
[66]
Allan Snavely and Dean M. Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. In ASPLOS.
[67]
SPEC. 2006. SPEC CPU2006 benchmark suite. Retrieved from http://www.spec.org/spec2006.
[68]
Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. In ISCA.
[69]
TPC. 2015. TPC benchmarks. Retrieved from http://www.tpc.org.
[70]
Univ. of Tennessee. 2015. HPC challenge: GUPS. Retrieved from http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess.
[71]
Thomas Vogelsang. 2010. Understanding the energy consumption of dynamic random access memories. In MICRO.
[72]
Jeff West, Youn Sung Choi, and Catherine Vartuli. 2012. Practical implications of via-middle Cu TSV-induced stress in a 28nm CMOS technology for Wide-IO logic-memory interconnect. In VLSIT.
[73]
Dong Hyuk Woo, Nak Hee Seong, Dean L. Lewis, and Hsien-Hsin S. Lee. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In HPCA.
[74]
David S. Yaney, Chih-Yuan Lu, Ross A. Kohler, Michael J. Kelly, and James T. Nelson. 1987. A meta-stable leakage phenomenon in DRAM charge storage - variable hold time. In IEDM.
[75]
Hongzhong Zheng, Jiang Lin, Zhao Zhang, Eugene Gorbatov, Howard David, and Zhichun Zhu. 2008. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In MICRO.
[76]
Hongzhong Zheng, Jiang Lin, Zhao Zhang, and Zhichun Zhu. 2009. Decoupled DIMM: Building high-bandwidth memory system using low-speed DRAM devices. In ISCA.

Cited By

View all
  • (2024)Genetic Cache: A Machine Learning Approach to Designing DRAM Cache Controllers in HBM SystemsACM Journal on Emerging Technologies in Computing Systems10.1145/367696620:3(1-24)Online publication date: 8-Jul-2024
  • (2024)Performance Analysis of 3D Stacked Memory Architectures in High Performance Computing2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)10.1109/ICACITE60783.2024.10616405(1634-1637)Online publication date: 14-May-2024
  • (2024)Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00030(280-296)Online publication date: 2-Mar-2024
  • Show More Cited By

Index Terms

  1. Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
    January 2016
    848 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2836331
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 January 2016
    Accepted: 01 September 2015
    Revised: 01 September 2015
    Received: 01 June 2015
    Published in TACO Volume 12, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tag

    1. 3D-stacked DRAM

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)322
    • Downloads (Last 6 weeks)59
    Reflects downloads up to 03 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Genetic Cache: A Machine Learning Approach to Designing DRAM Cache Controllers in HBM SystemsACM Journal on Emerging Technologies in Computing Systems10.1145/367696620:3(1-24)Online publication date: 8-Jul-2024
    • (2024)Performance Analysis of 3D Stacked Memory Architectures in High Performance Computing2024 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE)10.1109/ICACITE60783.2024.10616405(1634-1637)Online publication date: 14-May-2024
    • (2024)Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00030(280-296)Online publication date: 2-Mar-2024
    • (2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
    • (2024)Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00024(99-114)Online publication date: 24-Jun-2024
    • (2024)Rethinking the Producer-Consumer Relationship in Modern DRAM-Based SystemsIEEE Access10.1109/ACCESS.2024.351437712(196207-196239)Online publication date: 2024
    • (2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
    • (2023)Technology Prospects for Data-Intensive ComputingProceedings of the IEEE10.1109/JPROC.2022.3218057111:1(92-112)Online publication date: Jan-2023
    • (2023)TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00031(235-247)Online publication date: Apr-2023
    • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media