Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/MICRO.2014.57acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
tutorial

Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures

Published: 13 December 2014 Publication History

Abstract

Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (for example, due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory level parallelism causing significant slowdown and higher power consumption.
This paper proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, thus allowing high performance, lower power and efficiently protects the stacked memory from large-granularity failures. Citadel consists of three components, TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs, Tri Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures, and Dynamic Dual Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700x higher reliability than Chip Kill-like ECC codes.

References

[1]
U. Kang et al., "8gb 3d ddr3 dram using through-silicon-via technology," in ISSCC, 2009.
[2]
H. M. C. Consortium, "Hybrid memory cube specification 1.0," 2013. {Online}. Available: hybridmemorycube.org
[3]
J. Standard, "High bandwidth memory (hbm) dram," in JESD235, 2013.
[4]
M. Dubash, "Not hot swap but 'fail in place'," in TechWorld, 2004. {Online}. Available: http://features.techworld.com/storage/960/not-hot-swap-but-fail-in-place/
[5]
V. Sridharan and D. Liberty, "A study of dram failures in the field," in SC-2012.
[6]
DDR3 ECC Unbuffered DIMM Spec Sheet, Silicon Power, 2010.
[7]
T. J. Dell, "A white paper on the benefits of chipkillcorrect ecc for pc server main memory," IBM, Tech. Rep. 11/19/97, 1997.
[8]
A.-C. Hsieh et al., "Tsv redundancy: Architecture and design issues in 3d ic," in DATE 2010.
[9]
W. Peterson and D. Brown, "Cyclic codes for error detection," Proceedings of the IRE, vol. 49, no. 1, pp. 228--235, 1961.
[10]
D. Roberts and P. Nair, "Faultsim: A fast, configurable memory-resilience simulator," in The Memory Forum: ISCA-41.
[11]
B. Schroeder et al., "Dram errors in the wild: a large-scale field study," SIGMETRICS Perform. Eval. Rev.
[12]
B. Schroeder and G. Gibson, "A large-scale study of failures in high-performance computing systems," Dependable and Secure Computing, IEEE Transactions on, 2010.
[13]
V. Sridharan et al., "Feng shui of supercomputer memory: Positional effects in dram and sram faults," in SC, 2013.
[14]
J.-H. Yoo et al., "A 32-bank 1 gb self-strobing synchronous dram with 1 gbyte/s bandwidth," JSSCC, vol. 31, no. 11, pp.1635--1644, 1996.
[15]
S. Shiratake et al., "A pseudo multi-bank dram with categorized access sequence," in VLSI, 1999.
[16]
J.-S. Kim et al., "A 1.2v 12.8gb/s 2gb mobile wide-i/o dram with 4x128 i/os using tsv-based stacking," in ISSCC, 2011.
[17]
J. T. Pawlowski, "Hybrid memory cube (hmc)," in HOTCHIPS, 2011.
[18]
T. Hollis, "Modeling and simulation challenges in 3d memories," in DesignCon, 2012.
[19]
J. Bolaria, "Micron reinvents dram memory," in Microprocessor Report (MPR), 2011.
[20]
Octopus 8-Port DRAM for Die-Stack Applications: TSC100801/2/4, Tezzaron Semiconductor, 2010.
[21]
Y. Kim et al., "A case for exploiting subarray-level parallelism (salp) in dram," in ISCA-39.
[22]
"Spec cpu2006 benchmark suite," in Standard Performance Evaluation Corporation. {Online}. Available: http://www.spec.org/cpu2006/
[23]
C. Bienia, "Benchmarking modern multiprocessors," in Ph.D. Thesis, Princeton University, 2011.
[24]
K. Albayraktaroglu et al., "Biobench: A benchmark suite of bioinformatics applications."
[25]
Calculating Memory System Power for DDR3, Micron, 2007.
[26]
MT41J512M4:8Gb QuadDie DDR3 SDRAM Rev. A 03/11, Micron, 2010.
[27]
(2011) Jang seok choi in the ddr4 mini workshop. {Online}. Available: http://jedec.org/sites/default/files/JS JChoiJ3DR4_miniWorkshop.pdf
[28]
J. Sim et al., "Resilient die-stacked dram caches," in ISCA-40.
[29]
L. Jiang, Q. Xu, and B. Eklow, "On effective tsv repair for 3d-stacked ics," in DATE-2012.
[30]
D. H. Yoon and M. Erez, "Virtualized and flexible ecc for main memory," in ASPLOS-15.
[31]
D. Roberts et al., "On-chip cache device scaling limits and effective fault repair techniques in future nanoscale technology," in DSD-10.
[32]
C. Wilkerson and othes, "Trading off cache capacity for reliability to enable low voltage operation," in ISCA-35.
[33]
P. J. Nair et al., "Archshield: architectural framework for assisting dram scaling by tolerating high error rates," in ISCA- 40.
[34]
J. Nerl et al., "System and method for controlling application of an error correction code (ecc) algorithm in a memory subsystem," Patent US 7 437 651 B2.
[35]
D. H. Yoon et al., "Boom: Enabling mobile memory based low-power server dimms," in ISCA-39.
[36]
J. Nerl et al., "System and method for applying error correction code (ecc) erasure mode and clearing recorded information from a page deallocation table," Patent US 7 313 749 B2.
[37]
A. Udipi et al., "Lot-ecc: Localized and tiered reliability mechanisms for commodity memory systems," in ISCA-39.
[38]
J. Kim et al., "Multi-bit error tolerant caches using two-dimensional error coding," in MICRO-40.
[39]
A. Thomasian and J. Menon, "Raid5 performance with distributed sparing," Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 6, pp. 640--657, 1997.
[40]
S. Li et al., "System implications of memory reliability in exascale computing," in SC, 2011.
[41]
C. Wilkerson et al., "Reducing cache power with low-cost, multi-bit error-correcting codes," in ISCA-37.

Cited By

View all

Index Terms

  1. Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
        December 2014
        697 pages
        ISBN:9781479969982

        Sponsors

        Publisher

        IEEE Computer Society

        United States

        Publication History

        Published: 13 December 2014

        Check for updates

        Author Tags

        1. DRAM
        2. Error Correcting Code
        3. Faults
        4. Resilience
        5. Stacked Memory
        6. Through Silicon Vias

        Qualifiers

        • Tutorial
        • Research
        • Refereed limited

        Conference

        MICRO-47
        Sponsor:

        Acceptance Rates

        MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;
        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 01 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
        • (2019)TouchéProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358281(453-465)Online publication date: 12-Oct-2019
        • (2019)DRIS-3Proceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317805(1-6)Online publication date: 2-Jun-2019
        • (2018)AttachéProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00034(326-338)Online publication date: 20-Oct-2018
        • (2017)Understanding Reduced-Voltage Operation in Modern DRAM DevicesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/30844471:1(1-42)Online publication date: 13-Jun-2017
        • (2016)XEDACM SIGARCH Computer Architecture News10.1145/3007787.300117444:3(341-353)Online publication date: 18-Jun-2016
        • (2016)Reliability and Performance Trade-off Study of Heterogeneous MemoriesProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989113(395-401)Online publication date: 3-Oct-2016
        • (2016)RATT-ECCACM Transactions on Architecture and Code Optimization10.1145/295775813:3(1-24)Online publication date: 17-Sep-2016
        • (2016)CitadelACM Transactions on Architecture and Code Optimization10.1145/284080712:4(1-24)Online publication date: 6-Jan-2016
        • (2016)XEDProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.38(341-353)Online publication date: 18-Jun-2016
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media