Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Published: 14 March 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

    References

    [1]
    Flux calculator. http://seutest.com/cgi-bin/FluxCalculator.cgi.
    [2]
    mcelog: memory error handling in user space. http://halobates.de/lk10-mcelog.pdf.
    [3]
    AMD. AMD graphics cores next (GCN) architecture. http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf.
    [4]
    AMD. Bios and kernel developer guide (BKDG) for AMD family 10h models 00h-0fh processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf.
    [5]
    AMD. AMD64 architecture programmer's manual volume 2: System programming, revision 3.23. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf, 2013.
    [6]
    A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11--33, Jan.-Mar. 2004.
    [7]
    R. Baumann. Radiation-induced soft errors in advanced semi-conductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, Sept. 2005.
    [8]
    K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead, Sep. 2008.
    [9]
    L. Borucki, G. Schindlbeck, and C. Slayman. Comparison of accelerated DRAM soft error rates measured at component and system level. In IEEE International Reliability Physics Symposium (IRPS), pages 482--487, 2008.
    [10]
    C. Constantinescu. Impact of deep submicron technology on dependability of VLSI circuits. In International Conference on Dependable Systems and Networks (DSN), pages 205--209, 2002.
    [11]
    C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, Jul.-Aug. 2003.
    [12]
    C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In International Conference on Dependable Systems and Networks (DSN), pages 610--621, 2014.
    [13]
    A. Dixit, R. Heald, and A. Wood. Trends from ten years of soft error experimentation. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009.
    [14]
    N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 163--174, 2012.
    [15]
    X. Huang, W.-C. Lee, C. Kuo, D. Hisamoto, L. Chang, J. Kedzierski, E. Anderson, H. Takeuchi, Y.-K. Choi, K. Asano, V. Subramanian, T.-J. King, J. Bokor, and C. Hu. Sub 50-nm FinFET: PMOS. In International Electron Devices Meeting (IEDM), pages 67--70, 1999.
    [16]
    A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 111--122, 2012.
    [17]
    E. Ibe, H. Taniguchi, Y. Yahagi, K. i. Shimbo, and T. Toba. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. In IEEE Transactions on Electron Devices, pages 1527--1538, Jul. 2010.
    [18]
    X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, low-storage-overhead chipkill correct via multi-line error correction. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 24:1--24:12, 2013.
    [19]
    Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), pages 361 -- 372, 2014.
    [20]
    X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In USENIX Annual Technical Conference (USENIX- ATC), pages 6--20, 2010.
    [21]
    X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In USENIX Annual Technical Conference (USENIXATC), pages 21:1--21:6, 2007.
    [22]
    P. W. Lisowski and K. F. Schoenberg. The Los Alamos neutron science center. In Nuclear Instruments and Methods, volume 562:2, pages 910--914, June 2006.
    [23]
    T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, Jan. 1979.
    [24]
    A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557--1568, Dec. 2004.
    [25]
    S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture (MICRO), pages 29--40, 2003.
    [26]
    J. T. Pawlowski. Memory errors and mitigation: Keynote talk for SELSE 2014. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2014.
    [27]
    H. Quinn, P. Graham, and T. Fairbanks. SEEs induced by high-energy protons and neutrons in SDRAM. In IEEE Radiation Effects Data Workshop (REDW), pages 1--5, 2011.
    [28]
    B. Schroeder. Personal Communication.
    [29]
    B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006.
    [30]
    B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, Feb. 2011.
    [31]
    T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2013.
    [32]
    J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor. Resilient die-stacked DRAM caches. In International Symposium on Computer Architecture (ISCA), pages 416--427, 2013.
    [33]
    V. Sridharan and D. Liberty. A study of DRAM failures in the field. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 76:1--76:11, 2012.
    [34]
    V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 22:1--22:11, 2013.
    [35]
    A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, 2012.
    [36]
    J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73--84, 2014.
    [37]
    C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-l. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. In International Symposium on Computer Architecture (ISCA), pages 83--93, 2010.

    Cited By

    View all
    • (2024)HW-SW Interface Design and Implementation for Error Logging and Reporting for RAS ImprovementIEEE Access10.1109/ACCESS.2024.339384412(60081-60094)Online publication date: 2024
    • (2023)Characterizing and Improving Resilience of Accelerators to Memory Errors in Autonomous RobotsACM Transactions on Cyber-Physical Systems10.1145/36278288:3(1-33)Online publication date: 23-Oct-2023
    • (2023)RISC-V-Based Evaluation and Strategy Exploration of MRAM Triple-Level Hybrid Cache SystemsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.326810831:7(980-992)Online publication date: 1-Jul-2023
    • Show More Cited By

    Index Terms

    1. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 4
        ASPLOS '15
        April 2015
        676 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2775054
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
          March 2015
          720 pages
          ISBN:9781450328357
          DOI:10.1145/2694344
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 March 2015
        Published in SIGPLAN Volume 50, Issue 4

        Check for updates

        Author Tags

        1. field studies
        2. large-scale systems
        3. reliability

        Qualifiers

        • Research-article

        Funding Sources

        • United States Department of Energy

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)192
        • Downloads (Last 6 weeks)12

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)HW-SW Interface Design and Implementation for Error Logging and Reporting for RAS ImprovementIEEE Access10.1109/ACCESS.2024.339384412(60081-60094)Online publication date: 2024
        • (2023)Characterizing and Improving Resilience of Accelerators to Memory Errors in Autonomous RobotsACM Transactions on Cyber-Physical Systems10.1145/36278288:3(1-33)Online publication date: 23-Oct-2023
        • (2023)RISC-V-Based Evaluation and Strategy Exploration of MRAM Triple-Level Hybrid Cache SystemsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.326810831:7(980-992)Online publication date: 1-Jul-2023
        • (2022)CRP: Conditional Replacement Policy for Reliability Enhancement of STT-MRAM CachesIEEE Transactions on Magnetics10.1109/TMAG.2022.317526958:7(1-13)Online publication date: Jul-2022
        • (2022)An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers2022 41st International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS55811.2022.00032(262-272)Online publication date: Sep-2022
        • (2022)Fault Management Framework and Multi-layer Recovery Methodology for Resilient System2022 6th International Conference on System Reliability and Safety (ICSRS)10.1109/ICSRS56243.2022.10067849(32-39)Online publication date: 23-Nov-2022
        • (2022)A Reliability-oriented Faults Taxonomy and a Recovery-oriented Methodological Approach for Systems Resilience2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC54236.2022.00016(48-55)Online publication date: Jun-2022
        • (2022)On-Die Dynamic Remapping Cache: Strong and Independent Protection Against Intermittent FaultsIEEE Access10.1109/ACCESS.2022.319287910(78970-78982)Online publication date: 2022
        • (2022)Improving DRAM Energy-efficiencyComputing at the EDGE10.1007/978-3-030-74536-3_5(123-140)Online publication date: 20-Sep-2022
        • (2021)Soteria: Towards Resilient Integrity-Protected and Encrypted Non-Volatile MemoriesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480066(1214-1226)Online publication date: 18-Oct-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media