Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2694344.2694348acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Published: 14 March 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

    References

    [1]
    Flux calculator. http://seutest.com/cgi-bin/FluxCalculator.cgi.
    [2]
    mcelog: memory error handling in user space. http://halobates.de/lk10-mcelog.pdf.
    [3]
    AMD. AMD graphics cores next (GCN) architecture. http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf.
    [4]
    AMD. Bios and kernel developer guide (BKDG) for AMD family 10h models 00h-0fh processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf.
    [5]
    AMD. AMD64 architecture programmer's manual volume 2: System programming, revision 3.23. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf, 2013.
    [6]
    A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11--33, Jan.-Mar. 2004.
    [7]
    R. Baumann. Radiation-induced soft errors in advanced semi-conductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, Sept. 2005.
    [8]
    K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead, Sep. 2008.
    [9]
    L. Borucki, G. Schindlbeck, and C. Slayman. Comparison of accelerated DRAM soft error rates measured at component and system level. In IEEE International Reliability Physics Symposium (IRPS), pages 482--487, 2008.
    [10]
    C. Constantinescu. Impact of deep submicron technology on dependability of VLSI circuits. In International Conference on Dependable Systems and Networks (DSN), pages 205--209, 2002.
    [11]
    C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, Jul.-Aug. 2003.
    [12]
    C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In International Conference on Dependable Systems and Networks (DSN), pages 610--621, 2014.
    [13]
    A. Dixit, R. Heald, and A. Wood. Trends from ten years of soft error experimentation. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009.
    [14]
    N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 163--174, 2012.
    [15]
    X. Huang, W.-C. Lee, C. Kuo, D. Hisamoto, L. Chang, J. Kedzierski, E. Anderson, H. Takeuchi, Y.-K. Choi, K. Asano, V. Subramanian, T.-J. King, J. Bokor, and C. Hu. Sub 50-nm FinFET: PMOS. In International Electron Devices Meeting (IEDM), pages 67--70, 1999.
    [16]
    A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 111--122, 2012.
    [17]
    E. Ibe, H. Taniguchi, Y. Yahagi, K. i. Shimbo, and T. Toba. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. In IEEE Transactions on Electron Devices, pages 1527--1538, Jul. 2010.
    [18]
    X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, low-storage-overhead chipkill correct via multi-line error correction. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 24:1--24:12, 2013.
    [19]
    Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), pages 361 -- 372, 2014.
    [20]
    X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In USENIX Annual Technical Conference (USENIX- ATC), pages 6--20, 2010.
    [21]
    X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In USENIX Annual Technical Conference (USENIXATC), pages 21:1--21:6, 2007.
    [22]
    P. W. Lisowski and K. F. Schoenberg. The Los Alamos neutron science center. In Nuclear Instruments and Methods, volume 562:2, pages 910--914, June 2006.
    [23]
    T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, Jan. 1979.
    [24]
    A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557--1568, Dec. 2004.
    [25]
    S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture (MICRO), pages 29--40, 2003.
    [26]
    J. T. Pawlowski. Memory errors and mitigation: Keynote talk for SELSE 2014. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2014.
    [27]
    H. Quinn, P. Graham, and T. Fairbanks. SEEs induced by high-energy protons and neutrons in SDRAM. In IEEE Radiation Effects Data Workshop (REDW), pages 1--5, 2011.
    [28]
    B. Schroeder. Personal Communication.
    [29]
    B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006.
    [30]
    B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, Feb. 2011.
    [31]
    T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2013.
    [32]
    J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor. Resilient die-stacked DRAM caches. In International Symposium on Computer Architecture (ISCA), pages 416--427, 2013.
    [33]
    V. Sridharan and D. Liberty. A study of DRAM failures in the field. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 76:1--76:11, 2012.
    [34]
    V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 22:1--22:11, 2013.
    [35]
    A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, 2012.
    [36]
    J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73--84, 2014.
    [37]
    C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-l. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. In International Symposium on Computer Architecture (ISCA), pages 83--93, 2010.

    Cited By

    View all
    • (2024)Artificial Neural Networks for Space and Safety-Critical Applications: Reliability Issues and Potential SolutionsIEEE Transactions on Nuclear Science10.1109/TNS.2024.334995671:4(377-404)Online publication date: Apr-2024
    • (2023)HashTagProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620394(2797-2814)Online publication date: 9-Aug-2023
    • (2023)Using Benford's Law to Identify Unusual Failure RegionsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624121(516-519)Online publication date: 12-Nov-2023
    • Show More Cited By

    Index Terms

    1. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2015
        720 pages
        ISBN:9781450328357
        DOI:10.1145/2694344
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 March 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. field studies
        2. large-scale systems
        3. reliability

        Qualifiers

        • Research-article

        Funding Sources

        • United States Department of Energy

        Conference

        ASPLOS '15

        Acceptance Rates

        ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;
        Overall Acceptance Rate 535 of 2,713 submissions, 20%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)182
        • Downloads (Last 6 weeks)12
        Reflects downloads up to 09 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Artificial Neural Networks for Space and Safety-Critical Applications: Reliability Issues and Potential SolutionsIEEE Transactions on Nuclear Science10.1109/TNS.2024.334995671:4(377-404)Online publication date: Apr-2024
        • (2023)HashTagProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620394(2797-2814)Online publication date: 9-Aug-2023
        • (2023)Using Benford's Law to Identify Unusual Failure RegionsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624121(516-519)Online publication date: 12-Nov-2023
        • (2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
        • (2023)Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory FaultsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607084(1-17)Online publication date: 12-Nov-2023
        • (2023)Imprecise Store ExceptionsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589087(1-15)Online publication date: 17-Jun-2023
        • (2023)Design and Evaluation of a Peripheral for Integrity Checking to Improve RAS in RISC-V Architectures2023 8th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM)10.1109/SEEDA-CECNSM61561.2023.10470707(1-6)Online publication date: 10-Nov-2023
        • (2023)Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323692(01-09)Online publication date: 28-Oct-2023
        • (2023)A Systematic Study of DDR4 DRAM Faults in the Field2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071066(991-1002)Online publication date: Feb-2023
        • (2023)AstriFlash A Flash-Based System for Online Services2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070955(81-93)Online publication date: Feb-2023
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media