Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

Published: 03 March 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Main memory is one of the leading hardware causes for machine crashes in today's datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors. In this paper, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, the paper uses the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluates the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.

    References

    [1]
    Soft errors in electronic memory -- a white paper. Tezzaron Semiconductor. URL http://tezzaron.com/about/papes/soft_errors_1_1_secture.pdf.
    [2]
    L. A. Barroso and U. Hölzle. The case for energy-proportional computing. IEEE Computer, 40 (12), 2007.
    [3]
    T. M. Chalfant. Solaris operating system availability features. In SunBluePrints Online, 2004.
    [4]
    T. J. Dell. A white paper on the benefits of chip kill-correct ECC for PC server main memory. IBM Microelectronics, 1997.
    [5]
    T. J. Dell. System RAS implications of DRAM soft errors. IBM J. Res. Dev., 52 (3), 2008.
    [6]
    P. E. Dodd. Device simulation of charge collection and single-event upset. IEEE Nuclear Science, 43: 561--575, 1996.
    [7]
    A. Gara. Overview of the Blue Gene/L system architecture. IBM J. Res. Dev., 49: 195--212, March 2005.
    [8]
    IBM journal of Research and Development staff. Overview of the IBM Blue Gene/P project. IBM J. Res. Dev., 52 (1/2): 199--220, January 2008.
    [9]
    H. Kobayashi, K. Shiraishi, H. Tsuchiya, H. Usuki, Y. Nagai, and K. Takahisa. Evaluation of lsi soft errors induced by terrestrial cosmic rays and alpha particles. Technical report, Sony corporation and RCNP Osaka University, 2001.
    [10]
    X. Li, K. Shen, M. Huang, and L. Chu. A memory soft error measurement on production systems. In Proc. USENIX Annual Technical Conference (ATC '07), pages 21:1--21:6, 2007.
    [11]
    X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In Proc. USENIX Annual Technical Conference (ATC '10), pages 75--88, 2010.
    [12]
    T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26 (1), 1979.
    [13]
    B. Murphy. Automating software failure reporting. ACM Queue, 2, 2004.
    [14]
    E. Normand. Single event upset at ground level. IEEE Transaction on Nuclear Sciences, 6 (43): 2742--2750, 1996.
    [15]
    T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. IBM J. Res. Dev., 40 (1), 1996.
    [16]
    M. Ohmacht. Blue Gene/L compute chip: memory and Ethernet subsystem. IBM J. Res. Dev., 49: 255--264, March 2005.
    [17]
    R. V. Rein. BadRAM: Linux kernel support for broken RAM modules. URL http://rick.vanrein.org/linux/badram/.
    [18]
    B. Schroeder and G. A. Gibson. A large scale study of failures in high-performance-computing systems. In Proc. Int'l Conf. Dependable Systems and Networks (DSN 2006), pages 249--258, 2006.
    [19]
    B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. In Proc. 11th Int'l Joint Conf. Measurement and Modeling of Computer Systems (SIGMETRICS '09), pages 193--204, 2009.
    [20]
    D. Tang, P. Carruthers, Z. Totari, and M. W. Shapiro. Assessment of the effect of memory page retirement on system RAS against hardware faults. In Proc. Int'l Conf. Dependable Systems and Networks (DSN 2006), pages 365--370, 2006.
    [21]
    H. H. Tang. Semm-2: a new generation of single-event-effect modeling tools. IBM J. Res. Dev., 52: 233--244, May 2008.
    [22]
    H. H. K. Tang, C. E. Murray, G. Fiorenza, K. P. Rodbell, M. S. Gordon, and D. F. Heidel. New simulation methodology for effects of radiation in semiconductor chip structures. IBM J. Res. Dev., 52: 245--253, May 2008.
    [23]
    USENIX. The computer failure data repository (CFDR). URL http://cfdr.usenix.org/.
    [24]
    D. H. Yoon and M. Erez. Virtualized and flexible ECC for main memory. In Proc. 15th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '10), pages 397--408, 2010.
    [25]
    J. Ziegler. IBM experiments in soft fails in computer electronics. Political Analysis, 40 (1): 3--18, 1996.
    [26]
    J. F. Ziegler. Terrestrial cosmic rays. IBM J. Res. Dev., 40: 19--39, January 1996.
    [27]
    J. F. Ziegler and W. A. Lanford. Effect of Cosmic Rays on Computer Memories. Science, 206: 776--788, 1979.
    [28]
    J. F. Ziegler, M. E. Nelson, J. D. Shell, R. J. Peterson, C. J. Gelderloos, H. P. Muhlfeld, and C. J. Montrose. Cosmic ray soft error rates of 16-Mb DRAM memory chips. IEEE J. Solid-state Circuits, 33: 246--252, 1998.

    Cited By

    View all
    • (2024)SoK: Rowhammer on Commodity Operating SystemsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3656998(436-452)Online publication date: 1-Jul-2024
    • (2023)Mars Attacks!Proceedings of the 22nd ACM Workshop on Hot Topics in Networks10.1145/3626111.3628199(245-253)Online publication date: 28-Nov-2023
    • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
    • Show More Cited By

    Index Terms

    1. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 47, Issue 4
      ASPLOS '12
      April 2012
      453 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2248487
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS XVII: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
        March 2012
        476 pages
        ISBN:9781450307598
        DOI:10.1145/2150976
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 March 2012
      Published in SIGPLAN Volume 47, Issue 4

      Check for updates

      Author Tags

      1. DRAM errors
      2. correctable errors
      3. field study
      4. reliability
      5. uncorrectable errors

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)73
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)SoK: Rowhammer on Commodity Operating SystemsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3656998(436-452)Online publication date: 1-Jul-2024
      • (2023)Mars Attacks!Proceedings of the 22nd ACM Workshop on Hot Topics in Networks10.1145/3626111.3628199(245-253)Online publication date: 28-Nov-2023
      • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
      • (2023)CSI:Rowhammer – Cryptographic Security and Integrity against Rowhammer2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179390(1702-1718)Online publication date: May-2023
      • (2023)Exploration of Bitflip’s Effect on Deep Neural Network Accuracy in Plaintext and CiphertextIEEE Micro10.1109/MM.2023.327311543:5(24-34)Online publication date: 5-May-2023
      • (2023)Workload Failure Prediction for Data Centers2023 IEEE 16th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD60044.2023.00064(479-485)Online publication date: Jul-2023
      • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
      • (2022)Hyperdimensional hashingProceedings of the 59th ACM/IEEE Design Automation Conference10.1145/3489517.3530553(907-912)Online publication date: 10-Jul-2022
      • (2022)Performance and Power Estimation of STT-MRAM Main Memory with Reliable System-level SimulationACM Transactions on Embedded Computing Systems10.1145/347683821:1(1-25)Online publication date: 14-Jan-2022
      • (2022)ECMO: ECC Architecture Reusing Content-Addressable Memories for Obtaining High Reliability in DRAMIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.315389430:6(781-793)Online publication date: 1-Jun-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media