Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/DSN.2015.57guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Published: 22 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have shown, failures in DRAM devices are an important source of errors in modern servers. To reduce the effects of memory errors, error correcting codes (ECC) have been developed to help detect and correct errors when they occur. In order to develop effective techniques, including new ECC mechanisms, to combat memory errors, it is important to understand the memory reliability trends in modern systems. In this paper, we analyze the memory errors in the entire fleet of servers at Facebook over the course of fourteen months, representing billions of device days. The systems we examine cover a wide range of devices commonly used in modern servers, with DIMMs manufactured by 4 vendors in capacities ranging from 2 GB to 24 GB that use the modern DDR3 communication protocol. We observe several new reliability trends for memory systems that have not been discussed before in literature. We show that (1) memory errors follow a power-law, specifically, a Pareto distribution with decreasing hazard rate, with average error rate exceeding median error rate by around 55, (2) non-DRAM memory failures from the memory controller and memory channel cause the majority of errors, and the hardware and software overheads to handle such errors cause a kind of denial of service attack in some servers, (3) using our detailed analysis, we provide the first evidence that more recent DRAM cell fabrication technologies (as indicated by chip density) have substantially higher failure rates, increasing by 1.8 over the previous generation, (4) DIMM architecture decisions affect memory reliability: DIMMs with fewer chips and lower transfer widths have the lowest error rates, likely due to electrical noise reduction, (5) while CPU and memory utilization do not show clear trends with respect to failure rates, workload type can influence failure rate by up to 6:5, suggesting certain memory access patterns may induce more errors, (6) we develop a model for memory reliability and show how system design choices such as using lower density DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7%, and (7) we perform the first implementation and real-system analysis of page offlining at scale, showing that it can reduce memory error rate by 67%, and identify several real-world impediments to the technique.

    Cited By

    View all
    • (2024)Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656615(188-200)Online publication date: 30-May-2024
    • (2024)OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant AnalyticsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638468(94-108)Online publication date: 2-Mar-2024
    • (2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    DSN '15: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
    June 2015
    573 pages
    ISBN:9781479986293

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 22 June 2015

    Author Tags

    1. DRAM
    2. main memory
    3. reliability
    4. warehouse-scale computing

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656615(188-200)Online publication date: 30-May-2024
    • (2024)OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant AnalyticsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638468(94-108)Online publication date: 2-Mar-2024
    • (2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
    • (2023)Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614304(957-971)Online publication date: 28-Oct-2023
    • (2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
    • (2023)Space MicrodatacentersProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614271(900-915)Online publication date: 28-Oct-2023
    • (2023)RowPress: Amplifying Read Disturbance in Modern DRAM ChipsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589063(1-18)Online publication date: 17-Jun-2023
    • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
    • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
    • (2021)Predicting Uncorrectable Memory Errors from the Correctable Error History: No Free Predictors in the FieldProceedings of the International Symposium on Memory Systems10.1145/3488423.3519316(1-10)Online publication date: 27-Sep-2021
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media