Article

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Authors:

Justin Meza,

Qiang Wu,

Sanjeev Kumar,

Onur MutluAuthors Info & Claims

DSN '15: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Pages 415 - 426

https://doi.org/10.1109/DSN.2015.57

Published: 22 June 2015 Publication History

Abstract

Computing systems use dynamic random-access memory (DRAM) as main memory. As prior works have shown, failures in DRAM devices are an important source of errors in modern servers. To reduce the effects of memory errors, error correcting codes (ECC) have been developed to help detect and correct errors when they occur. In order to develop effective techniques, including new ECC mechanisms, to combat memory errors, it is important to understand the memory reliability trends in modern systems. In this paper, we analyze the memory errors in the entire fleet of servers at Facebook over the course of fourteen months, representing billions of device days. The systems we examine cover a wide range of devices commonly used in modern servers, with DIMMs manufactured by 4 vendors in capacities ranging from 2 GB to 24 GB that use the modern DDR3 communication protocol. We observe several new reliability trends for memory systems that have not been discussed before in literature. We show that (1) memory errors follow a power-law, specifically, a Pareto distribution with decreasing hazard rate, with average error rate exceeding median error rate by around 55, (2) non-DRAM memory failures from the memory controller and memory channel cause the majority of errors, and the hardware and software overheads to handle such errors cause a kind of denial of service attack in some servers, (3) using our detailed analysis, we provide the first evidence that more recent DRAM cell fabrication technologies (as indicated by chip density) have substantially higher failure rates, increasing by 1.8 over the previous generation, (4) DIMM architecture decisions affect memory reliability: DIMMs with fewer chips and lower transfer widths have the lowest error rates, likely due to electrical noise reduction, (5) while CPU and memory utilization do not show clear trends with respect to failure rates, workload type can influence failure rate by up to 6:5, suggesting certain memory access patterns may induce more errors, (6) we develop a model for memory reliability and show how system design choices such as using lower density DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7%, and (7) we perform the first implementation and real-system analysis of page offlining at scale, showing that it can reduce memory error rate by 67%, and identify several real-world impediments to the technique.

Cited By

View all

Oles VSchmedding AOstrouchov GShin WSmirni EEngelmann C(2024)Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656615(188-200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656615
Jamshidi KVora KLee IChabbi MSteuwer M(2024)OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant AnalyticsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638468(94-108)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638468
Kim MWi MPark JKo SChoi JNam HKim NAhn JLee E(2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623777
Show More Cited By

Recommendations

A study of DRAM failures in the field
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM errors are a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM ...
Low Overhead Software Wear Leveling for Hybrid PCM + DRAM Main Memory on Embedded Systems
Phase change memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics, such as low-cost, shock-resistivity, nonvolatility, high density, and low leakage power. However, relatively low endurance has limited its ...
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...

Comments

Information & Contributors

Information

Published In

DSN '15: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

June 2015

573 pages

ISBN:9781479986293

Publisher

IEEE Computer Society

United States

Publication History

Published: 22 June 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Oles VSchmedding AOstrouchov GShin WSmirni EEngelmann C(2024)Understanding GPU Memory Corruption at Extreme Scale: The Summit Case StudyProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656615(188-200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656615
Jamshidi KVora KLee IChabbi MSteuwer M(2024)OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant AnalyticsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638468(94-108)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638468
Kim MWi MPark JKo SChoi JNam HKim NAhn JLee E(2023)How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAMProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623777(986-1001)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623777
Agiakatsikas DPapadimitriou GKarakostas VGizopoulos DPsarakis MBelanger-Champagne CBlackmore E(2023)Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614304(957-971)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614304
Jung JErez M(2023)Predicting Future-System Reliability with a Component-Level DRAM Fault ModelProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614294(944-956)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614294
Bleier NMubarik MSwenson GKumar R(2023)Space MicrodatacentersProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614271(900-915)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614271
Luo HOlgun AYağlıkçı ATuğrul YRhyner SCavlak MLindegger JSadrosadati MMutlu OSolihin YHeinrich M(2023)RowPress: Amplifying Read Disturbance in Modern DRAM ChipsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589063(1-18)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589063
Liu HLi ZTan CYang RCao GLiu ZGuo CGilad YKostic DMoatti YBiran O(2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3579370.3594777
Ferreira KLevy SHemmert JPedretti KWeissman JChandra AGavrilovska ATiwari D(2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531465
Du XLi C(2021)Predicting Uncorrectable Memory Errors from the Correctable Error History: No Free Predictors in the FieldProceedings of the International Symposium on Memory Systems10.1145/3488423.3519316(1-10)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519316
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Recommendations

A study of DRAM failures in the field

Low Overhead Software Wear Leveling for Hybrid PCM &#x002B; DRAM Main Memory on Embedded Systems

A Novel Memory Block Management Scheme for PCM Using WOM-Code

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations

Low Overhead Software Wear Leveling for Hybrid PCM + DRAM Main Memory on Embedded Systems