Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2745844.2745848acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article
Open access

A Large-Scale Study of Flash Memory Failures in the Field

Published: 15 June 2015 Publication History

Abstract

Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center environment running modern applications and system software.
This paper presents the first large-scale study of flash-based SSD reliability in the field. We analyze data collected across a majority of flash-based solid state drives at Facebook data centers over nearly four years and many millions of operational hours in order to understand failure properties and trends of flash-based SSDs. Our study considers a variety of SSD characteristics, including: the amount of data written to and read from flash chips; how data is mapped within the SSD address space; the amount of data copied, erased, and discarded by the flash controller; and flash board temperature and bus power.
Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD's physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software. We hope that the findings of this first large-scale flash memory reliability study can inspire others to develop other publicly-available analyses and novel flash reliability solutions.

References

[1]
NVM Express Specification. http://www.nvmexpress.org/specifications/.
[2]
The R Project for Statistical Computing. http://www.r-project.org/.
[3]
American National Standards Institute. AT Attachment 8 -- ATA/ATAPI Command Set. http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf, 2008.
[4]
H. Belgal, N. Righos, I. Kalastirsky, et al. A New Reliability Model for Post-Cycling Charge Retention of Flash Memories. IRPS, 2002.
[5]
A. Brand, K. Wu, S. Pan, et al. Novel Read Disturb Failure Mechanism Induced By Flash Cycling. IRPS, 1993.
[6]
Y. Cai, E. F. Haratsch, O. Mutlu, et al. Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis. In DATE, 2012.
[7]
Y. Cai, E. F. Haratsch, O. Mutlu, et al. Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling. In DATE, 2013.
[8]
Y. Cai, Y. Luo, S. Ghose, et al. Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation. In DSN, 2015.
[9]
Y. Cai, Y. Luo, E. F. Haratsch, et al. Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery. In HPCA, 2015.
[10]
Y. Cai, O. Mutlu, E. F. Haratsch, et al. Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation. In ICCD, 2013.
[11]
Y. Cai, G. Yalcin, O. Mutlu, et al. Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime. In ICCD, 2012.
[12]
Y. Cai, G. Yalcin, O. Mutlu, et al. Error Analysis and Retention-Aware Error Management for NAND Flash Memory. ITJ, 2013.
[13]
Y. Cai, G. Yalcin, O. Mutlu, et al. Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories. In SIGMETRICS, 2014.
[14]
A. Chimenton and P. Olivo. Erratic Erase in Flash Memories -- Part I: Basic Experimental and Statistical Characterization. IEEE Trans. Elect. Dev., 50(4), 2003.
[15]
T.-S. Chung, D.-J. Park, S. Park, et al. A survey of flash translation layer. J. Sys. Arch., 55, 2009.
[16]
C. Compagnoni, A. Spinelli, R. Gusmeroli, et al. First Evidence for Injection Statistics Accuracy Limitations in NAND Flash Constant-Current Fowler-Nordheim Programming. IEDM Tech Dig., 2007.
[17]
J. Cooke. The Inconvenient Truths of NAND Flash Memory. In Flash Memory Summit, 2007.
[18]
R. Degraeve, F. Schuler, B. Kaczer, et al. Analytical Percolation Model for Predicting Anomalous Charge Loss in Flash Memories. IEEE Trans. Elect. Dev., 51(9), 2004.
[19]
A. Gartrell, M. Srinivasan, B. Alger, et al. McDipper: A Key-Value Cache for Flash Storage. https://www.facebook.com/notes/10151347090423920, 2013.
[20]
L. M. Grupp, J. D. Davis, and S. Swanson. The Bleak Future of NAND Flash Memory. In FAST, 2012.
[21]
S. Hur, J. Lee, M. Park, et al. Effective Program Inhibition Beyond 90nm NAND Flash Memories. NVSM, 2004.
[22]
S. Joo, H. Yang, K. Noh, et al. Abnormal Disturbance Mechanism of Sub-100 nm NAND Flash Memory. Japanese J. Applied Physics, 45(8A), 2006.
[23]
T. Jung, Y. Choi, K. Suh, et al. A 3.3V 128Mb Multi-Level NAND Flash Memory for Mass Storage Applications. ISSCC, 1996.
[24]
M. Kato, N. Miyamoto, H. Kume, et al. Read-Disturb Degradation Mechanism Due to Electron Trapping in the Tunnel Oxide for Low-Voltage Flash Memories. IEDM, 1994.
[25]
H. Kurata, K. Otsuga, A. Kotabe, et al. The Impact of Random Telegraph Signals on the Scaling of Multilevel Flash Memories. VLSI, 2006.
[26]
J. Lee, J. Choi, D. Park, et al. Degradation of Tunnel Oxide by FN Current Stress and Its Effects on Data Retention Characteristics of 90-nm NAND Flash Memory. IRPS, 2003.
[27]
J. Lee, C. Lee, M. Lee, et al. A New Program Disturbance Phenomenon in NAND Flash Memory by Source/Drain Hot-Electrons Generated by GIDL Current. NVSM, 2006.
[28]
A. Maislos. A New Era in Embedded Flash Memory, 2011. Presentation at Flash Memory Summit.
[29]
N. Mielke, H. Belgal, A. Fazio, et al. Recovery Effects in the Distributed Cycling of Flash Memories. RPS, 2006.
[30]
N. Mielke, H. Belgal, I. Kalastirsky, et al. Flash EEPROM Threshold Instabilities due to Charge Trapping During Program/Erase Cycling. IEEE Trans. Dev. and Mat. Reliability, 2(3), 2004.
[31]
N. Mielke, T. Marquart, N. Wu, et al. Bit Error Rate in NAND Flash Memories. In IRPS, 2008.
[32]
T. Ong, A. Fazio, N. Mielke, et al. Erratic Erase in ETOX#8482; Flash Memory Array. VLSI, 1993.
[33]
J. Ouyang, S. Lin, S. Jiang, et al. SDF: Software-Defined Flash for Web-Scale Internet Storage Systems. ASPLOS, 2014.
[34]
B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In FAST, 2007.
[35]
K. Suh, B. Suh, Y. Lim, et al. A 3.3 V 32 Mb NAND Flash Memory with Incremental Step Pulse Programming Scheme. IEEE J. Sol. St. Circuits, 30(11), 1995.
[36]
K. Takeuchi, S. Satoh, T. Tanaka, et al. A Negative Vth Cell Architecture for Highly Scalable, Excellently Noise-Immune, and Highly Reliable NAND Flash Memories. IEEE J. Sol. St. Circuits, 34(5), 1995.
[37]
A. Thusoo, J. Sen Sarma, N. Jain, et al. Hivetextendash A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.
[38]
M. Xu, C. Tan, and L. MingFu. Extended Arrhenius Law of Time-to-Breakdown of Ultrathin Gate Oxides. Applied Physics Letters, 82(15), 2003.
[39]
R. Yamada, Y. Mori, Y. Okuyama, et al. Analysis of Detrap Current Due to Oxide Traps to Improve Flash Memory Retention. IRPS, 2000.
[40]
R. Yamada, T. Sekiguchi, Y. Okuyama, et al. A Novel Analysis Method of Threshold Voltage Shift Due to Detrap in a Multi-Level Flash Memory. VLSI, 2001.

Cited By

View all
  • (2024)DSigProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691974(667-685)Online publication date: 10-Jul-2024
  • (2024)Flash-oriented Coded Storage: Research Status and Future DirectionsACM Transactions on Storage10.1145/370899521:1(1-37)Online publication date: 19-Dec-2024
  • (2024)Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDsACM Transactions on Storage10.1145/362799220:1(1-32)Online publication date: 30-Jan-2024
  • Show More Cited By

Index Terms

  1. A Large-Scale Study of Flash Memory Failures in the Field

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
    June 2015
    488 pages
    ISBN:9781450334860
    DOI:10.1145/2745844
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2015

    Check for updates

    Author Tags

    1. flash memory
    2. reliability
    3. warehouse-scale data centers

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMETRICS '15
    Sponsor:

    Acceptance Rates

    SIGMETRICS '15 Paper Acceptance Rate 32 of 239 submissions, 13%;
    Overall Acceptance Rate 459 of 2,691 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)430
    • Downloads (Last 6 weeks)63
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DSigProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691974(667-685)Online publication date: 10-Jul-2024
    • (2024)Flash-oriented Coded Storage: Research Status and Future DirectionsACM Transactions on Storage10.1145/370899521:1(1-37)Online publication date: 19-Dec-2024
    • (2024)Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDsACM Transactions on Storage10.1145/362799220:1(1-32)Online publication date: 30-Jan-2024
    • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
    • (2023)The Role of Polymers in Halide Perovskite Resistive Switching DevicesPolymers10.3390/polym1505106715:5(1067)Online publication date: 21-Feb-2023
    • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
    • (2023)uBFT: Microsecond-Scale BFT using Disaggregated MemoryProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575732(862-877)Online publication date: 27-Jan-2023
    • (2023)Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction ModelsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.313157120:1(256-272)Online publication date: 1-Jan-2023
    • (2022)Do Temperature and Humidity Exposures Hurt or Benefit Your SSDs?2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE54114.2022.9774582(352-357)Online publication date: 14-Mar-2022
    • (2022)PaviseProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569662(109-123)Online publication date: 8-Oct-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media