Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Reliability Analysis of SSDs Under Power Fault

Published: 01 November 2016 Publication History

Abstract

Modern storage technology (solid-state disks (SSDs), NoSQL databases, commoditized RAID hardware, etc.) brings new reliability challenges to the already-complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue in this dependability-critical area. Understanding how new storage components behave under power fault is the first step towards designing new robust storage systems.
In this article, we propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test 17 commodity SSDs from six different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that 14 of the 17 tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

References

[1]
Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance. In Proceedings of the USENIX 2008 Annual Technical Conference (ATC’08). USENIX Association, Berkeley, CA, 57--70.
[2]
Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson, and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. Trans. Stor. 4, 3, Article 8 (Nov. 2008), 28 pages.
[3]
Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). ACM, New York, NY, 289--300.
[4]
H. P. Belgal, N. Righos, I. Kalastirsky, J. J. Peterson, R. Shiner, and N. Mielke. 2002. A new reliability model for post-cycling charge retention of flash memories. In Proceedings of the 40th IEEE International Reliability Physics Symposium (IRPS’02).
[5]
Roberto Bez, Emilio Camerlenghi, Alberto Modelli, and Angelo Visconti. 2003. Introduction to flash memory. In Procedings of the IEEE. 489--502.
[6]
Andrew Birrell, Michael Isard, Chuck Thacker, and Ted Wobber. 2007. A design for high-performance flash disks. SIGOPS Oper. Syst. Rev. 41, 2 (2007), 88--93.
[7]
Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013. Linux block IO: Introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). ACM, New York, NY, Article 22, 10 pages.
[8]
A. Brand, K. Wu, S. Pan, and D. Chin. 1993. Novel read disturb failure mechanism induced by FLASH cycling. In Proceedings of the 31st IEEE International Reliability Physics Symposium (IRPS’93).
[9]
John Bucy, Jiri Schindler, Steve Schlosser, and Greg Ganger. DiskSim v4.0. Retrieved from www.pdl.cmu.edu/DiskSim/.
[10]
Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai. 2012. Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). EDA Consortium, 521--526.
[11]
Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai. 2014. Neighbor-cell assisted error correction for MLC NAND flash memories. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’14). ACM, New York, NY, 491--504.
[12]
Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In ACM SIGMETRICS.
[13]
Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using crash hoare logic for certifying the FSCQ file system. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 18--37.
[14]
Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12).
[15]
Thomas Claburn. Amazon Web Services Hit By Power Outage. Retrieved from http://www.informationweek. com/cloud-computing/infrastructure/amazon-web-services-hit-by-power-outage/240002170.
[16]
Lukas Czerner and Karel Zak. 2014. FSTRIM in Linux. Retrieved from http://man7.org/linux/man-pages/man8/fstrim.8.html. (2014).
[17]
John Erickson, Madanlal Musuvathi, Sebastian Burckhardt, and Kirk Olynyk. 2010. Effective data-race detection for the kernel. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, Berkeley, CA, 151--162.
[18]
Pedro Fonseca, Rodrigo Rodrigues, and Björn B. Brandenburg. 2014. SKI: Exposing kernel concurrency bugs through systematic schedule exploration. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 415--431.
[19]
Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12).
[20]
Eran Gal and Sivan Toledo. 2005. Algorithms and data structures for flash memories. ACM Comput. Surv. 37, 2 (2005), 138--163.
[21]
Garth Gibson. 1990. Redundant Disk Arrays: Reliable Parallel Secondary Storage. Ph.D. Dissertation. University of California, Berkeley.
[22]
Wojciech Golab, Xiaozhou Li, and Mehul A. Shah. 2011. Analyzing consistency properties for fun and profit. In Proceedings of the 30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’11). ACM, New York, NY, 197--206.
[23]
Kevin M. Greenan, James S. Plank, and Jay J. Wylie. 2010. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In Proceedings of the 2nd USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’10). USENIX Association, Berkeley, CA, 5.
[24]
Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing flash memory: Anomalies, observations, and applications. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42).
[25]
Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The bleak future of NAND flash memory. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12).
[26]
Haryadi S. Gunawi, Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. 2008. EIO: Error handling is occasionally correct. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). Berkeley, CA.
[27]
Christoph Hellwig. 2009. Kernel patch for v2.6.33-rc1. Retreived from http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ab0a9735e06914ce4d2a94ffa41497dbc142fe7f. (2009).
[28]
Xavier Jimenez, David Novo, and Paolo Ienne. 2014. Wear unleveling: Improving NAND flash lifetime by balancing page endurance. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX, Berkeley, CA.47--59.
[29]
Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008a. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 9, 15 pages.
[30]
Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008b. Parity lost and parity regained. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 9, 15 pages.
[31]
Andrew Ku. 2011. Second-Generation SandForce: It’s All About Compression. Retrieved from http://www.tomshardware.com/review/vertex-3-sandforce-ssd,2869-3.html.
[32]
H. Kurata, K. Otsuga, A. Kotabe, S. Kajiyama, T. Osabe, Y. Sasago, S. Narumi, K. Tokami, S. Kamohara, and O. Tsuchiya. 2006. The impact of random telegraph signals on the scaling of multilevel flash memories. In Symposium on VLSI Circuits (VLSI’06).
[33]
Anna Leach. Level 3’s UPS burnout sends websites down in flames. Retrieved from http://www.theregister.co.uk/2012/07/10/data_centre_power_cut/.
[34]
Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Berkeley, CA, 273--286.
[35]
Sungjin Lee, Dongkun Shin, Young-Jin Kim, and Jihong Kim. 2008. LAST: Locality-aware sector translation for NAND flash memory-based storage systems. SIGOPS Oper. Syst. Rev. 42, 6 (Oct. 2008), 36--42.
[36]
Jiangpeng Li, Kai Zhao, Xuebin Zhang, Jun Ma, Ming Zhao, and Tong Zhang. 2015. How much can data compressibility help to improve NAND flash memory lifetime? In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Berkeley, CA, 227--240.
[37]
Ren-Shou Liu, Chia-Lin Yang, and Wei Wu. 2012. Optimizing NAND flash-based SSDs via retention relaxation. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX, Berkeley, CA.
[38]
Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. 2013. A study of linux file system evolution. In Presented as Part of the 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX, Berkeley, CA, 31--44.
[39]
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the14th USENIX Conference on File and Storage Technologies (FAST’16). USENIX Association, Berkeley, CA, 133--148.
[40]
Youyou Lu, Jiwu Shu, and Wei Wang. 2014. ReconFS: A reconstructable file system on flash storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX, Berkeley, CA, 75--88.
[41]
Youyou Lu, Jiwu Shu, and Weimin Zheng. 2013. Extending the lifetime of flash-based storage through reducing write amplification from file systems. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST'13). USENIX, Berkeley, CA, 257--270.
[42]
Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, Raju Rangaswami, Sushma Devendrappa, Bharath Ramsundar, and Sriram Ganesan. 2014. NVMKV: A scalable and lightweight flash aware key-value store. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14). USENIX, Berkeley, CA.
[43]
Robert McMillan. 2012. Amazon Blames Generators for Blackout That Crushed Netflix. Retrieved from http://www.wired.com/wiredenterprise/2012/07/amazon_explains/.
[44]
Cade Metz. 2012. Flash Drives Replace Disks at Amazon, Facebook, Dropbox. Retrieved from http://www.wired.com/wiredenterprise/2012/06/flash-data-centers/all/.
[45]
Rich Miller. Human Error Cited in Hosting.com Outage. Retrieved from http://www.datacenterknowledge.com/archives/2012/07/28/human-error-cited-hosting-com-outage/.
[46]
Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. 2015. Cross-checking semantic correctness: The case of finding file system bugs. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 361--377.
[47]
T. Ong, A. Frazio, N. Mielke, S. Pan, N. Righos, G. Atwood, and S. Lai. 1993. Erratic erase in ETOX/sup TM/ flash memory array. In Proceedings of the Symposium on VLSI Technology (VLSI’93).
[48]
Personal Communication. 2012. Personal communication with an employee of a major flash manufacturer.
[49]
Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).
[50]
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). 206--220.
[51]
Minghai Qin, Eitan Yaakobi, and Paul H. Siegel. 2014. Constrained codes that mitigate inter-cell interference in read/write cycles for flash memories. IEEE J. Select. Areas Commun. 32, 5 (2014), 836--846.
[52]
Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. Coerced cache eviction and discreet mode journaling: Dealing with misbehaving disks. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems8Networks (DSN’11). IEEE Computer Society, Washington, DC, 518--529.
[53]
Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (1992), 26--52.
[54]
Marco A. A. Sanvido, Frank R. Chu, Anand Kulkarni, and Robert Selinger. 2008. NAND flash memory and its role in storage architectures. In Procedings of the IEEE. 1864--1874.
[55]
Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07).
[56]
Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX, Berkeley, CA, 67--80.
[57]
Kang-Deog Suh, Byung-Hoon Suh, Young-Ho Lim, Jin-Ki Kim, Young-Joon Choi, Yong-Nam Koh, Sung-Soo Lee, Suk-Chon Kwon, Byung-Soon Choi, Jin-Sun Yum, Jung-Hyuk Choi, Jang-Rae Kim, and Hyung-Kyu Lim. 1995. A 3.3V 32Mb NAND flash memory with incremental step pulse programming scheme. IEEE Journal of Solid-State Circuits.
[58]
K. Takeuchi, T. Tanaka, and T. Tanzawa. 1998. A multipage cell architecture for high-speed programming multilevel NAND flash memories. IEEE Journal of Solid-State Circuits.
[59]
Arie Tal. 2002. Two flash technologies compared: NOR vs NAND. White Paper of M-SYstems. M-Systems Flash Disk Pioneers, Ltd. https://focus.ti.com/pdfs/omap/diskonchipvsnor.pdf.
[60]
Veeresh Taranalli, Hironori Uchikawa, and Paul H. Siegel. 2015. Error analysis and inter-cell interference mitigation in multi-level cell flash memories. In Proceedings of the 2015 IEEE International Conference on Communications (ICC’15). 271--276.
[61]
Nick Triantos. 2006. Lost Writes in Storage Systems. Retrieved from http://storagefoo.blogspot.com/2006/04/lost-writes.html.
[62]
Huang-Wei Tseng, Laura M. Grupp, and Steven Swanson. 2011. Understanding the impact of power loss on flash memory. In Proceedings of the 48th Design Automation Conference (DAC’11).
[63]
Gala Yadgar, Eitan Yaakobi, and Assaf Schuster. 2015. Write once, get 50% free: Saving SSD erase costs using WOM codes. In Proceedings of the13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX, Berkeley, CA, Santa Clara, CA, 257--271.
[64]
Junfeng Yang, Can Sar, and Dawson Engler. 2006. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). 131--146.
[65]
Yiying Zhang, Leo Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. De-indirection for flash-based SSDs with nameless writes. In Proceedings of the 10th Conference on File and Storage Technologies (FAST’12).
[66]
Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zhao, and Shashank Singh. 2014. Torturing databases for fun and profit. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX, Berkeley, CA, 449--464.
[67]
Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. 2013. Understanding the robustness of SSDs under power fault. In Presented as part of the 11th USENIX Conference on File and Storage Technologies (FAST 13). USENIX, Berkeley, CA, 271--284.

Cited By

View all
  • (2024)PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337455535:5(844-861)Online publication date: May-2024
  • (2023)Understanding Persistent-memory-related Issues in the Linux KernelACM Transactions on Storage10.1145/360594619:4(1-28)Online publication date: 30-Nov-2023
  • (2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/3483447Online publication date: 29-Mar-2022
  • Show More Cited By

Index Terms

  1. Reliability Analysis of SSDs Under Power Fault

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 34, Issue 4
      January 2017
      93 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/3014162
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 November 2016
      Accepted: 01 August 2016
      Revised: 01 May 2016
      Received: 01 January 2015
      Published in TOCS Volume 34, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. SSD
      2. Storage systems
      3. fault injection
      4. flash memory
      5. power failure

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Division of Computer and Network Systems, National Science Fundation
      • Division of Computing and Communication Foundations, National Science Fundation

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)29
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 17 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337455535:5(844-861)Online publication date: May-2024
      • (2023)Understanding Persistent-memory-related Issues in the Linux KernelACM Transactions on Storage10.1145/360594619:4(1-28)Online publication date: 30-Nov-2023
      • (2022)A Study of Failure Recovery and Logging of High-Performance Parallel File SystemsACM Transactions on Storage10.1145/3483447Online publication date: 29-Mar-2022
      • (2022)On the Reproducibility of Bugs in File-System Aware Storage Applications2022 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS55553.2022.9925445(1-7)Online publication date: Oct-2022
      • (2021)Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00048(409-420)Online publication date: Sep-2021
      • (2019)Lessons and actionsProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358890(961-975)Online publication date: 10-Jul-2019
      • (2019)Evaluating file system reliability on solid state drivesProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358874(783-797)Online publication date: 10-Jul-2019
      • (2019)Measurement and Analysis of SSD Reliability Data Based on Accelerated Endurance TestElectronics10.3390/electronics81113578:11(1357)Online publication date: 16-Nov-2019
      • (2019)iLife: Safely Extending Lifetime for Memory-Oriented SSDElectronics10.3390/electronics80606108:6(610)Online publication date: 30-May-2019
      • (2018)Towards robust file system checkersProceedings of the 16th USENIX Conference on File and Storage Technologies10.5555/3189759.3189770(105-121)Online publication date: 12-Feb-2018
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media