Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

Published: 01 October 2007 Publication History

Abstract

Component failure in large-scale IT installations is becoming an ever-larger problem as the number of components in a single cluster approaches a million.
This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007] and presents and analyzes field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. More than 110,000 disks are covered by this data, some for an entire lifetime of five years. The data includes drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.
We find that in the field, annual disk replacement rates typically exceed 1%, with 2--4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF.
We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. In other words, the replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.
Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA drives, potentially an indication that disk-independent factors such as operating conditions affect replacement rates more than component-specific ones. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.
Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

References

[1]
Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).
[2]
CFDR. 2007. The computer failure data repository. http://cfdr.usenix.org/.
[3]
Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. TP-338.1. Seagate Technology, November.
[4]
Corbett, P. F., English, R., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the Conference on File and Storage Technologies (FAST).
[5]
Drummer, D., Khurshudov, A., Riedel, E., and Watts R. 2006. Personal communication.
[6]
Elerath, J. G. 2000a. AFR: Problems of definition, calculation and measurement in a commercial environment. In Proceedings of the Annual Reliability and Maintainability Symposium.
[7]
Elerath, J. G. 2000b. Specifying reliability in the disk drive industry: No more MTBFs. In Proceedings of the Annual Reliability and Maintainability Symposium.
[8]
Elerath, J. G. and Shah, S. 2004. Server class drives: How reliable are they? In Proceedings of the Annual Reliability and Maintainability Symposium.
[9]
Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP).
[10]
Gibson, G. A. 1992. Redundant disk arrays: Reliable, parallel secondary storage. Dissertation. MIT Press, New York.
[11]
Gray, J. 1990. A census of tandem system availability between 1985 and 1990. IEEE Trans. Reliabil. 39, 4.
[12]
Gray, J. 1986. Why do computers stop and what can be done about it. In Proceedings of the 5th Symposium on Reliability in Distributed Software and Database Systems.
[13]
Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).
[14]
Iyer, R. K., Rossetti, D. J., and Hsueh, M. C. 1986. Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst. 4, 3.
[15]
Kalyanakrishnam, M., Kalbarczyk, Z., and Iyer, R. 1999. Failure data analysis of a LAN of Windows NT-based computers. In Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.
[16]
Karagiannis, T. 2002. Selfis: A short tutorial. Tech. rep., University of California, Riverside.
[17]
Karagiannis, T., Molle, M., and Faloutsos, M. 2004. Long-range dependence: Ten years of internet traffic modeling. IEEE Internet Comput. 8, 5.
[18]
LANL. http://www.lanl.gov/projects/computerscience/data/.
[19]
Leland, W. E., Taqqu, M. S., Willinger, W., and Wilson, D. V. 1994. On the self-similar nature of ethernet traffic. IEEE/ACM Trans. Netw. 2, 1.
[20]
Lin, T.-T. Y. and Siewiorek, D. P. 1990. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. Reliabil. 39, 4.
[21]
Meyer, J. and Wei, L. 1988. Analysis of workload influence on dependability. In Proceedings of the International Symposium on Fault-Tolerant Computing.
[22]
Murphy, B. and Gent, T. 1995. Measuring system and software reliability using an automated data collection process. Qual. Reliabil. Eng. Int. 11, 5.
[23]
NERSC. 2007. Systems disk failure. http://pdsi.nersc.gov/all_diskfailure.php.
[24]
Nurmi, D., Brevik, J., and Wolski, R. 2005. Modeling machine availability in enterprise and wide-area distributed computing environments. In International Euro-Par Conference on Parallel Processing.
[25]
Oppenheimer, D. L., Ganapathi, A., and Patterson, D. A. 2003. Why do internet services fail, and what can be done about it? In USENIX Symposium on Internet Technologies and Systems.
[26]
Patterson, D., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD).
[27]
Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the Conference on File and Storage Technologies (FAST).
[28]
Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. Iron file systems. In Proc. of the 20th ACM Symposium on Operating Systems Principles (SOSP).
[29]
Ross, S. M. Introduction to Probability Models. 6th edn. Academic Press.
[30]
Sahoo, R. K., Sivasubramaniam, A., Squillante, M. S., and Zhang, Y. 2004. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
[31]
Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the Conference on File and Storage Technologies (FAST).
[32]
Schroeder, B. and Gibson, G. A. 2006. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
[33]
Schwarz, T., Baker, M., Bassi, S., Baumgart, B., Flagg, W., van Ingen, C., Joste, K., Manasse, M., and Shah, M. 2006. Disk failure investigations at the internet archive. In NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST) Work in Progress Session.
[34]
Talagala, N. and Patterson, D. 1999. An analysis of error behaviour in a large storage system. In The IEEE Workshop on Fault Tolerance in Parallel and Distributed Systems.
[35]
Tang, D., Iyer, R. K., and Subramani, S. S. 1990. Failure analysis and modelling of a VAX cluster system. In Proceedings of the International Symposium on Fault-tolerant Computing.
[36]
van Ingen, C. and Gray, J. 2005. Empirical measurements of disk failure rates and error rates. Tech. Rep. MSR-TR-2005-166, Microsoft Research, December.
[37]
Xu, J., Kalbarczyk, Z., and Iyer, R. K. 1999. Networked Windows NT system field failure data analysis. In Proceedings of the Pacific Rim International Symposium on Dependable Computing.
[38]
Yang, J. and Sun, F.-B. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Reliability and Maintainability Symposium.

Cited By

View all
  • (2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024
  • (2024)A Multi-source Domain Adaption Approach to Minority Disk Failure PredictionAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_4(53-72)Online publication date: 1-Mar-2024
  • (2023)A Hybrid Neural Ordinary Differential Equation Based Digital Twin Modeling and Online Diagnosis for an Industrial Cooling FanFuture Internet10.3390/fi1509030215:9(302)Online publication date: 4-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 3, Issue 3
October 2007
183 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/1288783
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2007
Published in TOS Volume 3, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hard drive replacements
  2. MTTF
  3. annual failure rates
  4. annual replacement rates
  5. datasheet MTTF
  6. failure correlation
  7. hard drive failure
  8. infant mortality
  9. storage reliability
  10. time between failure
  11. wear-out

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)99
  • Downloads (Last 6 weeks)11
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024
  • (2024)A Multi-source Domain Adaption Approach to Minority Disk Failure PredictionAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_4(53-72)Online publication date: 1-Mar-2024
  • (2023)A Hybrid Neural Ordinary Differential Equation Based Digital Twin Modeling and Online Diagnosis for an Industrial Cooling FanFuture Internet10.3390/fi1509030215:9(302)Online publication date: 4-Sep-2023
  • (2023)Multidimensional Features Helping Predict Failures in Production SSD-Based Consumer Storage Systems2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137082(1-6)Online publication date: Apr-2023
  • (2023)HPC ForecastCommunications of the ACM10.1145/355230966:2(82-90)Online publication date: 20-Jan-2023
  • (2023)Minimum Repair Bandwidth LDPC Codes for Distributed Storage SystemsIEEE Communications Letters10.1109/LCOMM.2022.323026327:2(428-432)Online publication date: Feb-2023
  • (2023)Predicting Hard Disk Drive Faults, Failures and Associated Misbehavior’s2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00082(484-493)Online publication date: May-2023
  • (2023)Research and Technologies for next-generation high-temperature data centers – State-of-the-arts and future perspectivesRenewable and Sustainable Energy Reviews10.1016/j.rser.2022.112991171(112991)Online publication date: Jan-2023
  • (2023)Event-Driven Chaos Testing for Containerized ApplicationsHigh Performance Computing10.1007/978-3-031-40843-4_12(144-157)Online publication date: 25-Aug-2023
  • (2022)Fog computing application of cyber-physical models of IoT devices with symbolic approximation algorithmsJournal of Cloud Computing10.1186/s13677-022-00337-y11:1Online publication date: 5-Oct-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media