Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807615acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems

Published: 15 November 2015 Publication History

Abstract

The increasing data demands from high-performance computing applications significantly accelerate the capacity, capability and reliability requirements of storage systems. As systems scale, component failures and repair times increase, significantly impacting data availability. A wide array of decision points must be balanced in designing such systems.
We propose a systematic approach that balances and optimizes both initial and continuous spare provisioning based on a detailed investigation of the anatomy and field failure data analysis of extreme-scale storage systems. We consider the component failure characteristics and its cost and impact at the system level simultaneously. We build a tool to evaluate different provisioning schemes, and the results demonstrate that our optimized provisioning can reduce the duration of data unavailability by as much as 52% under a fixed budget. We also observe that non-disk components have much higher failure rates than disks, and warrant careful considerations in the overall provisioning process.

References

[1]
M. Alam and V. Mani. Queueing Model of a Bi-level Markov Service-system and Its Solution using Recursion. Trans. Reliability, 37:427--433, Oct. 1988.
[2]
B. Behlendorf. Sequoia's 55PB Lustre+ZFS Filesystem. In Lustre User Group (LUG) Meeting. OpenSFS, 2012.
[3]
P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2):145--185, June 1994.
[4]
DataDirect Networks, Inc. S2A9900 Datasheet, http://www.ddn.com/support/downloads-documentation/, 2011.
[5]
DataDirect Networks, Inc. DDN SFA12K Family, 2014.
[6]
L. Devroye. Sample-based Non-uniform Random Variate Generation. In Proceedings of the 18th Conference on Winter Simulation, WSC '86, pages 260--265, New York, NY, USA, 1986. ACM.
[7]
J. G. Elerath and M. Pecht. Enhanced reliability modeling of raid storage systems. In In Proceedings of the International Conference on Dependable Systems and Networks (DSN, pages 175--184, 2007.
[8]
J. G. Elerath and J. Schindler. Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation. Trans. Storage, 10(2):7:1--7:21, Mar. 2014.
[9]
B. Ghodrati, D. Benjevic, and A. Jardine. Product support improvement by considering system operating environment: A case study on spare parts procurement. International Journal of Quality and Reliability Management, 29(4):436--450, 2012.
[10]
G. A. Gibson and D. A. Patterson. Designing Disk Arrays for High Data Reliability. Journal of Parallel and Distributed Computing, 17(1-2):4--27, Jan. 1993.
[11]
K. Greenan. Reliability and Power-Efficiency in Erasure-Coded Storage Systems. Technical Report UCSC-SSRC-09-08, University of California, Santa Cruz, Dec. 2009.
[12]
P. E. Greenwood and M. S. Nikulin. A Guide to Chi-Squared Testing. Wiley, New York, 1996.
[13]
M. Holland and G. A. Gibson. Parity declustering for continuous operation in redundant disk arrays, volume 27. ACM, 1992.
[14]
IBM DS8000 Series. http://www-03.ibm.com/systems/storage/disk/ds8000/overview.html, 2014.
[15]
A. Jardine and A. Tsang. Maintenance, Replacement, and Reliability: Theory and Applications. Dekker Mechanical Engineering. Taylor & Francis, 2005.
[16]
T. P. Lewis and J. K. Cochran. Applying Queueing Theory to Improve the Modeling of Spares Provisioning of Small Combat Aircraft Units. In Proceedings of the 17th International Conference on Computers and Industrial Engineering, ICC&IE '94, pages 297--301, Tarrytown, NY, USA, 1995. Pergamon Press, Inc.
[17]
V. Mani and V. Sarma. Queuing Network Models for Aircraft Availability and Spares Management. Trans. Reliability, R-33(3):257--262, Aug. 1984.
[18]
NetApp, Inc. FAS8080 EX, http://www.netapp.com/us/products/storage-systems/fas8000/, 2014.
[19]
Panasas, Inc. ActiveStor 16, http://www.panasas.com/products/activestor, 2014.
[20]
D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD '88, pages 109--116, New York, NY, USA, 1988. ACM.
[21]
Personal Communications. Spider I system administrators on component replacement time, June 13, 2014.
[22]
E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, pages 2--2, Berkeley, CA, USA, 2007. USENIX Association.
[23]
K. K. Rao, J. L. Hafner, and R. A. Golding. Reliability for Networked Storage Nodes. In International Conference on Dependable Systems and Networks (DSN), pages 237--248. IEEE Computer Society, 2006.
[24]
M. Rausand and A. Hoyland. System Reliability Theory: Models, Statistical Methods and Applications. Wiley-IEEE, 3 edition, Nov. 2003.
[25]
K. Sakai, S. Sumimoto, and M. Kurokawa. High-performance and highly reliable file system for the k computer. FUJITSU Science Technology, 48(3):302--209, 2012.
[26]
B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, Berkeley, CA, USA, 2007. USENIX Association.
[27]
M. Schulze, G. Gibson, R. Katz, and D. Patterson. How Reliable Is A RAID. In COMPCON Spring Š89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage, Digest of Papers, pages 118--123. IEEE, 1989.
[28]
T. J. E. Schwarz, Q. Xin, E. L. Miller, D. D. E. Long, A. Hospodor, and S. W. Ng. Disk Scrubbing in Large Archival Storage Systems. In 12th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), 4--8 October 2004, Vollendam, The Netherlands, pages 409--418, 2004.
[29]
Seagate Technology. ClusterStor 9000, 2014.
[30]
G. Shipman, D. Dillow, S. Oral, and F. Wang. The Spider Center Wide File System: From Concept to Reality. In Cray User Group (CUG) Conference, Atlanta, May 2009.
[31]
Top500 Site:. http://top500.org/lists/2010/06/, June, 2010.
[32]
T. S. Vaughan. Failure Replacement and Preventive Maintenance Spare Parts Ordering Policy. European Journal of Operational Research, 161(1):183--190, 2005.
[33]
L. Wan, F. Wang, S. Oral, S. S. Vazhkudai, and Q. Cao. A report on simulation-driven reliability and failure analysis of large-scale storage systems. Technical Report ORNL/TM-2014/421, Oak Ridge National Laboratory, December 2014.
[34]
Q. Xin, E. L. Miller, T. J. E. Schwarz, D. D. E. Long, S. A. Brandt, and W. Litwin. Reliability Mechanisms for Very Large Storage Systems. In IEEE Symposium on Mass Storage Systems, pages 146--156, 2003.

Cited By

View all
  • (2021)Systematically inferring I/O performance variability by examining repetitive job behaviorProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476186(1-15)Online publication date: 14-Nov-2021
  • (2021)Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48987.2021.00043(305-313)Online publication date: Jun-2021
  • (2017)Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2017.1(1-9)Online publication date: Dec-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)20
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Systematically inferring I/O performance variability by examining repetitive job behaviorProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476186(1-15)Online publication date: 14-Nov-2021
  • (2021)Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48987.2021.00043(305-313)Online publication date: Jun-2021
  • (2017)Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2017.1(1-9)Online publication date: Dec-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media