research-article

Public Access

A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems

Authors:

Sudharshan S. Vazhkudai,

Qing CaoAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 75, Pages 1 - 12

https://doi.org/10.1145/2807591.2807615

Published: 15 November 2015 Publication History

Abstract

The increasing data demands from high-performance computing applications significantly accelerate the capacity, capability and reliability requirements of storage systems. As systems scale, component failures and repair times increase, significantly impacting data availability. A wide array of decision points must be balanced in designing such systems.

We propose a systematic approach that balances and optimizes both initial and continuous spare provisioning based on a detailed investigation of the anatomy and field failure data analysis of extreme-scale storage systems. We consider the component failure characteristics and its cost and impact at the system level simultaneously. We build a tool to evaluate different provisioning schemes, and the results demonstrate that our optimized provisioning can reduce the duration of data unavailability by as much as 52% under a fixed budget. We also observe that non-disk components have much higher failure rates than disks, and warrant careful considerations in the overall provisioning process.

References

[1]

M. Alam and V. Mani. Queueing Model of a Bi-level Markov Service-system and Its Solution using Recursion. Trans. Reliability, 37:427--433, Oct. 1988.

[2]

B. Behlendorf. Sequoia's 55PB Lustre+ZFS Filesystem. In Lustre User Group (LUG) Meeting. OpenSFS, 2012.

[3]

P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2):145--185, June 1994.

Digital Library

[4]

DataDirect Networks, Inc. S2A9900 Datasheet, http://www.ddn.com/support/downloads-documentation/, 2011.

[5]

DataDirect Networks, Inc. DDN SFA12K Family, 2014.

[6]

L. Devroye. Sample-based Non-uniform Random Variate Generation. In Proceedings of the 18th Conference on Winter Simulation, WSC '86, pages 260--265, New York, NY, USA, 1986. ACM.

Digital Library

[7]

J. G. Elerath and M. Pecht. Enhanced reliability modeling of raid storage systems. In In Proceedings of the International Conference on Dependable Systems and Networks (DSN, pages 175--184, 2007.

Digital Library

[8]

J. G. Elerath and J. Schindler. Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation. Trans. Storage, 10(2):7:1--7:21, Mar. 2014.

Digital Library

[9]

B. Ghodrati, D. Benjevic, and A. Jardine. Product support improvement by considering system operating environment: A case study on spare parts procurement. International Journal of Quality and Reliability Management, 29(4):436--450, 2012.

[10]

G. A. Gibson and D. A. Patterson. Designing Disk Arrays for High Data Reliability. Journal of Parallel and Distributed Computing, 17(1-2):4--27, Jan. 1993.

Digital Library

[11]

K. Greenan. Reliability and Power-Efficiency in Erasure-Coded Storage Systems. Technical Report UCSC-SSRC-09-08, University of California, Santa Cruz, Dec. 2009.

[12]

P. E. Greenwood and M. S. Nikulin. A Guide to Chi-Squared Testing. Wiley, New York, 1996.

[13]

M. Holland and G. A. Gibson. Parity declustering for continuous operation in redundant disk arrays, volume 27. ACM, 1992.

Digital Library

[14]

IBM DS8000 Series. http://www-03.ibm.com/systems/storage/disk/ds8000/overview.html, 2014.

[15]

A. Jardine and A. Tsang. Maintenance, Replacement, and Reliability: Theory and Applications. Dekker Mechanical Engineering. Taylor & Francis, 2005.

[16]

T. P. Lewis and J. K. Cochran. Applying Queueing Theory to Improve the Modeling of Spares Provisioning of Small Combat Aircraft Units. In Proceedings of the 17th International Conference on Computers and Industrial Engineering, ICC&IE '94, pages 297--301, Tarrytown, NY, USA, 1995. Pergamon Press, Inc.

Digital Library

[17]

V. Mani and V. Sarma. Queuing Network Models for Aircraft Availability and Spares Management. Trans. Reliability, R-33(3):257--262, Aug. 1984.

[18]

NetApp, Inc. FAS8080 EX, http://www.netapp.com/us/products/storage-systems/fas8000/, 2014.

[19]

Panasas, Inc. ActiveStor 16, http://www.panasas.com/products/activestor, 2014.

[20]

D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD '88, pages 109--116, New York, NY, USA, 1988. ACM.

Digital Library

[21]

Personal Communications. Spider I system administrators on component replacement time, June 13, 2014.

[22]

E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, pages 2--2, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[23]

K. K. Rao, J. L. Hafner, and R. A. Golding. Reliability for Networked Storage Nodes. In International Conference on Dependable Systems and Networks (DSN), pages 237--248. IEEE Computer Society, 2006.

Digital Library

[24]

M. Rausand and A. Hoyland. System Reliability Theory: Models, Statistical Methods and Applications. Wiley-IEEE, 3 edition, Nov. 2003.

[25]

K. Sakai, S. Sumimoto, and M. Kurokawa. High-performance and highly reliable file system for the k computer. FUJITSU Science Technology, 48(3):302--209, 2012.

[26]

B. Schroeder and G. A. Gibson. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, Berkeley, CA, USA, 2007. USENIX Association.

Digital Library

[27]

M. Schulze, G. Gibson, R. Katz, and D. Patterson. How Reliable Is A RAID. In COMPCON Spring ÂŠ89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage, Digest of Papers, pages 118--123. IEEE, 1989.

[28]

T. J. E. Schwarz, Q. Xin, E. L. Miller, D. D. E. Long, A. Hospodor, and S. W. Ng. Disk Scrubbing in Large Archival Storage Systems. In 12th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), 4--8 October 2004, Vollendam, The Netherlands, pages 409--418, 2004.

Digital Library

[29]

Seagate Technology. ClusterStor 9000, 2014.

[30]

G. Shipman, D. Dillow, S. Oral, and F. Wang. The Spider Center Wide File System: From Concept to Reality. In Cray User Group (CUG) Conference, Atlanta, May 2009.

[31]

Top500 Site:. http://top500.org/lists/2010/06/, June, 2010.

[32]

T. S. Vaughan. Failure Replacement and Preventive Maintenance Spare Parts Ordering Policy. European Journal of Operational Research, 161(1):183--190, 2005.

[33]

L. Wan, F. Wang, S. Oral, S. S. Vazhkudai, and Q. Cao. A report on simulation-driven reliability and failure analysis of large-scale storage systems. Technical Report ORNL/TM-2014/421, Oak Ridge National Laboratory, December 2014.

[34]

Q. Xin, E. L. Miller, T. J. E. Schwarz, D. D. E. Long, S. A. Brandt, and W. Litwin. Reliability Mechanisms for Very Large Storage Systems. In IEEE Symposium on Mass Storage Systems, pages 146--156, 2003.

Digital Library

Cited By

Costa EPatel TSchwaller BBrandt JTiwari Dde Supinski BHall MGamblin T(2021)Systematically inferring I/O performance variability by examining repetitive job behaviorProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476186(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476186
Taherin APatel TGeorgakoudis GLaguna ITiwari D(2021)Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48987.2021.00043(305-313)Online publication date: Jun-2021
https://doi.org/10.1109/DSN48987.2021.00043
Wan LWolf MWang FChoi JOstrouchov GKlasky S(2017)Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2017.1(1-9)Online publication date: Dec-2017
https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.1

Index Terms

A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems

Recommendations

Stochastic modeling for performance and availability evaluation of hybrid storage systems

Performance and availability models for assessing hybrid storage systems.Response time, throughput and availability can be estimated using the proposed models.Model validation using DiskSIM.Experimental results indicating the practical feasibility of ...
Understanding and coping with failures in large-scale storage systems
Improving storage system availability with D-GRAID

We present the design, implementation, and evaluation of D-GRAID, a gracefully degrading and quickly recovering RAID storage array. D-GRAID ensures that most files within the file system remain available even when an unexpectedly high number of faults ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
415
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)20

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Costa EPatel TSchwaller BBrandt JTiwari Dde Supinski BHall MGamblin T(2021)Systematically inferring I/O performance variability by examining repetitive job behaviorProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476186(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476186
Taherin APatel TGeorgakoudis GLaguna ITiwari D(2021)Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48987.2021.00043(305-313)Online publication date: Jun-2021
https://doi.org/10.1109/DSN48987.2021.00043
Wan LWolf MWang FChoi JOstrouchov GKlasky S(2017)Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS.2017.1(1-9)Online publication date: Dec-2017
https://doi.org/10.1109/HPCC-SmartCity-DSS.2017.1

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents