research-article

Performance under failures of high-end computing

Authors:

Hui JinAuthors Info & Claims

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Article No.: 48, Pages 1 - 11

https://doi.org/10.1145/1362622.1362687

Published: 10 November 2007 Publication History

Abstract

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.

References

[1]

Berman, F., Wolski, R., Casanova, H., Cirne, W., et al. "Adaptive Computing on the Grid Using AppLeS," IEEE Transactions on Parallel and Distributed Systems. Vol 14, No4, pp. 369--382, 2003.

Digital Library

[2]

Dinda, P., and O'Hallaron, D. "Host load prediction using linear models," Cluster Computing, Vol 3, pp. 265--280, 2000.

Digital Library

[3]

Doga, A., and Ozguner, F. "Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing," In Proc. of the 29th International Conference on Parallel Processing, pp. 307--314, Toronto, Canada, Aug., 2000.

Digital Library

[4]

Duda, A. "The Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, pp. 221--229, June 1983.

[5]

Garg, S., Huang, Y., Kintala, C., and Trivedi, K. S., "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation," In Proc. of 1996 ACM SIGMETRICS Conference, pp. 252--261, Philadelphia, PA, May 1996.

Digital Library

[6]

Garg, L., Sun, X-H., and Waston, E. "Performance Modeling and Prediction of Non-Dedicated Network Computing," IEEE Trans. on Computers, Vol 51, No 9, pp. 1041--1055, Sep., 2002.

Digital Library

[7]

Gross, D., Harris, C. M., Fundamentals of Queuing Theory, 3rd Edition, John Wiley & Sons, 1998.

Digital Library

[8]

Los Alamos National Laboratory, Operational Data to Support and Enable Computer Science Research, http://institute.lanl.gov/data/lanldata.shtml

[9]

Lu, Charng-da "Scalable Diskless Checkpointing for Large Parallel Systems," Ph.D dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.

Digital Library

[10]

Milojicic, D. S., Douglas, F., Paindaveine, Y., Wheeler, R., and Zhou, S. "Process Migration," ACM Computing Surveys, Volume 32, No 3, Sep., 2000.

Digital Library

[11]

Nicola, V. F., Kulkarni, V. G., and Trivedi, K. S. "Queueing Analysis of Fault-Tolerant Computer Systems," IEEE Trans. Software Engineering, Vol. SE-13, No. 3, pp. 363--375, 1987.

Digital Library

[12]

Oliner, A., Sahoo, R. K., Moreira, J. E., Gupta, M., and Sivasubramaniam, A. "Fault-Aware Job Scheduling for BlueGene/L Systems," in Proc. of the 18^th IEEE International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, Apr., 2004.

[13]

Oliner, A., Sahoo, P. K., Moreira, J. E., and Gupta M., "Performance Implications of Periodic Checkpointing on Largescale Cluster Systems," in Proc. of the 19^th IEEE International Parallel and Distributed Processing Symposium, Denver, Colorado, Apr., 2005.

Digital Library

[14]

Pradhan, D. K. Fault-Tolerant Computer System Design, Prentice Hall, Inc., 1996.

Digital Library

[15]

Schroeder, B., and Gibson, G. A. "A large-scale study of failures in high-performance computing systems," in Proc. of the 2006 InternationalConferenceon Dependable Systems and Networks, Philadelphia, PA, June 2006.

Digital Library

[16]

Srinivasan, S., and Jha, N. K. "Safety and Reliability Driven Task Allocation in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, Vol 10, No 3, pp. 238--251, 1999.

Digital Library

[17]

Wolski, R. "Dynamically forecasting network performance using the network weather service," Cluster Computing, Vol 1, pp. 119--132, 1998.

Digital Library

[18]

Wu, M., and Sun, X.-H. "Grid Harvest Service: A Performance System of Grid Computing," Journal of Parallel and Distributed Computing, Vol. 66, No. 10, pp. 1322--1337, 2006.

Digital Library

[19]

Young, J. W. "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, Vol. 17, No 9, pp. 530--531, 1974.

Digital Library

Cited By

Park JHuang XLee C(2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
https://doi.org/10.1007/s11227-023-05482-y
Kumar RJha SMahgoub AKalyanam RHarrell SSong XKalbarczyk ZKramer WIyer RBagchi S(2020)The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48063.2020.00034(158-171)Online publication date: Jun-2020
https://doi.org/10.1109/DSN48063.2020.00034
Rojas EMeneses EJones TMaxwell D(2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 18-Aug-2020
https://doi.org/10.1007/978-3-030-57675-2_3
Show More Cited By

Index Terms

Performance under failures of high-end computing

Recommendations

Fault-Aware Runtime Strategies for High-Performance Computing

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In ...
Job failures in high performance computing systems: A large-scale empirical study

The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth ...
Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems
ICPADS '08: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems

Correlated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

November 2007

723 pages

ISBN:9781595937643

DOI:10.1145/1362622

General Chair:
Becky Verastegui
Oak Ridge National Laboratory

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '07

Sponsor:

SIGARCH
IEEE-CS

SC '07: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2007

Nevada, Reno

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
284
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Park JHuang XLee C(2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
https://doi.org/10.1007/s11227-023-05482-y
Kumar RJha SMahgoub AKalyanam RHarrell SSong XKalbarczyk ZKramer WIyer RBagchi S(2020)The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48063.2020.00034(158-171)Online publication date: Jun-2020
https://doi.org/10.1109/DSN48063.2020.00034
Rojas EMeneses EJones TMaxwell D(2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 18-Aug-2020
https://doi.org/10.1007/978-3-030-57675-2_3
Noraziah AHerawan TRahman MAbdullah ZMustafa BFakharaldien M(2019)Fault Tolerance Impact on Near Field Communication for Data Storage of Mobile Commerce Technology in Cloud ComputingProceedings of the International Conference on Data Engineering 2015 (DaEng-2015)10.1007/978-981-13-1799-6_51(489-497)Online publication date: 10-Aug-2019
https://doi.org/10.1007/978-981-13-1799-6_51
Singh M(2017)Performance analysis of checkpoint based efficient failure-aware scheduling algorithm2017 International Conference on Computing, Communication and Automation (ICCCA)10.1109/CCAA.2017.8229916(859-863)Online publication date: May-2017
https://doi.org/10.1109/CCAA.2017.8229916
Qiu ZPérez JHarrison PAvritzer AIosup AZhu XBecker S(2016)Tackling Latency via Replication in Distributed SystemsProceedings of the 7th ACM/SPEC on International Conference on Performance Engineering10.1145/2851553.2851562(197-208)Online publication date: 12-Mar-2016
https://dl.acm.org/doi/10.1145/2851553.2851562
Singh M(2016)Incremental checkpoint based failure-aware scheduling algorithm in grid computing2016 International Conference on Computing, Communication and Automation (ICCCA)10.1109/CCAA.2016.7813820(772-778)Online publication date: Apr-2016
https://doi.org/10.1109/CCAA.2016.7813820
Ziming Zheng Li Yu Zhiling Lan (2015)Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/RestartIEEE Transactions on Computers10.1109/TC.2014.231718264:5(1402-1415)Online publication date: 3-Apr-2015
https://dl.acm.org/doi/10.1109/TC.2014.2317182
Qiu ZPerez J(2015)Enhancing reliability and response times via replication in computing clusters2015 IEEE Conference on Computer Communications (INFOCOM)10.1109/INFOCOM.2015.7218512(1355-1363)Online publication date: Apr-2015
https://doi.org/10.1109/INFOCOM.2015.7218512
Bosilca GBouteiller ABrunet ECappello FDongarra JGuermouche AHerault TRobert YVivien FZaidouni D(2014)Unified model for assessing checkpointing protocols at extreme-scaleConcurrency and Computation: Practice & Experience10.1002/cpe.317326:17(2772-2791)Online publication date: 10-Dec-2014
https://dl.acm.org/doi/10.1002/cpe.3173
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents