Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1362622.1362687acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Performance under failures of high-end computing

Published: 10 November 2007 Publication History

Abstract

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace.

References

[1]
Berman, F., Wolski, R., Casanova, H., Cirne, W., et al. "Adaptive Computing on the Grid Using AppLeS," IEEE Transactions on Parallel and Distributed Systems. Vol 14, No4, pp. 369--382, 2003.
[2]
Dinda, P., and O'Hallaron, D. "Host load prediction using linear models," Cluster Computing, Vol 3, pp. 265--280, 2000.
[3]
Doga, A., and Ozguner, F. "Reliable matching and scheduling of precedence-constrained tasks in heterogeneous distributed computing," In Proc. of the 29th International Conference on Parallel Processing, pp. 307--314, Toronto, Canada, Aug., 2000.
[4]
Duda, A. "The Effects of Checkpointing on Program Execution Time," Information Processing Letters, vol. 16, pp. 221--229, June 1983.
[5]
Garg, S., Huang, Y., Kintala, C., and Trivedi, K. S., "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation," In Proc. of 1996 ACM SIGMETRICS Conference, pp. 252--261, Philadelphia, PA, May 1996.
[6]
Garg, L., Sun, X-H., and Waston, E. "Performance Modeling and Prediction of Non-Dedicated Network Computing," IEEE Trans. on Computers, Vol 51, No 9, pp. 1041--1055, Sep., 2002.
[7]
Gross, D., Harris, C. M., Fundamentals of Queuing Theory, 3rd Edition, John Wiley & Sons, 1998.
[8]
Los Alamos National Laboratory, Operational Data to Support and Enable Computer Science Research, http://institute.lanl.gov/data/lanldata.shtml
[9]
Lu, Charng-da "Scalable Diskless Checkpointing for Large Parallel Systems," Ph.D dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.
[10]
Milojicic, D. S., Douglas, F., Paindaveine, Y., Wheeler, R., and Zhou, S. "Process Migration," ACM Computing Surveys, Volume 32, No 3, Sep., 2000.
[11]
Nicola, V. F., Kulkarni, V. G., and Trivedi, K. S. "Queueing Analysis of Fault-Tolerant Computer Systems," IEEE Trans. Software Engineering, Vol. SE-13, No. 3, pp. 363--375, 1987.
[12]
Oliner, A., Sahoo, R. K., Moreira, J. E., Gupta, M., and Sivasubramaniam, A. "Fault-Aware Job Scheduling for BlueGene/L Systems," in Proc. of the 18th IEEE International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, Apr., 2004.
[13]
Oliner, A., Sahoo, P. K., Moreira, J. E., and Gupta M., "Performance Implications of Periodic Checkpointing on Largescale Cluster Systems," in Proc. of the 19th IEEE International Parallel and Distributed Processing Symposium, Denver, Colorado, Apr., 2005.
[14]
Pradhan, D. K. Fault-Tolerant Computer System Design, Prentice Hall, Inc., 1996.
[15]
Schroeder, B., and Gibson, G. A. "A large-scale study of failures in high-performance computing systems," in Proc. of the 2006 InternationalConferenceon Dependable Systems and Networks, Philadelphia, PA, June 2006.
[16]
Srinivasan, S., and Jha, N. K. "Safety and Reliability Driven Task Allocation in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, Vol 10, No 3, pp. 238--251, 1999.
[17]
Wolski, R. "Dynamically forecasting network performance using the network weather service," Cluster Computing, Vol 1, pp. 119--132, 1998.
[18]
Wu, M., and Sun, X.-H. "Grid Harvest Service: A Performance System of Grid Computing," Journal of Parallel and Distributed Computing, Vol. 66, No. 10, pp. 1322--1337, 2006.
[19]
Young, J. W. "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, Vol. 17, No 9, pp. 530--531, 1974.

Cited By

View all
  • (2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
  • (2020)The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48063.2020.00034(158-171)Online publication date: Jun-2020
  • (2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 18-Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
November 2007
723 pages
ISBN:9781595937643
DOI:10.1145/1362622
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. application performance
  2. failure modeling
  3. fault-tolerance

Qualifiers

  • Research-article

Funding Sources

Conference

SC '07
Sponsor:

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
  • (2020)The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48063.2020.00034(158-171)Online publication date: Jun-2020
  • (2020)Towards a Model to Estimate the Reliability of Large-Scale Hybrid SupercomputersEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_3(37-51)Online publication date: 18-Aug-2020
  • (2019)Fault Tolerance Impact on Near Field Communication for Data Storage of Mobile Commerce Technology in Cloud ComputingProceedings of the International Conference on Data Engineering 2015 (DaEng-2015)10.1007/978-981-13-1799-6_51(489-497)Online publication date: 10-Aug-2019
  • (2017)Performance analysis of checkpoint based efficient failure-aware scheduling algorithm2017 International Conference on Computing, Communication and Automation (ICCCA)10.1109/CCAA.2017.8229916(859-863)Online publication date: May-2017
  • (2016)Tackling Latency via Replication in Distributed SystemsProceedings of the 7th ACM/SPEC on International Conference on Performance Engineering10.1145/2851553.2851562(197-208)Online publication date: 12-Mar-2016
  • (2016)Incremental checkpoint based failure-aware scheduling algorithm in grid computing2016 International Conference on Computing, Communication and Automation (ICCCA)10.1109/CCAA.2016.7813820(772-778)Online publication date: Apr-2016
  • (2015)Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/RestartIEEE Transactions on Computers10.1109/TC.2014.231718264:5(1402-1415)Online publication date: 3-Apr-2015
  • (2015)Enhancing reliability and response times via replication in computing clusters2015 IEEE Conference on Computer Communications (INFOCOM)10.1109/INFOCOM.2015.7218512(1355-1363)Online publication date: Apr-2015
  • (2014)Unified model for assessing checkpointing protocols at extreme-scaleConcurrency and Computation: Practice & Experience10.1002/cpe.317326:17(2772-2791)Online publication date: 10-Dec-2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media