Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2834976.2834981acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

BAD-check: bulk asynchronous distributed checkpointing

Published: 15 November 2015 Publication History

Abstract

Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpoint-restart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneficial. With simulations, we demonstrate that scientific applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.

References

[1]
E. Barton, J. Bent, and Q. Koziol, "Fast forward storage and io program documents," in LLNS subcontract no. B599860 For Extreme-Scale Computing Research and Development (Fast Forward) Storage and I/O, 2014. {Online}. Available: https://wiki.hpdd.intel.com/display/PUB/Fast+Forward+Storage+and+IO+Program+Documents
[2]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, "Legion: Expressing locality and independence with logical regions," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 66:1--66:11. {Online}. Available: http://dl.acm.org/citation.cfm?id=2388996.2389086
[3]
J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, and J. Woodring, "Jitter-free co-processing on a prototype exascale storage stack," in Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, April 2012, pp. 1--5.
[4]
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: a checkpoint filesystem for parallel applications," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09. New York, NY, USA: ACM, 2009, pp. 21:1--21:12. {Online}. Available: http://doi.acm.org/10.1145/1654059.1654081
[5]
J. Bent, B. Settlemyer, N. DeBardeleben, S. Faibish, D. Ting, U. Gupta, and P. Tzelnic, "On the non-suitability of non-volatility," in 7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15). Santa Clara, CA: USENIX Association, Jul. 2015. {Online}. Available: https://www.usenix.org/conference/hotstorage15/workshop-program/presentation/bent
[6]
B. Bhargava and S.-R. Lian, "Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach," in Reliable Distributed Systems, 1988. Proceedings., Seventh Symposium on, Oct 1988, pp. 3--12.
[7]
A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler, "The scalable commutativity rule: Designing scalable software for multicore processors," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ser. SOSP '13. New York, NY, USA: ACM, 2013, pp. 1--17. {Online}. Available: http://doi.acm.org/10.1145/2517349.2522712
[8]
J. J. Colman and R. R. Linn, "Separating combustion from pyrolysis in higrad/firetec," International Journal of Wildland Fire, vol. 16, no. 4, pp. 493--502, 2007. {Online}. Available: http://dx.doi.org/10.1071/WF06074
[9]
R. Cook, E. Dube, I. Lee, C. Shereda, F. Wang, and L. Nau, Survey of Novel Programming Models for Parallelizing Applications at Exascale, Nov 2011. {Online}. Available: http://www.osti.gov/scitech/servlets/purl/1107306
[10]
A. Hammouda, A. Siegel, and S. Siegel, "Overcoming asynchrony: An analysis of the effects of asynchronous noise on nearest neighbor synchronizations," in Solving Software Challenges for Exascale, ser. Lecture Notes in Computer Science, S. Markidis and E. Laure, Eds. Springer International Publishing, 2015, vol. 8759, pp. 100--109. {Online}. Available: http://dx.doi.org/10.1007/978-3-319-15976-8_7
[11]
D. Ibtesham, D. Arnold, K. B. Ferreira, and P. G. Bridges, "On the viability of checkpoint compression for extreme scale fault tolerance," in Proceedings of the 2011 International Conference on Parallel Processing - Volume 2, ser. Euro-Par'11. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 302--311. {Online}. Available: http://dx.doi.org/10.1007/978-3-642-29740-3_34
[12]
L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, pp. 558--565, Jul. 1978. {Online}. Available: http://doi.acm.org/10.1145/359545.359563
[13]
K. Li, J. F. Naughton, and J. S. Plank, "Real-time, concurrent checkpoint for parallel programs," in Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, ser. PPOPP '90. New York, NY, USA: ACM, 1990, pp. 79--88. {Online}. Available: http://doi.acm.org/10.1145/99163.99173
[14]
N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, "On the role of burst buffers in leadership-class storage systems," in In Proceedings of the 2012 IEEE Conference on Massive Data Storage, 2012.
[15]
J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin, "Flexible io and integration for scientific codes through the adaptable io system (adios)," in CLADE '08: Proceedings of the 6th international workshop on Challenges of large applications in distributed environments. New York, NY, USA: ACM, 2008, pp. 15--24.
[16]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1--11. {Online}. Available: http://dx.doi.org/10.1109/SC.2010.18
[17]
NERSC and the Alliance for Computing at Extreme Scale, Trinity / NERSC-8 Request for Proposal, 2013. {Online}. Available: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/
[18]
B. Nicolae, "Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal," in IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, Boston, United States, May 2013, pp. 19--28. {Online}. Available: https://hal.inria.fr/hal-00781532
[19]
B. Nicolae and F. Cappello, "Ai-ckpt: Leveraging memory access patterns for adaptive asynchronous incremental checkpointing," in Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, ser. HPDC '13. New York, NY, USA: ACM, 2013, pp. 155--166. {Online}. Available: http://doi.acm.org/10.1145/2462902.2462918
[20]
Oak Ridge, Argonne, and Livermore National Labs, CORAL Request for Proposal B604142, 2014. {Online}. Available: https://asc.llnl.gov/CORAL/
[21]
S. Osman, D. Subhraveti, G. Su, and J. Nieh, "The design and implementation of zap: A system for migrating computing environments," SIGOPS Oper. Syst. Rev., vol. 36, no. SI, pp. 361--376, Dec. 2002. {Online}. Available: http://doi.acm.org/10.1145/844128.844162
[22]
F. Petrini, D. J. Kerbyson, and S. Pakin, "The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q," in Supercomputing, ser. SC '03. New York, NY, USA: ACM, 2003, pp. 55--. {Online}. Available: http://doi.acm.org/10.1145/1048935.1050204
[23]
B. Randell, "System structure for software fault tolerance," in Proceedings of the International Conference on Reliable Software. New York, NY, USA: ACM, 1975, pp. 437--449. {Online}. Available: http://doi.acm.org/10.1145/800027.808467
[24]
R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges, "Alleviating scalability issues of checkpointing protocols," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 18:1--18:11. {Online}. Available: http://dl.acm.org/citation.cfm?id=2388996.2389021
[25]
R. Thakur, W. Gropp, and E. Lusk, "Data sieving and collective i/o in romio," in Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation, ser. FRONTIERS '99. Washington, DC, USA: IEEE Computer Society, 1999, pp. 182--. {Online}. Available: http://dl.acm.org/citation.cfm?id=795668.796733

Cited By

View all
  • (2016)DAOS and friendsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014971(1-12)Online publication date: 13-Nov-2016
  • (2016)DAOS and Friends: A Proposal for an Exascale Storage SystemSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.49(585-596)Online publication date: Nov-2016

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PDSW '15: Proceedings of the 10th Parallel Data Storage Workshop
November 2015
59 pages
ISBN:9781450340083
DOI:10.1145/2834976
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC15
Sponsor:

Acceptance Rates

PDSW '15 Paper Acceptance Rate 9 of 25 submissions, 36%;
Overall Acceptance Rate 17 of 41 submissions, 41%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2016)DAOS and friendsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014971(1-12)Online publication date: 13-Nov-2016
  • (2016)DAOS and Friends: A Proposal for an Exascale Storage SystemSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.49(585-596)Online publication date: Nov-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media