BAD-check: bulk asynchronous distributed checkpointing

J Bent, B Settlemyer, H Bao, S Faibish, J Sauer… - Proceedings of the 10th …, 2015 - dl.acm.org
J Bent, B Settlemyer, H Bao, S Faibish, J Sauer, J Zhang
Proceedings of the 10th Parallel Data Storage Workshop, 2015dl.acm.org
Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI
processes are vulnerable to interruption due to a single process or node failure. Due to the
dependence of each state calculation on the successful completion of each of the prior state
calculations, checkpoint-restart is the most widely-used technique to achieve fault tolerance.
To write a consistent view of distributed state as a checkpoint, applications typically
synchronize and pause while writing data to persistent media. In this paper we present a …
Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpoint-restart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneficial. With simulations, we demonstrate that scientific applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.
ACM Digital Library