Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which all processes coordinate to dump memory to stable storage simultaneously. However, in systems comprising tens of thousands of nodes, the total data volume can overwhelm the network and storage farm, creating an I/O bottleneck. Furthermore, a very large class of scientific applications can fail on these systems if one of the processes dies.
Poor checkpointing performance limits checkpointing frequency and increases the time-to-solution of applications. Also, the application can spend more time in recovery and restart because large systems tend to fail often.
Diskless checkpointing is a viable approach that provides high-performance and reliable storage for intermediate or temporary data, such as checkpoint files. First, the data is stored in memory instead of disk. Second, reliability and recoverability is guaranteed by use of redundancy codes (parity bits or Reed-Solomon codes), which are stored on spares. Third, I/O is made scalable by partitioning nodes and spares into small groups. Each group takes care of its own redundancy codes generation and node failure and recovery.
We have implemented a diskless checkpointing and recovery system and assessed its performance with both I/O benchmarks and real scientific applications. The results show much greater I/O scalability and higher throughput than disk-based parallel file systems for a large number of clients.
As a technology projection, we have also developed an analytical model to investigate the performability of diskless checkpointing. Our model evaluation shows that the overhead of checkpoint/recovery is small on systems with thousands of nodes, and with appropriate partitioning of nodes, the user application can survive several times longer.
Cited By
- Dou W and Li Y (2018). A fault-tolerant computing method for Xdraw parallel algorithm, The Journal of Supercomputing, 74:6, (2776-2800), Online publication date: 1-Jun-2018.
- Bouteiller A, Herault T, Bosilca G, Du P and Dongarra J (2015). Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy, ACM Transactions on Parallel Computing, 1:2, (1-28), Online publication date: 18-Feb-2015.
- Jia Y, Bosilca G, Luszczek P and Dongarra J Parallel reduction to hessenberg form with algorithm-based fault tolerance Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-11)
- Du P, Bouteiller A, Bosilca G, Herault T and Dongarra J Algorithm-based fault tolerance for dense matrix factorizations Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, (225-234)
- Du P, Bouteiller A, Bosilca G, Herault T and Dongarra J (2012). Algorithm-based fault tolerance for dense matrix factorizations, ACM SIGPLAN Notices, 47:8, (225-234), Online publication date: 11-Sep-2012.
- Gomez L, Nicolae B, Maruyama N, Cappello F and Matsuoka S Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds Proceedings of the 18th international conference on Parallel Processing, (313-324)
- Davies T, Karlsson C, Liu H, Ding C and Chen Z High performance linpack benchmark Proceedings of the international conference on Supercomputing, (162-171)
- Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N and Matsuoka S FTI Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, (1-32)
- Salah K, Al-Shaikh R and Sindi M Towards green computing using diskless high performance clusters Proceedings of the 7th International Conference on Network and Services Management, (456-459)
- Gomez L, Maruyama N, Cappello F and Matsuoka S Distributed Diskless Checkpoint for Large Scale Systems Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, (63-72)
- Jin H, Sun X, Zheng Z, Lan Z and Xie B Performance under Failures of DAG-based Parallel Computing Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, (236-243)
- Bosilca G, Delmas R, Dongarra J and Langou J (2009). Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, 69:4, (410-416), Online publication date: 1-Apr-2009.
- Cappello F (2009). Fault Tolerance in Petascale/ Exascale Systems, International Journal of High Performance Computing Applications, 23:3, (212-226), Online publication date: 1-Aug-2009.
- Tabatabaee V and Hollingsworth J Automatic software interference detection in parallel applications Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-12)
- Wu M, Sun X and Jin H Performance under failures of high-end computing Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-11)
Index Terms
- Scalable diskless checkpointing for large parallel systems
Recommendations
Multilevel Diskless Checkpointing
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. ...
Diskless Checkpointing
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed ...
N-Level Diskless Checkpointing
HPCC '09: Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and CommunicationsDiskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-...