Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Scalable diskless checkpointing for large parallel systems
Publisher:
  • University of Illinois at Urbana-Champaign
  • Champaign, IL
  • United States
ISBN:978-0-542-44725-9
Order Number:AAI3199074
Pages:
155
Reflects downloads up to 12 Nov 2024Bibliometrics
Skip Abstract Section
Abstract

Parallel scientific applications deal with machine unreliability by periodic checkpointing, in which all processes coordinate to dump memory to stable storage simultaneously. However, in systems comprising tens of thousands of nodes, the total data volume can overwhelm the network and storage farm, creating an I/O bottleneck. Furthermore, a very large class of scientific applications can fail on these systems if one of the processes dies.

Poor checkpointing performance limits checkpointing frequency and increases the time-to-solution of applications. Also, the application can spend more time in recovery and restart because large systems tend to fail often.

Diskless checkpointing is a viable approach that provides high-performance and reliable storage for intermediate or temporary data, such as checkpoint files. First, the data is stored in memory instead of disk. Second, reliability and recoverability is guaranteed by use of redundancy codes (parity bits or Reed-Solomon codes), which are stored on spares. Third, I/O is made scalable by partitioning nodes and spares into small groups. Each group takes care of its own redundancy codes generation and node failure and recovery.

We have implemented a diskless checkpointing and recovery system and assessed its performance with both I/O benchmarks and real scientific applications. The results show much greater I/O scalability and higher throughput than disk-based parallel file systems for a large number of clients.

As a technology projection, we have also developed an analytical model to investigate the performability of diskless checkpointing. Our model evaluation shows that the overhead of checkpoint/recovery is small on systems with thousands of nodes, and with appropriate partitioning of nodes, the user application can survive several times longer.

Cited By

  1. Dou W and Li Y (2018). A fault-tolerant computing method for Xdraw parallel algorithm, The Journal of Supercomputing, 74:6, (2776-2800), Online publication date: 1-Jun-2018.
  2. ACM
    Bouteiller A, Herault T, Bosilca G, Du P and Dongarra J (2015). Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy, ACM Transactions on Parallel Computing, 1:2, (1-28), Online publication date: 18-Feb-2015.
  3. ACM
    Jia Y, Bosilca G, Luszczek P and Dongarra J Parallel reduction to hessenberg form with algorithm-based fault tolerance Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-11)
  4. ACM
    Du P, Bouteiller A, Bosilca G, Herault T and Dongarra J Algorithm-based fault tolerance for dense matrix factorizations Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, (225-234)
  5. ACM
    Du P, Bouteiller A, Bosilca G, Herault T and Dongarra J (2012). Algorithm-based fault tolerance for dense matrix factorizations, ACM SIGPLAN Notices, 47:8, (225-234), Online publication date: 11-Sep-2012.
  6. Gomez L, Nicolae B, Maruyama N, Cappello F and Matsuoka S Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds Proceedings of the 18th international conference on Parallel Processing, (313-324)
  7. ACM
    Davies T, Karlsson C, Liu H, Ding C and Chen Z High performance linpack benchmark Proceedings of the international conference on Supercomputing, (162-171)
  8. ACM
    Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N and Matsuoka S FTI Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, (1-32)
  9. Salah K, Al-Shaikh R and Sindi M Towards green computing using diskless high performance clusters Proceedings of the 7th International Conference on Network and Services Management, (456-459)
  10. Gomez L, Maruyama N, Cappello F and Matsuoka S Distributed Diskless Checkpoint for Large Scale Systems Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, (63-72)
  11. Jin H, Sun X, Zheng Z, Lan Z and Xie B Performance under Failures of DAG-based Parallel Computing Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, (236-243)
  12. Bosilca G, Delmas R, Dongarra J and Langou J (2009). Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, 69:4, (410-416), Online publication date: 1-Apr-2009.
  13. Cappello F (2009). Fault Tolerance in Petascale/ Exascale Systems, International Journal of High Performance Computing Applications, 23:3, (212-226), Online publication date: 1-Aug-2009.
  14. ACM
    Tabatabaee V and Hollingsworth J Automatic software interference detection in parallel applications Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-12)
  15. ACM
    Wu M, Sun X and Jin H Performance under failures of high-end computing Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-11)
Contributors
  • University of Iowa
  • University of Illinois Urbana-Champaign

Recommendations