Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid

Wrzesinska, Gosia; Oprescu, Ana-Maria; Kielmann, Thilo; Bal, Henri

doi:10.1007/978-3-540-74466-5_46

Gosia Wrzesinska¹,
Ana-Maria Oprescu¹,
Thilo Kielmann¹ &
…
Henri Bal¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4641))

Included in the following conference series:

European Conference on Parallel Processing

823 Accesses
1 Citations

Abstract

Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application.

Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15%. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.

Download to read the full chapter text

Chapter PDF

System-Level Transparent Checkpointing for OpenSHMEM

FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Article Open access 13 March 2024

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Wrzesinska, G., van Nieuwport, R.V., Maassen, J., Bal, H.E.: Fault-tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid. In: IPDPS 2005. 19th IEEE International Parallel and Distributed Processing Symposium, IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Baldeschwieler, J., Blumofe, R., Brewer, E.: ATLAS: An Infrastructure for Global Computing. In: Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, Connemara, Ireland, pp. 165–172 (September 1996)
Google Scholar
van Nieuwpoort, R.V., Kielmann, T., Bal, H.: Efficient Load Balancing for Wide-Area Divide-and-Conquer Applications. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Snowbird, Utah, USA, pp. 34–43 (June 2001)
Google Scholar
van Nieuwpoort, R.V., Maassen, J., Wrzesinska, G., Kielmann, T., Bal, H.E.: Satin: Simple and Efficient Java-based Grid Programming. Scalable Computing: Practice and Experience 6(3), 19–32 (2005)
Google Scholar
Allen, G., Davis, K., Goodale, T., Hutanu, A., Kaiser, H., Kielmann, T., Merzky, A., van Nieuwpoort, R., Reinefeld, A., Schintke, F., Schütt, T., Seidel, E., Ullmer, B.: The Grid Application Toolkit: Towards Generic and Easy Application Programming Interfaces for the Grid. Proceedings of the IEEE 93(3), 534–550 (2005)
Article Google Scholar
Li, K., Naughton, J.F., Plank, J.S.: Real-time, concurrent checkpointing for parallel programs. In: PPoPP 1990. 2nd ACM SIGPLAN Symposium on Principles and Practice of Parall el Programming, pp. 79–88. ACM Press, New York (1990)
Chapter Google Scholar
Litzkow, M., Livny, M., Mutka, M.: Condor - A Hunter of Idle Workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems, San Jose, California, pp. 104–111 (June 1988)
Google Scholar
Allen, G., Benger, W., Goodale, T., Hege, H.C., Lanfermann, G., Merzky, A., Radke, T., Seidel, E., Shalf, J.: The Cactus Code: A Problem Solving Environment for the Grid. In: Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, Pennsylvania, USA, pp. 253–260 (August 2000)
Google Scholar
Iskra, K.A., Hendrikse, Z.W., van Albada, G.D., Overeinder, B.J., Sloot, P.M.A., Gehring, J.: Experiments with migration of message-passing tasks. In: Buyya, R., Baker, M. (eds.) GRID 2000. LNCS, vol. 1971, pp. 203–213. Springer, Heidelberg (2000)
Chapter Google Scholar
Plank, J.: Efficient Checkpointing on MIMD architectures. PhD thesis, Princeton University (1993)
Google Scholar
Vadhiyar, S.S., Dongarra, J.J.: SRS – a framework for developing malleable and migratable parallel applications for distributed systems. Parallel Processing Letters 13(2), 291–312 (2003)
Article MathSciNet Google Scholar
Finkel, R., Manber, U.: DIB – A Distributed Implementation of Backtracking. ACM Transactions of Programming Languages and Systems 9(2), 235–256 (1987)
Article Google Scholar
Lin, F.C.H., Keller, R.M.: Distributed Recovery in Applicative Systems. In: Proceedings of the 1986 International Conference on Parallel Processing, University Park, PA, USA, pp. 405–412 (August 1986)
Google Scholar
Blumofe, R., Lisiecki, P.: Adaptive and Reliable Parallel Computing on Networks of Workstations. In: USENIX 1997 Annual Technical Conference on UNIX and Advanced Computing Systems, Anaheim, California, pp. 133–147 (January 1997)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing 37(1), 55–69 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Vrije Universiteit Amsterdam,
Gosia Wrzesinska, Ana-Maria Oprescu, Thilo Kielmann & Henri Bal

Authors

Gosia Wrzesinska
View author publications
You can also search for this author in PubMed Google Scholar
Ana-Maria Oprescu
View author publications
You can also search for this author in PubMed Google Scholar
Thilo Kielmann
View author publications
You can also search for this author in PubMed Google Scholar
Henri Bal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Anne-Marie Kermarrec Luc Bougé Thierry Priol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wrzesinska, G., Oprescu, AM., Kielmann, T., Bal, H. (2007). Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid. In: Kermarrec, AM., Bougé, L., Priol, T. (eds) Euro-Par 2007 Parallel Processing. Euro-Par 2007. Lecture Notes in Computer Science, vol 4641. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74466-5_46

Download citation

DOI: https://doi.org/10.1007/978-3-540-74466-5_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74465-8
Online ISBN: 978-3-540-74466-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid

Abstract

Chapter PDF

Similar content being viewed by others

System-Level Transparent Checkpointing for OpenSHMEM

FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid

Abstract

Chapter PDF

Similar content being viewed by others

System-Level Transparent Checkpointing for OpenSHMEM

FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation