Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Georgakoudis, Giorgis; Guo, Luanzheng; Laguna, Ignacio

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2102.06896 (cs)

[Submitted on 13 Feb 2021]

Title:Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Authors:Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna

View PDF

Abstract:Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit++ recovers much faster than restarting, up to 6x, or ULFM, up to 3x, and that it scales excellently as the number of MPI processes grows.

Comments:	International Conference on High Performance Computing (ISC 2020)
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2102.06896 [cs.DC]
	(or arXiv:2102.06896v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2102.06896

Submission history

From: Luanzheng Guo [view email]
[v1] Sat, 13 Feb 2021 10:41:19 UTC (81 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DC

< prev | next >

new | recent | 2021-02

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Giorgis Georgakoudis
Luanzheng Guo
Ignacio Laguna

export BibTeX citation

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators