research-article

Fault tolerance for remote memory access programming models

Authors:

Torsten HoeflerAuthors Info & Claims

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

Pages 37 - 48

https://doi.org/10.1145/2600212.2600224

Published: 23 June 2014 Publication History

Abstract

Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault tolerance for RMA and show that it is fundamentally different from resilience mechanisms targeting the message passing (MP) model. We design a model for reasoning about fault tolerance for RMA, addressing both flat and hierarchical hardware. We use this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-memory checkpointing, transparent logging of remote memory accesses, and a scheme for transparent recovery of failed processes. Our protocols take into account diminishing amounts of memory per core, one of the major features of future exascale machines. The implementation of our fault-tolerance scheme entails negligible additional overheads. Our reliability model shows that in-memory checkpointing and logging provide high resilience. This study enables highly-scalable resilience mechanisms for RMA and fills a research gap between fault tolerance and emerging RMA programming models.

References

[1]

foMPI, 2013. http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI.

[2]

Scalable Checkpoint / Restart, 2013. http://sourceforge.net/projects/scalablecr/.

[3]

TSUBAME2.0: Failure History, April 2013. http://mon.g.gsic.titech.ac.jp/trouble-list/index.htm.

[4]

S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proc. of the Ann. Intl. Conf. on Supercomp., ICS '04, pages 277--286, 2004.

Digital Library

[5]

N. Ali, S. Krishnamoorthy, N. Govind, and B. Palmer. A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models. In Par., Dist. and Net. Proc. (PDP), the Eur. Intl. Conf. on, pages 24--31, 2011.

Digital Library

[6]

L. Alvisi and K. Marzullo. Message Logging: Pessimistic, Optimistic, Causal, and Optimal. IEEE Trans. Softw. Eng., 24(2):149--159, Feb. 1998.

Digital Library

[7]

D. Arteaga and M. Zhao. Towards Scalable Application Checkpointing with Parallel File System Delegation. In Proc. of the IEEE Intl. Conf. on Net., Arch., and Stor., NAS '11, pages 130--139, 2011.

Digital Library

[8]

L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: high performance Fault Tolerance Interface for hybrid systems. In Proc. of the ACM/IEEE Supercomputing, SC '11, pages 32:1--32:32.

Digital Library

[9]

B. Bhargava and S.-R. Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach. In Rel. Dist. Syst., 1988. Proc., Symp. on, pages 3--12.

[10]

G. Bronevetsky, D. J. Marques, K. K. Pingali, R. Rugina, and S. A. McKee. Compiler-enhanced incremental checkpointing for OpenMP applications. In Proc. of the ACM SIGPLAN Symp. on Prin. and Prac. of Par. Prog., PPoPP '08, pages 275--276.

Digital Library

[11]

K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, Feb. 1985.

Digital Library

[12]

P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: high-performance, reliable secondary storage. ACM Comput. Surv., 26(2):145--185, June 1994.

Digital Library

[13]

J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In Proc. of the ACM/IEEE Supercomputing, SC '12, pages 58:1--58:11.

Digital Library

[14]

D. H. Bailey et al. The NAS parallel benchmarks. Technical report, The Intl. J. of Super. App., 1991.

[15]

J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst., 22(3):303--312, Feb. 2006.

Digital Library

[16]

E. Elnozahy and W. Zwaenepoel. Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. Comp., IEEE Trans. on, 41(5):526 --531, 1992.

Digital Library

[17]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, Sept. 2002.

Digital Library

[18]

G. Bosilca et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, the ACM/IEEE Conf., pages 29--29, 2002.

Digital Library

[19]

R. Gerstenberger, M. Besta, and T. Hoefler. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. In Proc. of the ACM/IEEE Supercomputing, SC '13, pages 53:1--53:12, 2013.

Digital Library

[20]

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications. In Par. Dist. Proc. Symp., the IEEE Intl., pages 989--1000.

Digital Library

[21]

J.-M. Helary, A. Mostefaoui, R. Netzer, and M. Raynal. Preventing useless checkpoints in distributed computations. In Rel. Dist. Sys., 1997. Proc., the Symp. on, pages 183 --190, 1997.

Digital Library

[22]

T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, and K. Underwood. Remote Memory Access Programming in MPI-3. Argonne National Laboratory, Tech. Rep, 2013.

[23]

S. Hogan, J. Hammond, and A. Chien. An evaluation of difference and threshold techniques for efficient checkpoints. In Dep. Sys. and Net. Work. (DSN-W), IEEE/IFIP Int'l. Conf., pages 1--6, 2012.

[24]

F. Isaila, J. Garcia, J. Carretero, R. Ross, and D. Kimpe. Making the case for reforming the I/O software stack of extreme-scale systems. In Ex. App. and Soft. Conf. (EASC'13), 2013.

[25]

H. Jin and K. Hwang. Distributed Checkpointing on Clusters with Dynamic Striping and Staggering. In Advances in Computing Science - ASIAN 2002, volume 2550, pages 19--33. 2002.

Digital Library

[26]

R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. Soft. Eng., IEEE Trans. on, SE-13(1):23--31, 1987.

Digital Library

[27]

L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM, 21(7):558--565, 1978.

Digital Library

[28]

Y. Li and Z. Lan. Using adaptive fault tolerance to improve application robustness on the Teragrid. Proc. of TeraGrid, 322, 2007.

[29]

J. Manson, W. Pugh, and S. V. Adve. The Java Memory Model. In Proc. of ACM Symp. on Prin. of Prog. Lang., POPL '05, pages 378--391, 2005.

Digital Library

[30]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proc. of ACM/IEEE Supercomputing, SC '10, pages 1--11.

Digital Library

[31]

MPI Forum. fMPI: A Message-Passing Interface Standard. Version 3, September 2012. available at: http://www.mpi-forum.org (Sep. 2012).

[32]

R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. Par. and Dist. Sys., IEEE Trans. on, 6(2):165--169, 1995.

Digital Library

[33]

B. Nicolae and F. Cappello. BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots. In Proc. of ACM/IEEE Supercomputing, SC '11, pages 34:1--34:12.

Digital Library

[34]

D. Petrović, O. Shahmirzadi, T. Ropars, and A. Schiper. High-performance RMA-based broadcast on the Intel SCC. In Proc. of ACM Symp. Par. Alg. Arch., SPAA '12, pages 121--130, 2012.

Digital Library

[35]

J. Plank, K. Li, and M. Puening. Diskless checkpointing. Par. and Dist. Sys., IEEE Trans. on, 9(10):972--986, 1998.

Digital Library

[36]

I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. J. of the Soc. for Indust. & Appl. Math., 8(2):300--304, 1960.

[37]

R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges. Alleviating scalability issues of checkpointing protocols. In Proc. of ACM/IEEE Supercomputing, SC '12, pages 18:1--18:11.

Digital Library

[38]

K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proc. of the ACM/IEEE Supercomputing, SC '12, pages 19:1--19:10.

Digital Library

[39]

Z. Tong, R. Y. Kain, and W. T. Tsai. Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks. IEEE Trans. Par. Dist. Sys., 3(2):246--251, 1992.

Digital Library

[40]

M. Vasavada, F. Mueller, P. H. Hargrove, and E. Roman. Comparing different approaches for incremental checkpointing: The showdown. In Linux Symposium, page 69, 2011.

[41]

S. Yoo, C. Killian, T. Kelly, H. K. Cho, and S. Plite. Composable Reliability for Asynchronous Systems. In Proc. of the USENIX Ann. Tech. Conf., USENIX ATC'12, pages 3--3, 2012.

Digital Library

Cited By

Hubner LHespe DSanders PStamatakis A(2022)ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00008(24-35)Online publication date: Nov-2022
https://doi.org/10.1109/FTXS56515.2022.00008
Besta MFischer MKalavri VKapralov MHoefler T(2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021
https://doi.org/10.1109/TPDS.2021.3131677
Besta MDomke JSchneider MKonieczny MGirolamo SSchneider TSingla AHoefler T(2021)High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303576132:4(943-959)Online publication date: 1-Apr-2021
https://doi.org/10.1109/TPDS.2020.3035761
Show More Cited By

Index Terms

Fault tolerance for remote memory access programming models
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

State-Restrict MLC STT-RAM Designs for High-Reliable High-Performance Memory System
DAC '14: Proceedings of the 51st Annual Design Automation Conference

Multi-level Cell Spin-Transfer Torque Random Access Memory (MLC STT-RAM) is a promising nonvolatile memory technology for high-capacity and high-performance applications. However, the reliability concerns and the complicated access mechanism greatly ...
Algorithm-Based Fault Tolerance for FFT Networks

Algorithm-based fault tolerance (ABFT) is a low-overhead system-level fault tolerance technique. Many ABFT schemes have been proposed in the past for fast Fourier transform (FFT) networks. In this paper, a new ABFT scheme for FFT networks is proposed. ...
Evaluating the Fault Tolerance of Stateful TMR
NBIS '10: Proceedings of the 2010 13th International Conference on Network-Based Information Systems

Module redundancy is often used in the construction of reliable systems. Triple Module Redundancy (TMR) is a method for improving reliability through module redundancy, although it does not give the correct results when two out of three modules fail. We,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

June 2014

334 pages

ISBN:9781450327497

DOI:10.1145/2600212

General Chairs:
Beth Plale
Indiana University, USA
,
Matei Ripeanu
University of British Columbia, CA
,
Program Chairs:
Franck Cappello
Argonne National Lab and INRIA, USA
,
Dongyan Xu
Purdue University, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'14

Sponsor:

SIGARCH

HPDC'14: The 23rd International Symposium on High-Performance Parallel and Distributed Computing

June 23 - 27, 2014

BC, Vancouver, Canada

Acceptance Rates

HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hubner LHespe DSanders PStamatakis A(2022)ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00008(24-35)Online publication date: Nov-2022
https://doi.org/10.1109/FTXS56515.2022.00008
Besta MFischer MKalavri VKapralov MHoefler T(2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021
https://doi.org/10.1109/TPDS.2021.3131677
Besta MDomke JSchneider MKonieczny MGirolamo SSchneider TSingla AHoefler T(2021)High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303576132:4(943-959)Online publication date: 1-Apr-2021
https://doi.org/10.1109/TPDS.2020.3035761
Besta MFischer MBen-Nun TStanojevic DLicht JHoefler T(2020)Substream-Centric Maximum Matchings on FPGAACM Transactions on Reconfigurable Technology and Systems10.1145/337787113:2(1-33)Online publication date: 24-Apr-2020
https://dl.acm.org/doi/10.1145/3377871
Besta MCarigiet AJanda KVonarburg-Shmaria ZGianinazzi LHoefler T(2020)High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and QualitySC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00103(1-17)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00103
Besta MSchneider MKonieczny MCynk KHenriksson EGirolamo SSingla AHoefler T(2020)FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall ShortSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00031(1-18)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00031
Shahneous Bari MBasu DLu WCurtis TChapman B(2020)Checkpointing OpenSHMEM Programs Using Compiler Analysis2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS51974.2020.00011(51-60)Online publication date: Nov-2020
https://doi.org/10.1109/FTXS51974.2020.00011
Lee KSullivan MHari STsai TKeckler SErez MEigenmann RDing CMcKee S(2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330361
Sultana NRüfenacht MSkjellum ALaguna IMohror K(2019)Failure recovery for bulk synchronous applications with MPI stagesParallel Computing10.1016/j.parco.2019.02.00784:C(1-14)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.02.007
Shahzad FKreutzer MZeiser TMachado RPieper AHager GWellein G(2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1177/1094342016677085
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten