Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2600212.2600224acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Fault tolerance for remote memory access programming models

Published: 23 June 2014 Publication History

Abstract

Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault tolerance for RMA and show that it is fundamentally different from resilience mechanisms targeting the message passing (MP) model. We design a model for reasoning about fault tolerance for RMA, addressing both flat and hierarchical hardware. We use this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-memory checkpointing, transparent logging of remote memory accesses, and a scheme for transparent recovery of failed processes. Our protocols take into account diminishing amounts of memory per core, one of the major features of future exascale machines. The implementation of our fault-tolerance scheme entails negligible additional overheads. Our reliability model shows that in-memory checkpointing and logging provide high resilience. This study enables highly-scalable resilience mechanisms for RMA and fills a research gap between fault tolerance and emerging RMA programming models.

References

[1]
foMPI, 2013. http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI.
[2]
Scalable Checkpoint / Restart, 2013. http://sourceforge.net/projects/scalablecr/.
[3]
TSUBAME2.0: Failure History, April 2013. http://mon.g.gsic.titech.ac.jp/trouble-list/index.htm.
[4]
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proc. of the Ann. Intl. Conf. on Supercomp., ICS '04, pages 277--286, 2004.
[5]
N. Ali, S. Krishnamoorthy, N. Govind, and B. Palmer. A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models. In Par., Dist. and Net. Proc. (PDP), the Eur. Intl. Conf. on, pages 24--31, 2011.
[6]
L. Alvisi and K. Marzullo. Message Logging: Pessimistic, Optimistic, Causal, and Optimal. IEEE Trans. Softw. Eng., 24(2):149--159, Feb. 1998.
[7]
D. Arteaga and M. Zhao. Towards Scalable Application Checkpointing with Parallel File System Delegation. In Proc. of the IEEE Intl. Conf. on Net., Arch., and Stor., NAS '11, pages 130--139, 2011.
[8]
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: high performance Fault Tolerance Interface for hybrid systems. In Proc. of the ACM/IEEE Supercomputing, SC '11, pages 32:1--32:32.
[9]
B. Bhargava and S.-R. Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach. In Rel. Dist. Syst., 1988. Proc., Symp. on, pages 3--12.
[10]
G. Bronevetsky, D. J. Marques, K. K. Pingali, R. Rugina, and S. A. McKee. Compiler-enhanced incremental checkpointing for OpenMP applications. In Proc. of the ACM SIGPLAN Symp. on Prin. and Prac. of Par. Prog., PPoPP '08, pages 275--276.
[11]
K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, Feb. 1985.
[12]
P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: high-performance, reliable secondary storage. ACM Comput. Surv., 26(2):145--185, June 1994.
[13]
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In Proc. of the ACM/IEEE Supercomputing, SC '12, pages 58:1--58:11.
[14]
D. H. Bailey et al. The NAS parallel benchmarks. Technical report, The Intl. J. of Super. App., 1991.
[15]
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst., 22(3):303--312, Feb. 2006.
[16]
E. Elnozahy and W. Zwaenepoel. Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. Comp., IEEE Trans. on, 41(5):526 --531, 1992.
[17]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, Sept. 2002.
[18]
G. Bosilca et al. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, the ACM/IEEE Conf., pages 29--29, 2002.
[19]
R. Gerstenberger, M. Besta, and T. Hoefler. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. In Proc. of the ACM/IEEE Supercomputing, SC '13, pages 53:1--53:12, 2013.
[20]
A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications. In Par. Dist. Proc. Symp., the IEEE Intl., pages 989--1000.
[21]
J.-M. Helary, A. Mostefaoui, R. Netzer, and M. Raynal. Preventing useless checkpoints in distributed computations. In Rel. Dist. Sys., 1997. Proc., the Symp. on, pages 183 --190, 1997.
[22]
T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, and K. Underwood. Remote Memory Access Programming in MPI-3. Argonne National Laboratory, Tech. Rep, 2013.
[23]
S. Hogan, J. Hammond, and A. Chien. An evaluation of difference and threshold techniques for efficient checkpoints. In Dep. Sys. and Net. Work. (DSN-W), IEEE/IFIP Int'l. Conf., pages 1--6, 2012.
[24]
F. Isaila, J. Garcia, J. Carretero, R. Ross, and D. Kimpe. Making the case for reforming the I/O software stack of extreme-scale systems. In Ex. App. and Soft. Conf. (EASC'13), 2013.
[25]
H. Jin and K. Hwang. Distributed Checkpointing on Clusters with Dynamic Striping and Staggering. In Advances in Computing Science - ASIAN 2002, volume 2550, pages 19--33. 2002.
[26]
R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. Soft. Eng., IEEE Trans. on, SE-13(1):23--31, 1987.
[27]
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM, 21(7):558--565, 1978.
[28]
Y. Li and Z. Lan. Using adaptive fault tolerance to improve application robustness on the Teragrid. Proc. of TeraGrid, 322, 2007.
[29]
J. Manson, W. Pugh, and S. V. Adve. The Java Memory Model. In Proc. of ACM Symp. on Prin. of Prog. Lang., POPL '05, pages 378--391, 2005.
[30]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proc. of ACM/IEEE Supercomputing, SC '10, pages 1--11.
[31]
MPI Forum. fMPI: A Message-Passing Interface Standard. Version 3, September 2012. available at: http://www.mpi-forum.org (Sep. 2012).
[32]
R. Netzer and J. Xu. Necessary and sufficient conditions for consistent global snapshots. Par. and Dist. Sys., IEEE Trans. on, 6(2):165--169, 1995.
[33]
B. Nicolae and F. Cappello. BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots. In Proc. of ACM/IEEE Supercomputing, SC '11, pages 34:1--34:12.
[34]
D. Petrović, O. Shahmirzadi, T. Ropars, and A. Schiper. High-performance RMA-based broadcast on the Intel SCC. In Proc. of ACM Symp. Par. Alg. Arch., SPAA '12, pages 121--130, 2012.
[35]
J. Plank, K. Li, and M. Puening. Diskless checkpointing. Par. and Dist. Sys., IEEE Trans. on, 9(10):972--986, 1998.
[36]
I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. J. of the Soc. for Indust. & Appl. Math., 8(2):300--304, 1960.
[37]
R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges. Alleviating scalability issues of checkpointing protocols. In Proc. of ACM/IEEE Supercomputing, SC '12, pages 18:1--18:11.
[38]
K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proc. of the ACM/IEEE Supercomputing, SC '12, pages 19:1--19:10.
[39]
Z. Tong, R. Y. Kain, and W. T. Tsai. Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks. IEEE Trans. Par. Dist. Sys., 3(2):246--251, 1992.
[40]
M. Vasavada, F. Mueller, P. H. Hargrove, and E. Roman. Comparing different approaches for incremental checkpointing: The showdown. In Linux Symposium, page 69, 2011.
[41]
S. Yoo, C. Killian, T. Kelly, H. K. Cho, and S. Plite. Composable Reliability for Asynchronous Systems. In Proc. of the USENIX Ann. Tech. Conf., USENIX ATC'12, pages 3--3, 2012.

Cited By

View all
  • (2022)ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00008(24-35)Online publication date: Nov-2022
  • (2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021
  • (2021)High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303576132:4(943-959)Online publication date: 1-Apr-2021
  • Show More Cited By

Index Terms

  1. Fault tolerance for remote memory access programming models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
    June 2014
    334 pages
    ISBN:9781450327497
    DOI:10.1145/2600212
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. algorithms
    2. performance
    3. reliability

    Qualifiers

    • Research-article

    Conference

    HPDC'14
    Sponsor:

    Acceptance Rates

    HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;
    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00008(24-35)Online publication date: Nov-2022
    • (2021)Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3131677(1-1)Online publication date: 2021
    • (2021)High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303576132:4(943-959)Online publication date: 1-Apr-2021
    • (2020)Substream-Centric Maximum Matchings on FPGAACM Transactions on Reconfigurable Technology and Systems10.1145/337787113:2(1-33)Online publication date: 24-Apr-2020
    • (2020)High-Performance Parallel Graph Coloring with Strong Guarantees on Work, Depth, and QualitySC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00103(1-17)Online publication date: Nov-2020
    • (2020)FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall ShortSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00031(1-18)Online publication date: Nov-2020
    • (2020)Checkpointing OpenSHMEM Programs Using Compiler Analysis2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS51974.2020.00011(51-60)Online publication date: Nov-2020
    • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
    • (2019)Failure recovery for bulk synchronous applications with MPI stagesParallel Computing10.1016/j.parco.2019.02.00784:C(1-14)Online publication date: 1-May-2019
    • (2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media