Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807665acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Practical scalable consensus for pseudo-synchronous distributed systems

Published: 15 November 2015 Publication History

Abstract

The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.

References

[1]
S. Amarasinghe and et al. Exascale Programming Challenges. In Proceedings of the Workshop on Exascale Programming Challenges, Marina del Rey, CA, USA. U.S Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR), Jul 2011.
[2]
T. Aysal, M. Yildiz, A. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. Signal Processing, IEEE Transactions on, 57(7):2748--2761, July 2009.
[3]
W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing and Applications, 27(3):244--254, 2013.
[4]
W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. J. Dongarra. An evaluation of User-Level Failure Mitigation support in MPI. Computing, 95(12):1171--1184, 2013.
[5]
D. Buntinas. Scalable distributed consensus to support MPI fault tolerance. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, pages 1240--1249, Shanghai, China, May 2012.
[6]
T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. J. ACM, 43(4):685--722, July 1996.
[7]
K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, Feb. 1985.
[8]
B. Charron-Bost and A. Schiper. Uniform consensus is harder than consensus. J. Algorithms, 51(1):15--37, Apr. 2004.
[9]
S. Chaudhuri, M. Erlihy, N. A. Lynch, and M. R. Tuttle. Tight bounds for k-set agreement. J. ACM, 47(5):912--943, Sept. 2000.
[10]
W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Trans. Computers, 51(1):13--32, 2002.
[11]
E. W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commun. ACM, 17(11):643--644, Nov. 1974.
[12]
A. Dimakis, A. Sarwate, and M. Wainwright. Geographic gossip: efficient aggregation for sensor networks. In Information Processing in Sensor Networks, 2006. IPSN 2006. The Fifth International Conference on, pages 69--76, 2006.
[13]
D. Dolev and C. Lenzen. Early-deciding consensus is expensive. In Proceedings of the 2013 ACM Symposium on Principles of Distributed Computing, PODC '13, pages 270--279, New York, NY, USA, 2013. ACM.
[14]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, Sept. 2002.
[15]
K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 44:1--44:12, New York, NY, USA, 2011. ACM.
[16]
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374--382, Apr. 1985.
[17]
M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '14, 2014.
[18]
M. Gamell, K. Teranishi, M. A. Heroux, J. Mayo, H. Kolla, J. Chen, and M. Parashar. Exploring Failure Recovery for Stencil-based Applications at Extreme Scales. In The 24th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, June 2015.
[19]
T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, K. Teranishi, M. Parashar, and J. J. Dongarra. Practical scalable consensus for pseudo-synchronous distributed systems: Formal proof. Technical Report ICL-UT-15-01, University of Tennessee, Innovative Computing Laboratory, http://www.icl.utk.edu/~herault/TR/icl-ut-15-01.pdf, April 2015.
[20]
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, 2009.
[21]
K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6):518--528, 1984.
[22]
M. Hurfin and M. Raynal. A simple and fast asynchronous consensus protocol based on a weak failure detector. distributed computing, pages 209--223, 1999.
[23]
J. Hursey, T. Naughton, G. Vallee, and R. L. Graham. A Log-scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. In Proceedings of the 18th European MPI Users' Group Conference on Recent Advances in the Message Passing Interface, EuroMPI'11, pages 255--263, Berlin, Heidelberg, 2011. Springer-Verlag.
[24]
L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133--169, May 1998.
[25]
L. Lamport. PAXOS made simple. ACM SIGACT News (Distributed Computing Column), 32(4 -- Whole Number 121):51--58, Dec. 2001.
[26]
L. Lamport. Generalized Consensus and Paxos. Technical Report MSR-TR-2005-33, Microsoft Research, 2005.
[27]
L. Lamport. Fast Paxos. Distributed Computing, 19(2):79--103, 2006.
[28]
M. Larrea, A. Fernández, and S. Arévalo. Optimal Implementation of the Weakest Failure Detector for Solving Consensus. In Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing, PODC '00, pages 334--, New York, NY, USA, 2000. ACM.
[29]
C. Mohan and B. Lindsay. Efficient commit protocols for the tree of processes model of distributed transactions. In SIGOPS OSR, volume 19, pages 40--52. ACM, 1985.
[30]
P. Raipin Parvedy, M. Raynal, and C. Travers. Strongly terminating early-stopping k-set agreement in synchronous systems with general omission failures. Theory of Computing Systems, 47(1):259--287, 2010.
[31]
A. Schiper. Early consensus in an asynchronous system with a weak failure detector. Distributed Computing, pages 149--157, 1997.
[32]
K. Teranishi and M. A. Heroux. Toward Local Failure Local Recovery Resilience Model Using MPI-ULFM. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 51:51--51:56, New York, NY, USA, 2014. ACM.
[33]
P. Zielinski. Paxos At War. In Proceedings of the 2001 Winter Simulation Conference, 2004.

Cited By

View all
  • (2022)Implicit Actions and Non-blocking Failure Recovery with MPI2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00009(36-46)Online publication date: Nov-2022
  • (2020)MATCH: An MPI Fault Tolerance Benchmark Suite2020 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC50251.2020.00015(60-71)Online publication date: Oct-2020
  • (2020)Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 15-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. agreement
  3. fault-tolerance

Qualifiers

  • Research-article

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Implicit Actions and Non-blocking Failure Recovery with MPI2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00009(36-46)Online publication date: Nov-2022
  • (2020)MATCH: An MPI Fault Tolerance Benchmark Suite2020 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC50251.2020.00015(60-71)Online publication date: Oct-2020
  • (2020)Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 15-Jun-2020
  • (2020)Tree‐based fault‐tolerant collective operations for MPIConcurrency and Computation: Practice and Experience10.1002/cpe.582633:14Online publication date: 15-Jun-2020
  • (2019)Corrected trees for reliable group communicationProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295721(287-299)Online publication date: 16-Feb-2019
  • (2018)Running resilient MPI applications on a Dynamic Group of Recommended ProcessesJournal of the Brazilian Computer Society10.1186/s13173-018-0069-z24:1Online publication date: 12-Mar-2018
  • (2018)Resilient gossip-inspired all-reduce algorithms for high-performance computingThe International Journal of High Performance Computing Applications10.1177/1094342018762531(109434201876253)Online publication date: 9-Apr-2018
  • (2018)A failure detector for HPC platformsInternational Journal of High Performance Computing Applications10.1177/109434201771150532:1(139-158)Online publication date: 1-Jan-2018
  • (2018)System Software for Many-Core and Multi-core ArchitectureAdvanced Software Technologies for Post-Peta Scale Computing10.1007/978-981-13-1924-2_4(59-75)Online publication date: 7-Dec-2018
  • (2017)Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme ScalesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269653828:10(2881-2895)Online publication date: 1-Oct-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media