research-article

Practical scalable consensus for pseudo-synchronous distributed systems

Authors:

Thomas Herault,

Aurelien Bouteiller,

George Bosilca,

Keita Teranishi,

Manish Parashar,

Jack DongarraAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 31, Pages 1 - 12

https://doi.org/10.1145/2807591.2807665

Published: 15 November 2015 Publication History

Abstract

The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.

References

[1]

S. Amarasinghe and et al. Exascale Programming Challenges. In Proceedings of the Workshop on Exascale Programming Challenges, Marina del Rey, CA, USA. U.S Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR), Jul 2011.

[2]

T. Aysal, M. Yildiz, A. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. Signal Processing, IEEE Transactions on, 57(7):2748--2761, July 2009.

Digital Library

[3]

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability: Design and rationale. International Journal of High Performance Computing and Applications, 27(3):244--254, 2013.

Digital Library

[4]

W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. J. Dongarra. An evaluation of User-Level Failure Mitigation support in MPI. Computing, 95(12):1171--1184, 2013.

Digital Library

[5]

D. Buntinas. Scalable distributed consensus to support MPI fault tolerance. In 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, pages 1240--1249, Shanghai, China, May 2012.

Digital Library

[6]

T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. J. ACM, 43(4):685--722, July 1996.

Digital Library

[7]

K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63--75, Feb. 1985.

Digital Library

[8]

B. Charron-Bost and A. Schiper. Uniform consensus is harder than consensus. J. Algorithms, 51(1):15--37, Apr. 2004.

Digital Library

[9]

S. Chaudhuri, M. Erlihy, N. A. Lynch, and M. R. Tuttle. Tight bounds for k-set agreement. J. ACM, 47(5):912--943, Sept. 2000.

Digital Library

[10]

W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Trans. Computers, 51(1):13--32, 2002.

Digital Library

[11]

E. W. Dijkstra. Self-stabilizing systems in spite of distributed control. Commun. ACM, 17(11):643--644, Nov. 1974.

Digital Library

[12]

A. Dimakis, A. Sarwate, and M. Wainwright. Geographic gossip: efficient aggregation for sensor networks. In Information Processing in Sensor Networks, 2006. IPSN 2006. The Fifth International Conference on, pages 69--76, 2006.

Digital Library

[13]

D. Dolev and C. Lenzen. Early-deciding consensus is expensive. In Proceedings of the 2013 ACM Symposium on Principles of Distributed Computing, PODC '13, pages 270--279, New York, NY, USA, 2013. ACM.

Digital Library

[14]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, Sept. 2002.

Digital Library

[15]

K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 44:1--44:12, New York, NY, USA, 2011. ACM.

Digital Library

[16]

M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374--382, Apr. 1985.

Digital Library

[17]

M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '14, 2014.

Digital Library

[18]

M. Gamell, K. Teranishi, M. A. Heroux, J. Mayo, H. Kolla, J. Chen, and M. Parashar. Exploring Failure Recovery for Stencil-based Applications at Extreme Scales. In The 24th International ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '15, June 2015.

Digital Library

[19]

T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, K. Teranishi, M. Parashar, and J. J. Dongarra. Practical scalable consensus for pseudo-synchronous distributed systems: Formal proof. Technical Report ICL-UT-15-01, University of Tennessee, Innovative Computing Laboratory, http://www.icl.utk.edu/~herault/TR/icl-ut-15-01.pdf, April 2015.

[20]

M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, 2009.

[21]

K. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6):518--528, 1984.

Digital Library

[22]

M. Hurfin and M. Raynal. A simple and fast asynchronous consensus protocol based on a weak failure detector. distributed computing, pages 209--223, 1999.

Digital Library

[23]

J. Hursey, T. Naughton, G. Vallee, and R. L. Graham. A Log-scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI. In Proceedings of the 18th European MPI Users' Group Conference on Recent Advances in the Message Passing Interface, EuroMPI'11, pages 255--263, Berlin, Heidelberg, 2011. Springer-Verlag.

Digital Library

[24]

L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133--169, May 1998.

Digital Library

[25]

L. Lamport. PAXOS made simple. ACM SIGACT News (Distributed Computing Column), 32(4 -- Whole Number 121):51--58, Dec. 2001.

[26]

L. Lamport. Generalized Consensus and Paxos. Technical Report MSR-TR-2005-33, Microsoft Research, 2005.

[27]

L. Lamport. Fast Paxos. Distributed Computing, 19(2):79--103, 2006.

Digital Library

[28]

M. Larrea, A. Fernández, and S. Arévalo. Optimal Implementation of the Weakest Failure Detector for Solving Consensus. In Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing, PODC '00, pages 334--, New York, NY, USA, 2000. ACM.

Digital Library

[29]

C. Mohan and B. Lindsay. Efficient commit protocols for the tree of processes model of distributed transactions. In SIGOPS OSR, volume 19, pages 40--52. ACM, 1985.

Digital Library

[30]

P. Raipin Parvedy, M. Raynal, and C. Travers. Strongly terminating early-stopping k-set agreement in synchronous systems with general omission failures. Theory of Computing Systems, 47(1):259--287, 2010.

Digital Library

[31]

A. Schiper. Early consensus in an asynchronous system with a weak failure detector. Distributed Computing, pages 149--157, 1997.

Digital Library

[32]

K. Teranishi and M. A. Heroux. Toward Local Failure Local Recovery Resilience Model Using MPI-ULFM. In Proceedings of the 21st European MPI Users' Group Meeting, EuroMPI/ASIA '14, pages 51:51--51:56, New York, NY, USA, 2014. ACM.

Digital Library

[33]

P. Zielinski. Paxos At War. In Proceedings of the 2001 Winter Simulation Conference, 2004.

Cited By

Bouteiller ABosilca G(2022)Implicit Actions and Non-blocking Failure Recovery with MPI2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00009(36-46)Online publication date: Nov-2022
https://doi.org/10.1109/FTXS56515.2022.00009
Guo LGeorgakoudis GParasyris KLaguna ILi D(2020)MATCH: An MPI Fault Tolerance Benchmark Suite2020 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC50251.2020.00015(60-71)Online publication date: Oct-2020
https://doi.org/10.1109/IISWC50251.2020.00015
Georgakoudis GGuo LLaguna I(2020)Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50743-5_27
Show More Cited By

Index Terms

Practical scalable consensus for pseudo-synchronous distributed systems

Recommendations

On the Subject of Non-Equivocation: Defining Non-Equivocation in Synchronous Agreement Systems
PODC '20: Proceedings of the 39th Symposium on Principles of Distributed Computing

We study non-equivocation in synchronous agreement protocols: the restriction on faulty processes that they cannot act differently towards distinct non-faulty processes. Guarantees of non-equivocation have been used to provide improved fault tolerance ...
Fast and simple distributed consensus

The problem of fault-tolerant agreement is fundamental to distributed computing. When agreement is to be reached in spite of arbitrary behavior by faulty processors, this problem is called Distributed Consensus. By requiring that the number of faulty ...
Quantum distributed consensus
PODC '08: Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing

Distributed consensus can be achieved on asynchronous communication networks when assisted by quantum mechanics. This contradicts the FLP impossibility result by achieving consensus in the presence of faults.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
230
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bouteiller ABosilca G(2022)Implicit Actions and Non-blocking Failure Recovery with MPI2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS56515.2022.00009(36-46)Online publication date: Nov-2022
https://doi.org/10.1109/FTXS56515.2022.00009
Guo LGeorgakoudis GParasyris KLaguna ILi D(2020)MATCH: An MPI Fault Tolerance Benchmark Suite2020 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC50251.2020.00015(60-71)Online publication date: Oct-2020
https://doi.org/10.1109/IISWC50251.2020.00015
Georgakoudis GGuo LLaguna I(2020)Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50743-5_27
Margolin ABarak A(2020)Tree‐based fault‐tolerant collective operations for MPIConcurrency and Computation: Practice and Experience10.1002/cpe.582633:14Online publication date: 15-Jun-2020
https://doi.org/10.1002/cpe.5826
Küttler MPlaneta MBierbaum JWeinhold CHärtig HBarak AHoefler THollingsworth JKeidar I(2019)Corrected trees for reliable group communicationProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295721(287-299)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295721
Camargo EDuarte E(2018)Running resilient MPI applications on a Dynamic Group of Recommended ProcessesJournal of the Brazilian Computer Society10.1186/s13173-018-0069-z24:1Online publication date: 12-Mar-2018
https://doi.org/10.1186/s13173-018-0069-z
Casas MGansterer WWimmer E(2018)Resilient gossip-inspired all-reduce algorithms for high-performance computingThe International Journal of High Performance Computing Applications10.1177/1094342018762531(109434201876253)Online publication date: 9-Apr-2018
https://doi.org/10.1177/1094342018762531
Bosilca GBouteiller AGuermouche AHerault TRobert YSens PDongarra J(2018)A failure detector for HPC platformsInternational Journal of High Performance Computing Applications10.1177/109434201771150532:1(139-158)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1177/1094342017711505
Hori ATsujita YShimada AYoshinaga KMitaro NFukazawa GSato MBosilca GBouteiller AHerault T(2018)System Software for Many-Core and Multi-core ArchitectureAdvanced Software Technologies for Post-Peta Scale Computing10.1007/978-981-13-1924-2_4(59-75)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-981-13-1924-2_4
Gamell MTeranishi KMayo JKolla HHeroux MChen JParashar M(2017)Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme ScalesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269653828:10(2881-2895)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TPDS.2017.2696538
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents