Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Software-Based Replication for Fault Tolerance

Published: 01 April 1997 Publication History

Abstract

Developers of early distributed systems took a simplistic approach to providing fault tolerance: They just used another copy of the same hardware as a backup. Later, others developed replication software to work on off-the-shelf hardware. Since neither of these methods is especially economical, a logical course is to take it one step further and eliminate the extra hardware altogether. Fully software-based replication relies on sophisticated techniques to keep track of server communications and ensure the consistency of information across several server replicas. How do you know that each server shares the same view of the data or program semantics? What happens if a server replica crashes? How do you make sure that a system processes invocations in the correct order? These are all problems that a replication technique has to handle. The authors describe two fundamental techniques, primary-backup and active replication, and illustrate how they handle these problems. At this point, both have advantages and disadvantages that depend on the application. The authors also propose that group communication provides a sufficient framework for implementing software-based replication. The concept of static and dynamic groups proves useful in thinking about how to implement replication techniques. Replication techniques can also use total-order and view-synchronous multicast primitives from group communication.

References

[1]
M. Herlihy and J. Wing, "Linearizability: A Correctness Condition for Concurrent Objects," ACM Trans. Programming Languages and Systems, July 1990, pp. 463-492.
[2]
N. Budhiraja, et al., "The Primary-Backup Approach," in Distributed Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. 199-216.
[3]
F.B. Schneider, "Replication Management Using the State-Machine Approach," in Distributed Systems, S. Mullender, ed., ACM Press, New York, 1993, pp. 169-197.
[4]
A.M. Ricciardi and K.P. Birman, "Using Process Groups to Implement Failure Detection in Asynchronous Environments," Proc. 10th ACM Symp. Principles Distributed Computing, ACM Press, New York, 1991, pp. 341-352.
[5]
K. Birman A. Schiper and P. Stephenson, "Lightweight Causal and Atomic Group Multicast," ACM Trans. Computer Systems, Aug. 1991, pp. 272-314.
[6]
A. Schiper and A. Sandoz, "Uniform Reliable Multicast in a Virtually Synchronous Environment," Proc. IEEE 13th Int'l Conf. Distributed Computing Systems, IEEE CS Press, Los Alamitos, Calif., 1993, pp. 561-568.
[7]
T.D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems," J. ACM, Mar. 1996, pp. 225-267.
[8]
R. Guerraoui and A. Schiper, "Consensus Service: A Modular Approach for Building Agreement Protocols in Distributed Systems," Proc. IEEE 26th Int'l Symp. Fault-Tolerant Computing, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 168-177.
[9]
K. Birman, "The Process Group Approach to Reliable Distributed Computing," Comm. ACM, Dec. 1993, pp. 37-53.
[10]
Special Section on Group Communication, D. Powell, ed., Comm. ACM, Apr. 1996, pp. 50-97.
[11]
T. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems," J. ACM, Mar. 1996, pp. 225-267.
[12]
M. Fischer N. Lynch and M. Paterson, "Impossibility of Distributed Consensus with One Faulty Process," J. ACM, Apr. 1985, pp. 374-382.

Cited By

View all
  • (2024)A survey on hybrid transactional and analytical processingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00858-933:5(1485-1515)Online publication date: 1-Sep-2024
  • (2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
  • (2022)Strengthening Atomic Multicast for Partitioned State Machine ReplicationProceedings of the 11th Latin-American Symposium on Dependable Computing10.1145/3569902.3569909(51-60)Online publication date: 21-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computer
Computer  Volume 30, Issue 4
April 1997
95 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 April 1997

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A survey on hybrid transactional and analytical processingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00858-933:5(1485-1515)Online publication date: 1-Sep-2024
  • (2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
  • (2022)Strengthening Atomic Multicast for Partitioned State Machine ReplicationProceedings of the 11th Latin-American Symposium on Dependable Computing10.1145/3569902.3569909(51-60)Online publication date: 21-Nov-2022
  • (2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionInternational Journal of High Performance Computing Applications10.1177/109434202199043335:4(285-311)Online publication date: 1-Jul-2021
  • (2021)A Multimodality Information Synchronization Scheme for a Multisource Information System in the Electric GridSecurity and Communication Networks10.1155/2021/55135902021Online publication date: 1-Jan-2021
  • (2021)Live modeling in the context of state machine models and code generationSoftware and Systems Modeling (SoSyM)10.1007/s10270-020-00829-y20:3(795-819)Online publication date: 1-Jun-2021
  • (2020)Meaningful availabilityProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388283(545-558)Online publication date: 25-Feb-2020
  • (2020)Proving Server FaultsProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425942(74-80)Online publication date: 4-Nov-2020
  • (2020)Bounded Verification of State Machine ModelsProceedings of the 12th System Analysis and Modelling Conference10.1145/3419804.3420263(23-32)Online publication date: 19-Oct-2020
  • (2020)A model-based architecture for interactive run-time monitoringSoftware and Systems Modeling (SoSyM)10.1007/s10270-020-00780-y19:4(959-981)Online publication date: 1-Jul-2020
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media