article

Free access

Implementing fault-tolerant services using the state machine approach: a tutorial

Author:

Fred B. SchneiderAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 22, Issue 4

Pages 299 - 319

https://doi.org/10.1145/98163.98167

Published: 01 December 1990 Publication History

PDF eReader

Abstract

The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

References

[1]

AIZIKOWITZ, J. 1989. Designing distributed services using refinement mappings. Ph.D. dissertation, Computer Science Dept., Cornell Univ., Ithaca, New York. Also available as Tech. Rep. TR 89-1040.

Crossref

Google Scholar

[2]

BERNSTEIN, A. J. 1985. A loosely coupled system for reliably storing data. IEEE Trans. Softw. Eng. SE-11, 5 (May), 446-454.

Google Scholar

[3]

BIRMAN, K. P. 1985. Replication and fault tolerance in the ISIS system. In Proceedings of the lOth A CM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec. 1985), A CM, pp. 79-86.

Crossref

Google Scholar

[4]

BIRMAN, K. P., AND JOSEPH, T. 1987. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76.

Crossref

Google Scholar

[5]

CRISTIAN, F., AGHILI, H., STRONG, H. R., AND DOLEV, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Conference on Fault-tolerant Computing (Ann Arbor, Mich., June 1985), IEEE Computer Society.

Google Scholar

[6]

DIJKSTRA, E. W. 1974. Self stabilization in spite of distributed control. Commun. A CM I7, 11 (Nov.), 643-644.

Crossref

Google Scholar

[7]

FISCHER, M., LYNCH, N., AND PATERSON, M. 1985. Impossibility of distributed consensus with one faulty process, d. ACM 32, 2 (Apr. 1986), 374-382.

Crossref

Google Scholar

[8]

GARCIA-MOLINA, H., PITTELLI, F., AND DAVIDSON, S. 1986. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (Mar. 1986), 27-47.

Crossref

Google Scholar

[9]

GOPAL, A., STRONG, R., TOUEG, S., AND CRISTIAN, F., 1990. Early-delivery atomic broadcast. To appear in Proceedings of the 9th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (Quebec City, Quebec, Aug. 1990).

Crossref

Google Scholar

[10]

GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course, Lecture Notes in Computer Science. Vol. 60. Springer- Verlag, New York, pp. 393-481.

Crossref

Google Scholar

[11]

HALPERN, J., SIMONS, B., STRONG, R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), pp. 89-102.

Crossref

Google Scholar

[12]

HUTCHINSON, N., AND PETERSON, L. 1988. Design of the x-kernel. In Proceedings of SIGCOMM '88--Symposium on Communication Architectures and Protocols (Stanford, Calif., Aug.), pp. 65-75.

Crossref

Google Scholar

[13]

LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7 (July), 558-565.

Crossref

Google Scholar

[14]

LAMPORT, L. 1979b. The implementation of reliable distributed multiprocess systems. Comput. Networks 2, 95-114.

Google Scholar

[15]

LAMPORT, L. 1984. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (Apr.), 254-280.

Crossref

Google Scholar

[16]

LAMPORT, L. 1989. The part-time parliament. Tech. Rep. 49. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif.

Google Scholar

[17]

LAMPORT, L., AND MELLIAR-SMITH, P. M. 1984. Byzantine clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), 68-74.

Crossref

Google Scholar

[18]

LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM TOPLAS 4, 3 (July), 382-401.

Crossref

Google Scholar

[19]

LISKOV, B., AND LADIN, R. 1986. Highly available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the 5th A CM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada, Aug.), ACM, pp. 29-39.

Crossref

Google Scholar

[20]

MANCINI, L., AND PAPPALARDO, G. 1988. Towards a theory of replicated processing. Formal Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 331. Springer-Verlag, New York, pp. 175-192.

Crossref

Google Scholar

[21]

MARZULLO, K. 1989. Implementing fault-tolerant sensors. Tech. Rep. TR 89-997. Computer Science Dept., Cornell Univ., Ithaca, New York.

Google Scholar

[22]

MARZULLO, K., AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems (San Jose, CA, June), IEEE Computer Society, pp. 447-455.

Google Scholar

[23]

PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHT- ING, R. D. 1989. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (Aug.), 217-246.

Crossref

Google Scholar

[24]

PITTELLI, F. M., AND GARCIA-MOLINA, S. 1989. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb.), 25-60.

Crossref

Google Scholar

[25]

SCHLICHTING, R. D., AND SCHNEIDER, F. B. 1983. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS I, 3 (Aug.), 222-238.

Crossref

Google Scholar

[26]

SCHNEIDER, F. B. 1980. Ensuring consistency on a distributed database system by use of distributed semaphores. In Proceedings of International Symposium on Distributed Data Bases (Paris, France, Mar.), INRIA, pp. 183-189.

Google Scholar

[27]

SCHNEIDER, F. B. 1982. Synchronization in distributed programs. ACM TOPLAS 4, 2 (Apr.), 179-195.

Crossref

Google Scholar

[28]

SCHNEIDER, F. B. 1984. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May), 145-154.

Crossref

Google Scholar

[29]

SCHNEIDER, F. B. 1985. Paradigms for distributed programs. Distributed Systems. Methods and Tools for Specification. Lecture Notes in Computer Science, Vol. 190. Springer-Verlag, New York, pp. 343-430.

Crossref

Google Scholar

[30]

SCHNEIDER, F. B. 1986. A paradigm for reliable clock synchronization. In Proceedings of the Advanced Seminar on Real-Time Local Area Networks (Bandol, France, Apr.), INRIA, pp. 85-104.

Google Scholar

[31]

SCHNEIDER, F. B., GRIES, D., AND SCHLICHTING, R. D. 1984. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1-15.

Crossref

Google Scholar

[32]

SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass.

Google Scholar

[33]

SKEEN, D. 1982. Crash recovery in a distributed database system. Ph.D. dissertation, Univ. of California at Berkeley, May.

Google Scholar

[34]

STRONG, H. R., AND DOLEV, D. 1983. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers. (Compcon 83, IEEE Computer Society, Mar.), IEEE Computer Society, pp. 77-82.

Google Scholar

[35]

WENSLEY, J., WENSKY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P. M., SHOSTAK, R. E., and WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.

Google Scholar

Cited By

View all

Jain AGupta NGupta B(2025)A survey on scalable consensus algorithms for blockchain technologyCyber Security and Applications10.1016/j.csa.2024.1000653(100065)Online publication date: Dec-2025
https://doi.org/10.1016/j.csa.2024.100065
Gomes Jr. EAlchieri EDotti FMendizabal O(2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
https://doi.org/10.5753/jisa.2024.3891
Xu YZhu HPandey PConway AJohnson RGanesan AAlagappan RMa XWon Y(2024)IONIAProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650711(225-242)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650711
Show More Cited By

Recommendations

Implementing fault-tolerant services using state machines: beyond replication
DISC'10: Proceedings of the 24th international conference on Distributed computing

This paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as ...
Byzantine Fault Tolerant State Machine Replication in Any Programming Language
PODC '19: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing

State machine replication is a fundamental primitive in fault tolerant distributed computing, but few production tools exist to support the replication of arbitrary state machines. The tools that do exist, like Apache Zookeeper, CoreOS's etcd, and ...
Fault tolerant control using a fuzzy predictive approach

This paper proposes the application of fault-tolerant control (FTC) using fuzzy predictive control. The FTC approach is based on two steps, fault detection and isolation (FDI) and fault accommodation. The fault detection is performed by a model-based ...

Reviews

Reviewer: Valentin Cristea

Distributed software structured in terms of clients and servers is considered. Replicas of a single server are executed on separate processors of a distributed system, and protocols coordinate client interactions with these replicas. The paper describes how a system can be viewed in terms of a state machine, clients, and output devices. In this context, Schneider considers two representative classes of faulty behavior: Byzantine failures and fail-stop failures. The core sections of the paper present algorithms that cope with these failures. An important class of optimizations and the dynamic reconfiguration are also tackled. A separate section discusses related work . The paper is intended for people working in the domain of distributed systems and real-time systems. It systematically presents protocols that involve replication of components using the state machine approach, although few of these protocols were obtained in this manner. The paper was received in November 1987 and the final revision was accepted in January 1990. Unfortunately, this long delay is easily perceived by the reader.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Computing Surveys Volume 22, Issue 4

Dec. 1990

109 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/98163

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 1990

Published in CSUR Volume 22, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,653
Total Citations
View Citations
12,222
Total Downloads

Downloads (Last 12 months)1,084
Downloads (Last 6 weeks)92

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jain AGupta NGupta B(2025)A survey on scalable consensus algorithms for blockchain technologyCyber Security and Applications10.1016/j.csa.2024.1000653(100065)Online publication date: Dec-2025
https://doi.org/10.1016/j.csa.2024.100065
Gomes Jr. EAlchieri EDotti FMendizabal O(2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
https://doi.org/10.5753/jisa.2024.3891
Xu YZhu HPandey PConway AJohnson RGanesan AAlagappan RMa XWon Y(2024)IONIAProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650711(225-242)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650711
Bora PMinh PWillemse T(2024)Modelling the Raft Distributed Consensus Protocol in mCRL2Electronic Proceedings in Theoretical Computer Science10.4204/EPTCS.399.4399(7-20)Online publication date: 27-Mar-2024
https://doi.org/10.4204/EPTCS.399.4
Zhang ZFeng KChen XLiu XSun H(2024)RHCA: Robust HCA via Consistent RevotingMathematics10.3390/math1204059312:4(593)Online publication date: 17-Feb-2024
https://doi.org/10.3390/math12040593
Dong HLiu S(2024)Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus ProtocolsElectronics10.3390/electronics1303048113:3(481)Online publication date: 23-Jan-2024
https://doi.org/10.3390/electronics13030481
Chen XHe SSun LZheng YWu C(2024)A Survey of Consortium Blockchain and Its ApplicationsCryptography10.3390/cryptography80200128:2(12)Online publication date: 22-Mar-2024
https://doi.org/10.3390/cryptography8020012
Nagda HSinghal SAmiri MLoo B(2024)Rashnu: Data-Dependent Order-FairnessProceedings of the VLDB Endowment10.14778/3665844.366586117:9(2335-2348)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665861
Wu CAmiri MQin HMehta BMarcus RLoo B(2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641216
Jia ZWitchel E(2024)The Key Ideas Behind Boki's Shared LogsACM SIGOPS Operating Systems Review10.1145/3689051.368905458:1(7-14)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3689051.3689054
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Recommendations

Implementing fault-tolerant services using state machines: beyond replication

Byzantine Fault Tolerant State Machine Replication in Any Programming Language

Fault tolerant control using a fuzzy predictive approach

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations