Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Implementing fault-tolerant services using the state machine approach: a tutorial

Published: 01 December 1990 Publication History
  • Get Citation Alerts
  • Abstract

    The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

    References

    [1]
    AIZIKOWITZ, J. 1989. Designing distributed services using refinement mappings. Ph.D. dissertation, Computer Science Dept., Cornell Univ., Ithaca, New York. Also available as Tech. Rep. TR 89-1040.
    [2]
    BERNSTEIN, A. J. 1985. A loosely coupled system for reliably storing data. IEEE Trans. Softw. Eng. SE-11, 5 (May), 446-454.
    [3]
    BIRMAN, K. P. 1985. Replication and fault tolerance in the ISIS system. In Proceedings of the lOth A CM Symposium on Operating Systems Principles (Orcas Island, Washington, Dec. 1985), A CM, pp. 79-86.
    [4]
    BIRMAN, K. P., AND JOSEPH, T. 1987. Reliable communication in the presence of failures. ACM TOCS 5, 1 (Feb. 1987), 47-76.
    [5]
    CRISTIAN, F., AGHILI, H., STRONG, H. R., AND DOLEV, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Conference on Fault-tolerant Computing (Ann Arbor, Mich., June 1985), IEEE Computer Society.
    [6]
    DIJKSTRA, E. W. 1974. Self stabilization in spite of distributed control. Commun. A CM I7, 11 (Nov.), 643-644.
    [7]
    FISCHER, M., LYNCH, N., AND PATERSON, M. 1985. Impossibility of distributed consensus with one faulty process, d. ACM 32, 2 (Apr. 1986), 374-382.
    [8]
    GARCIA-MOLINA, H., PITTELLI, F., AND DAVIDSON, S. 1986. Application of Byzantine agreement in database systems. ACM TODS 11, 1 (Mar. 1986), 27-47.
    [9]
    GOPAL, A., STRONG, R., TOUEG, S., AND CRISTIAN, F., 1990. Early-delivery atomic broadcast. To appear in Proceedings of the 9th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (Quebec City, Quebec, Aug. 1990).
    [10]
    GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course, Lecture Notes in Computer Science. Vol. 60. Springer- Verlag, New York, pp. 393-481.
    [11]
    HALPERN, J., SIMONS, B., STRONG, R., AND DOLEV, D. 1984. Fault-tolerant clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), pp. 89-102.
    [12]
    HUTCHINSON, N., AND PETERSON, L. 1988. Design of the x-kernel. In Proceedings of SIGCOMM '88--Symposium on Communication Architectures and Protocols (Stanford, Calif., Aug.), pp. 65-75.
    [13]
    LAMPORT, L. 1978a. Time, clocks and the ordering of events in a distributed system. Commun. ACM 21, 7 (July), 558-565.
    [14]
    LAMPORT, L. 1979b. The implementation of reliable distributed multiprocess systems. Comput. Networks 2, 95-114.
    [15]
    LAMPORT, L. 1984. Using time instead of timeout for fault-tolerance in distributed systems. ACM TOPLAS 6, 2 (Apr.), 254-280.
    [16]
    LAMPORT, L. 1989. The part-time parliament. Tech. Rep. 49. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif.
    [17]
    LAMPORT, L., AND MELLIAR-SMITH, P. M. 1984. Byzantine clock synchronization. In Proceedings of the 3rd A CM SIGA CT-SIGOPS Symposium on Principles of Distributed Computing (Vancouver, Canada, Aug.), 68-74.
    [18]
    LAMPORT, L., SHOSTAK, R., AND PEASE, M. 1982. The Byzantine generals problem. ACM TOPLAS 4, 3 (July), 382-401.
    [19]
    LISKOV, B., AND LADIN, R. 1986. Highly available distributed services and fault-tolerant distributed garbage collection. In Proceedings of the 5th A CM Symposium on Principles of Distributed Computing (Calgary, Alberta, Canada, Aug.), ACM, pp. 29-39.
    [20]
    MANCINI, L., AND PAPPALARDO, G. 1988. Towards a theory of replicated processing. Formal Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 331. Springer-Verlag, New York, pp. 175-192.
    [21]
    MARZULLO, K. 1989. Implementing fault-tolerant sensors. Tech. Rep. TR 89-997. Computer Science Dept., Cornell Univ., Ithaca, New York.
    [22]
    MARZULLO, K., AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems (San Jose, CA, June), IEEE Computer Society, pp. 447-455.
    [23]
    PETERSON, L. L., BUCHOLZ, N. C., AND SCHLICHT- ING, R. D. 1989. Preserving and using context information in interprocess communication. ACM TOCS 7, 3 (Aug.), 217-246.
    [24]
    PITTELLI, F. M., AND GARCIA-MOLINA, S. 1989. Reliable scheduling in a TMR database system. ACM TOCS 7, 1 (Feb.), 25-60.
    [25]
    SCHLICHTING, R. D., AND SCHNEIDER, F. B. 1983. Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM TOCS I, 3 (Aug.), 222-238.
    [26]
    SCHNEIDER, F. B. 1980. Ensuring consistency on a distributed database system by use of distributed semaphores. In Proceedings of International Symposium on Distributed Data Bases (Paris, France, Mar.), INRIA, pp. 183-189.
    [27]
    SCHNEIDER, F. B. 1982. Synchronization in distributed programs. ACM TOPLAS 4, 2 (Apr.), 179-195.
    [28]
    SCHNEIDER, F. B. 1984. Byzantine generals in action: Implementing fail-stop processors. ACM TOCS 2, 2 (May), 145-154.
    [29]
    SCHNEIDER, F. B. 1985. Paradigms for distributed programs. Distributed Systems. Methods and Tools for Specification. Lecture Notes in Computer Science, Vol. 190. Springer-Verlag, New York, pp. 343-430.
    [30]
    SCHNEIDER, F. B. 1986. A paradigm for reliable clock synchronization. In Proceedings of the Advanced Seminar on Real-Time Local Area Networks (Bandol, France, Apr.), INRIA, pp. 85-104.
    [31]
    SCHNEIDER, F. B., GRIES, D., AND SCHLICHTING, R. D. 1984. Fault-tolerant broadcasts. Sci. Comput. Program. 4, 1-15.
    [32]
    SIEWIOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Press, Bedford, Mass.
    [33]
    SKEEN, D. 1982. Crash recovery in a distributed database system. Ph.D. dissertation, Univ. of California at Berkeley, May.
    [34]
    STRONG, H. R., AND DOLEV, D. 1983. Byzantine agreement. Intellectual Leverage for the Information Society, Digest of Papers. (Compcon 83, IEEE Computer Society, Mar.), IEEE Computer Society, pp. 77-82.
    [35]
    WENSLEY, J., WENSKY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P. M., SHOSTAK, R. E., and WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.

    Cited By

    View all
    • (2025)A survey on scalable consensus algorithms for blockchain technologyCyber Security and Applications10.1016/j.csa.2024.1000653(100065)Online publication date: Dec-2025
    • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
    • (2024)IONIAProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650711(225-242)Online publication date: 27-Feb-2024
    • Show More Cited By

    Recommendations

    Reviews

    Valentin Cristea

    Distributed software structured in terms of clients and servers is considered. Replicas of a single server are executed on separate processors of a distributed system, and protocols coordinate client interactions with these replicas. The paper describes how a system can be viewed in terms of a state machine, clients, and output devices. In this context, Schneider considers two representative classes of faulty behavior: Byzantine failures and fail-stop failures. The core sections of the paper present algorithms that cope with these failures. An important class of optimizations and the dynamic reconfiguration are also tackled. A separate section discusses related work . The paper is intended for people working in the domain of distributed systems and real-time systems. It systematically presents protocols that involve replication of components using the state machine approach, although few of these protocols were obtained in this manner. The paper was received in November 1987 and the final revision was accepted in January 1990. Unfortunately, this long delay is easily perceived by the reader.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 22, Issue 4
    Dec. 1990
    109 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/98163
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 December 1990
    Published in CSUR Volume 22, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,084
    • Downloads (Last 6 weeks)92
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A survey on scalable consensus algorithms for blockchain technologyCyber Security and Applications10.1016/j.csa.2024.1000653(100065)Online publication date: Dec-2025
    • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
    • (2024)IONIAProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650711(225-242)Online publication date: 27-Feb-2024
    • (2024)Modelling the Raft Distributed Consensus Protocol in mCRL2Electronic Proceedings in Theoretical Computer Science10.4204/EPTCS.399.4399(7-20)Online publication date: 27-Mar-2024
    • (2024)RHCA: Robust HCA via Consistent RevotingMathematics10.3390/math1204059312:4(593)Online publication date: 17-Feb-2024
    • (2024)Asynchronous Consensus Quorum Read: Pioneering Read Optimization for Asynchronous Consensus ProtocolsElectronics10.3390/electronics1303048113:3(481)Online publication date: 23-Jan-2024
    • (2024)A Survey of Consortium Blockchain and Its ApplicationsCryptography10.3390/cryptography80200128:2(12)Online publication date: 22-Mar-2024
    • (2024)Rashnu: Data-Dependent Order-FairnessProceedings of the VLDB Endowment10.14778/3665844.366586117:9(2335-2348)Online publication date: 1-May-2024
    • (2024)Towards Full Stack Adaptivity in Permissioned BlockchainsProceedings of the VLDB Endowment10.14778/3641204.364121617:5(1073-1080)Online publication date: 1-Jan-2024
    • (2024)The Key Ideas Behind Boki's Shared LogsACM SIGOPS Operating Systems Review10.1145/3689051.368905458:1(7-14)Online publication date: 14-Aug-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media