Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Fault tolerance under UNIX

Published: 01 January 1989 Publication History
  • Get Citation Alerts
  • Abstract

    The initial design for a distributed, fault-tolerant version of UNIX based on three-way atomic message transmission was presented in an earlier paper [3]. The implementation effort then moved from Auragen Systems1 to Nixdorf Computer where it was completed. This paper describes the working system, now known as the TARGON/32.
    The original design left open questions in at least two areas: fault tolerance for server processes and recovery after a crash were briefly and inaccurately sketched, rebackup after recovery was not discussed at all. The fundamental design involving three-way message transmission has remained unchanged. However, in addition to important changes in the implementation, server backup has been redesigned and is now more consistent with that of normal user processes. Recovery and rebackup have been completed in a less centralized and thus more efficient manner than previously envisioned.
    In this paper we review important aspects of the original design and note how the implementation differs from our original ideas. We then focus on the backup and recovery for server processes and the changes and additions in the design and implementation of recovery and rebackup.

    References

    [1]
    ARNOW, D., AND GLAZER, S. A fast safe file system for UNIX. Unpublished paper written in 1984 for Auragen Systems Corp., Ft. Lee, N.J.
    [2]
    BARTLETT, J. A nonstop kernel. In Proceedings of the Eighth Symposium on Operating Systems Principles (Asilomar, Calif., Dec. 1981). ACM, New York, 1981.
    [3]
    BORG, A., BAUMBACH, J., AND GLAZER, S. A message system supporting fault tolerance. In Ninth Symposium on Operating Systems Principles (Breton Woods, N.H., Oct. 1983). ACM, New York, 1983.
    [4]
    GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND TRAIGER, I. The recovery manager of the system R database manager. ACM Comput. Surv. 13, 2 (June 1981), 223-242.
    [5]
    KIM, W. Highly available systems for database applications. ACM Comput. Surv. 16, 1 (June 1984), 71-98.
    [6]
    LISKOV, B., AND SCHEIFLER, R. Guardians and actions: Linguistic support for robust, distributed programs. ACM Trans. Program. Lang. Syst. 5, 3 (July 1983), 381-404.
    [7]
    LISKOV, B., AND LADIN, R. Highly-available distributed services and fault-tolerant distributed garbage collection. Programming Methodology Group Memo 48, MIT Laboratory for Computer Science, May, 1986.
    [8]
    POWELL, M., AND MILLER, B. Process migration in DEMOS/MP. In Proceedings o/the Ninth Symposium on Operating Systems Principles (Breton Woods, N.H., Oct. 1983). ACM, New York, 1983.
    [9]
    POWELL, M., AND PRESOTTO, D. PUBLISHING: A reliable broadcast communication mechanism. In Proceedings of the Ninth Symposium on Operating Systems Principles (Breton Woods, N.H., Oct. 1983). ACM, New York, 1983.
    [10]
    RASHID, R., AND ROBERTSON, G. Accent: A communication-oriented network operating system kernel. Tech. Rep. CMU-CS-81-123, Dept. of Computer Science, Carnegie-Mellon Univ., Apr. 1981.
    [11]
    SCHNEIDER, F.B. Byzantine generals in Action: Implementing fail-stop processors. ACM Trans. Comput. Syst. 2, 2 (May 1984), 145-154.
    [12]
    Stratus~32, VOS Re{erence Manual. Stratus Computers, Inc., Marlborough, Mass., 1982.
    [13]
    STROM, R. E., AND YEMIN|, S. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 3 (Aug. 1985), 204-226.
    [14]
    TOLERANT SYSTEMS, INC. Eternity series: Technology brief. Internal publication, July 1988, Tolerant Systems, San Jose, Calif.
    [15]
    VERHOFSTAD, J. Recovery techniques for database systems. ACM Comput. Surv. 10, 2 (June 1978), 167-196.
    [16]
    WALTER, B. A robust and efficient protocol for checking the availability of remote sites. In Proceedings of the Sixth Workshop on Distributed Data Management and Computer Networks, (Berkeley, Calif., Feb. 1982), pp. 45-68.

    Cited By

    View all

    Reviews

    Paul Siegel

    After many years of relatively quiet use in Bell Labs, universities, and a few commercial development centers, the UNIX operating system has recently become more popular. Its portability is a major asset: applications developed on a machine running UNIX can generally be made to run on another machine running UNIX with very little effort, even if the two machines are made by different manufacturers. Around the time IBM came out with the System/360 in the 1960s, users began to balk at the necessity of rewriting all their applications whenever they changed hardware. This reluctance was heightened when CP/M, followed closely by MS-DOS, made thousands of applications portable across many microcomputers of different manufacture. Meanwhile, UNIX had been implemented on a variety of machines of widely differing architectures. Although UNIX is much more powerful than MS-DOS, it ran principally on minis. As hardware prices fell and interest grew, however, UNIX became available on the largest machines as well as on micros, Simultaneously, the market for fault tolerance began to increase. First launched in a big way by Tandem in 1977, fault tolerance has virtually become a requirement in certain application areas, particularly transaction processing. As other companies came into this competitive market, some failed, including Auragen Systems, which tried to capitalize on the advantages of UNIX and fault tolerance to create a product for the transaction processing market. Nixdorf has apparently taken that product, revised it, and produced TARGON/32—a distributed, fault-tolerant version of UNIX. I enjoyed reading this paper because my background is in fault tolerance and UNIX, and I have followed the evolution of both with great interest. Despite its occasional lapses into UNIX jargon, I think that this paper would enlighten anyone who had an operating system background. It discusses some history, how the goals of TARGON/32 differ from those of other currently available systems. The authors then cover TARGON/32's architecture, server processes, process families, interprocess communication, backup and synchronization of user processes and peripheral servers, crash detection and recovery, and finally the impact of fault tolerance on performance. I was surprised to learn that the performance penalty of fault tolerance in the TARGON distributed architecture is only ten percent over non–fault-tolerant processes. I found a few minor errors that I wish had been corrected: :9BNot all TOCS readers are necessarily conversant with the problems of fault tolerance. Therefore, the authors should have pointed out that restoring a module to service can be an expensive operation, arguing against automatic restoration of modules immediately after they become available. The few remaining assumptions about the reader's understanding of UNIX internals such as inodes and index blocks could easily have been removed. TARGON/32 is an asymmetrical multiprocessor system, but the authors never examine the performance impact of the asymmetry. The authors do not discuss how unrelated processes that share such resources as memory and semaphores are backed up. These are minor points. In general, I appreciate the value of this paper and the system it describes. I am in favor of any work that helps integrate UNIX with modern system requirements.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 7, Issue 1
    Feb. 1989
    116 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/58564
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 1989
    Published in TOCS Volume 7, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media