Fault-tolerant computer system design: | Guide books

Dhiraj Kumar Pradhan
University of Bristol
- Publication Years1972 - 2016
- Publication counts172
- Citation count1,125
- Available for Download33
- Downloads (cumulative)8,867
- Downloads (12 months)556
- Downloads (6 weeks)101
- Average Downloads per Article269
- Average Citation per Article7
View Full Profile

Index Terms

Fault-tolerant computer system design

Reviews

Reviewer: Claudiu Bulaceanu

Today, when designing a functional system is a common matter, emphasis is placed on designing mission-critical systems with enhanced reliability and a high degree of safety. This textbook covers architecture and design of fault-tolerant and high-availability systems, from both the theoretical and the practical points of view. The book is divided into eight parts. The first part comprises four chapters, the first of which is an introduction. The second deals with redundancy techniques for hardware, software, and time. Chapter 3 treats evaluation techniques. Chapter 4 deals with design methodologies. Part 2, comprising seven chapters, discusses the architecture of fault-tolerant computers. The first chapter is an introduction to fault-tolerant architectures. The second chapter treats the categorization of applications and related systems. The third chapter is dedicated to several well-known computer architectures such as IBM and VAX. The fourth chapter briefly analyzes several fault-tolerant implementations, such as the ones issued by AT&T, Tandem, Stratus, and VAXft. The fifth chapter discusses some examples of long-life systems, like the ones used in space for Voyager and Galileo. The sixth chapter highlights the critical systems features and characteristics of systems like the one used for the space shuttle. The seventh chapter is a summary of the part. Part 3, consisting of 13 chapters, displays the principles of fault-tolerant multiprocessor and distributed systems. After an introduction and a review of types of parallel processing, topology of interconnection, and programming models, the next chapters deal with several types of fault-tolerance: static redundancy and dynamic redundancy. Other chapters deal with fault detection, recovery strategies, recovery techniques and schemes (such as rollback and forward recovery), and reconfiguration in multiprocessors. Part 4 deals with several case studies regarding fault-tolerant multiprocessors and distributed systems. This part contains ten chapters that include several well-known implementations, such as the Tandem systems, the Stratus XA/r 300 Systems, the Sequoia systems, the Byzantine resilience model, and the space shuttle system. The six chapters of Part 5 are dedicated to experimental analysis of systems dependability. Statistical techniques, the design phase, and the prototype phase are presented, and the operational phase is discussed at length. Part 6 includes five chapters and covers reliability estimation, including element and system reliability, behavioral decomposition, and examples. Part 7, including seven chapters, focuses on fault tolerance in software. It highlights the motivation for this area of study, how to deal with faulty programs, fault-tolerance in software, reliability models for software, acceptance tests, exception handling, and a case study of an extended distributed recovery block. The final part comprises four chapters and is dedicated to system diagnosis. The different flavors of diagnoses are discussed, including the diagnosis under probabilistic models and under bounded models. This is a classroom textbook intended for advanced undergraduate and graduate students in computer science or a related field of study; IT researchers who want to complete their knowledge of systems reliability; designers of reliable computer systems; and IT project managers. The level and structure of the text, as well as its approach to the fault-tolerant systems field, are typical for the university environment. Each part ends with suitable exercises. The best part of the book is its comprehensive and clear explanation of a field not yet well established and with many things still to uncover. The real-life examples and theoretical mathematical support are additional outstanding features . On the down side, the book would have improved by following at least one example of the design of a fault-tolerant system from beginning to end, passing through all the phases outlined in the book.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Recommendations

Fault tolerant computing in computer design

This paper is presented as an attempt to cover the basic practices and methodologies involved in the area of contemporary fault tolerant computing in a computer design course. Most computer design courses cover design of various components of a computer ...
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...
Fault tolerant system design and SEU injection based testing

The methodology for the design and testing of fault tolerant systems implemented into an FPGA platform with different types of diagnostic techniques is presented in this paper. Basic principles of partial dynamic reconfiguration are described together ...

Browse Books

Sections

An introduction to the design and analysis of fault-tolerant systems

Architecture of fault-tolerant computers

Fault-tolerant multiprocessor and distributed systems: principles

Case studies in fault-tolerant multiprocessor and distributed systems

Experimental analysis of computer system dependability

Reliability estimation

Fault-tolerance in software

System diagnosis

Cited By

Index Terms

Reviews

Access critical reviews of Computing literature here

Fault tolerant computing in computer design

Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

Fault tolerant system design and SEU injection based testing