System and application failures are all too common. In this dissertation we argue that operating systems should provide the fundamental abstraction we call failure transparency—the illusion that systems and applications do not fail. Systems that provide failure transparency attempt to completely mask failures from users, and failure handling from programmers. We construct a theory of consistent recovery that provides the fundamental rules for recovering transparently after a failure. In addition to aiding our quest for failure transparency, the theory unifies all existing recovery protocols: they are all simply variations on the theme of the theory's central invariant. Using the theory as a launching point, we construct a series of systems that get us closer to providing failure transparency. The first such system is Vista, a lightweight transaction library. Vista is built on reliable memory, and as a result realizes remarkable performance and simplicity. Vista improves transaction performance by three orders of magnitude over a similar disk-based system yet has 1/10th the code. Vista exposes the high cost in complexity of disk's slow performance. We use Vista to construct Vistagrams, a distributed system that can provide distributed recovery with almost no overhead. However, both Vista and Vistagrams depend on the programmers help in guaranteeing consistent recovery. Therefore, they cannot be said to provide failure transparency. To get us closer to that goal, we construct Discount Checking, a lightweight check-pointing system. Discount Checking can preserve and recover the complete state of a running process, including significant kernel state despite being itself a user-level system. Using Discount Checking's fast checkpoints, we construct seven recovery protocols and show the performance of each on a wide variety of real, interactive applications. We find that we can provide failure transparency with overhead of 0–2%. We conclude failure transparency is feasible, even for the challenging application domain we target.
Recommendations
Exploring failure transparency and the limits of generic recovery
OSDI'00: Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and ...
Failure Transparency in Remote Procedure Calls
A model of remote procedure call (RPC) which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented. A technique of adopting orphans caused by failures, which is based on ...