Theory and practice of failure transparency

January 1999

Author:
David Ellis Lowell,
Chair:
Peter M. Chen

Publisher:

University of Michigan
Dept. 72 Ann Arbor, MI
United States

ISBN:978-0-599-63353-7

Order Number:AAI9959810

Pages:

116

Purchase on ProQuest

Bibliometrics

Abstract

System and application failures are all too common. In this dissertation we argue that operating systems should provide the fundamental abstraction we call failure transparency—the illusion that systems and applications do not fail. Systems that provide failure transparency attempt to completely mask failures from users, and failure handling from programmers. We construct a theory of consistent recovery that provides the fundamental rules for recovering transparently after a failure. In addition to aiding our quest for failure transparency, the theory unifies all existing recovery protocols: they are all simply variations on the theme of the theory's central invariant. Using the theory as a launching point, we construct a series of systems that get us closer to providing failure transparency. The first such system is Vista, a lightweight transaction library. Vista is built on reliable memory, and as a result realizes remarkable performance and simplicity. Vista improves transaction performance by three orders of magnitude over a similar disk-based system yet has 1/10th the code. Vista exposes the high cost in complexity of disk's slow performance. We use Vista to construct Vistagrams, a distributed system that can provide distributed recovery with almost no overhead. However, both Vista and Vistagrams depend on the programmers help in guaranteeing consistent recovery. Therefore, they cannot be said to provide failure transparency. To get us closer to that goal, we construct Discount Checking, a lightweight check-pointing system. Discount Checking can preserve and recover the complete state of a running process, including significant kernel state despite being itself a user-level system. Using Discount Checking's fast checkpoints, we construct seven recovery protocols and show the performance of each on a wide variety of real, interactive applications. We find that we can provide failure transparency with overhead of 0–2%. We conclude failure transparency is feasible, even for the challenging application domain we target.

Cited By

Lowell D, Chandra S and Chen P Exploring failure transparency and the limits of generic recovery Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4

Contributors

David Ellis Lowell
HP Labs
- Publication Years1996 - 2004
- Publication counts9
- Citation count1,106
- Available for Download12
- Downloads (cumulative)24,673
- Downloads (12 months)974
- Downloads (6 weeks)66
- Average Downloads per Article2,056
- Average Citation per Article123
View Full Profile
Peter M. Chen
University of Michigan, Ann Arbor
- Publication Years1988 - 2020
- Publication counts78
- Citation count8,178
- Available for Download72
- Downloads (cumulative)100,734
- Downloads (12 months)8,138
- Downloads (6 weeks)822
- Average Downloads per Article1,399
- Average Citation per Article105
View Full Profile

Comments

Recommendations

Failure Transparency in Remote Procedure Calls
Read More
Exploring failure transparency and the limits of generic recovery
OSDI'00: Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4

We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free operation. To provide failure transparency, an operating system must recover applications after hardware, operating system, and ...
Read More
Failure Transparency in Remote Procedure Calls

A model of remote procedure call (RPC) which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented. A technique of adopting orphans caused by failures, which is based on ...
Read More

Browse Theses

Sections

Cited By

Failure Transparency in Remote Procedure Calls

Exploring failure transparency and the limits of generic recovery

Failure Transparency in Remote Procedure Calls

Sections

Cited By

Save to Binder

Recommendations

Failure Transparency in Remote Procedure Calls

Exploring failure transparency and the limits of generic recovery

Failure Transparency in Remote Procedure Calls