Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Theory and practice of failure transparency
Publisher:
  • University of Michigan
  • Dept. 72 Ann Arbor, MI
  • United States
ISBN:978-0-599-63353-7
Order Number:AAI9959810
Pages:
116
Bibliometrics
Skip Abstract Section
Abstract

System and application failures are all too common. In this dissertation we argue that operating systems should provide the fundamental abstraction we call failure transparency—the illusion that systems and applications do not fail. Systems that provide failure transparency attempt to completely mask failures from users, and failure handling from programmers. We construct a theory of consistent recovery that provides the fundamental rules for recovering transparently after a failure. In addition to aiding our quest for failure transparency, the theory unifies all existing recovery protocols: they are all simply variations on the theme of the theory's central invariant. Using the theory as a launching point, we construct a series of systems that get us closer to providing failure transparency. The first such system is Vista, a lightweight transaction library. Vista is built on reliable memory, and as a result realizes remarkable performance and simplicity. Vista improves transaction performance by three orders of magnitude over a similar disk-based system yet has 1/10th the code. Vista exposes the high cost in complexity of disk's slow performance. We use Vista to construct Vistagrams, a distributed system that can provide distributed recovery with almost no overhead. However, both Vista and Vistagrams depend on the programmers help in guaranteeing consistent recovery. Therefore, they cannot be said to provide failure transparency. To get us closer to that goal, we construct Discount Checking, a lightweight check-pointing system. Discount Checking can preserve and recover the complete state of a running process, including significant kernel state despite being itself a user-level system. Using Discount Checking's fast checkpoints, we construct seven recovery protocols and show the performance of each on a wide variety of real, interactive applications. We find that we can provide failure transparency with overhead of 0–2%. We conclude failure transparency is feasible, even for the challenging application domain we target.

Contributors
  • University of Michigan, Ann Arbor

Recommendations