Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/829524.831061guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Fault-Tolerant Replication Management in Large-Scale Distributed Storage Systems

Published: 18 October 1999 Publication History

Abstract

Failures of all forms happen: from losing single network packets to site-wide disasters. Since businesses rely heavily on their data, it is imperative that failures require minimal time and effort to repair and that the service interruption during the failure or repair period should be as short as possible. To this end, the ideal system should repair itself, relying on humans only when absolutely necessary in the repair process. This paper describes one component of a self-healing storage system: the component that allows for automatic recovery of access to data when the power comes back on after a large-scale outage. Our failure recovery protocol is part of a suite of modular protocols that make up the Palladio distributed storage system. This protocol guarantees that service will be repaired quickly and automatically when enough failures are repaired.

References

[1]
K. Amiri, G. A. Gibson, and R. Golding. Scalable concurrency control and recovery for shared storage arrays. Technical Report CMU-CS-99-111, Dept. of Computer Science, Carnegie-Mellon Univ., 1999.
[2]
N. T. Bhatti, M. A. Hiltunen, R. D. Schlichting, and W. Chiu. Coyote: a system for constructing fine-grain configurable communication services. ACM Trans. on Computer Systems, 16(4):321-66, November 1998.
[3]
A. D. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swart. The Echo distributed file system. Technical Report 111, Digital Equipment Corp. Systems Research Center, Palo Alto, CA, September 1993.
[4]
E. Borowsky, R. Golding, P. Jacobson, A. Merchant, L. Schreier, M. Spasojevic, and J. Wilkes. Capacity planning with phased workloads. In Proc. of the 1st Workshop on Software and Peliormance, October 1998.
[5]
E. Borowsky, R. Golding, A. Merchant, E. Shriver, M. Spasojevic, and J. Wilkes. Using attribute-managed storage to achieve QoS. In Proc. of the 5th Intl. Workshop On Quality of Service, June 1997.
[6]
L.-F. Cabrera and D. D. E. Long. Swift: a storage architecture for large objects. In Proc. of the 11th IEEE Symp. On Mass Storage Systems, pages 123-8, October 1991.
[7]
T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. In Proc. of the 11th ACM Symp. on Principles of Distributed Computing, pages 147-58, 1992.
[8]
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225- 67, March 1996.
[9]
F. Cristian and C. Fetzer. The timed asynchronous system model. Technical Report CSE97-519, Computer Science Dept., Univ. of California at San Diego, 1997.
[10]
A. EI Abaddi and S. Toueg. Maintaining availability in partitioned replicated databases. ACM Trans. on Database Systems, 14(2):264-90, June 1989.
[11]
D. K. Gifford. Weighted voting for replicated data. In Proc. of the 7th Symp. on Operating Systems Principles, pages 150-62, December 1979.
[12]
R. Golding and E. Borowsky. The Palladio failure recovery protocol. Technical Report HPL-SSP-99-1, Storage Systems Program, Hewlett-Packard Laboratories, March 1999.
[13]
R. A. Golding and D. D. E. Long. Using an object-oriented framework to construct wide-area group communication mechanisms. In Proc. of the Int. Symp. on Applied Computing: Research and Applications in Software Engineering, Databases, and Distributed Systems, 1993.
[14]
C. G. Gray and D. R. Cheriton. Leases: an efficient fault tolerant mechanism for distributed file cache consistency. In Proc. of the 12th ACM Symp. on Operating Systems Principles, pages 202-10, December 1989.
[15]
R. Guerraoui and A. Schiper. The decentralized nonblocking atomic commitment protocol. In Proc. of the 7th IEEE Symp. on Parallel and Distributed Processing, October 1995.
[16]
E. K. Lee and C. A. Thekkath. Petal: distributed virtual disks. In Proc. of the 7th Intl. Conf on Architectural Support for Programming Languages and Operating Systems, pages 84-92, October 1996.
[17]
D. D. E. Long and J.-F. Pâris. Voting with regenerable volatile witnesses. In Proc. of the 7th Int. Conf on Data Engineering, pages 112-19, April 1991.
[18]
D. D. E. Long and J.-F. Pâris. Voting without version numbers. In Proc. of the Intl. Conf. on Performance, Computing, and Communications, pages 139-45, February 1997.
[19]
D. Malkhi, M. Reiter, and R. Wright. Probabilistic quorum systems. In Proc. of the 16th ACM Symp. on Principles of Distributed Computing, August 1997.
[20]
D. Powell. Group communication. Communications of the ACM, 39(4):50-3, April 1996.
[21]
K. W. Preslan, A. P. Barry, J. E. Brassow, G. M. Erickson, E. Nygaard, C. J. Sabol, S. R. Soltis, D. C. Teigland, and M. T. O' Keefe. A 64-bit, shared disk file system for Linux. In Proe. of the 16th IEEE Symp. on Mass Storage Systems, March 1999.
[22]
R. van Renesse, K. P. Birman, and S. Maffeis. Horus: a flexible group communication system. Communications of the ACM, 39(4):76-83, April 1996.

Cited By

View all
  • (2006)Walking toward moving goalpostsProceedings of the First international conference on Hot topics in autonomic computing10.5555/1973393.1973397(4-4)Online publication date: 16-Jun-2006

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
SRDS '99: Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
October 1999
ISBN:0769502903

Publisher

IEEE Computer Society

United States

Publication History

Published: 18 October 1999

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2006)Walking toward moving goalpostsProceedings of the First international conference on Hot topics in autonomic computing10.5555/1973393.1973397(4-4)Online publication date: 16-Jun-2006

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media