research-article

Independent Recovery in Large-Scale Distributed Systems

Author:

Peter TriantafillouAuthors Info & Claims

IEEE Transactions on Software Engineering, Volume 22, Issue 11

Pages 812 - 826

https://doi.org/10.1109/32.553700

Published: 01 November 1996 Publication History

Abstract

In large systems, replication can become important means to improve data access times and availability. Existing recovery protocols, on the other hand, were proposed for small-scale distributed systems. Such protocols typically update stale, newly-recovered sites with replicated data and resolve the commit uncertainty of recovering sites. Thus, given that in large systems failures are more frequent and that data access times are costlier, such protocols can potentially introduce large overheads in large systems and must be avoided, if possible. We call these protocols dependent recovery protocols since they require a recovering site to consult with other sites. Independent recovery has been studied in the context of one-copy systems and has been proven unattainable. This paper offers independent recovery protocols for large-scale systems with replicated data. It shows how the protocols can be incorporated into several well-known replication protocols and proves that these protocols continue to ensure data consistency. The paper then addresses the issue of nonblocking atomic commitment. It presents mechanisms which can reduce the overhead of termination protocols and the probability of blocking. Finally, the performance impact of the proposed recovery protocols is studied through the use of simulation and analytical studies. The results of these studies show that the significant benefits of independent recovery can be enjoyed with a very small loss in data availability and a very small increase in the number of transaction abortions.

References

[1]

D. Agrawal and A. El Abbadi, "The Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data," Proc. Int'l Conf. Very Large Data Bases, Aug. 1990.

Digital Library

[2]

P. Bernstein and N. Goodman, "An Algorithm for Concurrency Control and Recovery in Replicated Distributed Databases," ACM Trans. Database Systems, vol. 9, no. 4, pp. 596-615, Dec. 1984.

Digital Library

[3]

P. Bernstein and N. Goodman, "The Failure and Recovery Problem for Replicated Databases," Proc. Second ACM Symp. Principles of Distributed Computing, pp. 114-122, Aug. 1983.

Digital Library

[4]

P. Bernstein V. Hadzilacos and N. Goodman, Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987.

Digital Library

[5]

K. Birman and T.A. Joseph, "Reliable Communication in the Presence of Failures," ACM Trans. Computer Systems, vol. 5, no. 1, pp. 47-76, Feb. 1987.

Digital Library

[6]

K. Brahmadathan and K.V.S. Ramarao, "Read-Only Transactions in Partitioned Replicated Databases," Proc. Fifth Int'l Conf. Data Eng., IEEE, pp. 522-529, Feb. 1989.

Digital Library

[7]

D. Eager and K. Sevcik, "Achieving Robustness in Distributed Database Systems," ACM Trans. Database Systems, vol. 8, no. 3, pp. 354-381, Sept. 1983.

Digital Library

[8]

A. El Abbadi and S. Toueg, "Availability in Partitioned Replicated Databases," ACM Trans. Database Systems, vol. 14, no. 2, pp. 264-290, June 1989.

Digital Library

[9]

D. Gifford, "Weighted Voting for Replicated Data," Proc. Seventh ACM Symp. Operating Systems Principles, pp. 150-162, Dec. 1979.

Digital Library

[10]

J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.

Digital Library

[11]

M. Hammer and D.W. Shipman, "Reliability Mechanisms in SDD-1: A System for Distributed Databases," ACM Trans. Database Systems, vol. 5, no. 4, pp. 431-466, Dec. 1980.

Digital Library

[12]

M. Herlihy, "A Quorum Consensus Method for Abstract Data Types," ACM Trans. Computer Systems, vol. 4, no. 1, pp. 32-53, Feb. 1986.

Digital Library

[13]

B. Lampson, "Atomic Transactions," Lecture Notes in Computer Science, vol. 105, Distributed Systems: Architecture and Implementation, pp. 246-265. Springer-Verlag, 1981.

Digital Library

[14]

D.D.E. Long, "The Management of Replication in a Distributed System," PhD thesis, Dept. of Computer Science, Univ. of California, San Diego, (available as Technical Report UCSC-CRL-88-07 from the Univ. of California, Santa Cruz), 1988.

Digital Library

[15]

D.D.E. Long J.L. Carroll and C.J. Park, "A Study of the Reliability of Internet Sites," Technical Report UCSC-CRL-90-46, Dept. of Computer Science, Univ. of California, Santa, Cruz, 1990.

Digital Library

[16]

B.M. Oki and B. Liskov, "Viewstamped Replication: A New Primary Copy Method to Support Highly Available Distributed Systems," Proc. Seventh ACM Symp. Principles Distributed Computing, pp. 8-17, Aug. 1988.

Digital Library

[17]

P. Triantafillou, "Employing Replication to Achieve Efficiency and High Availability in Distributed Systems," PhD thesis, Univ. of Waterloo, Canada (available as Research Report CS-91-28), July 1991.

Digital Library

[18]

P. Triantafillou and D.J. Taylor, "Efficiently Maintaining Availability in the Presence of Partitionings in Distributed Systems," Proc. Seventh Int'l Conf. Data Eng., IEEE, pp. 34-41, Apr. 8-12, 1991.

Digital Library

[19]

P. Triantafillou and D.J. Taylor, "Multi-Class Replicated Data Management: Exploiting Replication to Improve Efficiency," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 2, pp. 121-139, Feb. 1994.

Digital Library

[20]

P. Triantafillou and D.J. Taylor, "The Location-Based Paradigm for Replication: Achieving Efficiency and Availability in Distributed Systems," IEEE Trans. Software Eng., vol. 21, no. 1, pp. 1-18, Jan. 1985.

Digital Library

[21]

P. Triantafillou and D.J. Taylor, "VELOS: A New Approach for Efficiently Achieving High Availability in Partitioned Distributed Systems," IEEE Trans. Knowledge and Data Eng., pp. 305-321, Apr. 1996.

Digital Library

Cited By

Nicola MJarke M(2000)Performance Modeling of Distributed and Replicated DatabasesIEEE Transactions on Knowledge and Data Engineering10.1109/69.86891212:4(645-672)Online publication date: 1-Jul-2000
https://dl.acm.org/doi/10.1109/69.868912
Triantafillou PNeilson C(1997)Achieving Strong Consistency in a Distributed File SystemIEEE Transactions on Software Engineering10.1109/32.58132823:1(35-55)Online publication date: 1-Jan-1997
https://dl.acm.org/doi/10.1109/32.581328

Index Terms

Independent Recovery in Large-Scale Distributed Systems
1. Information systems
  1. Data management systems
    1. Database administration
      1. Database utilities and tools
    2. Database management system engines

Recommendations

VELOS: A New Approach for Efficiently Achieving High Availability in Partitioned Distributed Systems

This work presents a new protocol, VELOS, for tolerating partitionings in distributed systems with replicated data. Our primary goals were influenced by efficiency and availability constraints. The proposed protocol achieves optimal availability, ...
The Load and Availability of Byzantine Quorum Systems

Replicated services accessed via quorums enable each access to be performed at only a subset (quorum) of the servers and achieve consistency across accesses by requiring any two quorums to intersect. Recently, b-masking quorum systems, whose ...
Consistent and automatic replica regeneration

Reducing management costs and improving the availability of large-scale distributed systems require automatic replica regeneration, that is, creating new replicas in response to replica failures. A major challenge to regeneration is maintaining ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering Volume 22, Issue 11

November 1996

62 pages

ISSN:0098-5589

Editor:
Richard A. Kemmerer
Univ. of California, Santa Barbara, CA

Issue’s Table of Contents

Copyright © Copyright © 1996 IEEE. All Rights Reserved.

Publisher

IEEE Press

Publication History

Published: 01 November 1996

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nicola MJarke M(2000)Performance Modeling of Distributed and Replicated DatabasesIEEE Transactions on Knowledge and Data Engineering10.1109/69.86891212:4(645-672)Online publication date: 1-Jul-2000
https://dl.acm.org/doi/10.1109/69.868912
Triantafillou PNeilson C(1997)Achieving Strong Consistency in a Distributed File SystemIEEE Transactions on Software Engineering10.1109/32.58132823:1(35-55)Online publication date: 1-Jan-1997
https://dl.acm.org/doi/10.1109/32.581328

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents