Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1267680.1267697guideproceedingsArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

Subtleties in tolerating correlated failures in wide-area storage systems

Published: 08 May 2006 Publication History

Abstract

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today's wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IrisStore, a distributed read-write storage layer that provides high availability. Our results using IrisStore on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.

Cited By

View all
  • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
  • (2018)Fault-tolerance, fast and slowProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291197(391-408)Online publication date: 8-Oct-2018
  • (2016)Correlated crash vulnerabilitiesProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026890(151-167)Online publication date: 2-Nov-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NSDI'06: Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
May 2006
54 pages

Sponsors

  • USENIX Assoc: USENIX Assoc

Publisher

USENIX Association

United States

Publication History

Published: 08 May 2006

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
  • (2018)Fault-tolerance, fast and slowProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291197(391-408)Online publication date: 8-Oct-2018
  • (2016)Correlated crash vulnerabilitiesProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026890(151-167)Online publication date: 2-Nov-2016
  • (2015)MojimACM SIGARCH Computer Architecture News10.1145/2786763.269437043:1(3-18)Online publication date: 14-Mar-2015
  • (2015)MojimACM SIGPLAN Notices10.1145/2775054.269437050:4(3-18)Online publication date: 14-Mar-2015
  • (2015)MojimProceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2694344.2694370(3-18)Online publication date: 14-Mar-2015
  • (2015)On Replica Placement in High-Availability Storage Under Correlated FailureProceedings of the 9th International Conference on Combinatorial Optimization and Applications - Volume 948610.1007/978-3-319-26626-8_26(348-363)Online publication date: 18-Dec-2015
  • (2014)Heading off correlated failures through independence-as-a-serviceProceedings of the 11th USENIX conference on Operating Systems Design and Implementation10.5555/2685048.2685073(317-334)Online publication date: 6-Oct-2014
  • (2012)ThemisProceedings of the Third ACM Symposium on Cloud Computing10.1145/2391229.2391242(1-14)Online publication date: 14-Oct-2012
  • (2012)Understanding data survivability in archival storage systemsProceedings of the 5th Annual International Systems and Storage Conference10.1145/2367589.2367605(1-12)Online publication date: 4-Jun-2012
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media