Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2637166.2637235acmotherconferencesArticle/Chapter ViewAbstractPublication PagesapsysConference Proceedingsconference-collections
research-article

Machine fault tolerance for reliable datacenter systems

Published: 25 June 2014 Publication History

Abstract

Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at datacenter scale to significantly affect overall system reliability and availability. In this paper, we propose a new failure model, called Machine Fault Tolerance, and a new abstraction, a replicated write-once trusted table, to provide improved resilience to these types of failures. Since most machine failures manifest in application server and operating system code, we assume a Byzantine model for those parts of the system. However, by assuming that the hypervisor and network are trustworthy, we are able to reduce the overhead of machine-fault masking to be close to that of non-Byzantine Paxos.

References

[1]
Amazon S3 availability event: July 20, 2008. http://status.aws.amazon.com/s3-20080720.html.
[2]
Linux kernel in 2011: 15 million total lines of code and Microsoft is a top contributor. http://arstechnica.com/business/2012/04/linux-kernel-in-2011-15-million-total-lines-of-code-and-microsoft-is-a-top-contributor/.
[3]
M. Abd-El-Malek, G. R. Ganger, G. R. Goodson, M. K. Reiter, and J. J. Wylie. Fault-scalable Byzantine fault-tolerant services. In SOSP, 2005.
[4]
M. Castro and B. Liskov. Practical Byzantine fault tolerance. In OSDI, 1999.
[5]
T. Chandra, R. Griesemer, and J. Redstone. Paxos made live - an engineering perspective. In PODC, 2007.
[6]
B.-G. Chun, P. Maniatis, and S. Shenker. Diverse replication for single-machine Byzantine-fault tolerance. In USENIX ATC, 2008.
[7]
B.-G. Chun, P. Maniatis, S. Shenker, and J. Kubiatowicz. Attested append-only memory: Making adversaries stick to their word. In SOSP, 2007.
[8]
A. Clement, F. Junqueira, A. Kate, and R. Rodrigues. On the (limited) power of non-equivocation. In PODC, 2012.
[9]
J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira. Hq replication: A hybrid quorum protocol for Byzantine fault tolerance. In OSDI, 2006.
[10]
R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat. Chronos: Predictable low latency for data center applications. In SOCC '12, San Jose, CA, USA, Oct. 2012.
[11]
R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong. Zyzzyva: Speculative Byzantine fault tolerance. In SOSP, 2007.
[12]
L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 1998.
[13]
L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Trans. Program. Lang. Syst., 1982.
[14]
D. Levin, J. R. Douceur, J. R. Lorch, and T. Moscibroda. TrInc: Small trusted hardware for large distributed systems. In NSDI, 2009.
[15]
J. M. McCune, Y. Li, N. Qu, Z. Zhou, A. Datta, V. Gligor, and A. Perrig. TrustVisor: Efficient TCB reduction and attestation. In IEEE Symposium on Security and Privacy, 2010.
[16]
M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228--234, Apr. 1980.
[17]
B. Schroeder, E. Pinheiro, and W.-D. Weber. Dram errors in the wild: A large-scale field study. In SIGMETRICS, 2009.
[18]
G. S. Veronese, M. Correia, A. N. Bessani, L. C. Lung, and P. Verissimo. Efficient Byzantine fault-tolerance. IEEE Transactions on Computers, 2013.
[19]
T. Wood, R. Singh, A. Venkataramani, P. Shenoy, and E. Cecchet. ZZ and the art of practical BFT execution. In EuroSys, 2011.
[20]
J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Separating agreement from execution for Byzantine fault tolerant services. In SOSP, 2003.

Cited By

View all
  • (2023)NeoBFT: Accelerating Byzantine Fault Tolerance Using Authenticated In-Network OrderingProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604874(239-254)Online publication date: 10-Sep-2023
  • (2023)uBFT: Microsecond-Scale BFT using Disaggregated MemoryProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575732(862-877)Online publication date: 27-Jan-2023
  • (2015)ARCHISTARProceedings of the 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom.2015.71(371-378)Online publication date: 30-Nov-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
APSys '14: Proceedings of 5th Asia-Pacific Workshop on Systems
June 2014
98 pages
ISBN:9781450330244
DOI:10.1145/2637166
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • Chinese Academy of Sciences

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2014

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

APSys'14
Sponsor:
APSys'14: Asia-Pacific Workshop on Systems
June 25 - 26, 2014
Beijing, China

Acceptance Rates

APSys '14 Paper Acceptance Rate 14 of 35 submissions, 40%;
Overall Acceptance Rate 169 of 430 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)NeoBFT: Accelerating Byzantine Fault Tolerance Using Authenticated In-Network OrderingProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604874(239-254)Online publication date: 10-Sep-2023
  • (2023)uBFT: Microsecond-Scale BFT using Disaggregated MemoryProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575732(862-877)Online publication date: 27-Jan-2023
  • (2015)ARCHISTARProceedings of the 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom.2015.71(371-378)Online publication date: 30-Nov-2015

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media