Article

Rx: treating bugs as allergies---a safe method to survive software failures

Authors:

Jagadeesan Sundaresan,

Yuanyuan ZhouAuthors Info & Claims

SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles

Pages 235 - 248

https://doi.org/10.1145/1095810.1095833

Published: 20 October 2005 Publication History

Abstract

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.

References

[1]

L. Alvisi and K. Marzullo. Trade-offs in implementing optimal message logging protocols. In Proceedings of the 15th ACM Symposium on the Principles of Distributed Computing, May 1996.

Digital Library

[2]

C. Amza, A. Cox, and W. Zwaenepoel. Data replication strategies for fault tolerance and availability on commodity clusters. In Proceedings of the 2000 International Conference on Dependable Systems and Networks, Jun 2000.

Digital Library

[3]

A. Avizienis. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, SE-11(12), 1985.

Digital Library

[4]

A. Avizienis and L. Chen. On the implementation of N-version programming for software fault tolerance during execution. In Proceedings of the 1st International Computer Software and Applications Conference, Nov 1977.

[5]

J. F. Bartlett. A NonStop kernel. In Proceedings of the 8th Symposium on Operating Systems Principles, Dec 1981.

Digital Library

[6]

K. P. Birman. Building Secure and Reliable Network Applications, chapter 19. Manning ISBN: 1-884777-29-5, 1996.

Digital Library

[7]

A. Bobbio and M. Sereno. Fine grained software rejuvenation models. In Proceedings of the 1998 International Computer Performance and Dependability Symposium, Sep 1998.

[8]

A. Bohra, I. Neamtiu, P. Gallard, F. Sultan, and L. Iftode. Remote repair of operating system state using backdoors. In Proceedings of the 2004 International Conference on Autonomic Computing, May 2004.

Digital Library

[9]

A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance. In Proceedings of the 9th Symposium on Operating Systems Principles, Oct 1983.

Digital Library

[10]

A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7(1), 1989.

Digital Library

[11]

T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems, 14(1):80--107, Feb 1996.

Digital Library

[12]

G. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda. Reducing recovery time in a small recursively restartable system. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, Jun 2002.

Digital Library

[13]

G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot -- A technique for cheap recovery. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.

Digital Library

[14]

M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating System Design and Implementation, Feb 1999.

Digital Library

[15]

M. Castro and B. Liskov. Proactive recovery in a Byzantine-Fault-Tolerant system. In Proceedings of the 4th Symposium on Operating System Design and Implementation, Oct 2000.

Digital Library

[16]

CERT/CC. Advisories. http://www.cert.org/advisories/.

[17]

S. Chandra and P. M. Chen. Whither generic recovery from application faults? A fault study using open-source software. In Proceedings of the 2000 International Conference on Dependable Systems and Networks, Jun 2000.

Digital Library

[18]

S. Chandra and P. M. Chen. The impact of recovery mechanisms on the likelihood of saving corrupted state. In Proceedings of the 13th International Symposium on Software Reliability Engineering, Nov 2002.

Digital Library

[19]

Y. Chen, J. S. Plank, and K. Li. CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of the 1997 ACM/IEEE Supercomputing Conference, Nov 1997.

Digital Library

[20]

J. Condit, M. Harren, S. McPeak, G. C. Necula, and W. Weimer. CCured in the real world. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, Jun 2003.

Digital Library

[21]

C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke, S. Beattie, A. Grier, P. Wagle, Q. Zhang, and H. Hinton. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. In Proceedings of the 7th USENIX Security Symposium, Jan 1998.

Digital Library

[22]

G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating System Design and Implementation, Dec 2002.

Digital Library

[23]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computer Surveys, 34(3):375--408, 2002.

Digital Library

[24]

Y. A. Feldman and H. Schneider. Simulating reactive systems by deduction. ACM Transactions on Software Engineering and Methodology, 2(2):128--175, 1993.

Digital Library

[25]

S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. On the analysis of software rejuvenation policies. In Proceedings of the Annual Conference on Computer Assurance, Jun 1997.

[26]

J. Gray. Why do computers stop and what can be done about it? In Proceedings of the 5th Symposium on Reliable Distributed Systems, Jan 1986.

[27]

W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z.-Y. Yang. Characterization of Linux kernel behavior under errors. In Proceedings of the 2003 International Conference on Dependable Systems and Networks, Jun 2003.

[28]

R. Hasting and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the USENIX Winter 1992 Technical Conference, Dec 1992.

[29]

Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th Annual International Symposium on Fault-Tolerant Computing, Jun 1995.

Digital Library

[30]

D. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing, Aug 1988.

Digital Library

[31]

D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and check-pointing. Journal of Algorithms, 11(3):462--491, 1990.

Digital Library

[32]

K. Li, J. Naughton, and J. Plank. Concurrent real-time checkpoint for parallel programs. In Proceedings of the 2nd ACM SIGPLAN Symposium on Princiles & Practice of Parallel Programming, Mar 1990.

Digital Library

[33]

D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th Symposium on Operating System Design and Implementation, Oct 2000.

Digital Library

[34]

D. E. Lowell and P. M. Chen. Free transactions with rio vista. In Proceedings of the 16th Symposium on Operating Systems Principles, Oct 1997.

Digital Library

[35]

D. E. Lowell and P. M. Chen. Discount checking: Transparent, low-overhead recovery for general applications. Technical report, CSE-TR-410-99, University of Michigan, Jul 1998.

[36]

E. Marcus and H. Stern. Blueprints for High Availability. John Willey & Sons, 2000.

Digital Library

[37]

D. Mosberger and T. Jin. httperf - a tool for measuring web server performance. SIGMETRICS Performance Evaluation Review, 26(3):31--37, 1998.

Digital Library

[38]

D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical report, Technical Report UCB//CSD-02-1175, U.C.Berkeley, Mar 2002.

Digital Library

[39]

J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972--986, 1998.

Digital Library

[40]

F. Qin, S. Lu, and Y. Zhou. Safemem: Exploiting ECC-memory for detecting memory leaks and memory corruption during production runs. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, Feb 2005.

Digital Library

[41]

B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1(2):220--232, 1975.

Digital Library

[42]

B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in computing system design. ACM Computer Surveys, 10(2):123--165, 1978.

Digital Library

[43]

M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.

Digital Library

[44]

R. Rodrigues, M. Castro, and B. Liskov. BASE: Using abstraction to improve fault tolerance. In Proceedings of the 18th Symposium on Operating Systems Principles, Oct 2001.

Digital Library

[45]

M. Russinovich and B. Cogswell. Replay for concurrent non-deterministic shared-memory applications. In Proceedings of the ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, May 1996.

Digital Library

[46]

D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R. W. Carton, and J. Ofir. Deciding when to forget in the Elephant file system. In Proceedings of the 17th ACM Symposium on Operating System Principles, Dec 1999.

Digital Library

[47]

D. Scott. Assessing the costs of application downtime. Gartner Group, May 1998.

[48]

S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the USENIX 2005 Annual Technical Conference, Apr 2005.

Digital Library

[49]

S. Srinivasan, C. Andrews, S. Kandula, and Y. Zhou. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. In Proceedings of the USENIX 2004 Annual Technical Conference, Jun 2004.

Digital Library

[50]

S. Staniford, V. Paxson, and N. Weaver. How to own the internet in your spare time. In Proceedings of the 11th USENIX Security Symposium, Aug 2002.

Digital Library

[51]

S. D. Stoller. Testing concurrent Java programs using randomized scheduling. In Proceedings of the 2nd Workshop on Runtime Verification, Jul 2002.

[52]

R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204--226, 1985.

Digital Library

[53]

M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- A study of field failures in operating systems. In Proceedings of the 21th Annual International Symposium on Fault-Tolerant Computing, Jun 1991.

[54]

M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.

Digital Library

[55]

G. Trent and M. Sake. Webstone: The first generation in http server benchmarking, 1995.

[56]

W. Vogels, D. Dumitriu, A. Agrawal, T. Chia, and K. Guo. Scalability of the Microsoft Cluster Service. In Proceedings of the 2nd USENIX Windows NT Symposium, Aug 1998.

Digital Library

[57]

W. Vogels, D. Dumitriu, K. Birman, R. Gamache, M. Massa, R. Short, J. Vert, J. Barrera, and J. Gray. The design and architecture of the Microsoft Cluster Service. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing, Jun 1998.

Digital Library

[58]

Y.-M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, Jun 1993.

[59]

Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. M. R. Kintala. Checkpointing and its applications. In Proceedings of the 25th Annual International Symposium on Fault-Tolerant Computing, Jun 1995.

Digital Library

[60]

Y. Zhou, P. M. Chen, and K. Li. Fast cluster failover using virtual memory-mapped communication. In Proceedings of the 1999 ACM International Conference on Supercomputing, Jun 1999.

Digital Library

Cited By

Hanna Cd'Amorim M(2024)Toward Systematizing Hot Fixing for Production SoftwareCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3664456(677-679)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3664456
Liu JHao XArpaci-Dusseau AArpaci-Dusseau RChajed T(2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665942
Xu XLiu HTao GXuan ZZhang XCrnkovic I(2022)Checkpointing and deterministic training for deep learningProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528605(65-76)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3522664.3528605
Show More Cited By

Index Terms

Rx: treating bugs as allergies---a safe method to survive software failures
1. General and reference
  1. Cross-computing tools and techniques
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software reliability

Recommendations

Rx: Treating bugs as allergies—a safe method to survive software failures

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: required application restructuring, inability to ...
Rx: treating bugs as allergies---a safe method to survive software failures
SOSP '05

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to ...
Modeling and analysis of software aging and software failure

Many studies reported that system suffered from outages more due to software faults than hardware faults. Recently, the phenomenon of "software aging", which was caused by aging-related faults, is observed in many software systems. Software aging, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles

October 2005

259 pages

ISBN:1595930795

DOI:10.1145/1095810

General Chair:
Andrew Herbert
Microsoft Research, UK
,
Program Chair:
Ken Birman
Cornell University, USA

ACM SIGOPS Operating Systems Review Volume 39, Issue 5
SOSP '05
December 2005
290 pages
ISSN:0163-5980
DOI:10.1145/1095809
Issue’s Table of Contents

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SOSP05

Sponsor:

SOSP05: ACM SIGOPS 20th Symposium on Operating Systems Principles 2005

October 23 - 26, 2005

Brighton, United Kingdom

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Sponsor:
sigops

ACM SIGOPS 30th Symposium on Operating Systems Principles

November 4 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

302
Total Citations
View Citations
2,339
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)16

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hanna Cd'Amorim M(2024)Toward Systematizing Hot Fixing for Production SoftwareCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3664456(677-679)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3664456
Liu JHao XArpaci-Dusseau AArpaci-Dusseau RChajed T(2024)Shadow Filesystems: Recovering from Filesystem Runtime Errors via Robust Alternative ExecutionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665942(15-22)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3655038.3665942
Xu XLiu HTao GXuan ZZhang XCrnkovic I(2022)Checkpointing and deterministic training for deep learningProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528605(65-76)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3522664.3528605
Zhou ZBenson TCanini MChandrasekaran B(2021)TardisProceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR)10.1145/3482898.3483355(108-121)Online publication date: 11-Oct-2021
https://dl.acm.org/doi/10.1145/3482898.3483355
Choi BBurns RHuang PBarbalace ABhatotia PAlvisi LCadar C(2021)Understanding and dealing with hard faults in persistent memory systemsProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456252(441-457)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447786.3456252
Taylor MBoubin JChen HStewart CQin F(2021)A Study on Software Bugs in Unmanned Aircraft Systems2021 International Conference on Unmanned Aircraft Systems (ICUAS)10.1109/ICUAS51884.2021.9476844(1439-1448)Online publication date: 15-Jun-2021
https://doi.org/10.1109/ICUAS51884.2021.9476844
Andersen LBallantyne MFelleisen M(2020)Adding interactive visual syntax to textual codeProceedings of the ACM on Programming Languages10.1145/34282904:OOPSLA(1-28)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428290
Gorjiara HXu GDemsky B(2020)Satune: synthesizing efficient SAT encodersProceedings of the ACM on Programming Languages10.1145/34282144:OOPSLA(1-32)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428214
Bartell SDietz WAdve V(2020)Guided linking: dynamic linking without the costsProceedings of the ACM on Programming Languages10.1145/34282134:OOPSLA(1-29)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428213
Sotiropoulos TChaliasos SMitropoulos DSpinellis D(2020)A model for detecting faults in build specificationsProceedings of the ACM on Programming Languages10.1145/34282124:OOPSLA(1-30)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428212
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents