Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

ASSURE: automatic software self-healing using rescue points

Published: 07 March 2009 Publication History

Abstract

Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover software from unknown faults while maintaining both system integrity and availability, by mimicking system behavior under known error conditions. Rescue points are locations in existing application code for handling a given set of programmer-anticipated failures, which are automatically repurposed and tested for safely enabling fault recovery from a larger class of (unanticipated) faults. When a fault occurs at an arbitrary location in the program, ASSURE restores execution to an appropriate rescue point and induces the program to recover execution by virtualizing the program's existing error-handling facilities. Rescue points are identified using fuzzing, implemented using a fast coordinated checkpoint-restart mechanism that handles multi-process and multi-threaded applications, and, after testing, are injected into production code using binary patching. We have implemented an ASSURE Linux prototype that operates without application source code and without base operating system kernel changes. Our experimental results on a set of real-world server applications and bugs show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.

References

[1]
M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-flow Integrity. In Proceedings of the ACM conference on Computer and Communications Security (CCS), pages 340--353, November 2005.
[2]
J. Boyd. Patterns of Conflict. Unpublished briefing, http://www.d-n-i.net/boyd/pdf/poc.pdf, 1986.
[3]
T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14(1):80--107, 1996.
[4]
D. Brumley, H. Wang, S. Jha, and D. Song. Creating vulnerability signatures using weakest pre-conditions. In Proceedings of the 2007 Computer Security Foundations Symposium, Venice, Italy, July 2007.
[5]
B. Buck and J. K. Hollingsworth. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14(4):317--329, Winter 2000.
[6]
G. Candea and A. Fox. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems, May 2003.
[7]
S. Chandra. An evaluation of the recovery-related properties of Software Faults. PhD thesis, University of Michigan, 2000.
[8]
M. Costa, J. Crowcroft, M. Castro, and A. Rowstron. Vigilante: End-to-End Containment of Internet Worms. In Proceedings of the ACM Symposium on Systems and Operating Systems Principles (SOSP), December 2005.
[9]
B. Demsky and M. C. Rinard. Automatic detection and repair of errors in data structures. In Proceedings of the ACM Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), October 2003.
[10]
J. Etoh. GCC extension for protecting applications from stack-smashing attacks. http://www.trl.ibm.com/projects/security/ssp/.
[11]
S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the USENIX Technical Conference, 2005.
[12]
V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In Proceedings of the USENIX Security Symposium, August 2002.
[13]
N. Kolettis and N. D. Fulton. Software rejuvenation: analysis, module and applications. In FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, page 381, Washington, DC, USA, 1995. IEEE Computer Society.
[14]
O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In Proceedings of the USENIX Technical Conference, 2007.
[15]
B. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12), December 1990.
[16]
J. Newsome, D. Brumley, and D. Song. Vulnerability-specific execution filtering for exploit prevention on commodity software. In Proceedings of the Symposium on Network and Distributed System Security (SNDSS), February 2006.
[17]
National Vulnerability Database. http://nvd.nist.gov/statistics.cfm, April 2006.
[18]
S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 361--376, December 2002.
[19]
PaX Project. Address space layout randomization, Mar 2003. http://pageexec.virtualave.net/docs/aslr.txt.
[20]
A. D. Roelker. Snort 2.0: Protocol flow analyzer.
[21]
S. Sidiroglou, Y. Giovanidis, and A. Keromytis. A dynamic mechanism for recovery from buffer overflow attacks. In Proceedings of the Information Security Conference (ISC), September 2005.
[22]
S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the USENIX Technical Conference, April 2005.
[23]
Y. Song, M. E. Locasto, A. Stavrou, A. D. Keromytis, and S. J. Stolfo. On the infeasibility of modeling polymorphic shellcode. In Proceedings of the 14th ACM conference on Computer and communications security (CCS), 2007.
[24]
M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- a study of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing (FTCS--21), pages 2--9, 1991.
[25]
J. Tucek, J. Newsome, S. Lu, C. Huang, S. Xanthos, D. Brumley, Y. Zhou, and D. Song. Sweeper: a lightweight end-to-end system for defending against fast worms. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EUROSYS), 2007.
[26]
H. J. Wang, C. Guo, D. R. Simon, and A. Zugenmaier. Shield: vulnerability-driven network filters for Preventing Known Vulnerability Exploits. In Proceedings of the ACM SIGCOMM Conference, August 2004.
[27]
V. Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks (Amsterdam, Netherlands: 1999), 31(23-24):2435--2463, 1999.
[28]
F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: treating bugs as allergies -- a safe method to survive software failures. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), October 2005.
[29]
E. Rescorla. Security holes. Who cares? In Proceedings of the 12th USENIX Security Symposium, Washington, D.C., 2003.
[30]
M. Rinard. Acceptability-oriented Computing. In Proceedings of ACM Conference on Object Oriented Programming, Systems, Languages, and Applications, October 2003.
[31]
M. Rinard, C. Cadar, D. Dumitran, D. Roy, T. Leu, and J. W Beebee. Enhancing server availability and security through Failure-Oblivious Computing. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), December 2004.

Cited By

View all
  • (2019)DFix: automatically fixing timing bugs in distributed systemsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314620(994-1009)Online publication date: 8-Jun-2019
  • (2019)Automatic Software RepairIEEE Transactions on Software Engineering10.1109/TSE.2017.275501345:1(34-67)Online publication date: 1-Jan-2019
  • (2018)A Survey of Techniques for Automatically Sensing the Behavior of a CrowdACM Computing Surveys10.1145/312934351:1(1-40)Online publication date: 19-Feb-2018
  • Show More Cited By

Recommendations

Reviews

Rafael Corchuelo

Software crashes due to programming bugs constitute a major problem for systems that need to be available 24 hours a day, seven days a week. Many authors are researching techniques that allow computer applications to detect their own failures and recover from them automatically. "Recovering" means that the application rolls back to a safe state and returns a reasonable error code to the client; recovering in such a way improves system availability, while an appropriate patch is created by a programmer. Sidiroglou et al. have devised a tool called ASSURE that allows server applications that run on Linux 2.6 systems to recover from their failures. ASSURE is innovative insofar as it can deal with applications that are available in binary form only, run on multiple threads and processes, handle polymorphic or encrypted input, or have deterministic bugs (not necessarily memory leaks); furthermore, it does not require any modifications to the underlying operating system. The authors have tested ASSURE on a number of actual bugs in well-known server applications, such as Apache, Squid, and MySQL. They prove that the tool is very efficient. ASSURE builds on so-called rescue points, which are functions that return integer error codes or null pointers when a known error is detected. Rescue points and error codes are identified automatically by running the application in a testing environment where it is fed invalid inputs. When a failure is detected for the first time, ASSURE analyzes the problem in a sandbox. First, it determines what function failed. Then, it identifies the closest rescue point to which the application can be rolled back to keep working well. A piece of code is then inserted at this rescue point that returns an error code, thus preventing the application from continuing and failing. Finally, the application is restarted. The paper is not at all difficult to read, although it does not provide enough details for other scientists to repeat the work. The authors' writing style is didactic and they get to the point very straightforwardly. They also make it very clear what their original contributions are. I recommend this paper and ASSURE for system administrators who need to keep their servers highly available. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 44, Issue 3
ASPLOS 2009
March 2009
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1508284
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
    March 2009
    358 pages
    ISBN:9781605584065
    DOI:10.1145/1508244
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2009
Published in SIGPLAN Volume 44, Issue 3

Check for updates

Author Tags

  1. binary patching
  2. chekpoint restart
  3. error recovery
  4. reliable software
  5. software self-healing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2019)DFix: automatically fixing timing bugs in distributed systemsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314620(994-1009)Online publication date: 8-Jun-2019
  • (2019)Automatic Software RepairIEEE Transactions on Software Engineering10.1109/TSE.2017.275501345:1(34-67)Online publication date: 1-Jan-2019
  • (2018)A Survey of Techniques for Automatically Sensing the Behavior of a CrowdACM Computing Surveys10.1145/312934351:1(1-40)Online publication date: 19-Feb-2018
  • (2018)Automatic Software RepairACM Computing Surveys10.1145/310590651:1(1-24)Online publication date: 23-Jan-2018
  • (2018)Fault Tolerance Based Comparative Analysis of Scheduling Algorithms in Cloud Computing2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET)10.1109/ICCSDET.2018.8821216(1-6)Online publication date: Dec-2018
  • (2017)HybrisACM Transactions on Storage10.1145/311989613:3(1-32)Online publication date: 28-Sep-2017
  • (2017)Systematic Erasure Codes with Optimal Repair Bandwidth and StorageACM Transactions on Storage10.1145/310947913:3(1-27)Online publication date: 28-Sep-2017
  • (2017)CPR: cross platform binary code reuse via platform independent trace programProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3092703.3092707(158-169)Online publication date: 10-Jul-2017
  • (2016)Statistical similarity of binariesACM SIGPLAN Notices10.1145/2980983.290812651:6(266-280)Online publication date: 2-Jun-2016
  • (2016)Stratified synthesis: automatically learning the x86-64 instruction setACM SIGPLAN Notices10.1145/2980983.290812151:6(237-250)Online publication date: 2-Jun-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media