Article

Flashback: a lightweight extension for rollback and deterministic replay for software debugging

Authors:

Sudarshan M. Srinivasan,

Srikanth Kandula,

Christopher R. Andrews,

Yuanyuan ZhouAuthors Info & Claims

ATEC '04: Proceedings of the annual conference on USENIX Annual Technical Conference

Page 3

Published: 27 June 2004 Publication History

Abstract

Software robustness has significant impact on system availability. Unfortunately, finding software bugs is a very challenging task because many bugs are hard to reproduce. While debugging a program, it would be very useful to rollback a crashed program to a previous execution point and deterministically re-execute the "buggy" code region. However, most previous work on rollback and replay support was designed to survive hardware or operating system failures, and is therefore too heavyweight for the fine-grained rollback and replay needed for software debugging.

This paper presents Flashback, a lightweight OS extension that provides fine-grained rollback and replay to help debug software. Flashback uses shadow processes to efficiently roll back in-memory state of a process, and logs a process' interactions with the system to support deterministic replay. Both shadow processes and logging of system calls are implemented in a lightweight fashion specifically designed for the purpose of software debugging.

We have implemented a prototype of Flashback in the Linux operating system. Our experimental results with micro-benchmarks and real applications show that Flashback adds little overhead and can quickly roll back a debugged program to a previous execution point and deterministically replay from that point.

References

[1]

{1} S. V. Adve, M. D. Hill, B. P. Miller, and R. H. B. Netzer. Detecting Data Races on Weak Memory Systems. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 234-243, 1991.]]

[2]

{2} Alvisi and Marzullo. Trade-offs in implementing causal message logging protocols. In PODC: 15th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing , 1996.]]

[3]

{3} C. Amza, A. Cox, and W. Zwaenepoel. Data replication strategies for fault tolerance and availability on commodity clusters. Proc. of the International Conference on Dependable Systems and Networks., 2000.]]

[4]

{4} A. Bobbio and M. Sereno. Fine grained software rejuvenation models. In IEEE International Computer Performance and Dependability Symposium, 1998.]]

[5]

{5} A. Borg, J. Baumbach, and S. Glazer. A message system supporting fault tolerance. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP), volume 17, pages 90-99, 1983.]]

[6]

{6} A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7(1):1-24, Feb. 1989.]]

[7]

{7} C. Boyapati, R. Lee, and M. Rinard. Ownership types for safe programming: Preventing data races and deadlocks. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), November 2002.]]

[8]

{8} G. Candea and A. Fox. Crashonly software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems, May 2003.]]

[9]

{9} M. Castro and B. Liskov. Proactive recovery in a byzantine-fault-tolerant system. In OSDI, 2000.]]

[10]

{10} P. M. Chen, D. E. Lowell, and G. W. Dunlap. Discount checking: Transparent, low-overhead recovery for general applications. Technical report, University of Michigan, Department of Electrical Engineering and Computer Science, July 1998.]]

[11]

{11} P. M. Chen, W. T. Ng, S. Chandra, C. Aycock, G. Rajamani, and D. Lowell. The Rio file cache: Surviving operating systems crashes. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 74-83, Cambridge, Massachusetts, 1-5 Oct. 1996. ACM Press.]]

[12]

{12} Y. Chen, J. S. Plank, and K. Li. Clip: a checkpointing tool for message-passing parallel programs. In Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pages 1-11. ACM Press, 1997.]]

[13]

{13} J. Choi and H. Srinivasan. Deterministic replay of java multithreaded applications. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, pages 48-59, Aug. 1998.]]

[14]

{14} J.-D. Choi and S. L. Min. Race Frontier: Reproducing Data Races in Parallel-Program Debugging. In Proceedings of the Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, pages 145-154, 1991.]]

[15]

{15} K. D. Cooper, M. W. Hall, R. T. Hood, K. Kennedy, K. S. McKinley, J. M. Mellor-Crummey, L. Torczon, and S. K. Warren. The ParaScope Parallel Programming Environment. Proceedings of the IEEE, 81(2):244-263, 1993.]]

[16]

{16} O. P. Damani and V. K. Garg. How to recover efficiently and asynchronously when optimism fails. In International Conference on Distributed Computing Systems, pages 108-115, 1996.]]

[17]

{17} G. W. Dunlap, S. T. Kind, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: enabling intrusion analysis through virtual-machine logging and replay. ACM SIGOPS Operating Systems Review, 35(SI):211-224, 2002.]]

[18]

{18} E. N. M. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375-408, 2002.]]

[19]

{19} D. R. Engler, D. Y. Chen, and A. Chou. Bugs as inconsistent behavior: A general approach to inferring errors in systems code. In Symposium on Operating Systems Principles, pages 57-72, 2001.]]

[20]

{20} D. Evans, J. Guttag, J. Horning, and Y. M. Tan. Lclint: A tool for using specifications to check code. In Symposium on the Foundations of Software Engineering, December 1994.]]

[21]

{21} D. Evans and D. Larochelle. Improving security using extensible lightweight static analysis. IEEE Software, 19(1):42-51, /2002.]]

[22]

{22} S. Feldman and C. Brown. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging, 24(1):112-123, Jan. 1989.]]

[23]

{23} C. Flanagan and S. N. Freund. Type-based race detection for Java. ACM SIGPLAN Notices, 35(5):219-232, 2000.]]

[24]

{24} C. Flanagan, K. Leino, M. Lillibridge, C. Nelson, J. Saxe, and R. Stata. Extended static checking for java. In PLDI, 2002.]]

[25]

{25} G. Candea et. al. Reducing recovery time in a small recursively restartable system. In DSN, 2002.]]

[26]

{26} K. Gharachorloo and P. B. gibbons. Detecting Violations of Sequential Consistency. In Proceedings of the Third Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316-326, 1991.]]

[27]

{27} S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system and language for building system-specific, static analyses. In Proceeding of the ACM SIGPLAN 2002 Conference on Programming language design and implementation (PLDI), 2002.]]

[28]

{28} S. Hangal and M. S. Lam. Tracking down software bugs using automatic anomaly detection. In Proc. 2002 Int. Conf. Software Engineering, pages 291-301, Orlando, FL, May 2002.]]

[29]

{29} R. Haskin, Y. Malachi, and G. Chan. Recovery management in quicksilver. ACM Transactions on Computer Systems (TOCS), 6(1):82-108, 1988.]]

[30]

{30} R. Hastings and B. Joyce. Purify: Fast detection of memory leaks and access errors. In the Winter USENIX, 1992.]]

[31]

{31} K. Havelund and T. Pressburger. Model checking java programs using java pathfinder, 1998.]]

[32]

{32} Y. Huang, C. Kintala, N. Kolettis, and N. Fulton. Software rejuvenation: analysis, module and applications. In FTCS-25, 1995.]]

[33]

{33} Y. Huang and Y. Wang. Why optimistic message logging has not been used in telecommunication systems. In Proceedings of the 1995 International Symposium on Fault-Tolerant Computing (FTCS), pages 459-463, june 1995.]]

[34]

{34} D. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, pages 171-181, Aug. 1988.]]

[35]

{35} KAI-Intel Corporation. Assure. URL: http://developer.intel.com/software/products/assure/.]]

[36]

{36} S. Kumar and K. Li. Using model checking to debug network interface firmware. In the Fifth Symposium on Operating Systems Design and Implementation (OSDI), 2002.]]

[37]

{37} K. Li, J. Naughton, and J. Plank. Concurrent real-time checkpoint for parallel programs. In Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 79-88, Seattle, Washington, Mar. 1990.]]

[38]

{38} K. Li, J. Naughton, and J. Plank. An efficient checkpointing method for multicomputers with wormhole routing. International Journal of Parallel Programming, 20(3):159-180, June 1991.]]

[39]

{39} K. Li, J. Naughton, and J. Plank. Low-latency concurrent checkpoint for parallel programs. IEEE Transactions on Parallel and Distributed Computing, 1994.]]

[40]

{40} B. Liskov. Distributed programming in argus. Communications of the ACM, 31(3):300-312, March 1988.]]

[41]

{41} A. Loginov, S. H. Yong, S. Horwitz, and T. W. Reps. Debugging via run-time type checking. In Fundamental Approaches to Software Engineering, pages 217-232, 2001.]]

[42]

{42} D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and limits of generic recovery. In OSDI, 2000.]]

[43]

{43} D. E. Lowell and P. M. Chen. Free transactions with Rio Vista. In Proceedings of the 16th Symposium on Operating Systems Principles (SOSP-97), volume 31,5 of Operating Systems Review, pages 92-101, New York, Oct. 5-8 1997. ACM Press.]]

[44]

{44} M. Luján, J. R. Gurd, T. L. Freeman, and J. Miguel. Elimination of Java array bounds checks in the presence of indirection. In Proceedings of the Joint ACM Java Grande-Iscope Conference, pages 76-85, 2002.]]

[45]

{45} E. Marcus and H. Stern. Blueprints for high availablity. John Willey and Sons, 2000.]]

[46]

{46} J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Proceedings of The 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269-278, Apr. 1991.]]

[47]

{47} S. L. Min and J.-D. Choi. An Efficient Cache-based Access Anomaly Detection Scheme. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 235-244, 1991.]]

[48]

{48} National Institute of Standards and Technlogy (NIST), Department of Commerce. Software errors cost u.s. economy $59.5 billion annually. NIST News Release 2002-10, 2002.]]

[49]

{49} G. C. Necula, S. McPeak, and W. Weimer. CCured: type-safe retrofitting of legacy code. In Symposium on Principles of Programming Languages, pages 128-139, 2002.]]

[50]

{50} R. H. B. Netzer. Optimal tracing and replay for debugging shared-memory parallel programs. In PADD, 1993.]]

[51]

{51} J. Oplinger and M. S. Lam. Enhancing software reliability with speculative threads, October 2002.]]

[52]

{52} D. A. Patterson and et. al. Recovery-oriented computing (roc): Motivation, definition, techniques, and case studies. UC Berkeley CS Tech. Report, UCB//CSD-02-1175, 2002.]]

[53]

{53} D. Perkovic and P. J. Keleher. A Protocol-Centric Approach to on-the-Fly Race Detection. IEEE Transactions on Parallel and Distributed Systems, 11(10):1058-1072, 2000.]]

[54]

{54} J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972-??, 1998.]]

[55]

{55} M. Prvulovic and J. Torrellas. Reenact: using thread-level speculation mechanisms to debug data races in multithreaded codes. In Proceedings of the 30th Annual Symposium on Computer Architecture, 2003.]]

[56]

{56} M. Prvulovic and J. Torrellas. ReEnact: Using Thread-Level Speculation to Debug Software; An Application to Data Races in Multithreaded Codes. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), June 2003.]]

[57]

{57} R. Rodrigues, M. Castro, and B. Liskov. BASE: Using abstraction to improve fault tolerance. In Proceedings of the 18th ACM Symposium on Operating System Principles, pages 15-28, Banff, Canada, Oct. 2001.]]

[58]

{58} M. Ronsse and K. D. Bosschere. RecPlay: a Fully Integrated Practical Record/Replay System. ACM Transactions on Computer Systems, 17(2):133-152, 1999.]]

[59]

{59} M. Russinovich and B. Cogswell. Replay for concurrent nondeterministic shared-memory applications. In Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, pages 258-266, Jerusalem, Israel, 1996. ACM Press.]]

[60]

{60} Y. Saito and B. Bershad. A transactional memory service in an extensible operating system. In USENIX Annual Technical Conference, 1998.]]

[61]

{61} K. Salem and H. Garcia-Molina. Checkpointing memory-resident databases. Technical Report CS-TR-126-87, Department of Computer Science, Princeton University, 1987.]]

[62]

{62} M. Satyanarayanan, H. Mashburn, P. Kumar, D. Steere, and J. Kistler. Lightweight recoverable virtual memory. In SOSP, 1993.]]

[63]

{63} S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems , 15(4):391-411, 1997.]]

[64]

{64} E. Schonberg. On-the-fly detection of access anomalies. In ACM SIGPLAN '89 Conference on Programming Language Design and Implementation (PLDI), June 1989.]]

[65]

{65} M. Seltzer, Y. Endo, and C. Small. Dealing with disaster: Surviving misbehaved kernel extensions. In OSDI, 1996.]]

[66]

{66} J. H. Slye and E. N. Elnozahy. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the Twenty-Sixth International Symposium on Fault-Tolerant Computing, pages 250-261, Washington, June 25-27 1996. IEEE.]]

[67]

{67} S. W. Smith, D. B. Johnson, and J. D. Tygar. Completely asynchronous optimistic recovery with minimal rollbacks. In FTCS-25: 25th International Symposium on Fault Tolerant Computing Digest of Papers, pages 361-371, Pasadena, California, 1995.]]

[68]

{68} N. Sterling. Warlock: A static data race analysis tool. In USENIX Winter Technical Conference, 1993.]]

[69]

{69} J. M. Stone. Debugging concurrent processes: A case study. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 1988.]]

[70]

{70} R. E. Strom and S. A. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204-226, Aug. 1985.]]

[71]

{71} syscalltrack software home page at http://syscalltrack.sourceforge.net/how.html.]]

[72]

{72} C. A. Thekkath and H. M. Levy. Hardware and software support for efficient exception handling. In ASPLOS, 1994.]]

[73]

{73} G. Trent and M. Sake. Webstone: The first generation in http server benchmarking. Feb 1995.]]

[74]

{74} C. v. Praun and T. Gross. Object race detection. In 16th Annual Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), Tampa Bay, FL, October 2001.]]

[75]

{75} D. Wagner and D. Dean. Intrusion detection via static analysis. In IEEE Symposium on Security and Privacy, pages 156-169, 2001.]]

[76]

{76} D. Wagner, J. Foster, E. Brewer, and A. Aiken. A first step towards automated detection of buffer overrun vulnerabilities. In Network and Distributed System Security Symposium, pages 3-17, San Diego, CA, February 2000.]]

[77]

{77} Y. Wang, P. Y. Chung, Y. Huang, and E. N. Elnozahy. Integrating checkpointing with transaction processing. In FTCS, 1997.]]

[78]

{78} Y. Wang, Y. Huang, W. K. Fuchs, C. Kintala, and G. Suri. Progressive retry for software failure recovery in message-passing applications. IEEE Transactions on Computers, 46(10):1137-1141, Oct 1997.]]

[79]

{79} Y. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and its applications. In FTCS-25, 1995.]]

[80]

{80} M. Wu and W. Zwaenepoel. eNVy: A non-volatile, main memory storage system. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 86-97, San Jose, California, Oct. 4-7, 1994. ACM SIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society.]]

[81]

{81} Y. Xie and D. Engler. Using redundancies to find errors. In Proceedings of the tenth ACM SIGSOFT symposium on Foundations of software engineering, pages 51-60, 2002.]]

[82]

{82} Y. Zhou, P. M. Chen, and K. Li. Fast cluster failover using virtual memory-mapped communication. In the 13th ACM International Conference on Supercomputing, June 1999.]]

Cited By

Xu XLiu HTao GXuan ZZhang XCrnkovic I(2022)Checkpointing and deterministic training for deep learningProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528605(65-76)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3522664.3528605
Choi BBurns RHuang PBarbalace ABhatotia PAlvisi LCadar C(2021)Understanding and dealing with hard faults in persistent memory systemsProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456252(441-457)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447786.3456252
Wang ZChoo CKozuch MMowry TPekhimenko GSeshadri VSkarlatos DMartínez JDuato JJohn L(2021)NVOverlayProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00046(498-511)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00046
Show More Cited By

Index Terms

Recommendations

Using evolution patterns to find duplicated bugs
Bug localization via searching crowd-contributed code
Internetware '14: Proceedings of the 6th Asia-Pacific Symposium on Internetware

Bug localization, i.e., locating bugs in code snippets, is a frequent task in software development. Although static bug-finding tools are available to reduce manual effort in bug localization, these tools typically detect bugs with known project-...
Effective Bug Triage Based on Historical Bug-Fix Information
ISSRE '14: Proceedings of the 2014 IEEE 25th International Symposium on Software Reliability Engineering

For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ATEC '04: Proceedings of the annual conference on USENIX Annual Technical Conference

June 2004

572 pages

Publisher

USENIX Association

United States

Publication History

Published: 27 June 2004

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

161
Total Citations
View Citations
33
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu XLiu HTao GXuan ZZhang XCrnkovic I(2022)Checkpointing and deterministic training for deep learningProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528605(65-76)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3522664.3528605
Choi BBurns RHuang PBarbalace ABhatotia PAlvisi LCadar C(2021)Understanding and dealing with hard faults in persistent memory systemsProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456252(441-457)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3447786.3456252
Wang ZChoo CKozuch MMowry TPekhimenko GSeshadri VSkarlatos DMartínez JDuato JJohn L(2021)NVOverlayProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00046(498-511)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00046
Li YMiao RAlizadeh MYu MLorch JYu M(2019)DETERProceedings of the 16th USENIX Conference on Networked Systems Design and Implementation10.5555/3323234.3323270(437-451)Online publication date: 26-Feb-2019
https://dl.acm.org/doi/10.5555/3323234.3323270
Quinn AFlinn JCafarella M(2019)You can't debug what you can't seeProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321428(163-169)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3317550.3321428
Guo JLi SLou JYang ZLiu TZhang DMøller A(2019)Sara: self-replay augmented record and replay for Android in industrial casesProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3293882.3330557(90-100)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.1145/3293882.3330557
Wang CChen XJia WLi BQiu HZhao SCui HSeshan SBanerjee S(2018)PloverProceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307483(483-499)Online publication date: 9-Apr-2018
https://dl.acm.org/doi/10.5555/3307441.3307483
Quinn AFlinn JCafarella MArpaci-Dusseau AVoelker G(2018)SledgehammerProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291208(545-560)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291208
Rajasekaran SChawla HNi ZShah NBerger EWood TPierre GFerreira PShrira L(2018)CRIMESProceedings of the 19th International Middleware Conference10.1145/3274808.3274812(40-52)Online publication date: 26-Nov-2018
https://dl.acm.org/doi/10.1145/3274808.3274812
Arora NBell JIvančić FKaiser GRay BHuchard MKästner CFraser G(2018)Replay without recording of production bugs for service oriented applicationsProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering10.1145/3238147.3238186(452-463)Online publication date: 3-Sep-2018
https://dl.acm.org/doi/10.1145/3238147.3238186
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten