Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132747.3132768acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
Open access

Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach

Published: 14 October 2017 Publication History


Complex and unforeseen failures in distributed systems must be diagnosed and replicated in a development environment so that developers can understand the underlying problem and verify the resolution. System logs often form the only source of diagnostic information, and developers reconstruct a failure using manual guesswork. This is an unpredictable and time-consuming process which can lead to costly service outages while a failure is repaired.
This paper describes Pensieve, a tool capable of reconstructing near-minimal failure reproduction steps from log files and system bytecode, without human involvement. Unlike existing solutions that use symbolic execution to search for the entire path leading to the failure, Pensieve is based on the Partial Trace Observation, which states that programmers do not simulate the entire execution to understand the failure, but follow a combination of control and data dependencies to reconstruct a simplified trace that only contains events that are likely to be relevant to the failure. Pensieve follows a set of carefully designed rules to infer a chain of causally dependent events leading to the failure symptom while aggressively skipping unrelated code paths to avoid the path-explosion overheads of symbolic execution models.

Supplementary Material

MP4 File (pensieve.mp4)


Gautam Altekar and Ion Stoica. 2009. ODR: Output-deterministic Replay for Multicore Debugging. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP'09). 193--206.
Joy Arulraj, Guoliang Jin, and Shan Lu. 2014. Leveraging the Short-term Memory of Hardware to Diagnose Production-run Software Failures. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'14). 207--222.
Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What Makes a Good Bug Report?. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (SIGSOFT '08/FSE-16). 308--318.
T Britton, L Jeng, G Carver, and P Cheak. 2013. Reversible debugging software. Judge Business School (2013).
Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 209--224.
chord 2015. Chord: Java Bytecode Analysis. https://code.google.com/p/jchord/. (2015).
Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang, Garth A. Gibson, and Randal E. Bryant. 2013. Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP'13). 388--405.
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS'08/ETAPS'08). 337--340.
David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M. Chen. 2014. Eidetic Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). 525--540.
George W. Dunlap, Samuel T. King, Sukru Cinar, Murtaza A. Basrai, and Peter M. Chen. 2002. ReVirt: Enabling Intrusion Analysis Through Virtual-machine Logging and Replay. In Proceedings of the 5th symposium on Operating systems design and implementation (OSDI'02). 211--224.
Wei Jin and Alessandro Orso. 2012. BugRedux: Reproducing field failures for in-house debugging. In 34th International Conference on Software Engineering (ICSE'12). 474--484.
jvmti 2017. JVM TI: Java Virtual Machine Tool Interface. http://docs.oracle.com/javase/7/docs/technotes/guides/jvmti/. (2017).
Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 (July 1978), 558--565.
Kaituo Li, Pallavi Joshi, Aarti Gupta, and Malay K. Ganai. 2014. ReproLite: A Lightweight Tool to Quickly Reproduce Hard System Bugs. In Proceedings of the ACM Symposium on Cloud Computing (SOCC'14). 25:1--25:13.
Tongping Liu, Charlie Curtsinger, and Emery D. Berger. 2011. DTHREADS: Efficient Deterministic Multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP'11). 327--336.
Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). 26--26.
Satish Narayanasamy, Gilles Pokam, and Brad Calder. 2005. BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. In Proceedings of the 32Nd Annual International Symposium on Computer Architecture (ISCA'05). 284--295.
Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu H. Lee, and Shan Lu. 2009. PRES: Probabilistic Replay with Execution Sketching on Multiprocessors. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP'09). 177--192.
Dinesh Subhraveti and Jason Nieh. 2011. Record and Transplay: Partial Checkpointing for Replay Debugging Across Heterogeneous Systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'11). 109--120.
H. Thane and H. Hansson. 2000. Using deterministic replay for debugging of distributed real-time systems. In Proceedings of the 12th Euromicro Conference on Real-Time Systems (ECRTS'00). 265--272.
Frank Tip. 1995. A survey of program slicing techniques. Journal of Programming Languages 3, 3 (1995), 121--189.
Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. 2011. DoublePlay: Parallelizing Sequential Logging and Replay. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'11). 15--26.
Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering (ICSE'81). 439--449.
Baowen Xu, Ju Qian, Xiaofang Zhang, Zhongqiang Wu, and Lin Chen. 2005. A brief survey of program slicing. ACM SIGSOFT Software Engineering Notes 30, 2 (2005), 1--36.
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP'09). 117--132.
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14). 249--265.
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS'10). 143--154.
Cristian Zamfir and George Candea. 2010. Execution synthesis: a technique for automated software debugging. In European Conference on Computer Systems, Proceedings of the 5th European conference on Computer systems (EuroSys'10). 321--334.
Andreas Zeller. 1999. Yesterday, My Program Worked. Today, It Does Not. Why?. In Proceedings of the 7th European Software Engineering Conference Held Jointly with the 7th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE-7). 253--267.
Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). 603--618.
Xu Zhao, Yongle Zhang, David Lion, Muhammad FaizanUllah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the Third Symposium on Operating Systems Design and Implementation (OSDI'14). 629--644.

Cited By

View all
  • (2024)EXCHAINProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691937(2047-2062)Online publication date: 16-Apr-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
October 2017
677 pages
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Check for updates

Author Tags

  1. Failure reproduction
  2. debugging
  3. distributed systems
  4. log


  • Research-article
  • Research
  • Refereed limited


SOSP '17

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)313
  • Downloads (Last 6 weeks)50
Reflects downloads up to 13 Jan 2025

Other Metrics


Cited By

View all
  • (2024)EXCHAINProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691937(2047-2062)Online publication date: 16-Apr-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • (2024)Go Static: Contextualized Logging Statement GenerationProceedings of the ACM on Software Engineering10.1145/36437541:FSE(609-630)Online publication date: 12-Jul-2024
  • (2024)Log Compression via Redundancy Eliminating at Word and Numerical Levels2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00028(115-122)Online publication date: 7-Jul-2024
  • (2023)Software Failure Log Analysis for Engineers—ReviewElectronics10.3390/electronics1210226012:10(2260)Online publication date: 16-May-2023
  • (2023)Adonis: Practical and Efficient Control Flow Recovery through OS-level TracesACM Transactions on Software Engineering and Methodology10.1145/360718733:1(1-27)Online publication date: 4-Jul-2023
  • (2023)Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless ComputingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613154(314-330)Online publication date: 23-Oct-2023
  • (2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
  • (2023)Buffer-Based High-Coverage and Low-Overhead Request Event Monitoring in the CloudIEEE/ACM Transactions on Networking10.1109/TNET.2022.322461031:4(1732-1747)Online publication date: Aug-2023
  • Show More Cited By

View Options

View options


View or Download as a PDF file.



View online with eReader.


Login options







Share this Publication link

Share on social media