Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3389694acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Causality-Guided Adaptive Interventional Debugging

Published: 31 May 2020 Publication History

Abstract

Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application's intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them.

Supplementary Material

MP4 File (3318464.3389694.mp4)
Presentation Video

References

[1]
Abhishek Agarwal, Sidharth Jaggi, and Arya Mazumdar. 2018. Novel Impossibility Results for Group-Testing. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, June 17--22, 2018 . 2579--2583.
[2]
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 331--346.
[3]
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8--10, 2012 . 307--320.
[4]
Mona Attariyan and Jason Flinn. 2010. Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. In 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010, October 4--6, 2010, Vancouver, BC, Canada, Proceedings . 237--250.
[5]
George K. Baah, Andy Podgurski, and Mary Jean Harrold. 2010. Causal Inference for Statistical Fault Localization. In Proceedings of the 19th International Symposium on Software Testing and Analysis (ISSTA '10). ACM, New York, NY, USA, 73--84.
[6]
George K. Baah, Andy Podgurski, and Mary Jean Harrold. 2011. Mitigating the confounding effects of program dependences for effective fault localization. In SIGSOFT/FSE'11 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC'11: 13th European Software Engineering Conference (ESEC-13), Szeged, Hungary, September 5--9, 2011. 146--156.
[7]
Yechao Bai, Qingsi Wang, Chun Lo, Mingyan Liu, Jerome P. Lynch, and Xinggan Zhang. 2019. Adaptive Bayesian group testing: Algorithms and performance. Signal Processing. Vol. 156 (2019), 191--207.
[8]
Peter Bailis, Alan Fekete, Michael J Franklin, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica. 2015. Feral concurrency control: An empirical investigation of modern application integrity. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1327--1342.
[9]
Leonardo Baldassini, Oliver Johnson, and Matthew Aldridge. 2013. The capacity of adaptive group testing. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, July 7--12, 2013 . 2676--2680.
[10]
Thomas Ball and James R. Larus. 1994. Optimally Profiling and Tracing Programs. ACM Trans. Program. Lang. Syst. Vol. 16, 4 (1994), 1319--1360.
[11]
Antonio Bovenzi, Domenico Cotroneo, Roberto Pietrantuono, and Stefano Russo. 2012. On the aging effects due to concurrency bugs: A case study on MySQL. In 2012 IEEE 23rd International Symposium on Software Reliability Engineering. IEEE, 211--220.
[12]
Hong Cheng, David Lo, Yang Zhou, Xiaoyin Wang, and Xifeng Yan. 2009. Identifying bug signatures using discriminative graph mining. In Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ISSTA 2009, Chicago, IL, USA, July 19--23, 2009. 141--152.
[13]
Trishul M. Chilimbi, Ben Liblit, Krishna K. Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. HOLMES: Effective statistical debugging via efficient path profiling. In 31st International Conference on Software Engineering, ICSE 2009, May 16--24, 2009, Vancouver, Canada, Proceedings. 34--44.
[14]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM. Vol. 51, 1 (2008), 107--113.
[15]
Ding-Zhu Du and Frank K. Hwang. 1993. Combinatorial group testing and its applications .World Scientific, Singapore River Edge, N.J.
[16]
Anna Fariha, Suman Nath, and Alexandra Meliou. 2020. Causality-Guided Adaptive Interventional Debugging. CoRR. Vol. abs/2003.09539 (2020). arxiv: 2003.09539
[17]
Farid Feyzi and Saeed Parsa. 2017. Inforence: Effective Fault Localization Based on Information-Theoretic Analysis and Statistical Causal Inference. CoRR. Vol. abs/1712.03361 (2017). arxiv: 1712.03361
[18]
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. PVLDB. Vol. 8, 1 (2014), 61--72.
[19]
Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince R. Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen C. Hunt. 2009. Debugging in the (very) large: ten years of implementation and experience. In SOSP. 103--116.
[20]
Helga Gudmundsdottir, Babak Salimi, Magdalena Balazinska, Dan R. K. Ports, and Dan Suciu. 2017. A Demonstration of Interactive Analysis of Performance Measurements with Viska. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. 1707--1710.
[21]
Joseph Y. Halpern and Judea Pearl. 2001. Causes and Explanations: A Structural-Model Approach: Part 1: Causes. In UAI '01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, University of Washington, Seattle, Washington, USA, August 2--5, 2001 . 194--202.
[22]
Seungjae Han, Kang G Shin, and Harold A Rosenberg. 1995. Doctor: An integrated software fault injection environment for distributed real-time systems. In Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium. IEEE, 204--213.
[23]
Maurice Herlihy. 1992. A Methodology for Implementing Highly Concurrent Data Objects (Abstract). Operating Systems Review . Vol. 26, 2 (1992), 12.
[24]
Christopher Hitchcock. 2015. Conditioning, intervening, and decision. Synthese. Vol. 193 (03 2015).
[25]
F. K. Hwang. 1972. A Method for Detecting All Defective Members in a Population by Group Testing. J. Amer. Statist. Assoc. Vol. 67, 339 (1972), 605--608.
[26]
David D. Jensen, Javier Burroni, and Matthew J. Rattigan. 2019. Object Conditioning for Causal Inference. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22--25, 2019. 393.
[27]
Lingxiao Jiang and Zhendong Su. 2007. Context-aware statistical debugging: from bug predictors to faulty control flow paths. In 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5--9, 2007, Atlanta, Georgia, USA. 184--193.
[28]
Guoliang Jin, Aditya V. Thakur, Ben Liblit, and Shan Lu. 2010. Instrumentation and sampling strategies for cooperative concurrency bug isolation. In Proceedings of the 25th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2010, October 17--21, 2010, Reno/Tahoe, Nevada, USA. 241--255.
[29]
Noah M. Johnson, Juan Caballero, Kevin Zhijie Chen, Stephen McCamant, Pongsin Poosankam, Daniel Reynaud, and Dawn Song. 2011. Differential Slicing: Identifying Causal Execution Differences for Security Applications. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 347--362.
[30]
James A. Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In 20th IEEE/ACM International Conference on Automated Software Engineering (ASE 2005), November 7--11, 2005, Long Beach, CA, USA. 273--282.
[31]
Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. 1995. FERRARI: A Flexible Software-Based Fault and Error Injection System. IEEE Trans. Computers . Vol. 44, 2 (1995), 248--260.
[32]
Amin Karbasi and Morteza Zadimoghaddam. 2012. Sequential group testing with graph constraints. In 2012 IEEE Information Theory Workshop, Lausanne, Switzerland, September 3--7, 2012. 292--296.
[33]
Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben Niu. 2017. Lazy Diagnosis of In-Production Concurrency Bugs. In SOSP. ACM. 582--598.
[34]
Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-scale Industrial Setting. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). ACM, New York, NY, USA, 101--111.
[35]
Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM. Vol. 21, 7 (1978), 558--565.
[36]
Tanakorn Leesatapornwongsa, Jeffrey F Lukman, Shan Lu, and Haryadi S Gunawi. 2016. TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems. ACM SIGPLAN Notices. Vol. 51, 4 (2016), 517--530.
[37]
Tongxin Li, Chun Lam Chan, Wenhao Huang, Tarik Kaced, and Sidharth Jaggi. 2014. Group testing with prior statistics. In 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, June 29 - July 4, 2014 . 2346--2350.
[38]
Ben Liblit, Mayur Naik, Alice X. Zheng, Alexander Aiken, and Michael I. Jordan. 2005. Scalable statistical bug isolation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12--15, 2005. 15--26.
[39]
Bo Liu, Zhengwei Qi, Bin Wang, and Ruhui Ma. 2014. Pinso: Precise Isolation of Concurrency Bugs via Delta Triaging. In ICSME. IEEE Computer Society, 201--210.
[40]
Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P. Midkiff. 2006. Statistical Debugging: A Hypothesis Testing-Based Approach. IEEE Trans. Software Eng. Vol. 32, 10 (2006), 831--848.
[41]
Shan Lu, Soyeon Park, Chongfeng Hu, Xiao Ma, Weihang Jiang, Zhenmin Li, Raluca A. Popa, and Yuanyuan Zhou. 2007. MUVI: automatically inferring multi-variable access correlations and detecting related semantic and concurrency bugs. In Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, October 14--17, 2007. 103--116.
[42]
Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learning from mistakes: a comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2008, Seattle, WA, USA, March 1--5, 2008 . 329--339.
[43]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 643--653.
[44]
Paul D Marinescu and George Candea. 2009. LFI: A practical and general library-level fault injector. In 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 379--388.
[45]
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, and Dan Suciu. 2010. WHY SO? or WHY NO? Functional Causality for Explaining Query Answers. In Proceedings of the Fourth International VLDB workshop on Management of Uncertain Data (MUD 2010) in conjunction with VLDB 2010, Singapore, September 13, 2010. 3--17.
[46]
Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, and Dan Suciu. 2011. Tracing data errors with view-conditioned causality. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011. 505--516.
[47]
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanations in Databases. PVLDB. Vol. 7, 13 (2014), 1715--1716.
[48]
Lennart Oldenburg, Xiangfeng Zhu, Kamala Ramasubramanian, and Peter Alvaro. 2019. Fixed It For You: Protocol Repair Using Lineage Graphs. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13--16, 2019, Online Proceedings.
[49]
Chris Parnin and Alessandro Orso. 2011. Are automated debugging techniques actually helping programmers?. In ISSTA. ACM, 199--209.
[50]
Judea Pearl. 2000. Causality: Models, Reasoning, and Inference .Cambridge University Press, New York, NY, USA.
[51]
Judea Pearl. 2011. The algorithmization of counterfactuals. Ann. Math. Artif. Intell. Vol. 61, 1 (2011), 29--39.
[52]
Judea Pearl and Thomas Verma. 1991. A Theory of Inferred Causation. In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning (KR'91). Cambridge, MA, USA, April 22--25, 1991. 441--452.
[53]
Gang Shu, Boya Sun, Andy Podgurski, and Feng Cao. 2013. MFL: Method-Level Fault Localization with Causal Inference. In ICST. IEEE Computer Society, 124--133.
[54]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3--7, 2010 . 1--10.
[55]
P. Spirtes, C. Glymour, and R. Scheines. 2000. Causation, Prediction, and Search 2nd ed.). MIT press.
[56]
William N Sumner and Xiangyu Zhang. 2009. Algorithms for automatically computing the causal paths of failures. In International Conference on Fundamental Approaches to Software Engineering. Springer, 355--369.
[57]
Aditya V. Thakur, Rathijit Sen, Ben Liblit, and Shan Lu. 2009. Cooperative crug isolation. In Proceedings of the International Workshop on Dynamic Analysis: held in conjunction with the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2009), WODA 2009, Chicago, IL, USA, July, 2009. 35--41.
[58]
Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An empirical study of bugs in test code. In 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015. 101--110.
[59]
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data X-Ray: A Diagnostic Tool for Data Errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. 1231--1245.
[60]
Dasarath Weeratunge, Xiangyu Zhang, William N. Sumner, and Suresh Jagannathan. 2010. Analyzing concurrency bugs using dual slicing. In ISSTA. ACM. 253--264.
[61]
W Eric Wong and Vidroha Debroy. 2009. A survey of software fault localization. Department of Computer Science, University of Texas at Dallas, Tech. Rep. UTDCS-45. Vol. 9 (2009).
[62]
James Woodward. 2003. Making Things Happen: A Theory of Causal Explanation .Oxford University Press. 2002192596
[63]
Bin Xin, William N. Sumner, and Xiangyu Zhang. 2008. Efficient program execution indexing. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7--13, 2008. 238--248.
[64]
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 249--265.
[65]
Alice X. Zheng, Michael I. Jordan, Ben Liblit, Mayur Naik, and Alex Aiken. 2006. Statistical debugging: simultaneous identification of multiple bugs. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25--29, 2006. 1105--1112.
[66]
Alice X. Zheng, Irina Rish, and Alina Beygelzimer. 2005. Efficient Test Selection in Active Diagnosis via Entropy Approximation. In UAI '05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26--29, 2005. 675.

Cited By

View all
  • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024
  • (2024)Enabling Runtime Verification of Causal Discovery Algorithms with Automated Conditional Independence ReasoningProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623348(1-13)Online publication date: 20-May-2024
  • (2023)CAMEOProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624791(555-571)Online publication date: 30-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concurrency bug
  2. group testing
  3. root-causing
  4. trace analysis

Qualifiers

  • Research-article

Funding Sources

  • Oracle Labs
  • National Science Foundation

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)12
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Counterfactual Explanation at Will, with Zero Privacy LeakageProceedings of the ACM on Management of Data10.1145/36549332:3(1-29)Online publication date: 30-May-2024
  • (2024)Enabling Runtime Verification of Causal Discovery Algorithms with Automated Conditional Independence ReasoningProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623348(1-13)Online publication date: 20-May-2024
  • (2023)CAMEOProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624791(555-571)Online publication date: 30-Oct-2023
  • (2023)XInsight: eXplainable Data Analysis Through The Lens of CausalityProceedings of the ACM on Management of Data10.1145/35893011:2(1-27)Online publication date: 20-Jun-2023
  • (2023)Perfce: Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00106(1454-1466)Online publication date: 11-Sep-2023
  • (2023)Applications of statistical causal inference in software engineeringInformation and Software Technology10.1016/j.infsof.2023.107198159:COnline publication date: 1-Jul-2023
  • (2022)How I stopped worrying about training data bugs and started complainingProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533305(1-5)Online publication date: 12-Jun-2022
  • (2022)UnicornProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519575(199-217)Online publication date: 28-Mar-2022
  • (2022)The resilience of conjunctive queries with inequalitiesInformation Sciences: an International Journal10.1016/j.ins.2022.08.049613:C(982-1002)Online publication date: 1-Oct-2022
  • (2022)BugDocThe VLDB Journal10.1007/s00778-022-00733-532:1(75-101)Online publication date: 23-Feb-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media