research-article

Public Access

Detecting flaky tests in probabilistic and machine learning applications

Authors:

Rutvik Choudhary,

Sasa MisailovicAuthors Info & Claims

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 211 - 224

https://doi.org/10.1145/3395363.3397366

Published: 18 July 2020 Publication History

Abstract

Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code.

In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain.

Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.

References

[1]

AllenNLP Commit 089d744 2019. https://github.com/allenai/allennlp/pull/2778/ commits/089d744.

[2]

AllenNLP Commit 53bba3d 2018. https://github.com/allenai/allennlp/commit/ 53bba3d.

[3]

AllenNLP Issue 727 2018. https://github.com/allenai/allennlp/pull/727.

[4]

American Fuzzy Loop 2014. http://lcamtuf.coredump.cx/afl.

[5]

Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE transactions on software engineering ( 2015 ).

[6]

M. Bates. 1995. Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America ( 1995 ).

[7]

Matthew James Beal. 2003. Variational algorithms for approximate Bayesian inference.

[8]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.

[9]

Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. 2019. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research ( 2019 ).

[10]

Bob Carpenter, Andrew Gelman, Matt Hofman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, et al. 2016. Stan: A probabilistic programming language. JSTATSOFT ( 2016 ).

[11]

Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes Borgström. 2013. Bayesian Inference Using Data Flow Analysis. In ESEC/FSE.

[12]

Cleverhans Commit 58505ce 2017. https://github.com/tensorflow/cleverhans/ pull/149/commits/58505ce.

[13]

Cleverhans Issue 167 2017. https://github.com/tensorflow/cleverhans/issues/167.

[14]

Conda package management system 2017. https://docs.conda.io.

[15]

Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair. arXiv: 1912. 03197 [cs.SE]

[16]

Marco Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Vikash K. Mansinghka, and Martin Vechev. 2018. Incremental Inference for Probabilistic Programs. In PLDI.

[17]

Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hofman, and Rif A Saurous. 2017. Tensorflow distributions. arXiv preprint arXiv:1711.10604 ( 2017 ).

[18]

Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In ESEC/FSE.

[19]

Saikat Dutta, Wenxian Zhang, Zixin Huang, and Sasa Misailovic. 2019. Storm: program reduction for testing and debugging probabilistic programming systems. In ESEC/FSE.

[20]

Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In ISSTA.

[21]

Eric Jang. Why Randomness is Important for Deep Learning 2016. https: //blog.evjang.com/ 2016 /07/randomness-deep-learning. html.

[22]

Flaky test plugin 2019. https://github.com/box/flaky.

[23]

Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. 2018. An Introduction to Deep Reinforcement Learning. arXiv: 1811. 12560 [cs.LG]

[24]

Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical Test Dependency Detection. In ICST.

[25]

Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact Symbolic Inference for Probabilistic Programs. In CAV.

[26]

Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian data analysis.

[27]

John Geweke et al. 1991. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN.

[28]

Wally R Gilks, Andrew Thomas, and David J Spiegelhalter. 1994. A language and program for complex Bayesian modelling. The Statistician ( 1994 ).

[29]

Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 ( 2012 ).

[30]

Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages. http://dippl.org.

[31]

GPytorch Pull Request 373 2018. https://github.com/cornellius-gp/gpytorch/ pull/373.

[32]

Mark Harman and Peter O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In SCAM.

[33]

Jason Brownlee. Embrace Randomness in Machine Learning 2019. https:// machinelearningmastery.com /randomness-in-machine-learning/.

[34]

Keyur Joshi, Vimuth Fernando, and Sasa Misailovic. 2019. Statistical algorithmic profiling for randomized approximate programs. In ICSE.

[35]

Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo A. Martin. 2019. ArviZ a unified library for exploratory analysis of Bayesian models in Python. The Journal of Open Source Software ( 2019 ).

[36]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In ISSTA.

[37]

Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In ICSE.

[38]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In ICST.

[39]

Caroline Lemieux, Rohan Padhye, Koushik Sen, and Dawn Song. 2018. PerfFuzz: Automatically Generating Pathological Inputs. In ISSTA.

[40]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.

[41]

Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint 1404.0099 ( 2014 ).

[42]

T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2013. Infer.NET 2.5. Microsoft Research Cambridge. http://research.microsoft.com/infernet.

[43]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 ( 2013 ).

[44]

Radford M Neal et al. 2011. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo ( 2011 ).

[45]

Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In ASE.

[46]

Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. 2014. R2: An eficient MCMC sampler for probabilistic programs. In AAAI.

[47]

Akira K Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma. 1998. Regression testing in an industrial environment. Commun. ACM ( 1998 ).

[48]

Rohan Padhye, Caroline Lemieux, and Koushik Sen. 2019. JQF: Coverage-Guided Property-Based Testing in Java. In ISSTA DEMO.

[49]

Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic Fuzzing with Zest. In ISSTA.

[50]

Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. Proc. ACM Program. Lang. OOPSLA ( 2019 ).

Digital Library

[51]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.

[52]

Avi Pfefer. 2001. IBAL: a probabilistic rational programming language. In IJCAI.

[53]

Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In ICSE.

[54]

PyroWebPage 2018. Pyro. http://pyro.ai.

[55]

PySyft Issue 1399 2018. https://github.com/OpenMined/PySyft/pull/1399.

[56]

Adrian E Raftery and Steven M Lewis. 1995. The number of iterations, convergence diagnostics and generic Metropolis algorithms. Practical Markov Chain Monte Carlo ( 1995 ).

[57]

Raster Vision Issue 285 2018. https://github.com/azavea/raster-vision/issues/285.

[58]

John A Rice. 2006. Mathematical statistics and data analysis.

[59]

John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science ( 2016 ).

[60]

Simone Scardapane and Dianhui Wang. 2017. Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery ( 2017 ).

[61]

Jurgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks ( 2015 ).

[62]

August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications. In ICST.

[63]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.

[64]

TensorFlowWebPage 2018. TensorFlow. https://www.tensorflow.org.

[65]

Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In ICSME.

[66]

Dustin Tran, Matthew D. Hofman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic programming. In ICLR.

[67]

Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv ( 2016 ).

[68]

Abraham Wald. 1945. Sequential tests of statistical hypotheses. The annals of mathematical statistics ( 1945 ).

[69]

Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In AISTATS.

[70]

Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification, and Reliability ( 2012 ).

[71]

Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. arXiv: 1906. 10742 [cs.LG]

[72]

Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning. National Science Review ( 2017 ).

Cited By

Chen YRoychoudhury APaiva AAbreu RStorey M(2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3641227
Berndt ABaltes SBach TRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639741
Rahman SShi ARoychoudhury APaiva AAbreu RStorey M(2024)FlakeSync: Automatically Repairing Async Flaky TestsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639115(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639115
Show More Cited By

Index Terms

Detecting flaky tests in probabilistic and machine learning applications
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
Root causing flaky tests in a large-scale industrial setting
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

In today’s agile world, developers often rely on continuous integration pipelines to help build and validate their changes by executing tests in an efficient manner. One of the significant factors that hinder developers’ productivity is flaky tests—...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2020

591 pages

ISBN:9781450380089

DOI:10.1145/3395363

General Chair:
Sarfraz Khurshid
University of Texas at Austin, USA
,
Program Chair:
Corina S. Păsăreanu
Carnegie Mellon University Silicon Valley / NASA Ames Research Center, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ISSTA '20

Sponsor:

SIGSOFT

ISSTA '20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 18 - 22, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '24

Sponsor:
sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
873
Total Downloads

Downloads (Last 12 months)236
Downloads (Last 6 weeks)33

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YRoychoudhury APaiva AAbreu RStorey M(2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3641227
Berndt ABaltes SBach TRoychoudhury APaiva AAbreu RStorey MAniche MNagappan N(2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639477.3639741
Rahman SShi ARoychoudhury APaiva AAbreu RStorey M(2024)FlakeSync: Automatically Repairing Async Flaky TestsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639115(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639115
Liu XSong ZFang WYang WWang WChua TNgo CKa-Wei Lee RKumar RLauw H(2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM on Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645628
Wang JLei YLi MRen GXie HJin SLi JHu J(2024)Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00095(872-883)Online publication date: 12-Mar-2024
https://doi.org/10.1109/SANER60148.2024.00095
ElGazzar MHossny EOmara F(2024)A survey of Detecting Flakiness in Automated Test Regression Suite2024 21st Learning and Technology Conference (L&T)10.1109/LT60077.2024.10469624(330-336)Online publication date: 15-Jan-2024
https://doi.org/10.1109/LT60077.2024.10469624
Amini MNaseri SNejati S(2024)Evaluating the impact of flaky simulators on testing autonomous driving systemsEmpirical Software Engineering10.1007/s10664-023-10433-529:2Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1007/s10664-023-10433-5
Chen YYildiz AMarinov DJabbarvand RJust RFraser G(2023)Transforming Test Suites into CroissantsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598119(1080-1092)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598119
Berglund LGrube TGay Gde Oliveira Neto FPlatis D(2023)Test Maintenance for Machine Learning Systems: A Case Study in the Automotive Industry2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00045(410-421)Online publication date: Apr-2023
https://doi.org/10.1109/ICST57152.2023.00045
Gruber MHeine MOster NPhilippsen MFraser G(2023)Practical Flaky Test Prediction using Common Code Evolution and Test History Data2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00028(210-221)Online publication date: Apr-2023
https://doi.org/10.1109/ICST57152.2023.00028
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents