Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3395363.3397366acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Public Access

Detecting flaky tests in probabilistic and machine learning applications

Published: 18 July 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code.
    In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain.
    Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.

    References

    [1]
    AllenNLP Commit 089d744 2019. https://github.com/allenai/allennlp/pull/2778/ commits/089d744.
    [2]
    AllenNLP Commit 53bba3d 2018. https://github.com/allenai/allennlp/commit/ 53bba3d.
    [3]
    AllenNLP Issue 727 2018. https://github.com/allenai/allennlp/pull/727.
    [4]
    American Fuzzy Loop 2014. http://lcamtuf.coredump.cx/afl.
    [5]
    Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE transactions on software engineering ( 2015 ).
    [6]
    M. Bates. 1995. Models of natural language understanding. Proceedings of the National Academy of Sciences of the United States of America ( 1995 ).
    [7]
    Matthew James Beal. 2003. Variational algorithms for approximate Bayesian inference.
    [8]
    Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.
    [9]
    Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. 2019. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research ( 2019 ).
    [10]
    Bob Carpenter, Andrew Gelman, Matt Hofman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, et al. 2016. Stan: A probabilistic programming language. JSTATSOFT ( 2016 ).
    [11]
    Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes Borgström. 2013. Bayesian Inference Using Data Flow Analysis. In ESEC/FSE.
    [12]
    Cleverhans Commit 58505ce 2017. https://github.com/tensorflow/cleverhans/ pull/149/commits/58505ce.
    [13]
    Cleverhans Issue 167 2017. https://github.com/tensorflow/cleverhans/issues/167.
    [14]
    Conda package management system 2017. https://docs.conda.io.
    [15]
    Maxime Cordy, Renaud Rwemalika, Mike Papadakis, and Mark Harman. 2019. FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair. arXiv: 1912. 03197 [cs.SE]
    [16]
    Marco Cusumano-Towner, Benjamin Bichsel, Timon Gehr, Vikash K. Mansinghka, and Martin Vechev. 2018. Incremental Inference for Probabilistic Programs. In PLDI.
    [17]
    Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hofman, and Rif A Saurous. 2017. Tensorflow distributions. arXiv preprint arXiv:1711.10604 ( 2017 ).
    [18]
    Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing probabilistic programming systems. In ESEC/FSE.
    [19]
    Saikat Dutta, Wenxian Zhang, Zixin Huang, and Sasa Misailovic. 2019. Storm: program reduction for testing and debugging probabilistic programming systems. In ESEC/FSE.
    [20]
    Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M Rao, RP Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In ISSTA.
    [21]
    Eric Jang. Why Randomness is Important for Deep Learning 2016. https: //blog.evjang.com/ 2016 /07/randomness-deep-learning. html.
    [22]
    Flaky test plugin 2019. https://github.com/box/flaky.
    [23]
    Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. 2018. An Introduction to Deep Reinforcement Learning. arXiv: 1811. 12560 [cs.LG]
    [24]
    Alessio Gambi, Jonathan Bell, and Andreas Zeller. 2018. Practical Test Dependency Detection. In ICST.
    [25]
    Timon Gehr, Sasa Misailovic, and Martin Vechev. 2016. PSI: Exact Symbolic Inference for Probabilistic Programs. In CAV.
    [26]
    Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian data analysis.
    [27]
    John Geweke et al. 1991. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN.
    [28]
    Wally R Gilks, Andrew Thomas, and David J Spiegelhalter. 1994. A language and program for complex Bayesian modelling. The Statistician ( 1994 ).
    [29]
    Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. 2012. Church: a language for generative models. arXiv preprint arXiv:1206.3255 ( 2012 ).
    [30]
    Noah D Goodman and Andreas Stuhlmüller. 2014. The design and implementation of probabilistic programming languages. http://dippl.org.
    [31]
    GPytorch Pull Request 373 2018. https://github.com/cornellius-gp/gpytorch/ pull/373.
    [32]
    Mark Harman and Peter O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In SCAM.
    [33]
    Jason Brownlee. Embrace Randomness in Machine Learning 2019. https:// machinelearningmastery.com /randomness-in-machine-learning/.
    [34]
    Keyur Joshi, Vimuth Fernando, and Sasa Misailovic. 2019. Statistical algorithmic profiling for randomized approximate programs. In ICSE.
    [35]
    Ravin Kumar, Colin Carroll, Ari Hartikainen, and Osvaldo A. Martin. 2019. ArviZ a unified library for exploratory analysis of Bayesian models in Python. The Journal of Open Source Software ( 2019 ).
    [36]
    Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In ISSTA.
    [37]
    Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In ICSE.
    [38]
    Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In ICST.
    [39]
    Caroline Lemieux, Rohan Padhye, Koushik Sen, and Dawn Song. 2018. PerfFuzz: Automatically Generating Pathological Inputs. In ISSTA.
    [40]
    Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In FSE.
    [41]
    Vikash Mansinghka, Daniel Selsam, and Yura Perov. 2014. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint 1404.0099 ( 2014 ).
    [42]
    T. Minka, J.M. Winn, J.P. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. 2013. Infer.NET 2.5. Microsoft Research Cambridge. http://research.microsoft.com/infernet.
    [43]
    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 ( 2013 ).
    [44]
    Radford M Neal et al. 2011. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo ( 2011 ).
    [45]
    Mahdi Nejadgholi and Jinqiu Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In ASE.
    [46]
    Aditya V Nori, Chung-Kil Hur, Sriram K Rajamani, and Selva Samuel. 2014. R2: An eficient MCMC sampler for probabilistic programs. In AAAI.
    [47]
    Akira K Onoma, Wei-Tek Tsai, Mustafa Poonawala, and Hiroshi Suganuma. 1998. Regression testing in an industrial environment. Commun. ACM ( 1998 ).
    [48]
    Rohan Padhye, Caroline Lemieux, and Koushik Sen. 2019. JQF: Coverage-Guided Property-Based Testing in Java. In ISSTA DEMO.
    [49]
    Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic Fuzzing with Zest. In ISSTA.
    [50]
    Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. Proc. ACM Program. Lang. OOPSLA ( 2019 ).
    [51]
    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.
    [52]
    Avi Pfefer. 2001. IBAL: a probabilistic rational programming language. In IJCAI.
    [53]
    Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In ICSE.
    [54]
    PyroWebPage 2018. Pyro. http://pyro.ai.
    [55]
    PySyft Issue 1399 2018. https://github.com/OpenMined/PySyft/pull/1399.
    [56]
    Adrian E Raftery and Steven M Lewis. 1995. The number of iterations, convergence diagnostics and generic Metropolis algorithms. Practical Markov Chain Monte Carlo ( 1995 ).
    [57]
    Raster Vision Issue 285 2018. https://github.com/azavea/raster-vision/issues/285.
    [58]
    John A Rice. 2006. Mathematical statistics and data analysis.
    [59]
    John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science ( 2016 ).
    [60]
    Simone Scardapane and Dianhui Wang. 2017. Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery ( 2017 ).
    [61]
    Jurgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks ( 2015 ).
    [62]
    August Shi, Alex Gyori, Owolabi Legunsen, and Darko Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications. In ICST.
    [63]
    August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In ESEC/FSE.
    [64]
    TensorFlowWebPage 2018. TensorFlow. https://www.tensorflow.org.
    [65]
    Swapna Thorve, Chandani Sreshtha, and Na Meng. 2018. An Empirical Study of Flaky Tests in Android Apps. In ICSME.
    [66]
    Dustin Tran, Matthew D. Hofman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic programming. In ICLR.
    [67]
    Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv ( 2016 ).
    [68]
    Abraham Wald. 1945. Sequential tests of statistical hypotheses. The annals of mathematical statistics ( 1945 ).
    [69]
    Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In AISTATS.
    [70]
    Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software Testing, Verification, and Reliability ( 2012 ).
    [71]
    Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. arXiv: 1906. 10742 [cs.LG]
    [72]
    Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning. National Science Review ( 2017 ).

    Cited By

    View all
    • (2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
    • (2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
    • (2024)FlakeSync: Automatically Repairing Async Flaky TestsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639115(1-12)Online publication date: 20-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
    July 2020
    591 pages
    ISBN:9781450380089
    DOI:10.1145/3395363
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Flaky tests
    2. Machine Learning
    3. Non-Determinism
    4. Probabilistic Programming
    5. Randomness

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ISSTA '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 58 of 213 submissions, 27%

    Upcoming Conference

    ISSTA '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)236
    • Downloads (Last 6 weeks)33
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Flakiness Repair in the Era of Large Language ModelsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3641227(441-443)Online publication date: 14-Apr-2024
    • (2024)Taming Timeout Flakiness: An Empirical Study of SAP HANAProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639741(69-80)Online publication date: 14-Apr-2024
    • (2024)FlakeSync: Automatically Repairing Async Flaky TestsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639115(1-12)Online publication date: 20-May-2024
    • (2024)WEFix: Intelligent Automatic Generation of Explicit Waits for Efficient Web End-to-End Flaky TestsProceedings of the ACM on Web Conference 202410.1145/3589334.3645628(3043-3052)Online publication date: 13-May-2024
    • (2024)Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00095(872-883)Online publication date: 12-Mar-2024
    • (2024)A survey of Detecting Flakiness in Automated Test Regression Suite2024 21st Learning and Technology Conference (L&T)10.1109/LT60077.2024.10469624(330-336)Online publication date: 15-Jan-2024
    • (2024)Evaluating the impact of flaky simulators on testing autonomous driving systemsEmpirical Software Engineering10.1007/s10664-023-10433-529:2Online publication date: 21-Feb-2024
    • (2023)Transforming Test Suites into CroissantsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598119(1080-1092)Online publication date: 12-Jul-2023
    • (2023)Test Maintenance for Machine Learning Systems: A Case Study in the Automotive Industry2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00045(410-421)Online publication date: Apr-2023
    • (2023)Practical Flaky Test Prediction using Common Code Evolution and Test History Data2023 IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST57152.2023.00028(210-221)Online publication date: Apr-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media