Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524481.3527227acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

What do developer-repaired Flaky tests tell us about the effectiveness of automated Flaky test detection?

Published: 19 July 2022 Publication History

Abstract

Because they pass or fail without code changes, flaky tests cause serious problems such as spuriously failing builds and the eroding of developers' trust in tests. Many previous evaluations of automated flaky test detection techniques do not accurately assess their usefulness for the developers who identify the flaky tests to repair. This is because researchers evaluate detection techniques against baselines that are not derived from past developer behavior or against no baselines at all. To study the effectiveness of an automated test rerunning technique, a common baseline for other approaches to detection, this paper uses 75 commits --- authored by human software developers --- that repair test flakiness in 31 real-world Python projects. Surprisingly, automated rerunning detects the developer-repaired flaky tests in only 40% of the studied commits. This result suggests that automated rerunning does not often find those flaky tests that developers fix, implying that it makes an unsuitable baseline for assessing a detection technique's usefulness for developers.

References

[1]
2022. Box/Flaky: Plugin for Nose or Pytest That Automatically Reruns Flaky Tests. https://github.com/box/flaky
[2]
2022. Cirq, Commit Fixing Test Flakiness. https://github.com/quantumlib/Cirq/commit/0d1eacc456ada78de0815446905e9c48254cee6b
[3]
2022. Find, Install and Publish Python Packages with the Python Package Index. https://pypi.org/
[4]
2022. Pip Freeze. https://pip.pypa.io/en/stable/cli/pip_freeze/
[5]
2022. Replication Package. https://github.com/flake-it/showflakes-framework
[6]
2022. ShowFlakes. https://github.com/flake-it/showflakes
[7]
2022. Virtual Environments and Packages. https://docs.python.org/3/tutorial/venv.html
[8]
2022. Welcome to the Tox Automation Project. https://tox.wiki/en/stable/
[9]
A. Alshammari, C. Morris, M. Hilton, and J. Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In Proceedings of the International Conference on Software Engineering (ICSE).
[10]
J. Bell, G. Kaiser, E. Melski, and M. Dattatreya. 2015. Efficient Dependency Detection for Safe Java Test Acceleration. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 770--781.
[11]
J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE). 433--444.
[12]
A. Bertolino, E. Cruciani, B. Miranda, and R. Verdecchia. 2021. Know Your Neighbor: Fast Static Prediction of Test Flakiness. IEEE Access 9 (2021), 76119--76134. Issue 4.
[13]
S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, and Misailovic S. 2020. Detecting Flaky Tests in Probabilistic and Machine Learning Applications. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). 211--224.
[14]
S. Dutta, A. Shi, and Misailovic S. 2021. FLEX: Fixing Flaky Tests in Machine Learning Projects by Updating Assertion Bounds. In Proceedings of the Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 211--224.
[15]
M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 830--840.
[16]
M. Fazzini, A. Gorla, and A. Orso. 2020. A Framework for Automated Test Mocking of Mobile Apps. In Proceedings of the International Conference on Automated Software Engineering (ASE). 1204--1208.
[17]
M. Fowler. 2011. Eradicating Non-Determinism in Tests, https://martinfowler.com/articles/nonDeterminism.html.
[18]
A. Gambi, J. Bell, and A. Zeller. 2018. Practical Test Dependency Detection. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). 1--11.
[19]
M. Gruber, S. Lukasczyk, F. Kroiß, and G. Fraser. 2021. An Empirical Study of Flaky Tests in Python. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST).
[20]
A. Gyori, B. Lambeth, A. Shi, O. Legunsen, and D. Marinov. 2015. NonDex: A Tool for Detecting and Debugging Wrong Assumptions on Java API Specification. In Proceedings of the Symposium on the Foundations of Software Engineering (FSE). 223--233.
[21]
A. Gyori, A. Shi, F. Hariri, and D. Marinov. 2015. Reliable Testing: Detecting State-Polluting Tests to Prevent Test Dependency. In Proceedings of the International Conference on Software Testing and Analysis (ISSTA). 223--233.
[22]
S. Habchi, G. Haben, M. Papadakis, M. Cordy, and Y. Le Traon. 2022. A Qualitative Study on the Sources, Impacts, and Mitigation Strategies of Flaky Tests. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST).
[23]
M. Harman and P. O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In Proceedings of the International Working Conference on Source Code Analysis and Manipulation (SCAM). 1--23.
[24]
K. Herzig. 2016. Let's Assume We Have to Pay for Testing. https://www.slideshare.net/kim.herzig/keynote-ast-2016
[25]
C. Huo and J. Clause. 2014. Improving Oracle Quality by Detecting Brittle Assertions and Unused Inputs in Tests. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE). 621--631.
[26]
W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie. 2019. IDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). 312--322.
[27]
W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. In Proceedings of the International Conference on Software Reliability Engineering (ISSRE). 403--413.
[28]
G. Li, S. Lu, M. Musuvathi, S. Nath, and R. Padhye. 2019. Efficient Scalable Thread-Safety-Violation Detection: Finding Thousands of Concurrency Bugs during Testing. In Proceedings of the Symposium on Operating Systems Principles (SOSP). 162--180.
[29]
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the Symposium on the Foundations of Software Engineering (FSE). 643--653.
[30]
J. Micco. 2016. Flaky Tests at Google and How We Mitigate Them, https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html.
[31]
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2021. A Survey of Flaky Tests. Transactions on Software Engineering and Methodology 31, 1 (2021), 1--74.
[32]
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2022. Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). to appear.
[33]
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2022. Surveying the Developer Experience of Flaky Tests. In Proceedings of the International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
[34]
G. Pinto, B. Miranda, S. Dissanayake, M. D. Amorim, C. Treude, A. Bertolino, and M. D'amorim. 2020. What is the Vocabulary of Flaky Tests?. In Proceedings of the International Conference on Mining Software Repositories (MSR). 492--502.
[35]
A. Shi, A. Gyori, O. Legunsen, and D. Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-Deterministic Specifications. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). 80--90.
[36]
D. Silva, L. Teixeira, and M. D'Amorim. 2020. Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). 301--311.
[37]
V. Terragni, P. Salza, and F. Ferrucci. 2020. A Container-based Infrastructure for Fuzzy-driven Root Causing of Flaky Tests. In Proceedings of the International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). 69--72.
[38]
C. S. Timperley, L. Herckis, C. Le Goues, and M. Hilton. 2021. Understanding and Improving Artifact Sharing in Software Engineering Research. Empirical Software Engineering 26, 4 (2021), 1--41.
[39]
D. G. Widder, M. Hilton, C. Kästner, and B. Vasilescu. 2019. A Conceptual Replication of Continuous Integration Pain Points in the Context of Travis CI. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 647--658.
[40]
S. Zhang, D. Jalali, J. Wuttke, K. Muşlu, W. Lam, M. D. Ernst, and D. Notkin. 2014. Empirically Revisiting the Test Independence Assumption. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). 385--396.
[41]
H. Zhu, L. Wei, M. Wen, Y. Liu, S. C. Cheung, Q. Sheng, and C. Zhou. 2020. Mock-Sniffer: Characterizing and Recommending Mocking Decisions for Unit Tests. In Proceedings of the International Conference on Automated Software Engineering (ASE). 436--447.

Cited By

View all
  • (2024)On the Impact of Hitting System Resource Limits on Test FlakinessProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643898(14-19)Online publication date: 14-Apr-2024
  • (2024)FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code RepairIEEE Transactions on Software Engineering10.1109/TSE.2024.3472476(1-26)Online publication date: 2024
  • (2023)Flaky Tests in UI: Understanding Causes and Applying Correction StrategiesProceedings of the XXXVII Brazilian Symposium on Software Engineering10.1145/3613372.3613406(398-406)Online publication date: 25-Sep-2023
  • Show More Cited By

Index Terms

  1. What do developer-repaired Flaky tests tell us about the effectiveness of automated Flaky test detection?

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AST '22: Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test
    May 2022
    180 pages
    ISBN:9781450392860
    DOI:10.1145/3524481
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE TCSC: IEEE Technical Committee on Scalable Computing

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Flaky tests
    2. automated detection
    3. software testing

    Qualifiers

    • Research-article

    Funding Sources

    • EPSRC Doctoral Training Partnership

    Conference

    AST '22
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On the Impact of Hitting System Resource Limits on Test FlakinessProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643898(14-19)Online publication date: 14-Apr-2024
    • (2024)FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code RepairIEEE Transactions on Software Engineering10.1109/TSE.2024.3472476(1-26)Online publication date: 2024
    • (2023)Flaky Tests in UI: Understanding Causes and Applying Correction StrategiesProceedings of the XXXVII Brazilian Symposium on Software Engineering10.1145/3613372.3613406(398-406)Online publication date: 25-Sep-2023
    • (2023)Improved Flaky Test Detection with Black-Box Approach and Test Smells2023 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC58397.2023.10217934(245-251)Online publication date: 9-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media