research-article

What do developer-repaired Flaky tests tell us about the effectiveness of automated Flaky test detection?

Authors:

Gregory M. Kapfhammer,

Michael Hilton,

Phil McMinnAuthors Info & Claims

AST '22: Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test

Pages 160 - 164

https://doi.org/10.1145/3524481.3527227

Published: 19 July 2022 Publication History

Abstract

Because they pass or fail without code changes, flaky tests cause serious problems such as spuriously failing builds and the eroding of developers' trust in tests. Many previous evaluations of automated flaky test detection techniques do not accurately assess their usefulness for the developers who identify the flaky tests to repair. This is because researchers evaluate detection techniques against baselines that are not derived from past developer behavior or against no baselines at all. To study the effectiveness of an automated test rerunning technique, a common baseline for other approaches to detection, this paper uses 75 commits --- authored by human software developers --- that repair test flakiness in 31 real-world Python projects. Surprisingly, automated rerunning detects the developer-repaired flaky tests in only 40% of the studied commits. This result suggests that automated rerunning does not often find those flaky tests that developers fix, implying that it makes an unsuitable baseline for assessing a detection technique's usefulness for developers.

References

[1]

2022. Box/Flaky: Plugin for Nose or Pytest That Automatically Reruns Flaky Tests. https://github.com/box/flaky

[2]

2022. Cirq, Commit Fixing Test Flakiness. https://github.com/quantumlib/Cirq/commit/0d1eacc456ada78de0815446905e9c48254cee6b

[3]

2022. Find, Install and Publish Python Packages with the Python Package Index. https://pypi.org/

[4]

2022. Pip Freeze. https://pip.pypa.io/en/stable/cli/pip_freeze/

[5]

2022. Replication Package. https://github.com/flake-it/showflakes-framework

[6]

2022. ShowFlakes. https://github.com/flake-it/showflakes

[7]

2022. Virtual Environments and Packages. https://docs.python.org/3/tutorial/venv.html

[8]

2022. Welcome to the Tox Automation Project. https://tox.wiki/en/stable/

[9]

A. Alshammari, C. Morris, M. Hilton, and J. Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In Proceedings of the International Conference on Software Engineering (ICSE).

[10]

J. Bell, G. Kaiser, E. Melski, and M. Dattatreya. 2015. Efficient Dependency Detection for Safe Java Test Acceleration. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 770--781.

[11]

J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov. 2018. DeFlaker: Automatically Detecting Flaky Tests. In Proceedings of the International Conference on Software Engineering (ICSE). 433--444.

[12]

A. Bertolino, E. Cruciani, B. Miranda, and R. Verdecchia. 2021. Know Your Neighbor: Fast Static Prediction of Test Flakiness. IEEE Access 9 (2021), 76119--76134. Issue 4.

[13]

S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, and Misailovic S. 2020. Detecting Flaky Tests in Probabilistic and Machine Learning Applications. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). 211--224.

[14]

S. Dutta, A. Shi, and Misailovic S. 2021. FLEX: Fixing Flaky Tests in Machine Learning Projects by Updating Assertion Bounds. In Proceedings of the Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 211--224.

[15]

M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 830--840.

[16]

M. Fazzini, A. Gorla, and A. Orso. 2020. A Framework for Automated Test Mocking of Mobile Apps. In Proceedings of the International Conference on Automated Software Engineering (ASE). 1204--1208.

[17]

M. Fowler. 2011. Eradicating Non-Determinism in Tests, https://martinfowler.com/articles/nonDeterminism.html.

[18]

A. Gambi, J. Bell, and A. Zeller. 2018. Practical Test Dependency Detection. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). 1--11.

[19]

M. Gruber, S. Lukasczyk, F. Kroiß, and G. Fraser. 2021. An Empirical Study of Flaky Tests in Python. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST).

[20]

A. Gyori, B. Lambeth, A. Shi, O. Legunsen, and D. Marinov. 2015. NonDex: A Tool for Detecting and Debugging Wrong Assumptions on Java API Specification. In Proceedings of the Symposium on the Foundations of Software Engineering (FSE). 223--233.

[21]

A. Gyori, A. Shi, F. Hariri, and D. Marinov. 2015. Reliable Testing: Detecting State-Polluting Tests to Prevent Test Dependency. In Proceedings of the International Conference on Software Testing and Analysis (ISSTA). 223--233.

[22]

S. Habchi, G. Haben, M. Papadakis, M. Cordy, and Y. Le Traon. 2022. A Qualitative Study on the Sources, Impacts, and Mitigation Strategies of Flaky Tests. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST).

[23]

M. Harman and P. O'Hearn. 2018. From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis. In Proceedings of the International Working Conference on Source Code Analysis and Manipulation (SCAM). 1--23.

[24]

K. Herzig. 2016. Let's Assume We Have to Pay for Testing. https://www.slideshare.net/kim.herzig/keynote-ast-2016

[25]

C. Huo and J. Clause. 2014. Improving Oracle Quality by Detecting Brittle Assertions and Unused Inputs in Tests. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE). 621--631.

[26]

W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie. 2019. IDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). 312--322.

[27]

W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov. 2020. Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects. In Proceedings of the International Conference on Software Reliability Engineering (ISSRE). 403--413.

[28]

G. Li, S. Lu, M. Musuvathi, S. Nath, and R. Padhye. 2019. Efficient Scalable Thread-Safety-Violation Detection: Finding Thousands of Concurrency Bugs during Testing. In Proceedings of the Symposium on Operating Systems Principles (SOSP). 162--180.

[29]

Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. 2014. An Empirical Analysis of Flaky Tests. In Proceedings of the Symposium on the Foundations of Software Engineering (FSE). 643--653.

[30]

J. Micco. 2016. Flaky Tests at Google and How We Mitigate Them, https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html.

[31]

O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2021. A Survey of Flaky Tests. Transactions on Software Engineering and Methodology 31, 1 (2021), 1--74.

[32]

O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2022. Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). to appear.

[33]

O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2022. Surveying the Developer Experience of Flaky Tests. In Proceedings of the International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[34]

G. Pinto, B. Miranda, S. Dissanayake, M. D. Amorim, C. Treude, A. Bertolino, and M. D'amorim. 2020. What is the Vocabulary of Flaky Tests?. In Proceedings of the International Conference on Mining Software Repositories (MSR). 492--502.

Digital Library

[35]

A. Shi, A. Gyori, O. Legunsen, and D. Marinov. 2016. Detecting Assumptions on Deterministic Implementations of Non-Deterministic Specifications. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). 80--90.

[36]

D. Silva, L. Teixeira, and M. D'Amorim. 2020. Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker. In Proceedings of the International Conference on Software Maintenance and Evolution (ICSME). 301--311.

[37]

V. Terragni, P. Salza, and F. Ferrucci. 2020. A Container-based Infrastructure for Fuzzy-driven Root Causing of Flaky Tests. In Proceedings of the International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). 69--72.

[38]

C. S. Timperley, L. Herckis, C. Le Goues, and M. Hilton. 2021. Understanding and Improving Artifact Sharing in Software Engineering Research. Empirical Software Engineering 26, 4 (2021), 1--41.

Digital Library

[39]

D. G. Widder, M. Hilton, C. Kästner, and B. Vasilescu. 2019. A Conceptual Replication of Continuous Integration Pain Points in the Context of Travis CI. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 647--658.

[40]

S. Zhang, D. Jalali, J. Wuttke, K. Muşlu, W. Lam, M. D. Ernst, and D. Notkin. 2014. Empirically Revisiting the Test Independence Assumption. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA). 385--396.

[41]

H. Zhu, L. Wei, M. Wen, Y. Liu, S. C. Cheung, Q. Sheng, and C. Zhou. 2020. Mock-Sniffer: Characterizing and Recommending Mocking Decisions for Unit Tests. In Proceedings of the International Conference on Automated Software Engineering (ASE). 436--447.

Cited By

Leinen FPerathoner APretschner A(2024)On the Impact of Hitting System Resource Limits on Test FlakinessProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643898(14-19)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643898
Fatima SHemmati HBriand L(2024)FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code RepairIEEE Transactions on Software Engineering10.1109/TSE.2024.3472476(1-26)Online publication date: 2024
https://doi.org/10.1109/TSE.2024.3472476
Sousa ÉBezerra CMachado I(2023)Flaky Tests in UI: Understanding Causes and Applying Correction StrategiesProceedings of the XXXVII Brazilian Symposium on Software Engineering10.1145/3613372.3613406(398-406)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1145/3613372.3613406
Show More Cited By

Index Terms

What do developer-repaired Flaky tests tell us about the effectiveness of automated Flaky test detection?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...
Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AST '22: Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test

May 2022

180 pages

ISBN:9781450392860

DOI:10.1145/3524481

Conference Chair:
Alejandra Garrido,
General Chair:
W. Eric Wong,
Program Chairs:
Guglielmo De Angelis,
Hyunsook Do,
Bao N. Nguyen

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

EPSRC Doctoral Training Partnership

Conference

AST '22

Sponsor:

SIGSOFT

AST '22: IEEE/ACM 3rd International Conference on Automation of Software Test

May 17 - 18, 2022

Pennsylvania, Pittsburgh

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
137
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leinen FPerathoner APretschner A(2024)On the Impact of Hitting System Resource Limits on Test FlakinessProceedings of the 1st International Workshop on Flaky Tests10.1145/3643656.3643898(14-19)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3643656.3643898
Fatima SHemmati HBriand L(2024)FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code RepairIEEE Transactions on Software Engineering10.1109/TSE.2024.3472476(1-26)Online publication date: 2024
https://doi.org/10.1109/TSE.2024.3472476
Sousa ÉBezerra CMachado I(2023)Flaky Tests in UI: Understanding Causes and Applying Correction StrategiesProceedings of the XXXVII Brazilian Symposium on Software Engineering10.1145/3613372.3613406(398-406)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1145/3613372.3613406
Carmo DGonçalves LDias APombo N(2023)Improved Flaky Test Detection with Black-Box Approach and Test Smells2023 IEEE Symposium on Computers and Communications (ISCC)10.1109/ISCC58397.2023.10217934(245-251)Online publication date: 9-Jul-2023
https://doi.org/10.1109/ISCC58397.2023.10217934

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents