research-article

Open access

Do Automatic Test Generation Tools Generate Flaky Tests?

Authors:

Muhammad Firhard Roslan,

Fabian Scharnböck,

Gordon FraserAuthors Info & Claims

ICSE '24: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering

Article No.: 47, Pages 1 - 12

https://doi.org/10.1145/3597503.3608138

Published: 06 February 2024 Publication History

Abstract

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.

References

[1]

[n. d.]. Class Calendar. https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html

[2]

[n. d.]. Class Random. https://docs.oracle.com/javase/8/docs/api/java/util/Random.html

[3]

[n. d.]. JUnit 4. https://junit.org/junit4/

[4]

[n. d.]. Maven Central Repository. https://repo.maven.apache.org/maven2/

[5]

[n. d.]. Maven Surefire plugin. https://maven.apache.org/surefire/maven-surefire-plugin/

[6]

[n. d.]. OSS-Fuzz: How do you handle timeouts and OOMs? https://google.github.io/oss-fuzz/faq/#how-do-you-handle-timeouts-and-ooms

[7]

[n. d.]. Pynguin documentation: Generating Assertions. https://pynguin.readthedocs.io/en/latest/user/assertions.html#simple

[8]

[n. d.]. pytest. https://docs.pytest.org/en/7.2.x/

[9]

[n. d.]. pytest-random-order: a pytest plugin that randomises the order of tests. https://pypi.org/project/pytest-random-order/

[10]

[n. d.]. Python Package Index (PyPI). https://pypi.org/

[11]

2023. Do Automatic Test Generation Tools Generate Flaky Tests? [Dataset].

[12]

Azeem Ahmad, Erik Norrestam Held, Ola Leifler, and Kristian Sandahl. 2022. Identifying Randomness related Flaky Tests through Divergence and Execution Tracing. In International Conference on Software Testing, Verification and Validation Workshops (ICST-Workshops). 293--300.

[13]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In International Conference on Software Engineering (ICSE). 1572--1584.

[14]

Andrea Arcuri, Gordon Fraser, and Juan Pablo Galeotti. 2014. Automated Unit Test Generation for Classes with Environment Dependencies. In International Conference on Automated Software Engineering (ASE). 79--89.

[15]

José Campos, Andrea Arcuri, Gordon Fraser, and Rui Abreu. 2014. Continuous test generation: Enhancing continuous integration with automated test generation. In International Conference on Automated Software Engineering (ASE). 55--66.

Digital Library

[16]

Albert Danial. 2021. cloc: v1.92.

[17]

Jens Dietrich, Shawn Rasheed, and Amjed Tahir. 2022. Flaky Test Sanitisation via On-the-Fly Assumption Inference for Tests with Network Dependencies. In IEEE Working Conference on Source Code Analysis and Manipulation (SCAM). 264--275.

[18]

Zhen Yu Ding and Claire Le Goues. 2021. An Empirical Study of OSS-Fuzz Bugs. In International Conference on Mining Software Repositories (MSR). 131--142.

[19]

Thomas Durieux, Claire Le Goues, Michael Hilton, and Rui Abreu. 2020. Empirical Study of Restarted and Flaky Builds on Travis CI. In International Conference on Mining Software Repositories (MSR). 254--264.

[20]

Saikat Dutta, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. 2020. Detecting Flaky Tests in Probabilistic and Machine Learning Applications. In International Symposium on Software Testing and Analysis (ISSTA). 211--224.

[21]

Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 830--840.

Digital Library

[22]

Zhiyu Fan. 2019. A systematic evaluation of problematic tests generated by EvoSuite. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 165--167.

[23]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin (1971), 378.

[24]

Gordon Fraser. 2018. A tutorial on using and extending the EvoSuite search-based test generator. In International Symposium on Search Based Software Engineering (SSBSE). 106--130.

[25]

Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In ACM SIGSOFT Software Engineering Notes. 416--419.

[26]

Gordon Fraser and Andrea Arcuri. 2013. EvoSuite: On the challenges of test case generation in the real world. In International Conference on Software Testing, Verification and Validation (ICST). 362--369.

Digital Library

[27]

Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using EvoSuite. ACM Transactions on Software Engineering and Methodology (2014), 1--42.

[28]

Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015. Does automated unit test generation really help software testers? a controlled empirical study. ACM Transactions on Software Engineering and Methodology (2015), 1--49.

[29]

Martin Gruber and Gordon Fraser. 2022. A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It. In International Conference on Software Testing, Verification and Validation (ICST). 82--92.

[30]

Martin Gruber and Gordon Fraser. 2023. Debugging Flaky Tests using Spectrum-based Fault Localization. In International Conference on Automation of Software Test (AST@ICSE). 128--139.

[31]

Martin Gruber and Gordon Fraser. 2023. FlaPy: Mining Flaky Python Tests at Scale. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 127--131.

[32]

Martin Gruber, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. An Empirical Study of Flaky Tests in Python. In International Conference on Software Testing, Verification and Validation (ICST). 148--158.

[33]

Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering (2011), 649--678.

Digital Library

[34]

Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck, Thomas Rodriguez, Kenneth Russell, and David Cox. 2008. Design of the Java HotSpot™ client compiler for Java 6. ACM Transactions on Architecture and Code Optimization (TACO) (2008), 1--32.

[35]

Wing Lam. 2020. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois.edu/flakytests

[36]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In International Symposium on Software Testing and Analysis (ISSTA). 204--215.

Digital Library

[37]

Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In International Conference on Software Engineering (ICSE). 1471--1482.

Digital Library

[38]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In International Conference on Software Testing, Verification and Validation (ICST). 312--322.

[39]

Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-Test-Aware Regression Testing Techniques. In International Symposium on Software Testing and Analysis (ISSTA). 298--311.

[40]

Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing Order-Dependent Flaky Tests via Test Generation. In International Conference on Software Engineering (ICSE). 1881--1892.

Digital Library

[41]

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated Unit Test Generation for Python. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 168--172.

[42]

Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated Unit Test Generation for Python. In International Symposium on Search Based Software Engineering (SSBSE). 9--24.

[43]

Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2023. An empirical study of automated unit test generation for Python. Empirical Software Engineering (2023), 36.

[44]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering (FSE). 643--653.

[45]

Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra. 2019. Predictive Test Selection. In International Conference on Software Engineering (ICSE). 91--100.

[46]

Phil McMinn. 2004. Search-Based Software Test Data Generation: A Survey. Journal of Software Testing, Verification and Reliability (2004), 105--156.

[47]

Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-Scale Continuous Testing. In International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 233--242.

[48]

John Micco. 2016. Flaky Tests at Google and How We Mitigate Them. https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html

[49]

Olivier Nourry, Yutaro Kashiwa, Bin Lin, Gabriele Bavota, Michele Lanza, and Yasutaka Kamei. 2023. The Human Side of Fuzzing: Challenges Faced by Developers During Fuzzing Activities. ACM Transactions on Software Engineering and Methodology (2023).

[50]

Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java HotSpot™ Server Compiler. In Java (TM) Virtual Machine Research and Technology Symposium (JVM 01). 1--12.

[51]

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2022. A Survey of Flaky Tests. IEEE Transactions on Software Engineering (2022), 17:1--17:74.

Digital Library

[52]

O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2022. Surveying the Developer Experience of Flaky Tests. In International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 253--262.

[53]

Samad Paydar and Aidin Azamnouri. 2019. An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites. Lecture Notes in Computer Science (2019), 111--126.

[54]

Qianyang Peng, August Shi, and Lingming Zhang. 2020. Empirically Revisiting and Enhancing IR-Based Test-Case Prioritization. In International Symposium on Software Testing and Analysis (ISSTA). 324--336.

[55]

Darrel A Regier, William E Narrow, Diana E Clarke, Helena C Kraemer, S Janet Kuramoto, Emily A Kuhl, and David J Kupfer. 2013. DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses. American journal of psychiatry (2013), 59--70.

[56]

José Miguel Rojas, Gordon Fraser, and Andrea Arcuri. 2016. Seeding strategies in search-based unit test generation. Journal of Software Testing, Verification and Reliability (2016), 366--401.

[57]

Wing Lam Ruixin Wang, Yang Chen. 2022. iPFlakies: A Framework for Detecting and Fixing Python Order-Dependent Flaky Tests. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 120--124.

[58]

Sebastian Schweikl, Gordon Fraser, and Andrea Arcuri. 2022. EvoSuite at the SBST 2022 Tool Competition. In International Workshop on Search-Based Software Testing (SBST@ICSE). 33--34.

Digital Library

[59]

Kostya Serebryany. 2017. OSS-Fuzz - Google's continuous fuzzing service for open source software. USENIX Security Symposium (2017).

[60]

Sina Shamshiri, René Just, José Miguel Rojas, Gordon Fraser, Phil McMinn, and Andrea Arcuri. 2015. Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges. In International Conference on Automated Software Engineering (ASE). 201--211.

[61]

Sina Shamshiri, José Miguel Rojas, Juan Pablo Galeotti, Neil Walkinshaw, and Gordon Fraser. 2018. How do automatically generated unit tests influence software maintenance?. In International Conference on Software Testing, Verification and Validation (ICST). 250--261.

[62]

Samuel Sanford Shapiro and Martin B Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika (1965), 591--611.

[63]

August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 545--555.

Digital Library

[64]

Arash. Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An Empirical Study of Bugs in Test Code. In International Conference on Software Maintenance and Evolution (ICSME). 101--110.

[65]

Sebastian Vogl, Sebastian Schweikl, Gordon Fraser, Andrea Arcuri, Jose Campos, and Annibale Panichella. 2021. EvoSuite at the SBST 2021 Tool Competition. In International Workshop on Search-Based Software Testing (SBST@ICSE). 28--29.

[66]

Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, and Wing Lam. 2022. Preempting Flaky Tests via Non-Idempotent-Outcome Tests. In International Conference on Software Engineering (ICSE). 1730--1742.

Digital Library

[67]

Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biometrics Bulletin (1945), 80--83.

[68]

Zhe Yu, Fahmid Fahid, Tim Menzies, Gregg Rothermel, Kyle Patrick, and Snehit Cherian. 2019. TERMINATOR: Better Automated UI Test Case Prioritization. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 883--894.

Cited By

Maton MKapfhammer GMcMinn P(2024)Exploring Pseudo-Testedness: Empirically Evaluating Extreme Mutation Testing at the Statement Level2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00059(587-598)Online publication date: 6-Oct-2024
https://doi.org/10.1109/ICSME58944.2024.00059
Roslan MRojas JMcMinn P(2024)Private-Keep Out? Understanding How Developers Account for Code Visibility in Unit Testing2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00037(312-324)Online publication date: 6-Oct-2024
https://doi.org/10.1109/ICSME58944.2024.00037
Kebaili ZKhelladi DAcher MBarais O(2024)Automated testing of metamodels and code co-evolutionSoftware and Systems Modeling10.1007/s10270-024-01245-2Online publication date: 9-Dec-2024
https://doi.org/10.1007/s10270-024-01245-2

Index Terms

Do Automatic Test Generation Tools Generate Flaky Tests?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Regression-Test History Data for Flaky-Test Research
FTW '24: Proceedings of the 1st International Workshop on Flaky Tests

Due to their random nature, flaky test failures are difficult to study. Without having observed a test to both pass and fail under the same setup, it is unknown whether a test is flaky and what its failure rate is. Thus, flaky-test research has greatly ...
An empirical analysis of flaky tests
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to ...
Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

May 2024

2942 pages

ISBN:9798400702174

DOI:10.1145/3597503

Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 Owner/Author(s).

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

EPSRC grant Test FLARE
EPSRC Doctoral Training Partnership with the University of Sheffield
DFG project STUNT
BMWK project ANUKI

Conference

ICSE '24

Sponsor:

SIGSOFT

ICSE '24: IEEE/ACM 46th International Conference on Software Engineering

April 14 - 20, 2024

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
683
Total Downloads

Downloads (Last 12 months)683
Downloads (Last 6 weeks)79

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maton MKapfhammer GMcMinn P(2024)Exploring Pseudo-Testedness: Empirically Evaluating Extreme Mutation Testing at the Statement Level2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00059(587-598)Online publication date: 6-Oct-2024
https://doi.org/10.1109/ICSME58944.2024.00059
Roslan MRojas JMcMinn P(2024)Private-Keep Out? Understanding How Developers Account for Code Visibility in Unit Testing2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00037(312-324)Online publication date: 6-Oct-2024
https://doi.org/10.1109/ICSME58944.2024.00037
Kebaili ZKhelladi DAcher MBarais O(2024)Automated testing of metamodels and code co-evolutionSoftware and Systems Modeling10.1007/s10270-024-01245-2Online publication date: 9-Dec-2024
https://doi.org/10.1007/s10270-024-01245-2

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents