Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3597503.3608138acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Open access

Do Automatic Test Generation Tools Generate Flaky Tests?

Published: 06 February 2024 Publication History

Abstract

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.

References

[1]
[n. d.]. Class Calendar. https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html
[2]
[n. d.]. Class Random. https://docs.oracle.com/javase/8/docs/api/java/util/Random.html
[3]
[n. d.]. JUnit 4. https://junit.org/junit4/
[4]
[n. d.]. Maven Central Repository. https://repo.maven.apache.org/maven2/
[5]
[n. d.]. Maven Surefire plugin. https://maven.apache.org/surefire/maven-surefire-plugin/
[6]
[n. d.]. OSS-Fuzz: How do you handle timeouts and OOMs? https://google.github.io/oss-fuzz/faq/#how-do-you-handle-timeouts-and-ooms
[7]
[n. d.]. Pynguin documentation: Generating Assertions. https://pynguin.readthedocs.io/en/latest/user/assertions.html#simple
[8]
[n. d.]. pytest. https://docs.pytest.org/en/7.2.x/
[9]
[n. d.]. pytest-random-order: a pytest plugin that randomises the order of tests. https://pypi.org/project/pytest-random-order/
[10]
[n. d.]. Python Package Index (PyPI). https://pypi.org/
[11]
2023. Do Automatic Test Generation Tools Generate Flaky Tests? [Dataset].
[12]
Azeem Ahmad, Erik Norrestam Held, Ola Leifler, and Kristian Sandahl. 2022. Identifying Randomness related Flaky Tests through Divergence and Execution Tracing. In International Conference on Software Testing, Verification and Validation Workshops (ICST-Workshops). 293--300.
[13]
Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In International Conference on Software Engineering (ICSE). 1572--1584.
[14]
Andrea Arcuri, Gordon Fraser, and Juan Pablo Galeotti. 2014. Automated Unit Test Generation for Classes with Environment Dependencies. In International Conference on Automated Software Engineering (ASE). 79--89.
[15]
José Campos, Andrea Arcuri, Gordon Fraser, and Rui Abreu. 2014. Continuous test generation: Enhancing continuous integration with automated test generation. In International Conference on Automated Software Engineering (ASE). 55--66.
[16]
Albert Danial. 2021. cloc: v1.92.
[17]
Jens Dietrich, Shawn Rasheed, and Amjed Tahir. 2022. Flaky Test Sanitisation via On-the-Fly Assumption Inference for Tests with Network Dependencies. In IEEE Working Conference on Source Code Analysis and Manipulation (SCAM). 264--275.
[18]
Zhen Yu Ding and Claire Le Goues. 2021. An Empirical Study of OSS-Fuzz Bugs. In International Conference on Mining Software Repositories (MSR). 131--142.
[19]
Thomas Durieux, Claire Le Goues, Michael Hilton, and Rui Abreu. 2020. Empirical Study of Restarted and Flaky Builds on Travis CI. In International Conference on Mining Software Repositories (MSR). 254--264.
[20]
Saikat Dutta, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. 2020. Detecting Flaky Tests in Probabilistic and Machine Learning Applications. In International Symposium on Software Testing and Analysis (ISSTA). 211--224.
[21]
Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding Flaky Tests: The Developer's Perspective. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 830--840.
[22]
Zhiyu Fan. 2019. A systematic evaluation of problematic tests generated by EvoSuite. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 165--167.
[23]
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin (1971), 378.
[24]
Gordon Fraser. 2018. A tutorial on using and extending the EvoSuite search-based test generator. In International Symposium on Search Based Software Engineering (SSBSE). 106--130.
[25]
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In ACM SIGSOFT Software Engineering Notes. 416--419.
[26]
Gordon Fraser and Andrea Arcuri. 2013. EvoSuite: On the challenges of test case generation in the real world. In International Conference on Software Testing, Verification and Validation (ICST). 362--369.
[27]
Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using EvoSuite. ACM Transactions on Software Engineering and Methodology (2014), 1--42.
[28]
Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015. Does automated unit test generation really help software testers? a controlled empirical study. ACM Transactions on Software Engineering and Methodology (2015), 1--49.
[29]
Martin Gruber and Gordon Fraser. 2022. A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It. In International Conference on Software Testing, Verification and Validation (ICST). 82--92.
[30]
Martin Gruber and Gordon Fraser. 2023. Debugging Flaky Tests using Spectrum-based Fault Localization. In International Conference on Automation of Software Test (AST@ICSE). 128--139.
[31]
Martin Gruber and Gordon Fraser. 2023. FlaPy: Mining Flaky Python Tests at Scale. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 127--131.
[32]
Martin Gruber, Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2021. An Empirical Study of Flaky Tests in Python. In International Conference on Software Testing, Verification and Validation (ICST). 148--158.
[33]
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering (2011), 649--678.
[34]
Thomas Kotzmann, Christian Wimmer, Hanspeter Mössenböck, Thomas Rodriguez, Kenneth Russell, and David Cox. 2008. Design of the Java HotSpot™ client compiler for Java 6. ACM Transactions on Architecture and Code Optimization (TACO) (2008), 1--32.
[35]
Wing Lam. 2020. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois.edu/flakytests
[36]
Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thummalapenta. 2019. Root Causing Flaky Tests in a Large-Scale Industrial Setting. In International Symposium on Software Testing and Analysis (ISSTA). 204--215.
[37]
Wing Lam, Kıvanç Muşlu, Hitesh Sajnani, and Suresh Thummalapenta. 2020. A Study on the Lifecycle of Flaky Tests. In International Conference on Software Engineering (ICSE). 1471--1482.
[38]
Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In International Conference on Software Testing, Verification and Validation (ICST). 312--322.
[39]
Wing Lam, August Shi, Reed Oei, Sai Zhang, Michael D. Ernst, and Tao Xie. 2020. Dependent-Test-Aware Regression Testing Techniques. In International Symposium on Software Testing and Analysis (ISSTA). 298--311.
[40]
Chengpeng Li, Chenguang Zhu, Wenxi Wang, and August Shi. 2022. Repairing Order-Dependent Flaky Tests via Test Generation. In International Conference on Software Engineering (ICSE). 1881--1892.
[41]
Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated Unit Test Generation for Python. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 168--172.
[42]
Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated Unit Test Generation for Python. In International Symposium on Search Based Software Engineering (SSBSE). 9--24.
[43]
Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2023. An empirical study of automated unit test generation for Python. Empirical Software Engineering (2023), 36.
[44]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering (FSE). 643--653.
[45]
Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra. 2019. Predictive Test Selection. In International Conference on Software Engineering (ICSE). 91--100.
[46]
Phil McMinn. 2004. Search-Based Software Test Data Generation: A Survey. Journal of Software Testing, Verification and Reliability (2004), 105--156.
[47]
Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell, Rob Siemborski, and John Micco. 2017. Taming Google-Scale Continuous Testing. In International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 233--242.
[48]
John Micco. 2016. Flaky Tests at Google and How We Mitigate Them. https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
[49]
Olivier Nourry, Yutaro Kashiwa, Bin Lin, Gabriele Bavota, Michele Lanza, and Yasutaka Kamei. 2023. The Human Side of Fuzzing: Challenges Faced by Developers During Fuzzing Activities. ACM Transactions on Software Engineering and Methodology (2023).
[50]
Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The Java HotSpot™ Server Compiler. In Java (TM) Virtual Machine Research and Technology Symposium (JVM 01). 1--12.
[51]
Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2022. A Survey of Flaky Tests. IEEE Transactions on Software Engineering (2022), 17:1--17:74.
[52]
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn. 2022. Surveying the Developer Experience of Flaky Tests. In International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 253--262.
[53]
Samad Paydar and Aidin Azamnouri. 2019. An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites. Lecture Notes in Computer Science (2019), 111--126.
[54]
Qianyang Peng, August Shi, and Lingming Zhang. 2020. Empirically Revisiting and Enhancing IR-Based Test-Case Prioritization. In International Symposium on Software Testing and Analysis (ISSTA). 324--336.
[55]
Darrel A Regier, William E Narrow, Diana E Clarke, Helena C Kraemer, S Janet Kuramoto, Emily A Kuhl, and David J Kupfer. 2013. DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses. American journal of psychiatry (2013), 59--70.
[56]
José Miguel Rojas, Gordon Fraser, and Andrea Arcuri. 2016. Seeding strategies in search-based unit test generation. Journal of Software Testing, Verification and Reliability (2016), 366--401.
[57]
Wing Lam Ruixin Wang, Yang Chen. 2022. iPFlakies: A Framework for Detecting and Fixing Python Order-Dependent Flaky Tests. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 120--124.
[58]
Sebastian Schweikl, Gordon Fraser, and Andrea Arcuri. 2022. EvoSuite at the SBST 2022 Tool Competition. In International Workshop on Search-Based Software Testing (SBST@ICSE). 33--34.
[59]
Kostya Serebryany. 2017. OSS-Fuzz - Google's continuous fuzzing service for open source software. USENIX Security Symposium (2017).
[60]
Sina Shamshiri, René Just, José Miguel Rojas, Gordon Fraser, Phil McMinn, and Andrea Arcuri. 2015. Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges. In International Conference on Automated Software Engineering (ASE). 201--211.
[61]
Sina Shamshiri, José Miguel Rojas, Juan Pablo Galeotti, Neil Walkinshaw, and Gordon Fraser. 2018. How do automatically generated unit tests influence software maintenance?. In International Conference on Software Testing, Verification and Validation (ICST). 250--261.
[62]
Samuel Sanford Shapiro and Martin B Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika (1965), 591--611.
[63]
August Shi, Wing Lam, Reed Oei, Tao Xie, and Darko Marinov. 2019. iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 545--555.
[64]
Arash. Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An Empirical Study of Bugs in Test Code. In International Conference on Software Maintenance and Evolution (ICSME). 101--110.
[65]
Sebastian Vogl, Sebastian Schweikl, Gordon Fraser, Andrea Arcuri, Jose Campos, and Annibale Panichella. 2021. EvoSuite at the SBST 2021 Tool Competition. In International Workshop on Search-Based Software Testing (SBST@ICSE). 28--29.
[66]
Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, and Wing Lam. 2022. Preempting Flaky Tests via Non-Idempotent-Outcome Tests. In International Conference on Software Engineering (ICSE). 1730--1742.
[67]
Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biometrics Bulletin (1945), 80--83.
[68]
Zhe Yu, Fahmid Fahid, Tim Menzies, Gregg Rothermel, Kyle Patrick, and Snehit Cherian. 2019. TERMINATOR: Better Automated UI Test Case Prioritization. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 883--894.

Cited By

View all
  • (2024)Exploring Pseudo-Testedness: Empirically Evaluating Extreme Mutation Testing at the Statement Level2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00059(587-598)Online publication date: 6-Oct-2024
  • (2024)Private-Keep Out? Understanding How Developers Account for Code Visibility in Unit Testing2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00037(312-324)Online publication date: 6-Oct-2024
  • (2024)Automated testing of metamodels and code co-evolutionSoftware and Systems Modeling10.1007/s10270-024-01245-2Online publication date: 9-Dec-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
May 2024
2942 pages
ISBN:9798400702174
DOI:10.1145/3597503
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

  • Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2024

Check for updates

Author Tags

  1. test generation
  2. flaky tests
  3. empirical study

Qualifiers

  • Research-article

Funding Sources

  • EPSRC grant Test FLARE
  • EPSRC Doctoral Training Partnership with the University of Sheffield
  • DFG project STUNT
  • BMWK project ANUKI

Conference

ICSE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)683
  • Downloads (Last 6 weeks)79
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring Pseudo-Testedness: Empirically Evaluating Extreme Mutation Testing at the Statement Level2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00059(587-598)Online publication date: 6-Oct-2024
  • (2024)Private-Keep Out? Understanding How Developers Account for Code Visibility in Unit Testing2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00037(312-324)Online publication date: 6-Oct-2024
  • (2024)Automated testing of metamodels and code co-evolutionSoftware and Systems Modeling10.1007/s10270-024-01245-2Online publication date: 9-Dec-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media