research-article

From Fine-tuning to Output: An Empirical Investigation of Test Smells in Transformer-Based Test Code Generation

Authors:

Ahmed Aljohani,

Hyunsook DoAuthors Info & Claims

SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

Pages 1282 - 1291

https://doi.org/10.1145/3605098.3636058

Published: 21 May 2024 Publication History

Abstract

Researchers have recently leveraged transformer-based test code generation models to improve testing-related tasks (e.g., assert completion and test method generation). One such model, AthenaTest, has been recently introduced, and it has been well accepted by developers due to its ability to generate test cases similar to the ones written by developers. While the AthenaTest model provides adequate test coverage and improves test code readability, concerns remain regarding the quality of its generated test code, particularly the presence of test smells, which could degrade the test code's comprehension, readability, performance, and maintainability. In this paper, we investigated whether test cases generated by a transformer-based test code generation model - AthenaTest, contain test smells, including the presence of test smells in its fine-tuning dataset (Methods2Test). We evaluated seven test smells in AthenaTest's and Methods2Test's test cases. Our results reveal that 65% of Methods2Test's test cases contain test smells, which influence the output of AthenaTest, where 62% of its test cases contain at least one test smell. We also examined the design (test size and assertion frequency) of AthenaTest and Methods2Test test cases. Our findings show that AthenaTest tends to generate more assertions than the Methods2Test test case, which influenced the model to increase the occurrence rate of Assertion Roulette and Duplicate Assert smells.

References

[1]

Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. 2023. A3Test: Assertion-Augmented Automated Test Case Generation. arXiv preprint arXiv:2302.10352 (2023).

[2]

Wajdi Aljedaani, Anthony Peruma, Ahmed Aljohani, Mazen Alotaibi, Mohamed Wiem Mkaouer, Ali Ouni, Christian D Newman, Abdullatif Ghallab, and Stephanie Ludi. 2021. Test smell detection tools: A systematic mapping study. Evaluation and Assessment in Software Engineering (2021), 170--180.

[3]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and Dave Binkley. 2015. Are test smells really harmful? an empirical study. Empirical Software Engineering 20 (2015), 1052--1094.

Digital Library

[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.

[5]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).

[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[7]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 107--118.

Digital Library

[8]

Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 201--211.

Digital Library

[9]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416--419.

Digital Library

[10]

Gordon Fraser and Andrea Arcuri. 2012. Whole test suite generation. IEEE Transactions on Software Engineering 39, 2 (2012), 276--291.

Digital Library

[11]

Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, and Harald C Gall. 2019. Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software 156 (2019), 312--327.

Digital Library

[12]

Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. In Proceedings of the 26th Conference on Program Comprehension. 348--351.

Digital Library

[13]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437--440.

Digital Library

[14]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 45th International Conference on Software Engineering, ser. ICSE.

[15]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).

[16]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336--347.

Digital Library

[17]

Microsoft. 2023. Methods2Test. https://github.com/microsoft/methods2test.

[18]

Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815--816.

[19]

Fabio Palomba, Dario Di Nucci, Annibale Panichella, Rocco Oliveto, and Andrea De Lucia. 2016. On the diffusion of test smells in automatically generated test code: An empirical study. In Proceedings of the 9th international workshop on search-based software testing. 5--14.

Digital Library

[20]

Annibale Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, and Vincent J Hellendoorn. 2020. Revisiting test smells in automatically generated tests: limitations, pitfalls, and opportunities. In 2020 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 523--533.

[21]

Anthony Peruma, Khalid Almalki, Christian D Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba. 2020. Tsdetect: An open source test smells detection tool. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 1650--1654.

Digital Library

[22]

Anthony Peruma, Khalid Saeed Almalki, Christian D Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba. 2019. On the distribution of test smells in open source android applications: An exploratory study. (2019).

[23]

Aniket Potdar and Emad Shihab. 2014. An exploratory study on self-admitted technical debt. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 91--100.

Digital Library

[24]

Abdelilah Sakti, Gilles Pesant, and Yann-Gaël Guéhéneuc. 2014. Instance generator and problem representation to improve object oriented code coverage. IEEE Transactions on Software Engineering 41, 3 (2014), 294--313.

Digital Library

[25]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. Adaptive test generation using a large language model. arXiv preprint arXiv:2302.06527 (2023).

[26]

Mohammed Latif Siddiq, Shafayat H Majumder, Maisha R Mim, Sourov Jajodia, and Joanna CS Santos. 2022. An Empirical Study of Code Smells in Transformer-based Code Generation Techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 71--82.

[27]

Davide Spadini, Fabio Palomba, Andy Zaidman, Magiel Bruntink, and Alberto Bacchelli. 2018. On the relation of test smells to software code quality. In 2018 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 1--12.

[28]

Davide Spadini, Martin Schvarcbacher, Ana-Maria Oprescu, Magiel Bruntink, and Alberto Bacchelli. 2020. Investigating severity thresholds for test smells. In Proceedings of the 17th International Conference on Mining Software Repositories. 311--321.

Digital Library

[29]

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: Ai-assisted code completion system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2727--2735.

Digital Library

[30]

Michele Tufano. 2023. AthenaTest-Showcase. https://github.com/micheletufano/AthenaTest-Showcase.

[31]

Michele Tufano, Shao Kun Deng, Neel Sundaresan, and Alexey Svyatkovskiy. 2022. Methods2Test: A dataset of focal methods mapped to test cases. In Proceedings of the 19th International Conference on Mining Software Repositories. 299--303.

Digital Library

[32]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).

[33]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, and Neel Sundaresan. 2022. Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test. 54--64.

Digital Library

[34]

Arie Van Deursen, Leon Moonen, Alex Van Den Bergh, and Gerard Kok. 2001. Refactoring test code. In Proceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001). Citeseer, 92--95.

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[36]

Tássio Virgínio, Luana Almeida Martins, Larissa Rocha Soares, Railana Santana, Heitor Costa, and Ivan Machado. 2020. An empirical study of automatically-generated tests from the perspective of test smells. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering. 92--96.

Digital Library

[37]

Tássio Virgínio, Railana Santana, Luana Almeida Martins, Larissa Rocha Soares, Heitor Costa, and Ivan Machado. 2019. On the influence of test smells on test coverage. In Proceedings of the XXXIII Brazilian Symposium on Software Engineering. 467--471.

Digital Library

[38]

Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023. ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).

Index Terms

From Fine-tuning to Output: An Empirical Investigation of Test Smells in Transformer-Based Test Code Generation
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Empirical software validation

Recommendations

Are test smells really harmful? An empirical study

Bad code smells have been defined as indicators of potential problems in source code. Techniques to identify and mitigate bad code smells have been proposed and studied. Recently bad test code smells (test smells for short) have been put forward as a ...
Investigating Test Smells in JavaScript Test Code
SAST '21: Proceedings of the 6th Brazilian Symposium on Systematic and Automated Software Testing

Writing automated test cases is a challenging and demanding activity. The test case itself is software that requires proper design to ensure it can be implemented and maintained as long as the production code evolves. Like code smells, test smells may ...
An empirical investigation into the nature of test smells
ASE '16: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering

Test smells have been defined as poorly designed tests and, as reported by recent empirical studies, their presence may negatively affect comprehension and maintenance of test suites. Despite this, there are no available automated tools to support ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing

April 2024

1898 pages

ISBN:9798400702433

DOI:10.1145/3605098

Chair:
Jiman Hong,
Program Chairs:
Juw Won Park,
Adam Przybyłek

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC '24

Sponsor:

SIGAPP

SAC '24: 39th ACM/SIGAPP Symposium on Applied Computing

April 8 - 12, 2024

Avila, Spain

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
45
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)15

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents