Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3663529.3663798acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Tests4Py: A Benchmark for System Testing

Published: 10 July 2024 Publication History

Abstract

Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.

References

[1]
Remita Amine. 2021. youtube-dl. https://github.com/ytdl-org/youtube-dl
[2]
Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the Bug and How is It Fixed? An Experiment with Practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 117–128. isbn:9781450351058 https://doi.org/10.1145/3106237.3106255
[3]
Martin Eberlein, Yannic Noller, Thomas Vogel, and Lars Grunske. 2020. Evolutionary Grammar-Based Fuzzing. In Search-Based Software Engineering - 12th International Symposium, SSBSE 2020, Bari, Italy, October 7-8, 2020, Proceedings, Aldeida Aleti and Annibale Panichella (Eds.) (Lecture Notes in Computer Science, Vol. 12420). Springer, 105–120. https://doi.org/10.1007/978-3-030-59762-7_8
[4]
Martin Eberlein, Marius Smytzek, Dominic Steinhöfel, Lars Grunske, and Andreas Zeller. 2023. Semantic Debugging. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 438–449. https://doi.org/10.1145/3611643.3616296
[5]
Audrey Roy Greenfeld. 2022. Cookiecutter. https://www.cookiecutter.io/
[6]
Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Árpád Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019. BugsJS: a Benchmark of JavaScript Bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101. https://doi.org/10.1109/ICST.2019.00019
[7]
Yang Hu, Umair Z. Ahmed, Sergey Mechtaev, Ben Leong, and Abhik Roychoudhury. 2019. Re-Factoring Based Program Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 388–398. https://doi.org/10.1109/ASE.2019.00044
[8]
Vladimir Iakovlev. 2022. The Fuck. https://github.com/nvbn/thefuck
[9]
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014). 437–440. isbn:9781450326452 https://doi.org/10.1145/2610384.2628055
[10]
Alexander Kampmann, Nikolas Havrikov, Ezekiel O. Soremekun, and Andreas Zeller. 2020. When Does My Program Do This? Learning Circumstances of Software Behavior. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). 1228–1239. isbn:9781450370431 https://doi.org/10.1145/3368089.3409687
[11]
Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 468–478. https://doi.org/10.1109/SANER.2019.8667991
[12]
Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. FuzzBench: An Open Fuzzer Benchmarking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). 1393–1403. isbn:9781450385626 https://doi.org/10.1145/3468264.3473932
[13]
Sanic Community Organization. 2024. Sanic. https://sanic.dev
[14]
Ram Rachum, Alex Hall, and Iori Yanokura. 2019. PySnooper: Never use print for debugging again. https://doi.org/10.5281/zenodo.10462459
[15]
Sebastián Ramírez. 2018. FastAPI. https://fastapi.tiangolo.com/
[16]
Jakub Roztocil. 2022. HTTPie. https://httpie.io/
[17]
Ripon K. Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R. Prasad. 2018. Bugs.Jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR ’18). 10–13. isbn:9781450357166 https://doi.org/10.1145/3196398.3196473
[18]
Marius Smytzek and Andreas Zeller. 2022. SFLKit: a workbench for statistical fault localization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA. 1701–1705. isbn:9781450394130 https://doi.org/10.1145/3540250.3558915
[19]
Ezekiel O. Soremekun, Esteban Pavese, Nikolas Havrikov, Lars Grunske, and Andreas Zeller. 2022. Inputs From Hell. IEEE Trans. Software Eng., 48, 4 (2022), 1138–1153. https://doi.org/10.1109/TSE.2020.3013716
[20]
Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev, and Abhik Roychoudhury. 2017. Codeflaws: a programming competition benchmark for evaluating automated program repair tools. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). 180–182. https://doi.org/10.1109/ICSE-C.2017.76
[21]
David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE. IEEE / ACM, 339–349.
[22]
Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, Brian Goh, Ferdian Thung, Hong Jin Kang, Thong Hoang, David Lo, and Eng Lieh Ouh. 2020. BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). 1556–1560. https://doi.org/10.1145/3368089.3417943
[23]
Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better Test Cases for Better Automated Program Repair. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 831–841. isbn:9781450351058 https://doi.org/10.1145/3106237.3106274

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering
July 2024
715 pages
ISBN:9798400706585
DOI:10.1145/3663529
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Benchmark
  2. Python
  3. Test generation

Qualifiers

  • Research-article

Funding Sources

  • DFG

Conference

FSE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 49
    Total Downloads
  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)11
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media