research-article

Tests4Py: A Benchmark for System Testing

Authors:

Marius Smytzek,

Martin Eberlein,

Batuhan Serçe,

Andreas ZellerAuthors Info & Claims

FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

Pages 557 - 561

https://doi.org/10.1145/3663529.3663798

Published: 10 July 2024 Publication History

Abstract

Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.

References

[1]

Remita Amine. 2021. youtube-dl. https://github.com/ytdl-org/youtube-dl

[2]

Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the Bug and How is It Fixed? An Experiment with Practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 117–128. isbn:9781450351058 https://doi.org/10.1145/3106237.3106255

Digital Library

[3]

Martin Eberlein, Yannic Noller, Thomas Vogel, and Lars Grunske. 2020. Evolutionary Grammar-Based Fuzzing. In Search-Based Software Engineering - 12th International Symposium, SSBSE 2020, Bari, Italy, October 7-8, 2020, Proceedings, Aldeida Aleti and Annibale Panichella (Eds.) (Lecture Notes in Computer Science, Vol. 12420). Springer, 105–120. https://doi.org/10.1007/978-3-030-59762-7_8

Digital Library

[4]

Martin Eberlein, Marius Smytzek, Dominic Steinhöfel, Lars Grunske, and Andreas Zeller. 2023. Semantic Debugging. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 438–449. https://doi.org/10.1145/3611643.3616296

Digital Library

[5]

Audrey Roy Greenfeld. 2022. Cookiecutter. https://www.cookiecutter.io/

[6]

Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Árpád Beszédes, Rudolf Ferenc, and Ali Mesbah. 2019. BugsJS: a Benchmark of JavaScript Bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101. https://doi.org/10.1109/ICST.2019.00019

[7]

Yang Hu, Umair Z. Ahmed, Sergey Mechtaev, Ben Leong, and Abhik Roychoudhury. 2019. Re-Factoring Based Program Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 388–398. https://doi.org/10.1109/ASE.2019.00044

Digital Library

[8]

Vladimir Iakovlev. 2022. The Fuck. https://github.com/nvbn/thefuck

[9]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014). 437–440. isbn:9781450326452 https://doi.org/10.1145/2610384.2628055

Digital Library

[10]

Alexander Kampmann, Nikolas Havrikov, Ezekiel O. Soremekun, and Andreas Zeller. 2020. When Does My Program Do This? Learning Circumstances of Software Behavior. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). 1228–1239. isbn:9781450370431 https://doi.org/10.1145/3368089.3409687

Digital Library

[11]

Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 468–478. https://doi.org/10.1109/SANER.2019.8667991

[12]

Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. FuzzBench: An Open Fuzzer Benchmarking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). 1393–1403. isbn:9781450385626 https://doi.org/10.1145/3468264.3473932

Digital Library

[13]

Sanic Community Organization. 2024. Sanic. https://sanic.dev

[14]

Ram Rachum, Alex Hall, and Iori Yanokura. 2019. PySnooper: Never use print for debugging again. https://doi.org/10.5281/zenodo.10462459

[15]

Sebastián Ramírez. 2018. FastAPI. https://fastapi.tiangolo.com/

[16]

Jakub Roztocil. 2022. HTTPie. https://httpie.io/

[17]

Ripon K. Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R. Prasad. 2018. Bugs.Jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR ’18). 10–13. isbn:9781450357166 https://doi.org/10.1145/3196398.3196473

Digital Library

[18]

Marius Smytzek and Andreas Zeller. 2022. SFLKit: a workbench for statistical fault localization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA. 1701–1705. isbn:9781450394130 https://doi.org/10.1145/3540250.3558915

Digital Library

[19]

Ezekiel O. Soremekun, Esteban Pavese, Nikolas Havrikov, Lars Grunske, and Andreas Zeller. 2022. Inputs From Hell. IEEE Trans. Software Eng., 48, 4 (2022), 1138–1153. https://doi.org/10.1109/TSE.2020.3013716

[20]

Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev, and Abhik Roychoudhury. 2017. Codeflaws: a programming competition benchmark for evaluating automated program repair tools. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). 180–182. https://doi.org/10.1109/ICSE-C.2017.76

Digital Library

[21]

David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE. IEEE / ACM, 339–349.

[22]

Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, Brian Goh, Ferdian Thung, Hong Jin Kang, Thong Hoang, David Lo, and Eng Lieh Ouh. 2020. BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). 1556–1560. https://doi.org/10.1145/3368089.3417943

Digital Library

[23]

Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better Test Cases for Better Automated Program Repair. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 831–841. isbn:9781450351058 https://doi.org/10.1145/3106237.3106274

Digital Library

Index Terms

Tests4Py: A Benchmark for System Testing
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software notations and tools
    1. Software libraries and repositories

Recommendations

Large system performance of SPEC OMP benchmark suites
Special issue: OpenMP: Experiences and implementations

Performance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. SPEC OMP includes benchmark suites intended for measuring performance of modern shared memory parallel systems. ...
Overview of TPC Benchmark E: The Next Generation of OLTP Benchmarks
Performance Evaluation and Benchmarking

Set to replace the aging TPC-C, the TPC Benchmark E is the next generation OLTP benchmark, which more accurately models client database usage. TPC-E addresses the shortcomings of TPC-C. It has a much more complex workload, requires the use of RAID-...
SPEC MPI2007—an application benchmark suite for parallel systems using MPI
International Supercomputing Conference (ISC07)

The SPEC High-Performance Group has developed the benchmark suite SPEC MPI2007 and its run rules over the last few years. The purpose of the SPEC MPI2007 benchmark and its run rules is to further the cause of fair and objective benchmarking of high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

July 2024

715 pages

ISBN:9798400706585

DOI:10.1145/3663529

General Chair:
Marcelo d'Amorim
North Carolina State University, USA

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DFG

Conference

FSE '24

Sponsor:

SIGSOFT

FSE '24: 32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Porto de Galinhas, Brazil

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
49
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)11

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents