research-article

Patch Correctness Assessment: A Survey

Authors:

Bin LuoAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 34, Issue 2

Article No.: 55, Pages 1 - 50

https://doi.org/10.1145/3702972

Published: 20 January 2025 Publication History

Abstract

Most automated program repair methods rely on test cases to determine the correctness of the generated patches. However, due to the incompleteness of available test suites, some patches that pass all the test cases may still be incorrect. This issue is known as the patch overfitting problem. Overfitting problem is a longstanding problem in automated program repair. Due to overfitting patches, the patches obtained by automated program repair tools require further validation to determine their correctness. Researchers have proposed many methods to automatically assess the correctness of patches, but no systematic review provides a detailed introduction to this problem, the existing solutions, and the challenges. To address this deficiency, we systematically review the existing approaches to patch correctness assessment. We first offer a few examples of overfitting patches to acquire a more detailed understanding of this problem. We then propose a comprehensive categorization of publicly available techniques and datasets, examine the commonly used evaluation metrics, and perform an in-depth analysis of the effectiveness of the existing models in addressing the challenge of overfitting. Based on our analysis, we provided the difficulties encountered by current methodologies, alongside the possible avenues for future research exploration.

References

[1]

Valgrind. 2016. Retrieved from https://valgrind.org/

[2]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2Vec: Learning distributed representations of code. Proc. ACM Program. Lang. 3, POPL (2019), 40:1–40:29. DOI:

Digital Library

[3]

Gareth Bennett, Tracy Hall, and David Bowes. 2022. Some automatically generated patches are more likely to be correct than others: An analysis of Defects4J patch features. In Proceedings of the 3rd International Workshop on Automated Program Repair, 46–52.

Digital Library

[4]

Marcel Böhme and Abhik Roychoudhury. 2014. CoREBench: Studying complexity of regression errors. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA ’14). Corina S. Pasareanu and Darko Marinov (Eds.), ACM, 105–115. DOI:

Digital Library

[5]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 (2002), 321–357. DOI:

Digital Library

[6]

Liushan Chen, Yu Pei, and Carlo A. Furia. 2017. Contract-based program repair without the contracts. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), 637–647. DOI:

[7]

Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2021. SequenceR: Sequence-to-sequence learning for end-to-end program repair. IEEE Trans. Software Eng. 47, 9 (2021), 1943–1959. DOI:

[8]

Zimin Chen and Martin Monperrus. 2019. The remarkable role of similarity in redundancy-based program repair. arxiv:1811.05703. Retrieved from https://arxiv.org/pdf/1811.05703

[9]

Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21 (2020), 1–13.

[10]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 37–46.

[11]

Daniela S. Cruzes and Tore Dyba. 2011. Recommended steps for thematic synthesis in software engineering. In Proceedings of the 2011 International Symposium on Empirical Software Engineering and Measurement. IEEE, 275–284.

Digital Library

[12]

Viktor Csuvik, Dániel Horváth, Ferenc Horváth, and László Vidács. 2020. Utilizing source code embeddings to identify correct patches. In Proceedings of the 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF), 18–25. DOI:

[13]

Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Computer 11, 4 (1978), 34–41.

Digital Library

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT ’19), Vol. 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio (Eds.), Association for Computational Linguistics, 4171–4186. DOI:

[15]

Yukun Dong, Xiaotong Cheng, Yufei Yang, Lulu Zhang, Shuqi Wang, and Lingjie Kong. 2024. A method to identify overfitting program repair patches based on expression tree. Sci. Comput. Program. 235, 1 (2024), 103105. DOI:

Digital Library

[16]

Yukun Dong, Daolong Tang, Xiaotong Cheng, and Yufei Yang. 2022. Quality evaluation method of automatic software repair using syntax distance metrics. Symmetry 14, 8 (Aug. 2022), 1751. DOI:

[17]

Yukun Dong, Meng Wu, Li Zhang, Wenjing Yin, Mengying Wu, and Haojie Li. 2020. Priority measurement of patches for program repair based on semantic distance. Symmetry 12, 12 (Dec. 2020), 2102. DOI:

[18]

Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Empirical review of Java program repair tools: A large-scale experiment on 2,141 bugs and 23,551 repair attempts. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/SIGSOFT FSE ’19). Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo (Eds.), ACM, 302–313. DOI:

Digital Library

[19]

Thomas Durieux and Martin Monperrus. 2016. DynaMoth: Dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop on Automation of Software Test (AST@ICSE ’16). Christof J. Budnik, Gordon Fraser, and Francesca Lonetti (Eds.), ACM, 85–91. DOI:

Digital Library

[20]

Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021. Codetrans: Towards cracking the language of silicon’s code through self-supervised deeplearning and high performance computing. arXiv:2104.02443. Retrieved from https://arxiv.org/pdf/2104.02443

[21]

Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the SIGSOFT/FSE ’11 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC ’11: 13th European Software Engineering Conference (ESEC-13). Tibor Gyimóthy and Andreas Zeller (Eds.), ACM, 416–419. DOI:

Digital Library

[22]

Xiang Gao, Sergey Mechtaev, and Abhik Roychoudhury. 2019. Crash-avoiding program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’19). ACM, New York, NY, 8–18. DOI:

Digital Library

[23]

Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic software repair: A survey. IEEE Trans. Software Eng. 45, 1 (2019), 34–67. DOI:

Digital Library

[24]

Ali Ghanbari. 2020. ObjSim: Lightweight automatic patch prioritization via object similarity. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’20). ACM, New York, NY, 541–544. DOI:

Digital Library

[25]

Ali Ghanbari. 2022. Revisiting object similarity-based patch ranking in automated program repair: An extensive study. In Proceedings of the 3rd IEEE/ACM International Workshop on Automated Program Repair (APR@ICSE ’22). IEEE, 16–23. DOI:

Digital Library

[26]

Ali Ghanbari and Andrian Marcus. 2022. Patch correctness assessment in automated program repair based on the impact of patches on production and test code. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’22). Sukyoung Ryu and Yannis Smaragdakis (Eds.), ACM, 654–665. DOI:

Digital Library

[27]

Ali Ghanbari and Andrian Marcus. 2022. Shibboleth: Hybrid patch correctness assessment in automated program repair. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 1–4.

Digital Library

[28]

Ali Ghanbari and Lingming Zhang. 2019. PraPR: Practical program repair via bytecode mutation. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 1118–1121. DOI:

Digital Library

[29]

Patrice Godefroid. 2014. Micro execution. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.), ACM, 539–549. DOI:

Digital Library

[30]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A generic method for automatic software repair. IEEE Trans. Software Eng. 38, 1 (2012), 54–72. DOI:

Digital Library

[31]

Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair. Commun. ACM 62, 12 (2019), 56–65.

Digital Library

[32]

Sylvain Hallé. 2022. Test suite generation for Boolean conditions with equivalence class partitioning. In Proceedings of the 2022 IEEE/ACM 10th International Conference on Formal Methods in Software Engineering (FormaliSE), 23–33. DOI:

Digital Library

[33]

Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.

Digital Library

[34]

Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. CC2Vec: Distributed representations of code changes. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). Gregg Rothermel and Doo-Hwan Bae (Eds.), ACM, 518–529. DOI:

Digital Library

[35]

Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). ACM, New York, NY, 12–23. DOI:

Digital Library

[36]

Xin Huang, He Zhang, Xin Zhou, Muhammad Ali Babar, and Song Yang. 2018. Synthesizing qualitative research in software engineering: A critical review. In Proceedings of the 40th International Conference on Software Engineering, 1207–1218.

Digital Library

[37]

Elkhan Ismayilzada, Md Mazba Ur Rahman, Dongsun Kim, and Jooyong Yi. 2023. Poracle: Testing patches under preservation conditions to combat the overfitting problem of program repair. ACM Trans. Softw. Eng. Methodol. 33, 2 (Dec 2023), Article 44, 39 pages. DOI:

Digital Library

[38]

Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’18). Frank Tip and Eric Bodden (Eds.), ACM, 298–309. DOI:

Digital Library

[39]

René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA ’14). Corina S. Pasareanu and Darko Marinov (Eds.), ACM, 437–440. DOI:

Digital Library

[40]

Sungmin Kang and Shin Yoo. 2022. Language models can prioritize patches for practical program patching. In Proceedings of the 3rd IEEE/ACM International Workshop on Automated Program Repair (APR@ICSE ’22). IEEE, 8–15. DOI:

Digital Library

[41]

Rafael-Michael Karampatsis and Charles Sutton. 2020. How often do single-statement bugs occur? The Manysstubs4j dataset. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR ’20). Sunghun Kim, Georgios Gousios, Sarah Nadi, and Joseph Hejderup (Eds.), ACM, 573–577. DOI:

Digital Library

[42]

Maria Kechagia, Sergey Mechtaev, Federica Sarro, and Mark Harman. 2022. Evaluating automatic program repair capabilities to repair API misuses. IEEE Trans. Softw. Eng. 48, 7 (2022), 2658–2679. DOI:

[43]

B. Kitchenham and S. Charters. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01. School of Computer Science and Mathematics, Keele University, Keele. Retrieved from https://legacyfileshare.elsevier.com/promis_misc/525444systematicreviewsguide.pdf

[44]

Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2020. FixMiner: Mining relevant fix patterns for automated program repair. Empir. Softw. Eng. 25, 3 (2020), 1980–2024. DOI:

Digital Library

[45]

Quoc V. Le and Tomás Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning (ICML ’14). JMLR.org, 1188–1196. Retrieved from http://proceedings.mlr.press/v32/le14.html

[46]

Xuan-Bach Dinh Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li, and Corina S. Pasareanu. 2019. On reliability of patch correctness assessment. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). Joanne M. Atlee, Tevfik Bultan, and Jon Whittle (Eds.), IEEE/ACM, 524–535. DOI:

Digital Library

[47]

Xuan-Bach Dinh Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017. S3: Syntax- and semantic-guided repair synthesis via programming by examples. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’17). Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea Zisman (Eds.), ACM, 593–604. DOI:

Digital Library

[48]

Xuan-Bach Dinh Le, David Lo, and Claire Le Goues. 2016. History driven program repair. In Proceedings of the IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER ’16). IEEE Computer Society, 213–224. DOI:

[49]

Thanh Le-Cong, Duc-Minh Luong, Xuan Bach D. Le, David Lo, Nhat-Hoa Tran, Bui Quang-Huy, and Quyet-Thang Huynh. 2023. Invalidator: Automated patch correctness assessment via semantic and syntactic reasoning. IEEE Trans. Softw. Eng. 49, 6 (2023), 3411–3429. DOI:

Digital Library

[50]

Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for \(\textdollar\)8 each. In Proceedings of the 2012 34th International Conference on Software Engineering (ICSE), 3–13. DOI:

[51]

Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra. 2021. Automatic program repair. IEEE Softw. 38, 4 (2021), 22–27.

Digital Library

[52]

Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M. Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, et al. 2022. What language model to train if you have one million GPU hours? In Findings of the Association for Computational Linguistics (EMNLP ’22). Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 765–782. DOI:

[53]

Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. DLFix: Context-based code transformation learning for automated program repair. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). Gregg Rothermel and Doo-Hwan Bae (Eds.), ACM, 602–614. DOI:

Digital Library

[54]

Jingjing Liang, Ruyi Ji, Jiajun Jiang, Shurui Zhou, Yiling Lou, Yingfei Xiong, and Gang Huang. 2021. Interactive patch filtering as debugging aid. In Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), 239–250. DOI:

[55]

Bo Lin, Shangwen Wang, Ming Wen, and Xiaoguang Mao. 2022. Context-aware code change embedding for better patch correctness assessment. ACM Trans. Softw. Eng. Methodol. 31, 3, (May 2022), Article 51, 29 pages. DOI:

Digital Library

[56]

Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH ’17). Gail C. Murphy (Ed.), ACM, 55–56. DOI:

Digital Library

[57]

Kui Liu, Anil Koyuncu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon. 2019. You cannot fix what you cannot find! An investigation of fault localization bias in benchmarking automated program repair systems. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST ’19). IEEE, 102–113. DOI:

[58]

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. AVATAR: Fixing semantic bugs with fix patterns of static analysis violations. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19). Xinyu Wang, David Lo, and Emad Shihab (Eds.), IEEE, 456–467. DOI:

[59]

Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’19). Dongmei Zhang and Anders Møller (Eds.), ACM, 31–42. DOI:

Digital Library

[60]

Kui Liu, Li Li, Anil Koyuncu, Dongsun Kim, Zhe Liu, Jacques Klein, and Tegawendé F. Bissyandé. 2021. A critical review on the evaluation of automated program repair systems. J. Syst. Softw. 171 (2021), 110817.

[61]

Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F. Bissyandé, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for Java programs. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). Gregg Rothermel and Doo-Hwan Bae (Eds.), ACM, 615–627. DOI:

Digital Library

[62]

Xuliang Liu and Hao Zhong. 2018. Mining stackoverflow for program repair. In Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 118–129. DOI:

[63]

Yu Liu, Sergey Mechtaev, Pavle Subotic, and Abhik Roychoudhury. 2023. Program repair guided by datalog-defined static analysis. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’23). Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.), ACM, 1216–1228. DOI:

Digital Library

[64]

Fan Long and Martin C. Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’15). Elisabetta Di Nitto, Mark Harman, and Patrick Heymans (Eds.), ACM, 166–178. DOI:

Digital Library

[65]

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. CoCoNuT: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’20). Sarfraz Khurshid and Corina S. Pasareanu (Eds.), ACM, 101–114. DOI:

Digital Library

[66]

Fernanda Madeiral, Simon Urli, Marcelo de Almeida Maia, and Martin Monperrus. 2019. BEARS: An extensible Java bug benchmark for automatic program repair studies. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’19). Xinyu Wang, David Lo, and Emad Shihab (Eds.), IEEE, 468–478. DOI:

[67]

Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and Martin Monperrus. 2017. Automatic repair of real bugs in Java: A large-scale experiment on the Defects4j dataset. Empir. Softw. Eng. 22, 4 (2017), 1936–1964. DOI:

Digital Library

[68]

Matias Martinez, Maria Kechagia, Anjana Perera, Justyna Petke, Federica Sarro, and Aldeida Aleti. 2024. Test-based patch clustering for automatically-generated patches assessment. Empir. Softw. Eng. 29, 5 (2024), 116. DOI:

Digital Library

[69]

Matias Martinez and Martin Monperrus. 2016. ASTOR: A program repair library for Java (demo). In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA ’16). Andreas Zeller and Abhik Roychoudhury (Eds.), ACM, 441–444. DOI:

Digital Library

[70]

Matias Martinez and Martin Monperrus. 2017. Open-ended exploration of the program repair search space with mined templates: The next 8935 patches for Defects4j. arXiv:1712.03854. Retrieved from https://arxiv.org/pdf/1712.03854v1

[71]

Matias Martinez and Martin Monperrus. 2018. Ultra-large repair search space with automatically mined templates: The cardumen mode of astor. In Proceedings of the 10th International Symposium on Search-Based Software Engineering (SSBSE ’18). Thelma Elita Colanzi and Phil McMinn (Eds.), Lecture Notes in Computer Science, Vol. 11036, Springer, 65–86. DOI:

[72]

Derrick McKee, Nathan Burow, and Mathias Payer. 2019. Software ethology: An accurate, resilient, and cross-architecture binary analysis framework. arXiv:1906.02928. Retrieved from https://arxiv.org/abs/1906.02928

[73]

Sergey Mechtaev, Manh-Dung Nguyen, Yannic Noller, Lars Grunske, and Abhik Roychoudhury. 2018. Semantic program repair using a reference implementation. In Proceedings of the 40th International Conference on Software Engineering, 129–139.

Digital Library

[74]

Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). Laura K. Dillon, Willem Visser, and Laurie A. Williams (Eds.), ACM, 691–701. DOI:

Digital Library

[75]

Amirfarhad Nilizadeh, Marlon Calvo, Gary T. Leavens, and Xuan-Bach D. Le. 2021. More reliable test suites for dynamic APR by using counterexamples. In Proceedings of the 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), 208–219. DOI:

[76]

Amirfarhad Nilizadeh, Gary T. Leavens, Xuan-Bach D. Le, Corina S. Păsăreanu, and David R. Cok. 2021. Exploring true test overfitting in dynamic automated program repair using formal methods. In Proceedings of the 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 229–240.

[77]

Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed random testing for Java. In Proceedings Companion of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA ’07). Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steele Jr. (Eds.), ACM, 815–816. DOI:

Digital Library

[78]

Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv:2012.08680. Retrieved from https://arxiv.org/abs/2012.08680

[79]

Quang-Ngoc Phung, Misoo Kim, and Eunseok Lee. 2022. Identifying incorrect patches in program repair based on meaning of source code. IEEE Access 10 (2022), 12012–12030. DOI:

[80]

Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.), ACM, 254–265. DOI:

Digital Library

[81]

Zichao Qi, Fan Long, Sara Achour, and Martin C. Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (ISSTA ’15). Michal Young and Tao Xie (Eds.), ACM, 24–36. DOI:

Digital Library

[82]

Ripon K. Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R. Prasad. 2018. Bugs.jar: A large-scale, diverse dataset of real-world Java bugs. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR ’18). Andy Zaidman, Yasutaka Kamei, and Emily Hill (Eds.), ACM, 10–13. DOI:

Digital Library

[83]

Ripon K. Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R. Prasad. 2017. Elixir: Effective object-oriented program repair. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), 648–659. DOI:

[84]

Sina Shamshiri, René Just, José Miguel Rojas, Gordon Fraser, Phil McMinn, and Andrea Arcuri. 2015. Do automatically generated unit tests find real faults? An empirical study of effectiveness and challenges (T). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE ’15). Myra B. Cohen, Lars Grunske, and Michael Whalen (Eds.), IEEE Computer Society, 201–211. DOI:

Digital Library

[85]

Ridwan Shariffdeen, Yannic Noller, Lars Grunske, and Abhik Roychoudhury. 2021. Concolic program repair. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI ’21). ACM, New York, NY, 390–405. DOI:

Digital Library

[86]

Edward K. Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? Overfitting in automated program repair. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’15). Elisabetta Di Nitto, Mark Harman, and Patrick Heymans (Eds.), ACM, 532–543. DOI:

Digital Library

[87]

Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. Lstm-based deep learning models for non-factoid answer selection. arXiv:1511.04108. Retrieved from https://arxiv.org/pdf/1511.04108

[88]

Ming Tan, Lin Tan, Sashank Dara, and Caleb Mayeux. 2015. Online defect prediction for imbalanced data. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE ’15). Antonia Bertolino, Gerardo Canfora, and Sebastian G. Elbaum (Eds.), IEEE Computer Society, 99–108. DOI:

[89]

Shin Hwei Tan, Jooyong Yi, Yulis, Sergey Mechtaev, and Abhik Roychoudhury. 2017. Codeflaws: A programming competition benchmark for evaluating automated program repair tools. In Proceedings of the 39th International Conference on Software Engineering (ICSE ’17). Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard (Eds.), IEEE Computer Society, 180–182. DOI:

Digital Library

[90]

Shin Hwei Tan, Hiroaki Yoshida, Mukul R. Prasad, and Abhik Roychoudhury. 2016. Anti-patterns in search-based program repair. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE ’16). ACM, New York, NY, 727–738. DOI:

Digital Library

[91]

Xunzhu Tang, Haoye Tian, Zhenghan Chen, Weiguo Pian, Saad Ezzini, Abdoul Kader Kabore, Andrew Habib, Jacques Klein, and Tegawende F. Bissyande. 2024. Learning to represent patches. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (Lisbon, Portugal) (ICSE-Companion ’24). Association for Computing Machinery, New York, NY, 396–397. DOI:

Digital Library

[92]

Haoye Tian, Yinghua Li, Weiguo Pian, Abdoul Kader Kabore, Kui Liu, Andrew Habib, Jacques Klein, and Tegawendé F. Bissyandé. 2022. Predicting patch correctness based on the similarity of failing test cases. ACM Trans. Softw. Eng. Methodol. 31, 4 (2022), 1–30.

Digital Library

[93]

Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F. Bissyandé. 2020. Evaluating representation learning of code changes for predicting patch correctness in program repair. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 981–992.

Digital Library

[94]

Haoye Tian, Kui Liu, Yinghua Li, Abdoul Kader Kaboré, Anil Koyuncu, Andrew Habib, Li Li, Junhao Wen, Jacques Klein, and Tegawendé F. Bissyandé. 2023. The best of both worlds: Combining learned embeddings with engineered features for accurate prediction of correct patches. ACM Trans. Softw. Eng. Methodol. 32, 4, (May 2023), Article 92, 34 pages. DOI:

Digital Library

[95]

Haoye Tian, Xunzhu Tang, Andrew Habib, Shangwen Wang, Kui Liu, Xin Xia, Jacques Klein, and Tegawendé F. Bissyandé. 2022. Is this change the answer to that problem? Correlating descriptions of bug and code changes for evaluating patch correctness. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 1–13.

[96]

Lewis Tunstall, Leandro Von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. O’Reilly Media, Inc.

[97]

Rijnard van Tonder and Claire Le Goues. 2018. Static automated program repair for heap properties. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). ACM, New York, NY, 151–162. DOI:

Digital Library

[98]

Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, and Hai Jin. 2020. Automated patch correctness assessment: How far are we? In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 968–980.

Digital Library

[99]

Westley Weimer, Zachary P. Fry, and Stephanie Forrest. 2013. Leveraging program equivalence for adaptive program repair: Models and first results. In Proceedings of the 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE ’13). Ewen Denney, Tevfik Bultan, and Andreas Zeller (Eds.), IEEE, 356–366. DOI:

Digital Library

[100]

Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of the 2009 IEEE 31st International Conference on Software Engineering. IEEE, 364–374.

Digital Library

[101]

Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-aware patch generation for better automated program repair. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.), ACM, 1–11. DOI:

Digital Library

[102]

Martin White, Michele Tufano, Matías Martínez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and transforming program repair ingredients via deep learning code similarities. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), 479–490. DOI:

[103]

Qiushi Wu and Kangjie Lu. 2021. On the feasibility of stealthily introducing vulnerabilities in open-source software via hypocrite commits. Proc. Oakland (2021). Retrieved from https://linuxreviews.org/images/d/d9/OpenSourceInsecurity.pdf

[104]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23). IEEE Press, 1482–1494. DOI:

Digital Library

[105]

Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: Revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’22). ACM, New York, NY, 959–971. DOI:

Digital Library

[106]

Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. In Proceedings of the 33rd ACM SIGSOFT International Symposium on SoftwareTesting and Analysis (Vienna, Austria) (ISSTA 2024). Association for Computing Machinery, New York, NY, 819–831. DOI:

Digital Library

[107]

Qi Xin and Steven P. Reiss. 2017. Identifying test-suite-overfitted patches through test case generation. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. Tevfik Bultan and Koushik Sen (Eds.), ACM, 226–236. DOI:

Digital Library

[108]

Qi Xin and Steven P. Reiss. 2017. Leveraging syntax-related code for automated program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’17). Grigore Rosu, Massimiliano Di Penta, and Tien N. Nguyen (Eds.), IEEE Computer Society, 660–670. DOI:

[109]

Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and Gang Huang. 2018. Identifying patch correctness in test-based program repair. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.), ACM, 789–799. DOI:

Digital Library

[110]

Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), 416–426. DOI:

Digital Library

[111]

Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the 39th International Conference on Software Engineering (ICSE ’17). Sebastián Uchitel, Alessandro Orso, and Martin P. Robillard (Eds.), IEEE/ACM, 416–426. DOI:

Digital Library

[112]

Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian R. Lamelas Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2017. Nopol: Automatic repair of conditional statement bugs in Java programs. IEEE Trans. Software Eng. 43, 1 (2017), 34–55. DOI:

Digital Library

[113]

Dapeng Yan, Kui Liu, Yuqing Niu, Li Li, Zhe Liu, Zhiming Liu, Jacques Klein, and Tegawendé F. Bissyandé. 2022. Crex: Predicting patch correctness in automated repair of C programs through transfer learning of execution semantics. Inf. Softw. Technol. 152 (2022), 107043. DOI:

[114]

Bo Yang and Jinqiu Yang. 2020. Exploring the differences between plausible and correct patches at fine-grained level. In Proceedings of the 2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF), 1–8. DOI:

[115]

Jun Yang, Yuehan Wang, Yiling Lou, Ming Wen, and Lingming Zhang. 2023. Attention: Not just another dataset for patch-correctness checking. arXiv:2207.06590. Retrieved from https://arxiv.org/pdf/2207.06590

[116]

Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better test cases for better automated program repair. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE ’17). Eric Bodden, Wilhelm Schäfer, Arie van Deursen, and Andrea Zisman (Eds.), ACM, 831–841. DOI:

Digital Library

[117]

He Ye, Jian Gu, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2022. Automated classification of overfitting patches with statically extracted code features. IEEE Trans. Softw. Eng. 48, 8 (2022), 2920–2938. DOI:

Digital Library

[118]

He Ye, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2021. A comprehensive study of automatic program repair on the QuixBugs benchmark. J. Syst. Softw. 171 (2021), 110825.

[119]

He Ye, Matias Martinez, and Martin Monperrus. 2021. Automated patch assessment for program repair at scale. Empir. Softw. Eng. 26, 2 (2021), 20. DOI:

Digital Library

[120]

Zhongxing Yu, Matias Martinez, Benjamin Danglot, Thomas Durieux, and Martin Monperrus. 2017. Test case generation for program repair: A study of feasibility and effectiveness. arXiv:1703.00198. Retrieved from http://arxiv.org/abs/1703.00198

[121]

Zhongxing Yu, Matias Martinez, Benjamin Danglot, Thomas Durieux, and Martin Monperrus. 2019. Alleviating patch overfitting with automatic test generation: A study of feasibility and effectiveness for the Nopol repair system. Empir. Softw. Eng. 24, 1 (2019), 33–67. DOI:

Digital Library

[122]

Yuan Yuan and Wolfgang Banzhaf. 2020. ARJA: Automated repair of java programs via multi-objective genetic programming. IEEE Trans. Software Eng. 46, 10 (2020), 1040–1067. DOI:

[123]

Yuan Yuan and Wolfgang Banzhaf. 2020. Toward better evolutionary program repair: An integrated approach. ACM Trans. Softw. Eng. Methodol. 29, 1 (2020), 1–53.

Digital Library

[124]

He Zhang, Muhammad Ali Babar, and Paolo Tell. 2011. Identifying relevant studies in software engineering. Inf. Softw. Technol. 53, 6 (2011), 625–637. DOI:

Digital Library

[125]

Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023. A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol. 33, 2, Article 55 (Dec. 2023), 69 pages. DOI:

Digital Library

[126]

Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2023. Boosting automated patch correctness prediction via pre-trained language model. arXiv:2301.12453. DOI:

[127]

Wenkang Zhong, Chuanyi Li, Jidong Ge, and Bin Luo. 2022. Neural program repair: Systems, challenges and solutions. In Proceedings of the 13th Asia-Pacific Symposium on Internetware, 96–106. DOI:

Digital Library

[128]

Xin Zhou, Bowen Xu, Kisub Kim, DongGyun Han, Thanh Le-Cong, Junda He, Bach Le, and David Lo. 2023. PatchZero: Zero-shot automatic patch correctness assessment. arXiv:2303.00202. Retrieved from https://arxiv.org/pdf/2303.00202

[129]

Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’21). ACM, New York, NY, 341–353. DOI:

Digital Library

Index Terms

Patch Correctness Assessment: A Survey
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Patch correctness assessment in automated program repair based on the impact of patches on production and test code
ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, so previous research proposed ...
Shibboleth: Hybrid Patch Correctness Assessment in Automated Program Repair
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Test-based generate-and-validate automated program repair (APR) systems generate many patches that pass the test suite without fixing the bug. The generated patches must be manually inspected by the developers, a task that tends to be time-consuming, ...
FixCheck: A Tool for Improving Patch Correctness Analysis
ISSTA 2024: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

Patch correctness assessment aims at effectively detecting overfitted patches, i.e., patches that causes all tests to pass but do not actually fix the bug. Although several automated techniques for assessing patch correctness have been proposed, these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 34, Issue 2

February 2025

904 pages

EISSN:1557-7392

DOI:10.1145/3703017

Editor:
Abhik Roychoudhury
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2025

Online AM: 08 November 2024

Accepted: 28 September 2024

Revised: 28 September 2024

Received: 14 August 2023

Published in TOSEM Volume 34, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
Natural Science Foundation of Jiangsu Province
Cooperation Fund of Huawei-NJU Creative Laboratory for the Next Programming

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
211
Total Downloads

Downloads (Last 12 months)211
Downloads (Last 6 weeks)87

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents