Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Correlating Automated and Human Evaluation of Code Documentation Generation Quality

Published: 28 July 2022 Publication History

Abstract

Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation  [39]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.

References

[1]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. Code2seq: Generating sequences from structured representations of code. In Proceedings of the International Conference on Learning Representations.
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382–398.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA, May 7-9, 2015, Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1409.0473.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[5]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.
[6]
Raymond P. L. Buse and Westley R. Weimer. 2010. Automatically documenting program changes. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 33–42.
[7]
Qiuyuan Chen, Xin Xia, Han Hu, David Lo, and Shanping Li. 2021. Why my code summarization model does not work: Code comment improvement with category prediction. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 1–29.
[8]
Luis Fernando Cortés-Coy, Mario Linares-Vásquez, Jairo Aponte, and Denys Poshyvanyk. 2014. On automatically generating commit messages via summarization of source code changes. In Proceedings of the 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. IEEE, 275–284.
[9]
Deborah Coughlin. 2003. Correlating automated and human assessments of machine translation quality. In Proceedings of MT summit IX. Citeseer, 63–70.
[10]
Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.
[11]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 35th International Conference on Software Engineering. IEEE, 422–431.
[12]
Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. 2013. Statistical Methods for Rates and Proportions. John Wiley & Sons.
[13]
Andrew Forward and Timothy C. Lethbridge. 2002. The relevance of software documentation, tools and technologies: A survey. In Proceedings of the 2002 ACM Symposium on Document Engineering. 26–33.
[14]
Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. 33–41.
[15]
David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to comment” Translation”: Data, metrics, baselining & evaluation. In (ASE’20). Association for Computing Machinery, 746–757. DOI:
[16]
Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program comprehension with source code summarization. In Proceedings of the 2010 ACM/IEEE 32nd International Conference on Software Engineering. IEEE, 223–226.
[17]
Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the use of automated text summarization techniques for summarizing source code. In Proceedings of the 2010 17th Working Conference on Reverse Engineering. IEEE, 35–44.
[18]
Dennis E. Hinkle, William Wiersma, and Stephen G. Jurs. 2003. Applied Statistics for the Behavioral Sciences. Vol. 663. Houghton Mifflin College Division.
[19]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 2018 IEEE/ACM 26th International Conference on Program Comprehension. IEEE, 200–210.
[20]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (2020), 2179–2217.
[21]
Yuan Huang, Qiaoyang Zheng, Xiangping Chen, Yingfei Xiong, Zhiyong Liu, and Xiaonan Luo. 2017. Mining version control system for automatically generating commit comment. In Proceedings of the 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE, 414–423.
[22]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2073–2083.
[23]
Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically generating commit messages from diffs using neural machine translation. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 135–146.
[24]
Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. In Proceedings of the 28th International Conference on Program Comprehension. 184–195.
[25]
Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A neural model for generating natural language summaries of program subroutines. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering. IEEE, 795–806.
[26]
T. C. Lethbridge, J. Singer, and A. Forward. 2003. How software engineers use documentation: The state of the practice. IEEE Software 20, 6 (2003), 35–39.
[27]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 74–81.
[28]
Mario Linares-Vásquez, Luis Fernando Cortés-Coy, Jairo Aponte, and Denys Poshyvanyk. 2015. Changescribe: A tool for automatically generating commit messages. In Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, 709–712.
[29]
Chia-Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2122–2132.
[30]
Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, and Yu Qian. 2019. Generating commit messages from diffs using pointer-generator network. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories. IEEE, 299–309.
[31]
Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2020. Retrieval-augmented generation for code summarization via hybrid GNN. In Proceedings of the International Conference on Learning Representations.
[32]
Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: How far are we?. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 373–384.
[33]
Pablo Loyola, Edison Marrese-Taylor, Jorge Balazs, Yutaka Matsuo, and Fumiko Satoh. 2018. Content aware source code change description generation. In Proceedings of the 11th International Conference on Natural Language Generation. 119–128.
[34]
Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A neural architecture for generating natural language descriptions from source code changes. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 287–292. https://aclanthology.org/P17-2045.
[35]
Paul W. McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension. 279–290.
[36]
Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K. Vijay-Shanker. 2013. Automatic generation of natural language summaries for Java classes. In Proceedings of the 2013 21st International Conference on Program Comprehension. IEEE, 23–32.
[37]
Laura Moreno, Andrian Marcus, Lori Pollock, and K. Vijay-Shanker. 2013. Jsummarizer: An automatic generator of natural language summaries for Java classes. In Proceedings of the 2013 21st International Conference on Program Comprehension. IEEE, 230–232.
[38]
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2241–2252.
[39]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[40]
Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 379–389. DOI:
[41]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1073–1083.
[42]
Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799. http://arxiv.org/abs/1706.09799.
[43]
Xiaotao Song, Hailong Sun, Xu Wang, and Jiafei Yan. 2019. A survey of automatic generation of source code comments: Algorithms and techniques. IEEE Access 7 (2019), 111411–111428.
[44]
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K. Vijay-Shanker. 2010. Towards automatically generating summary comments for Java methods. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 43–52.
[45]
Sean Stapleton, Yashmeet Gambhir, Alexander LeClair, Zachary Eberhart, Westley Weimer, Kevin Leach, and Yu Huang. 2020. A human study of comprehension and code summarization. In Proceedings of the 28th International Conference on Program Comprehension. 2–13.
[46]
Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 341–351.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.
[48]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.
[49]
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S. Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397–407.
[50]
Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. 2019. Retrieve and refine: exemplar-based neural comment generation. In Proceedings of the 35th ACM/IEEE International Conference on Automated Software Engineering.
[51]
X. Xia, L. Bao, D. Lo, Z. Xing, A. E. Hassan, and S. Li. 2018. Measuring program comprehension: A large-scale field study with professionals. IEEE Transactions on Software Engineering 44, 10 (2018), 951–976.
[52]
Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, and Jian Lu. 2019. Commit message generation for source code changes. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 3975–3981.
[53]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020. Retrieval-based neural source code summarization. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering. IEEE, 1385–1397.

Cited By

View all
  • (2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
  • (2024)Test Optimization in DNN Testing: A SurveyACM Transactions on Software Engineering and Methodology10.1145/364367833:4(1-42)Online publication date: 27-Jan-2024
  • (2024)Test Input Prioritization for 3D Point CloudsACM Transactions on Software Engineering and Methodology10.1145/364367633:5(1-44)Online publication date: 4-Jun-2024
  • Show More Cited By

Index Terms

  1. Correlating Automated and Human Evaluation of Code Documentation Generation Quality

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 31, Issue 4
    October 2022
    867 pages
    ISSN:1049-331X
    EISSN:1557-7392
    DOI:10.1145/3543992
    • Editor:
    • Mauro Pezzè
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 July 2022
    Online AM: 23 May 2022
    Accepted: 01 November 2021
    Revised: 01 September 2021
    Received: 01 January 2021
    Published in TOSEM Volume 31, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Code documentation generation
    2. evaluation metrics
    3. empirical study

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Science Foundation of China
    • National Research Foundation, Singapore

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)530
    • Downloads (Last 6 weeks)33
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
    • (2024)Test Optimization in DNN Testing: A SurveyACM Transactions on Software Engineering and Methodology10.1145/364367833:4(1-42)Online publication date: 27-Jan-2024
    • (2024)Test Input Prioritization for 3D Point CloudsACM Transactions on Software Engineering and Methodology10.1145/364367633:5(1-44)Online publication date: 4-Jun-2024
    • (2024)KADEL: Knowledge-Aware Denoising Learning for Commit Message GenerationACM Transactions on Software Engineering and Methodology10.1145/364367533:5(1-32)Online publication date: 4-Jun-2024
    • (2024)LUNA: A Model-Based Universal Analysis Framework for Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.341192850:7(1921-1948)Online publication date: 1-Jul-2024
    • (2024)Active Code Learning: Benchmarking Sample-Efficient Training of Code ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.337696450:5(1080-1095)Online publication date: 13-Mar-2024
    • (2024)AUTOGENICS: Automated Generation of Context-Aware Inline Comments for Code Snippets on Programming Q&A Sites Using LLM2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00013(24-35)Online publication date: 7-Oct-2024
    • (2024)Epic-Level Text Generation with LLM through Auto-prompted Reinforcement Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650358(1-8)Online publication date: 30-Jun-2024
    • (2024)Can Developers Prompt? A Controlled Experiment for Code Documentation Generation2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00058(574-586)Online publication date: 6-Oct-2024
    • (2024)Supporting Software Maintenance with Dynamically Generated Document Hierarchies2024 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME58944.2024.00046(426-437)Online publication date: 6-Oct-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media