research-article

On the importance of building high-quality training datasets for neural code search

Authors:

Li LiAuthors Info & Claims

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 1609 - 1620

https://doi.org/10.1145/3510003.3510160

Published: 05 July 2022 Publication History

Abstract

The performance of neural code search is significantly influenced by the quality of the training data from which the neural models are derived. A large corpus of high-quality query and code pairs is demanded to establish a precise mapping from the natural language to the programming language. Due to the limited availability, most widely-used code search datasets are established with compromise, such as using code comments as a replacement of queries. Our empirical study on a famous code search dataset reveals that over one-third of its queries contain noises that make them deviate from natural user queries. Models trained through noisy data are faced with severe performance degradation when applied in real-world scenarios. To improve the dataset quality and make the queries of its samples semantically identical to real user queries is critical for the practical usability of neural code search. In this paper, we propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter. This is the first framework that applies semantic query cleaning to code search datasets. Experimentally, we evaluated the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks. Training the popular DeepCS model with the filtered dataset from our framework improves its performance by 19.2% MRR and 21.3% Answer@1, on average with the three validation benchmarks.

References

[1]

2021. Github. Retrieved Sep 1, 2021 from https://github.com/

[2]

2021. How to Write Doc Comments for the Javadoc Tool. Retrieved Sep 1, 2021 from https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguid

[3]

2021. On the Importance of Building High-quality Training Datasets for Neural Code Search. Retrieved Sep 1, 2021 from https://sites.google.com/view/hqtd

[4]

2021. StackOverflow. Retrieved Sep 1, 2021 from https://stackoverflow.com/

[5]

Jinwon An and S. Cho. 2015. Variational Autoencoder based Anomaly Detection using Reconstruction Probability.

[6]

Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. ArXiv abs/1707.02275 (2017).

[7]

José Cambronero, Hongyu Li, S. Kim, K. Sen, and S. Chandra. 2019. When deep learning met code search. Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019).

Digital Library

[8]

Jacopo Castellini, V. Poggioni, and Giulia Sorbi. 2017. Fake Twitter followers detection by denoising autoencoder. Proceedings of the International Conference on Web Intelligence (2017).

Digital Library

[9]

Q. Chen and Minghui Zhou. 2018. A Neural Framework for Retrieval and Summarization of Source Code. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 826--831.

Digital Library

[10]

Dan Chianucci and A. Savakis. 2016. Unsupervised change detection using Spatial Transformer Networks. 2016 IEEE Western New York Image and Signal Processing Workshop (WNYISPW) (2016), 1--5.

[11]

Kyunghyun Cho, B. V. Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv abs/1406.1078 (2014).

[12]

William G Cochran. 1977. Sampling techniques. Wiley Eastern Limited.

[13]

Cohen and J. 1960. A Coefficient of Agreement for Nominal Scales. Educational & Psychological Measurement 20, 1 (1960), 37--46.

[14]

A. Dempster, N. Laird, and D. Rubin. 1977. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper.

[15]

Xiaodong Gu, H. Zhang, and S. Kim. 2018. Deep Code Search. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 933--944.

[16]

Rajarshi Haldar, L. Wu, Jinjun Xiong, and J. Hockenmaier. 2020. A Multi-Perspective Architecture for Semantic Code Search. ArXiv abs/2005.06980 (2020).

[17]

Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, and Mengting Yuan. 2020. Neural joint attention code search over structure embeddings for software Q&A sites. J. Syst. Softw. 170 (2020), 110773.

[18]

X. Hu, G. Li, Xin Xia, D. Lo, and Zhi Jin. 2018. Deep Code Comment Generation. 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC) (2018), 200--20010.

[19]

H. Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. ArXiv abs/1909.09436 (2019).

[20]

Glenn D Israel. 1992. Determining sample size. (1992).

[21]

Kisub Kim, Dongsun Kim, Tegawendé F Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY - A Code-to-Code Search Engine. In The 40th International Conference on Software Engineering (ICSE 2018).

[22]

Diederik P. Kingma and M. Welling. 2014. Auto-Encoding Variational Bayes. CoRR abs/1312.6114 (2014).

[23]

W. Lawson, Esube Bekele, and K. Sullivan. 2017. Finding Anomalies with Generative Adversarial Networks for a Patrolbot. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), 484--485.

[24]

Hongyu Li, S. Kim, and S. Chandra. 2019. Neural Code Search Evaluation Dataset. ArXiv abs/1908.09804 (2019).

[25]

Li Li, Tegawendé F Bissyandé, Yves Le Traon, and Jacques Klein. 2016. Accessing Inaccessible Android APIs: An Empirical Study. In The 32nd International Conference on Software Maintenance and Evolution (ICSME 2016).

[26]

Li Li, Jun Gao, Tegawendé F Bissyandé, Lei Ma, Xin Xia, and Jacques Klein. 2018. Characterising Deprecated Android APIs. In The 15th International Conference on Mining Software Repositories (MSR 2018).

[27]

Li Li, Jun Gao, Tegawendé F Bissyandé, Lei Ma, Xin Xia, and Jacques Klein. 2020. CDA: Characterising Deprecated Android APIs. Empirical Software Engineering (EMSE) (2020).

[28]

W. Li, Haozhe Qin, Shuhan Yan, Beijun Shen, and Y. Chen. 2020. Learning Code-Query Interaction for Enhancing Code Searches. 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), 115--126.

[29]

Shuyu Lin, R. Clark, R. Birke, Sandro Schönborn, Niki Trigoni, and S. Roberts. 2020. Anomaly Detection for Time Series Using VAE-LSTM Hybrid Model. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), 4322--4326.

[30]

Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive Deep Code Search. Proceedings of the 28th International Conference on Program Comprehension (2020).

Digital Library

[31]

C. Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and J. Grundy. 2020. Opportunities and Challenges in Code Search Tools. ArXiv abs/2011.02297 (2020).

[32]

Chao Liu, Xin Xia, David Lo, Zhiwei Liu, A. Hassan, and Shanping Li. 2020. Simplifying Deep-Learning-Based Model for Code Search. ArXiv abs/2005.14373 (2020).

[33]

Pei Liu, Li Li, Yichun Yan, Mattia Fazzini, and John Grundy. 2021. Identifying and Characterizing Silently-Evolved Methods in the Android API. In The 43rd ACM/IEEE International Conference on Software Engineering, SEIP Track (ICSE-SEIP 2021).

[34]

Tie Luo and Sai Ganesh Nagarajan. 2018. Distributed Anomaly Detection Using Autoencoder Neural Networks in WSN for IoT. 2018 IEEE International Conference on Communications (ICC) (2018), 1--6.

[35]

Bhaskar Mitra, Nick Craswell, et al. 2018. An introduction to neural information retrieval. Now Foundations and Trends.

[36]

Seyed Mehdi Nasehi, Jonathan Sillito, F. Maurer, and C. Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. 2012 28th IEEE International Conference on Software Maintenance (ICSM) (2012), 25--34.

Digital Library

[37]

A. Nesbitt and Benjamin Nickolls. 2017. Libraries.io Open Source Repository and Dependency Metadata.

[38]

Raghavendra Chalapathy University of Sydney, Capital Markets Cooperative Research Centre, Sanjay Chawla Qatar Computing Research Institute, and Hbku. 2019. Deep Learning for Anomaly Detection: A Survey.

[39]

Luca Pascarella, Magiel Bruntink, and Alberto Bacchelli. 2019. Classifying code comments in Java software systems. Empirical Software Engineering 24, 3 (June 2019), 1499--1537.

Digital Library

[40]

Zhu Qihao, Sun Ze-yu, Liang Xiran, Xiong Yingfei, and Z. Lu. 2020. OCoR: An Overlapping-Aware Code Retriever. arXiv: Computation and Language (2020).

[41]

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving Code Search with Co-Attentive Representation Learning. Proceedings of the 28th International Conference on Program Comprehension (2020).

Digital Library

[42]

A. Singh. 2017. Anomaly Detection for Temporal Data using Long Short-Term Memory (LSTM).

[43]

Suwon Suh, Daniel H. Chae, Hyon-Goo Kang, and S. Choi. 2016. Echo-state conditional variational autoencoder for anomaly detection. 2016 International Joint Conference on Neural Networks (IJCNN) (2016), 1015--1022.

[44]

Zhensu Sun, Yan Liu, Chen Yang, and Yu Qian. 2020. PSCS: A Path-based Neural Model for Semantic Code Search. arXiv preprint arXiv:2008.03042 (2020).

[45]

Valerio Terragni, Yepang Liu, and S. C. Cheung. 2016. CSNIPPEX: automated synthesis of compilable code snippets from Q&A sites. Proceedings of the 25th International Symposium on Software Testing and Analysis (2016).

Digital Library

[46]

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), 13--25.

[47]

W. Wang, Y. Zhang, Zhengran Zeng, and Guandong Xu. 2020. TranS: A Transformer-based Framework for Unifying Code Summarization and Code Search. ArXiv abs/2003.03238 (2020).

[48]

S. Wold, K. Esbensen, and P. Geladi. 1987. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2 (1987), 37--52.

[49]

Haowen Xu, Wenxiao Chen, N. Zhao, Z. Li, Jiahao Bu, Zhihan Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, Jian Jhen Chen, Zhaogang Wang, and Honglin Qiao. 2018. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. Proceedings of the 2018 World Wide Web Conference (2018).

Digital Library

[50]

Shuhan Yan, H. Yu, Y. Chen, Beijun Shen, and L. Jiang. 2020. Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (2020), 344--354.

[51]

Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. The World Wide Web Conference (2019).

[52]

Ziyu Yao, Daniel S. Weld, W. Chen, and Huan Sun. 2018. StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow. Proceedings of the 2018 World Wide Web Conference (2018).

Digital Library

[53]

Wei Ye, Rui Xie, Jing lei Zhang, Tian xiang Hu, Xiaoyin Wang, and Shikun Zhang. 2020. Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning. Proceedings of The Web Conference 2020 (2020).

Digital Library

[54]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) (2018), 476--486.

Digital Library

[55]

Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are Code Examples on an Online Q&A Forum Reliable?: A Study of API Misuse on Stack Overflow. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 886--896.

Digital Library

[56]

Y. Zhang, Weiling Chen, C. Yeo, C. Lau, and B. Lee. 2017. Detecting rumors on Online Social Networks using multi-layer autoencoder. 2017 IEEE Technology & Engineering Management Conference (TEMSCON) (2017), 437--441.

[57]

Yanjie Zhao, Li Li, Haoyu Wang, Haipeng Cai, Tegawende Bissyande, Jacques Klein, and John Grundy. 2021. On the Impact of Sample Duplication in Machine Learning based Android Malware Detection. ACM Transactions on Software Engineering and Methodology (TOSEM) (2021).

Cited By

Hu HFang MLiu J(2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
https://doi.org/10.1016/j.infsof.2024.107589
Hou XZhao YLiu YYang ZWang KLi LLuo XLo DGrundy JWang H(2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/3695988Online publication date: 20-Sep-2024
https://doi.org/10.1145/3695988
Kim KKim JPark BKim DChong CWang YSun TTang DKlein JBissyande TFilkov VRay BZhou M(2024)DataRecipe --- How to Cook the Data for CodeLLM?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695593(1206-1218)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695593
Show More Cited By

Index Terms

On the importance of building high-quality training datasets for neural code search
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Reusability

Recommendations

Deep code search
ICSE '18: Proceedings of the 40th International Conference on Software Engineering

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat ...
Code Search: A Survey of Techniques for Finding Code
The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when ...
Neural query expansion for code search
MAPL 2019: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

Searching repositories of existing source code for code snippets is a key task in software engineering. Over the years, many approaches to this problem have been proposed. One recent tool called NCS, takes in a natural language query and outputs ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

May 2022

2508 pages

ISBN:9781450392211

DOI:10.1145/3510003

General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

ICSE '22

Sponsor:

SIGSOFT

ICSE '22: 44th International Conference on Software Engineering

May 21 - 29, 2022

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)15

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu HFang MLiu J(2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
https://doi.org/10.1016/j.infsof.2024.107589
Hou XZhao YLiu YYang ZWang KLi LLuo XLo DGrundy JWang H(2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/3695988Online publication date: 20-Sep-2024
https://doi.org/10.1145/3695988
Kim KKim JPark BKim DChong CWang YSun TTang DKlein JBissyande TFilkov VRay BZhou M(2024)DataRecipe --- How to Cook the Data for CodeLLM?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695593(1206-1218)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695593
Hu GZeng XYu WPeng MYuan MDuan L(2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686664
Liu SLi YXie XMa WMeng GLiu Y(2024)Automated Commit Intelligence by Pre-trainingACM Transactions on Software Engineering and Methodology10.1145/3674731Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3674731
Chen QYu CLiu RZhang CWang YWang KSu TWang L(2024)Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis TasksProceedings of the ACM on Programming Languages10.1145/36498298:OOPSLA1(500-528)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649829
Fan GChen SGao CXiao JZhang TFeng Z(2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641542
Liu YTantithamthavorn CLiu YLi L(2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3641540
Wang WLi YLi AZhang JMa WLiu YRoychoudhury APaiva AAbreu RStorey M(2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639217
Yang WSong LXue YRoychoudhury APaiva AAbreu RStorey M(2024)Rust-lancet: Automated Ownership-Rule-Violation Fixing with Behavior PreservationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639103(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639103
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents