Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3510003.3510160acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

On the importance of building high-quality training datasets for neural code search

Published: 05 July 2022 Publication History

Abstract

The performance of neural code search is significantly influenced by the quality of the training data from which the neural models are derived. A large corpus of high-quality query and code pairs is demanded to establish a precise mapping from the natural language to the programming language. Due to the limited availability, most widely-used code search datasets are established with compromise, such as using code comments as a replacement of queries. Our empirical study on a famous code search dataset reveals that over one-third of its queries contain noises that make them deviate from natural user queries. Models trained through noisy data are faced with severe performance degradation when applied in real-world scenarios. To improve the dataset quality and make the queries of its samples semantically identical to real user queries is critical for the practical usability of neural code search. In this paper, we propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter. This is the first framework that applies semantic query cleaning to code search datasets. Experimentally, we evaluated the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks. Training the popular DeepCS model with the filtered dataset from our framework improves its performance by 19.2% MRR and 21.3% Answer@1, on average with the three validation benchmarks.

References

[1]
2021. Github. Retrieved Sep 1, 2021 from https://github.com/
[2]
2021. How to Write Doc Comments for the Javadoc Tool. Retrieved Sep 1, 2021 from https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguid
[3]
2021. On the Importance of Building High-quality Training Datasets for Neural Code Search. Retrieved Sep 1, 2021 from https://sites.google.com/view/hqtd
[4]
2021. StackOverflow. Retrieved Sep 1, 2021 from https://stackoverflow.com/
[5]
Jinwon An and S. Cho. 2015. Variational Autoencoder based Anomaly Detection using Reconstruction Probability.
[6]
Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. ArXiv abs/1707.02275 (2017).
[7]
José Cambronero, Hongyu Li, S. Kim, K. Sen, and S. Chandra. 2019. When deep learning met code search. Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019).
[8]
Jacopo Castellini, V. Poggioni, and Giulia Sorbi. 2017. Fake Twitter followers detection by denoising autoencoder. Proceedings of the International Conference on Web Intelligence (2017).
[9]
Q. Chen and Minghui Zhou. 2018. A Neural Framework for Retrieval and Summarization of Source Code. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 826--831.
[10]
Dan Chianucci and A. Savakis. 2016. Unsupervised change detection using Spatial Transformer Networks. 2016 IEEE Western New York Image and Signal Processing Workshop (WNYISPW) (2016), 1--5.
[11]
Kyunghyun Cho, B. V. Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv abs/1406.1078 (2014).
[12]
William G Cochran. 1977. Sampling techniques. Wiley Eastern Limited.
[13]
Cohen and J. 1960. A Coefficient of Agreement for Nominal Scales. Educational & Psychological Measurement 20, 1 (1960), 37--46.
[14]
A. Dempster, N. Laird, and D. Rubin. 1977. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper.
[15]
Xiaodong Gu, H. Zhang, and S. Kim. 2018. Deep Code Search. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 933--944.
[16]
Rajarshi Haldar, L. Wu, Jinjun Xiong, and J. Hockenmaier. 2020. A Multi-Perspective Architecture for Semantic Code Search. ArXiv abs/2005.06980 (2020).
[17]
Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, and Mengting Yuan. 2020. Neural joint attention code search over structure embeddings for software Q&A sites. J. Syst. Softw. 170 (2020), 110773.
[18]
X. Hu, G. Li, Xin Xia, D. Lo, and Zhi Jin. 2018. Deep Code Comment Generation. 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC) (2018), 200--20010.
[19]
H. Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. ArXiv abs/1909.09436 (2019).
[20]
Glenn D Israel. 1992. Determining sample size. (1992).
[21]
Kisub Kim, Dongsun Kim, Tegawendé F Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY - A Code-to-Code Search Engine. In The 40th International Conference on Software Engineering (ICSE 2018).
[22]
Diederik P. Kingma and M. Welling. 2014. Auto-Encoding Variational Bayes. CoRR abs/1312.6114 (2014).
[23]
W. Lawson, Esube Bekele, and K. Sullivan. 2017. Finding Anomalies with Generative Adversarial Networks for a Patrolbot. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), 484--485.
[24]
Hongyu Li, S. Kim, and S. Chandra. 2019. Neural Code Search Evaluation Dataset. ArXiv abs/1908.09804 (2019).
[25]
Li Li, Tegawendé F Bissyandé, Yves Le Traon, and Jacques Klein. 2016. Accessing Inaccessible Android APIs: An Empirical Study. In The 32nd International Conference on Software Maintenance and Evolution (ICSME 2016).
[26]
Li Li, Jun Gao, Tegawendé F Bissyandé, Lei Ma, Xin Xia, and Jacques Klein. 2018. Characterising Deprecated Android APIs. In The 15th International Conference on Mining Software Repositories (MSR 2018).
[27]
Li Li, Jun Gao, Tegawendé F Bissyandé, Lei Ma, Xin Xia, and Jacques Klein. 2020. CDA: Characterising Deprecated Android APIs. Empirical Software Engineering (EMSE) (2020).
[28]
W. Li, Haozhe Qin, Shuhan Yan, Beijun Shen, and Y. Chen. 2020. Learning Code-Query Interaction for Enhancing Code Searches. 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), 115--126.
[29]
Shuyu Lin, R. Clark, R. Birke, Sandro Schönborn, Niki Trigoni, and S. Roberts. 2020. Anomaly Detection for Time Series Using VAE-LSTM Hybrid Model. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), 4322--4326.
[30]
Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive Deep Code Search. Proceedings of the 28th International Conference on Program Comprehension (2020).
[31]
C. Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and J. Grundy. 2020. Opportunities and Challenges in Code Search Tools. ArXiv abs/2011.02297 (2020).
[32]
Chao Liu, Xin Xia, David Lo, Zhiwei Liu, A. Hassan, and Shanping Li. 2020. Simplifying Deep-Learning-Based Model for Code Search. ArXiv abs/2005.14373 (2020).
[33]
Pei Liu, Li Li, Yichun Yan, Mattia Fazzini, and John Grundy. 2021. Identifying and Characterizing Silently-Evolved Methods in the Android API. In The 43rd ACM/IEEE International Conference on Software Engineering, SEIP Track (ICSE-SEIP 2021).
[34]
Tie Luo and Sai Ganesh Nagarajan. 2018. Distributed Anomaly Detection Using Autoencoder Neural Networks in WSN for IoT. 2018 IEEE International Conference on Communications (ICC) (2018), 1--6.
[35]
Bhaskar Mitra, Nick Craswell, et al. 2018. An introduction to neural information retrieval. Now Foundations and Trends.
[36]
Seyed Mehdi Nasehi, Jonathan Sillito, F. Maurer, and C. Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. 2012 28th IEEE International Conference on Software Maintenance (ICSM) (2012), 25--34.
[37]
A. Nesbitt and Benjamin Nickolls. 2017. Libraries.io Open Source Repository and Dependency Metadata.
[38]
Raghavendra Chalapathy University of Sydney, Capital Markets Cooperative Research Centre, Sanjay Chawla Qatar Computing Research Institute, and Hbku. 2019. Deep Learning for Anomaly Detection: A Survey.
[39]
Luca Pascarella, Magiel Bruntink, and Alberto Bacchelli. 2019. Classifying code comments in Java software systems. Empirical Software Engineering 24, 3 (June 2019), 1499--1537.
[40]
Zhu Qihao, Sun Ze-yu, Liang Xiran, Xiong Yingfei, and Z. Lu. 2020. OCoR: An Overlapping-Aware Code Retriever. arXiv: Computation and Language (2020).
[41]
Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving Code Search with Co-Attentive Representation Learning. Proceedings of the 28th International Conference on Program Comprehension (2020).
[42]
A. Singh. 2017. Anomaly Detection for Temporal Data using Long Short-Term Memory (LSTM).
[43]
Suwon Suh, Daniel H. Chae, Hyon-Goo Kang, and S. Choi. 2016. Echo-state conditional variational autoencoder for anomaly detection. 2016 International Joint Conference on Neural Networks (IJCNN) (2016), 1015--1022.
[44]
Zhensu Sun, Yan Liu, Chen Yang, and Yu Qian. 2020. PSCS: A Path-based Neural Model for Semantic Code Search. arXiv preprint arXiv:2008.03042 (2020).
[45]
Valerio Terragni, Yepang Liu, and S. C. Cheung. 2016. CSNIPPEX: automated synthesis of compilable code snippets from Q&A sites. Proceedings of the 25th International Symposium on Software Testing and Analysis (2016).
[46]
Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), 13--25.
[47]
W. Wang, Y. Zhang, Zhengran Zeng, and Guandong Xu. 2020. TranS: A Transformer-based Framework for Unifying Code Summarization and Code Search. ArXiv abs/2003.03238 (2020).
[48]
S. Wold, K. Esbensen, and P. Geladi. 1987. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2 (1987), 37--52.
[49]
Haowen Xu, Wenxiao Chen, N. Zhao, Z. Li, Jiahao Bu, Zhihan Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, Jian Jhen Chen, Zhaogang Wang, and Honglin Qiao. 2018. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. Proceedings of the 2018 World Wide Web Conference (2018).
[50]
Shuhan Yan, H. Yu, Y. Chen, Beijun Shen, and L. Jiang. 2020. Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (2020), 344--354.
[51]
Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. The World Wide Web Conference (2019).
[52]
Ziyu Yao, Daniel S. Weld, W. Chen, and Huan Sun. 2018. StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow. Proceedings of the 2018 World Wide Web Conference (2018).
[53]
Wei Ye, Rui Xie, Jing lei Zhang, Tian xiang Hu, Xiaoyin Wang, and Shikun Zhang. 2020. Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning. Proceedings of The Web Conference 2020 (2020).
[54]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) (2018), 476--486.
[55]
Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are Code Examples on an Online Q&A Forum Reliable?: A Study of API Misuse on Stack Overflow. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 886--896.
[56]
Y. Zhang, Weiling Chen, C. Yeo, C. Lau, and B. Lee. 2017. Detecting rumors on Online Social Networks using multi-layer autoencoder. 2017 IEEE Technology & Engineering Management Conference (TEMSCON) (2017), 437--441.
[57]
Yanjie Zhao, Li Li, Haoyu Wang, Haipeng Cai, Tegawende Bissyande, Jacques Klein, and John Grundy. 2021. On the Impact of Sample Duplication in Machine Learning based Android Malware Detection. ACM Transactions on Software Engineering and Methodology (TOSEM) (2021).

Cited By

View all
  • (2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
  • (2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/3695988Online publication date: 20-Sep-2024
  • (2024)DataRecipe --- How to Cook the Data for CodeLLM?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695593(1206-1218)Online publication date: 27-Oct-2024
  • Show More Cited By

Index Terms

  1. On the importance of building high-quality training datasets for neural code search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICSE '22: Proceedings of the 44th International Conference on Software Engineering
    May 2022
    2508 pages
    ISBN:9781450392211
    DOI:10.1145/3510003
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. code search
    2. data cleaning
    3. dataset
    4. deep learning

    Qualifiers

    • Research-article

    Conference

    ICSE '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 276 of 1,856 submissions, 15%

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)132
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)An intent-enhanced feedback extension model for code searchInformation and Software Technology10.1016/j.infsof.2024.107589177(107589)Online publication date: Jan-2025
    • (2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/3695988Online publication date: 20-Sep-2024
    • (2024)DataRecipe --- How to Cook the Data for CodeLLM?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695593(1206-1218)Online publication date: 27-Oct-2024
    • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
    • (2024)Automated Commit Intelligence by Pre-trainingACM Transactions on Software Engineering and Methodology10.1145/3674731Online publication date: 1-Jul-2024
    • (2024)Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis TasksProceedings of the ACM on Programming Languages10.1145/36498298:OOPSLA1(500-528)Online publication date: 29-Apr-2024
    • (2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 3-Jun-2024
    • (2024)On the Reliability and Explainability of Language Models for Program GenerationACM Transactions on Software Engineering and Methodology10.1145/364154033:5(1-26)Online publication date: 3-Jun-2024
    • (2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024
    • (2024)Rust-lancet: Automated Ownership-Rule-Violation Fixing with Behavior PreservationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639103(1-13)Online publication date: 20-May-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media