research-article

OCoR: an overlapping-aware code retriever

Authors:

Lu ZhangAuthors Info & Claims

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Pages 883 - 894

https://doi.org/10.1145/3324884.3416530

Published: 27 January 2021 Publication History

Abstract

Code retrieval helps developers reuse code snippets in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code relevant among a set of code snippets. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., "message" and "msg"), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related.

To address this problem, we propose a novel neural architecture named OCoR¹, where we introduce two specifically-designed components to capture overlaps: the first embeds names by characters to capture the overlaps between names, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier.

The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of the different components in OCoR.

References

[1]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

[2]

Miltos Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015 (JMLR Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. Journal of Machine Learning Research: Workshop and Conference Proceedings, 2123--2132.

[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:cs.CL/1409.0473

[5]

Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 2 (March 1994), 157--166.

Digital Library

[6]

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When Deep Learning Met Code Search. arXiv:cs.SE/1905.03813

[7]

Nick Craswell. 2009. Mean reciprocal rank. Encyclopedia of Database Systems (2009), 1703--1703.

[8]

Jian Fu, Xipeng Qiu, and Xuanjing Huang. 2016. Convolutional deep neural networks for document-based question answering. In Natural Language Understanding and Intelligent Applications. Springer, 790--797.

[9]

github. 2020. https://github.com/. github.

[10]

Alessandro Giusti, Dan C Cireşan, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. 2013. Fast image scanning with deep max-pooling convolutional neural networks. In 2013 IEEE International Conference on Image Processing. IEEE, 4034--4038.

[11]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933--944.

Digital Library

[12]

Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 842--851.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:cs.CV/1512.03385

[14]

Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016). arXiv:1606.08415 http://arxiv.org/abs/1606.08415

[15]

Emily Hill, Manuel Roldan-Vega, Jerry Alan Fails, and Greg Mallet. 2014. NL-based query refinement and contextualized code search results: A user study. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 34--43.

[16]

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:cs.NE/1207.0580

[17]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[18]

Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2015. Convolutional Neural Network Architectures for Matching Natural Language Sentences. arXiv:cs.CL/1503.03244

[19]

He Hua and Jimmy Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[20]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music Transformer: Generating Music with Long-Term Structure. In ICLR.

[21]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:cs.LG/1909.09436

[22]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073--2083.

[23]

Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting Working Code Examples. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 664--675.

Digital Library

[24]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. arXiv:cs.CL/1408.5882

[25]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv:cs.LG/1412.6980

[26]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

[27]

Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).

Digital Library

[28]

Hong Mei and Lu Zhang. 2018. Can big data bring a breakthrough for software automation? Science China Information Sciences 61 (05 2018), 056101.

[29]

Meili Lu, X. Sun, S. Wang, D. Lo, and Yucong Duan. 2015. Query expansion via WordNet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 545--549.

[30]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:cs.CL/1301.3781

[31]

Stack Overflow. 2020. https://stackoverflow.com/. Stack Overflow.

[32]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14-1162

[33]

Xipeng Qiu and Xuanjing Huang. 2015. Convolutional Neural Tensor Network Architecture for Community-Based Question Answering. In IJCAI, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 1305--1311. http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#QiuH15

[34]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural Machine Translation of Rare Words with Subword Units. arXiv:cs.CL/1508.07909

[35]

Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2019. A grammar-based structural cnn decoder for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7055--7062.

Digital Library

[36]

Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 8984--8991. https://aaai.org/ojs/index.php/AAAI/article/view/6430

[37]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.

[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[39]

Venkatesh Vinayakarao, Anita Sarma, Rahul Purandare, Shuktika Jain, and Saumya Jain. 2017. Anne: Improving source code search using entity retrieval approach. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 211--220.

Digital Library

[40]

Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397--407.

Digital Library

[41]

Mingzhou Xu, Derek F Wong, Baosong Yang, Yue Zhang, and Lidia S Chao. 2019. Leveraging local and global patterns for self-attention networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3069--3075.

[42]

Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. The World Wide Web Conference on - WWW '19 (2019).

Digital Library

[43]

Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference. 1693--1703.

Digital Library

[44]

Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. arXiv preprint arXiv:1611.06639 (2016).

Cited By

Chen GLiu WXie X(2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 27-Jan-2025
https://doi.org/10.1007/s10515-025-00487-8
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://doi.org/10.1186/s13677-024-00629-5
Wan YBi ZHe YZhang JZhang HSui YXu GJin HYu P(2024)Deep Learning for Code Intelligence: Survey, Benchmark and ToolkitACM Computing Surveys10.1145/3664597Online publication date: 18-May-2024
https://doi.org/10.1145/3664597
Show More Cited By

Index Terms

OCoR: an overlapping-aware code retriever

Recommendations

Opportunities and Challenges in Code Search Tools
Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, ...
CRaDLe: Deep code retrieval based on semantic Dependency Learning
Abstract
Code retrieval is a common practice for programmers to reuse existing code snippets in the open-source repositories. Given a user query (i.e., a natural language description), code retrieval aims at searching the most relevant ones from a set of ...
CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning
WWW '19: The World Wide Web Conference

To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

December 2020

1449 pages

ISBN:9781450367684

DOI:10.1145/3324884

General Chair:
John Grundy,
Program Chairs:
Claire Le Goues,
David Lo

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China

Conference

ASE '20

Sponsor:

ASE '20: 35th IEEE/ACM International Conference on Automated Software Engineering

December 21 - 25, 2020

Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
136
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen GLiu WXie X(2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 27-Jan-2025
https://doi.org/10.1007/s10515-025-00487-8
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://doi.org/10.1186/s13677-024-00629-5
Wan YBi ZHe YZhang JZhang HSui YXu GJin HYu P(2024)Deep Learning for Code Intelligence: Survey, Benchmark and ToolkitACM Computing Surveys10.1145/3664597Online publication date: 18-May-2024
https://doi.org/10.1145/3664597
Sun WFang CGe YHu YChen YZhang QGe XLiu YChen Z(2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3656341
Sun WFang CChen YZhang QTao GYou YHan TGe YHu YLuo BChen Z(2024)An Extractive-and-Abstractive Framework for Source Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363274233:3(1-39)Online publication date: 14-Mar-2024
https://dl.acm.org/doi/10.1145/3632742
Xie YLin JDong HZhang LWu Z(2023)Survey of Code Search Based on Deep LearningACM Transactions on Software Engineering and Methodology10.1145/362816133:2(1-42)Online publication date: 23-Dec-2023
https://dl.acm.org/doi/10.1145/3628161
Wang SGeng MLin BSun ZWen MLiu YLi LBissyandé TMao XChandra SBlincoe KTonella P(2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616323
Huang XMa YZhou HJiang ZZhang YWang TLi S(2023)Towards Better Multilingual Code Search through Cross-Lingual Contrastive LearningProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609439(22-32)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609439
Yang YHe WGao CXu ZXia XLiu C(2023)TopicAns: Topic-informed Architecture for Answer Recommendation on Technical Q&A SiteACM Transactions on Software Engineering and Methodology10.1145/360718933:1(1-25)Online publication date: 11-Jul-2023
https://dl.acm.org/doi/10.1145/3607189
Wei HSu XGao CZheng WTao W(2023)A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic SpacesACM Transactions on Software Engineering and Methodology10.1145/359186832:5(1-28)Online publication date: 21-Jul-2023
https://dl.acm.org/doi/10.1145/3591868
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten