Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3324884.3416530acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections

OCoR: an overlapping-aware code retriever

Published: 27 January 2021 Publication History


Code retrieval helps developers reuse code snippets in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code relevant among a set of code snippets. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., "message" and "msg"), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related.
To address this problem, we propose a novel neural architecture named OCoR1, where we introduce two specifically-designed components to capture overlaps: the first embeds names by characters to capture the overlaps between names, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier.
The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of the different components in OCoR.


Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
Miltos Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015 (JMLR Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. Journal of Machine Learning Research: Workshop and Conference Proceedings, 2123--2132.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:cs.CL/1409.0473
Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 2 (March 1994), 157--166.
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When Deep Learning Met Code Search. arXiv:cs.SE/1905.03813
Nick Craswell. 2009. Mean reciprocal rank. Encyclopedia of Database Systems (2009), 1703--1703.
Jian Fu, Xipeng Qiu, and Xuanjing Huang. 2016. Convolutional deep neural networks for document-based question answering. In Natural Language Understanding and Intelligent Applications. Springer, 790--797.
github. 2020. https://github.com/. github.
Alessandro Giusti, Dan C Cireşan, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. 2013. Fast image scanning with deep max-pooling convolutional neural networks. In 2013 IEEE International Conference on Image Processing. IEEE, 4034--4038.
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933--944.
Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea De Lucia, and Tim Menzies. 2013. Automatic query reformulations for text retrieval in software engineering. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 842--851.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:cs.CV/1512.03385
Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016). arXiv:1606.08415 http://arxiv.org/abs/1606.08415
Emily Hill, Manuel Roldan-Vega, Jerry Alan Fails, and Greg Mallet. 2014. NL-based query refinement and contextualized code search results: A user study. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 34--43.
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:cs.NE/1207.0580
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2015. Convolutional Neural Network Architectures for Matching Natural Language Sentences. arXiv:cs.CL/1503.03244
He Hua and Jimmy Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music Transformer: Generating Music with Long-Term Structure. In ICLR.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:cs.LG/1909.09436
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073--2083.
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting Working Code Examples. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 664--675.
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. arXiv:cs.CL/1408.5882
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv:cs.LG/1412.6980
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).
Hong Mei and Lu Zhang. 2018. Can big data bring a breakthrough for software automation? Science China Information Sciences 61 (05 2018), 056101.
Meili Lu, X. Sun, S. Wang, D. Lo, and Yucong Duan. 2015. Query expansion via WordNet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 545--549.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:cs.CL/1301.3781
Stack Overflow. 2020. https://stackoverflow.com/. Stack Overflow.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14-1162
Xipeng Qiu and Xuanjing Huang. 2015. Convolutional Neural Tensor Network Architecture for Community-Based Question Answering. In IJCAI, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 1305--1311. http://dblp.uni-trier.de/db/conf/ijcai/ijcai2015.html#QiuH15
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural Machine Translation of Rare Words with Subword Units. arXiv:cs.CL/1508.07909
Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2019. A grammar-based structural cnn decoder for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7055--7062.
Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 8984--8991. https://aaai.org/ojs/index.php/AAAI/article/view/6430
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
Venkatesh Vinayakarao, Anita Sarma, Rahul Purandare, Shuktika Jain, and Saumya Jain. 2017. Anne: Improving source code search using entity retrieval approach. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 211--220.
Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397--407.
Mingzhou Xu, Derek F Wong, Baosong Yang, Yue Zhang, and Lidia S Chao. 2019. Leveraging local and global patterns for self-attention networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3069--3075.
Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning. The World Wide Web Conference on - WWW '19 (2019).
Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference. 1693--1703.
Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. arXiv preprint arXiv:1611.06639 (2016).

Cited By

View all
  • (2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 27-Jan-2025
  • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
  • (2024)Deep Learning for Code Intelligence: Survey, Benchmark and ToolkitACM Computing Surveys10.1145/3664597Online publication date: 18-May-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
December 2020
1449 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]





Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021


Request permissions for this article.

Check for updates

Author Tags

  1. code retrieval
  2. neural network
  3. overlap


  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • National Natural Science Foundation of China


ASE '20

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics


Cited By

View all
  • (2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 27-Jan-2025
  • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
  • (2024)Deep Learning for Code Intelligence: Survey, Benchmark and ToolkitACM Computing Surveys10.1145/3664597Online publication date: 18-May-2024
  • (2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
  • (2024)An Extractive-and-Abstractive Framework for Source Code SummarizationACM Transactions on Software Engineering and Methodology10.1145/363274233:3(1-39)Online publication date: 14-Mar-2024
  • (2023)Survey of Code Search Based on Deep LearningACM Transactions on Software Engineering and Methodology10.1145/362816133:2(1-42)Online publication date: 23-Dec-2023
  • (2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
  • (2023)Towards Better Multilingual Code Search through Cross-Lingual Contrastive LearningProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609439(22-32)Online publication date: 4-Aug-2023
  • (2023)TopicAns: Topic-informed Architecture for Answer Recommendation on Technical Q&A SiteACM Transactions on Software Engineering and Methodology10.1145/360718933:1(1-25)Online publication date: 11-Jul-2023
  • (2023)A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic SpacesACM Transactions on Software Engineering and Methodology10.1145/359186832:5(1-28)Online publication date: 21-Jul-2023
  • Show More Cited By

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media