research-article

Retrieval on source code: a neural code search

Authors:

Saksham Sachdev,

Satish ChandraAuthors Info & Claims

MAPL 2018: Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

Pages 31 - 41

https://doi.org/10.1145/3211346.3211353

Published: 18 June 2018 Publication History

Abstract

Searching over large code corpora can be a powerful productivity tool for both beginner and experienced developers because it helps them quickly find examples of code related to their intent. Code search becomes even more attractive if developers could express their intent in natural language, similar to the interaction that Stack Overflow supports.

In this paper, we investigate the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand.

Our experiments using a benchmark suite derived from Stack Overflow and GitHub repositories show promising results. We find that while a basic word–embedding based search procedure works acceptably, better results can be obtained by adding a layer of supervision, as well as by a customized ranking strategy.

References

[1]

Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In International Conference on Machine Learning (ICML).

[2]

Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 2123–2132.

Digital Library

[3]

Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: A Search Engine for Open Source Code Supporting Structure-based Search. In Companion to the 21st ACM SIGPLAN Symposium on Object-oriented Programming Systems, Languages, and Applications (OOPSLA ’06). ACM, New York, NY, USA, 681–682.

Digital Library

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016). arXiv: 1607.04606 http://arxiv.org/abs/1607. 04606

[5]

Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching Connected API Subgraph via Text Phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE ’12). ACM, New York, NY, USA, Article 10, 11 pages.

Digital Library

[6]

Shaunak Chatterjee, Sudeep Juvekar, and Koushik Sen. 2009. SNIFF: A Search Engine for Java Using Free-Form Queries. In Fundamental Approaches to Software Engineering, Marsha Chechik and Martin Wirsing (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 385–400.

Digital Library

[7]

Zellig S. Harris. 1954. Distributional Structure. WORD 10, 2-3 (1954), 146–162.

[8]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2073–2083.

[9]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. CoRR abs/1702.08734 (2017). arXiv: 1702.08734

[10]

Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) (ASE ’15). IEEE Computer Society, Washington, DC, USA, 260–270.

Digital Library

[11]

Chris J. Maddison and Daniel Tarlow. 2014. Structured Generative Models of Natural Source Code. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML’14). JMLR.org, II–649–II–657. http://dl.acm.org/citation.cfm? id=3044805.3044965

Digital Library

[12]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding Relevant Functions and Their Usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM, New York, NY, USA, 111–120.

Digital Library

[13]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. CoRR abs/1310.4546 (2013). arXiv: 1310.4546

[14]

Bhaskar Mitra and Nick Craswell. 2017. Neural Models for Information Retrieval. CoRR abs/1705.01509 (2017). arXiv: 1705.01509

[15]

Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. 2017. Bayesian Sketch Learning for Program Synthesis. CoRR abs/1703.05698 (2017). arXiv: 1703.05698 http://arxiv.org/abs/1703.05698

[16]

Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 357–367.

Digital Library

[17]

Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. 2016. From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 404–415.

Digital Library

[18]

Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017). arXiv: 1709.00103 http://arxiv.org/abs/1709.00103

Cited By

Xiao YZuo XLu XDong JCao XBeschastnikh I(2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
https://doi.org/10.1016/j.neunet.2024.107067
Bibi NMaqbool ARana TAfzal FKhan A(2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
https://doi.org/10.3390/app14135795
HONG JCHOI EMIZUNO O(2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
https://doi.org/10.1587/transinf.2023MPP0002
Show More Cited By

Index Terms

Retrieval on source code: a neural code search
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
    2. Software post-development issues

Recommendations

Deep code search
ICSE '18: Proceedings of the 40th International Conference on Software Engineering

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat ...
Code Search: A Survey of Techniques for Finding Code
The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when ...
Opportunities and Challenges in Code Search Tools
Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MAPL 2018: Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

June 2018

80 pages

ISBN:9781450358347

DOI:10.1145/3211346

General Chair:
Justin Gottschlich,
Program Chair:
Alvin Cheung

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '18

Sponsor:

SIGPLAN

PLDI '18: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 18, 2018

PA, Philadelphia, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

108
Total Citations
View Citations
1,216
Total Downloads

Downloads (Last 12 months)93
Downloads (Last 6 weeks)21

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiao YZuo XLu XDong JCao XBeschastnikh I(2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
https://doi.org/10.1016/j.neunet.2024.107067
Bibi NMaqbool ARana TAfzal FKhan A(2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
https://doi.org/10.3390/app14135795
HONG JCHOI EMIZUNO O(2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
https://doi.org/10.1587/transinf.2023MPP0002
Zhang FLi MWu HWu T(2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
https://doi.org/10.1186/s13677-024-00629-5
Gao ZSu YHu XXia X(2024)Automating TODO-missed Methods Detection and PatchingACM Transactions on Software Engineering and Methodology10.1145/370079334:1(1-28)Online publication date: 6-Nov-2024
https://dl.acm.org/doi/10.1145/3700793
Sun WFang CGe YHu YChen YZhang QGe XLiu YChen Z(2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
https://dl.acm.org/doi/10.1145/3656341
Fan GChen SGao CXiao JZhang TFeng Z(2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1145/3641542
Pirapuraj PAfrath H(2024)Dsn2Code: An automated approach for similarity-based Software Architecture selection for Code reuse2024 International Research Conference on Smart Computing and Systems Engineering (SCSE)10.1109/SCSE61872.2024.10550890(1-6)Online publication date: 4-Apr-2024
https://doi.org/10.1109/SCSE61872.2024.10550890
Xu YPeng W(2024)Code Search Oriented Node-Enhanced Control Flow Graph Embedding2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00016(59-70)Online publication date: 7-Oct-2024
https://doi.org/10.1109/SCAM63643.2024.00016
Susanto BFerdiana RAdji T(2024)Performance of Traditional and Dense Vector Information Retrieval Models in Code Search2024 2nd International Conference on Software Engineering and Information Technology (ICoSEIT)10.1109/ICoSEIT60086.2024.10497512(52-57)Online publication date: 28-Feb-2024
https://doi.org/10.1109/ICoSEIT60086.2024.10497512
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents