Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3211346.3211353acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Retrieval on source code: a neural code search

Published: 18 June 2018 Publication History

Abstract

Searching over large code corpora can be a powerful productivity tool for both beginner and experienced developers because it helps them quickly find examples of code related to their intent. Code search becomes even more attractive if developers could express their intent in natural language, similar to the interaction that Stack Overflow supports.
In this paper, we investigate the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand.
Our experiments using a benchmark suite derived from Stack Overflow and GitHub repositories show promising results. We find that while a basic word–embedding based search procedure works acceptably, better results can be obtained by adding a layer of supervision, as well as by a customized ranking strategy.

References

[1]
Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In International Conference on Machine Learning (ICML).
[2]
Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 2123–2132.
[3]
Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: A Search Engine for Open Source Code Supporting Structure-based Search. In Companion to the 21st ACM SIGPLAN Symposium on Object-oriented Programming Systems, Languages, and Applications (OOPSLA ’06). ACM, New York, NY, USA, 681–682.
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016). arXiv: 1607.04606 http://arxiv.org/abs/1607. 04606
[5]
Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching Connected API Subgraph via Text Phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE ’12). ACM, New York, NY, USA, Article 10, 11 pages.
[6]
Shaunak Chatterjee, Sudeep Juvekar, and Koushik Sen. 2009. SNIFF: A Search Engine for Java Using Free-Form Queries. In Fundamental Approaches to Software Engineering, Marsha Chechik and Martin Wirsing (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 385–400.
[7]
Zellig S. Harris. 1954. Distributional Structure. WORD 10, 2-3 (1954), 146–162.
[8]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2073–2083.
[9]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. CoRR abs/1702.08734 (2017). arXiv: 1702.08734
[10]
Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) (ASE ’15). IEEE Computer Society, Washington, DC, USA, 260–270.
[11]
Chris J. Maddison and Daniel Tarlow. 2014. Structured Generative Models of Natural Source Code. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML’14). JMLR.org, II–649–II–657. http://dl.acm.org/citation.cfm? id=3044805.3044965
[12]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding Relevant Functions and Their Usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM, New York, NY, USA, 111–120.
[13]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. CoRR abs/1310.4546 (2013). arXiv: 1310.4546
[14]
Bhaskar Mitra and Nick Craswell. 2017. Neural Models for Information Retrieval. CoRR abs/1705.01509 (2017). arXiv: 1705.01509
[15]
Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. 2017. Bayesian Sketch Learning for Program Synthesis. CoRR abs/1703.05698 (2017). arXiv: 1703.05698 http://arxiv.org/abs/1703.05698
[16]
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 357–367.
[17]
Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. 2016. From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 404–415.
[18]
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017). arXiv: 1709.00103 http://arxiv.org/abs/1709.00103

Cited By

View all
  • (2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
  • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
  • (2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MAPL 2018: Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages
June 2018
80 pages
ISBN:9781450358347
DOI:10.1145/3211346
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. TF-IDF
  2. code search
  3. word-embedding

Qualifiers

  • Research-article

Conference

PLDI '18
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)93
  • Downloads (Last 6 weeks)21
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
  • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
  • (2024)A Combined Alignment Model for Code SearchIEICE Transactions on Information and Systems10.1587/transinf.2023MPP0002E107.D:3(257-267)Online publication date: 1-Mar-2024
  • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
  • (2024)Automating TODO-missed Methods Detection and PatchingACM Transactions on Software Engineering and Methodology10.1145/370079334:1(1-28)Online publication date: 6-Nov-2024
  • (2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
  • (2024)RAPID: Zero-Shot Domain Adaptation for Code Search with Pre-Trained ModelsACM Transactions on Software Engineering and Methodology10.1145/364154233:5(1-35)Online publication date: 18-Jan-2024
  • (2024)Dsn2Code: An automated approach for similarity-based Software Architecture selection for Code reuse2024 International Research Conference on Smart Computing and Systems Engineering (SCSE)10.1109/SCSE61872.2024.10550890(1-6)Online publication date: 4-Apr-2024
  • (2024)Code Search Oriented Node-Enhanced Control Flow Graph Embedding2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM)10.1109/SCAM63643.2024.00016(59-70)Online publication date: 7-Oct-2024
  • (2024)Performance of Traditional and Dense Vector Information Retrieval Models in Code Search2024 2nd International Conference on Software Engineering and Information Technology (ICoSEIT)10.1109/ICoSEIT60086.2024.10497512(52-57)Online publication date: 28-Feb-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media