Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3180155.3180167acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Deep code search

Published: 27 May 2018 Publication History

Abstract

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.
In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.
As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.

References

[1]
Camel case, https://en.wikipedia.org/wiki/camelcase.
[2]
Eclipse JDT. http://www.eclipse.org/jdt/.
[3]
Github. https://github.com.
[4]
Keras. https://keras.io/.
[5]
Lucene. https://lucene.apache.org/.
[6]
Theano, http://deeplearning.net/software/theano/.
[7]
M. Allamanis, H. Peng, and C. Sutton. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning (ICML), 2016.
[8]
J. Anvik and G. C. Murphy. Reducing the effort of bug report triage: Recommenders for development-oriented decisions. ACM Transactions on Software Engineering and Methodology (TOSEM), 20(3):10, 2011.
[9]
A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code artifacts. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pages 375--384. ACM, 2010.
[10]
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[11]
O. Barzilay, C. Treude, and A. Zagalsky. Facilitating crowd sourced software engineering via stack overflow. In Finding Source Code on the Web for Remix and Reuse, pages 289--308. Springer, 2013.
[12]
T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72--82, 1994.
[13]
J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer. Example-centric programming: integrating web search into the development environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 513--522. ACM, 2010.
[14]
B. A. Campbell and C. Treude. NLP2Code: Code snippet content assist via natural language tasks. arXiv preprint arXiv:1701.05648, 2017.
[15]
W.-K. Chan, H. Cheng, and D. Lo. Searching connected API subgraph via text phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, page 10. ACM, 2012.
[16]
O. Chaparro and A. Marcus. On the reduction of verbose queries in text retrieval based software maintenance. In Proceedings of the 38th International Conference on Software Engineering Companion, pages 716--718. ACM, 2016.
[17]
K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724--1734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.
[18]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493--2537, 2011.
[19]
C. S. Corley, K. Damevski, and N. A. Kraft. Exploring the use of deep learning for feature location. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 556--560. IEEE, 2015.
[20]
B. Dagenais and M. P. Robillard. Recovering traceability links between an api and its learning resources. In 2012 34th International Conference on Software Engineering (ICSE), pages 47--57. IEEE, 2012.
[21]
M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou. Applying deep learning to answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 813--820. IEEE, 2015.
[22]
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. DeViSE: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121--2129, 2013.
[23]
X. Ge, D. C. Shepherd, K. Damevski, and E. Murphy-Hill. Design and evaluation of a multi-recommendation system for local code search. Journal of Visual Languages & Computing, 2016.
[24]
G. Gousios, M. Pinzger, and A. v. Deursen. An exploratory study of the pull-based software development model. In Proceedings of the 36th International Conference on Software Engineering, pages 345--355. ACM, 2014.
[25]
A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5):855--868, 2009.
[26]
M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A search engine for finding highly relevant applications. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, pages 475--484. IEEE, 2010.
[27]
X. Gu, H. Zhang, D. Zhang, and S. Kim. Deep API learning. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE'16), 2016.
[28]
X. Gu, H. Zhang, D. Zhang, and S. Kim. DeepAM: Migrate APIs with multi-modal sequence to sequence learning. In Proceedings of the Twenty-Sixth International Joint Conferences on Artifical Intelligence (IJCAI'17), 2017.
[29]
S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 International Conference on Software Engineering, pages 842--851. IEEE Press, 2013.
[30]
E. Hill, L. Pollock, and K. Vijay-Shanker. Improving source code search with natural language phrasal representations of method signatures. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 524--527. IEEE Computer Society, 2011.
[31]
E. Hill, M. Roldan-Vega, J. A. Fails, and G. Mallet. NL-based query refinement and contextualized code search results: A user study. In Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week-IEEE Conference on, pages 34--43. IEEE, 2014.
[32]
R. Holmes, R. Cottrell, R. J. Walker, and J. Denzinger. The end-to-end use of source code examples: An exploratory study. In Software Maintenance, 2009. ICSM 2009. IEEE International Conference on, pages 555--558. IEEE, 2009.
[33]
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128--3137, 2015.
[34]
Y. Ke, K. T. Stolee, C. Le Goues, and Y. Brun. Repairing programs with semantic code search (T). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 295--306. IEEE, 2015.
[35]
I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering, pages 664--675. ACM, 2014.
[36]
Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
[37]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[38]
A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 476--481. IEEE, 2015.
[39]
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014.
[40]
M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661--670. ACM, 2014.
[41]
X. Li, Z. Wang, Q. Wang, S. Yan, T. Xie, and H. Mei. Relationship-aware code search for JavaScript frameworks. In Proceedings of the ACM SIGSOFT 24th International Symposium on the Foundations of Software Engineering. ACM, 2016.
[42]
W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, F. Wang, and P. Blunsom. Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744, 2016.
[43]
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18:300--336, 2009.
[44]
M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan. Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pages 545--549. IEEE, 2015.
[45]
F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao. CodeHow: Effective code search based on API understanding and extended boolean model. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015). IEEE, 2015.
[46]
C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering, 38(5):1069--1087, 2012.
[47]
C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11), pages 111--120. IEEE, 2011.
[48]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[49]
T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26--30, 2010, pages 1045--1048, 2010.
[50]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013.
[51]
I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. E. Hassan. A large scale empirical study on software reuse in mobile apps. IEEE Software, 31(2):78--86, 2014.
[52]
D. J. Montana and L. Davis. Training feedforward neural networks using genetic algorithms. In IJCAI, volume 89, pages 762--767, 1989.
[53]
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI'16, pages 1287--1293. AAAI Press, 2016.
[54]
L. Mou, R. Men, G. Li, L. Zhang, and Z. Jin. On end-to-end program generation from user intention by deep neural networks. arXiv, 2015.
[55]
A. Nederlof, A. Mesbah, and A. v. Deursen. Software engineering for the web: the state of the practice. In Companion Proceedings of the 36th International Conference on Software Engineering, pages 4--13. ACM, 2014.
[56]
T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. Nguyen. Exploring api embedding for api usages and applications. In Proceedings of the 39th International Conference on Software Engineering, pages 438--449. IEEE Press, 2017.
[57]
L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9(5):771--783, 2016.
[58]
H. Niu, I. Keivanloo, and Y. Zou. Learning to rank code examples for code search engines. Empirical Software Engineering, pages 1--33, 2016.
[59]
H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward. Deep sentence embedding using the long short term memory network: Analysis and application to information retrieval. CoRR, abs/1502.06922, 2015.
[60]
H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin. Building program vector representations for deep learning. In Proceedings of the 8th International Conference on Knowledge Science, Engineering and Management - Volume 9403, KSEM 2015, pages 547--553, New York, NY, USA, 2015. Springer-Verlag New York, Inc.
[61]
L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackoverflow to turn the ide into a self-confident programming prompter. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 102--111. ACM, 2014.
[62]
M. Raghothaman, Y. Wei, and Y. Hamadi. SWIM: synthesizing what I mean: code search and idiomatic snippet synthesis. In Proceedings of the 38th International Conference on Software Engineering, pages 357--367. ACM, 2016.
[63]
M. Rahimi and J. Cleland-Huang. Patterns of co-evolution between requirements and source code. In 2015 IEEE Fifth International Workshop on Requirements Patterns (RePa), pages 25--31. IEEE, 2015.
[64]
V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 2014.
[65]
S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, pages 243--253. IEEE Computer Society, 2009.
[66]
M. Renieres and S. P. Reiss. Fault localization with nearest neighbor queries. In Automated Software Engineering, 2003. Proceedings. 18th IEEE International Conference on, pages 30--39, Oct 2003.
[67]
P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering, pages 832--841. IEEE Press, 2013.
[68]
J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination of software engineering work practices. In CASCON First Decade High Impact Papers, pages 174--188. IBM Corp., 2010.
[69]
K. T. Stolee, S. Elbaum, and D. Dobos. Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM), 23(3):26, 2014.
[70]
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.
[71]
M. Tan, B. Xiang, and B. Zhou. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108, 2015.
[72]
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384--394. Association for Computational Linguistics, 2010.
[73]
Y. Uneno, O. Mizuno, and E.-H. Choi. Using a distributed representation of words in localizing relevant files for bug reports. In Software Quality, Reliability and Security (QRS), 2016 IEEE International Conference on, pages 183--190. IEEE, 2016.
[74]
J. Weston, S. Bengio, and N. Usunier. Wsabie: scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three, pages 2764--2770. AAAI Press, 2011.
[75]
M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk. Sorting and transforming program repair ingredients via deep learning code similarities. arXiv preprint arXiv:1707.04742, 2017.
[76]
M. White, M. Tufano, C. Vendome, and D. Poshyvanyk. Deep learning code fragments for code clone detection. In Proceedings of the 31th IEEE/ACM International Conference on Automated Software Engineering (ASE 2016), 2016.
[77]
M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep learning software repositories. In Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on, pages 334--345. IEEE, 2015.
[78]
R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, pages 2346--2352. Citeseer, 2015.
[79]
X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 689--699. ACM, 2014.
[80]
X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, pages 404--415. ACM, 2016.
[81]
H. Zhang, A. Jain, G. Khandelwal, C. Kaushik, S. Ge, and W. Hu. Bing developer assistant: Improving developer productivity by recommending sample code. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pages 956--961. ACM, 2016.
[82]
J. Zhou and R. J. Walker. API Deprecation: A retrospective analysis and detection method for code examples on the web. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE'16). ACM, 2016.

Cited By

View all
  • (2025)On Inter-Dataset Code Duplication and Data Leakage in Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.350428651:1(192-205)Online publication date: 1-Jan-2025
  • (2025)Web-FTP: A Feature Transferring-Based Pre-Trained Model for Web Attack DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.351279337:3(1495-1507)Online publication date: Mar-2025
  • (2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '18: Proceedings of the 40th International Conference on Software Engineering
May 2018
1307 pages
ISBN:9781450356381
DOI:10.1145/3180155
  • Conference Chair:
  • Michel Chaudron,
  • General Chair:
  • Ivica Crnkovic,
  • Program Chairs:
  • Marsha Chechik,
  • Mark Harman
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code search
  2. deep learning
  3. joint embedding

Qualifiers

  • Research-article

Conference

ICSE '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)335
  • Downloads (Last 6 weeks)31
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)On Inter-Dataset Code Duplication and Data Leakage in Large Language ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.350428651:1(192-205)Online publication date: 1-Jan-2025
  • (2025)Web-FTP: A Feature Transferring-Based Pre-Trained Model for Web Attack DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.351279337:3(1495-1507)Online publication date: Mar-2025
  • (2025)Promises and perils of using Transformer-based models for SE researchNeural Networks10.1016/j.neunet.2024.107067184(107067)Online publication date: Apr-2025
  • (2025)MITU: Locating relevant tutorial fragments of APIs with multi-source API knowledgeJournal of Systems and Software10.1016/j.jss.2024.112296222(112296)Online publication date: Apr-2025
  • (2025)RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code searchAutomated Software Engineering10.1007/s10515-025-00487-832:1Online publication date: 27-Jan-2025
  • (2024)DOĞAL DİL METİNLERİNDEN PROGRAMLAMA DİLİ KODU OLUŞTURMA ÇALIŞMALARI: BİR DERLEME ÇALIŞMASIİstanbul Ticaret Üniversitesi Fen Bilimleri Dergisi10.55071/ticaretfbd.135404023:45(209-244)Online publication date: 26-Jun-2024
  • (2024)EMPIRICAL STUDY OF DOMAIN-ADAPTED PRETRAINING AND CROSS-LINGUAL TRANSFER FOR PROGRAMMING LANGUAGE UNDERSTANDING USING THE DOLMA STACK CODE SUBSETJournal of Southwest Jiaotong University10.35741/issn.0258-2724.59.2.2659:2Online publication date: 2024
  • (2024)Requirement Dependency Extraction Based on Improved Stacking Ensemble Machine LearningMathematics10.3390/math1209127212:9(1272)Online publication date: 23-Apr-2024
  • (2024)I2RIntelligent Data Analysis10.3233/IDA-23008228:3(807-823)Online publication date: 28-May-2024
  • (2024)Analysis of Decompiled Program Code Using Abstract Syntax TreesAutomatic Control and Computer Sciences10.3103/S014641162308006057:8(958-967)Online publication date: 29-Feb-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media