Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Deep Graph Matching and Searching for Semantic Code Retrieval

Published: 10 May 2021 Publication History

Abstract

Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however, neglecting the deep structured features of query texts and source codes, both of which contain rich semantic information. In this article, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for the task of semantic code retrieval. To this end, we first represent both natural language query texts and programming language code snippets with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet. In particular, DGMS not only captures more structural information for individual query texts or code snippets, but also learns the fine-grained similarity between them by cross-attention based semantic matching operations. We evaluate the proposed DGMS model on two public code retrieval datasets with two representative programming languages (i.e., Java and Python). Experiment results demonstrate that DGMS significantly outperforms state-of-the-art baseline models by a large margin on both datasets. Moreover, our extensive ablation studies systematically investigate and illustrate the impact of each part of DGMS.

References

[1]
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A transformer-based approach for source code summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,. 4998--5007.
[2]
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys 51, 4 (2018), 1--37.
[3]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada.
[4]
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. Code2seq: Generating sequences from structured representations of code. In Proceedings of the International Conference on Learning Representations. OpenReview.net, New Orleans, LA.
[5]
Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. 2020. Structural language models of code. In Proceedings of the Thirty-seventh International Conference on Machine Learning. PMLR, Virtual Event, 245--256.
[6]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 40:1–40:29. https://dl.acm.org/doi/10.1145/3290353.
[7]
Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, Yizhou Sun, and Wei Wang. 2019. Simgnn: A neural network approach to fast graph similarity computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. ACM, Melbourne, Australia, 384--392.
[8]
Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. 2016. Learning local feature descriptors with triplets and shallow convolutional neural networks. In Proceedings of the British Machine Vision Conference. BMVA Press, York, UK, 3.
[9]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146. https://www.aclweb.org/anthology/Q17-1010/.
[10]
Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. 2019. Generative code modeling with graphs. In Proceedings of the International Conference on Learning Representations. OpenReview.net, New Orleans, LA.
[11]
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a “siamese” time delay neural network. In Proceedings of the Advances in Neural Information Processing Systems. Morgan-Kaufmann, Denver, Colorado, 737--744.
[12]
Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine 34, 4 (2017), 18--42.
[13]
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Tallinn, Estonia, 964--974.
[14]
Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Virtual Event.
[15]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL, Doha, Qatar, 1724--1734.
[16]
Noam Chomsky. 1956. Three models for the description of language. IRE Transactions on Information Theory 2, 3 (1956), 113--124.
[17]
Ronan Collobert and Samy Bengio. 2004. Links between perceptrons, MLPs and SVMs. In Proceedings of the 21st International Conference on Machine Learning. ACM, Banff, Alberta, Canada, 23.
[18]
Milan Cvitkovic, Badal Singh, and Animashree Anandkumar. 2019. Open vocabulary learning on source code with a graph-structured cache. In Proceedings of the 36th International Conference on Machine Learning. PMLR, Long Beach, California, 1475--1485.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Minneapolis, Minnesota, 4171--4186.
[20]
Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. 2020. A fair comparison of graph neural networks for graph classification. In Proceedings of the International Conference on Learning Representations. OpenReview.net, Addis Ababa, Ethiopia. Retrieved from https://openreview.net/forum?id=HygDF6NFPB.
[21]
Facebook Research. 2019. Releasing a new benchmark and data set for evaluating neural code search models. Retrieved from https://ai.facebook.com/blog/neural-code-search-evaluation-dataset/.
[22]
Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Structured neural summarization. In Proceedings of the International Conference on Learning Representations. OpenReview.net, New Orleans, LA.
[23]
Matthias Fey and Jan E. Lenssen. 2019. Fast graph representation learning with PyTorch geometric. In Proceedings of the ICLR Workshop on Representation Learning on Graphs and Manifolds. OpenReview.net, New Orleans, LA.
[24]
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning. PMLR, Sydney, NSW, Australia, 1263--1272.
[25]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. Association for Computing Machinery, Gothenburg, Sweden, 933--944.
[26]
Rajarshi Haldar, Lingfei Wu, Jinjun Xiong, and Julia Hockenmaier. 2020. A multi-perspective architecture for semantic code search. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, Virtual Event, 8563--8568.
[27]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Long Beach, CA, 1024--1034.
[28]
Emily Hill, Lori Pollock, and K. Vijay-Shanker. 2011. Improving source code search with natural language phrasal representations of method signatures. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, Lawrence, KS, 524--527.
[29]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.
[30]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[31]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, Berlin, Germany, 2073--2083.
[32]
Daniel Jurafsky and James H. Martin. 2019. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (3rd draft ed.). Retrieved from https://web.stanford.edu/ jurafsky/slp3/.
[33]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL, Doha, Qatar, 1746--1751.
[34]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net, Toulon, France.
[35]
Alexander LeClair, Sakib Haque, Linfgei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. arXiv preprint arXiv:2004.02843 (2020).
[36]
Hongyu Li, Seohyun Kim, and Satish Chandra. 2019. Neural code search evaluation dataset. arXiv preprint arXiv:1908.09804 (2019).
[37]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the 4th International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). OpenReview.net, San Juan, Puerto Rico.
[38]
Xiang Ling, Lingfei Wu, Saizhuo Wang, Tengfei Ma, Fangli Xu, Alex X Liu, Chunming Wu, and Shouling Ji. 2020. Multi-Level Graph Matching Networks for Deep Graph Similarity Learning. arXiv preprint arXiv:2007.04395 (2020).
[39]
Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. 2009. Sourcerer: Mining searching internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2 (2009), 300--336.
[40]
Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. Codehow: Effective code search based on api understanding and extended boolean model. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, Lincoln, NE, 260--270.
[41]
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, 55--60.
[42]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering. ACM, Waikiki, Honolulu, HI, 111--120.
[43]
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation (t). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, Lincoln, NE, 574--584.
[44]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66. Stanford InfoLab. Retrieved from http://ilpubs.stanford.edu:8090/422/.
[45]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, BC, Canada, 8026--8037.
[46]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL, Doha, Qatar, 1532--1543.
[47]
Yu Rong, Tingyang Xu, Junzhou Huang, Wenbing Huang, Hong Cheng, Yao Ma, Yiqi Wang, Tyler Derr, Lingfei Wu, and Tengfei Ma. 2020. Deep graph learning: Foundations, advances and applications. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 3555--3556.
[48]
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. ACM, Philadelphia, PA, 31--41.
[49]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61--80.
[50]
Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In Proceedings of the European Semantic Web Conference. Springer, Heraklion, Crete, Greece, 593--607.
[51]
Renuka Sindhgatta. 2006. Using an information retrieval system to retrieve source code samples. In Proceedings of the 28th International Conference on Software Engineering. ACM, Shanghai, China, 905--908.
[52]
Kenneth Slonneger and Barry L. Kurtz. 1995. Formal Syntax and Semantics of Programming Languages. Addison-Wesley Reading.
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Long Beach, CA, 5998--6008.
[54]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada.
[55]
Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, San Diego, CA, 13--25.
[56]
Shuohang Wang and Jing Jiang. 2017. A compare-aggregate model for matching text sequences. In Proceedings of the International Conference on Learning Representations. OpenReview.net, Toulon, France.
[57]
Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, Melbourne, Australia, 4144--4150.
[58]
Lingfei Wu, Ian En-Hsu Yen, Zhen Zhang, Kun Xu, Liang Zhao, Xi Peng, Yinglong Xia, and Charu Aggarwal. 2019. Scalable global alignment graph kernel using random features: From node embedding to graph embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, Anchorage, AK, 1418--1428.
[59]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. 2021. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2021), 4--24.
[60]
Tian Xie and Jeffrey C Grossman. 2018. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical Review Letters 120, 14 (2018), 145301.
[61]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural networks? In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net, New Orleans, LA.
[62]
Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code annotation for code retrieval with reinforcement learning. In Proceedings of The World Wide Web Conference. ACM, San Francisco, CA, 2203--2214.
[63]
Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Montréal, Canada, 5165--5175.
[64]
Zhen Zhang, Yijian Xiang, Lingfei Wu, Bing Xue, and Arye Nehorai. 2019. KerGM: Kernelized graph matching. In Proceedings of the Advances in Neural Information Processing Systems. Curran Associates, Inc., Vancouver, BC, Canada, 3335--3346.

Cited By

View all
  • (2024)Code Similarity Prediction Model for Industrial Management Features Based on Graph Neural NetworksEntropy10.3390/e2606050526:6(505)Online publication date: 9-Jun-2024
  • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
  • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 5
October 2021
508 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3461317
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2021
Accepted: 01 January 2021
Received: 01 September 2020
Published in TKDD Volume 15, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Neural networks
  2. graph representation
  3. source code retrieval

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Key R&D Program of China
  • Zhejiang Provincial Natural Science Foundation for Distinguished Young Scholars
  • Fundamental Research Funds for the Central Universities (Zhejiang University NGICS Platform)
  • NSFC
  • Key R&D Program of Zhejiang Province
  • Major Scientific Project of Zhejiang Laboratory

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)165
  • Downloads (Last 6 weeks)12
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Code Similarity Prediction Model for Industrial Management Features Based on Graph Neural NetworksEntropy10.3390/e2606050526:6(505)Online publication date: 9-Jun-2024
  • (2024)C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTMApplied Sciences10.3390/app1413579514:13(5795)Online publication date: 2-Jul-2024
  • (2024)Intelligent code search aids edge software developmentJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-024-00629-513:1Online publication date: 1-Apr-2024
  • (2024)A Survey of Source Code Search: A 3-Dimensional PerspectiveACM Transactions on Software Engineering and Methodology10.1145/365634133:6(1-51)Online publication date: 28-Jun-2024
  • (2024)Fusing Code SearchersIEEE Transactions on Software Engineering10.1109/TSE.2024.340304250:7(1852-1866)Online publication date: 1-Jul-2024
  • (2024)CodeFuse: Multimodal Code Search Model with Fine-Grained Attention Alignment2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00170(1290-1299)Online publication date: 2-Jul-2024
  • (2024)CMCS: contrastive-metric learning via vector-level sampling and augmentation for code searchScientific Reports10.1038/s41598-024-64205-214:1Online publication date: 24-Jun-2024
  • (2024)RRGcode: Deep hierarchical search-based code generationJournal of Systems and Software10.1016/j.jss.2024.111982211(111982)Online publication date: May-2024
  • (2024)Query-oriented two-stage attention-based model for code searchJournal of Systems and Software10.1016/j.jss.2023.111948210:COnline publication date: 25-Jun-2024
  • (2024)On the impact of multiple source code representations on software engineering tasks — An empirical studyJournal of Systems and Software10.1016/j.jss.2023.111941210:COnline publication date: 25-Jun-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media