research-article

A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces

Authors:

Wenxin TaoAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 123, Pages 1 - 28

https://doi.org/10.1145/3591868

Published: 21 July 2023 Publication History

Abstract

Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.

References

[1]

Shayan A. Akbar and Avinash C. Kak. 2019. SCOR: source code retrieval with semantics and order. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE/ACM, 1–12. DOI:

Digital Library

[2]

Allen and E. Frances. 1970. Control flow analysis. ACM 24, 5 (1970), 1–19.

[3]

Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2787–2795. Retrieved from https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html.

[4]

José Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 964–974. DOI:

Digital Library

[5]

Oscar Chaparro, Juan Manuel Florez, and Andrian Marcus. 2019. Using bug descriptions to reformulate queries during text-retrieval-based bug localization. Empir. Softw. Eng. 24, 5 (2019), 2947–3007. DOI:

Digital Library

[6]

Qingying Chen and Minghui Zhou. 2018. A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 826–831. DOI:

Digital Library

[7]

Sen Fang, Youshuai Tan, Tao Zhang, and Yepang Liu. 2021. Self-attention networks for code search. Inf. Softw. Technol. 134 (2021), 106542. DOI:

[8]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics. Association for Computational Linguistics, 1536–1547. DOI:

[9]

G. Casella and R. L. Berger.2002. Statistical Inference. Duxbury, Pacific Grove, CA.

[10]

Reza Gharibi, Amir Hossein Rasekh, Mohammad Hadi Sadreddini, and Seyed Mostafa Fakhrahmad. 2018. Leveraging textual properties of bug reports to localize relevant source files. Inf. Process. Manag. 54, 6 (2018), 1058–1076. DOI:

[11]

Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal representation for neural code search. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE, 483–494. DOI:

[12]

Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal representation for neural code search. Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME). 483–494. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edseee&AN=edseee.9609178&lang=zh-cn&site=eds-live.

[13]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. ACM, 933–944. DOI:

Digital Library

[14]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training code representations with data flow. In Proceedings of the 9th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=jLoC4ez43PZ.

[15]

Xuan Huo and Ming Li. 2017. Enhancing the unified features to locate buggy files by exploiting the sequential nature of source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, 1909–1915. DOI:

[16]

Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning unified features from natural and programming languages for locating buggy source code. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. IJCAI/AAAI Press, 1606–1612. Retrieved from http://www.ijcai.org/Abstract/16/230.

Digital Library

[17]

Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2020. Control flow graph embedding based on multi-instance decomposition for bug localization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 4223–4230. Retrieved from https://aaai.org/ojs/index.php/AAAI/article/view/5844.

[18]

Xuan Huo, Ferdian Thung, Ming Li, David Lo, and Shu-Ting Shi. 2021. Deep transfer bug localization. IEEE Trans. Softw. Eng. 47, 7 (2021), 1368–1380. DOI:

[19]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436 (2019).

[20]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping language to code in programmatic context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1643–1652. DOI:

[21]

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).

[22]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1746–1751. DOI:

[23]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations. Retrieved from http://arxiv.org/abs/1412.6980.

[24]

An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2015. Combining deep learning with information retrieval to localize buggy files for bug reports (N). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 476–481. DOI:

Digital Library

[25]

Ruitong Li, Gang Hu, and Min Peng. 2020. Hierarchical embedding for code search in software Q&A sites. In Proceedings of the International Joint Conference on Neural Networks. IEEE, 1–10. DOI:

[26]

Guangliang Liu, Yang Lu, Ke Shi, Jingfei Chang, and Xing Wei. 2019. Mapping bug reports to relevant source code files based on the vector space model and word embedding. IEEE Access 7 (2019), 78870–78881. DOI:

[27]

Yi-Fan Ma and Ming Li. 2022. The flowing nature matters: Feature learning from the control flow graph of source code for bug localization. Mach. Learn. (2022), 1–18. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edssjs&AN=edssjs.12EB9D72&lang=zh-cn&site=eds-live.

[28]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 28 (2008), 2579–2605.

[29]

Iosif Pinelis. 1994. Extremal probabilistic problems and hotelling’s T 2 test under a symmetry condition. Ann. Statist. 22, 1 (1994), 357–368. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edsjsr&AN=edsjsr.2242458&lang=zh-cn&site=eds-live.

[30]

Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. ACM, 31–41. DOI:

Digital Library

[31]

Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 31–41. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edscma&AN=edscma.3211353&lang=zh-cn&site=eds-live.

Digital Library

[32]

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension. ACM, 196–207. DOI:

Digital Library

[33]

Kathryn T. Stolee, Sebastian G. Elbaum, and Daniel Dobos. 2014. Solving the search for source code. ACM Trans. Softw. Eng. Methodol. 23, 3 (2014), 26:1–26:45. DOI:

Digital Library

[34]

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. The MIT Press, 1057–1063. Retrieved from http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.

[35]

Yukiya Uneno, Osamu Mizuno, and Eun-Hye Choi. 2016. Using a distributed representation of words in localizing relevant files for bug reports. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security. IEEE, 183–190. DOI:

[36]

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 13–25. DOI:

Digital Library

[37]

X. Wang, Y. Wang, F. Mi, P. Zhou, Y. Wan, X. Liu, L. Li, H. Wu, J. Liu, and X. Jiang. 2021. SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021).

[38]

Yan Xiao, Jacky Keung, Kwabena Ebo Bennin, and Qing Mi. 2018. Machine translation-based bug localization technique for bridging lexical gap. Inf. Softw. Technol. 99 (2018), 58–61. DOI:

Digital Library

[39]

Yan Xiao, Jacky Keung, Qing Mi, and Kwabena Ebo Bennin. 2018. Bug localization with semantic and structural features using convolutional neural network and cascade forest. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering. ACM, 101–111. DOI:

Digital Library

[40]

Ling Xu, Huanhuan Yang, Chao Liu, Jianhang Shuai, Meng Yan, Yan Lei, and Zhou Xu. 2021. Two-stage attention-based model for code search with textual and structural features. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 342–353. DOI:

[41]

Ling Xu, Huanhuan Yang, Chao Liu, Jianhang Shuai, Meng Yan, Yan Lei, and Zhou Xu. 2021. Two-stage attention-based model for code search with textual and structural features. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 342–353. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edseee&AN=edseee.9425924&lang=zh-cn&site=eds-live.

[42]

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 344–354. DOI:

[43]

Xin Ye, Razvan C. Bunescu, and Chang Liu. 2014. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE’14). ACM, 689–699. DOI:

Digital Library

[44]

Xin Ye, Hui Shen, Xiao Ma, Razvan C. Bunescu, and Chang Liu. 2016. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering. ACM, 404–415. DOI:

Digital Library

[45]

Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports. In Proceedings of the 34th International Conference on Software Engineering. IEEE Computer Society, 14–24. DOI:

[46]

Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, and Lu Zhang. 2020. OCoR: An overlapping-aware code retriever. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 883–894. DOI:

Digital Library

[47]

Ziye Zhu, Yun Li, Hanghang Tong, and Yu Wang. 2020. CooBa: Cross-project bug localization via adversarial transfer learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence. ijcai.org, 3565–3571. DOI:

[48]

Ziye Zhu, Yun Li, Yu Wang, Yaojing Wang, and Hanghang Tong. 2021. A deep multimodal model for bug localization. Data Mining Knowl. Discov. 35, 4 (2021), 1369–1392. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edssjs&AN=edssjs.B021CC4D&lang=zh-cn&site=eds-live.

Digital Library

Index Terms

A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces
1. Software and its engineering
  1. Software creation and management
    1. Search-based software engineering

Recommendations

Multi-modal attention network learning for semantic source code retrieval
ASE '19: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering

Code retrieval techniques and tools have been playing a key role in facilitating software developers to retrieve existing code fragments from available open-source repositories given a user query (e.g., a short natural language text describing the ...
Testing the cluster hypothesis in distributed information retrieval

How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single ...
A semantic model for cross-modal and multi-modal retrieval
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrieval

In this paper, a semantic model for cross-modal and multi-modal retrieval is studied. We assume that the semantic correlation of multimedia data from different modalities can be depicted in a probabilistic generation framework. Media data from different ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 32, Issue 5

September 2023

905 pages

ISSN:1049-331X

EISSN:1557-7392

DOI:10.1145/3610417

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2023

Online AM: 10 April 2023

Accepted: 15 February 2023

Revised: 27 December 2022

Received: 23 August 2022

Published in TOSEM Volume 32, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
282
Total Downloads

Downloads (Last 12 months)212
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents