Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces

Published: 21 July 2023 Publication History

Abstract

Software cross-modal retrieval is a popular yet challenging direction, such as bug localization and code search. Previous studies generally map natural language texts and codes into a homogeneous semantic space for similarity measurement. However, it is not easy to accurately capture their similar semantics in a homogeneous semantic space due to the semantic gap. Therefore, we propose to map the multi-modal data into heterogeneous semantic spaces to capture their unique semantics. Specifically, we propose a novel software cross-modal retrieval framework named Deep Hypothesis Testing (DeepHT). In DeepHT, to capture the unique semantics of the code’s control flow structure, all control flow paths (CFPs) in the control flow graph are mapped to a CFP sample set in the sample space. Meanwhile, the text is mapped to a CFP correlation distribution in the distribution space to model its correlation with different CFPs. The matching score is calculated according to how well the sample set obeys the distribution using hypothesis testing. The experimental results on two text-to-code retrieval tasks (i.e., bug localization and code search) and two code-to-text retrieval tasks (i.e., vulnerability knowledge retrieval and historical patch retrieval) show that DeepHT outperforms the baseline methods.

References

[1]
Shayan A. Akbar and Avinash C. Kak. 2019. SCOR: source code retrieval with semantics and order. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE/ACM, 1–12. DOI:
[2]
Allen and E. Frances. 1970. Control flow analysis. ACM 24, 5 (1970), 1–19.
[3]
Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2787–2795. Retrieved from https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html.
[4]
José Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 964–974. DOI:
[5]
Oscar Chaparro, Juan Manuel Florez, and Andrian Marcus. 2019. Using bug descriptions to reformulate queries during text-retrieval-based bug localization. Empir. Softw. Eng. 24, 5 (2019), 2947–3007. DOI:
[6]
Qingying Chen and Minghui Zhou. 2018. A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 826–831. DOI:
[7]
Sen Fang, Youshuai Tan, Tao Zhang, and Yepang Liu. 2021. Self-attention networks for code search. Inf. Softw. Technol. 134 (2021), 106542. DOI:
[8]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics. Association for Computational Linguistics, 1536–1547. DOI:
[9]
G. Casella and R. L. Berger.2002. Statistical Inference. Duxbury, Pacific Grove, CA.
[10]
Reza Gharibi, Amir Hossein Rasekh, Mohammad Hadi Sadreddini, and Seyed Mostafa Fakhrahmad. 2018. Leveraging textual properties of bug reports to localize relevant source files. Inf. Process. Manag. 54, 6 (2018), 1058–1076. DOI:
[11]
Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal representation for neural code search. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution. IEEE, 483–494. DOI:
[12]
Jian Gu, Zimin Chen, and Martin Monperrus. 2021. Multimodal representation for neural code search. Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME). 483–494. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edseee&AN=edseee.9609178&lang=zh-cn&site=eds-live.
[13]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. ACM, 933–944. DOI:
[14]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training code representations with data flow. In Proceedings of the 9th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=jLoC4ez43PZ.
[15]
Xuan Huo and Ming Li. 2017. Enhancing the unified features to locate buggy files by exploiting the sequential nature of source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, 1909–1915. DOI:
[16]
Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning unified features from natural and programming languages for locating buggy source code. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. IJCAI/AAAI Press, 1606–1612. Retrieved from http://www.ijcai.org/Abstract/16/230.
[17]
Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2020. Control flow graph embedding based on multi-instance decomposition for bug localization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press, 4223–4230. Retrieved from https://aaai.org/ojs/index.php/AAAI/article/view/5844.
[18]
Xuan Huo, Ferdian Thung, Ming Li, David Lo, and Shu-Ting Shi. 2021. Deep transfer bug localization. IEEE Trans. Softw. Eng. 47, 7 (2021), 1368–1380. DOI:
[19]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436 (2019).
[20]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Mapping language to code in programmatic context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1643–1652. DOI:
[21]
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).
[22]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1746–1751. DOI:
[23]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations. Retrieved from http://arxiv.org/abs/1412.6980.
[24]
An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2015. Combining deep learning with information retrieval to localize buggy files for bug reports (N). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, 476–481. DOI:
[25]
Ruitong Li, Gang Hu, and Min Peng. 2020. Hierarchical embedding for code search in software Q&A sites. In Proceedings of the International Joint Conference on Neural Networks. IEEE, 1–10. DOI:
[26]
Guangliang Liu, Yang Lu, Ke Shi, Jingfei Chang, and Xing Wei. 2019. Mapping bug reports to relevant source code files based on the vector space model and word embedding. IEEE Access 7 (2019), 78870–78881. DOI:
[27]
Yi-Fan Ma and Ming Li. 2022. The flowing nature matters: Feature learning from the control flow graph of source code for bug localization. Mach. Learn. (2022), 1–18. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edssjs&AN=edssjs.12EB9D72&lang=zh-cn&site=eds-live.
[28]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 28 (2008), 2579–2605.
[29]
Iosif Pinelis. 1994. Extremal probabilistic problems and hotelling’s T 2 test under a symmetry condition. Ann. Statist. 22, 1 (1994), 357–368. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edsjsr&AN=edsjsr.2242458&lang=zh-cn&site=eds-live.
[30]
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. ACM, 31–41. DOI:
[31]
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 31–41. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edscma&AN=edscma.3211353&lang=zh-cn&site=eds-live.
[32]
Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension. ACM, 196–207. DOI:
[33]
Kathryn T. Stolee, Sebastian G. Elbaum, and Daniel Dobos. 2014. Solving the search for source code. ACM Trans. Softw. Eng. Methodol. 23, 3 (2014), 26:1–26:45. DOI:
[34]
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Conference on Advances in Neural Information Processing Systems. The MIT Press, 1057–1063. Retrieved from http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.
[35]
Yukiya Uneno, Osamu Mizuno, and Eun-Hye Choi. 2016. Using a distributed representation of words in localizing relevant files for bug reports. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security. IEEE, 183–190. DOI:
[36]
Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 13–25. DOI:
[37]
X. Wang, Y. Wang, F. Mi, P. Zhou, Y. Wan, X. Liu, L. Li, H. Wu, J. Liu, and X. Jiang. 2021. SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021).
[38]
Yan Xiao, Jacky Keung, Kwabena Ebo Bennin, and Qing Mi. 2018. Machine translation-based bug localization technique for bridging lexical gap. Inf. Softw. Technol. 99 (2018), 58–61. DOI:
[39]
Yan Xiao, Jacky Keung, Qing Mi, and Kwabena Ebo Bennin. 2018. Bug localization with semantic and structural features using convolutional neural network and cascade forest. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering. ACM, 101–111. DOI:
[40]
Ling Xu, Huanhuan Yang, Chao Liu, Jianhang Shuai, Meng Yan, Yan Lei, and Zhou Xu. 2021. Two-stage attention-based model for code search with textual and structural features. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 342–353. DOI:
[41]
Ling Xu, Huanhuan Yang, Chao Liu, Jianhang Shuai, Meng Yan, Yan Lei, and Zhou Xu. 2021. Two-stage attention-based model for code search with textual and structural features. In Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 342–353. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edseee&AN=edseee.9425924&lang=zh-cn&site=eds-live.
[42]
Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 344–354. DOI:
[43]
Xin Ye, Razvan C. Bunescu, and Chang Liu. 2014. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE’14). ACM, 689–699. DOI:
[44]
Xin Ye, Hui Shen, Xiao Ma, Razvan C. Bunescu, and Chang Liu. 2016. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering. ACM, 404–415. DOI:
[45]
Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports. In Proceedings of the 34th International Conference on Software Engineering. IEEE Computer Society, 14–24. DOI:
[46]
Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, and Lu Zhang. 2020. OCoR: An overlapping-aware code retriever. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. IEEE, 883–894. DOI:
[47]
Ziye Zhu, Yun Li, Hanghang Tong, and Yu Wang. 2020. CooBa: Cross-project bug localization via adversarial transfer learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence. ijcai.org, 3565–3571. DOI:
[48]
Ziye Zhu, Yun Li, Yu Wang, Yaojing Wang, and Hanghang Tong. 2021. A deep multimodal model for bug localization. Data Mining Knowl. Discov. 35, 4 (2021), 1369–1392. Retrieved from https://search.ebscohost.com/login.aspx?direct=true&db=edssjs&AN=edssjs.B021CC4D&lang=zh-cn&site=eds-live.

Index Terms

  1. A Hypothesis Testing-based Framework for Software Cross-modal Retrieval in Heterogeneous Semantic Spaces

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 5
    September 2023
    905 pages
    ISSN:1049-331X
    EISSN:1557-7392
    DOI:10.1145/3610417
    • Editor:
    • Mauro Pezzè
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 July 2023
    Online AM: 10 April 2023
    Accepted: 15 February 2023
    Revised: 27 December 2022
    Received: 23 August 2022
    Published in TOSEM Volume 32, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Software cross-modal retrieval
    2. hypothesis testing
    3. deep learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 282
      Total Downloads
    • Downloads (Last 12 months)212
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media