Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code search

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Code search is a crucial task in software engineering, aiming to search relevant code from the codebase based on natural language queries. While deep-learning-based code search methods have demonstrated impressive performance, recent advances in contrastive learning have further enhanced the representation learning of these models. Despite these improvements, existing methods still have limitations in the representation learning of multi-modal data. Specifically, these methods suffer from a semantic loss in the representation learning of code and fail to explore functionally relevant code pairs in the representation learning fully. To address these limitations, we propose A Representation Fusion based Multi-View Momentum Contrastive Learning Framework for Code Search, named RFMC-CS. RFMC-CS effectively retains the semantic and structural information of code through multi-modal representation and fusion. Through elaborately designed Multi-View Momentum Contrastive Learning, RFMC-CS can further learn the correlations between different modalities of samples and semantic relevant samples. The experimental results on the CodeSearchNet benchmark show that RFMC-CS outperforms seven advanced baselines on MRR and Recall@k metrics. The ablation experiments illustrate the effectiveness of each component. The portability experiments show that RFMC-CS has good portability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

No datasets were generated or analysed during the current study.

Notes

  1. BGE: https://huggingface.co/BAAI/bge-large-en-v1.5.

  2. Tree-sitter: https://tree-sitter.github.io/tree-sitter/.

References

  • Cambronero, J., Li, H., Kim, S., Sen, K., Chandra, S.: When deep learning met code search. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 964–974. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3338906.3340458

  • Chai, Y., Zhang, H., Shen, B., Gu, X.: Cross-domain deep code search with meta learning. In: Proceedings of the 44th International Conference on Software Engineering, pp. 487–498. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3510003.3510125

  • Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 1597–1607. PMLR (2020)

  • Cheng, Y., Kuang, L.: CSRS: code search with relevance matching and semantic matching. In: 2022 IEEE/ACM 30th International Conference on Program Comprehension, pp. 533–542 (2022). https://doi.org/10.1145/3524610.3527889

  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423

  • Di Grazia, L., Pradel, M.: Code search: a survey of techniques for finding code. ACM Comput. Surv. (2023). https://doi.org/10.1145/3565971

    Article  MATH  Google Scholar 

  • Ding, Y., Buratti, L., Pujar, S., Morari, A., Ray, B., Chakraborty, S.: Contrastive learning for source code with structural and functional properties (2021). CoRR arXiv:2110.03868 [cs.PL]

  • Fang, H., Wang, S., Zhou, M., Ding, J., Xie, P.: CERT: contrastive self-supervised learning for language understanding (2020). CoRR arXiv:2005.12766 [cs.CL]

  • Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics, pp. 1536–1547. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139

  • Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.552

  • Giorgi, J., Nitski, O., Wang, B., Bader, G.: DeCLUTR: deep contrastive learning for unsupervised textual representations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1, pp. 879–895. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.72

  • Gu, X., Zhang, H., Kim, S.: Deep code search. In: Proceedings of the 40th International Conference on Software Engineering, pp. 933–944. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3180155.3180167

  • Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: UniXcoder: unified cross-modal pre-training for code representation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 7212–7225. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.acl-long.499

  • Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svyatkovskiy, A., Fu, S., Tufano, M., Deng, S.K., Clement, C., Drain, D., Sundaresan, N., Yin, J., Jiang, D., Zhou, M.: GraphCodeBERT: pre-training code representations with data flow. In: International Conference on Learning Representations (2021)

  • He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 639–648. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3397271.3401063

  • He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020). https://doi.org/10.1109/CVPR42600.2020.00975

  • Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. In: International Conference on Learning Representations (2019)

  • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  MATH  Google Scholar 

  • Hu, Y., Jiang, H., Hu, Z.: Measuring code maintainability with deep neural networks. Front. Comput. Sci. 17(6), 176214 (2023). https://doi.org/10.1007/s11704-022-2313-0

    Article  MATH  Google Scholar 

  • Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., Brockschmidt, M.: CodesearchNet challenge: evaluating the state of semantic code search (2019). CoRR arXiv:1909.09436 [cs.LG]

  • Jiang, X., Zheng, Z., Lyu, C., Li, L., Lyu, L.: Treebert: a tree-based pre-trained model for programming language. In: Uncertainty in Artificial Intelligence, pp. 54–63 (2021). PMLR

  • Kim, K., Ghatpande, S., Kim, D., Zhou, X., Liu, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Big code search: a bibliography. ACM Comput. Surv. (2023). https://doi.org/10.1145/3604905

    Article  Google Scholar 

  • Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net (2017)

  • Li, X., Gong, Y., Shen, Y., Qiu, X., Zhang, H., Yao, B., Qi, W., Jiang, D., Chen, W., Duan, N.: CodeRetriever: a large scale contrastive pre-training method for code search. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2898–2910. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)

  • Li, J., Liu, F., Li, J., Zhao, Y., Li, G., Jin, Z.: MCodeSearcher: multi-view contrastive learning for code search. In: Proceedings of the 14th Asia-Pacific Symposium on Internetware, pp. 270–280. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3609437.3609456

  • Li, Z., Yin, G., Wang, T., Zhang, Y., Yu, Y., Wang, H.: Correlation-based software search by leveraging software term database. Front. Comput. Sci. 12(5), 923–938 (2018). https://doi.org/10.1007/s11704-017-6573-z

    Article  MATH  Google Scholar 

  • Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., Baldi, P.: Sourcerer: mining and searching internet-scale software repositories. Data Min. Knowl. Discov. 18(2), 300–336 (2009). https://doi.org/10.1007/s10618-008-0118-x

    Article  MathSciNet  Google Scholar 

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: a robustly optimized Bert pretraining approach (2019). CoRR arXiv:1907.11692 [cs.CL]

  • Liu, C., Xia, X., Lo, D., Gao, C., Yang, X., Grundy, J.: Opportunities and challenges in code search tools. ACM Comput. Surv. (2021). https://doi.org/10.1145/3480027

    Article  MATH  Google Scholar 

  • Lv, F., Zhang, H., Lou, J.-G., Wang, S., Zhang, D., Zhao, J.: CodeHow: effective code search based on api understanding and extended Boolean model. In: 2015 30th IEEE/ACM International Conference on Automated Software Engineering, pp. 260–270 (2015). https://doi.org/10.1109/ASE.2015.42

  • McMillan, C., Grechanik, M., Poshyvanyk, D., Xie, Q., Fu, C.: Portfolio: finding relevant functions and their usage. In: 2011 33rd International Conference on Software Engineering, pp. 111–120 (2011). https://doi.org/10.1145/1985793.1985809

  • Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units (2016). arXiv:1508.07909

  • Shi, E., Wang, Y., Gu, W., Du, L., Zhang, H., Han, S., Zhang, D., Sun, H.: CoCoSoDa: effective contrastive learning for code search. In: Proceedings of the 45th International Conference on Software Engineering, pp. 2198–2210. IEEE Press (2023). https://doi.org/10.1109/ICSE48619.2023.00185

  • Shi, Z., Xiong, Y., Zhang, Y., Jiang, Z., Zhao, J., Wang, L., Li, S.: Improving code search with multi-modal momentum contrastive learning. In: 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC), pp. 280–291 (2023). https://doi.org/10.1109/ICPC58990.2023.00043

  • Shuai, J., Xu, L., Liu, C., Yan, M., Xia, X., Lei, Y.: Improving code search with co-attentive representation learning. In: Proceedings of the 28th International Conference on Program Comprehension, pp. 196–207. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3387904.3389269

  • Sun, W., Fang, C., Chen, Y., Tao, G., Han, T., Zhang, Q.: Code search based on context-aware code translation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 388–400. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3510003.3510140

  • Wan, Y., Shu, J., Sui, Y., Xu, G., Zhao, Z., Wu, J., Yu, P.: Multi-modal attention network learning for semantic source code retrieval. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 13–25. IEEE, San Diego, CA, USA (2019). https://doi.org/10.1109/ASE.2019.00012

  • Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.: CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.685

  • Wang, X., Wang, Y., Mi, F., Zhou, P., Wan, Y., Liu, X., Li, L., Wu, H., Liu, J., Jiang, X.: SynCoBERT: syntax-guided multi-modal contrastive pre-training for code representation (2021). CoRR arXiv:2108.04556 [cs.CL]

  • Wu, Z., Xiong, Y., Yu, S., Lin, D.: Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination (2018). arXiv:1805.01978

  • Xu, L., Yang, H., Liu, C., Shuai, J., Yan, M., Lei, Y., Xu, Z.: Two-stage attention-based model for code search with textual and structural features. In: 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering, pp. 342–353 (2021). https://doi.org/10.1109/SANER50967.2021.00039

  • Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., Xu, W.: ConSERT: a contrastive framework for self-supervised sentence representation transfer. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1, pp. 5065–5075. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.393

  • Zhu, Q., Sun, Z., Liang, X., Xiong, Y., Zhang, L.: OCoR: an overlapping-aware code retriever. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 883–894. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3324884.3416530

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No. 62250610224), and CCF-Zhipu Large Model Innovation Fund (No. CCF-Zhipu202408).

Author information

Authors and Affiliations

Authors

Contributions

GC: Conceptualization of this study, Methodology, Software, Investigation, Writing—original draft. WL: Methodology, Formal analysis, Investigation, Validation, Writing—review. XX: Conceptualization of this study, Supervision, Funding acquisition, Writing—review.

Corresponding author

Correspondence to Xiaoyuan Xie.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G., Liu, W. & Xie, X. RFMC-CS: a representation fusion based multi-view momentum contrastive learning framework for code search. Autom Softw Eng 32, 16 (2025). https://doi.org/10.1007/s10515-025-00487-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10515-025-00487-8

Keywords