Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3524610.3527915acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Unified abstract syntax tree representation learning for cross-language program classification

Published: 20 October 2022 Publication History

Abstract

Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a <u>U</u>nified <u>A</u>bstract <u>S</u>yntax <u>T</u>ree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach outperforms the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).

References

[1]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1--29.
[2]
David Azcona, Piyush Arora, I-Han Hsiao, and Alan Smeaton. 2019. user2code2vec: Embeddings for profiling students based on distributional representations of source code. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM, 86--95.
[3]
Brenda S Baker. 1993. A program for identifying duplicated code. Computing Science and Statistics (1993), 49--49.
[4]
Francesco Barchi, Emanuele Parisi, Gianvito Urgese, Elisa Ficarra, and Andrea Acquaviva. 2021. Exploration of Convolutional Neural Network models for source code classification. Engineering Applications of Artificial Intelligence 97 (2021), 104075.
[5]
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 (2007), 577--591.
[6]
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc, 3585--3597.
[7]
Juergen Börstler. 1995. Feature-oriented classification for software reuse. In Proceedings of the 7th International Conference on Software Engineering and Knowledge Engineering, Vol. 95. Knowledge Systems Institute, 22--24.
[8]
Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2019. Bilateral dependency neural networks for cross-language algorithm classification. In Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 422--433.
[9]
Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. InferCode: Self-supervised learning of code representations by predicting subtrees. In Proceedings of the 43rd International Conference on Software Engineering. IEEE, 1186--1197.
[10]
Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol. 1. IEEE, 539--546.
[11]
Keith L. Clark and John Darlington. 1980. Algorithm classification through synthesis. The computer journal 23, 1 (1980), 61--65.
[12]
Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., 7059--7069.
[13]
Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 1 (2005), 19--67.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 29--35.
[15]
Harris Drucker, Chris JC Burges, Linda Kaufman, Alex Smola, Vladimir Vapnik, et al. 1997. Support vector regression machines. Advances in neural information processing systems 9 (1997), 155--161.
[16]
Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990), 179--211.
[17]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. ACL, 1536--1547.
[18]
Francesca Arcelli Fontana and Marco Zanoni. 2017. Code smell severity classification using machine learning techniques. Knowledge-Based Systems 128 (2017), 43--58.
[19]
Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine learning 29, 2 (1997), 131--163.
[20]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5--6 (2005), 602--610.
[21]
Jacob A Harer, Louis Y Kim, Rebecca L Russell, Onur Ozdemir, Leonard R Kosta, Akshay Rangamani, Lei H Hamilton, Gabriel I Centeno, Jonathan R Key, Paul M Ellingwood, et al. 2018. Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497 (2018).
[22]
Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of the 2013 International Conference on Acoustics, Speech and Signal Processing. IEEE, 7304--7308.
[23]
Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis. ACM, 81--92.
[24]
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).
[25]
Kisub Kim, Dongsun Kim, Tegawendé F Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY: a code-to-code search engine. In Proceedings of the 40th International Conference on Software Engineering. ACM, 946--957.
[26]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations. OpenReview.
[27]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[28]
Jian Li, Pinjia He, Jieming Zhu, and Michael R Lyu. 2017. Software defect prediction via convolutional neural network. In Proceedings of the 2017 International Conference on Software Quality, Reliability and Security. IEEE, 318--328.
[29]
Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. Cclearner: A deep learning-based clone detection approach. In Proceedings of 33rd International Conference on Software Maintenance and Evolution. IEEE, 249--260.
[30]
Yuzhan Ma, Sarah Fakhoury, Michael Christensen, Venera Arnaoudova, Waleed Zogaan, and Mehdi Mirakhorli. 2018. Automatic classification of software artifacts in open-source applications. In Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 414--425.
[31]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[32]
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. ACM.
[33]
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2014. Migrating code with statistical machine translation. In Proceedings of the 36th International Conference on Software Engineering Companion. ACM, 544--547.
[34]
Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, and Zhi Jin. 2021. Integrating tree path in transformer for code representation. In Proceedings of the 35th Conference on Neural Information Processing Systems. Curran Associates, Inc.
[35]
Fayola Peters, Thein Than Tun, Yijun Yu, and Bashar Nuseibeh. 2017. Text filtering and ranking for security bug report prediction. IEEE Transactions on Software Engineering 45, 6 (2017), 615--631.
[36]
J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1, 1 (1986), 81--106.
[37]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61--80.
[38]
Kento Shimonaka, Soichi Sumi, Yoshiki Higo, and Shinji Kusumoto. 2016. Identifying auto-generated code by using machine learning techniques. In Proceedings of the 7th International Workshop on Empirical Software Engineering in Practice. IEEE, 18--23.
[39]
Ahmad Taherkhani, Ari Korhonen, and Lauri Malmi. 2011. Recognizing algorithms using language constructs, software metrics and roles of variables: An experiment with sorting algorithms. Comput. J. 54, 7 (2011), 1049--1066.
[40]
Secil Ugurel, Robert Krovetz, and C Lee Giles. 2002. What's the code? automatic classification of source code archives. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 632--638.
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems. Curran Associates, Inc., 5998--6008.
[42]
Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, 297--308.
[43]
Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 261--271.
[44]
Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. Lambdanet: Probabilistic type inference using graph neural networks. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.
[45]
Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marucs, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, et al. 2020. Misim: An end-to-end neural code similarity system. arXiv preprint arXiv:2006.05265 (2020).
[46]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 783--794.
[47]
Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015).

Cited By

View all
  • (2024)Criação de dashboards analíticos em Python para tomada de decisãoCaderno Pedagógico10.54033/cadpedv21n8-08421:8(e6539)Online publication date: 8-Aug-2024
  • (2024)A Novel Source Code Representation Approach Based on Multi-Head AttentionElectronics10.3390/electronics1311211113:11(2111)Online publication date: 29-May-2024
  • (2024)AlloyASG: Alloy Predicate Code Representation as a Compact Structurally Balanced GraphProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3640310.3674088(57-68)Online publication date: 22-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPC '22: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension
May 2022
698 pages
ISBN:9781450392983
DOI:10.1145/3524610
  • Conference Chairs:
  • Ayushi Rastogi,
  • Rosalia Tufano,
  • General Chair:
  • Gabriele Bavota,
  • Program Chairs:
  • Venera Arnaoudova,
  • Sonia Haiduc
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code representation learning
  2. cross-language program classification
  3. program classification
  4. program comprehension

Qualifiers

  • Research-article

Funding Sources

  • the National Natural Science Foundation of China
  • the National Key Research and Development Project
  • the National Key Research and Development Program of China
  • the Chongqing Science and Technology Plan Project
  • the Key Research and Development Program of Jiangsu Province
  • the Research Council of Norway
  • the Natural Science Foundation of Chongqing

Conference

ICPC '22
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)88
  • Downloads (Last 6 weeks)9
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Criação de dashboards analíticos em Python para tomada de decisãoCaderno Pedagógico10.54033/cadpedv21n8-08421:8(e6539)Online publication date: 8-Aug-2024
  • (2024)A Novel Source Code Representation Approach Based on Multi-Head AttentionElectronics10.3390/electronics1311211113:11(2111)Online publication date: 29-May-2024
  • (2024)AlloyASG: Alloy Predicate Code Representation as a Compact Structurally Balanced GraphProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems10.1145/3640310.3674088(57-68)Online publication date: 22-Sep-2024
  • (2024)StateGuard: Detecting State Derailment Defects in Decentralized Exchange Smart ContractCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651562(810-813)Online publication date: 13-May-2024
  • (2024)FMCS: Improving Code Search by Multi-Modal Representation Fusion and Momentum Contrastive Learning2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00068(632-638)Online publication date: 1-Jul-2024
  • (2024)Source Code Representation Approach Based on Multi-Head Attention2024 10th International Symposium on System Security, Safety, and Reliability (ISSSR)10.1109/ISSSR61934.2024.00066(1-9)Online publication date: 16-Mar-2024
  • (2023)IRC-CLVul: Cross-Programming-Language Vulnerability Detection with Intermediate Representations and Combined FeaturesElectronics10.3390/electronics1214306712:14(3067)Online publication date: 13-Jul-2023
  • (2023)TCCCD: Triplet-Based Cross-Language Code Clone DetectionApplied Sciences10.3390/app13211208413:21(12084)Online publication date: 6-Nov-2023
  • (2023)Hierarchical Abstract Syntax Tree Representation Learning Based on Graph Coarsening for Program Classification2023 8th International Conference on Data Science in Cyberspace (DSC)10.1109/DSC59305.2023.00035(181-188)Online publication date: 18-Aug-2023
  • (undefined)CCCS: Contrastive Cross-Language Code Search Using Code Graph InformationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3628429

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media