Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3540250.3549094acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Public Access

Diet code is healthy: simplifying programs for pre-trained models of code

Published: 09 November 2022 Publication History

Abstract

Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our empirical analysis of CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose DietCode, which aims at lightweight leverage of large pre-trained models for source code. DietCode simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy that selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduction in the computational cost without hampering the model performance. Experimental results on two downstream tasks show that DietCode provides comparable results to CodeBERT with 40% less computational cost in fine-tuning and testing.

References

[1]
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655–2668.
[2]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In International Conference on Learning Representations.
[3]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–29.
[4]
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
[5]
Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2019. Autofocus: interpreting attention-based neural networks by code perturbation. In Proceedings of 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 38–41.
[6]
Casey Casalnuovo, E Morgan, and P Devanbu. 2020. Does surprisal predict code comprehension difficulty. In Proceedings of the 42nd Annual Meeting of the Cognitive Science Society.
[7]
Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, and Baishakhi Ray. 2020. NatGen: Generative pre-training by" Naturalizing" source code. In Proceedings of the 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).
[8]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
[9]
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2018. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35, 1 (2018), 126–136.
[10]
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[11]
Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An empirical study on the usage of BERT models for code completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 108–119.
[12]
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations.
[13]
Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32 (2019).
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
[15]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. CodeBERT: a pre-Trained model for programming and natural languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings. 1536–1547.
[16]
James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. 2000. The Java language specification. Addison-Wesley Professional.
[17]
Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program comprehension with source code summarization. In Proceedings of ACM/IEEE 32nd international conference on software engineering (ICSE). 2, 223–226.
[18]
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). 837–847.
[19]
Raphael Hunger. 2005. Floating point operations in matrix-vector calculus. Munich University of Technology, Inst. for Circuit Theory and Signal.
[20]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436.
[21]
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). 1681–1691.
[22]
Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically generating commit messages from diffs using neural machine translation. In Proceedings of 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 135–146.
[23]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering (TSE), 28, 7 (2002), 654–670.
[24]
Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code? In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1332–1336.
[25]
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. 2021. Learned token pruning for transformers. arXiv preprint arXiv:2107.00910.
[26]
Diederik P. Kingma and Jimmy Ba. 2017. Adam: a method for stochastic optimization. arxiv:1412.6980.
[27]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
[28]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
[29]
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, and Duyu Tang. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
[30]
Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 336–347.
[31]
Liming Nie, He Jiang, Zhilei Ren, Zeyi Sun, and Xiaochen Li. 2016. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9, 5 (2016), 771–783.
[32]
Matteo Paltenghi and Michael Pradel. 2021. Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 867–879.
[33]
David Pisinger. 1995. Algorithms for knapsack problems.
[34]
Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, and Mohammad Amin Alipour. 2021. Understanding neural code intelligence through program simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). Association for Computing Machinery, 441–452.
[35]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1, 8 (2019), 9.
[36]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (2020), 1–67.
[37]
Paige Rodeghero, Collin McMillan, Paul W McBurney, Nigel Bosch, and Sidney D’Mello. 2014. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th international conference on Software engineering. 390–401.
[38]
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in bertology: what we know about how BERT works. Transactions of the Association for Computational Linguistics, 8 (2020), 842–866.
[39]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
[40]
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 8815–8821.
[41]
Kathryn T Stolee, Sebastian Elbaum, and Daniel Dobos. 2014. Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM), 23, 3 (2014), 1–45.
[42]
Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim A Laredo, and Alessandro Morari. 2021. Probing model signal-awareness via prediction-preserving input minimization. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 945–955.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems (NeurlPS 2017). 5998–6008.
[44]
Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. 2022. What do they capture? – a structural analysis of pre-trained language models for source code. In In Proceedings of the 44th International Conference on Software Engineering (ICSE).
[45]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
[46]
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 87–98.
[47]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32 (2019).
[48]
Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun. 2021. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5798–5809.
[49]
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering, 28, 2 (2002), 183–200.
[50]
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of IEEE/ACM 41st International Conference on Software Engineering (ICSE). 783–794.

Cited By

View all
  • (2024)RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss ScreeningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674819(229-238)Online publication date: 24-Jul-2024
  • (2024)AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680347(1124-1136)Online publication date: 11-Sep-2024
  • (2024)CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive NatureProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652142(466-478)Online publication date: 11-Sep-2024
  • Show More Cited By

Index Terms

  1. Diet code is healthy: simplifying programs for pre-trained models of code
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
          November 2022
          1822 pages
          ISBN:9781450394130
          DOI:10.1145/3540250
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 09 November 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Code intelligence
          2. Learning program representations
          3. Pre-trained models
          4. Program simplification

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          ESEC/FSE '22
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 112 of 543 submissions, 21%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)383
          • Downloads (Last 6 weeks)25
          Reflects downloads up to 03 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss ScreeningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674819(229-238)Online publication date: 24-Jul-2024
          • (2024)AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680347(1124-1136)Online publication date: 11-Sep-2024
          • (2024)CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive NatureProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652142(466-478)Online publication date: 11-Sep-2024
          • (2024)Natural Is the Best: Model-Agnostic Code Simplification for Pre-trained Large Language ModelsProceedings of the ACM on Software Engineering10.1145/36437531:FSE(586-608)Online publication date: 12-Jul-2024
          • (2024)Robustness-Enhanced Assertion Generation Method Based on Code Mutation and Attack DefenseCollaborative Computing: Networking, Applications and Worksharing10.1007/978-3-031-54528-3_16(281-300)Online publication date: 23-Feb-2024
          • (2023)Structural-semantics Guided Program Simplification for Understanding Neural Code Intelligence ModelsProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609438(1-11)Online publication date: 4-Aug-2023
          • (2023)Incorporating Signal Awareness in Source Code Modeling: An Application to Vulnerability DetectionACM Transactions on Software Engineering and Methodology10.1145/359720232:6(1-40)Online publication date: 29-Sep-2023
          • (2023)Probing Numeracy and Logic of Language Models of Code2023 IEEE/ACM International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE)10.1109/InteNSE59150.2023.00006(8-13)Online publication date: May-2023
          • (2023)Study of Distractors in Neural Models of Code2023 IEEE/ACM International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE)10.1109/InteNSE59150.2023.00005(1-7)Online publication date: May-2023
          • (2023)CCTest: Testing and Repairing Code Completion SystemsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00110(1238-1250)Online publication date: 14-May-2023
          • Show More Cited By

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media