research-article

Public Access

Diet code is healthy: simplifying programs for pre-trained models of code

Authors:

Xiaodong GuAuthors Info & Claims

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1073 - 1084

https://doi.org/10.1145/3540250.3549094

Published: 09 November 2022 Publication History

Abstract

Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our empirical analysis of CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose DietCode, which aims at lightweight leverage of large pre-trained models for source code. DietCode simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy that selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduction in the computational cost without hampering the model performance. Experimental results on two downstream tasks show that DietCode provides comparable results to CodeBERT with 40% less computational cost in fine-tuning and testing.

References

[1]

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655–2668.

[2]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. In International Conference on Learning Representations.

[3]

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–29.

Digital Library

[4]

Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).

[5]

Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2019. Autofocus: interpreting attention-based neural networks by code perturbation. In Proceedings of 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 38–41.

Digital Library

[6]

Casey Casalnuovo, E Morgan, and P Devanbu. 2020. Does surprisal predict code comprehension difficulty. In Proceedings of the 42nd Annual Meeting of the Cognitive Science Society.

[7]

Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, and Baishakhi Ray. 2020. NatGen: Generative pre-training by" Naturalizing" source code. In Proceedings of the 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE).

[8]

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.

[9]

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2018. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35, 1 (2018), 126–136.

[10]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[11]

Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An empirical study on the usage of BERT models for code completion. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 108–119.

[12]

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations.

[13]

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32 (2019).

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.

[15]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, and Daxin Jiang. 2020. CodeBERT: a pre-Trained model for programming and natural languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP): Findings. 1536–1547.

[16]

James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. 2000. The Java language specification. Addison-Wesley Professional.

Digital Library

[17]

Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program comprehension with source code summarization. In Proceedings of ACM/IEEE 32nd international conference on software engineering (ICSE). 2, 223–226.

Digital Library

[18]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). 837–847.

[19]

Raphael Hunger. 2005. Floating point operations in matrix-vector calculus. Munich University of Technology, Inst. for Circuit Theory and Signal.

[20]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436.

[21]

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). 1681–1691.

[22]

Siyuan Jiang, Ameer Armaly, and Collin McMillan. 2017. Automatically generating commit messages from diffs using neural machine translation. In Proceedings of 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 135–146.

Digital Library

[23]

Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering (TSE), 28, 7 (2002), 654–670.

Digital Library

[24]

Anjan Karmakar and Romain Robbes. 2021. What do pre-trained code models know about code? In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1332–1336.

Digital Library

[25]

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. 2021. Learned token pruning for transformers. arXiv preprint arXiv:2107.00910.

[26]

Diederik P. Kingma and Jimmy Ba. 2017. Adam: a method for stochastic optimization. arxiv:1412.6980.

[27]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.

[28]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

[29]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, and Duyu Tang. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).

[30]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 336–347.

Digital Library

[31]

Liming Nie, He Jiang, Zhilei Ren, Zeyi Sun, and Xiaochen Li. 2016. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing, 9, 5 (2016), 771–783.

[32]

Matteo Paltenghi and Michael Pradel. 2021. Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 867–879.

[33]

David Pisinger. 1995. Algorithms for knapsack problems.

[34]

Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, and Mohammad Amin Alipour. 2021. Understanding neural code intelligence through program simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). Association for Computing Machinery, 441–452.

[35]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1, 8 (2019), 9.

[36]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 (2020), 1–67.

[37]

Paige Rodeghero, Collin McMillan, Paul W McBurney, Nigel Bosch, and Sidney D’Mello. 2014. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th international conference on Software engineering. 390–401.

Digital Library

[38]

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in bertology: what we know about how BERT works. Transactions of the Association for Computational Linguistics, 8 (2020), 842–866.

[39]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[40]

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 8815–8821.

[41]

Kathryn T Stolee, Sebastian Elbaum, and Daniel Dobos. 2014. Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM), 23, 3 (2014), 1–45.

Digital Library

[42]

Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim A Laredo, and Alessandro Morari. 2021. Probing model signal-awareness via prediction-preserving input minimization. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 945–955.

Digital Library

[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems (NeurlPS 2017). 5998–6008.

[44]

Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. 2022. What do they capture? – a structural analysis of pre-trained language models for source code. In In Proceedings of the 44th International Conference on Software Engineering (ICSE).

[45]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.

[46]

Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). 87–98.

Digital Library

[47]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32 (2019).

[48]

Deming Ye, Yankai Lin, Yufei Huang, and Maosong Sun. 2021. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5798–5809.

[49]

Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering, 28, 2 (2002), 183–200.

Digital Library

[50]

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of IEEE/ACM 41st International Conference on Software Engineering (ICSE). 783–794.

Digital Library

Cited By

Li YShi EZheng DDuan KChen JWang Y(2024)RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss ScreeningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674819(229-238)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674819
Sun ZDu XYang ZLi LLo DChristakis MPradel M(2024)AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680347(1124-1136)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680347
Liu CCai YLin YHuang YPei YJiang BYang PDong JMei HChristakis MPradel M(2024)CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive NatureProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652142(466-478)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652142
Show More Cited By

Index Terms

Diet code is healthy: simplifying programs for pre-trained models of code
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
2. Software and its engineering
  1. Software creation and management
  2. Software notations and tools

Index terms have been assigned to the content through auto-classification.

Recommendations

Natural attack for pre-trained models of code
ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pre-trained models of code have achieved success in many important software engineering tasks. However, these powerful models are vulnerable to adversarial attacks that slightly perturb model inputs to make a victim model produce wrong outputs. Current ...
Bridging pre-trained models and downstream tasks for source code understanding
ICSE '22: Proceedings of the 44th International Conference on Software Engineering

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt ...
CodeEditor: Learning to Edit Source Code with Pre-trained Models
Developers often perform repetitive code editing activities (up to 70%) for various reasons (e.g., code refactoring) during software development. Many deep learning (DL) models have been proposed to automate code editing by learning from the code editing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2022

1822 pages

ISBN:9781450394130

DOI:10.1145/3540250

General Chair:
Abhik Roychoudhury
National University of Singapore, Singapore
,
Program Chairs:
Cristian Cadar
Imperial College London, UK
,
Miryung Kim
University of California at Los Angeles, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)
CCF-Baidu Open Fund
CCF-Tencent Rhino-Bird Young Faculty Open Research Fund

Conference

ESEC/FSE '22

Sponsor:

ESEC/FSE '22: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 14 - 18, 2022

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
574
Total Downloads

Downloads (Last 12 months)383
Downloads (Last 6 weeks)25

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li YShi EZheng DDuan KChen JWang Y(2024)RepoMinCoder: Improving Repository-Level Code Generation Based on Information Loss ScreeningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674819(229-238)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674819
Sun ZDu XYang ZLi LLo DChristakis MPradel M(2024)AI Coders Are among Us: Rethinking Programming Language Grammar towards Efficient Code GenerationProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680347(1124-1136)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680347
Liu CCai YLin YHuang YPei YJiang BYang PDong JMei HChristakis MPradel M(2024)CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive NatureProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652142(466-478)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652142
Wang YLi XNguyen TWang SNi CDing L(2024)Natural Is the Best: Model-Agnostic Code Simplification for Pre-trained Large Language ModelsProceedings of the ACM on Software Engineering10.1145/36437531:FSE(586-608)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3643753
Li MChen SFan GZhang LWu HXue XFeng Z(2024)Robustness-Enhanced Assertion Generation Method Based on Code Mutation and Attack DefenseCollaborative Computing: Networking, Applications and Worksharing10.1007/978-3-031-54528-3_16(281-300)Online publication date: 23-Feb-2024
https://doi.org/10.1007/978-3-031-54528-3_16
Shi CZhu TZhang TPang JPan M(2023)Structural-semantics Guided Program Simplification for Understanding Neural Code Intelligence ModelsProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609438(1-11)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609438
Suneja SZhuang YZheng YLaredo JMorari AKhurana U(2023)Incorporating Signal Awareness in Source Code Modeling: An Application to Vulnerability DetectionACM Transactions on Software Engineering and Methodology10.1145/359720232:6(1-40)Online publication date: 29-Sep-2023
https://dl.acm.org/doi/10.1145/3597202
Baltaji RThakkar P(2023)Probing Numeracy and Logic of Language Models of Code2023 IEEE/ACM International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE)10.1109/InteNSE59150.2023.00006(8-13)Online publication date: May-2023
https://doi.org/10.1109/InteNSE59150.2023.00006
Islam Rabin MHussain ASuneja SAlipour M(2023)Study of Distractors in Neural Models of Code2023 IEEE/ACM International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE)10.1109/InteNSE59150.2023.00005(1-7)Online publication date: May-2023
https://doi.org/10.1109/InteNSE59150.2023.00005
Li ZWang CLiu ZWang HChen DWang SGao CGrundy JPollock LPenta M(2023)CCTest: Testing and Repairing Code Completion SystemsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00110(1238-1250)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00110
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents