SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Yang, Xueqi; Jakubowski, Mariusz; Kang, Li; Yu, Haojie; Menzies, Tim

doi:10.1007/s10664-024-10558-1

SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Published: 10 December 2024

Volume 30, article number 38, (2025)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Xueqi Yang ORCID: orcid.org/0000-0002-8256-0604¹,
Mariusz Jakubowski²,
Li Kang²,
Haojie Yu² &
…
Tim Menzies¹

122 Accesses
Explore all metrics

Abstract

As software projects rapidly evolve, software artifacts become more complex and defects behind them get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Compared to previous state-of-the-art models (CodeBERT, RoBERTa and CodeT5), our experiments demonstrate that SparseCoder can handle significantly longer input sequences – at least twice as long, within the limits of our hardware resources and data statistics. Additionally, SparseCoder is four times faster than other methods measured in runtime, achieving a 50% reduction in floating point operations per second (FLOPs) with a negligible performance drop of less than 1% compared to Transformers using sparse attention (Sparse Atten). Plotting FLOPs of model inference against token lengths reveals that SparseCoder scales linearly, whereas other methods, including the current state-of-the-art model CodeT5, scale quadratically. Moreover, SparseCoder enhances interpretability by visualizing non-trivial tokens layer-wise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Bash comment generation via data augmentation and semantic-aware CodeBERT

Article 26 March 2024

Code Summarization Through Learning Linearized AST Paths with Transformer

CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysis

Article Open access 18 March 2022

Data Availability Statements

To facilitate further work by other researchers in this area, all of our scripts and datasets are available on-line. https://github.com/invisiblehead/Sparse_Attention_on_Transformer-based_model..

Notes

References

Ahmed T, Devanbu P (2022) Few-shot training llms for project-specific code-summarization. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5
Ainslie J, Ontanon S, Alberti C, Cvicek V, Fisher Z, Pham P, Ravula A, Sanghai S, Wang Q, Yang L (2020) Etc: Encoding long and structured inputs in transformers. arXiv:2004.08483
Ball T (1999) The concept of dynamic analysis. ACM SIGSOFT Software Engineering Notes 24(6):216–234
Article MATH Google Scholar
Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Chen Y, Qian S, Tang H, Lai X, Liu Z, Han S, Jia J (2023) Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307
Chen Z, Kommrusch S, Tufano M, Pouchet LN, Poshyvanyk D, Monperrus M (2019) Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE Trans Software Eng 47(9):1943–1959
Google Scholar
Chen Z, Monperrus M (2019) A literature study of embeddings on source code. arXiv:1904.03061
Chirkova N, Troshin S (2021) Empirical study of transformers for source code. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 703–715
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Ciniselli M, Cooper N, Pascarella L, Poshyvanyk D, Di Penta M, Bavota G (2021) An empirical study on the usage of bert models for code completion. In: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 108–119. IEEE
Clark K, Luong MT, Le QV, Manning CD (2020) Electra: Pre-training text encoders as discriminators rather than generators. arXiv:2003.10555
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Do CX, Luu NT, Nguyen PTL (2024) Optimizing software vulnerability detection using roberta and machine learning. Autom Softw Eng 31(2):40
Article MATH Google Scholar
Dong L, Xu S, Xu B (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Fan A, Gokkaya B, Harman M, Lyubarskiy M, Sengupta S, Yoo S, Zhang JM(2023) Large language models for software engineering: Survey and open problems. arXiv:2310.03533
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D et al (2020) Codebert: A pre-trained model for programming and natural languages. arXiv:2002.08155
Gao S, Zhang H, Gao C, Wang C (2023) Keeping pace with ever-increasing data: Towards continual learning of code intelligence models. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 30–42. IEEE
Ghofrani J, Mohseni M, Bozorgmehr A (2017) A conceptual framework for clone detection using machine learning. In: 2017 IEEE 4th International Conference on Knowledge-Based Engineering and Innovation (KBEI), pp. 0810–0817. IEEE
Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2021) A survey of quantization methods for efficient neural network inference. arXiv:2103.13630
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press
Goyal S, Choudhury AR, Raje S, Chakaravarthy V, Sabharwal Y, Verma A (2020) Power-bert: Accelerating bert inference via progressive word-vector elimination. In: International Conference on Machine Learning, pp. 3690–3699. PMLR
Gupta A, Berant J (2020) Gmat: Global memory augmentation for transformers. arXiv:2006.03274
Heckman S, Williams L (2011) A systematic literature review of actionable alert identification techniques for automated static code analysis. Inf Softw Technol 53(4):363–387
Article MATH Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article MATH Google Scholar
Hou X, Zhao Y, Liu Y, Yang Z, Wang K, Li L, Luo X, Lo D, Grundy J, Wang H (2023) Large language models for software engineering: A systematic literature review. arXiv:2308.10620
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp. 200–20010. IEEE
Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: Scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE’07), pp. 96–105. IEEE
Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q (2019) Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351
Kim G, Cho K (2020) Length-adaptive transformer: Train once with length drop, use anytime with search. arXiv:2010.07003
Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K (2021) I-bert: Integer-only bert quantization. In: International conference on machine learning, pp. 5506–5518. PMLR
Kim S, Shen S, Thorsley D, Gholami A, Kwon W, Hassoun J, Keutzer K (2021) Learned token pruning for transformers. arXiv:2107.00910
Kim S, Zhao J, Tian Y, Chandra S (2021) Code prediction by feeding trees to transformers. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 150–162. IEEE
LeCun Y, Denker J, Solla S (1989) Optimal brain damage. Advances in neural information processing systems 2
Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) Cclearner: A deep learning-based clone detection approach. In: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 249–260. IEEE
Li Z, Lu S, Guo D, Duan N, Jannu S, Jenks G, Majumder D, Green J, Svyatkovskiy A, Fu S et al (2022) Codereviewer: Pre-training for automating code review activities. arXiv e-prints pp. arXiv–2203
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692
Marcus A, Maletic JI (2001) Identification of high-level concept clones in source code. In: Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001), pp. 107–114. IEEE
Munkhdalai T, Faruqui M, Gopal S (2024) Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv:2404.07143
Ozkaya I (2023) Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Softw 40(3):4–8
Article MATH Google Scholar
Rosenthal R, Cooper H, Hedges L et al (1994) Parametric measures of effect size. The handbook of research synthesis 621(2):231–244
MATH Google Scholar
Russell R, Kim L, Hamilton L, Lazovich T, Harer J, Ozdemir O, Ellingwood P, McConley M (2018) Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 757–762. IEEE
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108
Sawilowsky SS (2009) New effect size rules of thumb. J Mod Appl Stat Methods 8(2):26
Article MathSciNet MATH Google Scholar
Treviso M, Ji T, Lee JU, van Aken B, Cao Q, Ciosici MR, Hassid M, Heafield K, Hooker S, Martins PH et al (2022) Efficient methods for natural language processing: A survey. arXiv:2209.00099
Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2018) Deep learning similarities from different representations of source code. In: 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), pp. 542–553. IEEE
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Voita E, Talbot D, Moiseev F, Sennrich R, Titov I (2019) Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv:1905.09418
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp. 397–407
Wang H, Zhang Z, Han S (2021) Spatten: Efficient sparse attention architecture with cascade token and head pruning. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 97–110. IEEE
Wang J, Huang Y, Chen C, Liu Z, Wang S, Wang Q (2024) Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering
Wang J, Wang S, Wang Q (2018) Is there a“ golden” feature set for static warning identification? an experimental evaluation. In: Proceedings of the 12th ACM/IEEE international symposium on empirical software engineering and measurement, pp. 1–10
Wang W, Wang Y, Joty S, Hoi SC (2023) Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 146–158
Wang Y, Wang W, Joty S, Hoi SC (2021) Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv:2109.00859
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE
Wu H, Zhao H, Zhang M (2020) Code summarization with structure-induced transformer. arXiv:2012.14710
Yang X, Chen J, Yedida R, Yu Z, Menzies T (2021) Learning to recognize actionable static code warnings (is intrinsically easy). Empir Softw Eng 26(3):1–24
Article Google Scholar
Yang X, Yu Z, Wang J, Menzies T (2021) Understanding static code warnings: An incremental ai approach. Expert Syst Appl 167:114134
Article Google Scholar
Ye D, Lin Y, Huang Y, Sun M (2021) Tr-bert: Dynamic token reduction for accelerating bert inference. arXiv:2105.11618
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L et al (2020) Big bird: Transformers for longer sequences. Adv Neural Inf Process Syst 33:17283–17297
Google Scholar
Zhang J, Panthaplackel S, Nie P, Li JJ, Gligoric M (2022) Coditt5: Pretraining for source code and natural language editing. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–12
Zhang J, Wang X, Zhang H, Sun H, Liu X (2020) Retrieval-based neural source code summarization. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), pp. 1385–1397. IEEE

Download references

Author information

Authors and Affiliations

Department of Computer Science, NC State University, Raleigh, NC, USA
Xueqi Yang & Tim Menzies
Microsoft, Redmond, WA, USA
Mariusz Jakubowski, Li Kang & Haojie Yu

Authors

Xueqi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Mariusz Jakubowski
View author publications
You can also search for this author in PubMed Google Scholar
Li Kang
View author publications
You can also search for this author in PubMed Google Scholar
Haojie Yu
View author publications
You can also search for this author in PubMed Google Scholar
Tim Menzies
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xueqi Yang.

Ethics declarations

Conflict of Interest

The authors declared that they have no conflict of interest.

Additional information

Communicated by: Lei Ma.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, X., Jakubowski, M., Kang, L. et al. SparseCoder: Advancing source code analysis with sparse attention and learned token pruning. Empir Software Eng 30, 38 (2025). https://doi.org/10.1007/s10664-024-10558-1

Download citation

Accepted: 24 September 2024
Published: 10 December 2024
DOI: https://doi.org/10.1007/s10664-024-10558-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bash comment generation via data augmentation and semantic-aware CodeBERT

Code Summarization Through Learning Linearized AST Paths with Transformer

CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysis

Data Availability Statements

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bash comment generation via data augmentation and semantic-aware CodeBERT

Code Summarization Through Learning Linearized AST Paths with Transformer

CORAL: COde RepresentAtion learning with weakly-supervised transformers for analyzing data analysis

Data Availability Statements

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation