An empirical study on the effectiveness of large language models for SATD identification and classification

Sheikhaei, Mohammad Sadegh; Tian, Yuan; Wang, Shaowei; Xu, Bowen

doi:10.1007/s10664-024-10548-3

An empirical study on the effectiveness of large language models for SATD identification and classification

Published: 01 October 2024

Volume 29, article number 159, (2024)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Mohammad Sadegh Sheikhaei ORCID: orcid.org/0000-0003-4219-642X¹,
Yuan Tian¹,
Shaowei Wang² &
…
Bowen Xu³

292 Accesses
Explore all metrics

Abstract

Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks, especially in software text generation tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this paper, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline, i.e., the CNN model, with a 4.4% to 7.2% improvement in F1 score. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. We also found that the largest Flan-T5 model, i.e., Flan-T5-XXL, when used with a zero-shot in-context learning (ICL) approach that only provides instructions for SATD identification, yields competitive results with traditional approaches but performs 6.4% to 9.2% worse than fine-tuned LLMs. For SATD classification, few-shot ICL approach, incorporating examples and category descriptions in prompts, outperforms the zero-shot approach and even surpasses the fine-tuned smaller Flan-T5 models. Moreover, our experiments demonstrate that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance. Our study highlights the capabilities and limitations of LLMs for SATD tasks and the role of contextual information in achieving higher performance with larger LLMs, setting a foundation for future efforts to enhance these models for more effective technical debt management.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Detecting non-natural language artifacts for de-noising bug reports

Article Open access 24 August 2022

Using BiLSTM with attention mechanism to automatically detect self-admitted technical debt

Article 25 May 2021

SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Article 10 December 2024

Data Availability Statements

The results, source code, and data related to this study are available at https://github.com/RISElabQueens/SATD_LLM

Notes

To create the prompt for the ICL approach, we followed the same method as presented in RQ2 approach. We selected the best result, which was obtained when we included the category descriptions and the five most relevant examples using the SentenceTransformer.

References

Aiken W, Mvula PK, Branco P, Jourdan G-V, Sabetzadeh M, Viktor H (2023) Measuring improvement of f 1-scores in detection of self-admitted technical debt. In: 2023 ACM/IEEE international conference on technical debt (TechDebt), pp 37–41. IEEE
Bavota G, Russo B (2016) A large-scale empirical study on self-admitted technical debt. In: 2016 IEEE/ACM 13th Working conference on Mining Software Repositories (MSR), pp 315–326
Bhatia A, Khomh F, Adams B, Hassan AE (2023) An empirical study of self-admitted technical debt in machine learning software
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Adv Neural Inf Process Syst, vol 33. Curran Associates Inc, pp 1877–1901
Buschmann F (2011) To pay or not to pay technical debt. IEEE Softw 28(6):29–31
Article Google Scholar
Cassee N, Zampetti F, Novielli N, Serebrenik A, Di Penta M (2022) Self-admitted technical debt and comments’ polarity: an empirical study. Empirical Softw Engg 27(6)
Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, Yi X, Wang C, Wang Y, Ye W, Zhang Y, Chang Y, Yu PS, Yang Q, Xie X (2024) A survey on evaluation of large language models. ACM Trans Intell Syst Technol 15(3)
Chen X, Yu D, Fan X, Wang L, Chen J (2022) Multiclass classification for self-admitted technical debt based on xgboost. IEEE Trans Reliab 71(3):1309–1324
Article Google Scholar
Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li Y, Wang X, Dehghani M, Brahma S, Webson A, Gu SS, Dai Z, Suzgun M, Chen X, Chowdhery A, Castro-Ros A, Pellat M, Robinson K, Valter D, Narang S, Mishra G, Yu A, Zhao V, Huang Y, Dai A, Yu H, Petrov S, Chi EH, Dean J, Devlin J, Roberts A, Zhou D, Le QV, Wei J (2022) Scaling instruction-finetuned language models
Cunningham W (1992) The wycash portfolio management system. In: Addendum to the Proceedings on Object-Oriented Programming Systems, Languages, and Applications (Addendum), OOPSLA ’92, pp 29–30, New York, NY, USA. Association for Computing Machinery
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the association for computational linguistics
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: A pre-trained model for programming and natural languages. In: Cohn T, He Y, Liu Y (eds) Findings of the association for computational linguistics: EMNLP 2020. Online. Association for Computational Linguistics, pp 1536–1547
Chapter Google Scholar
Fucci G, Cassee N, Zampetti F, Novielli N, Serebrenik A, Di Penta M (2021) Waiting around or job half-done? sentiment in self-admitted technical debt. In: 2021 IEEE/ACM 18th International conference on Mining Software Repositories (MSR), pp 403–414
Gao S, Wen X, Gao C, Wang W, Zhang H, Lyu MR (2023) What makes good in-context demonstrations for code intelligence tasks with llms? In: 2023 38th IEEE/ACM International conference on Automated Software Engineering (ASE), pp 761–773, Los Alamitos, CA, USA. IEEE Computer Society
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press
Google Scholar
Guo Z, Liu S, Liu J, Li Y, Chen L, Lu H, Zhou Y (2021) How far have we progressed in identifying self-admitted technical debts? A comprehensive empirical study. ACM Trans Softw Eng Methodol 30(4)
Hou X, Zhao Y, Liu Y, Yang Z, Wang K, Li L, Luo X, Lo D, Grundy JC, Wang H (2023) Large language models for software engineering: a systematic literature review. arxiv:2308.10620
Huang Q, Shihab E, Xia X, Lo D, Li S (2017) Identifying self-admitted technical debt in open source projects using text mining. Empir Softw Eng 23:418–451
Article Google Scholar
Jiang Z, Liu J, Chen Z, Li Y, Huang J, Huo Y, He P, Gu J, Lyu MR (2023) Llmparser: a llm-based log parsing framework
Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A (2023) Inferfix: end-to-end program repair with llms. arXiv preprint arXiv:2303.07263
Li Y, Soliman M, Avgeriou P (2022) Identifying self-admitted technical debt in issue tracking systems using machine learning. Empirical Softw Engg 27(6)
Li Y, Soliman M, Avgeriou P (2023) Automatic identification of self-admitted technical debt from four different sources. Empirical Softw Engg 28(65)
Liu J, Huang Q, Xia X, Shihab E, Lo D, Li S (2020) Is using deep learning frameworks free? Characterizing technical debt in deep learning frameworks. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS), pp 1–10
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach
Liu Z, Huang Q, Xia X, Shihab E, Lo D, Li S (2018) Satd detector: a text-mining-based self-admitted technical debt detection tool. In: 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion), pp 9–12
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations
Maldonado EdS, Shihab E (2015) Detecting and quantifying different types of self-admitted technical debt. In: 2015 IEEE 7th International workshop on Managing Technical Debt (MTD), pp 9–15
Maldonado E, d S, Shihab E, Tsantalis N (2017) Using natural language processing to automatically detect self-admitted technical debt. IEEE Trans Software Eng 43(11):1044–1062
Manning C, Klein D (2003) Optimization, maxent models, and conditional estimation without magic. In: Companion volume of the proceedings of HLT-NAACL 2003 - Tutorial Abstracts, pp 8–8
Minaee S, Mikolov T, Nikzad N, Chenaghlu MA, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. arxiv:2402.06196
Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, Barnes N, Mian AS (2023) A comprehensive overview of large language models. arxiv:2307.06435
OBrien D, Biswas S, Imtiaz S, Abdalkareem R, Shihab E, Rajan H (2022) 23 shades of self-admitted technical debt: an empirical study on machine learning software. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, ESEC/FSE 2022, pp 734–746, New York, USA. Association for Computing Machinery
Pinna A, Lunesu MI, Orrù S, Tonelli R (2023) Investigation on self-admitted technical debt in open-source blockchain projects. Future Internet, 15(7)
Potdar A, Shihab E (2014) An exploratory study on self-admitted technical debt. In: 2014 IEEE International conference on software maintenance and evolution, pp 91–100
Prenner J, Robbes R (2022) Making the most of small software engineering datasets with modern machine learning. IEEE Trans Software Eng 48(12):5050–5067
Google Scholar
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners
Raffel C, Shazeer NM, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:140:1-140:67
MathSciNet Google Scholar
Rantala L, Mäntylä M, Lo D (2020) Prevalence, contents and automatic detection of kl-satd. In: 2020 46th Euromicro conference on Software Engineering and Advanced Applications (SEAA), pp 385–388
Ren X, Xing Z, Xia X, Lo D, Wang X, Grundy J (2019) Neural network-based detection of self-admitted technical debt: from performance to explainability. 28(3)
Sheikhaei M S, Tian Y (2023) Automated self-admitted technical debt tracking at commit-level: a language-independent approach. In: 2023 ACM/IEEE International conference on Technical Debt (TechDebt), pp 22–26
Sridharan M, Rantala L, Mäntylä M (2023) Pentacet data-23 million contextual code comments and 250,000 satd comments. In: 2023 IEEE/ACM 20th International conference on Mining Software Repositories (MSR), pp 412–416. IEEE
Tam D, Mascarenhas A, Zhang S, Kwan S, Bansal M, Raffel C (2023) Evaluating the factual consistency of large language models through news summarization. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the Association for Computational Linguistics: ACL 2023, pp 5220–5255, Toronto, Canada. Association for Computational Linguistics
Tate K (2005) Sustainable software development: an agile perspective. Addison-Wesley Professional
Tian Y, Ali N, Lo D, Hassan AE (2016) On the unreliability of bug severity data. Empir Softw Eng 21:2298–2323
Article Google Scholar
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G (2023) Llama: open and efficient foundation language models
Trask A, Michalak P, Liu JC (2015) sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings. arxiv:1511.06388
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L u, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates Inc
Wehaibi S, Shihab E, Guerrouj L (2016) Examining the impact of self-admitted technical debt on software quality. In: 2016 IEEE 23Rd international conference on software analysis, evolution, and reengineering (SANER), vol 1, pp 179–188. IEEE
Wei Y, Wang Z, Liu J, Ding Y, Zhang L (2023) Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120
Xavier L, Ferreira F, Brito R, Valente MT (2020) Beyond the code: mining self-admitted technical debt in issue tracker systems. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20, pp 137–146, New York, USA. Association for Computing Machinery
Xiao T, Wang D, McIntosh S, Hata H, Kula RG, Ishio T, ichi Matsumoto K (2021) Characterizing and mitigating self-admitted technical debt in build systems. IEEE Trans Software Eng 48:4214–4228
Yu D, Wang L, Chen X, Chen J (2021) Using bilstm with attention mechanism to automatically detect self-admitted technical debt. Front Comp Sci 15(4):154208
Article Google Scholar
Yu Z, Fahid FM, Tu H, Menzies T (2022) Identifying self-admitted technical debts with jitterbug: a two-step approach. IEEE Trans Softw Eng 48(5):1676–1691
Article Google Scholar
Yuan Z, Liu J, Zi Q, Liu M, Peng X, Lou Y (2023) Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240
Zhang W, Deng Y, Liu B, Pan SJ, Bing L (2023) Sentiment analysis in the era of large language models: a reality check

Download references

Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number: RGPIN-2019-05071].

Author information

Authors and Affiliations

School of Computing, Queen’s University, Kingston, ON, Canada
Mohammad Sadegh Sheikhaei & Yuan Tian
Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
Shaowei Wang
Department of Computer Science, North Carolina State University, Raleigh, NC, USA
Bowen Xu

Authors

Mohammad Sadegh Sheikhaei
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Shaowei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Sadegh Sheikhaei.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Romain Robbes.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheikhaei, M.S., Tian, Y., Wang, S. et al. An empirical study on the effectiveness of large language models for SATD identification and classification. Empir Software Eng 29, 159 (2024). https://doi.org/10.1007/s10664-024-10548-3

Download citation

Accepted: 19 September 2024
Published: 01 October 2024
DOI: https://doi.org/10.1007/s10664-024-10548-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical study on the effectiveness of large language models for SATD identification and classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting non-natural language artifacts for de-noising bug reports

Using BiLSTM with attention mechanism to automatically detect self-admitted technical debt

SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Data Availability Statements

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An empirical study on the effectiveness of large language models for SATD identification and classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting non-natural language artifacts for de-noising bug reports

Using BiLSTM with attention mechanism to automatically detect self-admitted technical debt

SparseCoder: Advancing source code analysis with sparse attention and learned token pruning

Data Availability Statements

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation