Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

An empirical study on the effectiveness of large language models for SATD identification and classification

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks, especially in software text generation tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this paper, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline, i.e., the CNN model, with a 4.4% to 7.2% improvement in F1 score. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. We also found that the largest Flan-T5 model, i.e., Flan-T5-XXL, when used with a zero-shot in-context learning (ICL) approach that only provides instructions for SATD identification, yields competitive results with traditional approaches but performs 6.4% to 9.2% worse than fine-tuned LLMs. For SATD classification, few-shot ICL approach, incorporating examples and category descriptions in prompts, outperforms the zero-shot approach and even surpasses the fine-tuned smaller Flan-T5 models. Moreover, our experiments demonstrate that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance. Our study highlights the capabilities and limitations of LLMs for SATD tasks and the role of contextual information in achieving higher performance with larger LLMs, setting a foundation for future efforts to enhance these models for more effective technical debt management.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability Statements

The results, source code, and data related to this study are available at https://github.com/RISElabQueens/SATD_LLM

Notes

  1. To create the prompt for the ICL approach, we followed the same method as presented in RQ2 approach. We selected the best result, which was obtained when we included the category descriptions and the five most relevant examples using the SentenceTransformer.

References

  • Aiken W, Mvula PK, Branco P, Jourdan G-V, Sabetzadeh M, Viktor H (2023) Measuring improvement of f 1-scores in detection of self-admitted technical debt. In: 2023 ACM/IEEE international conference on technical debt (TechDebt), pp 37–41. IEEE

  • Bavota G, Russo B (2016) A large-scale empirical study on self-admitted technical debt. In: 2016 IEEE/ACM 13th Working conference on Mining Software Repositories (MSR), pp 315–326

  • Bhatia A, Khomh F, Adams B, Hassan AE (2023) An empirical study of self-admitted technical debt in machine learning software

  • Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Adv Neural Inf Process Syst, vol 33. Curran Associates Inc, pp 1877–1901

  • Buschmann F (2011) To pay or not to pay technical debt. IEEE Softw 28(6):29–31

    Article  Google Scholar 

  • Cassee N, Zampetti F, Novielli N, Serebrenik A, Di Penta M (2022) Self-admitted technical debt and comments’ polarity: an empirical study. Empirical Softw Engg 27(6)

  • Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, Yi X, Wang C, Wang Y, Ye W, Zhang Y, Chang Y, Yu PS, Yang Q, Xie X (2024) A survey on evaluation of large language models. ACM Trans Intell Syst Technol 15(3)

  • Chen X, Yu D, Fan X, Wang L, Chen J (2022) Multiclass classification for self-admitted technical debt based on xgboost. IEEE Trans Reliab 71(3):1309–1324

    Article  Google Scholar 

  • Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li Y, Wang X, Dehghani M, Brahma S, Webson A, Gu SS, Dai Z, Suzgun M, Chen X, Chowdhery A, Castro-Ros A, Pellat M, Robinson K, Valter D, Narang S, Mishra G, Yu A, Zhao V, Huang Y, Dai A, Yu H, Petrov S, Chi EH, Dean J, Devlin J, Roberts A, Zhou D, Le QV, Wei J (2022) Scaling instruction-finetuned language models

  • Cunningham W (1992) The wycash portfolio management system. In: Addendum to the Proceedings on Object-Oriented Programming Systems, Languages, and Applications (Addendum), OOPSLA ’92, pp 29–30, New York, NY, USA. Association for Computing Machinery

  • Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the association for computational linguistics

  • Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) CodeBERT: A pre-trained model for programming and natural languages. In: Cohn T, He Y, Liu Y (eds) Findings of the association for computational linguistics: EMNLP 2020. Online. Association for Computational Linguistics, pp 1536–1547

    Chapter  Google Scholar 

  • Fucci G, Cassee N, Zampetti F, Novielli N, Serebrenik A, Di Penta M (2021) Waiting around or job half-done? sentiment in self-admitted technical debt. In: 2021 IEEE/ACM 18th International conference on Mining Software Repositories (MSR), pp 403–414

  • Gao S, Wen X, Gao C, Wang W, Zhang H, Lyu MR (2023) What makes good in-context demonstrations for code intelligence tasks with llms? In: 2023 38th IEEE/ACM International conference on Automated Software Engineering (ASE), pp 761–773, Los Alamitos, CA, USA. IEEE Computer Society

  • Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press

    Google Scholar 

  • Guo Z, Liu S, Liu J, Li Y, Chen L, Lu H, Zhou Y (2021) How far have we progressed in identifying self-admitted technical debts? A comprehensive empirical study. ACM Trans Softw Eng Methodol 30(4)

  • Hou X, Zhao Y, Liu Y, Yang Z, Wang K, Li L, Luo X, Lo D, Grundy JC, Wang H (2023) Large language models for software engineering: a systematic literature review. arxiv:2308.10620

  • Huang Q, Shihab E, Xia X, Lo D, Li S (2017) Identifying self-admitted technical debt in open source projects using text mining. Empir Softw Eng 23:418–451

    Article  Google Scholar 

  • Jiang Z, Liu J, Chen Z, Li Y, Huang J, Huo Y, He P, Gu J, Lyu MR (2023) Llmparser: a llm-based log parsing framework

  • Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A (2023) Inferfix: end-to-end program repair with llms. arXiv preprint arXiv:2303.07263

  • Li Y, Soliman M, Avgeriou P (2022) Identifying self-admitted technical debt in issue tracking systems using machine learning. Empirical Softw Engg 27(6)

  • Li Y, Soliman M, Avgeriou P (2023) Automatic identification of self-admitted technical debt from four different sources. Empirical Softw Engg 28(65)

  • Liu J, Huang Q, Xia X, Shihab E, Lo D, Li S (2020) Is using deep learning frameworks free? Characterizing technical debt in deep learning frameworks. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS), pp 1–10

  • Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach

  • Liu Z, Huang Q, Xia X, Shihab E, Lo D, Li S (2018) Satd detector: a text-mining-based self-admitted technical debt detection tool. In: 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion), pp 9–12

  • Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations

  • Maldonado EdS, Shihab E (2015) Detecting and quantifying different types of self-admitted technical debt. In: 2015 IEEE 7th International workshop on Managing Technical Debt (MTD), pp 9–15

  • Maldonado E, d S, Shihab E, Tsantalis N (2017) Using natural language processing to automatically detect self-admitted technical debt. IEEE Trans Software Eng 43(11):1044–1062

  • Manning C, Klein D (2003) Optimization, maxent models, and conditional estimation without magic. In: Companion volume of the proceedings of HLT-NAACL 2003 - Tutorial Abstracts, pp 8–8

  • Minaee S, Mikolov T, Nikzad N, Chenaghlu MA, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. arxiv:2402.06196

  • Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, Barnes N, Mian AS (2023) A comprehensive overview of large language models. arxiv:2307.06435

  • OBrien D, Biswas S, Imtiaz S, Abdalkareem R, Shihab E, Rajan H (2022) 23 shades of self-admitted technical debt: an empirical study on machine learning software. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, ESEC/FSE 2022, pp 734–746, New York, USA. Association for Computing Machinery

  • Pinna A, Lunesu MI, Orrù S, Tonelli R (2023) Investigation on self-admitted technical debt in open-source blockchain projects. Future Internet, 15(7)

  • Potdar A, Shihab E (2014) An exploratory study on self-admitted technical debt. In: 2014 IEEE International conference on software maintenance and evolution, pp 91–100

  • Prenner J, Robbes R (2022) Making the most of small software engineering datasets with modern machine learning. IEEE Trans Software Eng 48(12):5050–5067

    Google Scholar 

  • Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners

  • Raffel C, Shazeer NM, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:140:1-140:67

    MathSciNet  Google Scholar 

  • Rantala L, Mäntylä M, Lo D (2020) Prevalence, contents and automatic detection of kl-satd. In: 2020 46th Euromicro conference on Software Engineering and Advanced Applications (SEAA), pp 385–388

  • Ren X, Xing Z, Xia X, Lo D, Wang X, Grundy J (2019) Neural network-based detection of self-admitted technical debt: from performance to explainability. 28(3)

  • Sheikhaei M S, Tian Y (2023) Automated self-admitted technical debt tracking at commit-level: a language-independent approach. In: 2023 ACM/IEEE International conference on Technical Debt (TechDebt), pp 22–26

  • Sridharan M, Rantala L, Mäntylä M (2023) Pentacet data-23 million contextual code comments and 250,000 satd comments. In: 2023 IEEE/ACM 20th International conference on Mining Software Repositories (MSR), pp 412–416. IEEE

  • Tam D, Mascarenhas A, Zhang S, Kwan S, Bansal M, Raffel C (2023) Evaluating the factual consistency of large language models through news summarization. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the Association for Computational Linguistics: ACL 2023, pp 5220–5255, Toronto, Canada. Association for Computational Linguistics

  • Tate K (2005) Sustainable software development: an agile perspective. Addison-Wesley Professional

  • Tian Y, Ali N, Lo D, Hassan AE (2016) On the unreliability of bug severity data. Empir Softw Eng 21:2298–2323

    Article  Google Scholar 

  • Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G (2023) Llama: open and efficient foundation language models

  • Trask A, Michalak P, Liu JC (2015) sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings. arxiv:1511.06388

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L u, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates Inc

  • Wehaibi S, Shihab E, Guerrouj L (2016) Examining the impact of self-admitted technical debt on software quality. In: 2016 IEEE 23Rd international conference on software analysis, evolution, and reengineering (SANER), vol 1, pp 179–188. IEEE

  • Wei Y, Wang Z, Liu J, Ding Y, Zhang L (2023) Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120

  • Xavier L, Ferreira F, Brito R, Valente MT (2020) Beyond the code: mining self-admitted technical debt in issue tracker systems. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20, pp 137–146, New York, USA. Association for Computing Machinery

  • Xiao T, Wang D, McIntosh S, Hata H, Kula RG, Ishio T, ichi Matsumoto K (2021) Characterizing and mitigating self-admitted technical debt in build systems. IEEE Trans Software Eng 48:4214–4228

  • Yu D, Wang L, Chen X, Chen J (2021) Using bilstm with attention mechanism to automatically detect self-admitted technical debt. Front Comp Sci 15(4):154208

    Article  Google Scholar 

  • Yu Z, Fahid FM, Tu H, Menzies T (2022) Identifying self-admitted technical debts with jitterbug: a two-step approach. IEEE Trans Softw Eng 48(5):1676–1691

    Article  Google Scholar 

  • Yuan Z, Liu J, Zi Q, Liu M, Peng X, Lou Y (2023) Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240

  • Zhang W, Deng Y, Liu B, Pan SJ, Bing L (2023) Sentiment analysis in the era of large language models: a reality check

Download references

Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number: RGPIN-2019-05071].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Sadegh Sheikhaei.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Romain Robbes.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheikhaei, M.S., Tian, Y., Wang, S. et al. An empirical study on the effectiveness of large language models for SATD identification and classification. Empir Software Eng 29, 159 (2024). https://doi.org/10.1007/s10664-024-10548-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-024-10548-3

Keywords