research-article

Duplicate Bug Report detection using Named Entity Recognition

Authors:

Jingyuan ChengAuthors Info & Claims

Volume 284, Issue C

https://doi.org/10.1016/j.knosys.2023.111258

Published: 25 January 2024 Publication History

Abstract

Software bugs pose significant challenges in management. The Bug Tracking System (BTS) serves as a standard platform to chronicle, oversee, and manage bugs throughout software development and maintenance. While BTS aggregates numerous bug reports for tracking purposes, identical bugs often get reported by various individuals. This redundancy leads to excessive duplicate reports, straining manual inspection efforts, risking repeated bug assignment tasks, and diminishing the efficiency of bug resolution. Notably, many contemporary DBR detection techniques tend to overlook the structured data abundant in descriptive information about bug report behaviors. To mitigate this oversight, this study introduces a groundbreaking method named CorNER. This technique enhances DBR detection precision by converting unstructured textual content into structured data via named entity recognition (NER). Specifically, CorNER employs Random Forest with context (RNER) to annotate entities in the title and description sections of bug reports and subsequently harnesses Text Convolutional Neural Networks (TextCNN) for feature extraction. Empirical evidence indicates a commendable improvement in CorNER’s F1-Score by 6.24% and 4.96% on average, surpassing the benchmarks of two prevalent DBR detection methods across five datasets.

References

[1]

C. Sun, D. Lo, X. Wang, J. Jing, S.C. Khoo, A discriminative model approach for accurate duplicate bug report retrieval, in: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010, 2010.

[2]

J. He, L. Xu, M. Yan, X. Xia, Y. Lei, Duplicate bug report detection using dual-channel convolutional neural networks, in: Proceedings of the 28th International Conference on Program Comprehension, 2020, pp. 117–127.

[3]

Bugzilla, 2022, https://www.bugzilla.org/.

[4]

JIRA, 2022, https://www.atlassian.com/software/jira.

[5]

LaunchPad, 2022, https://bugs.launchpad.net/.

[6]

Wang J., Yang Y., Menzies T., Wang Q., isense2. 0: Improving completion-aware crowdtesting management with duplicate tagger and sanity checker, ACM Trans. Softw. Eng. Methodol. (TOSEM) 29 (4) (2020) 1–27.

[7]

Rakha M.S., Bezemer C.-P., Hassan A.E., Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval, Empir. Softw. Eng. 23 (5) (2018) 2597–2621.

[8]

Rakha M.S., Bezemer C.-P., Hassan A., Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports, IEEE Trans. Softw. Eng. 44 (12) (2017) 1245–1268.

[9]

Q. Xie, Z. Wen, J. Zhu, C. Gao, Z. Zheng, Detecting Duplicate Bug Reports with Convolutional Neural Networks, in: 2018 25th Asia-Pacific Software Engineering Conference, APSEC, 2019.

[10]

Meng-Jie Q., Lin Z., Cheng-Zen J., Yang C., Chao-Yuan Z., Lee, Chun-Chang, Chen, Enhancements for duplication detection in bug reports with manifold correlation features, J. Syst. Softw. (2016).

[11]

Nguyen A.T., Nguyen T.T., Nguyen T.N., Lo D., Sun C., Duplicate bug report detection with a combination of information retrieval and topic modeling, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, IEEE, 2012, pp. 70–79.

[12]

D. Hu, M. Chen, T. Wang, J. Chang, Y. Zhang, Recommending Similar Bug Reports: A Novel Approach Using Document Embedding Model, in: 2018 25th Asia-Pacific Software Engineering Conference, APSEC, 2018.

[13]

Lerch J., Mezini M., Finding duplicates of your yet unwritten bug report, in: 2013 17th European Conference on Software Maintenance and Reengineering, IEEE, 2013, pp. 69–78.

[14]

R.P. Gopalan, A. Krishna, Duplicate Bug Report Detection Using Clustering, in: 2014 23rd Australian Software Engineering Conference, ASWEC, 2014.

[15]

Huang S., Chen L., Hui Z., Liu J., Yang S., Chen Q., A method of bug report quality detection based on vector space model, in: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion, QRS-C, IEEE, 2019, pp. 510–511.

[16]

Sun C., Lo D., Khoo S.-C., Jiang J., Towards more accurate retrieval of duplicate bug reports, in: 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE 2011, IEEE, 2011, pp. 253–262.

[17]

A. Budhiraja, K. Dutta, R. Reddy, M. Shrivastava, DWEN: deep word embedding network for duplicate bug report detection in software repositories, in: 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion Proceedings, ICSE-Companion, 2018.

[18]

Song Y., Wang X., Xie T., Zhang L., Mei H., JDF: detecting duplicate bug reports in Jazz, in: 2010 ACM/IEEE 32nd International Conference on Software Engineering, Vol. 2, IEEE, 2010, pp. 315–316.

[19]

Wu X., Zheng W., Xia X., Lo D., Data quality matters: A case study on data label correctness for security bug report prediction, IEEE Trans. Softw. Eng. 48 (7) (2022) 2541–2556.

[20]

Fan Y., Xin X., David L., Hassan A.E., Chaff from the wheat: Characterizing and determining valid bug reports, IEEE Trans. Softw. Eng. PP (2018) 1.

[21]

Wu X., Zheng W., Chen X., Wang F., Mu D., CVE-assisted large-scale security bug report dataset construction method, J. Syst. Softw. 160 (2020).

[22]

Zheng Z., Li C., Liu Y., Xi Z., A phase-type expansion approach for the performability of composite web services, IEEE Trans. Reliab. 71 (2) (2022) 579–589.

[23]

O. Chaparro, Improving Bug Reporting, Duplicate Detection, and Localization, in: IEEE/ACM International Conference on Software Engineering Companion, 2017.

[24]

Chaparro O., Florez J.M., Singh U., Marcus A., Reformulating queries for duplicate bug report detection, in: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering, SANER, IEEE, 2019, pp. 218–229.

[25]

Ye D., Xing Z., Foo C.Y., Ang Z.Q., Li J., Kapre N., Software-specific named entity recognition in software engineering social content, in: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, Vol. 1, SANER, IEEE, 2016, pp. 90–101.

[26]

J. Lafferty, A. McCallum, F.C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in: Proceedings of ICML, 2001.

[27]

S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P.H. Torr, Conditional Random Fields as Recurrent Neural Networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1529–1537.

[28]

Shao Y., Lin J.C.-W., Srivastava G., Jolfaei A., Guo D., Hu Y., Self-attention-based conditional random fields latent variables model for sequence labeling, Pattern Recognit. Lett. 145 (2021) 157–164.

[29]

Li J., Sun A., Han J., Li C., A survey on deep learning for named entity recognition, IEEE Trans. Knowl. Data Eng. 34 (1) (2020) 50–70.

Digital Library

[30]

Dong C., Zhang J., Zong C., Hattori M., Di H., Character-based LSTM-CRF with radical-level features for Chinese named entity recognition, in: Natural Language Understanding and Intelligent Applications, Springer, 2016, pp. 239–250.

[31]

X. Zhang, C. Li, H. Du, Named Entity Recognition for Terahertz Domain Knowledge Graph based on Albert-BiLSTM-CRF, in: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference, ITNEC, 2020.

[32]

Ren X., Ye X., Xing Z., Xia X., Xu X., Zhu L., Sun J., API-misuse detection driven by fine-grained API-constraint knowledge graph, in: 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE, IEEE, 2020, pp. 461–472.

[33]

Sun J., Xing Z., Peng X., Xu X., Zhu L., Task-oriented api usage examples prompting powered by programming task knowledge graph, in: 2021 IEEE International Conference on Software Maintenance and Evolution, ICSME, IEEE, 2021, pp. 448–459.

[34]

Zheng Z., Trivedi K.S., Wang N., Qiu K., Markov regenerative models of webservers for their user-perceived availability and bottlenecks, IEEE Trans. Dependable Secure Comput. 17 (1) (2017) 92–105.

[35]

K.K. Sabor, A. Hamou-Lhadj, A. Larsson, DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports, in: 2017 IEEE International Conference on Software Quality, Reliability and Security, QRS, 2017.

[36]

Nguyen H.T., Duong P.H., Cambria E., Learning short-text semantic similarity with word embeddings and external knowledge sources, Knowl.-Based Syst. 182 (2019).

Digital Library

[37]

Wang J., Dong Y., Measurement of text similarity: A survey, Information 11 (9) (2020) 421.

[38]

W. Lu, X. Sun, J. Wang, Y. Duan, B. Li, Construct Bug Knowledge Graph for Bug Resolution, in: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion, ICSE-C, 2017.

[39]

Zheng W., Cheng J., Wu X., Sun R., Wang X., Sun X., Domain knowledge-based security bug reports prediction, Knowl.-Based Syst. 241 (2022).

[40]

J. Deshmukh, K.M. Annervaz, S. Podder, S. Sengupta, N. Dubash, Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques, in: IEEE International Conference on Software Maintenance and Evolution, 2017.

[41]

Peters F., Tun T., Yu Y., Nuseibeh B., Text filtering and ranking for security bug report prediction, IEEE Trans. Softw. Eng. PP (99) (2017) 1.

[42]

Ye D., Xing Z., Kapre N., The structure and dynamics of knowledge network in domain-specific q&a sites: a case study of stack overflow, Empir. Softw. Eng. 22 (1) (2017) 375–406.

[43]

D. Mu, Y. Wu, Y. Chen, Z. Lin, C. Yu, X. Xing, G. Wang, An In-depth Analysis of Duplicated Linux Kernel Bug Reports, in: Network and Distributed Systems Security (NDSS) Symposium 2022, 2022.

[44]

Fan Y., Xia X., Da Costa D.A., Lo D., Hassan A.E., Li S., The impact of mislabeled changes by SZZ on just-in-time defect prediction, IEEE Trans. Softw. Eng. 47 (8) (2019) 1559–1586.

[45]

Cohen J., A coefficient of agreement for nominal scales, Educ. Psychol. Meas. 20 (1) (1960) 37–46.

[46]

Manning C.D., Raghavan P., Schütze H., Introduction to Information Retrieval, 2010.

[47]

Breiman C.D., Random forests, Mach. Learn 45 (1) (2001) 5–32.

Digital Library

[48]

Yang B., Huang C., Nevatia R., Learning affinities and dependencies for multi-target tracking using a CRF model, in: The 24th IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 1233–1240.

[49]

Nasar Z., Jaffry S.W., Malik M.K., Named entity recognition and relation extraction: State-of-the-art, ACM Comput. Surv. 54 (1) (2021) 1–39.

Digital Library

[50]

Zhang T., Han D., Vinayakarao V., Irsan I.C., Xu B., Thung F., Lo D., Jiang L., Duplicate bug report detection: How far are we?, ACM Trans. Softw. Eng. Methodol. 32 (4) (2023) 1–32.

[51]

Kucuk B., Hanhan I., Tuzun E., Characterizing duplicate bugs: Perceptions of practitioners and an empirical analysis, J. Softw. Evol. Process (2022).

Digital Library

[52]

A. Lazar, S. Ritchey, B. Sharif, Generating duplicate bug datasets, in: Proceedings of the 11th Working Conference on Mining Software Repositories, 2014, pp. 392–395.

[53]

Messaoud M.B., Miladi A., Jenhani I., Mkaouer M.W., Ghadhab L., Duplicate bug report detection using an attention-based neural language model, IEEE Trans. Reliab. (2022).

[54]

Mahfoodh H., Hammad M., Identifying duplicate bug records using Word2Vec prediction with software risk analysis, Int. J. Comput. Digit. Syst. 11 (1) (2022) 763–773.

[55]

Zheng W., Feng C., Yu T., Yang X., Wu X., Towards understanding bugs in an open source cloud management stack: An empirical study of OpenStack software bugs, J. Syst. Softw. 151 (2019) 210–223.

[56]

Grandini M., Bagli E., Visani G., Metrics for multi-class classification: an overview, 2020, arXiv preprint arXiv:2008.05756.

[57]

J. Romano, Appropriate Statistics for Ordinal Level Data: Should We Really Be Using t-test and Cohen’sd for Evaluating Group Differences on the NSSE and Other Surveys, in: Annual Meeting of the Florida Association of Institutional Research., 2006.

[58]

Yu T., Wei W., Xue H., Hayes J., ConPredictor: Concurrency defect prediction in real-world applications, IEEE Trans. Softw. Eng. PP (99) (2018) 1.

[59]

Zhou C., Li B., Sun X., Guo H., Recognizing software bug-specific named entity in software bug repository, in: 2018 IEEE/ACM 26th International Conference on Program Comprehension, ICPC, IEEE, 2018, pp. 108–10811.

[60]

Zhou C., Li B., Sun X., Improving software bug-specific named entity recognition with deep neural network, J. Syst. Softw. 165 (2020).

[61]

Ni C., Xia X., Lo D., Chen X., Gu Q., Revisiting supervised and unsupervised methods for effort-aware cross-project defect prediction, IEEE Trans. Softw. Eng. (2020).

[62]

Herbold S., Trautsch A., Grabowski J., A comparative study to benchmark cross-project defect prediction approaches, IEEE Trans. Softw. Eng. 44 (9) (2017) 811–833.

[63]

Hindle A., Alipour A., Stroulia E., A contextual approach towards more accurate duplicate bug report detection and ranking, Empir. Softw. Eng. 21 (2) (2016) 368–410.

[64]

Li B., Wei Y., Sun X., Bo L., Chen D., Tao C., Towards the identification of bug entities and relations in bug reports, Autom. Softw. Eng. 29 (1) (2022) 1–31.

[65]

Neysiani B.S., Babamir S.M., Automatic duplicate bug report detection using information retrieval-based versus machine learning-based approaches, in: 2020 6th International Conference on Web Research, ICWR, IEEE, 2020, pp. 288–293.

[66]

Bansal K., Rohil H., Literature review of finding duplicate bugs in open source systems, in: 2021 Fourth International Conference on Computational Intelligence and Communication Technologies, CCICT, IEEE, 2021, pp. 389–396.

[67]

Zheng W., Xun Y., Wu X., Deng Z., Chen X., Sui Y., A comparative study of class rebalancing methods for security bug report classification, IEEE Trans. Reliab. 70 (4) (2021) 1658–1670.

[68]

Y. Song, O. Chaparro, Bee: A tool for structuring and analyzing bug reports, in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1551–1555.

[69]

Kukkar A., Mohana R., Kumar Y., Nayyar A., Bilal M., Kwak K.-S., Duplicate bug report detection and classification system based on deep learning technique, IEEE Access 8 (2020) 200749–200763.

[70]

Fang F., Wu J., Li Y., Ye X., Aljedaani W., Mkaouer M.W., On the classification of bug reports to improve bug localization, Soft Comput. 25 (2021) 7307–7323.

[71]

”̈Ozt”̈urk C.E., Yilmaz E.H., K”̈oksal ”̈O., Koç A., Software module classification for commercial bug reports, in: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW, IEEE, 2023, pp. 1–5.

[72]

Gomes L.A.F., da Silva Torres R., Côrtes M.L., Bug report severity level prediction in open source software: A survey and research opportunities, Inf. Softw. Technol. 115 (2019) 58–78.

Digital Library

[73]

Chen R., Guo S.-K., Wang X.-Z., Zhang T.-L., Fusion of multi-RSMOTE with fuzzy integral to classify bug reports with an imbalanced distribution, IEEE Trans. Fuzzy Syst. 27 (12) (2019) 2406–2420.

[74]

X. Yang, D. Lo, X. Xia, L. Bao, J. Sun, Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports, in: IEEE International Symposium on Software Reliability Engineering, 2016.

[75]

A. Budhiraja, R. Reddy, M. Shrivastava, Poster: LWE: LDA Refined Word Embeddings for Duplicate Bug Report Detection, in: The 40th International Conference, 2018.

[76]

Kukkar A., Mohana R., Kumar Y., Nayyar A., Kwak K.S., Duplicate bug report detection and classification system based on deep learning technique, IEEE Access 8 (2020) 200749–200763.

[77]

G. Xiao, X. Du, Y. Sui, T. Yue, HINDbr: Heterogeneous Information Network Based Duplicate Bug Report Prediction, in: 2020 IEEE 31st International Symposium on Software Reliability Engineering, ISSRE, 2020.

[78]

Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K., Kuksa P., Natural language processing (almost) from scratch, J. Mach. Learn. Res. 12 (ARTICLE) (2011) 2493–2537.

[79]

Alipour A., Hindle A., Stroulia E., A Contextual Approach Towards More Accurate Duplicate Bug Report Detection, 2013.

[80]

Banerjee S., Syed Z., Helmick J., Culp M., Ryan K., Cukic B., Automated triaging of very large bug repositories, Inf. Softw. Technol. (2017).

[81]

S. Banerjee, Z. Syed, J. Helmick, B. Cukic, A fusion approach for classifying duplicate problem reports, in: Software Reliability Engineering (ISSRE), 2013 IEEE 24th International Symposium on, 2014.

[82]

Y. Li, R.P. Gopalan, Clustering high dimensional sparse transactional data with constraints, in: IEEE International Conference on Granular Computing, 2006.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems

Knowledge-Based Systems Volume 284, Issue C

Jan 2024

1456 pages

Issue’s Table of Contents

Copyright © 2023.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 25 January 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents