Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3382494.3422161acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
research-article

Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation

Published: 23 October 2020 Publication History

Abstract

Background: Code smells indicate potential design or implementation problems that may have a negative impact on programs. Similar to other software artefacts, developers use Stack Overflow (SO) to ask questions about code smells. However, given the high number of questions asked on the platform, and the limitations of the default tagging system, it takes significant effort to extract knowledge about code smells by means of manual approaches. Aim: We utilized supervised machine learning techniques to automatically identify code-smell discussions from SO posts. Method: We conducted an experiment using a manually labeled dataset that contains 3000 code-smell and 3000 non-code-smell posts to evaluate the performance of different classifiers when automatically identifying code smell discussions. Results: Our results show that Logistic Regression (LR) with parameter C=20 (inverse of regularization strength) and Bag of Words (BoW) feature extraction technique achieved the best performance amongst the algorithms we evaluated with a precision of 0.978, a recall of 0.965, and an F1-score of 0.971. Conclusion: Our results show that machine learning approach can effectively locate code-smell posts even if posts' title and/or tags cannot be of help. The technique can be used to extract code smell discussions from other textual artefacts (e.g., code reviews), and promisingly to extract SO discussions of other topics.

References

[1]
M. Abbes, F. Khomh, Y. G. Gueheneuc, and G. Antoniol. 2011. An empirical study of the impact of two antipatterns blob and spaghetti code on program comprehension. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering (CSMR). IEEE, 181--190.
[2]
M. Allamanis and C. Sutton. 2013. Why, when, and what: analyzing stack overflow questions by topic, type, and code. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR). IEEE, 53--56.
[3]
M. Asaduzzaman, C. K. Roy, and K. A. Schneider. 2020. CAPS: a supervised technique for classifying stack overflow posts concerning API issues. Empirical Software Engineering 25(2) (2020), 1493--1532.
[4]
K. Bajaj, K. Pattabiraman, and A. Mesbah. 2014. Mining questions asked by web developers. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 112--121.
[5]
A. Barua, S. W. Thomas, and A. E. Hassan. 2014. What are developers talking about? an analysis of topics and trends in stack overflow. Empirical Software Engineering 19(3) (2014), 619--654.
[6]
G. Bavota, A. Qusef, R. Oliveto, A. De Lucia, and D. Binkley. 2014. Are test smells really harmful? an empirical study. Empirical Software Engineering 20(4) (2014), 1052--1094.
[7]
S. Beyer, C. Macho, and M. Di Penta. 2018. Automatically classifying posts into question categories on stack overflow. In Proceedings of the 26th International Conference on Program Comprehension (ICPC). ACM, 211--221.
[8]
S. Beyer, C. Macho, M. Di Penta, and M. Pinzger. 2020. What kind of questions do developers ask on stack overflow? a comparison of automated approaches to classify posts into question categories. Empirical Software Engineering 25(3) (2020), 2258--2301.
[9]
S. Beyer and M. Pinzger. 2014. A manual categorization of android app development issues on Stack Overflow. In Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 531--535.
[10]
S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media.
[11]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[12]
M. Fowler. 2018. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional.
[13]
J. Garcia, D. Popescu, G. Edwards, and N. Medvidovi. 2009. Identifying architectural bad smells. In Proceedings of the 13th European Conference on Software Maintenance and Reengineering (CSMR). IEEE, 255--258.
[14]
F. Khomh, M. Di Penta, Y. G. Guéhéneuc, and G. Antoniol. 2012. An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empirical Software Engineering 17(3) (2012), 243--275.
[15]
S. Kotsiantis, D. Kanellopoulos, P. Pintelas, et al. 2006. Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30(1) (2006), 25--36.
[16]
J. H. Lauand T. Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. preprint arXiv:1607.05368 (2016).
[17]
S. M. Olbrich, D. S Cruzes, and D. I. Sjøberg. 2010. Are all code smells harmful? a study of god classes and brain classes in the evolution of three open source systems. In Proceedings of the 26th IEEE International Conference on Software Maintenance (ICSM). IEEE, 1--10.
[18]
D. N. Palacio, D. McCrystal, K. Moran, C. B. Cardenas, D. Poshavank, and C. Shenefiel. 2019. Learning to identify security-related issues using convolutional neural networks. In Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 140--144.
[19]
F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lucia. 2014. Do they really smell bad? A study on developers' perception of bad code smells. In Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 101--110.
[20]
J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). ACL, 1532--1543.
[21]
L. Ponzanelli, G. Bavota, R. Oliveto M. Di Penta, and M. Lanza. 2014. Mining stackoverflow to turn the IDE into a self-confident programming prompter. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 102--111.
[22]
F. A. Shah, K. Sirts, and D. Pfahl. 2019. Simple app review classification with only lexical features. In Proceedings of the 13th International Conference on Software Technologies (ICSOFT). SciTePress, 146--153.
[23]
S. Shcherban, P. Liang, A. Tahir, and X. Li. 2020. Dataset of the Paper "Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation". https://doi.org/10.5281/zenodo.3933982
[24]
C. Stanik, L. Montgomery, D. Martens, D. Fucci, and W. Maalej. 2018. A simple NLP-based approach to support onboarding and retention in open source communities. In Proceedings of the 34th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 68--78.
[25]
A. Tahir, J. Dietrich, S. Counsell, S. Licorish, and A. Yamashita. 2020. A large scale study on how developers discuss code smells and anti-pattern in stack exchange sites. Information and Software Technology 125 (2020), 106333.
[26]
A. Tahir, A. Yamashita, S. Licorish, and S. Counsell J. Dietrich. 2018. Can you tell me if it smells?: a study on how developers discuss code smells and anti-pattern in stack overflow. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering(EASE). ACM, 68--78.
[27]
F. Tian, F. Lu, P. Liang, and M. A. Babar. 2020. Automatic identification of architecture smell discussions from stack overflow. In Proceedings of the 32nd International Conference on Software Engineering and Knowledge Engineering (SEKE). KSI, 451--456.
[28]
C. Treude, O. Barzilay, and M. A. Storey. 2011. How do programmers ask and answer questions on the web? In Proceedings of the 33rd International Conference on Software Engineering(ICSE). ACM, 804--807.
[29]
M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, and D. Poshyvanyk. 2015. When and why your code starts to smell bad. In Proceedings of the 37th International Conference on Software Engineering (ICSE). IEEE, 403--414.
[30]
A. Yamashita and L. Moonen. 2013. Do developers care about code smells? an exploratory survey. In Proceedings of the 20th Working Conference on Reverse Engineering(WCRE). IEEE, 242--251.
[31]
Y. Zhang, R. Jin, and Z. H. Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1(1-4) (2010), 43--52.

Cited By

View all
  • (2024)The Use of AI in Software Engineering: A Synthetic Knowledge Synthesis of the Recent Research LiteratureInformation10.3390/info1506035415:6(354)Online publication date: 14-Jun-2024
  • (2022)Asking about Technical Debt: Characteristics and Automatic Identification of Technical Debt Questions on Stack OverflowProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3544902.3546245(45-56)Online publication date: 19-Sep-2022
  • (2022)An Insight into the Reusability of Stack Overflow Code Fragments in Mobile Applications2022 IEEE 16th International Workshop on Software Clones (IWSC)10.1109/IWSC55060.2022.00020(69-75)Online publication date: Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
October 2020
412 pages
ISBN:9781450375801
DOI:10.1145/3382494
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automatic Classification
  2. Code Smell
  3. Discussion
  4. Stack Overflow

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • IBO Technology (Shenzhen) Co., Ltd., China
  • The National Key R&D Program of China

Conference

ESEM '20
Sponsor:

Acceptance Rates

ESEM '20 Paper Acceptance Rate 26 of 123 submissions, 21%;
Overall Acceptance Rate 130 of 594 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)2
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Use of AI in Software Engineering: A Synthetic Knowledge Synthesis of the Recent Research LiteratureInformation10.3390/info1506035415:6(354)Online publication date: 14-Jun-2024
  • (2022)Asking about Technical Debt: Characteristics and Automatic Identification of Technical Debt Questions on Stack OverflowProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3544902.3546245(45-56)Online publication date: 19-Sep-2022
  • (2022)An Insight into the Reusability of Stack Overflow Code Fragments in Mobile Applications2022 IEEE 16th International Workshop on Software Clones (IWSC)10.1109/IWSC55060.2022.00020(69-75)Online publication date: Oct-2022
  • (2022)An empirical study of question discussions on Stack OverflowEmpirical Software Engineering10.1007/s10664-022-10180-z27:6Online publication date: 1-Nov-2022
  • (2021)A Machine Learning Based Ensemble Method for Automatic Multiclass Classification of DecisionsProceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering10.1145/3463274.3463325(40-49)Online publication date: 21-Jun-2021
  • (2021)Understanding Code Smell Detection via Code Review: A Study of the OpenStack Community2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC)10.1109/ICPC52881.2021.00038(323-334)Online publication date: May-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media