research-article

Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation

Authors:

Sergei Shcherban,

Xueying LiAuthors Info & Claims

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

Article No.: 34, Pages 1 - 6

https://doi.org/10.1145/3382494.3422161

Published: 23 October 2020 Publication History

Abstract

Background: Code smells indicate potential design or implementation problems that may have a negative impact on programs. Similar to other software artefacts, developers use Stack Overflow (SO) to ask questions about code smells. However, given the high number of questions asked on the platform, and the limitations of the default tagging system, it takes significant effort to extract knowledge about code smells by means of manual approaches. Aim: We utilized supervised machine learning techniques to automatically identify code-smell discussions from SO posts. Method: We conducted an experiment using a manually labeled dataset that contains 3000 code-smell and 3000 non-code-smell posts to evaluate the performance of different classifiers when automatically identifying code smell discussions. Results: Our results show that Logistic Regression (LR) with parameter C=20 (inverse of regularization strength) and Bag of Words (BoW) feature extraction technique achieved the best performance amongst the algorithms we evaluated with a precision of 0.978, a recall of 0.965, and an F1-score of 0.971. Conclusion: Our results show that machine learning approach can effectively locate code-smell posts even if posts' title and/or tags cannot be of help. The technique can be used to extract code smell discussions from other textual artefacts (e.g., code reviews), and promisingly to extract SO discussions of other topics.

References

[1]

M. Abbes, F. Khomh, Y. G. Gueheneuc, and G. Antoniol. 2011. An empirical study of the impact of two antipatterns blob and spaghetti code on program comprehension. In Proceedings of the 15th European Conference on Software Maintenance and Reengineering (CSMR). IEEE, 181--190.

[2]

M. Allamanis and C. Sutton. 2013. Why, when, and what: analyzing stack overflow questions by topic, type, and code. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR). IEEE, 53--56.

[3]

M. Asaduzzaman, C. K. Roy, and K. A. Schneider. 2020. CAPS: a supervised technique for classifying stack overflow posts concerning API issues. Empirical Software Engineering 25(2) (2020), 1493--1532.

[4]

K. Bajaj, K. Pattabiraman, and A. Mesbah. 2014. Mining questions asked by web developers. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 112--121.

[5]

A. Barua, S. W. Thomas, and A. E. Hassan. 2014. What are developers talking about? an analysis of topics and trends in stack overflow. Empirical Software Engineering 19(3) (2014), 619--654.

Digital Library

[6]

G. Bavota, A. Qusef, R. Oliveto, A. De Lucia, and D. Binkley. 2014. Are test smells really harmful? an empirical study. Empirical Software Engineering 20(4) (2014), 1052--1094.

Digital Library

[7]

S. Beyer, C. Macho, and M. Di Penta. 2018. Automatically classifying posts into question categories on stack overflow. In Proceedings of the 26th International Conference on Program Comprehension (ICPC). ACM, 211--221.

[8]

S. Beyer, C. Macho, M. Di Penta, and M. Pinzger. 2020. What kind of questions do developers ask on stack overflow? a comparison of automated approaches to classify posts into question categories. Empirical Software Engineering 25(3) (2020), 2258--2301.

[9]

S. Beyer and M. Pinzger. 2014. A manual categorization of android app development issues on Stack Overflow. In Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 531--535.

[10]

S. Bird, E. Klein, and E. Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media.

Digital Library

[11]

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.

[12]

M. Fowler. 2018. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional.

[13]

J. Garcia, D. Popescu, G. Edwards, and N. Medvidovi. 2009. Identifying architectural bad smells. In Proceedings of the 13th European Conference on Software Maintenance and Reengineering (CSMR). IEEE, 255--258.

[14]

F. Khomh, M. Di Penta, Y. G. Guéhéneuc, and G. Antoniol. 2012. An exploratory study of the impact of antipatterns on class change-and fault-proneness. Empirical Software Engineering 17(3) (2012), 243--275.

Digital Library

[15]

S. Kotsiantis, D. Kanellopoulos, P. Pintelas, et al. 2006. Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30(1) (2006), 25--36.

[16]

J. H. Lauand T. Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. preprint arXiv:1607.05368 (2016).

[17]

S. M. Olbrich, D. S Cruzes, and D. I. Sjøberg. 2010. Are all code smells harmful? a study of god classes and brain classes in the evolution of three open source systems. In Proceedings of the 26th IEEE International Conference on Software Maintenance (ICSM). IEEE, 1--10.

[18]

D. N. Palacio, D. McCrystal, K. Moran, C. B. Cardenas, D. Poshavank, and C. Shenefiel. 2019. Learning to identify security-related issues using convolutional neural networks. In Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 140--144.

[19]

F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lucia. 2014. Do they really smell bad? A study on developers' perception of bad code smells. In Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 101--110.

[20]

J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). ACL, 1532--1543.

[21]

L. Ponzanelli, G. Bavota, R. Oliveto M. Di Penta, and M. Lanza. 2014. Mining stackoverflow to turn the IDE into a self-confident programming prompter. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 102--111.

[22]

F. A. Shah, K. Sirts, and D. Pfahl. 2019. Simple app review classification with only lexical features. In Proceedings of the 13th International Conference on Software Technologies (ICSOFT). SciTePress, 146--153.

[23]

S. Shcherban, P. Liang, A. Tahir, and X. Li. 2020. Dataset of the Paper "Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation". https://doi.org/10.5281/zenodo.3933982

[24]

C. Stanik, L. Montgomery, D. Martens, D. Fucci, and W. Maalej. 2018. A simple NLP-based approach to support onboarding and retention in open source communities. In Proceedings of the 34th IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 68--78.

[25]

A. Tahir, J. Dietrich, S. Counsell, S. Licorish, and A. Yamashita. 2020. A large scale study on how developers discuss code smells and anti-pattern in stack exchange sites. Information and Software Technology 125 (2020), 106333.

[26]

A. Tahir, A. Yamashita, S. Licorish, and S. Counsell J. Dietrich. 2018. Can you tell me if it smells?: a study on how developers discuss code smells and anti-pattern in stack overflow. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering(EASE). ACM, 68--78.

Digital Library

[27]

F. Tian, F. Lu, P. Liang, and M. A. Babar. 2020. Automatic identification of architecture smell discussions from stack overflow. In Proceedings of the 32nd International Conference on Software Engineering and Knowledge Engineering (SEKE). KSI, 451--456.

[28]

C. Treude, O. Barzilay, and M. A. Storey. 2011. How do programmers ask and answer questions on the web? In Proceedings of the 33rd International Conference on Software Engineering(ICSE). ACM, 804--807.

[29]

M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, and D. Poshyvanyk. 2015. When and why your code starts to smell bad. In Proceedings of the 37th International Conference on Software Engineering (ICSE). IEEE, 403--414.

[30]

A. Yamashita and L. Moonen. 2013. Do developers care about code smells? an exploratory survey. In Proceedings of the 20th Working Conference on Reverse Engineering(WCRE). IEEE, 242--251.

[31]

Y. Zhang, R. Jin, and Z. H. Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics 1(1-4) (2010), 43--52.

Cited By

Kokol P(2024)The Use of AI in Software Engineering: A Synthetic Knowledge Synthesis of the Recent Research LiteratureInformation10.3390/info1506035415:6(354)Online publication date: 14-Jun-2024
https://doi.org/10.3390/info15060354
Kozanidis NVerdecchia RGuzman E(2022)Asking about Technical Debt: Characteristics and Automatic Identification of Technical Debt Questions on Stack OverflowProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3544902.3546245(45-56)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1145/3544902.3546245
Rahman MRoy C(2022)An Insight into the Reusability of Stack Overflow Code Fragments in Mobile Applications2022 IEEE 16th International Workshop on Software Clones (IWSC)10.1109/IWSC55060.2022.00020(69-75)Online publication date: Oct-2022
https://doi.org/10.1109/IWSC55060.2022.00020
Show More Cited By

Index Terms

Automatic Identification of Code Smell Discussions on Stack Overflow: A Preliminary Investigation
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Software and its engineering
  1. Software creation and management
    1. Designing software

Recommendations

Improving the quality of code snippets in stack overflow
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

Question and answer (Q&A) websites like Stack Overflow are one of the important sources of code examples in which developers can ask their questions and leave their answers about programming issues. Since the number of programmers who use these websites ...
Optimising the fit of stack overflow code snippets into existing code
GECCO '20: Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion

Software developers often reuse code from online sources such as Stack Overflow within their projects. However, the process of searching for code snippets and integrating them within existing source code can be tedious. In order to improve efficiency ...
DT: an upgraded detection tool to automatically detect two kinds of code smell: duplicated code and feature envy
ICGDA '18: Proceedings of the International Conference on Geoinformatics and Data Analysis

Code smell is unreasonable programming, and is produced when software developers don't have good habits of development and experience of development and other reasons. Code becomes more and more chaotic, the code structure become bloated. Code smell can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEM '20: Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

October 2020

412 pages

ISBN:9781450375801

DOI:10.1145/3382494

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

IBO Technology (Shenzhen) Co., Ltd., China
The National Key R&D Program of China

Conference

ESEM '20

Sponsor:

SIGSOFT

ESEM '20: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

October 5 - 9, 2020

Bari, Italy

Acceptance Rates

ESEM '20 Paper Acceptance Rate 26 of 123 submissions, 21%;

Overall Acceptance Rate 130 of 594 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
231
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kokol P(2024)The Use of AI in Software Engineering: A Synthetic Knowledge Synthesis of the Recent Research LiteratureInformation10.3390/info1506035415:6(354)Online publication date: 14-Jun-2024
https://doi.org/10.3390/info15060354
Kozanidis NVerdecchia RGuzman E(2022)Asking about Technical Debt: Characteristics and Automatic Identification of Technical Debt Questions on Stack OverflowProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3544902.3546245(45-56)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1145/3544902.3546245
Rahman MRoy C(2022)An Insight into the Reusability of Stack Overflow Code Fragments in Mobile Applications2022 IEEE 16th International Workshop on Software Clones (IWSC)10.1109/IWSC55060.2022.00020(69-75)Online publication date: Oct-2022
https://doi.org/10.1109/IWSC55060.2022.00020
Zhu WZhang HHassan AGodfrey M(2022)An empirical study of question discussions on Stack OverflowEmpirical Software Engineering10.1007/s10664-022-10180-z27:6Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s10664-022-10180-z
Fu LLiang PLi XYang C(2021)A Machine Learning Based Ensemble Method for Automatic Multiclass Classification of DecisionsProceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering10.1145/3463274.3463325(40-49)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3463274.3463325
Han XTahir ALiang PCounsell SLuo Y(2021)Understanding Code Smell Detection via Code Review: A Study of the OpenStack Community2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC)10.1109/ICPC52881.2021.00038(323-334)Online publication date: May-2021
https://doi.org/10.1109/ICPC52881.2021.00038

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents