Abstract
Machine learning (ML), including deep learning, has recently gained tremendous popularity in a wide range of applications. However, like traditional software, ML applications are not immune to the bugs that result from programming errors. Explicit programming errors usually manifest through error messages and stack traces. These stack traces describe the chain of function calls that lead to an anomalous situation, or exception. Indeed, these exceptions may cross the entire software stack (including applications and libraries). Thus, studying the ML-related patterns in stack traces can help practitioners and researchers understand the causes of exceptions in ML applications and the challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and study 18, 538 ML-related stack traces related to seven popular Python ML libraries. First, we observe that ML questions that contain stack traces are less likely to get accepted answers than questions that don’t, even though they gain more attention (i.e., more views and comments). Second, we observe that recurrent patterns exist in ML stack traces, even across different ML libraries, with a small portion of patterns covering many stack traces. Third, we derive five high-level categories and 26 low-level types from the stack trace patterns: most patterns are related to model training, python basic syntax, parallelization, subprocess invocation, and external module execution. Furthermore, the patterns related to external dependencies (e.g., file operations) or manipulations of artifacts (e.g., model conversion) are among the least likely to get accepted answers on SO. Our findings provide insights for researchers, ML library developers, and technical forum moderators to better support ML developers in writing error-free ML code. For example, future research can leverage the common patterns of stack traces to help ML developers locate solutions to problems similar to theirs or to identify experts who have experience solving similar patterns of problems. Researchers and ML library developers could prioritize efforts to help ML developers identify misuses of ML APIs, mismatches in data formats, and potential data/resource contentions so that ML developers can better avoid/fix model-related exception patterns, data-related exception patterns, and multi-process-related exception patterns, respectively.
Similar content being viewed by others
Data Availability
The data and scripts used to produce this work are shared in a replication package (MOOSELab 2022).
Notes
We used a small time overlap in the databases to ensure that we did not miss any SO posts. We remove any duplication in our final dataset.
Separating Text Blocks aids in determining the length of descriptions in SO questions.
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) \(\{\)TensorFlow\(\}\): A system for \(\{\)Large-Scale\(\}\) machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp 265–283
Alharthi H, Outioua D, Baysal O (2016) Predicting questions’ scores on Stack Overflow. In: 2016 IEEE/ACM 3rd international workshop on crowdsourcing in software engineering (CSI-SE), pp 1–7
Alshangiti M, Sapkota H, Murukannaiah PK, Liu X, Yu Q (2019) Why is developing machine learning applications challenging? a study on Stack Overflow posts. In: 2019 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM). IEEE, pp 1–11
Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: A case study. In: 2019 IEEE/ACM 41st international conference on software engineering: software engineering in practice (ICSE-SEIP). IEEE, pp 291–300
Atwi H, Lin B, Tsantalis N, Kashiwa Y, Kamei Y, Ubayashi N, Bavota G, Lanza M (2021) Pyref: Refactoring detection in python projects. In: 2021 IEEE 21st international working conference on source code analysis and manipulation (SCAM), pp 136–141
Baltes S, Dumani L, Treude C, Diehl S (2018) Sotorrent.” Proceedings of the 15th international conference on mining software repositories
Bangash AA, Sahar H, Chowdhury S, Wong AW, Hindle A, Ali K (2019) What do developers know about machine learning: a study of ml discussions on stackoverflow. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR). IEEE, pp 260–264
Bhat V, Gokhale A, Jadhav R, Pudipeddi J, Akoglu L (2014) Min (e) d your tags: Analysis of question response time in Stack Overflow. In: 2014 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM 2014). IEEE, pp 328–335
Borges H, Hora A, Valente MT (2016a) Predicting the popularity of github repositories. In: Proceedings of the The 12th international conference on predictive models and data analytics in software engineering, pp 1–10
Borges H, Hora A, Valente MT (2016b) Understanding the factors that impact the popularity of Github repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 334–344
Braiek HB, Khomh F, Adams B (2018) The open-closed principle of modern machine learning frameworks. In: Proceedings of the 15th international conference on mining software repositories, pp 353–363
Burnard P (1991) A method of analysing interview transcripts in qualitative research. Nurse Educ Today 11(6):461–466
Castelvecchi D (2016) Can we open the black box of ai? Nat News 538(7623):20
Chandrasekar P (2020) Scripting the future of Stack Overflow. [Online]. Available: https://stackoverflow.blog/2020/01/21/scripting-the-future-of-stack-2020-plans-vision/
Chollet F et al (2015) Keras. [Online]. Available: https://github.com/fchollet/keras
Debbarma MK, Debbarma S, Debbarma N, Chakma K, Jamatia A (2013) A review and analysis of software complexity metrics in structural testing. Int J Comput Commun Eng 2(2):129–133
Deloitte (2020) Annual report. https://www2.deloitte.com/content/dam/Deloitte/dk/Documents/about-deloitte/Impact_Report_20_21_web.pdf
Dilhara M, Ketkar A, Dig D (2021) Understanding software-2.0: A study of machine learning library usage and evolution. ACM Trans Softw Eng Methodol (TOSEM) 30(4):1–42
Gao W, Wu J, Xu G (2022) Detecting duplicate questions in Stack Overflow via source code modeling. Int J Softw Eng Knowl Eng 32(02):227–255
Gupta S (2021) What is the best language for machine learning? [Online]. Available: https://www.springboard.com/blog/data-science/best-language-for-machine-learning/
Hamidi A, Antoniol G, Khomh F, Di Penta M, Hamidi M (2021) Towards understanding developers’ machine-learning challenges: A multi-language study on Stack Overflow. In: 2021 IEEE 21st international working conference on source code analysis and manipulation (SCAM). IEEE, pp 58–69
Hayes AF, Krippendorff K (2007) Answering the call for a standard reliability measure for coding data. Commun Methods Meas 1(1):77–89
Huang C, Yao L, Wang X, Benatallah B, Sheng QZ (2017) Expert as a service: software expert recommendation via knowledge domain embeddings in Stack Overflow. In: 2017 IEEE international conference on web services (ICWS), pp 317–324
Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, pp 1110–1121
Humbatova N, Jahangirova G, Bavota G, Riccio V, Stocco A, Tonella P (2020) Taxonomy of real faults in deep learning systems. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, pp 1110–1121
Islam MJ, Nguyen HA, Pan R, Rajan H (2019) What do developers ask about ml libraries? a large-scale study using Stack Overflow. Preprint arXiv:1906.11940
Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE), pp 510–520
Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp 510–520
Jordan MI, Mitchell TM (2015) Machine learning: Trends, perspectives, and prospects. Science 349(6245):255–260
Khandkar SH (2009) Open coding. University of Calgary, vol 23, p 2009
Kou B, Di Y, Chen M, Zhang T (2022) Sosum: A dataset of Stack Overflow post summaries. In: Proceedings of the 19th international conference on mining software repositories, ser. MSR ’22. New York, NY, USA: Association for Computing Machinery, pp 247–251. [Online]. Available: https://doi.org/10.1145/3524842.3528487
Krippendorff K (2011) Computing krippendorff’s alpha-reliability
LaMorte WW (2017) Mann whitney u test (wilcoxon rank sum test). [Online]. Available: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html
Liu J, Baltes S, Treude C, Lo D, Zhang Y, Xia X (2021) Characterizing search activities on Stack Overflow. In: Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2021. New York, NY, USA: Association for Computing Machinery, pp 919–931
Loper E, Bird S (2002) Nltk: The natural language toolkit. Preprint arXiv:cs/0205028
Lune H, Berg BL (2017) Qualitative research methods for the social sciences. Pearson
Lyu Y, Li H, Sayagh M, Jiang ZM, Hassan AE (2021) An empirical study of the impact of data splitting decisions on the performance of aiops solutions. ACM Trans Softw Eng Methodol (TOSEM) 30(4):1–38
Majidi F, Openja M, Khomh F, Li H (2022) An empirical study on the usage of automated machine learning tools. Preprint arXiv:2208.13116
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 50–60
Marzi G, Balzano M, Marchiori D (2024) K-alpha calculator’s alpha calculator: A user-friendly tool for computing krippendorff’s alpha inter-rater reliability coefficient. MethodsX, vol 12, p 102545. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2215016123005411
Medeiros M, Kulesza U, Bonifacio R, Adachi E, Coelho R (2020) Improving bug localization by mining crash reports: An industrial study. In: 2020 IEEE international conference on software maintenance and evolution (ICSME). IEEE, pp 766–775
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) Mllib: Machine learning in apache spark. J Mach Learn Res 17(1):1235–1241
MOOSELab (2022) Replication: What causes exceptions in machine learning applications? Mining machine learning-related stack traces on Stack Overflow. https://github.com/mooselab/ML_StackTrace
NewVantage Partners (2022) The quest to achieve data-driven leadership: A progress report on the state of corporate data initiatives. [Online]. Available: https://c6abb8db-514c-4f5b-b5a1-fc710f1e464e.filesusr.com/ugd/e5361a_2f859f3457f24cff9b2f8a2bf54f82b7.pdf
Nguyen G, Dlugolinsky S, Bobák M, Tran V, Lopez Garcia A, Heredia I, Malík P, Hluchỳ L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124
Overflow S (2021) Stack Overflow developer survey 2021. [Online]. Available: https://insights.stackoverflow.com/survey/2021#overview
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, vol 32
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830
Raschka S, Patterson J, Nolet C (2020) Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence
Raschka S, Patterson J, Nolet C (2020) Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information, vol 11, no 4. [Online]. Available: https://www.mdpi.com/2078-2489/11/4/193
Rubei R, Di Sipio C, Nguyen PT, Di Rocco J, Di Ruscio D (2020) Postfinder: Mining Stack Overflow posts to support software developers. Inf Softw Technol 127:106367
Sabor KK, Hamdaqa M, Hamou-Lhadj A (2020) Automatic prediction of the severity of bugs using stack traces and categorical features. Inf Softw Technol 123:106205
Stol K-J, Ralph P, Fitzgerald B (2016) Grounded theory in software engineering research: a critical review and guidelines. In: Proceedings of the 38th International conference on software engineering, pp 120–131
Sui L, Dietrich J, Tahir A (2017) On the use of mined stack traces to improve the soundness of statically constructed call graphs. In: 2017 24th Asia-Pacific software engineering conference (APSEC), pp 672–676
Sun X, Zhou T, Li G, Hu J, Yang H, Li B (2017) An empirical study on real bugs for machine learning programs. In: 2017 24th Asia-Pacific software engineering conference (APSEC). IEEE, pp 348–357
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M et al (2019) Huggingface’s transformers: State-of-the-art natural language processing. Preprint arXiv:1910.03771
Xu S, Bennett A, Hoogeveen D, Lau JH, Baldwin T (2018) Preferred answer selection in Stack Overflow: Better text representations ... and metadata, metadata, metadata. In: Proceedings of the 2018 EMNLP workshop W-NUT: The 4th workshop on noisy user-generated text. Brussels, Belgium: Association for Computational Linguistics, pp 137–147. [Online]. Available: https://aclanthology.org/W18-6119
Zhang J, Wang Y, Yang D (2015) Ccspan: Mining closed contiguous sequential patterns. Knowl-Based Syst 89:1–13
Zhang T, Gao C, Ma L, Lyu M, Kim M (2019) An empirical study of common challenges in developing deep learning applications. In: 2019 IEEE 30th international symposium on software reliability engineering (ISSRE). IEEE, pp 104–115
Zhang R, Xiao W, Zhang H, Liu Y, Lin H, Yang M (2020) An empirical study on program failures of deep learning jobs. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, ser. ICSE ’20. New York, NY, USA: Association for Computing Machinery, pp 1159–1170
Acknowledgements
We would like to gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds de recherche du Québec - Nature et technologies (FRQNT) for their funding support for this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interest
No financial support was provided by any institution for the compilation of this work.
Additional information
Communicated by: Davide Falessi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ghadesi, A., Lamothe, M. & Li, H. What causes exceptions in machine learning applications? Mining machine learning-related stack traces on Stack Overflow. Empir Software Eng 29, 107 (2024). https://doi.org/10.1007/s10664-024-10499-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-024-10499-9