Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3629479.3629514acmotherconferencesArticle/Chapter ViewAbstractPublication PagessbqsConference Proceedingsconference-collections
research-article

A Case Study on Data Science Processes in an Academia-Industry Collaboration

Published: 06 December 2023 Publication History

Abstract

Data Science (DS) is emerging in major software development projects and often needs to follow software development practices. Therefore, DS processes will likely continue to attract Software Engineering (SE) practices and vice-versa. This case study aims to map and describe a software development process for Machine Learning(ML)-enabled applications and associated practices used in a real DS project at the Recod.ai laboratory in collaboration with an industrial partner. The focus was to analyze the process and identify the strengths and primary challenges, considering their expertise in robust ML practices and how they can contribute to general software quality. To achieve this, we conducted semi-structured interviews and analyzed them using procedures from the Straussian Grounded Theory. The results showed that the DS development process is iterative, with feedback between activities, which differs from the processes in the literature. Additionally, this process presents a greater involvement of domain experts. Besides, the team prioritizes software quality characteristics (attributes) in these DS projects to ensure some aspects of the final product’s quality, i.e., functional correctness and robustness. To achieve those, they use regular accuracy metrics and include explainability and data leakage as quality metrics during training. Finally, the software engineer’s role and its responsibilities differ from those of a traditional industry software engineer, as s/he is involved in most of the process steps. These characteristics can contribute to high-quality models achieving the partner needs and, consequently, relevant contributions to the intersection between SE and DS.

References

[1]
2023. Script Interview. Zenodo. https://doi.org/10.5281/zenodo.8404259
[2]
Timo Aho, Outi Sievi-Korte, Terhi Kilamo, Sezin Yaman, and Tommi Mikkonen. 2020. Demystifying Data Science Projects: A Look on the People and Process of Data Science Today. In Product-Focused Software Process Improvement: 21st International Conference, PROFES 2020, Turin, Italy, November 25–27, 2020, Proceedings (Turin, Italy). Springer-Verlag, Berlin, Heidelberg, 153–167. https://doi.org/10.1007/978-3-030-64148-1_10
[3]
Mohamed Abdullahi Ali, Ng Keng Yap, Abdul Azim Abd Ghani, Hazura Zulzalil, Novia Indriaty Admodisastro, and Amin Arab Najafabadi. 2022. A Systematic Mapping of Quality Models for AI Systems, Software and Components. Applied Sciences 12, 17 (2022). https://doi.org/10.3390/app12178700
[4]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 291–300. https://doi.org/10.1109/ICSE-SEIP.2019.00042
[5]
Andrew Begel and Thomas Zimmermann. 2014. Analyze This! 145 Questions for Data Scientists in Software Engineering. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 12–23. https://doi.org/10.1145/2568225.2568233
[6]
Houssem Ben Braiek and Foutse Khomh. 2020. On testing machine learning programs. Journal of Systems and Software 164 (2020), 110542. https://doi.org/10.1016/j.jss.2020.110542
[7]
Longbing Cao. 2017. Data Science: A Comprehensive Overview. ACM Comput. Surv. 50, 3, Article 43 (jun 2017), 42 pages. https://doi.org/10.1145/3076253
[8]
Davide Cirillo, Silvina Catuara-Solarz, Czuee Morey, Emre Guney, Laia Subirats, Simona Mellino, Annalisa Gigante, Alfonso Valencia, María José Rementeria, Antonella Santuccione Chadha, 2020. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ digital medicine 3, 1 (2020), 81.
[9]
Aurélien Géron. 2017. Hands-on machine learning with scikit-learn and tensorflow: Concepts. Tools, and Techniques to build intelligent systems (2017).
[10]
Bahar Gezici and Ayça Kolukısa Tarhan. 2022. Systematic literature review on software quality for AI-based software. Empirical Software Engineering 27 (2022). https://doi.org/10.1007/s10664-021-10105-2
[11]
Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software 180 (2021), 111031. https://doi.org/10.1016/j.jss.2021.111031
[12]
Ioannis Karamitsos, Saeed Albarhami, and Charalampos Apostolopoulos. 2020. Applying DevOps Practices of Continuous Automation for Machine Learning. Information 11, 7 (2020). https://doi.org/10.3390/info11070363
[13]
Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the Experiences of Adopting Automated Data Validation in an Industrial Machine Learning Project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248–257. https://doi.org/10.1109/ICSE-SEIP52600.2021.00034
[14]
Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software Engineering for AI-Based Systems: A Survey. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 37e (apr 2022), 59 pages. https://doi.org/10.1145/3487043
[15]
Fernando Martínez-Plumed, Lidia Contreras-Ochando, Cèsar Ferri, José Hernández-Orallo, Meelis Kull, Nicolas Lachiche, María José Ramírez-Quintana, and Peter Flach. 2021. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering 33, 8 (2021), 3048–3061. https://doi.org/10.1109/TKDE.2019.2962680
[16]
Walter McCulloch, Warren S. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5 (1943). https://doi.org/10.1007/BF02478259
[17]
Audris Mockus. 2014. Engineering Big Data Solutions. In Future of Software Engineering Proceedings (Hyderabad, India) (FOSE 2014). Association for Computing Machinery, New York, NY, USA, 85–99. https://doi.org/10.1145/2593882.2593889
[18]
Elizamary de Souza Nascimento, Iftekhar Ahmed, Edson Oliveira, Márcio Piedade Palheta, Igor Steinmacher, and Tayana Conte. 2019. Understanding Development Process of Machine Learning Systems: Challenges and Solutions. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–6. https://doi.org/10.1109/ESEM.2019.8870157
[19]
Yasar Ugur Pabuccu, Ibrahim Yel, Ayse Berrak Helvacioglu, and Büşra Nur Asa. 2022. The Requirement Cube: A Requirement Template for Business, User, and Functional Requirements With 5W1H Approach. Int. J. Inf. Syst. Model. Des. 13, 1 (feb 2022), 1–18. https://doi.org/10.4018/IJISMD.297046
[20]
Kai Petersen and Cigdem Gencel. 2013. Worldviews, research methods, and their relationship to validity in empirical software engineering research. In 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. IEEE, 81–89.
[21]
David Piorkowski, Soya Park, April Yi Wang, Dakuo Wang, Michael Muller, and Felix Portnoy. 2021. How AI Developers Overcome Communication Challenges in a Multidisciplinary Team: A Case Study. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 131 (apr 2021), 25 pages. https://doi.org/10.1145/3449205
[22]
Arthur L Samuel. 1959. Some studies in machine learning using the game of checkers. IBM Journal of research and development 3, 3 (1959), 210–229.
[23]
Anselm Strauss and Juliet Corbin. 1998. Basics of qualitative research techniques. (1998).
[24]
Zhiyuan Wan, Xin Xia, David Lo, and Gail C. Murphy. 2021. How does Machine Learning Change Software Development Practices?IEEE Transactions on Software Engineering 47, 9 (2021), 1857–1871. https://doi.org/10.1109/TSE.2019.2937083
[25]
Claus Weihs and Katja Ickstadt. 2017. Data Science: the impact of statistics. International Journal of Data Science and Analytics 6 (nov 2017). https://doi.org/10.1007/s41060-018-0102-5
[26]
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36. https://doi.org/10.1109/TSE.2019.2962027
[27]
Xiaoge Zhang, Felix T.S. Chan, Chao Yan, and Indranil Bose. 2022. Towards risk-aware artificial intelligence and machine learning systems: An overview. Decision Support Systems 159 (2022), 113800. https://doi.org/10.1016/j.dss.2022.113800

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SBQS '23: Proceedings of the XXII Brazilian Symposium on Software Quality
November 2023
391 pages
ISBN:9798400707865
DOI:10.1145/3629479
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. case study
  2. data science
  3. machine learning
  4. software processes

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SBQS '23
SBQS '23: XXII Brazilian Symposium on Software Quality
November 7 - 10, 2023
Bras\'{\i}lia, Brazil

Acceptance Rates

Overall Acceptance Rate 35 of 99 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 32
    Total Downloads
  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)4
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media