research-article

A Case Study on Data Science Processes in an Academia-Industry Collaboration

Authors:

Breno Bernard Nicolau De FrançaAuthors Info & Claims

SBQS '23: Proceedings of the XXII Brazilian Symposium on Software Quality

Pages 1 - 10

https://doi.org/10.1145/3629479.3629514

Published: 06 December 2023 Publication History

Abstract

Data Science (DS) is emerging in major software development projects and often needs to follow software development practices. Therefore, DS processes will likely continue to attract Software Engineering (SE) practices and vice-versa. This case study aims to map and describe a software development process for Machine Learning(ML)-enabled applications and associated practices used in a real DS project at the Recod.ai laboratory in collaboration with an industrial partner. The focus was to analyze the process and identify the strengths and primary challenges, considering their expertise in robust ML practices and how they can contribute to general software quality. To achieve this, we conducted semi-structured interviews and analyzed them using procedures from the Straussian Grounded Theory. The results showed that the DS development process is iterative, with feedback between activities, which differs from the processes in the literature. Additionally, this process presents a greater involvement of domain experts. Besides, the team prioritizes software quality characteristics (attributes) in these DS projects to ensure some aspects of the final product’s quality, i.e., functional correctness and robustness. To achieve those, they use regular accuracy metrics and include explainability and data leakage as quality metrics during training. Finally, the software engineer’s role and its responsibilities differ from those of a traditional industry software engineer, as s/he is involved in most of the process steps. These characteristics can contribute to high-quality models achieving the partner needs and, consequently, relevant contributions to the intersection between SE and DS.

References

[1]

2023. Script Interview. Zenodo. https://doi.org/10.5281/zenodo.8404259

[2]

Timo Aho, Outi Sievi-Korte, Terhi Kilamo, Sezin Yaman, and Tommi Mikkonen. 2020. Demystifying Data Science Projects: A Look on the People and Process of Data Science Today. In Product-Focused Software Process Improvement: 21st International Conference, PROFES 2020, Turin, Italy, November 25–27, 2020, Proceedings (Turin, Italy). Springer-Verlag, Berlin, Heidelberg, 153–167. https://doi.org/10.1007/978-3-030-64148-1_10

Digital Library

[3]

Mohamed Abdullahi Ali, Ng Keng Yap, Abdul Azim Abd Ghani, Hazura Zulzalil, Novia Indriaty Admodisastro, and Amin Arab Najafabadi. 2022. A Systematic Mapping of Quality Models for AI Systems, Software and Components. Applied Sciences 12, 17 (2022). https://doi.org/10.3390/app12178700

[4]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 291–300. https://doi.org/10.1109/ICSE-SEIP.2019.00042

Digital Library

[5]

Andrew Begel and Thomas Zimmermann. 2014. Analyze This! 145 Questions for Data Scientists in Software Engineering. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 12–23. https://doi.org/10.1145/2568225.2568233

Digital Library

[6]

Houssem Ben Braiek and Foutse Khomh. 2020. On testing machine learning programs. Journal of Systems and Software 164 (2020), 110542. https://doi.org/10.1016/j.jss.2020.110542

[7]

Longbing Cao. 2017. Data Science: A Comprehensive Overview. ACM Comput. Surv. 50, 3, Article 43 (jun 2017), 42 pages. https://doi.org/10.1145/3076253

Digital Library

[8]

Davide Cirillo, Silvina Catuara-Solarz, Czuee Morey, Emre Guney, Laia Subirats, Simona Mellino, Annalisa Gigante, Alfonso Valencia, María José Rementeria, Antonella Santuccione Chadha, 2020. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ digital medicine 3, 1 (2020), 81.

[9]

Aurélien Géron. 2017. Hands-on machine learning with scikit-learn and tensorflow: Concepts. Tools, and Techniques to build intelligent systems (2017).

[10]

Bahar Gezici and Ayça Kolukısa Tarhan. 2022. Systematic literature review on software quality for AI-based software. Empirical Software Engineering 27 (2022). https://doi.org/10.1007/s10664-021-10105-2

Digital Library

[11]

Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software 180 (2021), 111031. https://doi.org/10.1016/j.jss.2021.111031

Digital Library

[12]

Ioannis Karamitsos, Saeed Albarhami, and Charalampos Apostolopoulos. 2020. Applying DevOps Practices of Continuous Automation for Machine Learning. Information 11, 7 (2020). https://doi.org/10.3390/info11070363

[13]

Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the Experiences of Adopting Automated Data Validation in an Industrial Machine Learning Project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 248–257. https://doi.org/10.1109/ICSE-SEIP52600.2021.00034

Digital Library

[14]

Silverio Martínez-Fernández, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software Engineering for AI-Based Systems: A Survey. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 37e (apr 2022), 59 pages. https://doi.org/10.1145/3487043

Digital Library

[15]

Fernando Martínez-Plumed, Lidia Contreras-Ochando, Cèsar Ferri, José Hernández-Orallo, Meelis Kull, Nicolas Lachiche, María José Ramírez-Quintana, and Peter Flach. 2021. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Transactions on Knowledge and Data Engineering 33, 8 (2021), 3048–3061. https://doi.org/10.1109/TKDE.2019.2962680

[16]

Walter McCulloch, Warren S. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5 (1943). https://doi.org/10.1007/BF02478259

[17]

Audris Mockus. 2014. Engineering Big Data Solutions. In Future of Software Engineering Proceedings (Hyderabad, India) (FOSE 2014). Association for Computing Machinery, New York, NY, USA, 85–99. https://doi.org/10.1145/2593882.2593889

Digital Library

[18]

Elizamary de Souza Nascimento, Iftekhar Ahmed, Edson Oliveira, Márcio Piedade Palheta, Igor Steinmacher, and Tayana Conte. 2019. Understanding Development Process of Machine Learning Systems: Challenges and Solutions. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–6. https://doi.org/10.1109/ESEM.2019.8870157

[19]

Yasar Ugur Pabuccu, Ibrahim Yel, Ayse Berrak Helvacioglu, and Büşra Nur Asa. 2022. The Requirement Cube: A Requirement Template for Business, User, and Functional Requirements With 5W1H Approach. Int. J. Inf. Syst. Model. Des. 13, 1 (feb 2022), 1–18. https://doi.org/10.4018/IJISMD.297046

Digital Library

[20]

Kai Petersen and Cigdem Gencel. 2013. Worldviews, research methods, and their relationship to validity in empirical software engineering research. In 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. IEEE, 81–89.

[21]

David Piorkowski, Soya Park, April Yi Wang, Dakuo Wang, Michael Muller, and Felix Portnoy. 2021. How AI Developers Overcome Communication Challenges in a Multidisciplinary Team: A Case Study. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 131 (apr 2021), 25 pages. https://doi.org/10.1145/3449205

Digital Library

[22]

Arthur L Samuel. 1959. Some studies in machine learning using the game of checkers. IBM Journal of research and development 3, 3 (1959), 210–229.

[23]

Anselm Strauss and Juliet Corbin. 1998. Basics of qualitative research techniques. (1998).

[24]

Zhiyuan Wan, Xin Xia, David Lo, and Gail C. Murphy. 2021. How does Machine Learning Change Software Development Practices?IEEE Transactions on Software Engineering 47, 9 (2021), 1857–1871. https://doi.org/10.1109/TSE.2019.2937083

[25]

Claus Weihs and Katja Ickstadt. 2017. Data Science: the impact of statistics. International Journal of Data Science and Analytics 6 (nov 2017). https://doi.org/10.1007/s41060-018-0102-5

[26]

Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36. https://doi.org/10.1109/TSE.2019.2962027

Digital Library

[27]

Xiaoge Zhang, Felix T.S. Chan, Chao Yan, and Indranil Bose. 2022. Towards risk-aware artificial intelligence and machine learning systems: An overview. Decision Support Systems 159 (2022), 113800. https://doi.org/10.1016/j.dss.2022.113800

Digital Library

Index Terms

A Case Study on Data Science Processes in an Academia-Industry Collaboration
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management
    1. Software development process management
      1. Software development methods

Recommendations

A Case Study about Startups' Software Development Practices: A Preliminary Result
SBQS '19: Proceedings of the XVIII Brazilian Symposium on Software Quality

Recently software startups have been the focus of intense research, especially in the Software Engineering community. However, we need more empirical evidence that addresses how software startups perform their software development practices. This paper ...
Release engineering processes, models, and metrics
ESEC/FSE Doctoral Symposium '09: Proceedings of the doctoral symposium for ESEC/FSE on Doctoral symposium

No matter the development process or methodology, a software product must ultimately be released to a user in a readily consumable form. Different software products, from desktop applications to web services, may require different release processes, but ...
Demystifying Data Science Projects: A Look on the People and Process of Data Science Today
Product-Focused Software Process Improvement
Abstract
Processes and practices used in data science projects have been reshaping especially over the last decade. These are different from their software engineering counterparts. However, to a large extent, data science relies on software, and, once ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SBQS '23: Proceedings of the XXII Brazilian Symposium on Software Quality

November 2023

391 pages

ISBN:9798400707865

DOI:10.1145/3629479

Editors:
Edna Dias Canedo
University of Brasília (CIC, UnB), Brazil
,
Daniel de Paula Porto
University of Brasília (CIC, UnB), Brazil
,
Fábio Lúcio Lopes Mendonça
University of Brasília (FT, UnB), Brazil
,
Rafael Timóteo de Sousa Júnior
University of Brasília (FT, UnB), Brazil
,
Monalessa Perini Barcellos
Federal University of Espírito Santo (UFES), Brazil
,
Ismayle Sousa Santos
Ceará State University (UECE), Brazil
,
Sheila Reinehr
Pontifical Catholic University of Rio de Janeiro (PUC-PR), Brazil
,
Sergio Soares
Federal University of Pernambuco (UFPE), Brazil
,
Uirá Kulesza
Federal University of Rio Grande do Norte (UFRN), Brazil
,
Érica Ferreira de Souza
Federal Technological University of Paraná (UTFPR), Brazil
,
Adriano Albuquerque
University of Fortaleza (UNIFOR), Brazil
,
Carla Bezerra
Federal University of Ceará (UFC), Brazil
,
Rodrigo Santos
Federal University of the State of Rio de Janeiro (UNIRIO), Brazil
,
Alessandro Garcia
Pontifical Catholic University of Rio de Janeiro (PUC-RJ), Brazil
,
Simone Dornelas Costa
Federal University of Espírito Santo (UFES), Brazil
,
Adolfo Gustavo Serra Seca Neto
Federal Technological University of Paraná (UTFPR), Brazil

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SBQS '23

SBQS '23: XXII Brazilian Symposium on Software Quality

November 7 - 10, 2023

Bras\'{\i}lia, Brazil

Acceptance Rates

Overall Acceptance Rate 35 of 99 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
32
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents