Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3643991.3644915acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

How do Machine Learning Projects use Continuous Integration Practices? An Empirical Study on GitHub Actions

Published: 02 July 2024 Publication History

Abstract

Continuous Integration (CI) is a well-established practice in traditional software development, but its nuances in the domain of Machine Learning (ML) projects remain relatively unexplored. Given the distinctive nature of ML development, understanding how CI practices are adopted in this context is crucial for tailoring effective approaches. In this study, we conduct a comprehensive analysis of 185 open-source projects on GitHub (93 ML and 92 non-ML projects). Our investigation comprises both quantitative and qualitative dimensions, aiming to uncover differences in CI adoption between ML and non-ML projects. Our findings indicate that ML projects often require longer build duration, and medium-sized ML projects exhibit lower test coverage compared to non-ML projects. Moreover, small and medium-sized ML projects show a higher prevalence of increasing build duration trends compared to their non-ML counterparts. Additionally, our qualitative analysis illuminates the discussions around CI in both ML and non-ML projects, encompassing themes like CI Build Execution and Status, CI Testing, and CI Infrastructure. These insights shed light on the unique challenges faced by ML projects in adopting CI practices effectively.

References

[1]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 291--300. IEEE, 2019.
[2]
Hironori Washizaki, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. Studying software engineering patterns for designing machine learning systems. In 2019 10th International Workshop on Empirical Software Engineering in Practice (IWESEP), pages 49--495. IEEE, 2019.
[3]
Harikumar Pallathadka, Malik Mustafa, Domenic T Sanchez, Guna Sekhar Sajja, Sanjeev Gour, and Mohd Naved. Impact of machine learning on management, healthcare and agriculture. Materials Today: Proceedings, 80:2803--2806, 2023.
[4]
Danielle Gonzalez, Thomas Zimmermann, and Nachiappan Nagappan. The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In Proceedings of the 17th International conference on mining software repositories, pages 431--442, 2020.
[5]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
[6]
S. Bhatt. 7 machine learning challenges businesses face while implementing, 2020.
[7]
Lucy Ellen Lwakatare, Ivica Crnkovic, and Jan Bosch. Devops for ai-challenges in development of ai-enabled applications. In 2020 international conference on software, telecommunications and computer networks (SoftCOM), pages 1--6. IEEE, 2020.
[8]
Teemu Karvonen, Woubshet Behutiye, Markku Oivo, and Pasi Kuvaja. Systematic literature review on the impacts of agile release engineering practices. Information and software technology, 86:87--100, 2017.
[9]
Dhia Elhaq Rzig, Foyzul Hassan, Chetan Bansal, and Nachiappan Nagappan. Characterizing the usage of ci tools in ml projects. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 69--79, 2022.
[10]
Martin Fowler. Continuous integration, 2006.
[11]
Paul M Duvall, Steve Matyas, and Andrew Glover. Continuous integration: improving software quality and reducing risk. Pearson Education, 2007.
[12]
João Helis Bernardo, Daniel Alencar da Costa, Uirá Kulesza, and Christoph Treude. The impact of a continuous integration service on the delivery time of merged pull requests. Empirical Software Engineering, 28(4):97, 2023.
[13]
Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov. Quality and productivity outcomes relating to continuous integration in github. In Proceedings of the 2015 10th joint meeting on foundations of software engineering, pages 805--816, 2015.
[14]
Yangyang Zhao, Alexander Serebrenik, Yuming Zhou, Vladimir Filkov, and Bogdan Vasilescu. The impact of continuous integration on other software development practices: a large-scale empirical study. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 60--71. IEEE, 2017.
[15]
João Helis Bernardo, Daniel Alencar da Costa, and Uirá Kulesza. Studying the impact of adopting continuous integration on the delivery time of pull requests. In Proceedings of the 15th International Conference on Mining Software Repositories, pages 131--141, 2018.
[16]
Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering, pages 426--437, 2016.
[17]
Diego Saraiva, Daniel Alencar Da Costa, Uirá Kulesza, Gustavo Sizílio, José Gameleira Neto, Roberta Coelho, and Meiyappan Nagappan. Unveiling the relationship between continuous integration and code coverage. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), pages 247--259. IEEE, 2023.
[18]
Wagner Felidré, Leonardo Furtado, Daniel A da Costa, Bruno Cartaxo, and Gustavo Pinto. Continuous integration theater. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1--10. IEEE, 2019.
[19]
Jadson Santos, Daniel Alencar da Costa, and Uirá Kulesza. Investigating the impact of continuous integration practices on the productivity and quality of open-source projects. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 137--147, 2022.
[20]
Guilherme Freitas, João Helis Bernardo, Gustavo SizíLio, Daniel Alencar Da Costa, and Uirá Kulesza. Analyzing the impact of ci sub-practices on continuous code quality in open-source projects: An empirical study. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering, pages 1--10, 2023.
[21]
Bojan Karlaš, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. Building continuous integration services for machine learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2407--2415, 2020.
[22]
Alexandre Decan, Tom Mens, Pooya Rostami Mazrae, and Mehdi Golzadeh. On the use of github actions in software development repositories. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 235--245. IEEE, 2022.
[23]
Georgios Gousios and Diomidis Spinellis. Mining software engineering data from github. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), pages 501--502. IEEE, 2017.
[24]
Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. Curating github for engineered software projects. Empirical Software Engineering, 22:3219--3253, 2017.
[25]
Wes McKinney et al. pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing, 14(9):1--9, 2011.
[26]
Coveralls. Coveralls, 2023.
[27]
Codecov. Codecov, 2023.
[28]
Eliezio Soares, Gustavo Sizilio, Jadson Santos, Daniel Alencar da Costa, and Uirá Kulesza. The effects of continuous integration on software development: a systematic literature review. Empirical Software Engineering, 27(3):78, 2022.
[29]
Daniel S Wilks. Statistical methods in the atmospheric sciences, volume 100. Academic press, 2011.
[30]
Norman Cliff. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological bulletin, 114(3):494, 1993.
[31]
Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. Appropriate statistics for ordinal level data: Should we really be using t-test and cohen'sd for evaluating group differences on the nsse and other surveys. In annual meeting of the Florida Association of Institutional Research, volume 177, page 34, 2006.
[32]
Rohit J Kate. Using dynamic time warping distances as features for improved time series classification. Data Mining and Knowledge Discovery, 30:283--312, 2016.
[33]
Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd international conference on knowledge discovery and data mining, pages 359--370, 1994.
[34]
Stan Salvador and Philip Chan. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5):561--580, 2007.
[35]
Gustavo Sizilio Nery, Daniel Alencar da Costa, and Uirá Kulesza. An empirical study of the relationship between continuous integration and test code evolution. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 426--436. IEEE, 2019.
[36]
Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411--423, 2001.
[37]
Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. cluster: Cluster analysis basics and extensions, 2019.
[38]
Glenn A Bowen. Document analysis as a qualitative research method. Qualitative research journal, 9(2):27--40, 2009.
[39]
Zina O'leary. The essential guide to doing research. Sage, 2004.
[40]
Timothy Kinsman, Mairieli Wessel, Marco A Gerosa, and Christoph Treude. How do software developers use github actions to automate their workflows? In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pages 420--431. IEEE, 2021.
[41]
Sharon L Lohr. Sampling: design and analysis. CRC press, 2021.
[42]
Virginia Braun and Victoria Clarke. Using thematic analysis in psychology. Qualitative research in psychology, 3(2):77--101, 2006.
[43]
Lorelli S Nowell, Jill M Norris, Deborah E White, and Nancy J Moules. Thematic analysis: Striving to meet the trustworthiness criteria. International journal of qualitative methods, 16(1):1609406917733847, 2017.
[44]
Virginia Braun and Victoria Clarke. Reflecting on reflexive thematic analysis. Qualitative research in sport, exercise and health, 11(4):589--597, 2019.
[45]
Lutz Prechelt. An empirical comparison of seven programming languages. Computer, 33(10):23--29, 2000.
[46]
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. Machine learning testing: Survey, landscapes and horizons. IEEE Trans. Software Eng., 48(2):1--36, 2022.
[47]
Elizamary de Souza Nascimento, Iftekhar Ahmed, Edson Oliveira, Márcio Piedade Palheta, Igor Steinmacher, and Tayana Conte. Understanding development process of machine learning systems: Challenges and solutions. In 2019 acm/ieee international symposium on empirical software engineering and measurement (esem), pages 1--6. IEEE, 2019.
[48]
Taher Ahmed Ghaleb, Daniel Alencar Da Costa, and Ying Zou. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering, 24:2102--2139, 2019.
[49]
Dhia Elhaq Rzig, Foyzul Hassan, Chetan Bansal, and Nachiappan Nagappan. Characterizing the usage of CI tools in ML projects. In ESEM '22: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, Helsinki Finland, September 19 - 23, 2022, pages 69--79. ACM, 2022.
[50]
Mehdi Golzadeh, Alexandre Decan, and Tom Mens. On the rise and fall of ci services in github. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 662--672. IEEE, 2022.

Cited By

View all
  • (2025)Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystemsEmpirical Software Engineering10.1007/s10664-024-10570-530:1Online publication date: 1-Feb-2025

Index Terms

  1. How do Machine Learning Projects use Continuous Integration Practices? An Empirical Study on GitHub Actions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories
    April 2024
    788 pages
    ISBN:9798400705878
    DOI:10.1145/3643991
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 July 2024

    Check for updates

    Author Tags

    1. continuous integration
    2. machine learning
    3. GitHub actions
    4. mining software repositories

    Qualifiers

    • Research-article

    Funding Sources

    • CAPES
    • FACEPE
    • PRONEX
    • CNPq

    Conference

    MSR '24
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)114
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Bridging the language gap: an empirical study of bindings for open source machine learning libraries across software package ecosystemsEmpirical Software Engineering10.1007/s10664-024-10570-530:1Online publication date: 1-Feb-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media