Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3510457.3513042acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Open access

Strategies for reuse and sharing among data scientists in software teams

Published: 17 October 2022 Publication History

Abstract

Effective sharing and reuse practices have long been hallmarks of proficient software engineering. Yet the exploratory nature of data science presents new challenges and opportunities to support sharing and reuse of analysis code. To better understand current practices, we conducted interviews (N=17) and a survey (N=132) with data scientists at Microsoft, and extract five commonly used strategies for sharing and reuse of past work: personal analysis reuse, personal utility libraries, team shared analysis code, team shared template notebooks, and team shared libraries. We also identify factors that encourage or discourage data scientists from sharing and reusing. Our participants described obstacles to reuse and sharing including a lack of incentives to create shared code, difficulties in making data science code modular, and a lack of tool interoperability. We discuss how future tools might help meet these needs.

References

[1]
Rabe Abdalkareem, Olivier Nourry, Sultan Wehaibi, Suhaib Mujahid, and Emad Shihab. 2017. Why do developers use trivial packages? an empirical case study on npm. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 385--395.
[2]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291--300.
[3]
Apache. 2021. Apache Arrow. Retrieved October 15, 2021 from https:https://arrow.apache.org/
[4]
Andrew Begel and Thomas Zimmermann. 2014. Analyze this! 145 questions for data scientists in software engineering. In Proceedings of the 36th International Conference on Software Engineering. 12--23.
[5]
Mike Bostock. 2018. Introduction to Imports. Retrieved August 25, 2021 from https://observablehq.com/@observablehq/introduction-to-imports
[6]
Mark Chen and et al. 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
[7]
Databricks. 2021. Databricks. Retrieved August 25, 2021 from https://databricks.com
[8]
Databricks. 2021. Notebook workflows. Retrieved October 11, 2021 from https://docs.databricks.com/notebooks/notebook-workflows.html
[9]
William B Frakes and Kyo Kang. 2005. Software reuse research: Status and future. IEEE transactions on Software Engineering 31, 7 (2005), 529--536.
[10]
Git. 2021. Git. Retrieved August 25, 2021 from https://git-scm.com/
[11]
Martin L Griss. 1993. Software reuse: From library to factory. IBM systems journal 32, 4 (1993), 548--566.
[12]
Lars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel, and Maximilian Irlbeck. 2011. On the extent and nature of software reuse in open source java projects. In International Conference on Software Reuse. Springer, 207--222.
[13]
Mary Beth Kery and Brad A Myers. 2018. Interactions for untangling messy history in a computational notebook. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 147--155.
[14]
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 96--107.
[15]
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038.
[16]
Yongbeom Kim and Edward A Stohr. 1998. Software reuse: survey and research directions. Journal of Management Information Systems 14, 4 (1998), 113--147.
[17]
Charles W Krueger. 1992. Software reuse. ACM Computing Surveys (CSUR) 24, 2 (1992), 131--183.
[18]
Tim Menzies. 2016. How Not to Do It: Anti-Patterns for Data Science in Software Engineering. In Proceedings of the 38th International Conference on Software Engineering Companion (Austin, Texas) (ICSE '16). Association for Computing Machinery, New York, NY, USA, 887.
[19]
Tim Menzies, Ekrem Kocaguneli, Fayola Peters, Burak Turhan, and Leandro L. Minku. 2013. Data Science for Software Engineering. In Proceedings of the 2013 International Conference on Software Engineering (San Francisco, CA, USA) (ICSE '13). IEEE Press, 1484--1486.
[20]
Microsoft. 2021. Kusto query overview. Retrieved October 8, 2021 from https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/
[21]
Parastoo Mohagheghi and Reidar Conradi. 2007. Quality, productivity and economic benefits of software reuse: a review of industrial studies. Empirical Software Engineering 12 (2007), 471--516.
[22]
João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 507--517.
[23]
Israel J Mojica Ruiz, Meiyappan Nagappan, Bram Adams, and Ahmed E Hassan. 2012. Understanding reuse in the android market. In 2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 113--122.
[24]
Adam Rule, Aurélien Tabard, and James D. Hollan. 2018. Exploration and Explanation in Computational Notebooks. Association for Computing Machinery, New York, NY, USA, 1--12.
[25]
Nischal Shrestha, Colton Botta, Titus Barik, and Chris Parnin. 2020. Here We Go Again: Why Is It Difficult for Developers to Learn Another Programming Language?. In 42nd International Conference on Software Engineering (ICSE).
[26]
Bowen Xu, Le An, Ferdian Thung, Foutse Khomh, and David Lo. 2020. Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empirical Software Engineering 25, 1 (Jan. 2020), 755--789.
[27]
Amy X. Zhang, Michael J. Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proceedings of the ACM on Human-Computer Interaction 4 (2020), 1 -- 23.

Cited By

View all
  • (2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
  • (2024)DistilKaggle: A Distilled Dataset of Kaggle Jupyter NotebooksProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644882(647-651)Online publication date: 15-Apr-2024
  • (2023)A Model of Reusable Assets in AIE Software SystemsJournal of Computer Science and Technology10.24215/16666038.23.e1323:2(e13)Online publication date: 25-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE-SEIP '22: Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice
May 2022
371 pages
ISBN:9781450392266
DOI:10.1145/3510457
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Check for updates

Author Tags

  1. code reuse
  2. code sharing
  3. data science
  4. survey

Qualifiers

  • Research-article

Conference

ICSE '22
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)168
  • Downloads (Last 6 weeks)44
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task DecompositionProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676345(1-19)Online publication date: 13-Oct-2024
  • (2024)DistilKaggle: A Distilled Dataset of Kaggle Jupyter NotebooksProceedings of the 21st International Conference on Mining Software Repositories10.1145/3643991.3644882(647-651)Online publication date: 15-Apr-2024
  • (2023)A Model of Reusable Assets in AIE Software SystemsJournal of Computer Science and Technology10.24215/16666038.23.e1323:2(e13)Online publication date: 25-Oct-2023
  • (2023)Dead or Alive: Continuous Data Profiling for Interactive Data ScienceIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332736730:1(197-207)Online publication date: 30-Oct-2023
  • (2023)A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00034(171-183)Online publication date: May-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media